Relevant bibliographies by topics / Text tokenization

Journal articles
Dissertations / Theses
Books
Book chapters
Conference papers

Academic literature on the topic 'Text tokenization'

Author: Grafiati

Published: 3 June 2025

Last updated: 6 July 2025

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the lists of relevant articles, books, theses, conference reports, and other scholarly sources on the topic 'Text tokenization.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Journal articles on the topic "Text tokenization"

Sekhar, Sowmik. "Tokenization for Text Analysis." International Journal of Scientific Research and Engineering Trends 10, no. 1 (2024): 149–52. http://dx.doi.org/10.61137/ijsret.vol.10.issue1.127.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Nnaemeka, M. Oparauwah, N. Odii Juliet, I. Ayogu Ikechukwu, and C. Iwuchukwu Vitalis. "A boundary-based tokenization technique for extractive text summarization." World Journal of Advanced Research and Reviews 11, no. 2 (2021): 303–12. https://doi.org/10.5281/zenodo.5336977.

Full text

Abstract:

The need to extract and manage vital information contained in copious volumes of text documents has given birth to several automatic text summarization (ATS) approaches. ATS has found application in academic research, medical health records analysis, content creation and search engine optimization, finance and media. This study presents a boundary-based tokenization method for extractive text summarization. The proposed method performs word tokenization by defining word boundaries in place of specific delimiters. An extractive summarization algorithm was further developed based on the proposed boundary-based tokenization method, as well as word length consideration to control redundancy in summary output. Experimental results showed that the proposed approach enhanced word tokenization by enhancing the selection of appropriate keywords from text document to be used for summarization.

APA, Harvard, Vancouver, ISO, and other styles

Nnaemeka M Oparauwah, Juliet N Odii, Ikechukwu I Ayogu, and Vitalis C Iwuchukwu. "A boundary-based tokenization technique for extractive text summarization." World Journal of Advanced Research and Reviews 11, no. 2 (2021): 303–12. http://dx.doi.org/10.30574/wjarr.2021.11.2.0351.

Full text

Abstract:

APA, Harvard, Vancouver, ISO, and other styles

Nazir, Shahzad, Muhammad Asif, Mariam Rehman, and Shahbaz Ahmad. "Machine learning based framework for fine-grained word segmentation and enhanced text normalization for low resourced language." PeerJ Computer Science 10 (January 31, 2024): e1704. http://dx.doi.org/10.7717/peerj-cs.1704.

Full text

Abstract:

In text applications, pre-processing is deemed as a significant parameter to enhance the outcomes of natural language processing (NLP) chores. Text normalization and tokenization are two pivotal procedures of text pre-processing that cannot be overstated. Text normalization refers to transforming raw text into scriptural standardized text, while word tokenization splits the text into tokens or words. Well defined normalization and tokenization approaches exist for most spoken languages in world. However, the world’s 10th most widely spoken language has been overlooked by the research community. This research presents improved text normalization and tokenization techniques for the Urdu language. For Urdu text normalization, multiple regular expressions and rules are proposed, including removing diuretics, normalizing single characters, separating digits, etc. While for word tokenization, core features are defined and extracted against each character of text. Machine learning model is considered with specified handcrafted rules to predict the space and to tokenize the text. This experiment is performed, while creating the largest human-annotated dataset composed in Urdu script covering five different domains. The results have been evaluated using precision, recall, F-measure, and accuracy. Further, the results are compared with state-of-the-art. The normalization approach produced 20% and tokenization approach achieved 6% improvement.

APA, Harvard, Vancouver, ISO, and other styles

BAR-HAIM, ROY, KHALIL SIMA'AN, and YOAD WINTER. "Part-of-speech tagging of Modern Hebrew text." Natural Language Engineering 14, no. 2 (2008): 223–51. http://dx.doi.org/10.1017/s135132490700455x.

Full text

Abstract:

AbstractWords in Semitic texts often consist of a concatenation ofword segments, each corresponding to a part-of-speech (POS) category. Semitic words may be ambiguous with regard to their segmentation as well as to the POS tags assigned to each segment. When designing POS taggers for Semitic languages, a major architectural decision concerns the choice of the atomic input tokens (terminal symbols). If the tokenization is at the word level, the output tags must be complex, and represent both the segmentation of the word and the POS tag assigned to each word segment. If the tokenization is at the segment level, the input itself must encode the different alternative segmentations of the words, while the output consists of standard POS tags. Comparing these two alternatives is not trivial, as the choice between them may have global effects on the grammatical model. Moreover, intermediate levels of tokenization between these two extremes are conceivable, and, as we aim to show, beneficial. To the best of our knowledge, the problem of tokenization for POS tagging of Semitic languages has not been addressed before in full generality. In this paper, we study this problem for the purpose of POS tagging of Modern Hebrew texts. After extensive error analysis of the two simple tokenization models, we propose a novel, linguistically motivated, intermediate tokenization model that gives better performance for Hebrew over the two initial architectures. Our study is based on the well-known hidden Markov models (HMMs). We start out from a manually devised morphological analyzer and a very small annotated corpus, and describe how to adapt an HMM-based POS tagger for both tokenization architectures. We present an effective technique for smoothing the lexical probabilities using an untagged corpus, and a novel transformation for casting the segment-level tagger in terms of a standard, word-level HMM implementation. The results obtained using our model are on par with the best published results on Modern Standard Arabic, despite the much smaller annotated corpus available for Modern Hebrew.

APA, Harvard, Vancouver, ISO, and other styles

Vadlapati, Praneeth. "TokEncryption: Enhanced Hashing of Text using Tokenization." INTERANTIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT 08, no. 12 (2024): 1–8. https://doi.org/10.55041/ijsrem20280.

Full text

Abstract:

In the current digital space, the security of sensitive information, such as passwords and private data, is of high importance. Traditional hashing methods might not adequately address data privacy concerns or vulnerabilities created due to weak passwords. Tokenization methods are utilized in natural language processing (NLP). This paper introduces a method called “TokEncryption” to utilize tokens for one-way encryption of text called hashing. A tokenizer is used to generate tokens for an input text, which are utilized to encrypt the text to create secure encrypted text. Different characters of the text are encrypted using distinct tokens to ensure a variation in encryption patterns throughout the resultant text. The process enhances data security and privacy using an unconventional approach that makes it secure from attackers attempting to reconstruct the text. The system can be used in addition to the existing encryption approaches. The results show a successful encryption of text. Weak passwords are successfully encrypted to create strong passwords with multiple types of characters, including alphabets, numbers, and special characters. The code is available at github.com/Pro-GenAI/TokEncryption. Keywords: data privacy, data security, data encryption, natural language processing (NLP), tokenization, cryptography, information security

APA, Harvard, Vancouver, ISO, and other styles

A. Mullen, Lincoln, Kenneth Benoit, Os Keyes, Dmitry Selivanov, and Jeffrey Arnold. "Fast, Consistent Tokenization of Natural Language Text." Journal of Open Source Software 3, no. 23 (2018): 655. http://dx.doi.org/10.21105/joss.00655.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Bartenyev, Oleg. "Evaluating the Effectiveness of Text Tokenization Methods." Vestnik MEI, no. 6 (December 25, 2023): 144–56. http://dx.doi.org/10.24160/1993-6982-2023-6-144-156.

Full text

APA, Harvard, Vancouver, ISO, and other styles

S, Vijayarani, and Janani R. "Text Mining: open Source Tokenization Tools – An Analysis." Advanced Computational Intelligence: An International Journal (ACII) 3, no. 1 (2016): 37–47. http://dx.doi.org/10.5121/acii.2016.3104.

Full text

APA, Harvard, Vancouver, ISO, and other styles

A. Hosni Mahmoud, Hanan, Alaaeldin M. Hafez, and Eatedal Alabdulkreem. "Language-Independent Text Tokenization Using Unsupervised Deep Learning." Intelligent Automation & Soft Computing 35, no. 1 (2023): 321–34. http://dx.doi.org/10.32604/iasc.2023.026235.

Full text

APA, Harvard, Vancouver, ISO, and other styles

More sources

Dissertations / Theses on the topic "Text tokenization"

Aliwy, Ahmed Hussein. "Arabic Morphosyntactic Raw Text Part of Speech Tagging System." Doctoral thesis, 2013. http://depotuw.ceon.pl/handle/item/241.

Full text

Abstract:

We present a comprehensive Arabic tagging system: from the raw text to tagging disambiguation. For each processing step in the tagging system, we analyze the existing solutions (if any) and use one of them or propose, implement and evaluate a new one. This work began with designing a new Arabic tagset suitable for Classical Arabic (CA) and Modern Standard Arabic (MSA). In addition to the classical constructions in tag systems, we introduce interleaving of tags. Interleaving is a relation between tags which, in certain situations, can be attached to the same occurrence of a word, but each of them can also appear alone. Our tagset makes this relation explicit. Then we deal with the preparatory stages for tagging system. The first initial stage is tokenization and segmentation. We use rule-based and statistical methods for this task. The second stage is analyzing and extracting the lemma from the word. We have created our own analyzer compatible with our requirements. Its main part is a dictionary which provides features, POS and lemma for each word. The last part of our work is the tagging algorithm which produces one tag for each word. We use a hybrid method by combining rules-based and statistical methods. Three taggers, Hidden Markov Model (HMM), maximum match and Brill are combined by a new method, which we call master and slaves. Then handwritten rule-based tagger is also added to master-slaves. The rule based tagger eliminates incorrect tags, and the master chooses the best one among the remaining ones, assisted by the other slaves. Our complete system is ready to be used for annotation of Arabic corpora.<br>Streszczenie po polsku nie istnieje.

APA, Harvard, Vancouver, ISO, and other styles

Books on the topic "Text tokenization"

Mikheev, Andrei. Text Segmentation. Edited by Ruslan Mitkov. Oxford University Press, 2012. http://dx.doi.org/10.1093/oxfordhb/9780199276349.013.0010.

Full text

Abstract:

This article discusses electronic text as essentially just a sequence of characters. Text needs to be segmented at least into linguistic units such as words, punctuation, numbers, alphanumerics, etc. This process is called tokenization. The article mentions that most natural language processing techniques require text to be segmented into sentences as well. It briefly reviews some evaluation metrics and standard resources commonly used for text segmentation tasks. This article presents substantial challenges for computational analysis since tokens are directly attached to each other using pictogram characters or other native writing systems and outlines various computational approaches to tackle them in different languages. It focuses on the low-level tasks such as tokenization and sentence segmentation.

APA, Harvard, Vancouver, ISO, and other styles

Book chapters on the topic "Text tokenization"

Grefenstette, Gregory. "Tokenization." In Text, Speech and Language Technology. Springer Netherlands, 1999. http://dx.doi.org/10.1007/978-94-015-9273-4_9.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Hvitfeldt, Emil, and Julia Silge. "Tokenization." In Supervised Machine Learning for Text Analysis in R. Chapman and Hall/CRC, 2021. http://dx.doi.org/10.1201/9781003093459-3.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Fares, Murhaf, Stephan Oepen, and Yi Zhang. "Machine Learning for High-Quality Tokenization Replicating Variable Tokenization Schemes." In Computational Linguistics and Intelligent Text Processing. Springer Berlin Heidelberg, 2013. http://dx.doi.org/10.1007/978-3-642-37247-6_19.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Domingo, Miguel, Mercedes García-Martínez, Alexandre Helle, Francisco Casacuberta, and Manuel Herranz. "How Much Does Tokenization Affect Neural Machine Translation?" In Computational Linguistics and Intelligent Text Processing. Springer Nature Switzerland, 2023. http://dx.doi.org/10.1007/978-3-031-24337-0_38.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Graña, Jorge, Miguel A. Alonso, and Manuel Vilares. "A Common Solution for Tokenization and Part-of-Speech Tagging." In Text, Speech and Dialogue. Springer Berlin Heidelberg, 2002. http://dx.doi.org/10.1007/3-540-46154-x_1.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Graña, Jorge, Fco Mario Barcala, and Jesús Vilares. "Formal Methods of Tokenization for Part-of-Speech Tagging." In Computational Linguistics and Intelligent Text Processing. Springer Berlin Heidelberg, 2002. http://dx.doi.org/10.1007/3-540-45715-1_22.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Piergiovanni, AJ, Kairo Morton, Weicheng Kuo, Michael S. Ryoo, and Anelia Angelova. "Video Question Answering with Iterative Video-Text Co-tokenization." In Lecture Notes in Computer Science. Springer Nature Switzerland, 2022. http://dx.doi.org/10.1007/978-3-031-20059-5_5.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Kamps, Jaap, Sisay Fissaha Adafre, and Maarten de Rijke. "Effective Translation, Tokenization and Combination for Cross-Lingual Retrieval." In Multilingual Information Access for Text, Speech and Images. Springer Berlin Heidelberg, 2005. http://dx.doi.org/10.1007/11519645_12.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Tailor, Chetana, and Bankim Patel. "Sentence Tokenization Using Statistical Unsupervised Machine Learning and Rule-Based Approach for Running Text in Gujarati Language." In Advances in Intelligent Systems and Computing. Springer Singapore, 2018. http://dx.doi.org/10.1007/978-981-13-2285-3_38.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Nuankaew, Wongpanya S., Ronnachai Thipmontha, Phaisarn Jeefoo, Patchara Nasa-ngium, and Pratya Nuankaew. "Using Text Mining and Tokenization Analysis to Identify Job Performance for Human Resource Management at the University of Phayao." In Recent Challenges in Intelligent Information and Database Systems. Springer Nature Switzerland, 2023. http://dx.doi.org/10.1007/978-3-031-42430-4_47.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Conference papers on the topic "Text tokenization"

Kayalı, Nihal Zuhal, and Sevinç İlhan Omurca. "Hybrid Tokenization Strategy for Turkish Abstractive Text Summarization." In 2024 8th International Artificial Intelligence and Data Processing Symposium (IDAP). IEEE, 2024. http://dx.doi.org/10.1109/idap64064.2024.10711036.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Goldman, Omer, Avi Caciularu, Matan Eyal, Kris Cao, Idan Szpektor, and Reut Tsarfaty. "Unpacking Tokenization: Evaluating Text Compression and its Correlation with Model Performance." In Findings of the Association for Computational Linguistics ACL 2024. Association for Computational Linguistics, 2024. http://dx.doi.org/10.18653/v1/2024.findings-acl.134.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Hassler, M., and G. Fliedl. "Text preparation through extended tokenization." In DATA MINING AND MIS 2006. WIT Press, 2006. http://dx.doi.org/10.2495/data060021.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Prakrankamanant, Patawee, and Ekapol Chuangsuwanich. "Tokenization-based data augmentation for text classification." In 2022 19th International Joint Conference on Computer Science and Software Engineering (JCSSE). IEEE, 2022. http://dx.doi.org/10.1109/jcsse54890.2022.9836268.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Cruz Diaz, Noa P., and Manuel Maña López. "An Analysis of Biomedical Tokenization: Problems and Strategies." In Proceedings of the Sixth International Workshop on Health Text Mining and Information Analysis. Association for Computational Linguistics, 2015. http://dx.doi.org/10.18653/v1/w15-2605.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Hiraoka, Tatsuya, Hiroyuki Shindo, and Yuji Matsumoto. "Stochastic Tokenization with a Language Model for Neural Text Classification." In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2019. http://dx.doi.org/10.18653/v1/p19-1158.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Islam, Tanzirul, Mofazzal Hossain, and MD Fahim Arefin. "Comparative Analysis of Different Text Summarization Techniques Using Enhanced Tokenization." In 2021 3rd International Conference on Sustainable Technologies for Industry 4.0 (STI). IEEE, 2021. http://dx.doi.org/10.1109/sti53101.2021.9732589.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Posokhov, P. A., S. S. Skrylnikov, and O. V. Makhnytkina. "Artificial text detection in Russian language: a BERT-based Approach." In Dialogue. RSUH, 2022. http://dx.doi.org/10.28995/2075-7182-2022-21-470-476.

Full text

Abstract:

This paper describes our solution for the RuATD (Russian Artificial Text Detection) competition held within the Dialog 2022 conference. Our approach is based on the idea of transfer learning, using pre-trained RuRoBERTa, RuBERT, RuGPT3, RuGPT2 models. The final solution included Byte-level Byte-Pair Encoding tokenization, and a fine-tuned model RuRoBERTa model. The system got Accuracy metric value of 0.65 and took first place in the multiclass classification task.

APA, Harvard, Vancouver, ISO, and other styles

Horsmann, Tobias, and Torsten Zesch. "LTL-UDE $@$ EmpiriST 2015: Tokenization and PoS Tagging of Social Media Text." In Proceedings of the 10th Web as Corpus Workshop. Association for Computational Linguistics, 2016. http://dx.doi.org/10.18653/v1/w16-2615.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Huang, Zien. "An Ensemble LLM Framework of Text Recognition Based on BERT and BPE Tokenization." In 2024 5th International Seminar on Artificial Intelligence, Networking and Information Technology (AINIT). IEEE, 2024. http://dx.doi.org/10.1109/ainit61980.2024.10581466.

Full text

APA, Harvard, Vancouver, ISO, and other styles

We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

Contents

Academic literature on the topic 'Text tokenization'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Journal articles on the topic "Text tokenization"

Dissertations / Theses on the topic "Text tokenization"

Books on the topic "Text tokenization"

Book chapters on the topic "Text tokenization"

Conference papers on the topic "Text tokenization"