Journal articles: 'Text tokenization'

1

Sekhar, Sowmik. "Tokenization for Text Analysis." International Journal of Scientific Research and Engineering Trends 10, no. 1 (2024): 149–52. http://dx.doi.org/10.61137/ijsret.vol.10.issue1.127.

Full text

APA, Harvard, Vancouver, ISO, and other styles

2

Nnaemeka, M. Oparauwah, N. Odii Juliet, I. Ayogu Ikechukwu, and C. Iwuchukwu Vitalis. "A boundary-based tokenization technique for extractive text summarization." World Journal of Advanced Research and Reviews 11, no. 2 (2021): 303–12. https://doi.org/10.5281/zenodo.5336977.

Full text

Abstract:

The need to extract and manage vital information contained in copious volumes of text documents has given birth to several automatic text summarization (ATS) approaches. ATS has found application in academic research, medical health records analysis, content creation and search engine optimization, finance and media. This study presents a boundary-based tokenization method for extractive text summarization. The proposed method performs word tokenization by defining word boundaries in place of specific delimiters. An extractive summarization algorithm was further developed based on the proposed boundary-based tokenization method, as well as word length consideration to control redundancy in summary output. Experimental results showed that the proposed approach enhanced word tokenization by enhancing the selection of appropriate keywords from text document to be used for summarization.

APA, Harvard, Vancouver, ISO, and other styles

3

Nnaemeka M Oparauwah, Juliet N Odii, Ikechukwu I Ayogu, and Vitalis C Iwuchukwu. "A boundary-based tokenization technique for extractive text summarization." World Journal of Advanced Research and Reviews 11, no. 2 (2021): 303–12. http://dx.doi.org/10.30574/wjarr.2021.11.2.0351.

Full text

Abstract:

The need to extract and manage vital information contained in copious volumes of text documents has given birth to several automatic text summarization (ATS) approaches. ATS has found application in academic research, medical health records analysis, content creation and search engine optimization, finance and media. This study presents a boundary-based tokenization method for extractive text summarization. The proposed method performs word tokenization by defining word boundaries in place of specific delimiters. An extractive summarization algorithm was further developed based on the proposed boundary-based tokenization method, as well as word length consideration to control redundancy in summary output. Experimental results showed that the proposed approach enhanced word tokenization by enhancing the selection of appropriate keywords from text document to be used for summarization.

APA, Harvard, Vancouver, ISO, and other styles

4

Nazir, Shahzad, Muhammad Asif, Mariam Rehman, and Shahbaz Ahmad. "Machine learning based framework for fine-grained word segmentation and enhanced text normalization for low resourced language." PeerJ Computer Science 10 (January 31, 2024): e1704. http://dx.doi.org/10.7717/peerj-cs.1704.

Full text

Abstract:

In text applications, pre-processing is deemed as a significant parameter to enhance the outcomes of natural language processing (NLP) chores. Text normalization and tokenization are two pivotal procedures of text pre-processing that cannot be overstated. Text normalization refers to transforming raw text into scriptural standardized text, while word tokenization splits the text into tokens or words. Well defined normalization and tokenization approaches exist for most spoken languages in world. However, the world’s 10th most widely spoken language has been overlooked by the research community. This research presents improved text normalization and tokenization techniques for the Urdu language. For Urdu text normalization, multiple regular expressions and rules are proposed, including removing diuretics, normalizing single characters, separating digits, etc. While for word tokenization, core features are defined and extracted against each character of text. Machine learning model is considered with specified handcrafted rules to predict the space and to tokenize the text. This experiment is performed, while creating the largest human-annotated dataset composed in Urdu script covering five different domains. The results have been evaluated using precision, recall, F-measure, and accuracy. Further, the results are compared with state-of-the-art. The normalization approach produced 20% and tokenization approach achieved 6% improvement.

APA, Harvard, Vancouver, ISO, and other styles

5

BAR-HAIM, ROY, KHALIL SIMA'AN, and YOAD WINTER. "Part-of-speech tagging of Modern Hebrew text." Natural Language Engineering 14, no. 2 (2008): 223–51. http://dx.doi.org/10.1017/s135132490700455x.

Full text

Abstract:

AbstractWords in Semitic texts often consist of a concatenation ofword segments, each corresponding to a part-of-speech (POS) category. Semitic words may be ambiguous with regard to their segmentation as well as to the POS tags assigned to each segment. When designing POS taggers for Semitic languages, a major architectural decision concerns the choice of the atomic input tokens (terminal symbols). If the tokenization is at the word level, the output tags must be complex, and represent both the segmentation of the word and the POS tag assigned to each word segment. If the tokenization is at the segment level, the input itself must encode the different alternative segmentations of the words, while the output consists of standard POS tags. Comparing these two alternatives is not trivial, as the choice between them may have global effects on the grammatical model. Moreover, intermediate levels of tokenization between these two extremes are conceivable, and, as we aim to show, beneficial. To the best of our knowledge, the problem of tokenization for POS tagging of Semitic languages has not been addressed before in full generality. In this paper, we study this problem for the purpose of POS tagging of Modern Hebrew texts. After extensive error analysis of the two simple tokenization models, we propose a novel, linguistically motivated, intermediate tokenization model that gives better performance for Hebrew over the two initial architectures. Our study is based on the well-known hidden Markov models (HMMs). We start out from a manually devised morphological analyzer and a very small annotated corpus, and describe how to adapt an HMM-based POS tagger for both tokenization architectures. We present an effective technique for smoothing the lexical probabilities using an untagged corpus, and a novel transformation for casting the segment-level tagger in terms of a standard, word-level HMM implementation. The results obtained using our model are on par with the best published results on Modern Standard Arabic, despite the much smaller annotated corpus available for Modern Hebrew.

APA, Harvard, Vancouver, ISO, and other styles

6

Vadlapati, Praneeth. "TokEncryption: Enhanced Hashing of Text using Tokenization." INTERANTIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT 08, no. 12 (2024): 1–8. https://doi.org/10.55041/ijsrem20280.

Full text

Abstract:

In the current digital space, the security of sensitive information, such as passwords and private data, is of high importance. Traditional hashing methods might not adequately address data privacy concerns or vulnerabilities created due to weak passwords. Tokenization methods are utilized in natural language processing (NLP). This paper introduces a method called “TokEncryption” to utilize tokens for one-way encryption of text called hashing. A tokenizer is used to generate tokens for an input text, which are utilized to encrypt the text to create secure encrypted text. Different characters of the text are encrypted using distinct tokens to ensure a variation in encryption patterns throughout the resultant text. The process enhances data security and privacy using an unconventional approach that makes it secure from attackers attempting to reconstruct the text. The system can be used in addition to the existing encryption approaches. The results show a successful encryption of text. Weak passwords are successfully encrypted to create strong passwords with multiple types of characters, including alphabets, numbers, and special characters. The code is available at github.com/Pro-GenAI/TokEncryption. Keywords: data privacy, data security, data encryption, natural language processing (NLP), tokenization, cryptography, information security

APA, Harvard, Vancouver, ISO, and other styles

7

A. Mullen, Lincoln, Kenneth Benoit, Os Keyes, Dmitry Selivanov, and Jeffrey Arnold. "Fast, Consistent Tokenization of Natural Language Text." Journal of Open Source Software 3, no. 23 (2018): 655. http://dx.doi.org/10.21105/joss.00655.

Full text

APA, Harvard, Vancouver, ISO, and other styles

8

Bartenyev, Oleg. "Evaluating the Effectiveness of Text Tokenization Methods." Vestnik MEI, no. 6 (December 25, 2023): 144–56. http://dx.doi.org/10.24160/1993-6982-2023-6-144-156.

Full text

APA, Harvard, Vancouver, ISO, and other styles

9

S, Vijayarani, and Janani R. "Text Mining: open Source Tokenization Tools – An Analysis." Advanced Computational Intelligence: An International Journal (ACII) 3, no. 1 (2016): 37–47. http://dx.doi.org/10.5121/acii.2016.3104.

Full text

APA, Harvard, Vancouver, ISO, and other styles

10

A. Hosni Mahmoud, Hanan, Alaaeldin M. Hafez, and Eatedal Alabdulkreem. "Language-Independent Text Tokenization Using Unsupervised Deep Learning." Intelligent Automation & Soft Computing 35, no. 1 (2023): 321–34. http://dx.doi.org/10.32604/iasc.2023.026235.

Full text

APA, Harvard, Vancouver, ISO, and other styles

11

Irum, Naz Sodhar, Hussain Jalbani Akhtar, Hafeez Buller Abdul, and Naz Sodhar Anam. "TOKENIZATION OF SINDHI TEXT ON INFORMATION RETRIEVAL TOOL." PJEST 1, no. 1 (2021): 7. https://doi.org/10.5281/zenodo.4774104.

Full text

Abstract:

PJEST journal is a peer-reviewed, open access journal dedicated to publishing high quality research papers, short communication articles and reviews. The multidisciplinary aspects of the journal encourage global collaboration between researchers in multiple fields and provide cross-disciplinary dissemination of findings. The journal publishes original and unpublished research articles, case studies, and state of the art papers. It seeks original research efforts based on theoretical, experimental and simulated work that fills the gap in current knowledge. The annual volume consists of two issues published biannually. 

APA, Harvard, Vancouver, ISO, and other styles

12

Vadlapati, Praneeth. "Tokenization Beyond NLP: Potential Applications in Data Analytics, Cybersecurity, and Beyond." INTERANTIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT 08, no. 12 (2024): 1–7. https://doi.org/10.55041/ijsrem9532.

Full text

Abstract:

Tokenization is the process of segmenting redundant patterns of input data, such as text, into tokens that are suitable for model training and computational analysis. Tokenization plays a foundational role in Natural Language Processing (NLP). Additionally, tokenization methods exhibit significant potential in domains outside of NLP, where combining redundant patterns in data can enhance the efficiency, scalability, analytical capabilities, and accuracy of predictions. This paper explores the potential applications of tokenization in fields beyond NLP in multiple areas, including but not limited to bioinformatics, cybersecurity, and healthcare. These applications demonstrate the ability of tokenization to simplify complex data patterns, thereby enhancing predictive accuracy. By leveraging the pattern recognition strengths of tokenization, multiple domains could receive benefits from efficient data processing and pattern recognition, which indicates a promising future for custom tokenization techniques across disciplines. Keywords: Pattern Recognition, tokenization, tokens, Machine Learning (ML), Natural Language Processing (NLP)

APA, Harvard, Vancouver, ISO, and other styles

13

Cho, Danbi, Hyunyoung Lee, and Seungshik Kang. "An Empirical Study of Korean Sentence Representation with Various Tokenizations." Electronics 10, no. 7 (2021): 845. http://dx.doi.org/10.3390/electronics10070845.

Full text

Abstract:

It is important how the token unit is defined in a sentence in natural language process tasks, such as text classification, machine translation, and generation. Many studies recently utilized the subword tokenization in language models such as BERT, KoBERT, and ALBERT. Although these language models achieved state-of-the-art results in various NLP tasks, it is not clear whether the subword tokenization is the best token unit for Korean sentence embedding. Thus, we carried out sentence embedding based on word, morpheme, subword, and submorpheme, respectively, on Korean sentiment analysis. We explored the two-sentence representation methods for sentence embedding: considering the order of tokens in a sentence and not considering the order. While inputting a sentence, which is decomposed by token unit, to the two-sentence representation methods, we construct the sentence embedding with various tokenizations to find the most effective token unit for Korean sentence embedding. In our work, we confirmed: the robustness of the subword unit for out-of-vocabulary (OOV) problems compared to other token units, the disadvantage of replacing whitespace with a particular symbol in the sentiment analysis task, and that the optimal vocabulary size is 16K in subword and submorpheme tokenization. We empirically noticed that the subword, which was tokenized by a vocabulary size of 16K without replacement of whitespace, was the most effective for sentence embedding on the Korean sentiment analysis task.

APA, Harvard, Vancouver, ISO, and other styles

14

Vadlapati, Praneeth. "TokenizedDB: Text Tokenization using NLP for Enhanced Storage Efficiency and Data Privacy." INTERANTIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT 08, no. 12 (2024): 1–6. https://doi.org/10.55041/ijsrem11413.

Full text

Abstract:

SQL databases are widely utilized for data storage across various domains. However, storing text values in SQL databases requires a considerable amount of storage, even if numerous parts of the text are redundant. Traditional compression methods, while helpful, are limited in optimizing storage within databases. Tokenizers used for natural language processing (NLP) can convert text to tokens. This paper proposes a method called TokenizedDB to utilize tokenization to store text as tokens. This approach leads to benefits such as a reduction of storage space. Storage of text as tokens leads to increased data privacy in the event of database breaches since the attacker would be unaware of the text being stored as tokens and the method used for the tokenization process. The attackers fail to understand the method of text storage and fail to reconstruct the original text without being aware of the methods used for the tokenization. The system can be implemented in addition to the existing encryption methods to ensure data privacy. The results display up to 66% reduction in text storage requirements in some cases. The system proved its potential to enhance storage space efficiency to reduce storage costs while strengthening security protocols for text. The code is available at github.com/Pro-GenAI/TokenizedDB. Keywords: natural language processing (NLP), information retrieval, SQL databases, data storage optimization, data privacy, data security, storage efficiency

APA, Harvard, Vancouver, ISO, and other styles

15

McNamee, Paul, and James Mayfield. "Character N-Gram Tokenization for European Language Text Retrieval." Information Retrieval 7, no. 1/2 (2004): 73–97. http://dx.doi.org/10.1023/b:inrt.0000009441.78971.be.

Full text

APA, Harvard, Vancouver, ISO, and other styles

16

Zalmout, Nasser, and Nizar Habash. "Optimizing Tokenization Choice for Machine Translation across Multiple Target Languages." Prague Bulletin of Mathematical Linguistics 108, no. 1 (2017): 257–69. http://dx.doi.org/10.1515/pralin-2017-0025.

Full text

Abstract:

AbstractTokenization is very helpful for Statistical Machine Translation (SMT), especially when translating from morphologically rich languages. Typically, a single tokenization scheme is applied to the entire source-language text and regardless of the target language. In this paper, we evaluate the hypothesis that SMT performance may benefit from different tokenization schemes for different words within the same text, and also for different target languages. We apply this approach to Arabic as a source language, with five target languages of varying morphological complexity: English, French, Spanish, Russian and Chinese. Our results show that different target languages indeed require different source-language schemes; and a context-variable tokenization scheme can outperform a context-constant scheme with a statistically significant performance enhancement of about 1.4 BLEU points.

APA, Harvard, Vancouver, ISO, and other styles

17

Si, Chenglei, Zhengyan Zhang, Yingfa Chen, et al. "Sub-Character Tokenization for Chinese Pretrained Language Models." Transactions of the Association for Computational Linguistics 11 (May 18, 2023): 469–87. http://dx.doi.org/10.1162/tacl_a_00560.

Full text

Abstract:

Abstract Tokenization is fundamental to pretrained language models (PLMs). Existing tokenization methods for Chinese PLMs typically treat each character as an indivisible token. However, they ignore the unique feature of the Chinese writing system where additional linguistic information exists below the character level, i.e., at the sub-character level. To utilize such information, we propose sub-character (SubChar for short) tokenization. Specifically, we first encode the input text by converting each Chinese character into a short sequence based on its glyph or pronunciation, and then construct the vocabulary based on the encoded text with sub-word segmentation. Experimental results show that SubChar tokenizers have two main advantages over existing tokenizers: 1) They can tokenize inputs into much shorter sequences, thus improving the computational efficiency. 2) Pronunciation-based SubChar tokenizers can encode Chinese homophones into the same transliteration sequences and produce the same tokenization output, hence being robust to homophone typos. At the same time, models trained with SubChar tokenizers perform competitively on downstream tasks. We release our code and models at https://github.com/thunlp/SubCharTokenization to facilitate future work.

APA, Harvard, Vancouver, ISO, and other styles

18

Li, Angela W., and Konstantinos Mamouras. "Efficient Algorithms for the Uniform Tokenization Problem." Proceedings of the ACM on Programming Languages 9, OOPSLA1 (2025): 1492–518. https://doi.org/10.1145/3720498.

Full text

Abstract:

Tokenization (also known as scanning or lexing) is a computational task that has applications in the lexical analysis of programs during compilation and in data extraction and analysis for unstructured or semi-structured data (e.g., data represented using the JSON and CSV data formats). We propose two algorithms for the tokenization problem that have linear time complexity (in the length of the input text) without using large amounts of memory. We also show that an optimized version of one of these algorithms performs well compared to prior approaches on practical tokenization workloads.

APA, Harvard, Vancouver, ISO, and other styles

19

Vadlapati, Praneeth. "TokChat: Tokenization of Text for Secured Peer-to-Peer Communication." INTERANTIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT 08, no. 12 (2024): 1–6. https://doi.org/10.55041/ijsrem10800.

Full text

Abstract:

This paper presents TokChat, a system that utilizes a new approach designed for enhanced security in peer-to-peer communication using text messages. To improve the security of messages in a conversation, the system converts a message into tokens by leveraging existing pre-trained tokenizers that are commonly used across numerous natural language processing (NLP) tasks. The tokens represent a compact numerical form of text that captures the original message. A new private key is generated for every new conversation to ensure the confidentiality of the messages in the conversation. The tokens are encrypted using an existing AES-based cryptographic encryption approach, ensuring they can be securely transmitted to the recipient. The receiver receives the encrypted tokens, which are decrypted using the conversation-specific key. The decrypted tokens are converted to readable text using the tokenizer. The usage of the system ensures that the messages fail to be converted into readable text by successful attackers, even with the usage of the private conversation-specific key, without being aware of the tokenization process. The system offers a higher security standard for secure messaging in the use cases in which the confidentiality of the messages is crucial. The system has been tested using multiple text values and successfully tokenized the text, encrypted the tokens, decrypted the transmitted ciphertext, and reproduced the original text. The code is available at github.com/Pro-GenAI/TokChat. Keywords: natural language processing (NLP), tokenization, cryptographic messaging, secure communication

APA, Harvard, Vancouver, ISO, and other styles

20

Rehman, Zobia, Waqas Anwar, Usama Ijaz Bajwa, Wang Xuan, and Zhou Chaoying. "Morpheme Matching Based Text Tokenization for a Scarce Resourced Language." PLoS ONE 8, no. 8 (2013): e68178. http://dx.doi.org/10.1371/journal.pone.0068178.

Full text

APA, Harvard, Vancouver, ISO, and other styles

21

MR ADEPU RAJESH and DR TRYAMBAK HIWARKAR. "Exploring Preprocessing Techniques for Natural LanguageText: A Comprehensive Study Using Python Code." international journal of engineering technology and management sciences 7, no. 5 (2023): 390–99. http://dx.doi.org/10.46647/ijetms.2023.v07i05.047.

Full text

Abstract:

The paper highlights the significance of efficient text preprocessing strategies in Natural Language Processing (NLP), a field focused on enabling machines to understand and interpret human language. Text preprocessing is a crucial step in converting unstructured text into a machine-understandable format. It plays a vital role in various text classification tasks, including web search, document classification, chatbots, and virtual assistants. Techniques such as tokenization, stop word removal, and lemmatization are carefully studied and applied in a specific order to ensure accurate and efficient information retrieval. The paper emphasizes the importance of selecting and ordering preprocessing techniques wisely to achieve high-quality results. Effective text preprocessing involves cleaning and filtering textual data to eliminate noise and enhance efficiency. Furthermore, it provides insights into the impact of different techniques, such as raw text, tokenization, stop word removal, and stemming, using a Python implementation.

APA, Harvard, Vancouver, ISO, and other styles

22

Schwarz, Carlo. "Estimating text regressions using txtreg_train." Stata Journal: Promoting communications on statistics and Stata 23, no. 3 (2023): 799–812. http://dx.doi.org/10.1177/1536867x231196349.

Full text

Abstract:

In this article, I introduce new commands to estimate text regressions for continuous, binary, and categorical variables based on text strings. The command txtreg_train automatically handles text cleaning, tokenization, model training, and cross-validation for lasso, ridge, elastic-net, and regularized logistic regressions. The txtreg_predict command obtains the predictions from the trained text regression model. Furthermore, the txtreg_analyze command facilitates the analysis of the coefficients of the text regression model. Together, these commands provide a convenient toolbox for researchers to train text regressions. They also allow sharing of pretrained text regression models with other researchers.

APA, Harvard, Vancouver, ISO, and other styles

23

Alkesaiberi, Abdulelah, Ali Alkhathlan, and Ahmed Abdelali. "Ara--CANINE: Character-Based Pre-Trained Language Model for Arabic Language Understanding." International Journal on Cybernetics & Informatics 13, no. 2 (2024): 45–59. http://dx.doi.org/10.5121/ijci.2024.130204.

Full text

Abstract:

Recent advancements in the field of natural language processing have markedly enhanced the capability of machines to comprehend human language. However, as language models progress, they require continuous architectural enhancements and different approaches to text processing. One significant challenge stems from the rich diversity of languages, each characterized by its distinctive grammar resulting in a decreased accuracy of language models for specific languages, especially for low-resource languages. This limitation is exacerbated by the reliance of existing NLP models on rigid tokenization methods, rendering them susceptible to issues with previously unseen or infrequent words. Additionally, models based on word and subword tokenization are vulnerable to minor typographical errors, whether they occur naturally or result from adversarial misspellings. To address these challenges, this paper presents the utilization of a recently proposed free-tokenization method, such as Cannine, to enhance the comprehension of natural language. Specifically, we employ this method to develop an Arabic-free tokenization language model. In this research, we will precisely evaluate our model’s performance across a range of eight tasks using Arabic Language Understanding Evaluation (ALUE) benchmark. Furthermore, we will conduct a comparative analysis, pitting our free-tokenization model against existing Arabic language models that rely on sub-word tokenization. By making our pre-training and fine-tuning models accessible to the Arabic NLP community, we aim to facilitate the replication of our experiments and contribute to the advancement of Arabic language processing capabilities. To further support reproducibility and open-source collaboration, the complete source code and model checkpoints will be made publicly available on our Huggingface1 . In conclusion, the results of our study will demonstrate that the free-tokenization approach exhibits comparable performance to established Arabic language models that utilize sub-word tokenization techniques. Notably, in certain tasks, our model surpasses the performance of some of these existing models. This evidence underscores the efficacy of free-tokenization in processing the Arabic language, particularly in specific linguistic contexts.

APA, Harvard, Vancouver, ISO, and other styles

24

Qin, Honglun, Meiwen Li, Lin Wang, Youming Ge, Junlong Zhu, and Ruijuan Zheng. "A Radical-Based Token Representation Method for Enhancing Chinese Pre-Trained Language Models." Electronics 14, no. 5 (2025): 1031. https://doi.org/10.3390/electronics14051031.

Full text

Abstract:

In the domain of natural language processing (NLP), a primary challenge pertains to the process of Chinese tokenization, which remains challenging due to the lack of explicit word boundaries in written Chinese. The existing tokenization methods often treat each Chinese character as an indivisible unit, neglecting the finer semantic features embedded in the characters, such as radicals. To tackle this issue, we propose a novel token representation method that integrates radical-based features into the process. The proposed method extends the vocabulary to include both radicals and original character tokens, enabling a more granular understanding of Chinese text. We also conduct experiments on seven datasets covering multiple Chinese natural language processing tasks. The results show that our method significantly improves model performance on downstream tasks. Specifically, the accuracy of BERT on the BQ Croups dataset was enhanced to 86.95%, showing an improvement of 1.65% over the baseline. Additionally, the BERT-wwm performance demonstrated a 1.28% enhancement, suggesting that the incorporation of fine-grained radical features offers a more efficacious solution for Chinese tokenization and paves the way for future research in Chinese text processing.

APA, Harvard, Vancouver, ISO, and other styles

25

Akkasi, Abbas, Ekrem Varoğlu, and Nazife Dimililer. "ChemTok: A New Rule Based Tokenizer for Chemical Named Entity Recognition." BioMed Research International 2016 (2016): 1–9. http://dx.doi.org/10.1155/2016/4248026.

Full text

Abstract:

Named Entity Recognition (NER) from text constitutes the first step in many text mining applications. The most important preliminary step for NER systems using machine learning approaches is tokenization where raw text is segmented into tokens. This study proposes an enhanced rule based tokenizer, ChemTok, which utilizes rules extracted mainly from the train data set. The main novelty of ChemTok is the use of the extracted rules in order to merge the tokens split in the previous steps, thus producing longer and more discriminative tokens. ChemTok is compared to the tokenization methods utilized by ChemSpot and tmChem. Support Vector Machines and Conditional Random Fields are employed as the learning algorithms. The experimental results show that the classifiers trained on the output of ChemTok outperforms all classifiers trained on the output of the other two tokenizers in terms of classification performance, and the number of incorrectly segmented entities.

APA, Harvard, Vancouver, ISO, and other styles

26

N. Venkatesan and N. Arulanand. "Implications of Tokenizers in BERT Model for Low-Resource Indian Language." December 2022 4, no. 4 (2023): 264–71. http://dx.doi.org/10.36548/jscp.2022.4.005.

Full text

Abstract:

For any deep learning language model, the initial tokens are prepared as a part of the text preparation process, Tokenization. Important de facto models like BERT and GPT de facto utilize WordPiece and Byte Pair Encoding (BPE) as approaches. Tokenization may have a distinct impact on models for low-resource languages, such as the south Indian Dravidian languages, where many words may be produced by adding prefixes and suffixes. In this paper, four tokenizers are compared at various granularity levels, i.e., their outputs range from the tiniest individual letters to words in their most basic form. Using the BERT pretraining process on the Tamil text, these tokenizers as well as the language models are trained. The model is then fine-tuned with numerous parameters adjusted for the improved performance for a subsequent job in Tamil text categorization. The custom-built tokenizer for Tamil text is created and trained with BPE, WordPiece Vocabulary, Unigram, and WordLevel mechanisms and the compared results are presented after the downstream task of Tamil text categorization is performed using the BERT algorithm.

APA, Harvard, Vancouver, ISO, and other styles

27

Ibrahim Abdelfattah Almajali. "Comprehensive Analysis of Arabic Tokenization System Preprocessing using the Matching Model." Journal of Information Systems Engineering and Management 10, no. 4 (2025): 210–16. https://doi.org/10.52783/jisem.v10i4.8981.

Full text

Abstract:

This research paper proposes a novel Arabic word tokenization system based on the knowledge Word tokenization is the first stage for higher-order Natural Language Processing (NLP) tasks like Part-of-Speech (PoS) tagging, parsing, and named entity recognition. The amount of text on the World Wide Web is growing daily in the present era of technology, necessitating the use of advanced instruments. Since more and more people speak Arabic around the world, Arabic language processing systems must be improved. Due to the writing style of Arabic with a lack of support for capitalization features and the use of compound words, it is difficult to perform word tokenization. Arabic's inconsistent usage of space between words makes it difficult to tokenize words because of its cursive form. Word tokenization plays a vital role in all aspects of natural language processing. Different applications can be created once words have been tokenized. To develop this system, a maximum matching model with its two variations, namely forward and reverse maximum matching is used. The proposed system is implemented in Python. The results produced during system evaluation report high performance.

APA, Harvard, Vancouver, ISO, and other styles

28

Almaaytah, Shahab Ahmad. "Arabic word tokenization system using the maximum matching model." Edelweiss Applied Science and Technology 8, no. 6 (2024): 3210–17. http://dx.doi.org/10.55214/25768484.v8i6.2682.

Full text

Abstract:

Word tokenization is the first stage for higher-order Natural Language Processing (NLP) tasks like Part-of-Speech (PoS) tagging, parsing, and named entity recognition. The amount of text on the World Wide Web is growing daily in the present era of technology, necessitating the use of advanced instruments. Since more and more people speak Arabic around the world, Arabic language processing systems must be improved. Due to the writing style of Arabic with a lack of support for capitalization features and the use of compound words, it is difficult to perform word tokenization. This research paper proposes a novel Arabic word tokenization system based on the knowledge. To develop this system, a maximum matching model with its two variations, namely forward and reverse maximum matching is used. The proposed system is implemented in Python. The results produced during system evaluation report high performance.

APA, Harvard, Vancouver, ISO, and other styles

29

Aravind Ayyagiri, Om Goel, and Shalu Jain. "Innovative Approaches to Full-Text Search with Solr and Lucene." Universal Research Reports 11, no. 1 (2024): 209–24. http://dx.doi.org/10.36676/urr.v11.i1.1336.

Full text

Abstract:

Full-text search engines help efficiently process large volumes of textual material and provide appropriate results. Apache Solr and Apache Lucene are popular full-text search tools for indexing and querying huge datasets. This research study examines Solr and Lucene's strengths, weaknesses, and unique methods for improving full-text search efficiency and accuracy. Apache Lucene, the fundamental full-text search framework, has extensive indexing and querying features. Developers may tailor the search process using its flexible and extendable framework. Advanced indexing algorithms like inverted indices and tokenization underpin Lucene's search capabilities. However, complicated query needs and efficient large-scale data management remain problems. Solr, founded on Lucene, adds faceting, distributed searching, and rich text analysis to its search engine. Enterprise applications may use Solr's high availability, fault tolerance, and large-scale deployments. Solr has performance tuning and setup complexity issues despite these benefits. This study explores novel solutions to these issues and improves full-text search. Advanced tokenization and normalization may improve indexing tactics. Machine learning algorithms increase search relevancy, providing more accurate and contextual results. Query processing optimization is another invention. Caching, query rewriting, and parallel processing may minimize query latency and boost throughput. GPUs are also used to improve query execution. The article also discusses integrating Solr and Lucene with big data platforms and cloud services. Distributed computing frameworks and cloud storage may improve scalability and real-time search. How Solr and Lucene may incorporate AI and NLP to improve search accuracy and user experience is also investigated.

APA, Harvard, Vancouver, ISO, and other styles

30

Aravind Ayyagiri, Om Goel, and Shalu Jain. "Innovative Approaches to Full-Text Search with Solr and Lucene." Innovative Research Thoughts 10, no. 3 (2024): 144–59. http://dx.doi.org/10.36676/irt.v10.i3.1473.

Full text

Abstract:

Full-text search engines help efficiently process large volumes of textual material and provide appropriate results. Apache Solr and Apache Lucene are popular full-text search tools for indexing and querying huge datasets. This research study examines Solr and Lucene's strengths, weaknesses, and unique methods for improving full-text search efficiency and accuracy. Apache Lucene, the fundamental full-text search framework, has extensive indexing and querying features. Developers may tailor the search process using its flexible and extendable framework. Advanced indexing algorithms like inverted indices and tokenization underpin Lucene's search capabilities. However, complicated query needs and efficient large-scale data management remain problems. Solr, founded on Lucene, adds faceting, distributed searching, and rich text analysis to its search engine. Enterprise applications may use Solr's high availability, fault tolerance, and large-scale deployments. Solr has performance tuning and setup complexity issues despite these benefits. This study explores novel solutions to these issues and improves full-text search. Advanced tokenization and normalization may improve indexing tactics. Machine learning algorithms increase search relevancy, providing more accurate and contextual results. Query processing optimization is another invention. Caching, query rewriting, and parallel processing may minimize query latency and boost throughput. GPUs are also used to improve query execution. The article also discusses integrating Solr and Lucene with big data platforms and cloud services. Distributed computing frameworks and cloud storage may improve scalability and real-time search. How Solr and Lucene may incorporate AI and NLP to improve search accuracy and user experience is also investigated.

APA, Harvard, Vancouver, ISO, and other styles

31

Nafea, Ahmed Adil, Muhmmad Shihab Muayad, Russel R. Majeed, et al. "A Brief Review on Preprocessing Text in Arabic Language Dataset: Techniques and Challenges." Babylonian Journal of Artificial Intelligence 2024 (May 18, 2024): 46–53. http://dx.doi.org/10.58496/bjai/2024/007.

Full text

Abstract:

Text preprocessing plays an important role in natural language processing (NLP) tasks containing text classification, sentiment analysis, and machine translation. The preprocessing of Arabic text still presents unique challenges due to the language's rich morphology, complex grammar, and various character sets. This brief review studied various techniques utilized for preprocessing Arabic text data. This study discusses the challenges specific to Arabic text and current an overview of key preprocessing steps including normalization, tokenization, stemming, stop-word removal, and noise reduction. This survey analyzes preprocessing techniques on NLP tasks and focus on current research trends and future directions in Arabic text preprocessing.

APA, Harvard, Vancouver, ISO, and other styles

32

Sharma, Kartik. "Text to SQL Query." International Journal for Research in Applied Science and Engineering Technology 13, no. 5 (2025): 6870–73. https://doi.org/10.22214/ijraset.2025.71793.

Full text

Abstract:

With the exponential growth of data in modern organizations, the ability to extract meaningful insights from databases has become crucial. However, interacting with structured databases often requires knowledge of SQL (Structured Query Language), which presents a barrier for non-technical users. To address this challenge, this project proposes a Text to SQL system that enables users to retrieve data from relational databases by simply expressing their queries in natural language. The goal is to bridge the gap between human language and machine-readable SQL commands through the use of Natural Language Processing (NLP) and machine learning techniques. The system is designed to accept a natural language input, process and understand its intent, and then convert it into an equivalent SQL query that can be executed on a target database. The core methodology involves text preprocessing, tokenization, semantic parsing, and SQL query generation. Modern NLP models, including transformer-based architectures, are explored to improve the understanding of context and the mapping between language and database schema.

APA, Harvard, Vancouver, ISO, and other styles

33

Alzahrani, Ibrahim R. "Unlocking Potential Score Insights of Sentimental Analysis with a Deep Learning Revolutionizes." Emerging Science Journal 9, no. 1 (2025): 25–44. https://doi.org/10.28991/esj-2025-09-01-03.

Full text

Abstract:

Online hate has emerged as a rapidly growing issue worldwide, often stemming from differences in opinion. It is crucial to use appropriate language and words on social media platforms, as inappropriate communication can negatively impact others. Consequently, detecting hate speech is of significant importance. While manual methods are commonly employed to identify hate and offensive content on social media, they are time-consuming, labor-intensive, and prone to errors. Therefore, AI-based approaches are increasingly being adopted for the effective classification of hate and offensive speech. The proposed model incorporates various text preprocessing techniques, such as removing extraneous elements like URLs, emojis, and blank spaces. Following preprocessing, tokenization is applied to break down the text into smaller components or tokens. The tokenization technique utilized in this study is TF-IDF (Term Frequency–Inverse Document Frequency). After tokenization, the model performs the classification of hate and offensive speech using the proposed BiLSTM-based SM-CJ (Scalable Multi-Channel Joint) framework. The BiLSTM-based SM-CJ model is particularly effective in detecting hate, offensive, and neutral tweets due to its ability to capture both forward and backward contexts within a given text. Detecting hate speech requires a comprehensive understanding of the text and the identification of patterns spanning across multiple words or phrases. To achieve this, the LSTM component of the BiLSTM model is designed to capture long-term dependencies by utilizing information from earlier parts of the text. The proposed SM-CJ framework further aligns the input sequence lengths fetched from the input layer, enabling the model to focus on specific segments of the input sequence that are most relevant for hate speech detection. This approach allows the model to accurately capture derogatory language, and subtle nuances present in hate speech. Finally, the performance of the proposed framework is evaluated using various metrics, including accuracy, recall, F1-score, and precision. The results are compared with state-of-the-art approaches, demonstrating the effectiveness of the proposed model. Doi: 10.28991/ESJ-2025-09-01-03 Full Text: PDF

APA, Harvard, Vancouver, ISO, and other styles

34

Sergii V., Mashtalir, and Nikolenko Oleksandr V. "Data preprocessing and tokenization techniquesfortechnical Ukrainian texts." Applied Aspects of Information Technology 6, no. 3 (2023): 318–26. http://dx.doi.org/10.15276/aait.06.2023.22.

Full text

Abstract:

The field of Natural Language Processing (NLP) has witnessed significant advancements fueled by machine learning, deep learning, and artificial intelligence, expanding its applicability and enhancing human-computer interactions. However, NLP systems grapple with issues related to incomplete and error-laden data, potentially leading to biased model outputs. Specialized technical domains pose additional challenges, demanding domain-specific fine-tuning and custom lexicons. Moreover, many languages lack comprehensive NLP support, hindering accessibility.In this context, we explore novel NLP data preprocessing and tokenization techniques tailored for technical Ukrainian texts. We address a dataset comprising automotive repair labor entity names, known for errors and domain-specific terms, often in a blend of Ukrainian and Russian. Our goal is to classify these entities accurately, requiring comprehensive data cleaning, preprocessing and tokenization.Our approach modifies classical NLP preprocessing, incorporating language detection, specific Cyrillic character recognition, compounded word disassembly, and abbreviation handling. Text line normalization standardizes characters, punctuation, and abbreviations, improving consistency. Stopwords are curatedto enhance classification relevance. Translation of Russian to Ukrainian leverages detailed classifiers, resulting in a correspondence dictionary.Tokenization addresses concatenated tokens, spellingerrors, common prefixes in compound words and abbreviations.Lemmatization, crucial in languages like Ukrainian and Russian, builds dictionaries mapping word forms to lemmas, with a focus on noun cases. The results yield a robust token dictionary suitable for various NLP tasks, enhancing the accuracy and reliability of applications, particularly in technical Ukrainian contexts. This research contributes to the evolving landscape of NLP data preprocessing and tokenization, offering valuable insights for handling domain-specific languages.

APA, Harvard, Vancouver, ISO, and other styles

35

Avhad, Pranjali. "WordCanvas: Text-to-Image Generation." INTERANTIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT 08, no. 05 (2024): 1–5. http://dx.doi.org/10.55041/ijsrem32152.

Full text

Abstract:

This project investigates the novel use of stable dif- fusion techniques to generate high-quality images from detailed text descriptions. The combination of natural language under- standing and computer vision in text-to-image conversion opens up new possibilities for content creation and communication. Using cutting-edge stable diffusion models, our project builds a solid foundation for the generation process, which includes tokenization, pre-processing, specialized architecture design, and post-processing techniques. The advantages include eye-catching images, increased user engagement, content personalization, and improved accessibility. Automation of content generation has applications in marketing, education, data visualization, and creative expression. However, challenges such as model accuracy, ethical concerns, and biases need addressing. Achieving a balance between automation and human supervision is critical for the responsible application of this transformative capability. Index Terms—Stable diffusion, Text-to-image conversion, Nat- ural language understanding, Pre - processing, Post-processing techniques, Content personalization

APA, Harvard, Vancouver, ISO, and other styles

36

Wiguna, Ratu Aghnia raffaidy, and Andri Irfan Rifai. "Analisis Text Clustering Masyarakat Di Twitter Mengenai Omnibus Law Menggunakan Orange Data Mining." Journal of Information Systems and Informatics 3, no. 1 (2021): 1–12. http://dx.doi.org/10.33557/journalisi.v3i1.78.

Full text

Abstract:

Sosial media dengan platform twitter menjadi hal menarik untuk diteliti. Trending topik tersebut menghasilkan komentar masyarakat Indonesia yang mengandung opini berupa emosi. Penelitian ini mencoba menganalisis opini di twitter dengan metode analisis vader yang menghasilkan tweet profiler kemudian visualisasi distribution. Penelitian ini menggunakan tools orange data mining dengan mengaplikasikan Prepocess text yang meliputi transformation, tokenization, normalization, dan filtering yang bertujuan agar text bisa dianalisis. Kesimpulan dari penelitian ini yaitu respon masyarakat terhadap UU Cipta Kerja Omnibus Law mendapat 6 respon dan yang paling tertinggi responnya adalah masyarakat merasa surprise.

APA, Harvard, Vancouver, ISO, and other styles

37

Challapalli, Srinivasa Sai Abhijit. "Sentiment Analysis of the Twitter Dataset for the Prediction of Sentiments." Journal of Sensors, IoT & Health Sciences 2, no. 4 (2024): 1–15. https://doi.org/10.69996/jsihs.2024017.

Full text

Abstract:

Sentiment analysis of Twitter data involves using natural language processing (NLP) and machine learning techniques to classify the sentiments expressed in tweets. The goal is to classify tweets based on sentiment, emotion, or behavior in the text, using emotional labels such as positive, negative, or neutral. Twitter sentiment analysis is an important tool for instantly understanding public opinion on various topics, events, or brands. This process usually begins with the collection of large tweet data, followed by preliminary steps such as tokenization, outlier removal, and text normalization to clean the data. The text is then converted into a digital representation suitable for machine learning models using extraction techniques such as Bag-of-Words, Time-Inverse Document Frequency (TF-IDF), and word embedding. Deep learning models, such as convolutional neural networks (CNN), short-term neural networks (LSTM), and bidirectional LSTM (BiLSTM), are generally used to train and predict sentiment. This paper presents an effective method for sentiment analysis of Twitter profiles using deep learning methods, especially Convolutional Neural Networks (CNN), Long-Term Memory (LSTM) and Bidirectional LSTM (BiLSTM). The database contains 7000 tweets, which are pre-processed using text cleaning methods including tokenization, word removal, and lemmatization. The data is then converted to numerical vectors by methods such as bag-of-words and word embedding. The model is trained, and the test accuracy of CNN model is 0.95, test accuracy is 0.92, training accuracy of LSTM model is 0.97 and test accuracy is 0.90, and the training accuracy of BiLSTM model is 1.0 tab, and the accuracy rate is 0.9. . The results show a tendency to overdo it, with the model performing well on training data but poorly on test data. However, the model successfully classified tweets into positive, negative, and neutral groups, demonstrating the potential of deep learning in capturing sentiment from social media.

APA, Harvard, Vancouver, ISO, and other styles

38

Manikkannan, Prof. "SMS Spam Detection using Machine Learning." INTERANTIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT 07, no. 11 (2023): 1–11. http://dx.doi.org/10.55041/ijsrem27463.

Full text

Abstract:

SMS spam detection using Naive Bayes algorithm is a widely used technique in the field of text classification. The main aim of this approach is to classify the incoming messages into spam or ham categories. The Naive Bayes algorithm works by calculating the probability of a message belonging to a particular class, based on the occurrence of different words in the message. In this paper, we present an efficient and accurate approach for SMS spam detection using the Naive Bayes algorithm. The proposed approach utilizes a pre- processing step for feature extraction, which includes tokenization, stop-word removal, and stemming. The Naive Bayes algorithm is then trained on a dataset of labeled messages to learn the probability distributions of different words in spam and ham messages. Finally, the trained model is used to classify incoming messages into spam or ham categories. The results of our experiments show that the proposed approach achieves high accuracy in detecting SMS spam messages. Keywords: Naive Bayes Algorithm, Text Classification, Probability, Feature Extraction, Tokenization, StopWord Removal, Stemming, Labeled Dataset, Probability Distribution, Accuracy

APA, Harvard, Vancouver, ISO, and other styles

39

Amiri, Amin, Alireza Ghaffarnia, Nafiseh Ghaffar Nia, Dalei Wu, and Yu Liang. "Harmonizer: A Universal Signal Tokenization Framework for Multimodal Large Language Models." Mathematics 13, no. 11 (2025): 1819. https://doi.org/10.3390/math13111819.

Full text

Abstract:

This paper introduces Harmonizer, a universal framework designed for tokenizing heterogeneous input signals, including text, audio, and video, to enable seamless integration into multimodal large language models (LLMs). Harmonizer employs a unified approach to convert diverse, non-linguistic signals into discrete tokens via its FusionQuantizer architecture, built on FluxFormer, to efficiently capture essential signal features while minimizing complexity. We enhance features through STFT-based spectral decomposition, Hilbert transform analytic signal extraction, and SCLAHE spectrogram contrast optimization, and train using a composite loss function to produce reliable embeddings and construct a robust vector vocabulary. Experimental validation on music datasets such as E-GMD v1.0.0, Maestro v3.0.0, and GTZAN demonstrates high fidelity across 288 s of vocal signals (MSE = 0.0037, CC = 0.9282, Cosine Sim. = 0.9278, DTW = 12.12, MFCC Sim. = 0.9997, Spectral Conv. = 0.2485). Preliminary tests on text reconstruction and UCF-101 video clips further confirm Harmonizer’s applicability across discrete and spatiotemporal modalities. Rooted in the universality of wave phenomena and Fourier theory, Harmonizer offers a physics-inspired, modality-agnostic fusion mechanism via wave superposition and interference principles. In summary, Harmonizer integrates natural language processing and signal processing into a coherent tokenization paradigm for efficient, interpretable multimodal learning.

APA, Harvard, Vancouver, ISO, and other styles

40

Bogdanov, M. R., G. R. Shakhmametova, and N. N. Oskin. "Possibility of Using the Attention Mechanism in Multimodal Recognition of Cardiovascular Diseases." Programmnaya Ingeneria 15, no. 11 (2024): 578–88. http://dx.doi.org/10.17587/prin.15.578-588.

Full text

Abstract:

The paper is about studying the possibility of using the attention mechanism in diagnosing various cardiovascular diseases. Biomedical data were presented in different modalities (text, images, and time series). A comparison of the efficiency of 5 transformers based on the attention mechanism (Dosovitsky transformer, compact convolutional transformer, transformer with external attention, transformer based on tokenization with patch shift and local self-attention, transformer based on multiple deep attention) was carried out with the Exception convolutional neural network, three fully connected neural networks (MLP-Mixer, Fnet, and gMLP), and the YOLO architecture on the problem of multi-class classification (16 classes of dangerous arrhythmias). High efficiency of transformers in diagnosing cardiac diseases was shown. The transformer based on tokenization with patch shift and local self-attention showed the greatest efficiency.

APA, Harvard, Vancouver, ISO, and other styles

41

Dr. Madhur Jain, Shilpi Jain, Shruti Daga, and Roshni. "Unveiling Text Representation with 'Bag of Words'." International Journal of Scientific Research in Computer Science, Engineering and Information Technology 10, no. 3 (2024): 146–54. http://dx.doi.org/10.32628/cseit2410314.

Full text

Abstract:

Techniques for natural language processing (NLP) have grown to be essential tools for deciphering and drawing insightful conclusions from massive volumes of text data. A thorough review of numerous natural language processing (NLP) techniques, including as tokenization, stemming, lemmatization, named entity recognition, sentiment analysis, and topic modelling, is provided in this abstract. These methods are essential for applications like sentiment analysis, machine translation, text assistant categorization, and information retrieval. Furthermore, the capabilities of NLP systems have been greatly improved by recent developments in deep learning, especially with models like BERT and GPT. This has allowed them to reach state-of-the-art performance in a variety of language understanding tasks. The difficulties and potential paths for future study in NLP, including managing ambiguity, comprehending context, and enhancing multilingual assistance, are also highlighted in this abstract. Using NLP tools to their full potential, researchers.

APA, Harvard, Vancouver, ISO, and other styles

42

Somanathan Pillai, Sanjaikanth E. Vadakkethil, Srinivas A. Vaddadi, Rohith Vallabhaneni, Santosh Reddy Addula, and Bhuvanesh Ananthan. "TextBugger: an extended adversarial text attack on NLP-based text classification model." Indonesian Journal of Electrical Engineering and Computer Science 38, no. 3 (2025): 1735. https://doi.org/10.11591/ijeecs.v38.i3.pp1735-1744.

Full text

Abstract:

Recently, adversarial input highly negotiates the security concerns in deep learning (DL) techniques. The main motive to enhance the natural language processing (NLP) models is to learn attacks and secure against adversarial text. Presently, the antagonistic attack techniques face some issues like high error and traditional prevention approaches accurately secure data against harmful attacks. Hence, some attacks unable to increase more flaws of NLP models thereby introducing enhanced antagonistic mechanisms. The proposed article introduced an extended text adversarial generation method, TextBugger. Initially, preprocessing steps such as stop word (SR) removal, and tokenization are performed to remove noises from the text data. Then, various NLP models like Bi-directional encoder representations from transformers (BERT), robustly optimized BERT (ROBERTa), and extreme learning machine neural network (XLNet) models are analyzed for outputting hostile texts. The simulation process is carried out in the Python platform and a publicly available text classification attack database is utilized for the training process. Various assessing measures like success rate, time consumption, positive predictive value (PPV), Kappa coefficient (KC), and F-measure are analyzed with different TextBugger models. The overall success rate achieved by BERT, ROBERTa, and XLNet is about 98.6%, 99.7%, and 96.8% respectively.

APA, Harvard, Vancouver, ISO, and other styles

43

Badry Ali Mustofa and Wawan Laksito Yuly Saptomo. "Use of Natural Language Processing in Social Media Text Analysis." Journal of Artificial Intelligence and Engineering Applications (JAIEA) 4, no. 2 (2025): 1235–38. https://doi.org/10.59934/jaiea.v4i2.875.

Full text

Abstract:

Social media generates enormous volumes of text data, creating both opportunities and challenges for analysis. Natural Language Processing (NLP) enables in-depth analysis of public opinion, identification of trends and language patterns from social media texts. However, texts from social media often face problems with informal language, slang, and spelling errors. This research discusses the application of NLP techniques, such as sentiment analysis, tokenization, and text classification, and compares classical machine learning models (Naive Bayes and SVM) with deep learning models (BERT). Results show deep learning-based models excel at understanding informal language contexts, producing more accurate analysis. This study makes an important contribution in the development of AI-based applications for social media analysis.

APA, Harvard, Vancouver, ISO, and other styles

44

Bakaev, Ilkhom Izatovich. "The development of stemming algorithm for the Uzbek language." Кибернетика и программирование, no. 1 (January 2021): 1–12. http://dx.doi.org/10.25136/2644-5522.2021.1.35847.

Full text

Abstract:

The automatic processing of unstructured texts in natural languages is one of the relevant problems of computer analysis and text synthesis. Within this problem, the author singles out a task of text normalization, which usually suggests such processes as tokenization, stemming, and lemmatization. The existing stemming algorithms for the most part are oriented towards the synthetic languages with inflectional morphemes. The Uzbek language represents an example of agglutinative language, characterized by polysemanticity of affixal and auxiliary morphemes. Although the Uzbek language largely differs from, for example, English language, it is successfully processed by stemming algorithms. There are virtually no examples of effective implementation of stemming algorithms for the Uzbek language; therefore, this questions is the subject of scientific interest and defines the goal of this work. In the course of this research, the author solved the task of bringing the given texts in the Uzbek language to normal form, which on the preliminary stage were tokenized and cleared of stop words. To author developed the method of normalization of texts in the Uzbek language based on the stemming algorithm. The development of stemming algorithm employed hybrid approach with application of algorithmic method, lexicon of linguistic rules and database of the normal word forms of the Uzbek language. The precision of the proposed algorithm depends on the precision of tokenization algorithm. At the same time, the article did not explore the question of finding the roots of paired words separated by spaces, as this task is solved at the stage of tokenization. The algorithm can be integrated into various automated systems for machine translation, information extraction, data retrieval, etc.

APA, Harvard, Vancouver, ISO, and other styles

45

Aher, Sakshi Bhaulal. "INTELLIGENT PERSONAL MEMORY ASSISTANT." International Scientific Journal of Engineering and Management 04, no. 05 (2025): 1–7. https://doi.org/10.55041/isjem03380.

Full text

Abstract:

ABSTRACT: The growing need for personalized digital assistants has led to the development of systems that can effectively store, process, and retrieve user-specific data through natural language interactions. This paper introduces a Personal Memory Assistant (PMA), designed to handle both voice and text inputs, enabling users to store queries and retrieve relevant information when required. The system incorporates a range of Natural Language Processing (NLP) methods, such as tokenization, stopword removal, and Term Frequency-Inverse Document Frequency (TF-IDF), to efficiently analyze and interpret user inputs. For voice- based inputs, a speech-to-text mechanism is employed, offering users the flexibility to switch between voice and text seamlessly. User queries and data are stored in a cloud environment using Firebase, which ensures real-time synchronization and scalability of the stored information. Upon receiving a query, the system applies TF- IDF to match the input with previously stored data, facilitating accurate and contextually relevant retrieval. This approach allows the system to manage structured and unstructured data efficiently. By combining advanced NLP techniques with cloud based storage and real-time data processing, the PMA delivers personalized responses, enhancing user engagement and interaction. The paper demonstrates the potential of integrating cloud technologies and NLP methods to improve the functionality of digital assistants in providing context-aware, tailored responses. Keywords: Personal Memory Assistant (PMA), Natural Language Processing (NLP), Tokenization, Stopword Removal-TF- IDF, Speech-to-Text, Firebase, Cloud.

APA, Harvard, Vancouver, ISO, and other styles

46

Darú, Gilsiley Henrique, Felipe Daltrozo da Motta Motta, Antonio Castelo, and Gustavo Valentim Loch. "Short text classification applied to item description: Some methods evaluation." Semina: Ciências Exatas e Tecnológicas 43, no. 2 (2022): 189–98. http://dx.doi.org/10.5433/1679-0375.2022v43n2p189.

Full text

Abstract:

The increasing demand for information classification based on content in the age of social media and e-commerce has led to the need for automated product classification using their descriptions. This study aims to evaluate various techniques for this task, with a focus on descriptions written in Portuguese. A pipeline is implemented to preprocess the data, including lowercasing, accent removal, and unigram tokenization. The bag of words method is then used to convert text into numerical data, and five classification techniques are applied: argmaxtf, argmaxtfnorm, argmaxtfidf from information retrieval, and two machine learning methods logistic regression and support vector machines. The performance of each technique is evaluated using simple accuracy via thirty-fold cross validation. The results show that logistic regression achieves the highest mean accuracy among the evaluated techniques.

APA, Harvard, Vancouver, ISO, and other styles

47

Držík, Dávid, and Frantisek Forgac. "Slovak morphological tokenizer using the Byte-Pair Encoding algorithm." PeerJ Computer Science 10 (November 19, 2024): e2465. http://dx.doi.org/10.7717/peerj-cs.2465.

Full text

Abstract:

This study introduces a new approach to text tokenization, SlovaK Morphological Tokenizer (SKMT), which integrates the morphology of the Slovak language into the training process using the Byte-Pair Encoding (BPE) algorithm. Unlike conventional tokenizers, SKMT focuses on preserving the integrity of word roots in individual tokens, crucial for maintaining lexical meaning. The methodology involves segmenting and extracting word roots from morphological dictionaries and databases, followed by corpus preprocessing and training SKMT alongside a traditional BPE tokenizer. Comparative evaluation against existing tokenizers demonstrates SKMT’s outstanding ability to maintain root integrity, achieving 99.7% root integrity compared to SlovakBERT (90.5%) and a pureBPE tokenizer (93.1%). Further validation involved fine-tuning models on a sentiment classification NLP task, where models trained with SKMT achieved an F1-score improvement of 3.5% over those trained with conventional BPE tokenization, followed by a focus on the Semantic Textual Similarity (STS) task. These findings suggest that training language models on the SKMT tokenizer significantly enhances model performance and quality.

APA, Harvard, Vancouver, ISO, and other styles

48

Qarah, Faisal, and Tawfeeq Alsanoosy. "A Comprehensive Analysis of Various Tokenizers for Arabic Large Language Models." Applied Sciences 14, no. 13 (2024): 5696. http://dx.doi.org/10.3390/app14135696.

Full text

Abstract:

Pretrained language models have achieved great success in various natural language understanding (NLU) tasks due to their capacity to capture deep contextualized information in text using pretraining on large-scale corpora. Tokenization plays a significant role in the process of lexical analysis. Tokens become the input for other natural language processing (NLP) tasks, like semantic parsing and language modeling. However, there is a lack of research on the evaluation of the impact of tokenization on the Arabic language model. Therefore, this study aims to address this gap in the literature by evaluating the performance of various tokenizers on Arabic large language models (LLMs). In this paper, we analyze the differences between WordPiece, SentencePiece, and BBPE tokenizers by pretraining three BERT models using each tokenizer while measuring the performance of each model on seven different NLP tasks using 29 different datasets. Overall, the model pretrained with text tokenized using the SentencePiece tokenizer significantly outperforms the other two models that utilize WordPiece and BBPE tokenizers. The results of this paper will assist researchers in developing better models, making better decisions in selecting the best tokenizers, improving feature engineering, and making models more efficient, thus ultimately leading to advancements in various NLP applications.

APA, Harvard, Vancouver, ISO, and other styles

49

Yunania, Nanda, and Yulian Findawati. "Hate Speech and Emotion Detection on Twitter Using LSTM Model." JICTE (Journal of Information and Computer Technology Education) 7, no. 1 (2023): 1–5. https://doi.org/10.21070/jicte.v7i1.1645.

Full text

Abstract:

This research aims to develop a classification model to detect hate speech and emotions on the Twitter platform used the Long-Short Term Memory (LSTM) method. With the increasing volume of data on social media, especially Twitter, automatic identification of negative content is crucial for maintaining a healthy digital ecosystem. The dataset used in this study consists of tweets labeled for hate speech and various emotion categories. The preprocessing process is carried out to clean and prepare the data, including steps such as punctuation removal, tokenization, and text normalization. After preprocessing, the dataset is split into training and testing data with a ratio of 60:40 to ensure accurate model evaluation. The experimental results show that the LSTM model achieves an accuracy of 89% in hate speech classification and 71% in emotion classification. These results demonstrate the potential of the LSTM method in text analysis tasks and can serve as a basis for developing automatic detection systems on social media platforms. Highlights: LSTM achieves 89% accuracy in detecting hate speech and 71% in emotion classification. The model processes Indonesian language tweets to identify hate speech and emotional tone. Preprocessing steps like tokenization and stopword removal are crucial for accurate classification. Keywords: Hate Speech, LSTM, Twitter

APA, Harvard, Vancouver, ISO, and other styles

50

Le, Duy Nguyen Minh, Huy Gia Le, Hai Thanh Hoang, and Vu Anh Hoang. "XBert - A Model for Hate Speech Detection in Vietnamese Text." International Journal of Emerging Technology and Advanced Engineering 13, no. 12 (2023): 1–5. http://dx.doi.org/10.46338/ijetae1223_01.

Full text

Abstract:

— In the digital age, social media's pervasive influence has inadvertently escalated the prevalence of hate speech and offensive comments, with alarming implications for mental health. There is increasing evidence indicating a clear correlation between two factors. exposure to such toxic online content and the onset of depression among users, particularly affecting vulnerable groups like content creators and channel owners. Addressing this critical issue, our research introduces XBert, a model for detecting hostile and provocative language in Vietnamese. We propose an approach related to data preprocessing, improved tokenization, and model fine-tuning. We have modified the architecture of the Roberta model, used the EDA technique, and added a dropout parameter to the tokenizer. Our model achieved an accuracy of 99.75% and an F1-Macro score of 98.05%. This is a promising result for a model detecting provocative and hostile language in Vietnamese.

APA, Harvard, Vancouver, ISO, and other styles

Journal articles on the topic 'Text tokenization'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles