Log in

Relevant bibliographies by topics / Text dataset / Journal articles

To see the other types of publications on this topic, follow the link: Text dataset.

Journal articles on the topic 'Text dataset'

Author: Grafiati

Published: 29 September 2021

Last updated: 18 July 2025

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 journal articles for your research on the topic 'Text dataset.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

Khan, Shafiq Ur Rehman, and Muhammad Arshad Islam. "Event-Dataset: Temporal information retrieval and text classification dataset." Data in Brief 25 (August 2019): 104048. http://dx.doi.org/10.1016/j.dib.2019.104048.

Full text

APA, Harvard, Vancouver, ISO, and other styles

2

Assad, Ali, Abdul Hadi M. Alaidi, Amjad Yousif Sahib, Haider TH Salim ALRikabi, and Ahmed Magdy. "Transformer-based automatic Arabic text diacritization." Sustainable Engineering and Innovation 6, no. 2 (2024): 285–96. https://doi.org/10.37868/sei.v6i2.id305.

Full text

Abstract:

In Arabic natural language processing (NLP), automatic text diacritization is a major obstacle, and progress has been slow when compared to other language processing tasks. Automatic diacritical marking of Arabic text is proposed in this work using the first transformer-based paradigm designed solely for this task. By taking advantage of the attention mechanism, our system is able to capture more of the innate patterns in Arabic, surpassing the performance of both rule-based alternatives and neural network techniques. The model trained with the Clean-50 dataset had a diacritic error rate (DER) of 2.03%, even though the model trained with the Clean-400 dataset had a DER of 1.37%. As compared to state-of-the-art results, the improvement for the Clean-50 dataset is minimal. However, for the larger Clean-400 dataset, it is a notable improvement, indicating that this approach can deliver more accurate solutions for applications requiring precise diacritical marks with larger datasets. Additionally, this method achieves a DER of 1.21% for the Clean-400 dataset, and it performs even better when given extended input text with overlapping windows.

APA, Harvard, Vancouver, ISO, and other styles

3

Assad, Ali, Abdul Hadi M. Alaidi, Amjad Yousif Sahib, Haider TH Salim ALRikabi, and Ahmed Magdy. "Transformer-based automatic Arabic text diacritization." Sustainable Engineering and Innovation 6, no. 2 (2024): 285–96. http://dx.doi.org/10.37868/sei.v6i2.id392.

Full text

Abstract:

In Arabic natural language processing (NLP), automatic text diacritization is a major obstacle, and progress has been slow when compared to other language processing tasks. Automatic diacritical marking of Arabic text is proposed in this work using the first transformer-based paradigm designed solely for this task. By taking advantage of the attention mechanism, our system is able to capture more of the innate patterns in Arabic, surpassing the performance of both rule-based alternatives and neural network techniques. The model trained with the Clean-50 dataset had a diacritic error rate (DER) of 2.03%, even though the model trained with the Clean-400 dataset had a DER of 1.37%. As compared to state-of-the-art results, the improvement for the Clean-50 dataset is minimal. However, for the larger Clean-400 dataset, it is a notable improvement, indicating that this approach can deliver more accurate solutions for applications requiring precise diacritical marks with larger datasets. Additionally, this method achieves a DER of 1.21% for the Clean-400 dataset, and it performs even better when given extended input text with overlapping windows.

APA, Harvard, Vancouver, ISO, and other styles

4

Васильев, А. А., and А. С. Нестеров. "APPLYING TEXT QUESTIONS GENERATION ALGORITHMS FOR AUTOMATIC TEST GENERATION." Proceedings in Cybernetics 22, no. 3 (2023): 17–22. http://dx.doi.org/10.35266/1999-7604-2023-3-17-22.

Full text

Abstract:

The article presents findings of manual, semi-automatic, and automatic approaches to gen-erate test questions based on such methods as annotation, keyword extraction, and learning datasets for com-piling tests for studying material, along with a description of each method algorithm, examples of generated questions, and their quality assessment. These examples demonstrate the advantages of an algorithm for gen-erating a method using a dataset and a combination of methods, as well as their possible practical application.

APA, Harvard, Vancouver, ISO, and other styles

5

Saeed, Ari M. "AN AUTOMATED NEW APPROACH IN FAST TEXT CLASSIFICATION: A CASE STUDY FOR KURDISH TEXT." Science Journal of University of Zakho 12, no. 3 (2024): 330–36. http://dx.doi.org/10.25271/sjuoz.2024.12.3.1296.

Full text

Abstract:

With the rapid development of internet technology, text classification has become a vital part of obtaining quick and accurate data. Traditional machine learning methods often suffer from poor performance and high-dimensional feature spaces, which reduce their accuracy. In this paper, the FastText model is proposed as the first-ever classifier on Kurdish text and the results are compared with traditional machine learning methods to show the effects on Kurdish Text. For evaluating the model four datasets Kurdish News Dataset Headlines (KNDH), Medical Kurdish Dataset (MKD), Kurdish-Emotional-Dataset (KMD-77000), and KurdiSent are utilized and compared the results with the traditional machine learning algorithms such as: Random Forest (RF), k-nearest Neighbor (k-NN), Logistic Regression (LR), Multinomial Naïve Bayes (MNB), Support Vector Machine (SVM), Decision Tree (DT), Stochastic Gradient Descent (SGD), as well as the deep learning model Bidirectional Encoder Representations from Transformers (BERT). The outcomes indicate that the FastText model achieved the highest performance with 89% for each precision, recall, F1-score, and 89.10% accuracy for the KNDH dataset. Moreover, when the KMD dataset is utilized the FatText model obtained outperforms all others by approximately 2%. In addition, the comparative analysis showed that FastText is superior when Kurdisent is considered with precision, recall, F1-score, and accuracy by 81.32, 81.83, 81.57, and 81.4 respectively. In addition, when MKD is implemented, the FastText model obtained the highest performance with a precision of 93.32%, recall of 93.36, F1-score of 93.34, and accuracy of 93.1%.

APA, Harvard, Vancouver, ISO, and other styles

6

O, Hyon-Gwang, Myong-Chol Kim, Il-Nam Pak, Un-Hyok Choe, and Chol-Jun O. "RanPil: New Dataset and Benchmark for Offline Handwritten Korean Text Recognition." International Journal on Data Science and Technology 11, no. 2 (2025): 27–34. https://doi.org/10.11648/j.ijdst.20251102.12.

Full text

Abstract:

In recent years, since deep learning technology have been applied to handwritten text recognition, the need for handwritten document image Datasets has been growing more and more. In particular, the development of the dataset is of great significance for improving performance of handwritten Korean text recognition because no dataset for handwritten Korean text recognition has been published. In this paper, we present the “RanPil”, a new training and performance evaluation dataset for handwritten Korean text recognition, which consists of a total of 8,600 pages of images (182,000 text lines and 4,300,000 characters) written by 1,804 people. We evaluate writing- diversity of handwritten document images, such as text line spacing, text line slope, character size, word spacing, and character compactness. In addition, we propose an MOS (Mean Opinion Score) evaluation method for the scrawl-level. Finally, we evaluate the performance of TrOCR based on vision encoder and decoder with a test dataset classified by the scrawl-levels.

APA, Harvard, Vancouver, ISO, and other styles

7

Maekawa, Aru, Satoshi Kosugi, Kotaro Funakoshi, and Manabu Okumura. "DiLM: Distilling Dataset into Language Model for Text-level Dataset Distillation." Journal of Natural Language Processing 32, no. 1 (2025): 252–82. https://doi.org/10.5715/jnlp.32.252.

Full text

APA, Harvard, Vancouver, ISO, and other styles

8

Kolesov, Anton, Dmitry Kamyshenkov, Maria Litovchenko, Elena Smekalova, Alexey Golovizin, and Alex Zhavoronkov. "On Multilabel Classification Methods of Incompletely Labeled Biomedical Text Data." Computational and Mathematical Methods in Medicine 2014 (2014): 1–11. http://dx.doi.org/10.1155/2014/781807.

Full text

Abstract:

Multilabel classification is often hindered by incompletely labeled training datasets; for some items of such dataset (or even for all of them) some labels may be omitted. In this case, we cannot know if any item is labeled fully and correctly. When we train a classifier directly on incompletely labeled dataset, it performs ineffectively. To overcome the problem, we added an extra step, training set modification, before training a classifier. In this paper, we try two algorithms for training set modification: weighted k-nearest neighbor (WkNN) and soft supervised learning (SoftSL). Both of these approaches are based on similarity measurements between data vectors. We performed the experiments on AgingPortfolio (text dataset) and then rechecked on the Yeast (nontext genetic data). We tried SVM and RF classifiers for the original datasets and then for the modified ones. For each dataset, our experiments demonstrated that both classification algorithms performed considerably better when preceded by the training set modification step.

APA, Harvard, Vancouver, ISO, and other styles

9

Tian, Jing, Wushour Slamu, Miaomiao Xu, Chunbo Xu, and Xue Wang. "Research on Aspect-Level Sentiment Analysis Based on Text Comments." Symmetry 14, no. 5 (2022): 1072. http://dx.doi.org/10.3390/sym14051072.

Full text

Abstract:

Sentiment analysis is the processing of textual data and giving positive or negative opinions to sentences. In the ABSA dataset, most sentences contain one aspect of sentiment polarity, or sentences of one aspect have multiple identical sentiment polarities, which weakens the sentiment polarity of the ABSA dataset. Therefore, this paper uses the SemEval 14 Restaurant Review dataset, in which each document is symmetrically divided into individual sentences, and two versions of the datasets ATSA and ACSA are created. ATSA: Aspect Term Sentiment Analysis Dataset. ACSA: Aspect Category Sentiment Analysis Dataset. In order to symmetrically simulate the complex relationship between aspect contexts and accurately extract the polarity of emotional features, this paper combines the latest development trend of NLP, combines capsule network and BRET, and proposes the baseline model CapsNet-BERT. The experimental results verify the effectiveness of the model.

APA, Harvard, Vancouver, ISO, and other styles

10

Zhao, Huanhuan, Haihua Chen, Thomas A. Ruggles, Yunhe Feng, Debjani Singh, and Hong-Jun Yoon. "Improving Text Classification with Large Language Model-Based Data Augmentation." Electronics 13, no. 13 (2024): 2535. http://dx.doi.org/10.3390/electronics13132535.

Full text

Abstract:

Large Language Models (LLMs) such as ChatGPT possess advanced capabilities in understanding and generating text. These capabilities enable ChatGPT to create text based on specific instructions, which can serve as augmented data for text classification tasks. Previous studies have approached data augmentation (DA) by either rewriting the existing dataset with ChatGPT or generating entirely new data from scratch. However, it is unclear which method is better without comparing their effectiveness. This study investigates the application of both methods to two datasets: a general-topic dataset (Reuters news data) and a domain-specific dataset (Mitigation dataset). Our findings indicate that: 1. ChatGPT generated new data consistently enhanced model’s classification results for both datasets. 2. Generating new data generally outperforms rewriting existing data, though crafting the prompts carefully is crucial to extract the most valuable information from ChatGPT, particularly for domain-specific data. 3. The augmentation data size affects the effectiveness of DA; however, we observed a plateau after incorporating 10 samples. 4. Combining the rewritten sample with new generated sample can potentially further improve the model’s performance.

APA, Harvard, Vancouver, ISO, and other styles

11

Mediakov, Oleksandr, Dmytro Martjanov, and Vasyl Lytvyn. "SED-UA-SMALL: Ukrainian synthetic dataset for text embedding models." Vìsnik Nacìonalʹnogo unìversitetu "Lʹvìvsʹka polìtehnìka". Serìâ Ìnformacìjnì sistemi ta merežì 17 (June 2025): 403–10. https://doi.org/10.23939/sisn2025.17.403.

Full text

Abstract:

This paper presents Small Synthetic Embedding Dataset, a fully synthetic dataset in Ukrainian designed for training, fine-tuning, and evaluating text embedding models. The use of large language models (LLMs) allows for controlling the diversity of generated data in aspects such as NLP tasks, asymmetry between queries and documents, the presence of instructions, support for various languages, and avoidance of social biases. A zero-shot generation approach was used to create a set of Ukrainian query-documents pairs with corresponding similarity scores. The dataset can be used to evaluate the quality of multilingual embedding models, as well as to train or fine-tune models to improve their effectiveness when working with Ukrainian texts. The paper covers a comprehensive description of the dataset construction process, including the parameters influencing the diversity of generated texts, the large language models used for actual generation of the data, and an example of using the dataset to evaluate and compare selected multilingual embedding models on the task of semantic text similarity. Unlike existing Ukrainian datasets, which are mainly based on real texts, SED-UA-small is fully synthetic, providing greater flexibility in controlling the diversity and specificity of data for the needs of training and evaluating embedding models, and allowing for feast and cost-effective expansion of the dataset with high-quality entries if needed. We used a combination of open and proprietary large language models of different sizes to generate the first version of the dataset, consisting of 112 thousand text pairs, divided into training (~50%), testing (25%), and validation (25%) sets. The data is publicly available at https://huggingface.co/datasets/suntez13/sed-ua-small-sts-v1.

APA, Harvard, Vancouver, ISO, and other styles

12

Clérice, Thibault, Malamatenia Vlachou-Efstathiou, and Alix Chagué. "CREMMA Medii Aevi: Literary Manuscript Text Recognition in Latin." Journal of Open Humanities Data 9 (April 12, 2023): 4. http://dx.doi.org/10.5334/johd.97.

Full text

Abstract:

This paper presents a novel segmentation and handwritten text recognition dataset for Medieval Latin from the 11th to the 16th century. It connects with Medieval French datasets, as well as earlier Latin datasets, by enforcing common guidelines, bringing 263,000 new characters and now totaling over a million characters for medieval manuscripts in both languages. We provide our own addition to Ariane Pinche’s Old French guidelines to deal with specific Latin cases. We also offer an overview of how we addressed this dataset compilation through the use of pre-existing resources. With a higher abbreviation ratio and a better representation of abbreviating marks, we offer new models that outperform the Old French base model on Latin datasets, improving accuracy by 5% on unknown Latin manuscripts.

APA, Harvard, Vancouver, ISO, and other styles

13

ALAEI, ALIREZA, UMAPADA PAL, and P. NAGABHUSHAN. "DATASET AND GROUND TRUTH FOR HANDWRITTEN TEXT IN FOUR DIFFERENT SCRIPTS." International Journal of Pattern Recognition and Artificial Intelligence 26, no. 04 (2012): 1253001. http://dx.doi.org/10.1142/s0218001412530011.

Full text

Abstract:

In document image analysis (DIA) especially in handwritten document recognition, standard databases play significant roles for evaluating performances of algorithms and comparing results obtained by different groups of researchers. The field of DIA regard to Indo-Persian documents is still at its infancy compared to Latin script-based documents; as such standard datasets are not still available in literature. This paper is an effort towards alleviating this gap. In this paper, an unconstrained handwritten dataset containing documents of Persian, Bangla, Oriya and Kannada (PBOK) is introduced. The PBOK contains 707 text-pages written in four different languages (Persian, Bangla, Oriya and Kannada) by 436 individuals. Total number of text-lines, words/subwords and characters are 12,565, 104,541 and 423,980, respectively. In most documents of PBOK dataset contain either an overlapping or a touching text-lines. The average number of text-lines in text-pages of the PBOK dataset is 18. Two types of ground truths, based on pixels information and content information, are generated for the dataset. Because of such ground truths, the PBOK dataset can be utilized in many areas of document image processing e.g. text-line segmentation, word segmentation and word recognition. To provide an insight for other researches, recent text-line segmentation results on this dataset are also reported.

APA, Harvard, Vancouver, ISO, and other styles

14

Landro, Nicola, Ignazio Gallo, Riccardo La Grassa, and Edoardo Federici. "Two New Datasets for Italian-Language Abstractive Text Summarization." Information 13, no. 5 (2022): 228. http://dx.doi.org/10.3390/info13050228.

Full text

Abstract:

Text summarization aims to produce a short summary containing relevant parts from a given text. Due to the lack of data for abstractive summarization on low-resource languages such as Italian, we propose two new original datasets collected from two Italian news websites with multi-sentence summaries and corresponding articles, and from a dataset obtained by machine translation of a Spanish summarization dataset. These two datasets are currently the only two available in Italian for this task. To evaluate the quality of these two datasets, we used them to train a T5-base model and an mBART model, obtaining good results with both. To better evaluate the results obtained, we also compared the same models trained on automatically translated datasets, and the resulting summaries in the same training language, with the automatically translated summaries, which demonstrated the superiority of the models obtained from the proposed datasets.

APA, Harvard, Vancouver, ISO, and other styles

15

Pan, Hangyu, Yaoyi Xi, Ling Wang, Yu Nan, Zhizhong Su, and Rong Cao. "Dataset construction method of cross-lingual summarization based on filtering and text augmentation." PeerJ Computer Science 9 (March 28, 2023): e1299. http://dx.doi.org/10.7717/peerj-cs.1299.

Full text

Abstract:

Existing cross-lingual summarization (CLS) datasets consist of inconsistent sample quality and low scale. To address these problems, we propose a method that jointly supervises quality and scale to build CLS datasets. In terms of quality supervision, the method adopts a multi-strategy filtering algorithm to remove low-quality samples of monolingual summarization (MS) from the perspectives of character and semantics, thereby improving the quality of the MS dataset. In terms of scale supervision, the method adopts a text augmentation algorithm based on the pretrained model to increase the size of CLS datasets with quality assurance. This method was used to build an English-Chinese CLS dataset and evaluate it with a reasonable data quality evaluation framework. The evaluation results show that the dataset is of good quality and large size. These outcomes show that the proposed method may comprehensively improve quality and scale, thereby resulting in a high-quality and large-scale CLS dataset at a lower cost.

APA, Harvard, Vancouver, ISO, and other styles

16

Dong, Qianqian, Mingxuan Wang, Hao Zhou, Shuang Xu, Bo Xu, and Lei Li. "Consecutive Decoding for Speech-to-text Translation." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 14 (2021): 12738–48. http://dx.doi.org/10.1609/aaai.v35i14.17508.

Full text

Abstract:

Speech-to-text translation (ST), which directly translates the source language speech to the target language text, has attracted intensive attention recently. However, the combination of speech recognition and machine translation in a single model poses a heavy burden on the direct cross-modal cross-lingual mapping. To reduce the learning difficulty, we propose COnSecutive Transcription and Translation (COSTT), an integral approach for speech-to-text translation. The key idea is to generate source transcript and target translation text with a single decoder. It benefits the model training so that additional large parallel text corpus can be fully exploited to enhance the speech translation training. Our method is verified on three mainstream datasets, including Augmented LibriSpeech English-French dataset, TED English-German dataset, and TED English-Chinese dataset. Experiments show that our proposed COSTT outperforms the previous state-of-the-art methods. The code is available at https://github.com/dqqcasia/st.

APA, Harvard, Vancouver, ISO, and other styles

17

Blanco-Medina, Pablo, Eduardo Fidalgo, Enrique Alegre, Rocío Alaiz-Rodríguez, Francisco Jáñez-Martino, and Alexandra Bonnici. "Rectification and Super-Resolution Enhancements for Forensic Text Recognition." Sensors 20, no. 20 (2020): 5850. http://dx.doi.org/10.3390/s20205850.

Full text

Abstract:

Retrieving text embedded within images is a challenging task in real-world settings. Multiple problems such as low-resolution and the orientation of the text can hinder the extraction of information. These problems are common in environments such as Tor Darknet and Child Sexual Abuse images, where text extraction is crucial in the prevention of illegal activities. In this work, we evaluate eight text recognizers and, to increase the performance of text transcription, we combine these recognizers with rectification networks and super-resolution algorithms. We test our approach on four state-of-the-art and two custom datasets (TOICO-1K and Child Sexual Abuse (CSA)-text, based on text retrieved from Tor Darknet and Child Sexual Exploitation Material, respectively). We obtained a 0.3170 score of correctly recognized words in the TOICO-1K dataset when we combined Deep Convolutional Neural Networks (CNN) and rectification-based recognizers. For the CSA-text dataset, applying resolution enhancements achieved a final score of 0.6960. The highest performance increase was achieved on the ICDAR 2015 dataset, with an improvement of 4.83% when combining the MORAN recognizer and the Residual Dense resolution approach. We conclude that rectification outperforms super-resolution when applied separately, while their combination achieves the best average improvements in the chosen datasets.

APA, Harvard, Vancouver, ISO, and other styles

18

Lee, Seungsoo, Gyunyeop Kim, and Sangwoo Kang. "Cross-Domain Document Summarization Model via Two-Stage Curriculum Learning." Electronics 13, no. 17 (2024): 3425. http://dx.doi.org/10.3390/electronics13173425.

Full text

Abstract:

Generative document summarization is a natural language processing technique that generates short summary sentences while preserving the content of long texts. Various fine-tuned pre-trained document summarization models have been proposed using a specific single text-summarization dataset. However, each text-summarization dataset usually specializes in a particular downstream task. Therefore, it is difficult to treat all cases involving multiple domains using a single dataset. Accordingly, when a generative document summarization model is fine-tuned to a specific dataset, it performs well, whereas the performance is degraded by up to 45% for datasets that are not used during learning. In short, summarization models perform well with in-domain cases, as the dataset domain during training and evaluation is the same but perform poorly with out-domain inputs. In this paper, we propose a new curriculum-learning method using mixed datasets while training a generative summarization model to be more robust on out-domain datasets. Our method performed better than XSum with 10%, 20%, and 10% lower performance degradation in CNN/DM, which comprised one of two test datasets used, compared to baseline model performance.

APA, Harvard, Vancouver, ISO, and other styles

19

Zhang, Geng, and Jianpeng Hu. "Enhanced industrial text classification via hyper variational graph-guided global context integration." PeerJ Computer Science 10 (January 5, 2024): e1788. http://dx.doi.org/10.7717/peerj-cs.1788.

Full text

Abstract:

Background Joint local context that is primarily processed by pre-trained models has emerged as a prevailing technique for text classification. Nevertheless, there are relatively few classification applications on small sample of industrial text datasets. Methods In this study, an approach of employing global enhanced context representation of the pre-trained model to classify industrial domain text is proposed. To achieve the application of the proposed technique, we extract primary text representations and local context information as embeddings by leveraging the BERT pre-trained model. Moreover, we create a text information entropy matrix through statistical computation, which fuses features to construct the matrix. Subsequently, we adopt BERT embedding and hyper variational graph to guide the updating of the existing text information entropy matrix. This process is subjected to iteration three times. It produces a hypergraph primary text representation that includes global context information. Additionally, we feed the primary BERT text feature representation into capsule networks for purification and expansion as well. Finally, the above two representations are fused to obtain the final text representation and apply it to text classification through feature fusion module. Results The effectiveness of this method is validated through experiments on multiple datasets. Specifically, on the CHIP-CTC dataset, it achieves an accuracy of 86.82% and an F1 score of 82.87%. On the CLUEEmotion2020 dataset, the proposed model obtains an accuracy of 61.22% and an F1 score of 51.56%. On the N15News dataset, the accuracy and F1 score are 72.21% and 69.06% respectively. Furthermore, when applied to an industrial patent dataset, the model produced promising results with an accuracy of 91.84% and F1 score of 79.71%. All four datasets are significantly improved by using the proposed model compared to the baselines. The evaluation result of the four dataset indicates that our proposed model effectively solves the classification problem.

APA, Harvard, Vancouver, ISO, and other styles

20

Tran, Duc Chung, Duc Long Nguyen, and Mohd Fadzil Hassan. "Development and testing of an FPT.AI-based voicebot." Bulletin of Electrical Engineering and Informatics 9, no. 6 (2020): 2388–95. http://dx.doi.org/10.11591/eei.v9i6.2620.

Full text

Abstract:

In recent years, voicebot has become a popular communication tool between humans and machines. In this paper, we will introduce our voicebot integrating text-to-speech (TTS) and speech-to-text (STT) modules provided by FPT.AI. This voicebot can be considered as a critical improvement of a typical chatbot because it can respond to human’s queries by both text and speech. FPT Open Speech, LibriSpeech datasets, and music files were used to test the accuracy and performance of the STT module. For the TTS module, it was tested by using text on news pages in both Vietnamese and English. To test the voicebot, Homestay Service topic questions and off-topic messages were input to the system. The TTS module achieved 100% accuracy in the Vietnamese text test and 72.66% accuracy in the English text test. In the STT module test, the accuracy for FPT open speech dataset (Vietnamese) is 90.51% and for LibriSpeech Dataset (English) is 0% while the accuracy in music files test is 0% for both. The voicebot achieved 100% accuracy in its test. Since the FPT.AI STT and TTS modules were developed to support only Vietnamese for dominating the Vietnam market, it is reasonable that the test with LibriSpeech Dataset resulted in 0% accuracy.

APA, Harvard, Vancouver, ISO, and other styles

21

Tsiourlini, Maria, Katerina Tzafilkou, Dimitrios Karapiperis, and Christos Tjortjis. "Text Analytics on YouTube Comments for Food Products." Information 15, no. 10 (2024): 599. http://dx.doi.org/10.3390/info15100599.

Full text

Abstract:

YouTube is a popular social media platform in the contemporary digital landscape. The primary focus of this study is to explore the underlying sentiment in user comments about food-related videos on YouTube, specifically within two pivotal food categories: plant-based and hedonic product. We labeled comments using sentiment lexicons such as TextBlob, VADER, and Google’s Sentiment Analysis (GSA) engine. Comment sentiment was classified using advanced Machine-Learning (ML) algorithms, namely Support Vector Machines (SVM), Multinomial Naive Bayes, Random Forest, Logistic Regression, and XGBoost. The evaluation of these models encompassed key macro average metrics, including accuracy, precision, recall, and F1-score. The results from GSA showed a high accuracy level, with SVM achieving 93% accuracy in the plant-based dataset and 96% in the hedonic dataset. In addition to sentiment analysis, we delved into user interactions within the two datasets, measuring crucial metrics, such as views, likes, comments, and engagement rate. The findings illuminate significantly higher levels of views, likes, and comments in the hedonic food dataset, but the plant-based dataset maintains a superior overall engagement rate.

APA, Harvard, Vancouver, ISO, and other styles

22

AL-Banna, Alaa Ahmed, and Abeer K. AL-Mashhadany. "Natural Language Processing For Automatic text summarization [Datasets] - Survey." Wasit Journal of Computer and Mathematics Science 1, no. 4 (2022): 156–70. http://dx.doi.org/10.31185/wjcm.72.

Full text

Abstract:

Natural language processing has developed significantly recently, which has progressed the text summarization task. It is no longer limited to reducing the text size or obtaining helpful information from a long document only. It has begun to be used in getting answers from summarization, measuring the quality of sentiment analysis systems, research and mining techniques, document categorization, and natural language Inference, which increased the importance of scientific research to get a good summary. This paper reviews the most used datasets in text summarization in different languages and types, with the most effective methods for each dataset. The results are shown using text summarization matrices. The review indicates that the pre-training models achieved the highest results in the summary measures in most of the researchers' works for the datasets. Dataset English made up about 75% of the databases available to researchers due to the extensive use of the English language. Other languages such as Arabic, Hindi, and others suffered from low resources of dataset sources, which limited progress in the academic field.

APA, Harvard, Vancouver, ISO, and other styles

23

Zeng, Moran. "Research on AI-Generated Text Detection Based on Machine Learning Models." Transactions on Computer Science and Intelligent Systems Research 7 (November 25, 2024): 229–33. https://doi.org/10.62051/8k1jga32.

Full text

Abstract:

The purpose of this research is to ensure the authenticity of information, guarantee the reliability and credibility of information sources, and prevent the spread of false information, fabricated data, and misleading content. In the academic field, detecting AI-generated papers, articles, and assignments helps maintain academic integrity, prevent academic fraud and plagiarism, and thus improve academic capabilities. This study summarize the characteristics of the three selected models, which are Logistic Regression, Support Vector Machine (SVM), and Naive Bayes (NB) Classifier. And provide recommendations and directions for improvement in the choice of detection models for AI-generated content. Through comparison of three models—logistic regression, SVM, and Naive Bayes—on the same dataset in terms of Accuracy, Precision, and F1-score, it is determined that logistic regression performs the best for this type of dataset. Logistic regression achieves superior performance with metrics exceeding 90%. SVM shows suitability for large datasets with metrics around 70% in this dataset. However, Naive Bayes, typically suitable for smaller datasets, performs poorly on this dataset, achieving only 50% accuracy.

APA, Harvard, Vancouver, ISO, and other styles

24

R., Kohila* Dr. K. Arunesh. "TEXT MINING: TEXT SIMILARITY MEASURE FOR NEWS ARTICLES BASED ON STRING BASED APPROACH." Global Journal of Engineering Science and Research Management 3, no. 7 (2016): 35–42. https://doi.org/10.5281/zenodo.57373.

Full text

Abstract:

Now-a-days, the documents similarity measuring plays an important role in text related researches. There are many applications in document similarity measures such as plagiarism detection, document clustering, automatic essay scoring, information retrieval and machine translation. String Based Similarity, Knowledge Based Similarity and Corpus Based Similarity are the three major approaches proposed by the most of the   researchers to solve the problems in document similarity. In this paper, the String Based Similarity measure Term Based algorithm Cosine Similarity is used to measuring the similarity between the documents. The nouns in the documents are extracted and context word synset are also extracted using WordNet. The bigram dataset is created based on Context words. In this proposed method the similarity measure between the documents is measured using cosine similarity algorithm.  Preprocessing dataset, context word dataset and bigram dataset are used to measure the similarity. The context word document set measure gives a better similarity than bigram and pre-processing document set.

APA, Harvard, Vancouver, ISO, and other styles

25

Gifu, Daniela, and Covaci Silviu-Vasile. "Artificial Intelligence vs. Human: Decoding Text Authenticity with Transformers." Future Internet 17, no. 1 (2025): 38. https://doi.org/10.3390/fi17010038.

Full text

Abstract:

This paper presents a comprehensive study on detecting AI-generated text using transformer models. Our research extends the existing RODICA dataset to create the Enhanced RODICA for Human-Authored and AI-Generated Text (ERH) dataset. We enriched RODICA by incorporating machine-generated texts from various large language models (LLMs), ensuring a diverse and representative corpus. Methodologically, we fine-tuned several transformer architectures, including BERT, RoBERTa, and DistilBERT, on this dataset to distinguish between human-written and AI-generated text. Our experiments examined both monolingual and multilingual settings, evaluating the model’s performance across diverse datasets such as M4, AICrowd, Indonesian Hoax News Detection, TURNBACKHOAX, and ERH. The results demonstrate that RoBERTa-large achieved superior accuracy and F-scores of around 83%, particularly in monolingual contexts, while DistilBERT-multilingual-cased excelled in multilingual scenarios, achieving accuracy and F-scores of around 72%. This study contributes a refined dataset and provides insights into model performance, highlighting the transformative potential of transformer models in detecting AI-generated content.

APA, Harvard, Vancouver, ISO, and other styles

26

ALBayari, Reem, and Sherief Abdallah. "Instagram-Based Benchmark Dataset for Cyberbullying Detection in Arabic Text." Data 7, no. 7 (2022): 83. http://dx.doi.org/10.3390/data7070083.

Full text

Abstract:

(1) Background: the ability to use social media to communicate without revealing one’s real identity has created an attractive setting for cyberbullying. Several studies targeted social media to collect their datasets with the aim of automatically detecting offensive language. However, the majority of the datasets were in English, not in Arabic. Even the few Arabic datasets that were collected, none focused on Instagram despite being a major social media platform in the Arab world. (2) Methods: we use the official Instagram APIs to collect our dataset. To consider the dataset as a benchmark, we use SPSS (Kappa statistic) to evaluate the inter-annotator agreement (IAA), as well as examine and evaluate the performance of various learning models (LR, SVM, RFC, and MNB). (3) Results: in this research, we present the first Instagram Arabic corpus (sub-class categorization (multi-class)) focusing on cyberbullying. The dataset is primarily designed for the purpose of detecting offensive language in texts. We end up with 200,000 comments, of which 46,898 comments were annotated by three human annotators. The results show that the SVM classifier outperforms the other classifiers, with an F1 score of 69% for bullying comments and 85 percent for positive comments.

APA, Harvard, Vancouver, ISO, and other styles

27

Saleh Al-Sheikh, Idris, Masnizah Mohd, and Lia Warlina. "A Review of Arabic Text Recognition Dataset." Asia-Pacific Journal of Information Technology and Multimedia 09, no. 01 (2020): 69–81. http://dx.doi.org/10.17576/apjitm-2020-0901-06.

Full text

APA, Harvard, Vancouver, ISO, and other styles

28

Gaikwad, Mayur, Swati Ahirrao, Shraddha Phansalkar, and Ketan Kotecha. "Multi-Ideology ISIS/Jihadist White Supremacist (MIWS) Dataset for Multi-Class Extremism Text Classification." Data 6, no. 11 (2021): 117. http://dx.doi.org/10.3390/data6110117.

Full text

Abstract:

Social media platforms are a popular choice for extremist organizations to disseminate their perceptions, beliefs, and ideologies. This information is generally based on selective reporting and is subjective in content. However, the radical presentation of this disinformation and its outreach on social media leads to an increased number of susceptible audiences. Hence, detection of extremist text on social media platforms is a significant area of research. The unavailability of extremism text datasets is a challenge in online extremism research. The lack of emphasis on classifying extremism text into propaganda, radicalization, and recruitment classes is a challenge. The lack of data validation methods also challenges the accuracy of extremism detection. This research addresses these challenges and presents a seed dataset with a multi-ideology and multi-class extremism text dataset. This research presents the construction of a multi-ideology ISIS/Jihadist White supremacist (MIWS) dataset with recent tweets collected from Twitter. The presented dataset can be employed effectively and importantly to classify extremist text into popular types like propaganda, radicalization, and recruitment. Additionally, the seed dataset is statistically validated with a coherence score of Latent Dirichlet Allocation (LDA) and word mover’s distance using a pretrained Google News vector. The dataset shows effectiveness in its construction with good coherence scores within a topic and appropriate distance measures between topics. This dataset is the first publicly accessible multi-ideology, multi-class extremism text dataset to reinforce research on extremism text detection on social media platforms.

APA, Harvard, Vancouver, ISO, and other styles

29

Borandağ, Emin. "LSRM: A New Method for Turkish Text Classification." Applied Sciences 14, no. 23 (2024): 11143. http://dx.doi.org/10.3390/app142311143.

Full text

Abstract:

The text classification method is one of the most frequently used approaches in text mining studies. Text classification requires a model generation using a predefined dataset, and this model aims to assign uncategorized data to a correct category. In line with this purpose, this study used machine learning algorithms, deep learning algorithms, word embedding algorithms, and transfer-learning algorithms to classify Turkish texts using three diverse datasets, one of which is new, to analyze text classification performances for the Turkish language. The preparation process of the newly added dataset involved the variations in Turkish word usage patterns over the years, since it consisted of timestamp-enabled data. The study also developed a novel method named LSRM to increase the text classification performance for agglutinative languages such as Turkish. After testing the new method on datasets, the statistical ANOVA method revealed that applying the proposed LSRM method increased the classification performance.

APA, Harvard, Vancouver, ISO, and other styles

30

Mohammad, Adel Hamdan. "Arabic Text Classification: A Review." Modern Applied Science 13, no. 5 (2019): 88. http://dx.doi.org/10.5539/mas.v13n5p88.

Full text

Abstract:

Text classification is an important topic. The number of electronic documents available on line is massive. Text classification aims to classify documents into a set of predefined categories.&nbsp; Number of researches conducted on English dataset is great in comparison with number of researches done using Arabic dataset. This research could be considered as reference for most researchers who deal with Arabic dataset. This research used the most well-known algorithms used in text classification with Arabic dataset. Besides that, dataset used in this research is large enough in comparison with most dataset for Arabic language used in other researches. In addition, this research used different selections and weighting methods for documents. I expect that all researchers who would write researches using Arabic dataset will find this work helpful. Algorithms used in this research are na&iuml;ve Bayesian, support vector machines, artificial neural networks, k- nearest neighbors, C4.5 decision tree and rocchio classifier.

APA, Harvard, Vancouver, ISO, and other styles

31

Baidari, Ishwar, and Channamma Patil. "A Criterion for Deciding the Number of Clusters in a Dataset Based on Data Depth." Vietnam Journal of Computer Science 07, no. 04 (2020): 417–31. http://dx.doi.org/10.1142/s2196888820500232.

Full text

Abstract:

Clustering is a key method in unsupervised learning with various applications in data mining, pattern recognition and intelligent information processing. However, the number of groups to be formed, usually notated as [Formula: see text] is a vital parameter for most of the existing clustering algorithms as their clustering results depend heavily on this parameter. The problem of finding the optimal [Formula: see text] value is very challenging. This paper proposes a novel idea for finding the correct number of groups in a dataset based on data depth. The idea is to avoid the traditional process of running the clustering algorithm over a dataset for [Formula: see text] times and further, finding the [Formula: see text] value for a dataset without setting any specific search range for [Formula: see text] parameter. We experiment with different indices, namely CH, KL, Silhouette, Gap, CSP and the proposed method on different real and synthetic datasets to estimate the correct number of groups in a dataset. The experimental results on real and synthetic datasets indicate good performance of the proposed method.

APA, Harvard, Vancouver, ISO, and other styles

32

Asif Yaseen. "Data Classification Using Decision Trees J48 Algorithm for Text Mining of Business Data." Lahore Garrison University Research Journal of Computer Science and Information Technology 5, no. 2 (2021): 73–81. http://dx.doi.org/10.54692/lgurjcsit.2021.0502210.

Full text

Abstract:

The business industry is generating a lot of data on daily business deals and financial transactions. These businesses are generating intensive-data like they need customer satisfaction on top priority, fulfilling their needs, etc. In every step, Data is being produced. This Data has a great value that is hidden from regular users. Data analytics is used to unhide those values. In our project, we are using a business-related dataset that contains strings and their class (0 or 1). 0 or 1 denotes the positive or negative string labels. To analyze this data, we are using a decision tree classification algorithm (J48 exceptionally) to perform text mining (classification) on our target dataset. Text mining comes under supervised learning (type). In-text mining, generally, we use two datasets. One is used to train the model, and the second dataset is used to predict the missing class labels in the second dataset based on this training model generated using the first dataset.

APA, Harvard, Vancouver, ISO, and other styles

33

Aurelia, Miracle, Sheila Monica, and Abba Suganda Girsang. "Transformer-based abstractive indonesian text summarization." International Journal of Informatics and Communication Technology (IJ-ICT) 13, no. 3 (2024): 388. http://dx.doi.org/10.11591/ijict.v13i3.pp388-399.

Full text

Abstract:

The volume of data created, captured, copied, and consumed worldwide has increased from 2 zettabytes in 2010 to over 97 zettabytes in 2020, with an estimation of 181 zettabytes in 2025. Automatic text summarization (ATS) will ease giving points of information and will increase efficiency at the time consumed to understand the information. Therefore, improving ATS performance in summarizing news articles is the goal of this paper. This work will fine-tune the BART model using IndoSum, Liputan6, and Liputan6 augmented dataset for abstractive summarization. Data augmentation for Liputan6 will be augmented with the ChatGPT method. This work will also use r ecall-oriented understudy of gisting evaluation (ROUGE) as an evaluation metric. The data augmentation with ChatGPT used 10% of the clean news article from the Liputan6 training dataset and ChatGPT generated the abstractive summary based on that input, culminating in over 36 thousand data for the model’s fine-tuning. BART model that was finetuned using Indosum, Liputan6, and augmented Liputan6 dataset has the best ROUGE-2 score, outperforming ORACLE’s model although ORACLE still has the best ROUGE-1 and ROUGE-L score. This concludes that fine-tuning the BART model with multiple datasets will increase the performance of the model to do abstractive summarization tasks.

APA, Harvard, Vancouver, ISO, and other styles

34

Miracle, Aurelia, Monica Sheila, and Suganda Girsang Abba. "Transformer-based abstractive indonesian text summarization." International Journal of Informatics and Communication Technology 13, no. 3 (2024): 388–99. https://doi.org/10.11591/ijict.v13i3.pp388-399.

Full text

Abstract:

The volume of data created, captured, copied, and consumed worldwide has increased from 2 zettabytes in 2010 to over 97 zettabytes in 2020, with an estimation of 181 zettabytes in 2025. Automatic text summarization (ATS) will ease giving points of information and will increase efficiency at the time consumed to understand the information. Therefore, improving ATS performance in summarizing news articles is the goal of this paper. This work will fine-tune the BART model using IndoSum, Liputan6, and Liputan6 augmented dataset for abstractive summarization. Data augmentation for Liputan6 will be augmented with the ChatGPT method. This work will also use r ecall-oriented understudy of gisting evaluation (ROUGE) as an evaluation metric. The data augmentation with ChatGPT used 10% of the clean news article from the Liputan6 training dataset and ChatGPT generated the abstractive summary based on that input, culminating in over 36 thousand data for the model’s fine-tuning. BART model that was finetuned using Indosum, Liputan6, and augmented Liputan6 dataset has the best ROUGE-2 score, outperforming ORACLE’s model although ORACLE still has the best ROUGE-1 and ROUGE-L score. This concludes that fine-tuning the BART model with multiple datasets will increase the performance of the model to do abstractive summarization tasks.

APA, Harvard, Vancouver, ISO, and other styles

35

Itsnaini, Qurrota A’yuna, Mardhiya Hayaty, Andriyan Dwi Putra, and Nidal A. M. Jabari. "Abstractive Text Summarization using Pre-Trained Language Model "Text-to-Text Transfer Transformer (T5)"." ILKOM Jurnal Ilmiah 15, no. 1 (2023): 124–31. http://dx.doi.org/10.33096/ilkom.v15i1.1532.124-131.

Full text

Abstract:

Automatic Text Summarization (ATS) is one of the utilizations of technological sophistication in terms of text processing assisting humans in producing a summary or key points of a document in large quantities. We use Indonesian language as objects because there are few resources in NLP research using Indonesian language. This paper utilized PLTMs (Pre-Trained Language Models) from the transformer architecture, namely T5 (Text-to-Text Transfer Transformer) which has been completed previously with a larger dataset. Evaluation in this study was measured through comparison of the ROUGE (Recall-Oriented Understudy for Gisting Evaluation) calculation results between the reference summary and the model summary. The experiments with the pre-trained t5-base model with fine tuning parameters of 220M for the Indonesian news dataset yielded relatively high ROUGE values, namely ROUGE-1 = 0.68, ROUGE-2 = 0.61, and ROUGE-L = 0.65. The evaluation value worked well, but the resulting model has not achieved satisfactory results because in terms of abstraction, the model did not work optimally. We also found several errors in the reference summary in the dataset used.

APA, Harvard, Vancouver, ISO, and other styles

36

Qasim, Rukhma, Waqas Haider Bangyal, Mohammed A. Alqarni, and Abdulwahab Ali Almazroi. "A Fine-Tuned BERT-Based Transfer Learning Approach for Text Classification." Journal of Healthcare Engineering 2022 (January 7, 2022): 1–17. http://dx.doi.org/10.1155/2022/3498123.

Full text

Abstract:

Text Classification problem has been thoroughly studied in information retrieval problems and data mining tasks. It is beneficial in multiple tasks including medical diagnose health and care department, targeted marketing, entertainment industry, and group filtering processes. A recent innovation in both data mining and natural language processing gained the attention of researchers from all over the world to develop automated systems for text classification. NLP allows categorizing documents containing different texts. A huge amount of data is generated on social media sites through social media users. Three datasets have been used for experimental purposes including the COVID-19 fake news dataset, COVID-19 English tweet dataset, and extremist-non-extremist dataset which contain news blogs, posts, and tweets related to coronavirus and hate speech. Transfer learning approaches do not experiment on COVID-19 fake news and extremist-non-extremist datasets. Therefore, the proposed work applied transfer learning classification models on both these datasets to check the performance of transfer learning models. Models are trained and evaluated on the accuracy, precision, recall, and F1-score. Heat maps are also generated for every model. In the end, future directions are proposed.

APA, Harvard, Vancouver, ISO, and other styles

37

Rahma, Iftitah Athiyyah, and Lya Hulliyyatus Suadaa. "Penerapan Text Augmentation untuk Mengatasi Data yang Tidak Seimbang pada Klasifikasi Teks Berbahasa Indonesia." Jurnal Teknologi Informasi dan Ilmu Komputer 10, no. 6 (2023): 1329–40. http://dx.doi.org/10.25126/jtiik.1067325.

Full text

Abstract:

Klasifikasi teks merupakan salah satu tugas yang fundamental dalam natural language processing (NLP). Dalam dunia nyata, data dan sumber daya yang tersedia untuk pengklasifikasian teks terbatas. Salah satu kendala pada data berlabel yang digunakan yaitu imbalanced data atau data yang tidak seimbang. Permasalahan data yang tidak seimbang memengaruhi kinerja dan keakuratan model karena model hanya terfokus pada data dengan label mayoritas. Sementara itu, data berlabel minoritas cenderung diklasifikasikan tidak tepat oleh model, padahal untuk beberapa kasus kemampuan model untuk memprediksi data dengan label minoritas lebih penting. Untuk mengatasinya, penelitian ini melakukan pendekatan oversampling yaitu menambah data untuk menyeimbangkan dataset. Penerapan oversampling pada data teks dikenal dengan text augmentation. Pada penelitian ini dilakukan dua teknik text augmentation yaitu synonym replacement dan back translation pada beberapa kondisi ketidakseimbangan dan skenario augmentasi terhadap dua dataset. Berdasarkan hasil eksperimen, augmentasi mampu meningkatkan skor F1 label minoritas. Augmentasi lebih signifikan dalam dataset kecil dan kondisi ketidakeimbangan yang parah. Hasil dari teknik back translation lebih baik dibandingkan dengan teknik synonym replacement. Selain itu, hasil penelitian menunjukkan bahwa skenario jumlah augmentasi juga berpengaruh terhadap kenaikan skor F1. Semakin banyak jumlah data augmentasi belum tentu memberikan hasil yang semakin baik karena terindikasi overfitting pada data latih. Kata-kata yang tidak normal atau tidak baku pada dataset teks informal memengaruhi proses augmentasi sehingga hasil teks sintetis yang diperoleh tidak sebaik pada dataset teks formal. Abstract Text classification is one of the fundamental tasks in natural language processing (NLP). However, data and resources for text classification are limited in actual application. One of the constraints on the dataset for text classification is imbalanced data, or the condition when one label has more data than the others. Imbalanced data affects the performance and accuracy of the model because the model only focuses on the majority label data. Meanwhile, the minority label data tends to be classified incorrectly by the model, even though, in some cases, the model's ability to predict data with minority labels is more important. To solve this problem, this research uses an oversampling approach to augment data and balance the dataset. The application of oversampling text data is known as text augmentation. This research uses two text augmentation techniques, synonym replacement and back translation, applied to several imbalance conditions and augmentation scenarios for two datasets. Based on experimental results, augmentation can increase the F1 score of the minority class. Augmentation is more significant in small datasets and severe imbalance conditions. The results of the back translation technique are better than synonym replacement. In addition, this study's results show that the number of augmentation scenarios affects an increase in F1-score. However, increasing the augmentation data cannot ensure the results are getting better. Furthermore, words that are not normal in informal text datasets affect the augmentation process, so the results of synthetic text are better than the formal text dataset.

APA, Harvard, Vancouver, ISO, and other styles

38

Rahma, Iftitah Athiyyah, and Lya Hulliyyatus Suadaa. "Penerapan Text Augmentation untuk Mengatasi Data yang Tidak Seimbang pada Klasifikasi Teks Berbahasa Indonesia." Jurnal Teknologi Informasi dan Ilmu Komputer 10, no. 6 (2023): 1329–40. https://doi.org/10.25126/jtiik.2023107325.

Full text

Abstract:

Klasifikasi teks merupakan salah satu tugas yang fundamental dalam natural language processing (NLP). Dalam dunia nyata, data dan sumber daya yang tersedia untuk pengklasifikasian teks terbatas. Salah satu kendala pada data berlabel yang digunakan yaitu imbalanced data atau data yang tidak seimbang. Permasalahan data yang tidak seimbang memengaruhi kinerja dan keakuratan model karena model hanya terfokus pada data dengan label mayoritas. Sementara itu, data berlabel minoritas cenderung diklasifikasikan tidak tepat oleh model, padahal untuk beberapa kasus kemampuan model untuk memprediksi data dengan label minoritas lebih penting. Untuk mengatasinya, penelitian ini melakukan pendekatan oversampling yaitu menambah data untuk menyeimbangkan dataset. Penerapan oversampling pada data teks dikenal dengan text augmentation. Pada penelitian ini dilakukan dua teknik text augmentation yaitu synonym replacement dan back translation pada beberapa kondisi ketidakseimbangan dan skenario augmentasi terhadap dua dataset. Berdasarkan hasil eksperimen, augmentasi mampu meningkatkan skor F1 label minoritas. Augmentasi lebih signifikan dalam dataset kecil dan kondisi ketidakeimbangan yang parah. Hasil dari teknik back translation lebih baik dibandingkan dengan teknik synonym replacement. Selain itu, hasil penelitian menunjukkan bahwa skenario jumlah augmentasi juga berpengaruh terhadap kenaikan skor F1. Semakin banyak jumlah data augmentasi belum tentu memberikan hasil yang semakin baik karena terindikasi overfitting pada data latih. Kata-kata yang tidak normal atau tidak baku pada dataset teks informal memengaruhi proses augmentasi sehingga hasil teks sintetis yang diperoleh tidak sebaik pada dataset teks formal. Abstract Text classification is one of the fundamental tasks in natural language processing (NLP). However, data and resources for text classification are limited in actual application. One of the constraints on the dataset for text classification is imbalanced data, or the condition when one label has more data than the others. Imbalanced data affects the performance and accuracy of the model because the model only focuses on the majority label data. Meanwhile, the minority label data tends to be classified incorrectly by the model, even though, in some cases, the model's ability to predict data with minority labels is more important. To solve this problem, this research uses an oversampling approach to augment data and balance the dataset. The application of oversampling text data is known as text augmentation. This research uses two text augmentation techniques, synonym replacement and back translation, applied to several imbalance conditions and augmentation scenarios for two datasets. Based on experimental results, augmentation can increase the F1 score of the minority class. Augmentation is more significant in small datasets and severe imbalance conditions. The results of the back translation technique are better than synonym replacement. In addition, this study's results show that the number of augmentation scenarios affects an increase in F1-score. However, increasing the augmentation data cannot ensure the results are getting better. Furthermore, words that are not normal in informal text datasets affect the augmentation process, so the results of synthetic text are better than the formal text dataset.

APA, Harvard, Vancouver, ISO, and other styles

39

Duc, Chung Tran, Long Nguyen Duc, and Fadzil Hassan Mohd. "Development and testing of an FPT.AI-based voicebot." Bulletin of Electrical Engineering and Informatics 9, no. 6 (2020): 2388–95. https://doi.org/10.11591/eei.v9i6.2620.

Full text

Abstract:

In recent years, voicebot has become a popular communication tool between humans and machines. In this paper, we will introduce our voicebot integrating text-to-speech (TTS) and speech-to-text (STT) modules provided by FPT.AI. This voicebot can be considered as a critical improvement of a typical chatbot because it can respond to human’s queries by both text and speech. FPT Open Speech, LibriSpeech datasets, and music files were used to test the accuracy and performance of the STT module. For the TTS module, it was tested by using text on news pages in both Vietnamese and English. To test the voicebot, Homestay Service topic questions and off-topic messages were input to the system. The TTS module achieved 100% accuracy in the Vietnamese text test and 72.66% accuracy in the English text test. In the STT module test, the accuracy for FPT open speech dataset (Vietnamese) is 90.51% and for LibriSpeech Dataset (English) is 0% while the accuracy in music files test is 0% for both. The voicebot achieved 100% accuracy in its test. Since the FPT.AI STT and TTS modules were developed to support only Vietnamese for dominating the Vietnam market, i

APA, Harvard, Vancouver, ISO, and other styles

40

Ali, Hazrat, Khalid Iqbal, Ghulam Mujtaba, et al. "Urdu text in natural scene images: a new dataset and preliminary text detection." PeerJ Computer Science 7 (September 16, 2021): e717. http://dx.doi.org/10.7717/peerj-cs.717.

Full text

Abstract:

Text detection in natural scene images for content analysis is an interesting task. The research community has seen some great developments for English/Mandarin text detection. However, Urdu text extraction in natural scene images is a task not well addressed. In this work, firstly, a new dataset is introduced for Urdu text in natural scene images. The dataset comprises of 500 standalone images acquired from real scenes. Secondly, the channel enhanced Maximally Stable Extremal Region (MSER) method is applied to extract Urdu text regions as candidates in an image. Two-stage filtering mechanism is applied to eliminate non-candidate regions. In the first stage, text and noise are classified based on their geometric properties. In the second stage, a support vector machine classifier is trained to discard non-text candidate regions. After this, text candidate regions are linked using centroid-based vertical and horizontal distances. Text lines are further analyzed by a different classifier based on HOG features to remove non-text regions. Extensive experimentation is performed on the locally developed dataset to evaluate the performance. The experimental results show good performance on test set images. The dataset will be made available for research use. To the best of our knowledge, the work is the first of its kind for the Urdu language and would provide a good dataset for free research use and serve as a baseline performance on the task of Urdu text extraction.

APA, Harvard, Vancouver, ISO, and other styles

41

Dashenkov, Dmytro, and Kirill Smelyakov. "Extending the ImageNET dataset for multimodal text and image learning." INNOVATIVE TECHNOLOGIES AND SCIENTIFIC SOLUTIONS FOR INDUSTRIES, no. 1(31) (March 31, 2025): 20–31. https://doi.org/10.30837/2522-9818.2025.1.020.

Full text

Abstract:

Subject matter: image processing methods for classification and other computer vision tasks using multimodal data, including text descriptions of classes and images. Goal: development of a multimodal dataset for image classification using textual meta-information analysis. The resulting dataset should consist of image data, image classes, namely 1000 classes of objects depicted in photos from the ImageNet set, textual descriptions of individual images, and textual descriptions of image classes as a whole. Tasks: 1) based on the images of the ImageNet dataset, compile a dataset for training classifier models with text descriptions of image classes and individual images; 2) based on the obtained dataset, conduct an experiment on training a language neural network to confirm the effectiveness of using this approach to solve the classification problem. Methods: compilation of datasets manually, training of speech neural networks based on the RoBERTa architecture. The neural network training was carried out using the fine-tuning method, namely, adding a neural network layer to an existing model to obtain a new machine learning model capable of performing the selected task. Results: the result of the work is the creation of a dataset that combines image data with text data. The resulting dataset is useful for establishing a connection between the information that a machine learning model is able to extract from photos and the information that the model can extract from text data. The multimodal approach can be used to solve a wide range of problems, as demonstrated by the example of training a language neural network. The trained language model processes the descriptions of images contained in the dataset and predicts the class of the image to which this description is associated. The model is designed to filter out irrelevant text metadata, improving the quality of the dataset. Conclusions: data sets that combine multiple types of data can provide a broader context for solving problems that are typically associated with only one type of data, allowing for more effective application of machine learning methods.

APA, Harvard, Vancouver, ISO, and other styles

42

Shao, Shiyu. "Research on Story Text Generation Based on Transformer Model." Applied and Computational Engineering 175, no. 1 (2025): 8–17. https://doi.org/10.54254/2755-2721/2025.ast24685.

Full text

Abstract:

The transformer model was used to train and generate story text this time because certain parts or endings of the original story were not satisfactory. This study tried to use the model training to obtain other story paths. The main purpose is to study two paths: one is how to use pre-trained models for fine-tuning to achieve the desired effect, and the other is how to build a model trained from scratch to achieve the desired effect. DeepSeek R1 will be used as a control group to evaluate the generation effect.According to the results, the pre-trained model performs better on smaller datasets, generating logical sentences and paragraphs, while the model trained from scratch has not yet achieved good results on smaller datasets. As an improvement measure, a larger dataset will be used to enhance the model's generation performance, while adjusting new hyperparameters to fit the dataset.

APA, Harvard, Vancouver, ISO, and other styles

43

Tsimperidis, Ioannis, Olga-Dimitra Asvesta, Eleni Vrochidou, and George A. Papakostas. "IKDD: A Keystroke Dynamics Dataset for User Classification." Information 15, no. 9 (2024): 511. http://dx.doi.org/10.3390/info15090511.

Full text

Abstract:

Keystroke dynamics is the field of computer science that exploits data derived from the way users type. It has been used in authentication systems, in the identification of user characteristics for forensic or commercial purposes, and to identify the physical and mental state of users for purposes that serve human–computer interaction. Studies of keystroke dynamics have used datasets created from volunteers recording fixed-text typing or free-text typing. Unfortunately, there are not enough keystroke dynamics datasets available on the Internet, especially from the free-text category, because they contain sensitive and personal information from the volunteers. In this work, a free-text dataset is presented, which consists of 533 logfiles, each of which contains data from 3500 keystrokes, coming from 164 volunteers. Specifically, the software developed to record user typing is described, the demographics of the volunteers who participated are given, the structure of the dataset is analyzed, and the experiments performed on the dataset justify its utility.

APA, Harvard, Vancouver, ISO, and other styles

44

Atliha, Viktar, and Dmitrij Šešok. "Text Augmentation Using BERT for Image Captioning." Applied Sciences 10, no. 17 (2020): 5978. http://dx.doi.org/10.3390/app10175978.

Full text

Abstract:

Image captioning is an important task for improving human-computer interaction as well as for a deeper understanding of the mechanisms underlying the image description by human. In recent years, this research field has rapidly developed and a number of impressive results have been achieved. The typical models are based on a neural networks, including convolutional ones for encoding images and recurrent ones for decoding them into text. More than that, attention mechanism and transformers are actively used for boosting performance. However, even the best models have a limit in their quality with a lack of data. In order to generate a variety of descriptions of objects in different situations you need a large training set. The current commonly used datasets although rather large in terms of number of images are quite small in terms of the number of different captions per one image. We expanded the training dataset using text augmentation methods. Methods include augmentation with synonyms as a baseline and the state-of-the-art language model called Bidirectional Encoder Representations from Transformers (BERT). As a result, models that were trained on a datasets augmented show better results than that models trained on a dataset without augmentation.

APA, Harvard, Vancouver, ISO, and other styles

45

Yang, Jingwen, and Ruohua Zhou. "Whisper40: A Multi-Person Chinese Whisper Speaker Recognition Dataset Containing Same-Text Neutral Speech." Information 15, no. 4 (2024): 184. http://dx.doi.org/10.3390/info15040184.

Full text

Abstract:

Whisper speaker recognition (WSR) has received extensive attention from researchers in recent years, and it plays an important role in medical, judicial, and other fields. Among them, the establishment of a whisper dataset is very important for the study of WSR. However, the existing whisper dataset suffers from the problems of a small number of speakers, short speech duration, and lack of neutral speech with the same-text as the whispered speech in the same dataset. To address this issue, we present Whisper40, a multi-person Chinese WSR dataset containing same-text neutral speech spanning around 655.90 min sourced from volunteers. In addition, we use the current state-of-the-art speaker recognition model to build a WSR baseline system and combine the idea of transfer learning for pre-training the speaker recognition model using neutral speech datasets and transfer the empirical knowledge of specific network layers to the WSR system. The Whisper40 and CHAINs datasets are then used to fine-tune the model with transferred specific layers. The experimental results show that the Whisper40 dataset is practical, and the time delay neural network (TDNN) model performs well in both the same/cross-scene experiments. The equal error rate (EER) of Chinese WSR after transfer learning is reduced by 27.62% in comparison.

APA, Harvard, Vancouver, ISO, and other styles

46

Ng, Chun Chet, Che-Tsung Lin, Zhi Qin Tan, et al. "When IC meets text: Towards a rich annotated integrated circuit text dataset." Pattern Recognition 147 (March 2024): 110124. http://dx.doi.org/10.1016/j.patcog.2023.110124.

Full text

APA, Harvard, Vancouver, ISO, and other styles

47

Hsiao, Tzu‐Kun. "Document Conflation of a Large Scholarly Full‐text Dataset." Proceedings of the Association for Information Science and Technology 60, no. 1 (2023): 977–79. http://dx.doi.org/10.1002/pra2.917.

Full text

Abstract:

ABSTRACTThe availability of large scholarly full‐text datasets with in‐text citations annotated opens the opportunity to investigate how articles have been cited in scientific literature at scale. However, duplicate documents may exist in a dataset, and these duplicates may impact downstream analysis such as calculating citation counts. Document conflation is the task of identifying documents that are nearly identical to each other. This study evaluates document conflation in the Semantic Scholar Open Research Corpus (S2ORC), a dataset containing over 12 million scholarly articles. The evaluation was based on 6,099,232 full‐text S2ORC documents with PubMed IDs (PMIDs) or PubMed Central IDs (PMCIDs). Our findings showed that a portion of S2ORC might contain duplicates. Of the 6,099,232 full‐text documents, 1,280,196 (20.99%) had the same PMIDs or PMCIDs as at least one other document. Pairwise comparisons of their full text found that at least 9.44% of the documents in S2ORC had duplicates.

APA, Harvard, Vancouver, ISO, and other styles

48

Kim, Jaehyun, Seongwook Yoon, Taehyeon Choi, and Sanghoon Sull. "Unsupervised Video Anomaly Detection Based on Similarity with Predefined Text Descriptions." Sensors 23, no. 14 (2023): 6256. http://dx.doi.org/10.3390/s23146256.

Full text

Abstract:

Research on video anomaly detection has mainly been based on video data. However, many real-world cases involve users who can conceive potential normal and abnormal situations within the anomaly detection domain. This domain knowledge can be conveniently expressed as text descriptions, such as “walking” or “people fighting”, which can be easily obtained, customized for specific applications, and applied to unseen abnormal videos not included in the training dataset. We explore the potential of using these text descriptions with unlabeled video datasets. We use large language models to obtain text descriptions and leverage them to detect abnormal frames by calculating the cosine similarity between the input frame and text descriptions using the CLIP visual language model. To enhance the performance, we refined the CLIP-derived cosine similarity using an unlabeled dataset and the proposed text-conditional similarity, which is a similarity measure between two vectors based on additional learnable parameters and a triplet loss. The proposed method has a simple training and inference process that avoids the computationally intensive analyses of optical flow or multiple frames. The experimental results demonstrate that the proposed method outperforms unsupervised methods by showing 8% and 13% better AUC scores for the ShanghaiTech and UCFcrime datasets, respectively. Although the proposed method shows −6% and −5% than weakly supervised methods for those datasets, in abnormal videos, the proposed method shows 17% and 5% better AUC scores, which means that the proposed method shows comparable results with weakly supervised methods that require resource-intensive dataset labeling. These outcomes validate the potential of using text descriptions in unsupervised video anomaly detection.

APA, Harvard, Vancouver, ISO, and other styles

49

Liu, Hua. "Realization of Text Categorization for Small-Scaled Dataset." Advanced Materials Research 532-533 (June 2012): 1239–42. http://dx.doi.org/10.4028/www.scientific.net/amr.532-533.1239.

Full text

Abstract:

Testing of the text categorization and comparison testing is carried out based on small-scaled dataset. In case of lack of trained set, without training, the indexed text keywords are used to categorize the expert subject terms, with large categorization accuracy amounted to 0.82. In case of less trained set, after training, the characteristics vectors acquired from the training are added into experts’ subject terms and are categorized, with large accuracy amounted to 0.94, the level-3 accuracy amounted to 0.73, so the results are satisfying.

APA, Harvard, Vancouver, ISO, and other styles

50

Yuan, Tai-Ling, Zhe Zhu, Kun Xu, Cheng-Jun Li, Tai-Jiang Mu, and Shi-Min Hu. "A Large Chinese Text Dataset in the Wild." Journal of Computer Science and Technology 34, no. 3 (2019): 509–21. http://dx.doi.org/10.1007/s11390-019-1923-y.

Full text

APA, Harvard, Vancouver, ISO, and other styles

We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!