Log in

Relevant bibliographies by topics / TF-IDF algorithm / Journal articles

To see the other types of publications on this topic, follow the link: TF-IDF algorithm.

Journal articles on the topic 'TF-IDF algorithm'

Author: Grafiati

Published: 4 June 2025

Last updated: 23 June 2025

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 journal articles for your research on the topic 'TF-IDF algorithm.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

Li, Jinye. "A comparative study of keyword extraction algorithms for English texts." Journal of Intelligent Systems 30, no. 1 (2021): 808–15. http://dx.doi.org/10.1515/jisys-2021-0040.

Full text

Abstract:

Abstract This study mainly analyzed the keyword extraction of English text. First, two commonly used algorithms, the term frequency–inverse document frequency (TF–IDF) algorithm and the keyphrase extraction algorithm (KEA), were introduced. Then, an improved TF–IDF algorithm was designed, which improved the calculation of word frequency, and it was combined with the position weight to improve the performance of keyword extraction. Finally, 100 English literature was selected from the British Academic Written English Corpus for the analysis experiment. The results showed that the improved TF–IDF algorithm had the shortest running time and took only 4.93 s in processing 100 texts; the precision of the algorithms decreased with the increase of the number of extracted keywords. The comparison between the two algorithms demonstrated that the improved TF–IDF algorithm had the best performance, with a precision rate of 71.2%, a recall rate of 52.98%, and an F 1 score of 60.75%, when five keywords were extracted from each article. The experimental results show that the improved TF–IDF algorithm is effective in extracting English text keywords, which can be further promoted and applied in practice.

APA, Harvard, Vancouver, ISO, and other styles

2

R Wahyudi, M. Didik. "Evaluation of TF-IDF Algorithm Weighting Scheme in The Qur'an Translation Clustering with K-Means Algorithm." Journal of Information Technology and Computer Science 6, no. 2 (2021): 117–29. http://dx.doi.org/10.25126/jitecs.202162295.

Full text

Abstract:

The Al-Quran translation index issued by the Ministry of Religion can be used in text mining to search for similar patterns of Al-Quran translation. This study performs sentence grouping using the K-Means Clustering algorithm and three weighting scheme models of the TF-IDF algorithm to get the best performance of the Tf-IDF algorithm. From the three models of the TF-IDF algorithm weighting scheme, the highest percentage results were obtained in the traditional TF-IDF weighting scheme, namely 62.16% with an average percentage of 36.12% and a standard deviation of 12.77%. The smallest results are shown in the TF-IDF 1 normalization weighting scheme, namely 48.65% with an average percentage of 25.65% and a standard deviation of 10.16%. The smallest standard deviation results in a normalized 2 TF-IDF weighting of 8.27% with an average percentage of 28.15% and the largest percentage weighting of 48.65% which is the same as the normalized TF-IDF 1 weighting.

APA, Harvard, Vancouver, ISO, and other styles

3

Xu, Dong Dong, and Shao Bo Wu. "An Improved TFIDF Algorithm in Text Classification." Applied Mechanics and Materials 651-653 (September 2014): 2258–61. http://dx.doi.org/10.4028/www.scientific.net/amm.651-653.2258.

Full text

Abstract:

Term frequency/inverse document frequency (TF-IDF) is widely used in text classification at present, which is borrowed from Information Retrieval. Based on this conventional classical TF-IDF formula, we present a new TF-IDF weight schemes named CTF-IDF. The experiment shows that the improved method is feasible and effective. Furthermore, from the subsequent evaluations using 10-fold cross-validation, we can see the CTF-IDF greatly improves the accuracy of text classification.

APA, Harvard, Vancouver, ISO, and other styles

4

Hidayah Mazlan, Nurul, and Isredza Rahmi A Hamid. "Evaluation of Feature Selection Algorithm for Android Malware Detection." International Journal of Engineering & Technology 7, no. 4.31 (2018): 311–15. http://dx.doi.org/10.14419/ijet.v7i4.31.23387.

Full text

Abstract:

This paper synthesizes an evaluation of feature selection algorithm by utilizing Term Frequency Inverse Document Frequency (TF-IDF) as the main algorithm in Android malware detection. The Android features were filtered before detection process using TF-IDF algorithm. However, IDF is unaware to the training class labels and give incorrect weight value to some features. Therefore, the proposed approach modified the TF-IDF algorithm, where the algorithm focused on both sample and feature. Proposed algorithm applied considers the feature based on its level of importance. The related best features in the sample are selected using weight and priority ranking process. This increases the effect of important malware features selected in the Android application sample. These experiments are conducted on a sample collected from DREBIN dataset. The comparison between existing TF-IDF algorithm and modified TF-IDF (MTF-IDF) algorithm have been tested in various conditions such as different number of sample, different number of feature and combination of different types of feature. The analysis results show feature selection using MTF-IDF can improve malware detection analysis. MTF-IDF proved either using various kinds of feature or various kinds of dataset size, algorithm still effective for Android malware detection. MTF-IDF algorithm also proved that it could give appropriate scaling for all features in analyzing Android malware detection.Â Â

APA, Harvard, Vancouver, ISO, and other styles

5

Silalahi, Natalia, and Guidio Leonarde Ginting. "Rekomendasi Berita Berkaitan dengan Menerapkan Algoritma Text Mining dan TF-IDF." Bulletin of Computer Science Research 3, no. 4 (2023): 276–82. http://dx.doi.org/10.47065/bulletincsr.v3i4.266.

Full text

Abstract:

News presentation is generally structured in such a way that the information presented is well grouped, but the use of electronic media does not necessarily offer complete news categories because not all of the space offered can be filled with good presentation, so special treatment is needed so that readers get the news. needed which is arranged based on recommendations. To arrange this research to be more structured, the authors carried out several stages in completing the research, namely the Problem Identification Stage, Literature Study Stage, Data Collection Stage, Text Mining and TF-IDF Algorithm Implementation Stage, and conclusions. The author implements the text mining and TF-IDF algorithms in processing news title data starting with the Text Mining Algorithm where this stage is a preprocessing stage with the aim that the data to be processed is a basic word so that the weighting process in the TF-IDF Algorithm is not too broad. After the text mining stage, it will proceed to the TF-IDF stage, namely weighting the terms in each document. Text mining and TF-IDF algorithms are able to provide appropriate news recommendations based on the highest similarity in meaning both in terms of topic and object of the news title, for future research it is recommended to use other algorithms such as cosine similar so that recommendations are not only generated from the suitability of words but can also see the similarity of meaning so that research results can be even better.

APA, Harvard, Vancouver, ISO, and other styles

6

Goswami, Puneet, and Vidya Kamath. "The DF-ICF Algorithm- Modified TF-IDF." International Journal of Computer Applications 93, no. 13 (2014): 28–30. http://dx.doi.org/10.5120/16276-6036.

Full text

APA, Harvard, Vancouver, ISO, and other styles

7

Saputra, Nova Adi, Khurotul Aeni та Nurul Mega Saraswati. "Indonesian Hate Speech Text Classification Using Improved K-Nearest Neighbor with TF-IDF-ICSρF". Scientific Journal of Informatics 11, № 1 (2024): 21–30. http://dx.doi.org/10.15294/sji.v11i1.48085.

Full text

Abstract:

Purpose: Freedom in social media gives rise to the possibility of disturbing users through the sentences they send, which is limited by the Electronic Information and Transactions Law (UU ITE). This research aims to find an effective method for classifying hate speech text data, especially in Indonesian, with many categories expected to minimize this case.Methods: This study used 1.000 data from Twitter with five labels, including religion, race, physical, gender and other (invective or slander). The process started with several steps of preprocessing, data transformation using TF-IDF-ICSρF term weighting and data mining using an Improved KNN algorithm. Then, the results were compared with the TF-IDF and KNN methods to evaluate the differences.Result: Using TF-IDF-ICSρF and Improved KNN algorithms gets an average accuracy value of 88.11%, 17.81% higher compared with the same data and parameters to the K-Nearest Neighbor and TF-IDF algorithms, which get results of 70.30%.Novelty: Based on the comparison results, TF-IDF-ICSρF and Improved KNN methods can effectively classify hate speech sentences that have many labels with fairly good accuracy.

APA, Harvard, Vancouver, ISO, and other styles

8

Hashemzahde, Bahare, and Majid Abdolrazzagh-Nezhad. "Improving keyword extraction in multilingual texts." International Journal of Electrical and Computer Engineering (IJECE) 10, no. 6 (2020): 5909. http://dx.doi.org/10.11591/ijece.v10i6.pp5909-5916.

Full text

Abstract:

The accuracy of keyword extraction is a leading factor in information retrieval systems and marketing. In the real world, text is produced in a variety of languages, and the ability to extract keywords based on information from different languages improves the accuracy of keyword extraction. In this paper, the available information of all languages is applied to improve a traditional keyword extraction algorithm from a multilingual text. The proposed keywork extraction procedure is an unsupervise algorithm and designed based on selecting a word as a keyword of a given text, if in addition to that language holds a high rank based on the keywords criteria in other languages, as well. To achieve to this aim, the average TF-IDF of the candidate words were calculated for the same and the other languages. Then the words with the higher averages TF-IDF were chosen as the extracted keywords. The obtained results indicat that the algorithms’ accuracis of the multilingual texts in term frequency-inverse document frequency (TF-IDF) algorithm, graph-based algorithm, and the improved proposed algorithm are 80%, 60.65%, and 91.3%, respectively.

APA, Harvard, Vancouver, ISO, and other styles

9

Bahareh, Hashemzadeh, and Abdolrazzagh-Nezhad Majid. "Improving keyword extraction in multilingual texts." International Journal of Electrical and Computer Engineering (IJECE) 10, no. 6 (2020): 5909–16. https://doi.org/10.11591/ijece.v10i6.pp5909-5916.

Full text

Abstract:

The accuracy of keyword extraction is a leading factor in information retrieval systems and marketing. In the real world, text is produced in a variety of languages, and the ability to extract keywords based on information from different languages improves the accuracy of keyword extraction. In this paper, the available information of all languages is applied to improve a traditional keyword extraction algorithm from a multilingual text. The proposed keywork extraction procedure is an unsupervise algorithm and designed based on selecting a word as a keyword of a given text, if in addition to that language holds a high rank based on the keywords criteria in other languages, as well. To achieve to this aim, the average TF-IDF of the candidate words were calculated for the same and the other languages. Then the words with the higher averages TF-IDF were chosen as the extracted keywords. The obtained results indicat that the algorithms’ accuracis of the multilingual texts in term frequency-inverse document frequency (TF-IDF) algorithm, graph-based algorithm, and the improved proposed algorithm are 80, 60.65, and 91.3%, respectively.

APA, Harvard, Vancouver, ISO, and other styles

10

Chang, Hsien-Tsung, Shu-Wei Liu, and Nilamadhab Mishra. "A tracking and summarization system for online Chinese news topics." Aslib Journal of Information Management 67, no. 6 (2015): 687–99. http://dx.doi.org/10.1108/ajim-10-2014-0147.

Full text

Abstract:

Purpose – The purpose of this paper is to design and implement new tracking and summarization algorithms for Chinese news content. Based on the proposed methods and algorithms, the authors extract the important sentences that are contained in topic stories and list those sentences according to timestamp order to ensure ease of understanding and to visualize multiple news stories on a single screen. Design/methodology/approach – This paper encompasses an investigational approach that implements a new Dynamic Centroid Summarization algorithm in addition to a Term Frequency (TF)-Density algorithm to empirically compute three target parameters, i.e., recall, precision, and F-measure. Findings – The proposed TF-Density algorithm is implemented and compared with the well-known algorithms Term Frequency-Inverse Word Frequency (TF-IWF) and Term Frequency-Inverse Document Frequency (TF-IDF). Three test data sets are configured from Chinese news web sites for use during the investigation, and two important findings are obtained that help the authors provide more precision and efficiency when recognizing the important words in the text. First, the authors evaluate three topic tracking algorithms, i.e., TF-Density, TF-IDF, and TF-IWF, with the said target parameters and find that the recall, precision, and F-measure of the proposed TF-Density algorithm is better than those of the TF-IWF and TF-IDF algorithms. In the context of the second finding, the authors implement a blind test approach to obtain the results of topic summarizations and find that the proposed Dynamic Centroid Summarization process can more accurately select topic sentences than the LexRank process. Research limitations/implications – The results show that the tracking and summarization algorithms for news topics can provide more precise and convenient results for users tracking the news. The analysis and implications are limited to Chinese news content from Chinese news web sites such as Apple Library, UDN, and well-known portals like Yahoo and Google. Originality/value – The research provides an empirical analysis of Chinese news content through the proposed TF-Density and Dynamic Centroid Summarization algorithms. It focusses on improving the means of summarizing a set of news stories to appear for browsing on a single screen and carries implications for innovative word measurements in practice.

APA, Harvard, Vancouver, ISO, and other styles

11

Aohana, Mizanul Ridho, and Fitri Bimantoro. "Tourism Destination Article Search Features using TF-IDF and Cosine similarity." DIELEKTRIKA 10, no. 2 (2023): 145–53. http://dx.doi.org/10.29303/dielektrika.v10i2.338.

Full text

Abstract:

In the current digital era, the increasing public interest in searching for information about travel destinations necessitates an effective and accurate search system. However, search results for travel destination articles often yield irrelevant or inadequate outcomes. To address this issue, this paper proposes applying the Term Frequency-Inverse Document Frequency (TF-IDF) algorithm and Cosine Similarity in the search feature for travel destination articles. By employing these algorithms, the search system is anticipated to deliver more relevant and accurate results according to user needs. This research contributes to developing an effective search system for travel destination articles, assisting users in obtaining relevant and high-quality information about the destinations they are searching for. The methodology involves collecting data on travel destination articles, implementing the TF-IDF algorithm to evaluate term importance, and utilizing Cosine Similarity to measure the similarity between articles and user queries. The study results demonstrate that implementing the TF-IDF algorithm and Cosine Similarity in the search feature for travel destination articles enhances the accuracy and relevance of search results. Users can quickly discover articles that align with their queries, improving their search experience. In conclusion, this research highlights that applying the TF-IDF algorithm and Cosine Similarity in the search feature significantly improves the accuracy and relevance of search results for travel destination articles. This enhances the search experience for users seeking information about travel destinations.

APA, Harvard, Vancouver, ISO, and other styles

12

Wisky, Irzal Arief, Sarjon Defit, and Gunadi Widi Nurcahyo. "Development of extraction features for Detecting Adolescent Personality with Machine Learning Algorithms." JOIV : International Journal on Informatics Visualization 8, no. 3-2 (2024): 1606. https://doi.org/10.62527/joiv.8.3-2.3091.

Full text

Abstract:

This study aims to develop a Natural Language Processing (NLP)-based feature extraction algorithm optimized for personality type classification in adolescents. The algorithm used is TF-IDF + N-Gram Z, which combines Term Frequency-Inverse Document Frequency (TF-IDF) with the N-Gram Z technique to improve the feature representation of the analyzed text. TF-IDF functions to measure the importance of words in a document, while N-Gram Z enriches the context by considering the order of words that appear sequentially. The dataset in this study consists of 3,200 sentences generated by adolescent respondents through a survey designed to explore aspects of their personality. After the feature extraction process is complete, three variants of the Naïve Bayes method are applied for classification, namely Multinomial Naïve Bayes, Bernoulli Naïve Bayes, and Complement Naïve Bayes. Each variant has distinctive characteristics in handling certain data types, such as binomial and multinomial data. The results of the study show that the combined TF-IDF + N-Gram Z algorithm can produce highly representative features, as evidenced by high classification performance. The Multinomial Naïve Bayes and Complement Naïve Bayes variants each achieved 98% accuracy. These findings provide significant contributions to the development of NLP-based personality classification methods for Detecting Adolescent Personality. The combination of the TF-IDF + N-Gram Z algorithm with various Naïve Bayes variants produces an exceedingly high level of accuracy and can be applied in practice in the fields of psychology and adolescent education.

APA, Harvard, Vancouver, ISO, and other styles

13

SUHAIB, MD. "RECOMMENDAITON SYSTEM ENGINE FOR MOVIES USING MACHINE LEARNING ALGORITHM (TF-IDF VECTORIZATION)." International Scientific Journal of Engineering and Management 03, no. 04 (2024): 1–9. http://dx.doi.org/10.55041/isjem01577.

Full text

Abstract:

By offering tailored movie recommendations, Movie Recommendation Systems (MRS) are crucial for improving the user experience on streaming services. This research paper proposes and evaluates a Movie Recommendation System utilizing TF-IDF vectorization and cosine similarity. TF-IDF vectorization is used to analyze textual information related to movies, such as plot summaries, cast bios, and genres, in order to give users precise and pertinent suggestions. The similarity between the user's preferences and the movies in the dataset is then calculated using cosine similarity. The results of the study show that the suggested Movie Recommendation System, which makes use of cosine similarity and TF-IDF vectorization, greatly improves user happiness and recommendation accuracy. The developed system offers an effective solution for providing personalized movie recommendations, contributing to the advancement of recommendation systems in the entertainment industry. This study provides valuable insights for streaming platforms to improve their recommendation systems and enhance user engagement. Keywords: Movie Recommendation System, Machine Learning (ML), Natural Language Processing (NLP) TF-IDF Vectorization, Cosine Similarity, Content-Based Filtering, Personalized Recommendations, User Engagement.

APA, Harvard, Vancouver, ISO, and other styles

14

Kim, Hyun-Jin, Ji-Won Baek, and Kyungyong Chung. "Optimization of Associative Knowledge Graph using TF-IDF based Ranking Score." Applied Sciences 10, no. 13 (2020): 4590. http://dx.doi.org/10.3390/app10134590.

Full text

Abstract:

This study proposes the optimization method of the associative knowledge graph using TF-IDF based ranking scores. The proposed method calculates TF-IDF weights in all documents and generates term ranking. Based on the terms with high scores from TF-IDF based ranking, optimized transactions are generated. News data are first collected through crawling and then are converted into a corpus through preprocessing. Unnecessary data are removed through preprocessing including lowercase conversion, removal of punctuation marks and stop words. In the document term matrix, words are extracted and then transactions are generated. In the data cleaning process, the Apriori algorithm is applied to generate association rules and make a knowledge graph. To optimize the generated knowledge graph, the proposed method utilizes TF-IDF based ranking scores to remove terms with low scores and recreate transactions. Based on the result, the association rule algorithm is applied to create an optimized knowledge model. The performance is evaluated in rule generation speed and usefulness of association rules. The association rule generation speed of the proposed method is about 22 seconds faster. And the lift value of the proposed method for usefulness is about 0.43 to 2.51 higher than that of each one of conventional association rule algorithms.

APA, Harvard, Vancouver, ISO, and other styles

15

Samidi, Samidi, and Devy Fatmawati. "Sentiment Classification of IT Service Feedback via TF-IDF." CogITo Smart Journal 10, no. 2 (2024): 403–17. https://doi.org/10.31154/cogito.v10i2.701.403-417.

Full text

Abstract:

Handling user complaints and feedback is a key strategy of Pusintek, the Ministry of Finance of the Republic of Indonesia, to enhance user satisfaction. The challenge faced is the difficulty in accurately analyzing feedback due to differences in comments and categories chosen by users, which requires manual category correction. This study aims to automate feedback comment categorization using classification algorithms. Specifically, Naïve Bayes, Support Vector Machine (SVM), and K-Nearest Neighbors (K-NN) algorithms were applied to 11,108 user feedback records. The CRISP-DM framework was used, with dataset preparation involving sentiment analysis techniques (cleansing, case folding, normalization, filtering, and tokenization) and Term Frequency-Inverse Document Frequency (TF-IDF) weighting. Accuracy values for each algorithm were evaluated. Results show that the SVM algorithm performed the best, achieving an accuracy of 94.10% and consistently delivering the highest precision, recall, and f1-score across all sentiment categories. This research contributes to the development of an automatic feedback classification system that improves categorization accuracy, minimizes manual intervention, and optimizes user feedback analysis. It is expected to enrich the understanding of text classification and natural language processing techniques and open up opportunities for further research.

APA, Harvard, Vancouver, ISO, and other styles

16

Sulartopo, Sulartopo. "The thesis topic similarity test with TF-IDF method." E-Bisnis : Jurnal Ilmiah Ekonomi dan Bisnis 13, no. 1 (2020): 13–16. http://dx.doi.org/10.51903/e-bisnis.v13i1.140.

Full text

Abstract:

This research is to clarify how to test the thesis topic similarity, make it easy to check the topic thesis, whether it has been made by a student before. In this regard, an important issue that can be raised is how to make a thesis topic similarity test the manual way to be automated. The purpose of this study is a research method similarity of thesis topics using the TF-IDF method. In this research the system has two stages of process, the first mining the text that is categorizing the thesis that has been categorized using the TF-IDF algorithm, which is to read the appearance of each word in the contents of the document. The second stage results from the TF-IDF algorithm are reprocessed with the VSM algorithm. The end result of this program will get the names of documents that have a degree of similarity with keywords.

APA, Harvard, Vancouver, ISO, and other styles

17

Ni, Jianjun, Yu Cai, Guangyi Tang, and Yingjuan Xie. "Collaborative Filtering Recommendation Algorithm Based on TF-IDF and User Characteristics." Applied Sciences 11, no. 20 (2021): 9554. http://dx.doi.org/10.3390/app11209554.

Full text

Abstract:

The recommendation algorithm is a very important and challenging issue for a personal recommender system. The collaborative filtering recommendation algorithm is one of the most popular and effective recommendation algorithms. However, the traditional collaborative filtering recommendation algorithm does not fully consider the impact of popular items and user characteristics on the recommendation results. To solve these problems, an improved collaborative filtering algorithm is proposed, which is based on the Term Frequency-Inverse Document Frequency (TF-IDF) method and user characteristics. In the proposed algorithm, an improved TF-IDF method is used to calculate the user similarity on the basis of rating data first. Secondly, the multi-dimensional characteristics information of users is used to calculate the user similarity by a fuzzy membership method. Then, the above two user similarities are fused based on an adaptive weighted algorithm. Finally, some experiments are conducted on the movie public data set, and the experimental results show that the proposed method has better performance than that of the state of the art.

APA, Harvard, Vancouver, ISO, and other styles

18

Ahmed, Abdullahi Burhanuddeen, and Khalid Haruna. "ENHANCED SMS SPAM DETECTION USING BERNOULLI NAIVE BAYES WITH TF-IDF." FUDMA JOURNAL OF SCIENCES 9, no. 1 (2025): 393–99. https://doi.org/10.33003/fjs-2025-0901-3226.

Full text

Abstract:

The use of mobile text messaging for communication is increasingly widespread, with Short Message Service (SMS) experiencing significant growth over the last decade. Consequently, the increase in SMS usage has led to a concerning rise in SMS spam, presenting substantial challenges for users and service providers. This study proposes a novel method for detecting SMS spam by combining Term Frequency-Inverse Document Frequency (TF-IDF) with Bernoulli Naïve Bayes (BNB) algorithm. The approach employs the use of TF-IDF for comprehensive feature extraction and the classification capabilities of the Bernoulli Naïve Bayes Algorithm. Through experimental validation employing TF-IDF for feature extraction and the BNB algorithm for classification, the results demonstrate high accuracy (98.36%), precision (99.19%), and a notable Matthews Correlation Coefficient (MCC) of 0.93, showcasing superior model performance compared to existing benchmarks. Likewise, the proposed model shows efficient processing time (0.22 seconds). By combining strengths of TF-IDF and BNB, the approach offers effective SMS spam detection, surpassing the performance of traditional and deep learning classifiers. This research contributes valuable insights towards enhancing SMS security, thereby increasing trust between users and service providers.

APA, Harvard, Vancouver, ISO, and other styles

19

Lan, Fei. "Research on Text Similarity Measurement Hybrid Algorithm with Term Semantic Information and TF-IDF Method." Advances in Multimedia 2022 (April 23, 2022): 1–11. http://dx.doi.org/10.1155/2022/7923262.

Full text

Abstract:

TF-IDF (term frequency-inverse document frequency) is one of the traditional text similarity calculation methods based on statistics. Because TF-IDF does not consider the semantic information of words, it cannot accurately reflect the similarity between texts, and semantic information enhanced methods distinguish between text documents poorly because extended vectors with semantic similar terms aggravate the curse of dimensionality. Aiming at this problem, this paper advances a hybrid with the semantic understanding and TF-IDF to calculate the similarity of texts. Based on term similarity weighting tree (TSWT) data structure and the definition of semantic similarity information from the HowNet, the paper firstly discusses text preprocess and filter process and then utilizes the semantic information of those key terms to calculate similarities of text documents according to the weight of the features whose weight is greater than the given threshold. The experimental results show that the hybrid method is better than the pure TF-IDF and the method of semantic understanding at the aspect of accuracy, recall, and F1-metric by different K-means clustering methods.

APA, Harvard, Vancouver, ISO, and other styles

20

小林, 王. "改进的TF-IDF关键词提取方法Improved TF-IDF Keyword Extraction Algorithm". Computer Science and Application 03, № 01 (2013): 64–68. http://dx.doi.org/10.12677/csa.2013.31012.

Full text

APA, Harvard, Vancouver, ISO, and other styles

21

Xiang, Lin. "Application of an Improved TF-IDF Method in Literary Text Classification." Advances in Multimedia 2022 (May 9, 2022): 1–10. http://dx.doi.org/10.1155/2022/9285324.

Full text

Abstract:

Literature is extremely important in the advancement of human civilization. Every day, many literary texts of various genres are produced, dating back to ancient times. An urgent concern for managers in the current literary activity is how to classify and save the expanding mass of literary text data for easy access by readers. In the realm of text classification, the TF-IDF algorithm is a widely used classification algorithm. However, there are significant issues with utilizing this approach, including a lack of distribution information inside categories, a lack of distribution information between categories, and an inability to adjust to skewed datasets. It is possible to improve classification accuracy by using the TF-IDF algorithm in this paper’s application situation by exploiting the association between feature words and the quantity of texts in which they appear, while ignoring the variation in feature word distribution across categories. With the purpose of classifying the literary texts in this study, this work proposes an improved IDF method for the problem of feature words appearing several times and having diverse meanings in different fields. The meanings of feature words in distinct domains are separated to increase the trust in the TF-IDF algorithm’s output. Using the improved TF-IDF method suggested in this research with the random forest (RF) classifier, the experimental results show that the classifier has a good classification impact, which can meet the actual work needs, based on comparative experiments on feature dimension selection, feature selection algorithm, feature weight algorithm, and classifier. It has a fair amount of historical significance.

APA, Harvard, Vancouver, ISO, and other styles

22

Lin, Wei Ran, Zhi Hui Wu, Li Chao Feng, and Wai Bin Huang. "Chinese Text Classification with a KNN Classifier Using an Adjusted Feature Weighting Method." Applied Mechanics and Materials 50-51 (February 2011): 700–703. http://dx.doi.org/10.4028/www.scientific.net/amm.50-51.700.

Full text

Abstract:

KNN algorithm is used for Chinese text classification in this paper. First, TF-IDF is chosen as the feature weighting method. To the characteristics of corpus used in this paper, TF-IDF is adjusted to a new method. At last, experimental result shows the accuracy of KNN text classifier can be improved with the adjusted feature weighting method.

APA, Harvard, Vancouver, ISO, and other styles

23

Gothankar, Ajinkya, Lavish Gupta, Niharika Bisht, Samiksha Nehe, and Prof Monali Bansode. "Extractive Text and Video Summarization using TF-IDF Algorithm." International Journal for Research in Applied Science and Engineering Technology 10, no. 3 (2022): 927–32. http://dx.doi.org/10.22214/ijraset.2022.40775.

Full text

Abstract:

Abstract: Text summarization is a technique for extracting concise summaries from a large text without sacrificing any important information. It's a good way to extract crucial information from documents. The rapid rise of the internet has resulted in a substantial surge in data all across the world. It has become difficult for humans to manually summarise big documents. Automatic Text Summarization is an NLP technique that lowers the time and efforts required by a human to create a summary. Text summarising techniques are divided into two categories: extractive and abstractive. In the extractive approach, text summarising techniques choose sentences from documents based on a set of criteria. In the abstractive approach, text summarising techniques strive to improve sentence coherence by reducing redundancies and explaining the context of sentences. The extractive summarization approach is the subject of this paper. There are several methods for summarising data, including TF-IDF, Text Rank, PageRank, and Latent Dirichlet Allocation (LDA). This work examines Text Summarization using the TFIDF Algorithm, a numerical measure that ranks the value of a word in a document based on how frequently it appears in that document and a set of documents. The application of the TF-IDF Algorithm for text, document, article, and video summarization is described in this study. There are no repetitions in the results, and for some searches, they are nearly identical to the summary results provided by humans. This algorithm offers a sentence extraction technique that selects the most diverse top-ranked sentences. Keywords: Extractive Summarization, Term Frequency-Inverse Document Frequency (TF-IDF), Natural Language Processing (NLP), Text Summarization, Video Summarization

APA, Harvard, Vancouver, ISO, and other styles

24

Kageyama, Akinori, and Hiroshi Tsuji. "Characteristic Analysis for Research Institutes by TF/IDF Algorithm." IEEJ Transactions on Electronics, Information and Systems 125, no. 5 (2005): 713–19. http://dx.doi.org/10.1541/ieejeiss.125.713.

Full text

APA, Harvard, Vancouver, ISO, and other styles

25

Susanto, Muhammad Riza Radyaka, Husni Thamrin, and Naufal Azmi Verdikha. "PERFORMANCE OF TEXT SIMILARITY ALGORITHMS FOR ESSAY ANSWER SCORING IN ONLINE EXAMINATIONS." Jurnal Teknik Informatika (Jutif) 4, no. 6 (2023): 1515–21. http://dx.doi.org/10.52436/1.jutif.2023.4.6.1025.

Full text

Abstract:

The purpose of assessment is to determine learning success. Exams with question descriptions have several advantages, including ease of preparation and the ability to reveal student comprehension and originality. The problem with space is that it takes time to fix. Therefore, it is important to develop algorithms and software that automatically evaluate space. With the help of this algorithm and this software, you can solve some exam and assessment problems. This study aims to investigate similarity algorithms that approximate human patterns in evaluating ambiguous answers. This study examines his five similarity algorithms, including TF-IDF and LSA. The data was a collection of correct answers with a total of 371 texts. The similarity algorithm's performance was compared with human correction results. Evaluation was performed using Root Mean Square Error (RMSE). This study shows that his TF-IDF algorithm like Jaccard has the lowest his RMSE compared to human judgement. However, the LSA algorithm tended better to follow human rating patterns for descriptive tests..

APA, Harvard, Vancouver, ISO, and other styles

26

Li, Qian, Aiqiu Jiang, Zhenhua Zhang, Huoliang Wang, and Lin Wang. "The Application of TF‐IDF in the Research of Flood and Drought Disaster Defense System." ce/papers 8, no. 2 (2025): 1910–19. https://doi.org/10.1002/cepa.3274.

Full text

Abstract:

AbstractThis study explores the application potential of the TF‐IDF (Term Frequency‐Inverse Document Frequency) technique in the research of grassroots flood and drought disaster defense systems, aiming to optimize disaster management and prevention strategies through advanced text analytics. The paper begins by outlining issues prevalent in traditional grassroots defense systems, such as delayed warnings, misallocated resources, and insufficient public participation. It then proceeds to meticulously introduce the TF‐IDF method and its application in the standardization project of the Yuhang District's flood and drought defense system. By applying the TF‐IDF algorithm to data processing, the study extracts keywords and topics closely associated with disaster warning, risk management, and rescue operations. Through TF‐IDF analysis, key systems, critical matters, and essential content within grassroots defense frameworks are identified, leading to the proposition of a standardized achievement system for flood and drought defense: “one work checklist, three operational manuals, and a comprehensive emergency response plan.”As water and drought disaster scenarios become increasingly complex in the future, this TF‐IDF‐based analytical approach is poised to be a pivotal instrument in facilitating the modern transformation of disaster defense systems.

APA, Harvard, Vancouver, ISO, and other styles

27

Xu, Jingjing. "A natural language processing based technique for sentiment analysis of college english corpus." PeerJ Computer Science 9 (February 17, 2023): e1235. http://dx.doi.org/10.7717/peerj-cs.1235.

Full text

Abstract:

The college English corpus can help us better master English, but how to obtain the desired information from a large number of English corpus has become the focus of information technology. Based on the natural language processing (NLP) technology, a sentiment analysis model is built in this article. An improved term frequency–inverse document frequency (TF-IDF) algorithm is proposed in this article, where the weighted average method is used to determine the emotional value of each emotional word. The inspirational words are used to obtain the English corpus’s emotional tendency and emotional value. The results show that the model has high classification accuracy and operation efficiency when selecting feature words. Compared with the TF-IDF, the improved TF-IDF algorithm added the necessary information weight processing and word density weight processing to two new processing links, which can significantly improve the efficiency of college English learning.

APA, Harvard, Vancouver, ISO, and other styles

28

Kurniawanda, Muhamad Riza, and Fenina Adline Twince Tobing. "Analysis Sentiment Cyberbullying In Instagram Comments with XGBoost Method." IJNMT (International Journal of New Media Technology) 9, no. 1 (2022): 28–34. http://dx.doi.org/10.31937/ijnmt.v9i1.2670.

Full text

Abstract:

Technological developments make social media widely used by the general public, which causes negative impacts, one of which is cyberbullying. Cyberbullying is an act of insulting, humiliating another person on social media. A system that can detect cyberbullying because of the large amount of information circulating on social media is impossible for humans to visit. One suitable method to solve this problem is Extereme Gradient Boosting (XGBoost). XGBoost was chosen because it can run 10 times faster than other Gradient Boosting methods. The process of changing sentences into vectors uses the TF-IDF method. The TF/IDF method is known as a simple but relevant algorithm in doing words on a document. XGBoost accepts input in the form of vectors obtained from the TF-IDF process. In this research, there are 1452 comments which will be broken down into training data and testing data. By using XGBoost and TF-IDF methods, the accuracy is 75.20%, precision is 71%, recall is 87%, and F1-score is 78%.

APA, Harvard, Vancouver, ISO, and other styles

29

Wang, Yanhua, Zhiyuan Zhang, and Weigang Huo. "Research on aviation unsafe incidents classification with improved TF-IDF algorithm." Modern Physics Letters B 30, no. 12 (2016): 1650184. http://dx.doi.org/10.1142/s0217984916501840.

Full text

Abstract:

The text content of Aviation Safety Confidential Reports contains a large number of valuable information. Term frequency-inverse document frequency algorithm is commonly used in text analysis, but it does not take into account the sequential relationship of the words in the text and its role in semantic expression. According to the seven category labels of civil aviation unsafe incidents, aiming at solving the problems of TF-IDF algorithm, this paper improved TF-IDF algorithm based on co-occurrence network; established feature words extraction and words sequential relations for classified incidents. Aviation domain lexicon was used to improve the accuracy rate of classification. Feature words network model was designed for multi-documents unsafe incidents classification, and it was used in the experiment. Finally, the classification accuracy of improved algorithm was verified by the experiments.

APA, Harvard, Vancouver, ISO, and other styles

30

Setiawan, Gede Herdian, and I. Made Budi Adnyana. "Improving Helpdesk Chatbot Performance with Term Frequency-Inverse Document Frequency (TF-IDF) and Cosine Similarity Models." Journal of Applied Informatics and Computing 7, no. 2 (2023): 252–57. http://dx.doi.org/10.30871/jaic.v7i2.6527.

Full text

Abstract:

Helpdesk chatbots are growing in popularity due to their ability to provide help and answers to user questions quickly and effectively. Chatbot development poses several challenges, including enhancing accuracy in understanding user queries and providing relevant responses while improving problem-solving efficiency. In this research, we aim to enhance the accuracy and efficiency of the Helpdesk Chatbot by implementing the Term Frequency-Inverse Document Frequency (TF-IDF) model and the Cosine Similarity algorithm. The TF-IDF model is a method used to measure the frequency of words in a document and their occurrence in the entire document collection, while the Cosine Similarity algorithm is used to measure the similarity between two documents. After implementing and testing TF-IDF and Cosine Similarity models in the Helpdesk Chatbot, we achieved a 75% question recognition rate. To increase accuracy and precision, it is necessary to increase the knowledge dataset and improve pre-processing, especially in recognition and correct inaccurate spelling

APA, Harvard, Vancouver, ISO, and other styles

31

Yoan Maria Vianny and Erwin Budi Setiawan. "Implementation of Rumor Detection on Twitter Using J48 Algorithm." Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi) 4, no. 5 (2020): 775–81. http://dx.doi.org/10.29207/resti.v4i5.2059.

Full text

Abstract:

The existence of rumors on Twitter has caused a lot of unrest among Indonesians. Unrecognized validity confuses users for that information. In this study, an Indonesian rumor detection system is built by using J48 Algorithm in collaboration with Term Frequency Inverse Document Frequency (TF-IDF) weighting method. Dataset contains 47.449 tweets that have been manually labeled. This study offers new features, namely the number of emoticons in display name, the number of digits in display name, and the number of digits in username. These three new features are used to maximize information about information sources. The highest accuracy is obtained by 75.76% using 90% training data and 1.000 TF-IDF features in 1-gram to 3-gram combinations.

APA, Harvard, Vancouver, ISO, and other styles

32

Fathirachman Mahing, Naufal, Alifi Lazuardi Gunawan, Ahmad Foresta Azhar Zen, Fitra Abdurrachman Bachtiar, and Satrio Agung Wicaksono. "Klasifikasi Tingkat Stress dari Data Berbentuk Teks dengan Menggunakan Algoritma Support Vector Machine (SVM) dan Random Forest." Jurnal Teknologi Informasi dan Ilmu Komputer 10, no. 7 (2023): 1527–36. http://dx.doi.org/10.25126/jtiik.1078010.

Full text

Abstract:

Stres merupakan keadaan dimana seseorang merasakan adanya tekanan yang berlebih pada dirinya. Pemantauan tingkat stres menjadi hal yang penting bagi manusia. Tingkat stres yang tinggi dapat menimbulkan dampak negatif terhadap kesehatan manusia. Deteksi dini stres menjadi sesuatu yang sangat penting untuk dilakukan. Salah satu cara mengetahui tingkat stres seseorang adalah melalui analisis teks. Penelitian ini dilakukan untuk melakukan klasifikasi tingkat stres berdasarkan data berupa teks menggunakan algoritma Support Vector Machine (SVM) dan Random Forest. Pada penelitian ini melakukan perbandingan beberapa metode transformasi. Transformasi yang dilakukan pada penelitian ini menggunakan TF-IDF, CountVectorizer, NRCLex, dan Word Affect Intensities. Data yang digunakan dalam penelitian ini berupa sebuat teks berbahasa Inggris yang diambil dari media sosial Twitter. Total data yang digunakan yaitu 8439 data. Pelatihan model baik untuk Support Vector Machine dan Random Forest menggunakan 6751 data. Sedangkan untuk pengujian menggunakan 1688 data. Hasil penelitian menunjukkan bahwa algoritma SVM dengan pembobotan menggunakan TF-IDF memiliki performa yang paling baik dibandingkan dengan algoritma Random Forest dan metode transformasi lainnya yang digunakan dalam penelitian. Model algoritma SVM dengan transformasi TF-IDF yang dibangun berhasil mendapatkan akurasi sebesar 84%. Model ini mendapatkan akurasi yang lebih tinggi dibanding model Random Forest yang memperoleh akurasi tinggi sebesar 80% dengan menggunakan transformasi CountVectorizer. Abstract Stress is a condition where a person feels excessive pressure on himself. Monitoring stress levels is important for humans. High levels of stress can have a negative impact on human health. Early detection of stress is something that is very important to do. One way to find out someone's stress level is through text analysis.This research was conducted to classify stress levels based on text data using the Support Vector Machine (SVM) and Random Forest algorithms. This research compares several transformation methods. The transformation performed in this study uses TF-IDF, CountVectorizer, NRCLex, and Word Affect Intensities. The data used in this research is an English text taken from Twitter social media. The total data used is 8439 data. Model training for both Support Vector Machine and Random Forest uses 6751 data. While for testing using 1688 data. The results showed that the SVM algorithm with weighting using TF-IDF had the best performance compared to the Random Forest algorithm and other transformation methods used in the study. The SVM algorithm model with TF-IDF transformation that was built managed to get an accuracy of 84%. This model obtained a higher accuracy than the Random Forest model which obtained a high accuracy of 80% using the CountVectorizer transformation.

APA, Harvard, Vancouver, ISO, and other styles

33

Fathirachman Mahing, Naufal, Alifi Lazuardi Gunawan, Ahmad Foresta Azhar Zen, Fitra Abdurrachman Bachtiar, and Satrio Agung Wicaksono. "Klasifikasi Tingkat Stress dari Data Berbentuk Teks dengan Menggunakan Algoritma Support Vector Machine (SVM) dan Random Forest." Jurnal Teknologi Informasi dan Ilmu Komputer 11, no. 5 (2024): 1067–76. https://doi.org/10.25126/jtiik.2024118010.

Full text

Abstract:

Stres merupakan keadaan dimana seseorang merasakan adanya tekanan yang berlebih pada dirinya. Pemantauan tingkat stres menjadi hal yang penting bagi manusia. Tingkat stres yang tinggi dapat menimbulkan dampak negatif terhadap kesehatan manusia. Deteksi dini stres menjadi sesuatu yang sangat penting untuk dilakukan. Salah satu cara mengetahui tingkat stres seseorang adalah melalui analisis teks. Penelitian ini dilakukan untuk melakukan klasifikasi tingkat stres berdasarkan data berupa teks menggunakan algoritma Support Vector Machine (SVM) dan Random Forest. Pada penelitian ini melakukan perbandingan beberapa metode transformasi. Transformasi yang dilakukan pada penelitian ini menggunakan TF-IDF, CountVectorizer, NRCLex, dan Word Affect Intensities. Data yang digunakan dalam penelitian ini berupa sebuat teks berbahasa Inggris yang diambil dari media sosial Twitter. Total data yang digunakan yaitu 8439 data. Pelatihan model baik untuk Support Vector Machine dan Random Forest menggunakan 6751 data. Sedangkan untuk pengujian menggunakan 1688 data. Hasil penelitian menunjukkan bahwa algoritma SVM dengan pembobotan menggunakan TF-IDF memiliki performa yang paling baik dibandingkan dengan algoritma Random Forest dan metode transformasi lainnya yang digunakan dalam penelitian. Model algoritma SVM dengan transformasi TF-IDF yang dibangun berhasil mendapatkan akurasi sebesar 84%. Model ini mendapatkan akurasi yang lebih tinggi dibanding model Random Forest yang memperoleh akurasi tinggi sebesar 80% dengan menggunakan transformasi CountVectorizer. Abstract Stress is a condition where a person feels excessive pressure on himself. Monitoring stress levels is important for humans. High levels of stress can have a negative impact on human health. Early detection of stress is something that is very important to do. One way to find out someone's stress level is through text analysis.This research was conducted to classify stress levels based on text data using the Support Vector Machine (SVM) and Random Forest algorithms. This research compares several transformation methods. The transformation performed in this study uses TF-IDF, CountVectorizer, NRCLex, and Word Affect Intensities. The data used in this research is an English text taken from Twitter social media. The total data used is 8439 data. Model training for both Support Vector Machine and Random Forest uses 6751 data. While for testing using 1688 data. The results showed that the SVM algorithm with weighting using TF-IDF had the best performance compared to the Random Forest algorithm and other transformation methods used in the study. The SVM algorithm model with TF-IDF transformation that was built managed to get an accuracy of 84%. This model obtained a higher accuracy than the Random Forest model which obtained a high accuracy of 80% using the CountVectorizer transformation.

APA, Harvard, Vancouver, ISO, and other styles

34

Yang, Hulin, and Ai’nan Zhu. "Analysis of Telecommunication Fraud Cases Based on TF-IDF Algorithm." IOP Conference Series: Earth and Environmental Science 742, no. 1 (2021): 012011. http://dx.doi.org/10.1088/1755-1315/742/1/012011.

Full text

APA, Harvard, Vancouver, ISO, and other styles

35

Prabowo, Dhimas Anjar, Muhammad Fhadli, Mochammad Ainun Najib, Handika Agus Fauzi, and Imam Cholissodin. "TF-IDF-Enhanced Genetic Algorithm Untuk Extractive Automatic Text Summarization." Jurnal Teknologi Informasi dan Ilmu Komputer 3, no. 3 (2016): 208. http://dx.doi.org/10.25126/jtiik.201633217.

Full text

APA, Harvard, Vancouver, ISO, and other styles

36

Zhu, Zhiliang, Jie Liang, Deyang Li, Hai Yu, and Guoqi Liu. "Hot Topic Detection Based on a Refined TF-IDF Algorithm." IEEE Access 7 (2019): 26996–7007. http://dx.doi.org/10.1109/access.2019.2893980.

Full text

APA, Harvard, Vancouver, ISO, and other styles

37

Harmandini, Keisha Priya, and Kemas Muslim L. "Analysis of TF-IDF and TF-RF Feature Extraction on Product Review Sentiment." Sinkron 8, no. 2 (2024): 929–37. http://dx.doi.org/10.33395/sinkron.v8i2.13376.

Full text

Abstract:

Sentiment analysis of product reviews is critical in understanding customer views and satisfaction, especially in the context of e-commerce applications. A marketplace provides channels where users can submit reviews of the products they purchase. However, due to the large number of reviews in a marketplace, analyzing them is no longer feasible to be performed manually. This research proposes a machine learning implementation to perform sentiment analysis on product reviews. In this research, the product review dataset on Shopee marketplace is used for sentiment analysis by comparing TF-IDF and TF-RF feature extraction using the SVM algorithm with stages of dataset, labeling, feature extraction and accuracy results. The importance of the comparison between TF-IDF and TF-RF feature extraction in this research is related to the need to evaluate and determine which feature extraction method is most effective in increasing the accuracy of sentiment analysis. TF-IDF and TF-RF are two methods commonly used in text analysis, and a comparison of their performance can provide deep insight into the effectiveness of each in the context of product sentiment analysis.Thus, through this comparison, this research aims to determine the best approach that can provide the highest accuracy results, so that the results can serve as a guide for further research. Based on the evaluation, the highest accuracy value is achieved at 92.87% by using TF-IDF and SVM classifiers which outperformed previous research.

APA, Harvard, Vancouver, ISO, and other styles

38

Amalia, Amalia, Maya Silvi Lydia, Siti Dara Fadilla, and Miftahul Huda. "Perbandingan Metode Klaster dan Preprocessing Untuk Dokumen Berbahasa Indonesia." Jurnal Rekayasa Elektrika 14, no. 1 (2018): 35–42. http://dx.doi.org/10.17529/jre.v14i1.9027.

Full text

Abstract:

Clustering is an unsupervised method to group multiple objects based on the similarity automatically. The quality of clustering accuracy is determined by the number of similar objects in a correct cluster group. The robust preprocessing process and the choice of cluster algorithm can increase the efficiency of clustering. The objective of this study is to observe the most suitable method to cluster document in Bahasa Indonesia. We performed tests on several cluster algorithms such as K-Means, K-Means++ and Agglomerative with various preprocessing stages and collected the accuracy of each algorithm. Clustering experiments were conducted on a corpus containing 100 documents in Bahasa Indonesia with a commonly used preprocessing scenario. Additionally, we also attach our preprocessing stages such as LSA function, TF-IDF function, and LSA / TF-IDF function. We tested various LSA dimension reductions values from 10% to 90%, and the result shows that the best percentage of reduction rates between 50%-80%. The result also indicates that K-Means++ algorithm produces better purity values than other algorithms.

APA, Harvard, Vancouver, ISO, and other styles

39

Chen, Rong, Feng Chen, and Yi Sun. "Research on Automatic Text Classification Algorithm Based on ITF-IDF and KNN." Applied Mechanics and Materials 713-715 (January 2015): 1830–34. http://dx.doi.org/10.4028/www.scientific.net/amm.713-715.1830.

Full text

Abstract:

We consider how to efficiently text classification on all pairs of documents. This information can be used to information retrieval, digital library, information filtering, and search engine, among others. This paper describes text classification model which based on KNN algorithm. The text feature extraction algorithm, TF-IDF, can loss related information between text features, an improved ITF-IDF algorithm has been presented in order to overcome it. Our experiments show that our algorithm is better than others.

APA, Harvard, Vancouver, ISO, and other styles

40

Fikri, Muhammad, and Zaenal Abidin. "Analysis Of The Use Of Nazief-Adriani Stemming And Porter Stemming In Covid-19 Twitter Sentiment Analysis With Term Frequency-Inverse Document Frequency Weighting Based On K-Nearest Neighbor Algorithm." Recursive Journal of Informatics 2, no. 2 (2024): 80–87. http://dx.doi.org/10.15294/rji.v2i2.74267.

Full text

Abstract:

Abstract. This system was developed to determine the accuracy of sentiment analysis on Twitter regarding the COVID-19 issue using the Nazief-Adriani and Porter stemmers with TF-IDF weighting, along with a classification process using K-Nearest Neighbor (KNN) that resulted in a comparison of 48.24% for Nazief-Adriani and 48.24% for Porter. Purpose: This research aims to determine the accuracy of the Nazief-Adriani and Porter stemmer algorithms in performing text preprocessing using a dataset from Indonesian-language Twitter. This research involves word weighting using TF-IDF and classification using the K-Nearest Neighbor (KNN) algorithm. Methods/Study design/approach: The experimentation was conducted by applying the Nazief-Adriani and Porter stemmer algorithm methods, utilizing data sourced from Twitter related to COVID-19. Subsequently, the data underwent text preprocessing, stemming, TF-IDF weighting, accuracy testing of training and testing data using K-Nearest Neighbor (KNN) algorithm, and the accuracy of both stemmers was calculated employing a confusion matrix table. Result/Findings: This study obtained reasonably accurate results in testing the Nazief-Adriani stemmer with an accuracy of 50.98%, applied to sentiment analysis of COVID-19-related Twitter data using the Indonesian language. As for the accuracy of the Porter stemmer, it achieved an accuracy rate of 48.24%. Novelty/Originality/Value: Feature selection is crucial in stemmer accuracy testing. Therefore, in this study, feature selection is carried out using the Nazief-Adriani and Porter stemmers for testing purposes, and the accuracy data classification is conducted using the K-Nearest Neighbor (KNN) algorithm

APA, Harvard, Vancouver, ISO, and other styles

41

Yunxiang, Liu Qi Xu Zhang Tang: SIT Shanghai. "Research on Text Classification Method based on PTF-IDF and Cosine Similarity." Journal of Information and Communication Engineering Volume 6, Issue 1 (2020): 335–38. https://doi.org/10.5281/zenodo.4261313.

Full text

Abstract:

Text classification is a foundational task in many NLP applications. The text classification task in the era of big data faces new challenges. We propose a Promoted TF-IDF (Promoted-TF-IDF) and cosine similarity method for text classification. In our model, with the pre-trained word segmentation tool, we apply PTF-IDF method to judge which words play key roles in text classification to capture the key components in category. We also apply Cosine Similarity algorithm to judge similarity between text and category. We conduct experiments on commonly used datasets. The experimental results show that the proposed method outperforms the state-of-the-art methods on several datasets.

APA, Harvard, Vancouver, ISO, and other styles

42

Bozkurt, Muhammed Oğuzhan, Yağız Yaman, and Fahrettin Horasan. "Sentiment analysis with machine learning for drug reviews." Journal of Computer & Electrical and Electronics Engineering Sciences 2, no. 2 (2024): 35–45. http://dx.doi.org/10.51271/jceees-0016.

Full text

Abstract:

In the treatment of the disease, the fact that individuals use drugs independently from doctors without appropriate consultation causes their health status to become worse than normal. This article aims to conduct a sentiment analysis over the comments of individuals about the drug in case they use drugs without consultation. Within the scope of this study, patients' comments about drugs were vectorized using Bow and TF-IDF algorithms, sentiment analysis was made, and the predicted sentiments were; it was evaluated with precision, recall, f1score, accuracy and AUC score. As a result of the evaluations, the most successful result was obtained in the TF-IDF method. This result is the result of the Linear Support Vector Classifier algorithm with an Accuracy value of 93%.

APA, Harvard, Vancouver, ISO, and other styles

43

Navitha Abhinaya S, Neha H, Papireddigari Renusree, and Sowmya Lakshmi B. S. "Keyphrase Extraction from Scientific Articles." International Journal of Scientific Research in Computer Science, Engineering and Information Technology 10, no. 3 (2024): 601–11. http://dx.doi.org/10.32628/cseit24103210.

Full text

Abstract:

Keyphrase extraction is a crucial task in natural language processing (NLP) that involves identifying important terms and phrases in a text. This paper presents a methodology for extracting keyphrases from scientific articles using a combination of preprocessing techniques and the term frequency-inverse document frequency (TF-IDF) algorithm. The approach includes tokenization, stopword removal, and punctuation elimination, followed by the application of the TF-IDF vectorizer to identify and score keyphrases. The results demonstrate the effectiveness of the method in highlighting significant terms in scientific texts.

APA, Harvard, Vancouver, ISO, and other styles

44

Lubis, Arif Ridho, Mahyuddin Khairuddin Matyuso Nasution, Opim Salim Sitompul, and Elviawaty Muisa Zamzami. "The feature extraction for classifying words on social media with the Naïve Bayes algorithm." IAES International Journal of Artificial Intelligence (IJ-AI) 11, no. 3 (2022): 1041. http://dx.doi.org/10.11591/ijai.v11.i3.pp1041-1048.

Full text

Abstract:

To classify Naïve Bayes classification (NBC), however, it is necessary to have a previous pre-processing and feature extraction. Generally, pre-processing eliminates unnecessary words while feature extraction processes these words. This paper focuses on feature extraction in which calculations and searches are used by applying word2vec while in frequency using term frequency-Inverse document frequency (TF-IDF). The process of classifying words on Twitter with 1734 tweets which are defined as a document to weight the calculation of frequency with TF-IDF with words that often come out in tweet, the value of TF-IDF decreases and vice versa. Following the achievement of the weight value of the word in the tweet, the classification is carried out using Naïve Bayes with 1734 test data, yielding an accuracy of 88.8% in the Slack word category tweet and while in the tweet category of verb 78.79%. It can be concluded that the data in the form of words available on twitter can be classified and those that refer to slack words and verbs with a fairly good level of accuracy. so that it manifests from the habit of twitter social media user.

APA, Harvard, Vancouver, ISO, and other styles

45

Arif, Ridho Lubis, Khairuddin Matyuso Nasution Mahyuddin, Salim Sitompul Opim, and Muisa Zamzami Elviawaty. "The feature extraction for classifying words on social media with the Naïve Bayes algorithm." International Journal of Artificial Intelligence (IJ-AI) 11, no. 3 (2022): 1041–48. https://doi.org/10.11591/ijai.v11.i3.pp1041-1048.

Full text

Abstract:

To classify Naïve Bayes classification (NBC), however, it is necessary to have a previous pre-processing and feature extraction. Generally, pre-processing eliminates unnecessary words while feature extraction processes these words. This paper focuses on feature extraction in which calculations and searches are used by applying word2vec while in frequency using term frequency-Inverse document frequency (TF-IDF). The process of classifying words on Twitter with 1734 tweets which are defined as a document to weight the calculation of frequency with TF-IDF with words that often come out in tweet, the value of TF-IDF decreases and vice versa. Following the achievement of the weight value of the word in the tweet, the classification is carried out using Naïve Bayes with 1734 test data, yielding an accuracy of 88.8% in the Slack word category tweet and while in the tweet category of verb 78.79%. It can be concluded that the data in the form of words available on twitter can be classified and those that refer to slack words and verbs with a fairly good level of accuracy. so that it manifests from the habit of twitter social media user.

APA, Harvard, Vancouver, ISO, and other styles

46

Kardkovács, Zsolt T., and Gábor Kovács. "Finding sequential patterns with TF-IDF metrics in health-care databases." Acta Universitatis Sapientiae, Informatica 6, no. 2 (2014): 287–310. http://dx.doi.org/10.1515/ausi-2015-0008.

Full text

Abstract:

Abstract Finding frequent sequential patterns has been defined as finding ordered list of items that occur more times in a database than a user defined threshold. For big and dense databases that contain really long sequences and large itemset such as medical case histories, algorithm proposed on this idea of counting the occurrences output enourmous number of highly redundant frequent sequences, and are therefore simply impractical. Therefore, there is a need for algorithm that perform frequent pattern search and prefiltering simultaneously. In this paper, we propose an algorithm that reinterprets the term support on text mining basis. Experiments show that our method not only eliminates redundancy among the output sequences, but it scales much better with huge input data sizes. We apply our algorithm for mining medical databases: what diagnoses are likely to lead to a certain future health condition.

APA, Harvard, Vancouver, ISO, and other styles

47

Deviyanto, Akhmad, and Muhammad Didik Rohmad Wahyudi. "PENERAPAN ANALISIS SENTIMEN PADA PENGGUNA TWITTER MENGGUNAKAN METODE K-NEAREST NEIGHBOR." JISKA (Jurnal Informatika Sunan Kalijaga) 3, no. 1 (2018): 1. http://dx.doi.org/10.14421/jiska.2018.31-01.

Full text

Abstract:

AbstractThis research is made to implement the KNN (K-Nearest Neighbor) algorithm for sentiment analysis Twitter about Jakarta Governor Election 2017. The object is 2000 data tweets in Indonesia collected from Twitter during Januari 2017 using Python package called Twitterscraper. The methode used in sentiment analysis system is KNN with TF-IDF term weighting and Cosine similarity measure. As the test result, the highest accuracy is 67,2% when k=5, the highest precision is 56,94% with k=5, and the highest recall 78,24% with k=15.Keywords : K – Nearest Neighbor, Twitterscraper, TF-IDF, Cosine Similarity Penelitian ini dibuat untuk mengimplementasikan algoritma KNN (K - Nearest Neighbor) dalam analisis sentimen pengguna Twitter tentang topik Pilkada DKI 2017. Data tweet yang digunakan adalah sebanyak 2000 data tweet berbahasa Indonesia yang dikumpulkan selama bulan Januari 2017 menggunakan package Python bernama Twitterscraper. Menggunakan algoritma KNN dengan pembobotan kata TF-IDF dan fungsi Cosine Similarity, akan dilakukan pengklasifikasian nilai sentimen ke dalam dua kelas : positif dan negatif. Dari hasil pengujian diketahui bahwa nilai akurasi terbesar adalah 67,2% ketika k=5, presisi tertinggi 56,94% ketika k=5, dan recall 78,24% dengan k=15.Kata Kunci : K – Nearest Neighbor, Twitterscraper, TF-IDF, Cosine Similarity

APA, Harvard, Vancouver, ISO, and other styles

48

Cahyani, Salsabila Nida, and Galuh Wilujeng Saraswati. "IMPLEMENTATION OF SUPPORT VECTOR MACHINE METHOD IN CLASSIFYING SCHOOL LIBRARY BOOKS WITH COMBINATION OF TF-IDF AND WORD2VEC." Jurnal Teknik Informatika (Jutif) 4, no. 6 (2023): 1555–66. http://dx.doi.org/10.52436/1.jutif.2023.4.6.1536.

Full text

Abstract:

The development of technology in education is integral to enhancing its quality, such as implementing information technology in school libraries. Searching for books in school libraries is time-consuming due to conventional book classification, lacking organization based on classifications. Therefore, implementing information technology in school libraries is crucial to improve library management effectiveness. An innovative solution optimizing library management involves leveraging artificial intelligence, particularly machine learning. In applying machine learning to library book classification, Support Vector Machine acts as an algorithm understanding patterns and characteristics of book titles, categorizing them into Dewey Decimal Classification (DDC). The dataset comprises 10 classes aligned with DDC. Random data collection follows an 80:20 scale for training and testing data. Data preprocessing is an initial research stage, addressing imbalanced data through oversampling. Testing the SVM algorithm with a linear kernel and C = 1 parameter is conducted three times using different feature extraction methods: TF-IDF alone, Word2Vec alone, and a combination of TF-IDF and Word2Vec. Model performance evaluation employs K-Fold Cross-Validation. After the three objective tests, the most accurate book classification results were obtained using a combination of TF-IDF and Word2Vec feature extraction. It's concluded that SVM's book classification method can be applied, yielding the highest accuracy of 73% with the TF-IDF and Word2Vec feature extraction combination. This outperforms other feature extraction methods, with precision at 83%, recall at 72%, and an F1-Score of 76%.

APA, Harvard, Vancouver, ISO, and other styles

49

Rahim, Robbi, Nuning Kurniasih, Muhammad Dedi Irawan, et al. "Latent Semantic Indexing for Indonesian Text Similarity." International Journal of Engineering & Technology 7, no. 2.3 (2018): 73. http://dx.doi.org/10.14419/ijet.v7i2.3.12619.

Full text

Abstract:

Document is a written letter that can be used as evidence of information. Plagiarism is a deliberate or unintentional act of obtaining or attempting to obtain credit or value for a scientific work, citing some or all of the scientific work of another party acknowledged as a scientific work without stating the source properly and adequately. Latent Semantic Indexing method serves to find text that has the same text against from a document. The algorithm used is TF/IDF Algorithm that is the result of multiplication of TF value with IDF for a term in document while Vector Space Model (VSM) is method to see the level of closeness or similarity of word by way of weighting term.

APA, Harvard, Vancouver, ISO, and other styles

50

Alfauzan, Muhammad Fikri, Yuliant Sibaroni, and Fitriyani Fitriyani. "Sentiment Classification of Fuel Price Rise in Economic Aspects Using Lexicon and SVM Method." sinkron 8, no. 4 (2023): 2526–36. http://dx.doi.org/10.33395/sinkron.v8i4.12851.

Full text

Abstract:

After being hit by COVID-19 for a long time around the world which resulted in the paralysis of all countries, especially the economic aspects of all countries that dropped dramatically, the world was again shocked by the conflict between Russia and Ukraine which resulted in an increase in world oil prices including in Indonesia, many people complained and opposed the government's policy of increasing fuel prices because fuel affects various aspects, including economic aspects. Based on these problems, researchers use sentiment analysis methods that aim to find out people's opinions on issues that are being discussed throughout Indonesia and this research focuses on comparing the SVM algorithm with TF-IDF feature extraction then using K-Fold Cross Validation after that it is compared with the Lexicon Inset dictionary, in this case the model with Lexicon Inset which contains weighting on each word. In this study, it was found that the dataset model using the SVM algorithm with TF-IDF feature extraction and then using K-Fold Cross Validation obtained an average accuracy of 0.85 using the SVM algorithm. While the model using the automatic labeling dataset using the Indonesian sentiment Lexicon (Lexicon Inset) obtained an average accuracy of 0.68. Classification using SVM with TF-IDF feature extraction is superior to using Lexicon Inset.

APA, Harvard, Vancouver, ISO, and other styles

We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!