To see the other types of publications on this topic, follow the link: Term Frequency (TF).

Journal articles on the topic 'Term Frequency (TF)'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 journal articles for your research on the topic 'Term Frequency (TF).'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

Mohammed, Mohannad T., and Omar Fitian Rashid. "Document retrieval using term term frequency inverse sentence frequency weighting scheme." Indonesian Journal of Electrical Engineering and Computer Science 31, no. 3 (2023): 1478. http://dx.doi.org/10.11591/ijeecs.v31.i3.pp1478-1485.

Full text
Abstract:
The need for an efficient method to find the furthermost appropriate document corresponding to a particular search query has become crucial due to the exponential development in the number of papers that are now readily available to us on the web. The vector space model (VSM) a perfect model used in “information retrieval”, represents these words as a vector in space and gives them weights via a popular weighting method known as term frequency inverse document frequency (TF-IDF). In this research, work has been proposed to retrieve the most relevant document focused on representing documents and queries as vectors comprising average term term frequency inverse sentence frequency (TF-ISF) weights instead of representing them as vectors of term TF-IDF weight and two basic and effective similarity measures: Cosine and Jaccard were used. Using the MS MARCO dataset, this article analyzes and assesses the retrieval effectiveness of the TF-ISF weighting scheme. The result shows that the TF-ISF model with the Cosine similarity measure retrieves more relevant documents. The model was evaluated against the conventional TF-ISF technique and shows that it performs significantly better on MS MARCO data (Microsoft-curated data of Bing queries).
APA, Harvard, Vancouver, ISO, and other styles
2

Mohannad, T. Mohammed, and Fitian Rashid Omar. "Document retrieval using term frequency inverse sentence frequency weighting scheme." Document retrieval using term frequency inverse sentence frequency weighting scheme 31, no. 3 (2023): 1478–85. https://doi.org/10.11591/ijeecs.v31.i3.pp1478-1485.

Full text
Abstract:
The need for an efficient method to find the furthermost appropriate document corresponding to a particular search query has become crucial due to the exponential development in the number of papers that are now readily available to us on the web. The vector space model (VSM) a perfect model used in “information retrieval”, represents these words as a vector in space and gives them weights via a popular weighting method known as term frequency inverse document frequency (TF-IDF). In this research, work has been proposed to retrieve the most relevant document focused on representing documents and queries as vectors comprising average term term frequency inverse sentence frequency (TF-ISF) weights instead of representing them as vectors of term TF-IDF weight and two basic and effective similarity measures: Cosine and Jaccard were used. Using the MS MARCO dataset, this article analyzes and assesses the retrieval effectiveness of the TF-ISF weighting scheme. The result shows that the TF-ISF model with the Cosine similarity measure retrieves more relevant documents. The model was evaluated against the conventional TF-ISF technique and shows that it performs significantly better on MS MARCO data (Microsoft-curated data of Bing queries).
APA, Harvard, Vancouver, ISO, and other styles
3

Ariyanti, Meiga Ayu, Aji Prasetya Wibawa, and Utomo Pujianto. "Metode term frequency - invers document frequency pada mekanisme pencarian judul skripsi." TEKNO 28, no. 2 (2019): 177. http://dx.doi.org/10.17977/um034v28i2p177-190.

Full text
Abstract:
Tujuan penelitian dan pengembangan ini adalah (1) merancang dan membangun mekanisme pencarian dengan metode TF-IDF sebagai salah satu fitur yang ada pada SISINTA, (2) menguji akurasi, presisi, dan sensitifitas metode TF-IDF, dan (3) menguji fungsionalitas mekanisme pencarian dengan metode TF-IDF. Hasil penelitian dan pengembangan ini berupa fitur mekanisme pencarian judul skripsi dengan metode term frequency dan invers document frequency (TF-IDF). Fitur tersebut dapat menampilkan hasil pencarian judul skripsi yang relevan sesuai dengan kata kunci pencarian oleh pengguna. Berdasarkan hasil pengujian white-box yang dilakukan dengan pengujian akurasi, presisi dan sensitifitas didapatkan persentase yang sama yaitu sebesar 92%. Hasil tersebut termasuk kategori sempurna. Hasil uji coba kepada pengguna yang merupakan pengujian black-box menghasilkan keberhasilan fungsionalitas sebesar 100%. Berdasarkan hasil rata-rata pengujian white-box dan black-box diperoleh persentase sebesar 96%, sehingga dapat disimpulkan metode TF-IDF dalam mekanisme pencarian judul skripsi SISINTA ini sangat valid dan sangat layak untuk digunakan.
APA, Harvard, Vancouver, ISO, and other styles
4

Shehzad, Farhan, Abdur Rehman, Kashif Javed, Khalid A. Alnowibet, Haroon A. Babri, and Hafiz Tayyab Rauf. "Binned Term Count: An Alternative to Term Frequency for Text Categorization." Mathematics 10, no. 21 (2022): 4124. http://dx.doi.org/10.3390/math10214124.

Full text
Abstract:
In text categorization, a well-known problem related to document length is that larger term counts in longer documents cause classification algorithms to become biased. The effect of document length can be eliminated by normalizing term counts, thus reducing the bias towards longer documents. This gives us term frequency (TF), which in conjunction with inverse document frequency (IDF) became the most commonly used term weighting scheme to capture the importance of a term in a document and corpus. However, normalization may cause term frequency of a term in a related document to become equal or smaller than its term frequency in an unrelated document, thus perturbing a term’s strength from its true worth. In this paper, we solve this problem by introducing a non-linear mapping of term frequency. This alternative to TF is called binned term count (BTC). The newly proposed term frequency factor trims large term counts before normalization, thus moderating the normalization effect on large documents. To investigate the effectiveness of BTC, we compare it against the original TF and its more recently proposed alternative named modified term frequency (MTF). In our experiments, each of these term frequency factors (BTC, TF, and MTF) is combined with four well-known collection frequency factors (IDF), RF, IGM, and MONO and the performance of each of the resulting term weighting schemes is evaluated on three standard datasets (Reuters (R8-21578), 20-Newsgroups, and WebKB) using support vector machines and K-nearest neighbor classifiers. To determine whether BTC is statistically better than TF and MTF, we have applied the paired two-sided t-test on the macro F1 results. Overall, BTC is found to be 52% statistically significant than TF and MTF. Furthermore, the highest macro F1 value on the three datasets was achieved by BTC-based term weighting schemes.
APA, Harvard, Vancouver, ISO, and other styles
5

Lopes, Lucelene, Paulo Fernandes, and Renata Vieira. "Estimating term domain relevance through term frequency, disjoint corpora frequency - tf-dcf." Knowledge-Based Systems 97 (April 2016): 237–49. http://dx.doi.org/10.1016/j.knosys.2015.12.015.

Full text
APA, Harvard, Vancouver, ISO, and other styles
6

Vichianchai, Vuttichai, and Sumonta Kasemvilas. "A New Term Frequency with Gaussian Technique for Text Classification and Sentiment Analysis." Journal of ICT Research and Applications 15, no. 2 (2021): 152–68. http://dx.doi.org/10.5614/itbj.ict.res.appl.2021.15.2.4.

Full text
Abstract:
This paper proposes a new term frequency with a Gaussian technique (TF-G) to classify the risk of suicide from Thai clinical notes and to perform sentiment analysis based on Thai customer reviews and English tweets of travelers that use US airline services. This research compared TF-G with term weighting techniques based on Thai text classification methods from previous researches, including the bag-of-words (BoW), term frequency (TF), term frequency-inverse document frequency (TF-IDF), and term frequency-inverse corpus document frequency (TF-ICF) techniques. Suicide risk classification and sentiment analysis were performed with the decision tree (DT), naïve Bayes (NB), support vector machine (SVM), random forest (RF), and multilayer perceptron (MLP) techniques. The experimental results showed that TF-G is appropriate for feature extraction to classify the risk of suicide and to analyze the sentiments of customer reviews and tweets of travelers. The TF-G technique was more accurate than BoW, TF, TF-IDF and TF-ICF for term weighting in Thai suicide risk classification, for term weighting in sentiment analysis of Thai customer reviews for Burger King, Pizza Hut, and Sizzler restaurants, and for the sentiment analysis of English tweets of travelers using US airline services.
APA, Harvard, Vancouver, ISO, and other styles
7

Tamardina, Fadhilla Atansa, Hasbi Yasin, and Dwi Ispriyanti. "ANALISIS SENTIMEN REVIEW APLIKASI CRYPTOCURRENCY MENGGUNAKAN ALGORITMA MAXIMUM ENTROPY DENGAN METODE PEMBOBOTAN TF, TF-IDF DAN BINARY." Jurnal Gaussian 11, no. 1 (2022): 1–10. http://dx.doi.org/10.14710/j.gauss.v11i1.34004.

Full text
Abstract:
Pandemi COVID-19 yang belum berhenti menyebabkan kondisi ekonomi Indonesia kian memburuk. Masyarakat yang terkena dampak pemotongan upah akibat pandemi harus mencari cara untuk mendapatkan pendapatan pasif. Salah satu cara untuk mendapatkan hal tersebut adalah berinvestasi. Cryptocurrency adalah salah satu instrumen investasi berbasis aplikasi yang memiliki return tinggi. Aplikasi Pintu adalah aplikasi pertama yang menyediakan fasilitas mobile apps pada penggunanya. Aplikasi yang dirilis pada tahun 2020 ini sudah memiliki banyak ulasan yang diberikan oleh penggunanya. Ulasan ini dibutuhkan untuk mengetahui apakah ulasan yang diberikan bersifat positif atau negatif. Analisis sentimen pada aplikasi Pintu dipilih untuk melihat sentimen pengguna yang akan dibagi menjadi dua kelas sentimen yaitu positif dan negatif. Klasifikasi dilakukan dengan algoritma Maximum Entropy dengan perbandingan metode pembobotan kata Term Frequency (TF), Term Frequency-Inverse Document Frequency (TF-IDF) dan Binary. Model klasifikasi terbaik dilihat berdasarkan nilai akurasi yang dievaluasi dengan 5-Fold Cross Validation. Hasil klasifikasi model Maximum Entropy dengan Binary memiliki tingkat akurasi sebesar 83,21% sedangkan hasil klasifikasi model Maximum Entropy dengan Term Frequency hanya sebesar 83,01% dan model Maximum Entropy dengan Term Frequency-Inverse Document Frequency hanya sebesar 83,20%. Hal ini menunjukkan bahwa tidak terdapat perbedaan yang signifikan pada model algoritma Maximum Entropy dengan metode pembobotan kata Term Frequency (TF), Term Frequency-Inverse Document Frequency (TF-IDF) dan Binary. Keywords: Cryptocurrency, Binary, Term Frequency, Term Frequency-Inverse Document Frequency, Maximum Entropy
APA, Harvard, Vancouver, ISO, and other styles
8

Hendra Suputra, I. Putu Gede, Kiki Dwi Prebiana, and Frisca Olivia Gorianto. "Perbandingan Jenis TF terhadap Hasil Evaluasi Information Retrieval." JELIKU (Jurnal Elektronik Ilmu Komputer Udayana) 8, no. 2 (2021): 207. http://dx.doi.org/10.24843/jlk.2019.v08.i02.p13.

Full text
Abstract:
Pada sebuah sistem temu kembali,salah satu cara untuk mencari kesamaan antara query dengan dokumen adalah dengan menggunakan Term Frequency – Inverse Document Frequency atau TF-IDF. TF yang umum digunakan adalah langsung menggunakan jumlah term frequency padahal banyak jenis TF lainnya yang dapat dikombinasikan dengan IDF. Penelitian ini akan mengkombinasikan 4 jenis TF, yaitu Natural TF, Normalization/max TF, Logaritma TF, dan Boolean TF dengan tujuan untuk mencari jenis TF mana yang lebih baik setelah dikombinasikan dengan IDF. Hasil penelitian menunjukkan bahwa.Logaritma TF adalah yang terbaik dengan nilai F-measure sebesar 0,00662.
 Keywords: TF-IDF, Natural TF, Normalization TF, Logaritma TF, Boolean TF
APA, Harvard, Vancouver, ISO, and other styles
9

Yulita, Winda, Meida Cahyo Untoro, Mugi Praseptiawan, Ilham Firman Ashari, Aidil Afriansyah, and Ahmad Naim Bin Che Pee. "Automatic Scoring Using Term Frequency Inverse Document Frequency Document Frequency and Cosine Similarity." Scientific Journal of Informatics 10, no. 2 (2023): 93–104. http://dx.doi.org/10.15294/sji.v10i2.42209.

Full text
Abstract:
Purpose: In the learning process, most of the tests to assess learning achievement have been carried out by providing questions in the form of short answers or essay questions. The variety of answers given by students makes a teacher have to focus on reading them. This scoring process is difficult to guarantee quality if done manually. In addition, each class is taught by a different teacher, which can lead to unequal grades obtained by students due to the influence of differences in teacher experience. Therefore the purpose of this study is to develop an assessment of the answers. Automated short answer scoring is designed to automatically grade and evaluate students' answers based on a series of trained answer documents.Methods: This is ‘how’ you did it. Let readers know exactly what you did to reach your results. For example, did you undertake interviews? Did you carry out an experiment in the lab? What tools, methods, protocols or datasets did you use The method used is TF-IDF-DF and Similarity and scoring computation. Theword weight used is the term Frequency-Inverse Documents Frequency -Document Frequency (TF-IDF-DF) method. The data used is 5 questions with each question answered by 30 students, while the students' answers are assessed by teachers/experts to determine the real score. The study was evaluated by Mean Absolute Error (MAE).Result: The evaluation results obtained Mean Absolute Error (MAE) with a resulting value of 0.123.Value: The word weighting method used is the Term Frequency Inverse Document Frequency DocumentFrequency (TF-IDF-DF) which is an improvement over the Term Frequency Inverse Document Frequency (TF-IDF) method. This method is a method of weighting words that will be applied before calculating the similarity of sentences between teachers and students.
APA, Harvard, Vancouver, ISO, and other styles
10

Tama, Fauzaan Rakan, and Yuliant Sibaroni. "Fake News (Hoaxes) Detection on Twitter Social Media Content through Convolutional Neural Network (CNN) Method." JINAV: Journal of Information and Visualization 4, no. 1 (2023): 70–78. http://dx.doi.org/10.35877/454ri.jinav1525.

Full text
Abstract:
The use of social media is very influential for the community. Users can easily post various activities in the form of text, photos, and videos in social media. Information on social media contains fake news and hoaxes that will have an impact on society. One of the most social media used is Twitter. This study aims to detect fake news found on the Tweets using the Convolutional Neural Network (CNN) method by comparing the weighting features used of the Term Frequency Inverse Document Frequency (TF-IDF) and the Term Frequency-Relevance Frequency (TF-RF). The highest accuracy was obtained in the Term Frequency-Relevance Frequency (TF-RF) weighting feature with an accuracy of 84.11%, while in the Term Frequency Inverse Document Frequency (TF-IDF) weighting feature with an accuracy of 80.29%.
APA, Harvard, Vancouver, ISO, and other styles
11

Ni'mah, Ana Tsalitsatun, and Agus Zainal Arifin. "Perbandingan Metode Term Weighting terhadap Hasil Klasifikasi Teks pada Dataset Terjemahan Kitab Hadis." Rekayasa 13, no. 2 (2020): 172–80. http://dx.doi.org/10.21107/rekayasa.v13i2.6412.

Full text
Abstract:
Hadis adalah sumber rujukan agama Islam kedua setelah Al-Qur’an. Teks Hadis saat ini diteliti dalam bidang teknologi untuk dapat ditangkap nilai-nilai yang terkandung di dalamnya secara pegetahuan teknologi. Dengan adanya penelitian terhadap Kitab Hadis, pengambilan informasi dari Hadis tentunya membutuhkan representasi teks ke dalam vektor untuk mengoptimalkan klasifikasi otomatis. Klasifikasi Hadis diperlukan untuk dapat mengelompokkan isi Hadis menjadi beberapa kategori. Ada beberapa kategori dalam Kitab Hadis tertentu yang sama dengan Kitab Hadis lainnya. Ini menunjukkan bahwa ada beberapa dokumen Kitab Hadis tertentu yang memiliki topik yang sama dengan Kitab Hadis lain. Oleh karena itu, diperlukan metode term weighting yang dapat memilih kata mana yang harus memiliki bobot tinggi atau rendah dalam ruang Kitab Hadis untuk optimalisasi hasil klasifikasi dalam Kitab-kitab Hadis. Penelitian ini mengusulkan sebuah perbandingan beberapa metode term weighting, yaitu: Term Frequency Inverse Document Frequency (TF-IDF), Term Frequency Inverse Document Frequency Inverse Class Frequency (TF-IDF-ICF), Term Frequency Inverse Document Frequency Inverse Class Space Density Frequency (TF-IDF-ICSδF), dan Term Frequency Inverse Document Frequency Inverse Class Space Density Frequency Inverse Hadith Space Density Frequency (TF-IDF-ICSδF-IHSδF). Penelitian ini melakukan perbandingan hasil term weighting terhadap dataset Terjemahan 9 Kitab Hadis yang diterapkan pada mesin klasifikasi Naive Bayes dan SVM. 9 Kitab Hadis yang digunakan, yaitu: Sahih Bukhari, Sahih Muslim, Abu Dawud, at-Turmudzi, an-Nasa'i, Ibnu Majah, Ahmad, Malik, dan Darimi. Hasil uji coba menunjukkan bahwa hasil klasifikasi menggunakan metode term weighting TF-IDF-ICSδF-IHSδF mengungguli term weighting lainnya, yaitu mendapatkan Precission sebesar 90%, Recall sebesar 93%, F1-Score sebesar 92%, dan Accuracy sebesar 83%.Comparison of a term weighting method for the text classification in Indonesian hadithHadith is the second source of reference for Islam after the Qur’an. Currently, hadith text is researched in the field of technology for capturing the values of technology knowledge. With the research of the Book of Hadith, retrieval of information from the hadith certainly requires the representation of text into vectors to optimize automatic classification. The classification of the hadith is needed to be able to group the contents of the hadith into several categories. There are several categories in certain Hadiths that are the same as other Hadiths. Shows that there are certain documents of the hadith that have the same topic as other Hadiths. Therefore, a term weighting method is needed that can choose which words should have high or low weights in the Hadith Book space to optimize the classification results in the Hadith Books. This study proposes a comparison of several term weighting methods, namely: Term Frequency Inverse Document Frequency (TF-IDF), Term Frequency Inverse Document Frequency Inverse Class Frequency (TF-IDF-ICF), Term Frequency Inverse Document Frequency Inverse Class Space Density Frequency (TF-IDF-ICSδF) and Term Frequency Inverse Document Frequency Inverse Class Space Density Frequency Inverse Hadith Space Density Frequency (TF-IDF-ICSδF-IHSδF). This research compares the term weighting results to the 9 Hadith Book Translation dataset applied to the Naive Bayes classification engine and SVM. 9 Books of Hadith are used, namely: Sahih Bukhari, Sahih Muslim, Abu Dawud, at-Turmudzi, an-Nasa’i, Ibn Majah, Ahmad, Malik, and Darimi. The trial results show that the classification results using the TF-IDF-ICSδF-IHSδF term weighting method outperformed another term weighting, namely getting a Precession of 90%, Recall of 93%, F1-Score of 92%, and Accuracy of 83%.
APA, Harvard, Vancouver, ISO, and other styles
12

Widianto, Adi, Eka Pebriyanto, Fitriyanti Fitriyanti, and Marna Marna. "Document Similarity Using Term Frequency-Inverse Document Frequency Representation and Cosine Similarity." Journal of Dinda : Data Science, Information Technology, and Data Analytics 4, no. 2 (2024): 149–53. http://dx.doi.org/10.20895/dinda.v4i2.1589.

Full text
Abstract:
Document similarity is a fundamental task in natural language processing and information retrieval, with applications ranging from plagiarism detection to recommendation systems. In this study, we leverage the term frequency-inverse document frequency (TF-IDF) to represent documents in a high-dimensional vector space, capturing their unique content while mitigating the influence of common terms. Subsequently, we employ the cosine similarity metric to measure the similarity between pairs of documents, which assesses the angle between their respective TF-IDF vectors. To evaluate the effectiveness of our approach, we conducted experiments on the Document Similarity Triplets Dataset, a benchmark dataset specifically designed for assessing document similarity techniques. Our experimental results demonstrate a significant performance with an accuracy score of 93.6% using bigram-only representation. However, we observed instances where false predictions occurred due to paired documents having similar terms but differing semantics, revealing a weakness in the TF-IDF approach. To address this limitation, future research could focus on augmenting document representations with semantic features. Incorporating semantic information, such as word embeddings or contextual embeddings, could enhance the model's ability to capture nuanced semantic relationships between documents, thereby improving accuracy in scenarios where term overlap does not adequately signify similarity.
APA, Harvard, Vancouver, ISO, and other styles
13

Priyanka, Mesariya, and Madia Nidhi. "Document Ranking using Customizes Vector Method." International Journal of Trend in Scientific Research and Development 1, no. 4 (2017): 278–83. https://doi.org/10.31142/ijtsrd125.

Full text
Abstract:
Information retrieval IR system is about positioning reports utilizing clients question and get the important records from extensive dataset. Archive positioning is fundamentally looking the pertinent record as per their rank. Document ranking is basically search the relevant document according to their rank. Vector space model is traditional and widely applied information retrieval models to rank the web page based on similarity values. Term weighting schemes are the significant of an information retrieval system and it is query used in document ranking. Tf idf ranked calculates the term weight according to users query on basis of term which is including in documents. When user enter query it will find the documents in which the query terms are included and it will count the term calculate the Tf idf according to the highest weight of value it will gives the ranked documents. Priyanka Mesariya | Nidhi Madia "Document Ranking using Customizes Vector Method" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-1 | Issue-4 , June 2017, URL: https://www.ijtsrd.com/papers/ijtsrd125.pdf
APA, Harvard, Vancouver, ISO, and other styles
14

Nugroho, Kuncahyo Setyo, Fitra A. Bachtiar, and Wayan Firdaus Mahmudy. "Detecting Emotion in Indonesian Tweets: A Term-Weighting Scheme Study." Journal of Information Systems Engineering and Business Intelligence 8, no. 1 (2022): 61–70. http://dx.doi.org/10.20473/jisebi.8.1.61-70.

Full text
Abstract:
Background: Term-weighting plays a key role in detecting emotion in texts. Studies in term-weighting schemes aim to improve short text classification by distinguishing terms accurately. Objective: This study aims to formulate the best term-weighting schemes and discover the relationship between n-gram combinations and different classification algorithms in detecting emotion in Twitter texts. Methods: The data used was the Indonesian Twitter Emotion Dataset, with features generated through different n-gram combinations. Two approaches assign weights to the features. Tests were carried out using ten-fold cross-validation on three classification algorithms. The performance of the model was measured using accuracy and F1 score. Results: The term-weighting schemes with the highest performance are Term Frequency-Inverse Category Frequency (TF-ICF) and Term Frequency-Relevance Frequency (TF-RF). The scheme with a supervised approach performed better than the unsupervised one. However, we did not find a consistent advantage as some of the experiments found that Term Frequency-Inverse Document Frequency (TF-IDF) also performed exceptionally well. The traditional TF-IDF method remains worth considering as a term-weighting scheme. Conclusion: This study provides recommendations for emotion detection in texts. Future studies can benefit from dealing with imbalances in the dataset to provide better performance. Keywords: Emotion Detection, Feature Engineering, Term-Weighting, Text Mining
APA, Harvard, Vancouver, ISO, and other styles
15

Muhammad Kiko Aulia Reiki, Yuliant Sibaroni, and Erwin Budi Setiawan. "Comparison of Term Weighting Methods in Sentiment Analysis of the New State Capital of Indonesia with the SVM Method." International Journal on Information and Communication Technology (IJoICT) 8, no. 2 (2023): 53–65. http://dx.doi.org/10.21108/ijoict.v8i2.681.

Full text
Abstract:
The relocation of the State Capital to “Nusantara”, which was inaugurated with the enactment of UU No. 3 of 2022, is a significant project that has sparked polemics among Indonesian citizens. Many people expressed their opinions and thoughts regarding the relocation of the State Capital on Twitter. This tendency of public opinion needs to be identified with sentiment analysis. In sentiment analysis, term weighting is an essential component to obtain optimal accuracy. Various people are trying to modify the existing term weighting to increase the performance and accuracy of the model. One of them is icf-based or tf-bin.icf, which combines inverse category frequency (ICF) and relevance frequency (RF). This study compares the tf-idf, tf-rf, and tf-bin.icf term weighting with the SVM classification method on the new State Capital of Indonesia topic. The tf-idf weighting results are still the best compared to the tf-bin.icf and tf-rf term weights, with an accuracy score of 88.0% a 1,3% difference with tf-bin.icf term weighting.
APA, Harvard, Vancouver, ISO, and other styles
16

Alshehri, Arwa, and Abdulmohsen Algarni. "TF-TDA: A Novel Supervised Term Weighting Scheme for Sentiment Analysis." Electronics 12, no. 7 (2023): 1632. http://dx.doi.org/10.3390/electronics12071632.

Full text
Abstract:
In text classification tasks, such as sentiment analysis (SA), feature representation and weighting schemes play a crucial role in classification performance. Traditional term weighting schemes depend on the term frequency within the entire document collection; therefore, they are called unsupervised term weighting (UTW) schemes. One of the most popular UTW schemes is term frequency–inverse document frequency (TF-IDF); however, this is not sufficient for SA tasks. Newer weighting schemes have been developed to take advantage of the membership of documents in their categories. These are called supervised term weighting (STW) schemes; however, most of them weigh the extracted features without considering the characteristics of some noisy features and data imbalances. Therefore, in this study, a novel STW approach was proposed, known as term frequency–term discrimination ability (TF-TDA). TF-TDA mainly presents the extracted features with different degrees of discrimination by categorizing them into several groups. Subsequently, each group is weighted based on its contribution. The proposed method was examined over four SA datasets using naive Bayes (NB) and support vector machine (SVM) models. The experimental results proved the superiority of TF-TDA over two baseline term weighting approaches, with improvements ranging from 0.52% to 3.99% in the F1 score. The statistical test results verified the significant improvement obtained by TF-TDA in most cases, where the p-value ranged from 0.0000597 to 0.0455.
APA, Harvard, Vancouver, ISO, and other styles
17

You, Zi-Hung, Ya-Han Hu, Chih-Fong Tsai, and Yen-Ming Kuo. "Integrating Feature and Instance Selection Techniques in Opinion Mining." International Journal of Data Warehousing and Mining 16, no. 3 (2020): 168–82. http://dx.doi.org/10.4018/ijdwm.2020070109.

Full text
Abstract:
Opinion mining focuses on extracting polarity information from texts. For textual term representation, different feature selection methods, e.g. term frequency (TF) or term frequency–inverse document frequency (TF–IDF), can yield diverse numbers of text features. In text classification, however, a selected training set may contain noisy documents (or outliers), which can degrade the classification performance. To solve this problem, instance selection can be adopted to filter out unrepresentative training documents. Therefore, this article investigates the opinion mining performance associated with feature and instance selection steps simultaneously. Two combination processes based on performing feature selection and instance selection in different orders, were compared. Specifically, two feature selection methods, namely TF and TF–IDF, and two instance selection methods, namely DROP3 and IB3, were employed for comparison. The experimental results by using three Twitter datasets to develop sentiment classifiers showed that TF–IDF followed by DROP3 performs the best.
APA, Harvard, Vancouver, ISO, and other styles
18

Sintia, Sintia, Sarjon Defit, and Gunadi Widi Nurcahyo. "Product Codefication Accuracy With Cosine Similarity And Weighted Term Frequency And Inverse Document Frequency (TF-IDF)." Journal of Applied Engineering and Technological Science (JAETS) 2, no. 2 (2021): 62–69. http://dx.doi.org/10.37385/jaets.v2i2.210.

Full text
Abstract:
In the SiPaGa application, the codefication search process is still inaccurate, so OPD often make mistakes in choosing goods codes. So we need Cosine Similarity and TF-IDF methods that can improve the accuracy of the search. Cosine Similarity is a method for calculating similarity by using keywords from the code of goods. Term Frequency and Inverse Document (TFIDF) is a way to give weight to a one-word relationship (term). The purpose of this research is to improve the accuracy of the search for goods codification. Codification of goods processed in this study were 14,417 data sourced from the Goods and Price Planning Information System (SiPaGa) application database. The search keywords were processed using the Cosine Similarity method to see the similarities and using TF-IDF to calculate the weighting. This research produces the calculation of cosine similarity and TF-IDF weighting and is expected to be applied to the SiPaGa application so that the search process on the SiPaGa application is more accurate than before. By using the cosine sismilarity algorithm and TF-IDF, it is hoped that it can improve the accuracy of the search for product codification. So that OPD can choose the product code as desired
APA, Harvard, Vancouver, ISO, and other styles
19

Christian, Hans, Mikhael Pramodana Agus, and Derwin Suhartono. "Single Document Automatic Text Summarization using Term Frequency-Inverse Document Frequency (TF-IDF)." ComTech: Computer, Mathematics and Engineering Applications 7, no. 4 (2016): 285. http://dx.doi.org/10.21512/comtech.v7i4.3746.

Full text
Abstract:
The increasing availability of online information has triggered an intensive research in the area of automatic text summarization within the Natural Language Processing (NLP). Text summarization reduces the text by removing the less useful information which helps the reader to find the required information quickly. There are many kinds of algorithms that can be used to summarize the text. One of them is TF-IDF (TermFrequency-Inverse Document Frequency). This research aimed to produce an automatic text summarizer implemented with TF-IDF algorithm and to compare it with other various online source of automatic text summarizer. To evaluate the summary produced from each summarizer, The F-Measure as the standard comparison value had been used. The result of this research produces 67% of accuracy with three data samples which are higher compared to the other online summarizers.
APA, Harvard, Vancouver, ISO, and other styles
20

Hla, Sann Sint, and Khine Oo Khine. "Comparison of two methods on vector space model for trust in social commerce." TELKOMNIKA (Telecommunication, Computing, Electronics and Control) 19, no. 3 (2021): 809–16. https://doi.org/10.12928/telkomnika.v19i3.18150.

Full text
Abstract:
The study of dealing with searching information in documents within web pages is information retrieval (IR). The user needs to describe information with comments or reviews that consists of a number of words. Discovering weight of an inquiry term is helpful to decide the significance of a question. Estimation of term significance is a basic piece of most information retrieval approaches and it is commonly chosen through term frequency-inverse document frequency (TF-IDF). Also, improved TF-IDF method used to retrieve information in web documents. This paper presents comparison of TF-IDF method and improved TF-IDF method for information retrieval. Cosine similarity method calculated on both methods. The results of cosine similarity method on both methods compared on the desired threshold value. The relevance documents of TF-IDF method are more extracted than improved TF-IDF method.
APA, Harvard, Vancouver, ISO, and other styles
21

Arif, Ridho Lubis, Khairuddin Matyuso Nasution Mahyuddin, Salim Sitompul Opim, and Muisa Zamzami Elviawaty. "The feature extraction for classifying words on social media with the Naïve Bayes algorithm." International Journal of Artificial Intelligence (IJ-AI) 11, no. 3 (2022): 1041–48. https://doi.org/10.11591/ijai.v11.i3.pp1041-1048.

Full text
Abstract:
To classify Naïve Bayes classification (NBC), however, it is necessary to have a previous pre-processing and feature extraction. Generally, pre-processing eliminates unnecessary words while feature extraction processes these words. This paper focuses on feature extraction in which calculations and searches are used by applying word2vec while in frequency using term frequency-Inverse document frequency (TF-IDF). The process of classifying words on Twitter with 1734 tweets which are defined as a document to weight the calculation of frequency with TF-IDF with words that often come out in tweet, the value of TF-IDF decreases and vice versa. Following the achievement of the weight value of the word in the tweet, the classification is carried out using Naïve Bayes with 1734 test data, yielding an accuracy of 88.8% in the Slack word category tweet and while in the tweet category of verb 78.79%. It can be concluded that the data in the form of words available on twitter can be classified and those that refer to slack words and verbs with a fairly good level of accuracy. so that it manifests from the habit of twitter social media user.
APA, Harvard, Vancouver, ISO, and other styles
22

Gou, Zhinan, Zheng Huo, Yuanzhen Liu, and Yi Yang. "A Method for Constructing Supervised Topic Model Based on Term Frequency-Inverse Topic Frequency." Symmetry 11, no. 12 (2019): 1486. http://dx.doi.org/10.3390/sym11121486.

Full text
Abstract:
Supervised topic modeling has been successfully applied in the fields of document classification and tag recommendation in recent years. However, most existing models neglect the fact that topic terms have the ability to distinguish topics. In this paper, we propose a term frequency-inverse topic frequency (TF-ITF) method for constructing a supervised topic model, in which the weight of each topic term indicates the ability to distinguish topics. We conduct a series of experiments with not only the symmetric Dirichlet prior parameters but also the asymmetric Dirichlet prior parameters. Experimental results demonstrate that the result of introducing TF-ITF into a supervised topic model outperforms several state-of-the-art supervised topic models.
APA, Harvard, Vancouver, ISO, and other styles
23

Setiawan, Gede Herdian, and I. Made Budi Adnyana. "Improving Helpdesk Chatbot Performance with Term Frequency-Inverse Document Frequency (TF-IDF) and Cosine Similarity Models." Journal of Applied Informatics and Computing 7, no. 2 (2023): 252–57. http://dx.doi.org/10.30871/jaic.v7i2.6527.

Full text
Abstract:
Helpdesk chatbots are growing in popularity due to their ability to provide help and answers to user questions quickly and effectively. Chatbot development poses several challenges, including enhancing accuracy in understanding user queries and providing relevant responses while improving problem-solving efficiency. In this research, we aim to enhance the accuracy and efficiency of the Helpdesk Chatbot by implementing the Term Frequency-Inverse Document Frequency (TF-IDF) model and the Cosine Similarity algorithm. The TF-IDF model is a method used to measure the frequency of words in a document and their occurrence in the entire document collection, while the Cosine Similarity algorithm is used to measure the similarity between two documents. After implementing and testing TF-IDF and Cosine Similarity models in the Helpdesk Chatbot, we achieved a 75% question recognition rate. To increase accuracy and precision, it is necessary to increase the knowledge dataset and improve pre-processing, especially in recognition and correct inaccurate spelling
APA, Harvard, Vancouver, ISO, and other styles
24

Adelia, Dila, Widi Astuti, and Kemas Muslim Lhaksmana. "Election Hoax Detection on X using CNN with TF-RF and TF-IDF Weighting Features." Journal of Computer System and Informatics (JoSYC) 5, no. 4 (2024): 912–20. https://doi.org/10.47065/josyc.v5i4.5778.

Full text
Abstract:
X social media is a microblogging platform for sharing brief thoughts and trends. It has become a focal point for expressing political views. The increased political engagement on X social media has facilitated the swift and extensive sharing of ideas. Still, it also brings the risk of spreading false information and hoaxes that can manipulate public opinion. Preventing fake news on social media is crucial because it can influence election outcomes and social stability. For example, X social media has been used during elections to spread hoaxes, such as false claims of vote tampering or misleading information about candidate qualifications. This study implements a Convolutional Neural Network (CNN) due to its advantages in recognizing complex patterns and achieving high performance in tasks like classification. The dataset used in this study consists of 2,670 tweets. The dataset is divided into three subsets: 60% for training, 20% for testing, and 20% for validation. It also uses Term Frequency Relevance Frequency (TF-RF) and Term Frequency Inverse Document Frequency (TF-IDF) weighting features to improve accuracy in detecting fake news. This study compares the TF-RF and TF-IDF weighting features using the CNN classification method on the topic of the 2024 election. The testing results indicate that both TF-RF and TF-IDF achieved similar overall performance, with TF-RF slightly excelling in recall and F1-score. At the same time, TF-IDF showed a marginally higher precision.
APA, Harvard, Vancouver, ISO, and other styles
25

Santoti, Jennifer Velensia, Jennifer Jocelyn, and Hafiz Irsyad. "Implementasi Term Frequency - Inverse Document Frequency dan Cosine Similarity untuk Analisis Kemiripan Deskripsi Produk Halal." Jurnal Software Engineering and Computational Intelligence 3, no. 01 (2025): 44–52. https://doi.org/10.36982/jseci.v3i01.5421.

Full text
Abstract:
Di era digital saat ini, kejelasan informasi produk telah menjadi aspek penting untuk mendukung keputusan konsumen dalam proses pembelian. Penelitian ini difokuskan pada implementasi ekstraksi fitur dari deskripsi produk menggunakan metode TF-IDF (Term Frequency - Inverse Document Frequency) dan Cosine Similarity untuk memprediksi deskripsi produk yang membingungkan. Metodologi penelitian ini meliputi beberapa tahap preprocessing, yang meliputi tokenizing, stopword removal, filtering, penghapusan data null dan data NaN, serta ekstraksi fitur teks menggunakan metode TF-IDF dan Cosine Similarity. Hasil evaluasi menunjukkan bahwa sistem berhasil mengenali produk halal dengan nilai precision sebesar 96%, recall sebesar 98%, dan F1-score sebesar 97%, yang mengindikasikan bahwa adanya keseimbangan yang baik antara precision dan recall. Untuk produk haram mencapai precision sebesar 98%, recall sebesar 95%, dan F1-score sebesar 97%. Secara keseluruhan, sistem berhasil mendapatkan nilai akurasi sebesar 97%. Hasil evaluasi menunjukkan bahwa model lebih baik dalam mengenali produk halal, dengan hasil recall sebesar 98%, sementara hasil recall produk haram sebesar 95%. Hal ini mengindikasikan bahwa metode yang digunakan sangat efektif dalam memprediksi kejelasan deskripsi produk. Kesimpulan dari penelitian ini menegaskan bahwa kombinasi TF-IDF dan Cosine Similarity efektif dalam mengidentifikasi ambiguitas deskripsi produk, sehingga dapat meningkatkan transparansi informasi bagi konsumen.
APA, Harvard, Vancouver, ISO, and other styles
26

Jiang, Zhiying, Bo Gao, Yanlin He, Yongming Han, Paul Doyle, and Qunxiong Zhu. "Text Classification Using Novel Term Weighting Scheme-Based Improved TF-IDF for Internet Media Reports." Mathematical Problems in Engineering 2021 (March 5, 2021): 1–30. http://dx.doi.org/10.1155/2021/6619088.

Full text
Abstract:
With the rapid development of the internet technology, a large amount of internet text data can be obtained. The text classification (TC) technology plays a very important role in processing massive text data, but the accuracy of classification is directly affected by the performance of term weighting in TC. Due to the original design of information retrieval (IR), term frequency-inverse document frequency (TF-IDF) is not effective enough for TC, especially for processing text data with unbalanced distributions in internet media reports. Therefore, the variance between the DF value of a particular term and the average of all DFs DF ¯ , namely, the document frequency variance (ADF), is proposed to enhance the ability in processing text data with unbalanced distribution. Then, the normal TF-IDF is modified by the proposed ADF for processing unbalanced text collection in four different ways, namely, TF-IADF, TF-IADF+, TF-IADFnorm, and TF-IADF+norm. As a result, an effective model can be established for the TC task of internet media reports. A series of simulations have been carried out to evaluate the performance of the proposed methods. Compared with TF-IDF on state-of-the-art classification algorithms, the effectiveness and feasibility of the proposed methods are confirmed by simulation results.
APA, Harvard, Vancouver, ISO, and other styles
27

Deo, Tula Kanta, Rajesh Keshavrao Deshmukh, and Gajendra Sharma. "Comparative Study among Term Frequency-Inverse Document Frequency and Count Vectorizer towards K Nearest Neighbor and Decision Tree Classifiers for Text Dataset." Nepal Journal of Multidisciplinary Research 7, no. 2 (2024): 1–11. http://dx.doi.org/10.3126/njmr.v7i2.68189.

Full text
Abstract:
Background: Text classification techniques are increasingly important with the exponential growth of textual data on the internet. Term Frequency-Inverse Document Frequency (TF-IDF) and Count Vectorizer(CV) are commonly used methods for feature extraction. TF-IDF assigning weights to terms based on their frequency. CV simply counts the occurrences of terms. The performance of CV as well as TF-IDF are evaluated and compared with KNN and DT classifiers across text datasets. Methodology: The investigation begins with preprocessing. The feature vectors are created using both TF-IDF and CV. Feature vectors are passed into the KNN and DT classifiers at in training stage. Experiments are executed the usage of Kaggle's public database Ukraine 10K tweets sentiment_analysis dataset and the Womens ecommerce clothing reviews dataset. Findings: The average of precision, recall, f1 score and accuracy of KNN with TF-IDF were 84.5%, 87%, 83%, 87% respectively and KNN with CV were 83.5%, 87%, 83.5%, 87% respectively. Similarly, average of precision, recall, f1 score and accuracy of DT with TF-IDF were 89%, 89%, 89%, 89% respectively and DT with CV were 89%, 89.5%, 89.5%, 89.5% respectively. The results obtained in this research is consistent with previous similar research result. Conclusions: The performance of TF-IDF is almost similar as CV for a particular dataset and a particular classifier in this study. Novelty: The experiment performed using these classifiers and feature extraction methods on the datasets is a novelty and contribution of this research.
APA, Harvard, Vancouver, ISO, and other styles
28

Alfarizi, Muhammad Ibnu, Lailis Syafaah, and Merinda Lestandy. "Emotional Text Classification Using TF-IDF (Term Frequency-Inverse Document Frequency) And LSTM (Long Short-Term Memory)." JUITA : Jurnal Informatika 10, no. 2 (2022): 225. http://dx.doi.org/10.30595/juita.v10i2.13262.

Full text
Abstract:
Humans in carrying out communication activities can express their feelings either verbally or non-verbally. Verbal communication can be in the form of oral or written communication. A person's feelings or emotions can usually be seen by their behavior, tone of voice, and expression. Not everyone can see emotion only through writing, whether in the form of words, sentences, or paragraphs. Therefore, a classification system is needed to help someone determine the emotions contained in a piece of writing. The novelty of this study is a development of previous research using a similar method, namely LSTM but improved on the word weighting process using the TF-IDF method as a further process of LSTM classification. The method proposed in this research is called Natural Language Processing (NLP). The purpose of this study was to compare the classification method with the LSTM (Long Short-Term Memory) model by adding the word weighting TF-IDF (Term Frequency–Inverse Document Frequency) and the LinearSVC model, as well to increase accuracy in determining an emotion (sadness, anger, fear, love, joy, and surprise) contained in the text. The dataset used is 18000, which is divided into 16000 training data and 2000 test data with 6 classifications of emotion classes, namely sadness, anger, fear, love, joy, and surprise. The results of the classification accuracy of emotions using the LSTM method yielded a 97.50% accuracy while using the LinearSVC method resulted in an accuracy value of 89%.
APA, Harvard, Vancouver, ISO, and other styles
29

Sarkar, Kamal, and Santanu Dam. "Exploiting Semantic Term Relations in Text Summarization." International Journal of Information Retrieval Research 12, no. 1 (2022): 1–18. http://dx.doi.org/10.4018/ijirr.289607.

Full text
Abstract:
The traditional frequency based approach to creating multi-document extractive summary ranks sentences based on scores computed by summing up TF*IDF weights of words contained in the sentences. In this approach, TF or term frequency is calculated based on how frequently a term (word) occurs in the input and TF calculated in this way does not take into account the semantic relations among terms. In this paper, we propose methods that exploits semantic term relations for improving sentence ranking and redundancy removal steps of a summarization system. Our proposed summarization system has been tested on DUC 2003 and DUC 2004 benchmark multi-document summarization datasets. The experimental results reveal that performance of our multi-document text summarizer is significantly improved when the distributional term similarity measure is used for finding semantic term relations. Our multi-document text summarizer also outperforms some well known summarization baselines to which it is compared.
APA, Harvard, Vancouver, ISO, and other styles
30

Alshuraiqi, Hamza Sulimansallam. "Improved Term Frequency Inverse Document Frequency (TF-IDF) Method for Arabic Text Classification." International Journal of Advanced Trends in Computer Science and Engineering 9, no. 5 (2020): 6939–46. http://dx.doi.org/10.30534/ijatcse/2020/11952020.

Full text
APA, Harvard, Vancouver, ISO, and other styles
31

Riadi, Imam, Sunardi Sunardi, and Panggah Widiandana. "Mobile Forensics for Cyberbullying Detection using Term Frequency - Inverse Document Frequency (TF-IDF)." Jurnal Ilmiah Teknik Elektro Komputer dan Informatika 5, no. 2 (2020): 68. http://dx.doi.org/10.26555/jiteki.v5i2.14510.

Full text
APA, Harvard, Vancouver, ISO, and other styles
32

Sai, Ch Pranay. "Webpage Metadata Extration Using Machine Learning Techiniques." International Journal for Research in Applied Science and Engineering Technology 11, no. 12 (2023): 851–54. http://dx.doi.org/10.22214/ijraset.2023.57449.

Full text
Abstract:
Abstract: This Python script defines a Flask web application enabling users to input a URL. The application fetches the webpage content and utilizes TF-IDF (Term Frequency-Inverse Document Frequency) analysis to extract information like the title, description, and top keywords. The / route renders an HTML template (index.html) for user input, while the /extract route handles a POST request, fetching the webpage content, extracting relevant information using TF-IDF analysis, and rendering the results in another HTML template (result.html). The TF-IDF process involves tokenizing the text, eliminating stopwords, and calculating TF-IDF scores for each term. The top 10 keywords are then extracted based on their TF-IDF scores. The script also incorporates error handling for cases where the webpage cannot be fetched or an exception occurs during the process.
APA, Harvard, Vancouver, ISO, and other styles
33

Pujianto, Utomo, and Arya Yudhi Wijaya. "Pemilihan Korpus Statis Bersesuaian dengan Cosine Similarity dan Penggunaan IDF Global Pada Penambahan Dokumen Baru." E-Link : Jurnal Teknik Elektro dan Informatika 14, no. 2 (2020): 1. http://dx.doi.org/10.30587/e-link.v14i2.1215.

Full text
Abstract:
Abstrak – Permasalahan yang muncul pada saat pembobotan menggunakan nilai “term frequency–inversedocument frequency” (tf-idf) adalah adanya kebutuhan untuk selalu melakukan perhitungan ulang nilai inversedocument frequency (idf) setiap kali dokumen baru ditambahkan ke dalam database. Hal ini menyebabkanpeningkatan kompleksitas komputasi menjadi O(N2). Untuk menangani masalah tersebut, dalam paper ini diusulkansebuah metode yang menggunakan cosine similarity dan sejumlah korpus statis yang telah didefinisikan sebelumnya.Cosine similarity digunakan untuk menghitung kemiripan nilai term frequency (tf) dokumen baru dengan reratanilai tf dari setiap korpus statis yang ada dalam database. Nilai idf dari korpus statis yang memiliki nilai similaritypaling tinggi dengan dokumen baru kemudian dipilih sebagai nilai idf dari dokumen yang baru. Hasil uji cobamenunjukkan bahwa tidak terdapat perbedaan yang signifikan antara nilai tf-idf yang dihitung dengan metode telahada sebelumnya dengan metode yang diusulkan dalam paper ini. Dengan kata lain, metode ini dapatdipertimbangkan sebagai alternatif penentuan nilai idf, terutama karena kompleksitasnya yang hanya O(N).
APA, Harvard, Vancouver, ISO, and other styles
34

Chang, Hsien-Tsung, Shu-Wei Liu, and Nilamadhab Mishra. "A tracking and summarization system for online Chinese news topics." Aslib Journal of Information Management 67, no. 6 (2015): 687–99. http://dx.doi.org/10.1108/ajim-10-2014-0147.

Full text
Abstract:
Purpose – The purpose of this paper is to design and implement new tracking and summarization algorithms for Chinese news content. Based on the proposed methods and algorithms, the authors extract the important sentences that are contained in topic stories and list those sentences according to timestamp order to ensure ease of understanding and to visualize multiple news stories on a single screen. Design/methodology/approach – This paper encompasses an investigational approach that implements a new Dynamic Centroid Summarization algorithm in addition to a Term Frequency (TF)-Density algorithm to empirically compute three target parameters, i.e., recall, precision, and F-measure. Findings – The proposed TF-Density algorithm is implemented and compared with the well-known algorithms Term Frequency-Inverse Word Frequency (TF-IWF) and Term Frequency-Inverse Document Frequency (TF-IDF). Three test data sets are configured from Chinese news web sites for use during the investigation, and two important findings are obtained that help the authors provide more precision and efficiency when recognizing the important words in the text. First, the authors evaluate three topic tracking algorithms, i.e., TF-Density, TF-IDF, and TF-IWF, with the said target parameters and find that the recall, precision, and F-measure of the proposed TF-Density algorithm is better than those of the TF-IWF and TF-IDF algorithms. In the context of the second finding, the authors implement a blind test approach to obtain the results of topic summarizations and find that the proposed Dynamic Centroid Summarization process can more accurately select topic sentences than the LexRank process. Research limitations/implications – The results show that the tracking and summarization algorithms for news topics can provide more precise and convenient results for users tracking the news. The analysis and implications are limited to Chinese news content from Chinese news web sites such as Apple Library, UDN, and well-known portals like Yahoo and Google. Originality/value – The research provides an empirical analysis of Chinese news content through the proposed TF-Density and Dynamic Centroid Summarization algorithms. It focusses on improving the means of summarizing a set of news stories to appear for browsing on a single screen and carries implications for innovative word measurements in practice.
APA, Harvard, Vancouver, ISO, and other styles
35

Xu, Dong Dong, and Shao Bo Wu. "An Improved TFIDF Algorithm in Text Classification." Applied Mechanics and Materials 651-653 (September 2014): 2258–61. http://dx.doi.org/10.4028/www.scientific.net/amm.651-653.2258.

Full text
Abstract:
Term frequency/inverse document frequency (TF-IDF) is widely used in text classification at present, which is borrowed from Information Retrieval. Based on this conventional classical TF-IDF formula, we present a new TF-IDF weight schemes named CTF-IDF. The experiment shows that the improved method is feasible and effective. Furthermore, from the subsequent evaluations using 10-fold cross-validation, we can see the CTF-IDF greatly improves the accuracy of text classification.
APA, Harvard, Vancouver, ISO, and other styles
36

Sulaksono, Juli, Risky Aswi Ramadhani, and Ratih Kumalasari Niswatin. "Automatic Article Summary with the Term Frequency-Inverse Document Frequency Algorithm for Information on Elderly Health." Journal of Computational and Theoretical Nanoscience 17, no. 2 (2020): 1511–13. http://dx.doi.org/10.1166/jctn.2020.8833.

Full text
Abstract:
Elderly is someone whose age ranges from 60–74 years. At that age, one’s health tends to decline. Various programs have been provided by the Indonesian government, such as providing information, giving brochures, and giving announcements on the health service website. But this counselling is not optimal because of the elderly, tend to be lazy to read this because the eyes have started to farsight. So that the health information provided by dina health can be optimal, we try to make a model that is used to summarize an article so that the article is easily understood by the elderly. To summarize the article, this study uses the TF-IDF algorithm. By using the TF-IDF algorithm, it is expected that the elderly will be easier to understand health articles.
APA, Harvard, Vancouver, ISO, and other styles
37

Yunitarini, Rika, Jhon Filius Gultom, and Evy Maya Stefany. "Klasifikasi Jamu Tradisional Madura Menggunakan Metode K-Nearest Neighbors (KNN) dan Term Frequency-Inverse Document Frequency (TF-IDF) Sebagai Representasi Teks." Jurnal Informatika Polinema 11, no. 1 (2024): 99–106. https://doi.org/10.33795/jip.v11i1.6456.

Full text
Abstract:
Jamu Madura merupakan jamu tradisional yang digunakan untuk alternatif pengobatan maupun perawatan tubuh, baik laki-laki maupun perempuan. Penelitian ini bertujuan untuk melakukan proses pengembangan sistem otomatis untuk suatu klasifikasi jamu Madura dengan menggunakan pemodelan K-Nearest Neighbors (KNN) yang didukung oleh representasi teks TF-IDF (Term Frequency-Inverse Document Frequency). Dimana K-Nearest Neighbors adalah salah satu algoritma dalam suatu teknik machine learning yang digunakan untuk melakukan proses klasifikasi dan regresi, sedangkan TF-IDF (Term Frequency-Inverse Document Frequency) adalah suatu teknik yang umum digunakan dalam pemrosesan bahasa alami (NLP) dan information retrieval. Deskripsi jamu Madura tersebut kemudian diubah menjadi representasi vektor menggunakan model TF-IDF, yang memungkinkan pemahaman kontekstual dari kata-kata dalam teks. Proses pengembangan model melibatkan pelatihan menggunakan metode KNN dengan data jamu Madura yang telah diberi label, dimana label pada penelitian ini terdapat 3 kelas, yaitu 1) Jamu Kesehatan, 2) Jamu Perawatan kewanitaan dan, 3) Pasutri. Klasifikasi ini diikuti oleh evaluasi kinerja model menggunakan metrik seperti akurasi, presisi, dan recall. Hasil penelitian menunjukkan bahwa metode KNN dengan TF-IDF dapat mencapai tingkat akurasi yang tinggi dimana hasil tertinggi terdapat dengan nilai k = 9 dimana data latih 90% dan data uji 10% dengan hasil akurasi 85,71%, dengan presisi 88,92% dan recall 85,71%, hal ini menyimpulkan hasil akurasi yang baik.
APA, Harvard, Vancouver, ISO, and other styles
38

Qhabib, Fiqih Ainul, Abd Charis Fauzan, and Harliana Harliana. "Implementasi Algoritma Term Frequency Inverse Document Frequency (TF-IDF) dalam Menganalisis Sentimen Masyarakat Terhadap Covid-19 Varian Omicron." JTIM : Jurnal Teknologi Informasi dan Multimedia 4, no. 4 (2023): 308–18. http://dx.doi.org/10.35746/jtim.v4i4.233.

Full text
Abstract:
The latest variant was detected on November 24, 2021, namely the Omicron variant WHO said, Omicron was one of the Covid-19 variants that had mutated, with a very fast spread rate. The Government Republic of Indonesia has officially banned all foreigners from entering Indonesia, both those who have done so travel or come from countries exposed to the Omicron variant. This study uses data that has been processed using Netlytic online website. Netlytic analyzes text and visualizes public online conversations on social media sites. text preprocessing has several stages, namely case folding, tokenizing, stopword, stemming. Data analysis is the stage to classify words into positive, negative, or neutral sentiment classes. the last step is calculating the weights using the tf-idf method. It is proven from the DF value which reaches 628 words in one document, the D/DF value is 0.39 and the log D/DF is -0.41. The TF-IDF method can be taken in outline, namely it is easy to calculate frequency and relevance occurrence of words in a document. The TF-IDF method produces output according to user specifications, but this method takes a long time for large amounts of data.
APA, Harvard, Vancouver, ISO, and other styles
39

Ramadhan, Fikri Alwan, Sampe Hotlan Sitorus, and Tedy Rismawan. "Penerapan Metode Multinomial Naïve Bayes untuk Klasifikasi Judul Berita Clickbait dengan Term Frequency - Inverse Document Frequency." Jurnal Sistem dan Teknologi Informasi (JustIN) 11, no. 1 (2023): 70. http://dx.doi.org/10.26418/justin.v11i1.57452.

Full text
Abstract:
Clickbait merupakan judul berita yang bombastis dan memberikan informasi tidak utuh sehingga membuat pembaca penasaran ingin tahu dengan cara mengklik tautan berita. Penggunaan judul berita clickbait terkadang bersifat menjebak karena judul dari artikel tersebut bersifat tidak utuh. Hal tersebut menyebabkan kesimpulan yang didapat dari judul dan isi berita terkadang tidak sesuai. Sehingga perlu dilakukan penelitian untuk mengklasifikasi judul berita yang termasuk clickbait atau bukan. Penelitian ini menggunakan metode Multinomial Naïve Bayes dan TF-IDF (Term Frequency - Inverse Document Frequency) dengan data didapat dari judul berita online yang di ambil dari beberapa situs website. Pada penelitian ini TF-IDF digunakan untuk memberikan bobot kata pada proses pelatihan, pengujian dan klasifikasi. Data yang digunakan berjumlah 1000 data dengan 800 data latih dan 200 data uji. Pengujian dilakukan menggunakan confusion matrix sehingga didapatkan akurasi sebesar 65 %, recall sebesar 65,69 % dan precision sebesar 65,69 %.
APA, Harvard, Vancouver, ISO, and other styles
40

Rofiqi, Moh Afif, Abd Charis Fauzan, Afivatu Pratama Agustin, and Ahmad Agung Saputra. "Implementasi Term-Frequency Inverse Document Frequency (TF-IDF) Untuk Mencari Relevansi Dokumen Berdasarkan Query." ILKOMNIKA: Journal of Computer Science and Applied Informatics 1, no. 2 (2019): 58–64. http://dx.doi.org/10.28926/ilkomnika.v1i2.18.

Full text
Abstract:
Tujuan dibuatnya penelitian ini adalah untuk mencari relevansi antar beberapa dokumen berupa artikel berita dari beberapa sumber. Metode yang digunakan yaitu metode Term-Frequency Inverse Document Frequency karena relevan untuk keakuratan sebuah dokumen. Term-Frequency Inverse Document Frequency adalah perhitungan atau pembobotan kata melalui teknik tokenisasi, stopwords, dan steming, dan frekuensi munculnya kata dalam dokumen yang diberikan menunjukkan pentingnya kata itu di dalam sebuah dokumen. Yang mengunakan data dari artikel berita metode ini melakukan pembobotan kata didalam sebuah dokumen dengan mengalikan nilai TF dan IDF bedasarkan hasil querynya. Dan dari tiga artikel yang mengasilakan rank score untuk dokumen satu yang berscore 3,90847 dapat disimpulkan bahwa artikel berita pada dokumen satu adalah yang paling relevan dari pada dua artikel lainnya.
APA, Harvard, Vancouver, ISO, and other styles
41

Sutrisno, Rahmat Yanu, Abd Charis Fauzan, Fadhila Nur Hanifah, Ahmad Gufron, and Fatra Nonggala Putra. "Relevansi Artikel Berita Politik Berdasarkan Query Menggunakan Term Frequency Invers Document Frequency (TF-IDF)." ILKOMNIKA: Journal of Computer Science and Applied Informatics 2, no. 1 (2020): 54–58. http://dx.doi.org/10.28926/ilkomnika.v2i1.25.

Full text
Abstract:
Politik merupakan hal yang tidak bisa dipisahkan pada manusia saat ini, baik secara langsung maupun tidak langsung. Pengaruhnya juga sangat besar dalam kehidupan manusia, mulai dari pemerintahan hingga kehidupan pribadi. Seorang Aristoteles berpendapat bahwa politik adalah bagian terpenting dalam kehidupan manusia, karena ia mempunyai pengaruh yang kuat. Tujuan penelitian ini adalah untuk mencari relevansi artikel berita politik berdasarkan query. Data yang kami gunakan berasal dari media-media online seperti Detik, dan Cnn. Hasil yang diperoleh semakin tinggi akurasi query maka semakin sering pula query tersebut muncul pada dokumen yang sebelumnya telah ditentukan
APA, Harvard, Vancouver, ISO, and other styles
42

AlShammari, Ahmad Farhan. "Implementation of Keyword Extraction using Term Frequency-Inverse Document Frequency (TF-IDF) in Python." International Journal of Computer Applications 185, no. 35 (2023): 9–14. http://dx.doi.org/10.5120/ijca2023923137.

Full text
APA, Harvard, Vancouver, ISO, and other styles
43

Rustam, Furqan, Imran Ashraf, Arif Mehmood, Saleem Ullah, and Gyu Choi. "Tweets Classification on the Base of Sentiments for US Airline Companies." Entropy 21, no. 11 (2019): 1078. http://dx.doi.org/10.3390/e21111078.

Full text
Abstract:
The use of data from social networks such as Twitter has been increased during the last few years to improve political campaigns, quality of products and services, sentiment analysis, etc. Tweets classification based on user sentiments is a collaborative and important task for many organizations. This paper proposes a voting classifier (VC) to help sentiment analysis for such organizations. The VC is based on logistic regression (LR) and stochastic gradient descent classifier (SGDC) and uses a soft voting mechanism to make the final prediction. Tweets were classified into positive, negative and neutral classes based on the sentiments they contain. In addition, a variety of machine learning classifiers were evaluated using accuracy, precision, recall and F1 score as the performance metrics. The impact of feature extraction techniques, including term frequency (TF), term frequency-inverse document frequency (TF-IDF), and word2vec, on classification accuracy was investigated as well. Moreover, the performance of a deep long short-term memory (LSTM) network was analyzed on the selected dataset. The results show that the proposed VC performs better than that of other classifiers. The VC is able to achieve an accuracy of 0.789, and 0.791 with TF and TF-IDF feature extraction, respectively. The results demonstrate that ensemble classifiers achieve higher accuracy than non-ensemble classifiers. Experiments further proved that the performance of machine learning classifiers is better when TF-IDF is used as the feature extraction method. Word2vec feature extraction performs worse than TF and TF-IDF feature extraction. The LSTM achieves a lower accuracy than machine learning classifiers.
APA, Harvard, Vancouver, ISO, and other styles
44

Nugroho, Satyawan Agung, Fitra A. Bachtiar, and Randy Cahya Wihandika. "ASPECT EXTRACTION IN E-COMMERCE USING LATENT DIRICHLET ALLOCATION (LDA) WITH TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)." Jurnal Ilmiah Kursor 11, no. 2 (2022): 53. http://dx.doi.org/10.21107/kursor.v11i2.247.

Full text
Abstract:
Social media is a common thing that people use. Posts or comments found on social media describe someone’s feelings and opinions so there have to be important topics that can be extracted from social media. In the e-commerce field, topic is an interesting thing to know because it can describes people’s opinion towards a product. However, the large number of social media users is currently making the process of finding topics from social media difficult, so computer assistance is needed. One method that can be used is Latent Dirichlet Allocation (LDA). LDA is a good method for extracting topics, but the drawback is that sometimes the topics are incomprehensible. To cover up the drawback, TF-IDF feature selection method is used so that less important words can be skipped so LDA can generate a better topic. The best hyperparameter values ​​obtained were 10 iterations, 10 topics, α and β values consecutively 0,1 and 0,01. The best feature selection percentile value is 90. This value is used to find the threshold that can be used as the lower limit of the TF-IDF value of each word so that the word with greater TF-IDF value can be used as feature.
APA, Harvard, Vancouver, ISO, and other styles
45

Sharma, Saurabh, Zohaib Hasan, and Vishal Paranjape. "Applying Naive Bayes Techniques for Accurate Sentiment Analysis in Movie Reviews." International Journal of Innovative Research in Computer and Communication Engineering 10, no. 10 (2023): 8205–12. http://dx.doi.org/10.15680/ijircce.2022.1010019.

Full text
Abstract:
This study examines the effectiveness of Naive Bayes and Logistic Regression classifiers in analyzing the sentiment of movie reviews. Two feature extraction approaches, namely Bag of Words (BoW) and Term Frequency-Inverse Document Frequency (TF-IDF), are utilized. We employed a dataset comprising 50,000 IMDB reviews that underwent preprocessing techniques such as denoising, stop word removal, and stemming. The reviews were transformed into vectors using Bag-of-Words (BoW) and Term Frequency-Inverse Document Frequency (TFIDF) approaches. Our investigation demonstrates that Logistic Regression surpasses Naive Bayes in terms of accuracy. Logistic Regression achieves 89.52% accuracy for Bag-of-Words (BoW) and 89.23% accuracy for Term FrequencyInverse Document Frequency (TF-IDF), while Naive Bayes achieves 85.01% accuracy for BoW and 85.74% accuracy for TF-IDF. Naive Bayes has consistent performance with a minimum disparity between training and testing accuracies, indicating strong generalization skills despite its slightly lower accuracy. The results suggest that Logistic Regression outperforms Naive Bayes in terms of accuracy. However, Naive Bayes remains a strong contender because to its simplicity and consistent performance across various feature extraction methods. This comparison offers significant insights for choosing suitable classifiers and feature extraction techniques for text classification problems in sentiment analysis.
APA, Harvard, Vancouver, ISO, and other styles
46

Mujilahwati, Siti. "Kombinasi Algoritma Data Reduksi untuk Optimalisasi Dokumen Cluster." Jurnal Eksplora Informatika 12, no. 2 (2023): 113–19. http://dx.doi.org/10.30864/eksplora.v12i2.819.

Full text
Abstract:
Clustering adalah proses pengelompokkan tanpa pelatihan (unsupervised learning), salah satu algoritma yang dapat diterapkan untuk clustering adalah K-Means. Algoritma ini memiliki kinerja dengan konsep menghitung jarak terdekat dari sebuah cluster. Penelitian ini bertujuan untuk melakukan optimasi hasil clustering data abstrak skripsi dengan algoritma K-Means tersebut. Upaya yang dilakukan untuk optimalisasi hasil cluster adalah dengan model kombinasi algoritma Latent Semantic Analysis (LSA), Term Frequency – Inverse Document Frequency (TF-IDF) dan Hashing. Seperti penanganan data teks pada umumnya sebelum dilakukan clustering telah dilakukan praproses untuk pembersihan dan normalisasi data. Setelah praproses selanjutnya dilakukan ekstraksi data dalam bentuk vektor dengan metode Term Frequency – Inverse Document Frequency (TF-IDF) dan Hashing. Hasil vektor yang dihasilkan pada proses ekstraksi selanjutnya dilakukan kombinasi dari algoritma LSA bertujuan untuk mereduksi data. Hasil pengujian dari 229 data skripsi dan 4 cluster menunjukkan kombinasi LSA dengan ekstraksi TF-IDF memiliki keunggulan waktu eksekusi lebih efisien, sedangkan kombinasi LSA-Hashing memiliki nilai F-measure lebih baik.
APA, Harvard, Vancouver, ISO, and other styles
47

P., Velavan, S. Guru Balan, A. Mohamed Sadham Hussain, L. Muthu Krishna, Selvam N., and M. Mohamed Sameer Ali. "Patient Data Analytics by Term Frequency Modulation Diagnosis." Bonfring International Journal of Man Machine Interface 15, no. 1 (2025): 4–14. https://doi.org/10.9756/bijmmi/v15i1/bij25004.

Full text
Abstract:
This paper argued that patient data analysis is vital to healthcare machine learning, delivering insights that help enhance diagnosis, treatment, and patient care. Healthcare systems use electronic health records, medical imaging data, and real-time physiological measurements from wearable devices. It recognises the complexity and diversity of various data sources and uses advanced machine-learning to find patterns and information. Machine learning can also use patient-specific data to make personalised therapy recommendations, improving outcomes. TF-IDF and Blowfish were employed. It is the number of times a term appears in a document divided by the total terms. Frequent terms in a paper may be more important. It suggests better diagnostics, personalised therapy, illness prevention, and resource allocation. Machine learning and patient data analysis help healthcare providers customise treatment plans, anticipate illness development, and deliver more effective and focused interventions. It helps distinguish significant document terms from common words with little meaning. TF-IDF uses local term frequency and global corpus statistics to capture term specificity and relevance in document collections. For missing values, outliers, and inconsistent formats, raw patient data needs preparation. Blowfish has been extensively analysed since its conception and found to have no obvious design flaws. Blowfish is flexible and adaptable to diverse security needs because it provides key lengths from 32 to 448 bits. The encryption is more secure with longer keys. Data cleansing, normalisation, and standardisation are preprocessing steps. Data quality checks find and fix data anomalies.
APA, Harvard, Vancouver, ISO, and other styles
48

S., Sai Manasa Bala, and Kumari Santoshi. "Comprehensive Analysis of Variants of TF-IDF Applied on LDA and LSA Topic Modelling." International Journal of Engineering and Advanced Technology (IJEAT) 9, no. 6 (2020): 531–35. https://doi.org/10.35940/ijeat.D7669.089620.

Full text
Abstract:
Present generation is fully connected virtually through many sources of social media. In social media, opinions of people for any post, news or about any product through comments or emoticon designed to express the satisfactory note. Market standards improve on this basis. There are different online markets like Amazon, Flipkart, Myntra improve their businesses using these reviews passed. Analyzing large scale opinion or feedback of individual’s helps to identify hidden insights and work towards customer satisfaction. This paper proposes for applying different weighting scheme of TF-IDF (Term Frequency-Inverse Document Frequency) for topic modeling methods LSA and LDA to cluster the topics of discussion from large scale reviews related to booming online market ‘Amazon’. The main focus of the paper is to observe the changes in the topic modeling by applying different weighting schemes of TF-IDF. In this work topic-based models like LDA (Latent Dirichlet Allocation) and LSA (Latent Semantic Allocation) applied to various weighting schemes of TF-IDF and observed the changes of weights leads to variation of term frequency of different topics with respect to its documents. Results also show that the variation of term weights results changes in topic modeling. Visualization results of topic modeling clusters with different TF-IDF weighting schemes are presented.
APA, Harvard, Vancouver, ISO, and other styles
49

Harieby, Edo, Hoiriyah Hoiriyah, and Miftahul Walid. "TWITTER TEXT MINING MENGENAI ISU VAKSINASI COVID-19 MENGGUNAKAN METODE TERM FREQUENCY, INVERSE DOCUMENT FREQUENCY (TF-IDF)." JATI (Jurnal Mahasiswa Teknik Informatika) 6, no. 2 (2022): 532–37. http://dx.doi.org/10.36040/jati.v6i2.5129.

Full text
Abstract:
Penyebaran informasi mengenai vaksin covid-19 menarik perhatian masyarakat. Berbagai macam isu bermunculan terkait halal dan tidaknya vaksinasi covid-19 dilakukan. Media sosial Twitter salah satunya yang memberikan ruang pada masyarakat untuk menanyakan dan berkomentar terkait vaksin covid-19 melalui cuitan (tweet) ataupun retweet. Dengan metode TF-IDF, penelitian ini dilakukan untuk menganalisis text (analisis sentimen) dari kumpulan tweet sehingga hasilnya diketahui banyaknya kata yang muncul dapat menjadi suatu kata kunci dalam perbincangan di Twitter, bahwa banyak masyarakat yang menyetujui adanya wajib vaksin covid-19. Hasil penelitian ini menampilkan 5 kata teratas yang paling banyak muncul, antara lain : vaksin (831.431911 kata), vaksinasi (748.304896 kata), covid (709.626652 kata), sehat (435.356173 kata), dukung (417.387094 kata) dan indonesia (404.432113 kata). Sedangkan hasil pembobotan TF-IDF adalah : mui (0.6436902527847653), vaksin (0.132185733888140), covid (0.1566272932497384), sinovac (0.4762729721904365), suci (0.8634345960912986), halal (0.5720637913580648), dan ncovid (0.543713657254659). Hasil penelitian ini masih memerlukan pembobotan n-gram dengan L1 atau L2 Normalization agar dapat digunakan sebagai data train dan data test pada proses analisa selanjutnya.
APA, Harvard, Vancouver, ISO, and other styles
50

Chhipa, Shubham, Vishal Berwal, Tushar Hirapure, and Soumi Banerjee. "Recipe Recommendation System Using TF-IDF." ITM Web of Conferences 44 (2022): 02006. http://dx.doi.org/10.1051/itmconf/20224402006.

Full text
Abstract:
A Recipe Recommendation System is being proposed in this following paper. Food recommendation is a new area, with few systems that are focus on analysing and user preferences and constraints such as ingredients available at their side being deployed in real settings in the form of web application or mobile application [4]. The proposed model is a mobile application which allows users to search recipes using ingredients available at them including vegetables. For this work we have find a dataset which is a collection of Indian cuisines recipes and apply the content-based recommendation using Term Frequency – Inverse Document Frequency (TF-IDF) and Cosine Similarity [1]. This application gives the recommendation of Indian recipes based on ingredients available at them and allows users to filter out the recipes on course type, diet type, etc.
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!