To see the other types of publications on this topic, follow the link: Term frequency weighting.

Journal articles on the topic 'Term frequency weighting'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 journal articles for your research on the topic 'Term frequency weighting.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

Mohammed, Mohannad T., and Omar Fitian Rashid. "Document retrieval using term term frequency inverse sentence frequency weighting scheme." Indonesian Journal of Electrical Engineering and Computer Science 31, no. 3 (2023): 1478. http://dx.doi.org/10.11591/ijeecs.v31.i3.pp1478-1485.

Full text
Abstract:
The need for an efficient method to find the furthermost appropriate document corresponding to a particular search query has become crucial due to the exponential development in the number of papers that are now readily available to us on the web. The vector space model (VSM) a perfect model used in “information retrieval”, represents these words as a vector in space and gives them weights via a popular weighting method known as term frequency inverse document frequency (TF-IDF). In this research, work has been proposed to retrieve the most relevant document focused on representing documents and queries as vectors comprising average term term frequency inverse sentence frequency (TF-ISF) weights instead of representing them as vectors of term TF-IDF weight and two basic and effective similarity measures: Cosine and Jaccard were used. Using the MS MARCO dataset, this article analyzes and assesses the retrieval effectiveness of the TF-ISF weighting scheme. The result shows that the TF-ISF model with the Cosine similarity measure retrieves more relevant documents. The model was evaluated against the conventional TF-ISF technique and shows that it performs significantly better on MS MARCO data (Microsoft-curated data of Bing queries).
APA, Harvard, Vancouver, ISO, and other styles
2

Mohannad, T. Mohammed, and Fitian Rashid Omar. "Document retrieval using term frequency inverse sentence frequency weighting scheme." Document retrieval using term frequency inverse sentence frequency weighting scheme 31, no. 3 (2023): 1478–85. https://doi.org/10.11591/ijeecs.v31.i3.pp1478-1485.

Full text
Abstract:
The need for an efficient method to find the furthermost appropriate document corresponding to a particular search query has become crucial due to the exponential development in the number of papers that are now readily available to us on the web. The vector space model (VSM) a perfect model used in “information retrieval”, represents these words as a vector in space and gives them weights via a popular weighting method known as term frequency inverse document frequency (TF-IDF). In this research, work has been proposed to retrieve the most relevant document focused on representing documents and queries as vectors comprising average term term frequency inverse sentence frequency (TF-ISF) weights instead of representing them as vectors of term TF-IDF weight and two basic and effective similarity measures: Cosine and Jaccard were used. Using the MS MARCO dataset, this article analyzes and assesses the retrieval effectiveness of the TF-ISF weighting scheme. The result shows that the TF-ISF model with the Cosine similarity measure retrieves more relevant documents. The model was evaluated against the conventional TF-ISF technique and shows that it performs significantly better on MS MARCO data (Microsoft-curated data of Bing queries).
APA, Harvard, Vancouver, ISO, and other styles
3

Nugroho, Kuncahyo Setyo, Fitra A. Bachtiar, and Wayan Firdaus Mahmudy. "Detecting Emotion in Indonesian Tweets: A Term-Weighting Scheme Study." Journal of Information Systems Engineering and Business Intelligence 8, no. 1 (2022): 61–70. http://dx.doi.org/10.20473/jisebi.8.1.61-70.

Full text
Abstract:
Background: Term-weighting plays a key role in detecting emotion in texts. Studies in term-weighting schemes aim to improve short text classification by distinguishing terms accurately. Objective: This study aims to formulate the best term-weighting schemes and discover the relationship between n-gram combinations and different classification algorithms in detecting emotion in Twitter texts. Methods: The data used was the Indonesian Twitter Emotion Dataset, with features generated through different n-gram combinations. Two approaches assign weights to the features. Tests were carried out using ten-fold cross-validation on three classification algorithms. The performance of the model was measured using accuracy and F1 score. Results: The term-weighting schemes with the highest performance are Term Frequency-Inverse Category Frequency (TF-ICF) and Term Frequency-Relevance Frequency (TF-RF). The scheme with a supervised approach performed better than the unsupervised one. However, we did not find a consistent advantage as some of the experiments found that Term Frequency-Inverse Document Frequency (TF-IDF) also performed exceptionally well. The traditional TF-IDF method remains worth considering as a term-weighting scheme. Conclusion: This study provides recommendations for emotion detection in texts. Future studies can benefit from dealing with imbalances in the dataset to provide better performance. Keywords: Emotion Detection, Feature Engineering, Term-Weighting, Text Mining
APA, Harvard, Vancouver, ISO, and other styles
4

Ni'mah, Ana Tsalitsatun, and Agus Zainal Arifin. "Perbandingan Metode Term Weighting terhadap Hasil Klasifikasi Teks pada Dataset Terjemahan Kitab Hadis." Rekayasa 13, no. 2 (2020): 172–80. http://dx.doi.org/10.21107/rekayasa.v13i2.6412.

Full text
Abstract:
Hadis adalah sumber rujukan agama Islam kedua setelah Al-Qur’an. Teks Hadis saat ini diteliti dalam bidang teknologi untuk dapat ditangkap nilai-nilai yang terkandung di dalamnya secara pegetahuan teknologi. Dengan adanya penelitian terhadap Kitab Hadis, pengambilan informasi dari Hadis tentunya membutuhkan representasi teks ke dalam vektor untuk mengoptimalkan klasifikasi otomatis. Klasifikasi Hadis diperlukan untuk dapat mengelompokkan isi Hadis menjadi beberapa kategori. Ada beberapa kategori dalam Kitab Hadis tertentu yang sama dengan Kitab Hadis lainnya. Ini menunjukkan bahwa ada beberapa dokumen Kitab Hadis tertentu yang memiliki topik yang sama dengan Kitab Hadis lain. Oleh karena itu, diperlukan metode term weighting yang dapat memilih kata mana yang harus memiliki bobot tinggi atau rendah dalam ruang Kitab Hadis untuk optimalisasi hasil klasifikasi dalam Kitab-kitab Hadis. Penelitian ini mengusulkan sebuah perbandingan beberapa metode term weighting, yaitu: Term Frequency Inverse Document Frequency (TF-IDF), Term Frequency Inverse Document Frequency Inverse Class Frequency (TF-IDF-ICF), Term Frequency Inverse Document Frequency Inverse Class Space Density Frequency (TF-IDF-ICSδF), dan Term Frequency Inverse Document Frequency Inverse Class Space Density Frequency Inverse Hadith Space Density Frequency (TF-IDF-ICSδF-IHSδF). Penelitian ini melakukan perbandingan hasil term weighting terhadap dataset Terjemahan 9 Kitab Hadis yang diterapkan pada mesin klasifikasi Naive Bayes dan SVM. 9 Kitab Hadis yang digunakan, yaitu: Sahih Bukhari, Sahih Muslim, Abu Dawud, at-Turmudzi, an-Nasa'i, Ibnu Majah, Ahmad, Malik, dan Darimi. Hasil uji coba menunjukkan bahwa hasil klasifikasi menggunakan metode term weighting TF-IDF-ICSδF-IHSδF mengungguli term weighting lainnya, yaitu mendapatkan Precission sebesar 90%, Recall sebesar 93%, F1-Score sebesar 92%, dan Accuracy sebesar 83%.Comparison of a term weighting method for the text classification in Indonesian hadithHadith is the second source of reference for Islam after the Qur’an. Currently, hadith text is researched in the field of technology for capturing the values of technology knowledge. With the research of the Book of Hadith, retrieval of information from the hadith certainly requires the representation of text into vectors to optimize automatic classification. The classification of the hadith is needed to be able to group the contents of the hadith into several categories. There are several categories in certain Hadiths that are the same as other Hadiths. Shows that there are certain documents of the hadith that have the same topic as other Hadiths. Therefore, a term weighting method is needed that can choose which words should have high or low weights in the Hadith Book space to optimize the classification results in the Hadith Books. This study proposes a comparison of several term weighting methods, namely: Term Frequency Inverse Document Frequency (TF-IDF), Term Frequency Inverse Document Frequency Inverse Class Frequency (TF-IDF-ICF), Term Frequency Inverse Document Frequency Inverse Class Space Density Frequency (TF-IDF-ICSδF) and Term Frequency Inverse Document Frequency Inverse Class Space Density Frequency Inverse Hadith Space Density Frequency (TF-IDF-ICSδF-IHSδF). This research compares the term weighting results to the 9 Hadith Book Translation dataset applied to the Naive Bayes classification engine and SVM. 9 Books of Hadith are used, namely: Sahih Bukhari, Sahih Muslim, Abu Dawud, at-Turmudzi, an-Nasa’i, Ibn Majah, Ahmad, Malik, and Darimi. The trial results show that the classification results using the TF-IDF-ICSδF-IHSδF term weighting method outperformed another term weighting, namely getting a Precession of 90%, Recall of 93%, F1-Score of 92%, and Accuracy of 83%.
APA, Harvard, Vancouver, ISO, and other styles
5

Tama, Fauzaan Rakan, and Yuliant Sibaroni. "Fake News (Hoaxes) Detection on Twitter Social Media Content through Convolutional Neural Network (CNN) Method." JINAV: Journal of Information and Visualization 4, no. 1 (2023): 70–78. http://dx.doi.org/10.35877/454ri.jinav1525.

Full text
Abstract:
The use of social media is very influential for the community. Users can easily post various activities in the form of text, photos, and videos in social media. Information on social media contains fake news and hoaxes that will have an impact on society. One of the most social media used is Twitter. This study aims to detect fake news found on the Tweets using the Convolutional Neural Network (CNN) method by comparing the weighting features used of the Term Frequency Inverse Document Frequency (TF-IDF) and the Term Frequency-Relevance Frequency (TF-RF). The highest accuracy was obtained in the Term Frequency-Relevance Frequency (TF-RF) weighting feature with an accuracy of 84.11%, while in the Term Frequency Inverse Document Frequency (TF-IDF) weighting feature with an accuracy of 80.29%.
APA, Harvard, Vancouver, ISO, and other styles
6

Chen, Long, Liangxiao Jiang, and Chaoqun Li. "Using modified term frequency to improve term weighting for text classification." Engineering Applications of Artificial Intelligence 101 (May 2021): 104215. http://dx.doi.org/10.1016/j.engappai.2021.104215.

Full text
APA, Harvard, Vancouver, ISO, and other styles
7

Alshehri, Arwa, and Abdulmohsen Algarni. "TF-TDA: A Novel Supervised Term Weighting Scheme for Sentiment Analysis." Electronics 12, no. 7 (2023): 1632. http://dx.doi.org/10.3390/electronics12071632.

Full text
Abstract:
In text classification tasks, such as sentiment analysis (SA), feature representation and weighting schemes play a crucial role in classification performance. Traditional term weighting schemes depend on the term frequency within the entire document collection; therefore, they are called unsupervised term weighting (UTW) schemes. One of the most popular UTW schemes is term frequency–inverse document frequency (TF-IDF); however, this is not sufficient for SA tasks. Newer weighting schemes have been developed to take advantage of the membership of documents in their categories. These are called supervised term weighting (STW) schemes; however, most of them weigh the extracted features without considering the characteristics of some noisy features and data imbalances. Therefore, in this study, a novel STW approach was proposed, known as term frequency–term discrimination ability (TF-TDA). TF-TDA mainly presents the extracted features with different degrees of discrimination by categorizing them into several groups. Subsequently, each group is weighted based on its contribution. The proposed method was examined over four SA datasets using naive Bayes (NB) and support vector machine (SVM) models. The experimental results proved the superiority of TF-TDA over two baseline term weighting approaches, with improvements ranging from 0.52% to 3.99% in the F1 score. The statistical test results verified the significant improvement obtained by TF-TDA in most cases, where the p-value ranged from 0.0000597 to 0.0455.
APA, Harvard, Vancouver, ISO, and other styles
8

Shehzad, Farhan, Abdur Rehman, Kashif Javed, Khalid A. Alnowibet, Haroon A. Babri, and Hafiz Tayyab Rauf. "Binned Term Count: An Alternative to Term Frequency for Text Categorization." Mathematics 10, no. 21 (2022): 4124. http://dx.doi.org/10.3390/math10214124.

Full text
Abstract:
In text categorization, a well-known problem related to document length is that larger term counts in longer documents cause classification algorithms to become biased. The effect of document length can be eliminated by normalizing term counts, thus reducing the bias towards longer documents. This gives us term frequency (TF), which in conjunction with inverse document frequency (IDF) became the most commonly used term weighting scheme to capture the importance of a term in a document and corpus. However, normalization may cause term frequency of a term in a related document to become equal or smaller than its term frequency in an unrelated document, thus perturbing a term’s strength from its true worth. In this paper, we solve this problem by introducing a non-linear mapping of term frequency. This alternative to TF is called binned term count (BTC). The newly proposed term frequency factor trims large term counts before normalization, thus moderating the normalization effect on large documents. To investigate the effectiveness of BTC, we compare it against the original TF and its more recently proposed alternative named modified term frequency (MTF). In our experiments, each of these term frequency factors (BTC, TF, and MTF) is combined with four well-known collection frequency factors (IDF), RF, IGM, and MONO and the performance of each of the resulting term weighting schemes is evaluated on three standard datasets (Reuters (R8-21578), 20-Newsgroups, and WebKB) using support vector machines and K-nearest neighbor classifiers. To determine whether BTC is statistically better than TF and MTF, we have applied the paired two-sided t-test on the macro F1 results. Overall, BTC is found to be 52% statistically significant than TF and MTF. Furthermore, the highest macro F1 value on the three datasets was achieved by BTC-based term weighting schemes.
APA, Harvard, Vancouver, ISO, and other styles
9

Santhanakumar, M., C. Christopher Columbus, and K. Jayapriya. "Multi term based co-term frequency method for term weighting in information retrieval." International Journal of Business Information Systems 28, no. 1 (2018): 79. http://dx.doi.org/10.1504/ijbis.2018.091164.

Full text
APA, Harvard, Vancouver, ISO, and other styles
10

Santhanakumar, M., C. Christopher Columbus, and K. Jayapriya. "Multi term based co-term frequency method for term weighting in information retrieval." International Journal of Business Information Systems 28, no. 1 (2018): 79. http://dx.doi.org/10.1504/ijbis.2018.10012193.

Full text
APA, Harvard, Vancouver, ISO, and other styles
11

Zhang, Hui, Deqing Wang, Wenjun Wu, and Hongping Hu. "Term frequency – function of document frequency: a new term weighting scheme for enterprise information retrieval." Enterprise Information Systems 6, no. 4 (2012): 433–44. http://dx.doi.org/10.1080/17517575.2012.665945.

Full text
APA, Harvard, Vancouver, ISO, and other styles
12

Sabbah, Thabit, Ali Selamat, Md Hafiz Selamat, et al. "Modified frequency-based term weighting schemes for text classification." Applied Soft Computing 58 (September 2017): 193–206. http://dx.doi.org/10.1016/j.asoc.2017.04.069.

Full text
APA, Harvard, Vancouver, ISO, and other styles
13

Sugino, Takaaki, Toshihiro Kawase, Shinya Onogi, Taichi Kin, Nobuhito Saito, and Yoshikazu Nakajima. "Loss Weightings for Improving Imbalanced Brain Structure Segmentation Using Fully Convolutional Networks." Healthcare 9, no. 8 (2021): 938. http://dx.doi.org/10.3390/healthcare9080938.

Full text
Abstract:
Brain structure segmentation on magnetic resonance (MR) images is important for various clinical applications. It has been automatically performed by using fully convolutional networks. However, it suffers from the class imbalance problem. To address this problem, we investigated how loss weighting strategies work for brain structure segmentation tasks with different class imbalance situations on MR images. In this study, we adopted segmentation tasks of the cerebrum, cerebellum, brainstem, and blood vessels from MR cisternography and angiography images as the target segmentation tasks. We used a U-net architecture with cross-entropy and Dice loss functions as a baseline and evaluated the effect of the following loss weighting strategies: inverse frequency weighting, median inverse frequency weighting, focal weighting, distance map-based weighting, and distance penalty term-based weighting. In the experiments, the Dice loss function with focal weighting showed the best performance and had a high average Dice score of 92.8% in the binary-class segmentation tasks, while the cross-entropy loss functions with distance map-based weighting achieved the Dice score of up to 93.1% in the multi-class segmentation tasks. The results suggested that the distance map-based and the focal weightings could boost the performance of cross-entropy and Dice loss functions in class imbalanced segmentation tasks, respectively.
APA, Harvard, Vancouver, ISO, and other styles
14

Peng, Tao, Lu Liu, and Wanli Zuo. "PU text classification enhanced by term frequency-inverse document frequency-improved weighting." Concurrency and Computation: Practice and Experience 26, no. 3 (2013): 728–41. http://dx.doi.org/10.1002/cpe.3040.

Full text
APA, Harvard, Vancouver, ISO, and other styles
15

Dogan, Turgut, and Alper Kursat Uysal. "On Term Frequency Factor in Supervised Term Weighting Schemes for Text Classification." Arabian Journal for Science and Engineering 44, no. 11 (2019): 9545–60. http://dx.doi.org/10.1007/s13369-019-03920-9.

Full text
APA, Harvard, Vancouver, ISO, and other styles
16

Osman, Mohamad Amin, Shahrul Azman Mohd Noah, and Saidah Saad. "Identifying Terms to Represent Concepts of a Work Process Ontology." Indian Journal Of Science And Technology 17, no. 24 (2024): 2519–28. http://dx.doi.org/10.17485/ijst/v17i24.1597.

Full text
Abstract:
Objectives: To identify ontology concepts from text documents for the construction of work process ontology. Methods: This study proposes a methodology to identify terms representing a work process ontology concept from a document. The methodology encompasses several key steps: document collecting, text pre-processing, term weighting and analysis, terms mapping, and domain expert relevance judgments. A comparison between the results of three different term weighting schemes, namely the Term Frequency (TF), Term Frequency-Inverse Document Frequency (TFIDF), and Mutual Information (MI) is made with the ontology concept that the domain expert has judged. Findings: The approaches adopted in this study have managed to extract ontological concepts from the targeted domain knowledge source. The findings of the comparison analysis suggest that the TFIDF term weighting scheme exhibits better results compared to the TF and MI weighting schemes. Novelty: A work process ontology is a structured knowledge describing daily operations in the government sector. However, there has been little to no effort in building the work process ontology. This study presents an integrated approach for identifying ontology concepts from documents within the domain of the work process. To the utmost extent of our understanding, this research initiative is the initial attempt to introduce a structured methodology for the semi-automatic extraction and evaluation of concepts and relationships within this domain. The findings can be utilised as a foundation for developing an ontology in the specific field. Keywords: Ontology, Work process, Text extraction, Natural language processing, Term weighting
APA, Harvard, Vancouver, ISO, and other styles
17

Pan, Shou Hui, Li Wang, Ying Cheng Xu, and Guo Ping Xia. "Improved Web Text Classification Method for Classifying Quality Safety Accidents." Advanced Materials Research 121-122 (June 2010): 996–1001. http://dx.doi.org/10.4028/www.scientific.net/amr.121-122.996.

Full text
Abstract:
Web text classification, as one of the fundamental techniques of web mining, plays an important role in the web mining system. An improved term weighting method is proposed in this paper. Besides term frequency, the location of the term is also considered when calculating the weight of a term. Web pages were divided into 4 text blocks and each text block has its location weight. Experimental result shows that the precision of improved term weighting method is higher than traditional term weighting method.
APA, Harvard, Vancouver, ISO, and other styles
18

Mohamad, Amin Osman, Azman Mohd Noah Shahrul, and Saad Saidah. "Identifying Terms to Represent Concepts of a Work Process Ontology." Indian Journal of Science and Technology 17, no. 24 (2024): 2519–28. https://doi.org/10.17485/IJST/v17i24.1597.

Full text
Abstract:
Abstract <strong>Objectives:</strong>&nbsp;To identify ontology concepts from text documents for the construction of work process ontology.&nbsp;<strong>Methods:</strong>&nbsp;This study proposes a methodology to identify terms representing a work process ontology concept from a document. The methodology encompasses several key steps: document collecting, text pre-processing, term weighting and analysis, terms mapping, and domain expert relevance judgments. A comparison between the results of three different term weighting schemes, namely the Term Frequency (TF), Term Frequency-Inverse Document Frequency (TFIDF), and Mutual Information (MI) is made with the ontology concept that the domain expert has judged.&nbsp;<strong>Findings:</strong>&nbsp;The approaches adopted in this study have managed to extract ontological concepts from the targeted domain knowledge source. The findings of the comparison analysis suggest that the TFIDF term weighting scheme exhibits better results compared to the TF and MI weighting schemes.&nbsp;<strong>Novelty:</strong>&nbsp;A work process ontology is a structured knowledge describing daily operations in the government sector. However, there has been little to no effort in building the work process ontology. This study presents an integrated approach for identifying ontology concepts from documents within the domain of the work process. To the utmost extent of our understanding, this research initiative is the initial attempt to introduce a structured methodology for the semi-automatic extraction and evaluation of concepts and relationships within this domain. The findings can be utilised as a foundation for developing an ontology in the specific field. <strong>Keywords:</strong> Ontology, Work process, Text extraction, Natural language processing, Term weighting
APA, Harvard, Vancouver, ISO, and other styles
19

Muhammad Kiko Aulia Reiki, Yuliant Sibaroni, and Erwin Budi Setiawan. "Comparison of Term Weighting Methods in Sentiment Analysis of the New State Capital of Indonesia with the SVM Method." International Journal on Information and Communication Technology (IJoICT) 8, no. 2 (2023): 53–65. http://dx.doi.org/10.21108/ijoict.v8i2.681.

Full text
Abstract:
The relocation of the State Capital to “Nusantara”, which was inaugurated with the enactment of UU No. 3 of 2022, is a significant project that has sparked polemics among Indonesian citizens. Many people expressed their opinions and thoughts regarding the relocation of the State Capital on Twitter. This tendency of public opinion needs to be identified with sentiment analysis. In sentiment analysis, term weighting is an essential component to obtain optimal accuracy. Various people are trying to modify the existing term weighting to increase the performance and accuracy of the model. One of them is icf-based or tf-bin.icf, which combines inverse category frequency (ICF) and relevance frequency (RF). This study compares the tf-idf, tf-rf, and tf-bin.icf term weighting with the SVM classification method on the new State Capital of Indonesia topic. The tf-idf weighting results are still the best compared to the tf-bin.icf and tf-rf term weights, with an accuracy score of 88.0% a 1,3% difference with tf-bin.icf term weighting.
APA, Harvard, Vancouver, ISO, and other styles
20

Vichianchai, Vuttichai, and Sumonta Kasemvilas. "A New Term Frequency with Gaussian Technique for Text Classification and Sentiment Analysis." Journal of ICT Research and Applications 15, no. 2 (2021): 152–68. http://dx.doi.org/10.5614/itbj.ict.res.appl.2021.15.2.4.

Full text
Abstract:
This paper proposes a new term frequency with a Gaussian technique (TF-G) to classify the risk of suicide from Thai clinical notes and to perform sentiment analysis based on Thai customer reviews and English tweets of travelers that use US airline services. This research compared TF-G with term weighting techniques based on Thai text classification methods from previous researches, including the bag-of-words (BoW), term frequency (TF), term frequency-inverse document frequency (TF-IDF), and term frequency-inverse corpus document frequency (TF-ICF) techniques. Suicide risk classification and sentiment analysis were performed with the decision tree (DT), naïve Bayes (NB), support vector machine (SVM), random forest (RF), and multilayer perceptron (MLP) techniques. The experimental results showed that TF-G is appropriate for feature extraction to classify the risk of suicide and to analyze the sentiments of customer reviews and tweets of travelers. The TF-G technique was more accurate than BoW, TF, TF-IDF and TF-ICF for term weighting in Thai suicide risk classification, for term weighting in sentiment analysis of Thai customer reviews for Burger King, Pizza Hut, and Sizzler restaurants, and for the sentiment analysis of English tweets of travelers using US airline services.
APA, Harvard, Vancouver, ISO, and other styles
21

Alfarizi, Muhammad Ibnu, Lailis Syafaah, and Merinda Lestandy. "Emotional Text Classification Using TF-IDF (Term Frequency-Inverse Document Frequency) And LSTM (Long Short-Term Memory)." JUITA : Jurnal Informatika 10, no. 2 (2022): 225. http://dx.doi.org/10.30595/juita.v10i2.13262.

Full text
Abstract:
Humans in carrying out communication activities can express their feelings either verbally or non-verbally. Verbal communication can be in the form of oral or written communication. A person's feelings or emotions can usually be seen by their behavior, tone of voice, and expression. Not everyone can see emotion only through writing, whether in the form of words, sentences, or paragraphs. Therefore, a classification system is needed to help someone determine the emotions contained in a piece of writing. The novelty of this study is a development of previous research using a similar method, namely LSTM but improved on the word weighting process using the TF-IDF method as a further process of LSTM classification. The method proposed in this research is called Natural Language Processing (NLP). The purpose of this study was to compare the classification method with the LSTM (Long Short-Term Memory) model by adding the word weighting TF-IDF (Term Frequency–Inverse Document Frequency) and the LinearSVC model, as well to increase accuracy in determining an emotion (sadness, anger, fear, love, joy, and surprise) contained in the text. The dataset used is 18000, which is divided into 16000 training data and 2000 test data with 6 classifications of emotion classes, namely sadness, anger, fear, love, joy, and surprise. The results of the classification accuracy of emotions using the LSTM method yielded a 97.50% accuracy while using the LinearSVC method resulted in an accuracy value of 89%.
APA, Harvard, Vancouver, ISO, and other styles
22

Yulita, Winda, Meida Cahyo Untoro, Mugi Praseptiawan, Ilham Firman Ashari, Aidil Afriansyah, and Ahmad Naim Bin Che Pee. "Automatic Scoring Using Term Frequency Inverse Document Frequency Document Frequency and Cosine Similarity." Scientific Journal of Informatics 10, no. 2 (2023): 93–104. http://dx.doi.org/10.15294/sji.v10i2.42209.

Full text
Abstract:
Purpose: In the learning process, most of the tests to assess learning achievement have been carried out by providing questions in the form of short answers or essay questions. The variety of answers given by students makes a teacher have to focus on reading them. This scoring process is difficult to guarantee quality if done manually. In addition, each class is taught by a different teacher, which can lead to unequal grades obtained by students due to the influence of differences in teacher experience. Therefore the purpose of this study is to develop an assessment of the answers. Automated short answer scoring is designed to automatically grade and evaluate students' answers based on a series of trained answer documents.Methods: This is ‘how’ you did it. Let readers know exactly what you did to reach your results. For example, did you undertake interviews? Did you carry out an experiment in the lab? What tools, methods, protocols or datasets did you use The method used is TF-IDF-DF and Similarity and scoring computation. Theword weight used is the term Frequency-Inverse Documents Frequency -Document Frequency (TF-IDF-DF) method. The data used is 5 questions with each question answered by 30 students, while the students' answers are assessed by teachers/experts to determine the real score. The study was evaluated by Mean Absolute Error (MAE).Result: The evaluation results obtained Mean Absolute Error (MAE) with a resulting value of 0.123.Value: The word weighting method used is the Term Frequency Inverse Document Frequency DocumentFrequency (TF-IDF-DF) which is an improvement over the Term Frequency Inverse Document Frequency (TF-IDF) method. This method is a method of weighting words that will be applied before calculating the similarity of sentences between teachers and students.
APA, Harvard, Vancouver, ISO, and other styles
23

Fitriansyah, Reza, Ellya Sestri, and Vany Terisia. "Mengiplementasikan Vector Space Model Similarity Euclidean Distance Menggunakan TFIDF Pada Klasifikasi Teks Bahasa Indonesia." Jurnal Teknologi Informasi (JUTECH) 3, no. 2 (2022): 158–63. http://dx.doi.org/10.32546/jutech.v3i2.2034.

Full text
Abstract:
Weighting based on the term with stemming techniques to get the basic word form term in question. This will the application of the Indonesian language text classification machine using the K-Nearest Neighbor algorithm and the Vector Space Model method on the TFIDF frequency weighting of the number of words and the Euclidean Distance function. comparison between the test documents and the test sample collection Using news documents as learning documents, a total of 10 (10) documents with 3 (three) categories, produces an Precision and Recall 90.00% for k = 5 using frequency weighting in words with the Euclidean Distance function.
APA, Harvard, Vancouver, ISO, and other styles
24

Schohl, G. A. "Improved Approximate Method for Simulating Frequency-Dependent Friction in Transient Laminar Flow." Journal of Fluids Engineering 115, no. 3 (1993): 420–24. http://dx.doi.org/10.1115/1.2910155.

Full text
Abstract:
A new approximation to the weighting function in Zielke’s (1967) equation is used in an improved implementation of Trikha’s (1975) method for including frequency-dependent friction in transient laminar flow calculations. The new, five-term approximation was fitted to the weighting function using a nonlinear least squares approach. Transient results obtained using the new approximating function are nearly indistinguishable from results obtained using the exact expression for the weighting function.
APA, Harvard, Vancouver, ISO, and other styles
25

HASSAN, SAMER, RADA MIHALCEA, and CARMEN BANEA. "RANDOM WALK TERM WEIGHTING FOR IMPROVED TEXT CLASSIFICATION." International Journal of Semantic Computing 01, no. 04 (2007): 421–39. http://dx.doi.org/10.1142/s1793351x07000263.

Full text
Abstract:
This paper describes a new approach for estimating term weights in a document, and shows how the new weighting scheme can be used to improve the accuracy of a text classifier. The method uses term co-occurrence as a measure of dependency between word features. A random walk model is applied on a graph encoding words and co-occurrence dependencies, resulting in scores that represent a quantification of how a particular word feature contributes to a given context. Experiments performed on three standard classification datasets show that the new random walk based approach outperforms the traditional term frequency approach of feature weighting.
APA, Harvard, Vancouver, ISO, and other styles
26

Sucipto, Didik Dwi Prasetya та Triyanna Widiyaningtyas. "Α Supervised Hybrid Weighting Scheme for Bloom's Taxonomy Questions using Category Space Density-based Weighting". Engineering, Technology & Applied Science Research 15, № 2 (2025): 22102–8. https://doi.org/10.48084/etasr.10226.

Full text
Abstract:
Question documents organized based on Bloom's taxonomy have different characteristics than typical text documents. Bloom's taxonomy is a framework that classifies learning objectives into six cognitive domains, each having distinct characteristics. In the cognitive domain, different keywords and levels are used to classify questions. Using existing category-based term weighting methods is less relevant because it is only based on word types and not on the main characteristics of Bloom's taxonomy. This study aimed to develop a more relevant term weighting method for Bloom's taxonomy by considering the term density in each category and the specific keywords in each domain. The proposed method, called Hybrid Inverse Bloom Space Density Frequency, is designed to capture the unique characteristics of Bloom's taxonomy. Experimental results show that the proposed method can be applied to all question datasets, considering term density in each category and keywords in each cognitive domain. Furthermore, the accuracy of the proposed method was superior on all datasets using machine learning model evaluation.
APA, Harvard, Vancouver, ISO, and other styles
27

Priyanka, Mesariya, and Madia Nidhi. "Document Ranking using Customizes Vector Method." International Journal of Trend in Scientific Research and Development 1, no. 4 (2017): 278–83. https://doi.org/10.31142/ijtsrd125.

Full text
Abstract:
Information retrieval IR system is about positioning reports utilizing clients question and get the important records from extensive dataset. Archive positioning is fundamentally looking the pertinent record as per their rank. Document ranking is basically search the relevant document according to their rank. Vector space model is traditional and widely applied information retrieval models to rank the web page based on similarity values. Term weighting schemes are the significant of an information retrieval system and it is query used in document ranking. Tf idf ranked calculates the term weight according to users query on basis of term which is including in documents. When user enter query it will find the documents in which the query terms are included and it will count the term calculate the Tf idf according to the highest weight of value it will gives the ranked documents. Priyanka Mesariya | Nidhi Madia &quot;Document Ranking using Customizes Vector Method&quot; Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-1 | Issue-4 , June 2017, URL: https://www.ijtsrd.com/papers/ijtsrd125.pdf
APA, Harvard, Vancouver, ISO, and other styles
28

Yazdani, Sepideh Foroozan, Masrah Azrifah Azmi Murad, Nurfadhlina Mohd Sharef, Yashwant Prasad Singh, and Ahmed Razman Abdul Latiff. "Sentiment Classification of Financial News Using Statistical Features." International Journal of Pattern Recognition and Artificial Intelligence 31, no. 03 (2017): 1750006. http://dx.doi.org/10.1142/s0218001417500069.

Full text
Abstract:
Sentiment classification of financial news deals with the identification of positive and negative news so that they can be applied in decision support systems for stock trend predictions. This paper explores several types of feature spaces as different data spaces for sentiment classification of the news article. Experiments are conducted using [Formula: see text]-gram models unigram, bigram and the combination of unigram and bigram as feature extraction with traditional feature weighting methods (binary, term frequency (TF), and term frequency-document frequency (TF-IDF)), while document frequency (DF) was used in order to generate feature spaces with different dimensions to evaluate [Formula: see text]-gram models and traditional feature weighting methods. We performed some experiments to measure the classification accuracy of support vector machine (SVM) with two kernel methods of Linear and Gaussian radial basis function (RBF). We concluded that feature selection and feature weighting methods can have a substantial role in sentiment classification. Furthermore, the results showed that the proposed work which combined unigram and bigram along with TF-IDF feature weighting method and optimized RBF kernel SVM produced high classification accuracy in financial news classification.
APA, Harvard, Vancouver, ISO, and other styles
29

Assaf, Kamel. "Testing Different Log Bases for Vector Model Weighting Technique." International Journal on Natural Language Computing 12, no. 03 (2023): 1–15. http://dx.doi.org/10.5121/ijnlc.2023.12301.

Full text
Abstract:
Information retrieval systems retrieves relevant documents based on a query submitted by the user. The documents are initially indexed and the words in the documents are assigned weights using a weighting technique called TFIDF which is the product of Term Frequency (TF) and Inverse Document Frequency (IDF). TF represents the number of occurrences of a term in a document. IDF measures whether the term is common or rare across all documents. It is computed by dividing the total number of documents in the system by the number of documents containing the term and then computing the logarithm of the quotient. By default, we use base 10 to calculate the logarithm. In this paper, we are going to test this weighting technique by using a range of log bases from 0.1 to 100.0 to calculate the IDF. Testing different log bases for vector model weighting technique is to highlight the importance of understanding the performance of the system at different weighting values. We use the documents of MED, CRAN, NPL, LISA, and CISI test collections that scientists assembled explicitly for experiments in data information retrieval systems.
APA, Harvard, Vancouver, ISO, and other styles
30

Kamel, Assaf. "Testing Different Log Bases for Vector Model Weighting Technique." International Journal on Natural Language Computing (IJNLC) 12, no. 3 (2023): 15. https://doi.org/10.5281/zenodo.8138320.

Full text
Abstract:
Information retrieval systems retrieves relevant documents based on a query submitted by the user. The documents are initially indexed and the words in the documents are assigned weights using a weighting technique called TFIDF which is the product of Term Frequency (TF) and Inverse Document Frequency (IDF). TF represents the number of occurrences of a term in a document. IDF measures whether the term is common or rare across all documents. It is computed by dividing the total number of documents in the system by the number of documents containing the term and then computing the logarithm of the quotient. By default, we use base 10 to calculate the logarithm. In this paper, we are going to test this weighting technique by using a range of log bases from 0.1 to 100.0 to calculate the IDF. Testing different log bases for vector model weighting technique is to highlight the importance of understanding the performance of the system at different weighting values. We use the documents of MED, CRAN, NPL, LISA, and CISI test collections that scientists assembled explicitly for experiments in data information retrieval systems.
APA, Harvard, Vancouver, ISO, and other styles
31

S., Sai Manasa Bala, and Kumari Santoshi. "Comprehensive Analysis of Variants of TF-IDF Applied on LDA and LSA Topic Modelling." International Journal of Engineering and Advanced Technology (IJEAT) 9, no. 6 (2020): 531–35. https://doi.org/10.35940/ijeat.D7669.089620.

Full text
Abstract:
Present generation is fully connected virtually through many sources of social media. In social media, opinions of people for any post, news or about any product through comments or emoticon designed to express the satisfactory note. Market standards improve on this basis. There are different online markets like Amazon, Flipkart, Myntra improve their businesses using these reviews passed. Analyzing large scale opinion or feedback of individual&rsquo;s helps to identify hidden insights and work towards customer satisfaction. This paper proposes for applying different weighting scheme of TF-IDF (Term Frequency-Inverse Document Frequency) for topic modeling methods LSA and LDA to cluster the topics of discussion from large scale reviews related to booming online market &lsquo;Amazon&rsquo;. The main focus of the paper is to observe the changes in the topic modeling by applying different weighting schemes of TF-IDF. In this work topic-based models like LDA (Latent Dirichlet Allocation) and LSA (Latent Semantic Allocation) applied to various weighting schemes of TF-IDF and observed the changes of weights leads to variation of term frequency of different topics with respect to its documents. Results also show that the variation of term weights results changes in topic modeling. Visualization results of topic modeling clusters with different TF-IDF weighting schemes are presented.
APA, Harvard, Vancouver, ISO, and other styles
32

ZAKOS, JOHN, and BRIJESH VERMA. "CONCEPT-BASED TERM WEIGHTING FOR WEB INFORMATION RETRIEVAL." International Journal of Computational Intelligence and Applications 06, no. 02 (2006): 193–207. http://dx.doi.org/10.1142/s1469026806001915.

Full text
Abstract:
In this paper we present a novel technique for determining term importance by exploiting concept-based information found in ontologies. Calculating term importance is a significant and fundamental aspect of most information retrieval approaches, and it is traditionally determined through inverse document frequency (IDF). We propose concept-based term weighting (CBW), a technique that is fundamentally different to IDF in that it calculates term importance by intuitively interpreting the conceptual information in ontologies. We show that when CBW is used in an approach for web information retrieval on benchmark data, it performs comparatively to IDF, with only a 3.5% degradation in retrieval accuracy. While this small degradation has been observed, the significance of this technique is that (1) unlike IDF, CBW is independent of document collection statistics, (2) it presents a new way of interpreting ontologies for retrieval, and (3) it introduces an additional source of term importance information that can be used for term weighting.
APA, Harvard, Vancouver, ISO, and other styles
33

Sintia, Sintia, Sarjon Defit, and Gunadi Widi Nurcahyo. "Product Codefication Accuracy With Cosine Similarity And Weighted Term Frequency And Inverse Document Frequency (TF-IDF)." Journal of Applied Engineering and Technological Science (JAETS) 2, no. 2 (2021): 62–69. http://dx.doi.org/10.37385/jaets.v2i2.210.

Full text
Abstract:
In the SiPaGa application, the codefication search process is still inaccurate, so OPD often make mistakes in choosing goods codes. So we need Cosine Similarity and TF-IDF methods that can improve the accuracy of the search. Cosine Similarity is a method for calculating similarity by using keywords from the code of goods. Term Frequency and Inverse Document (TFIDF) is a way to give weight to a one-word relationship (term). The purpose of this research is to improve the accuracy of the search for goods codification. Codification of goods processed in this study were 14,417 data sourced from the Goods and Price Planning Information System (SiPaGa) application database. The search keywords were processed using the Cosine Similarity method to see the similarities and using TF-IDF to calculate the weighting. This research produces the calculation of cosine similarity and TF-IDF weighting and is expected to be applied to the SiPaGa application so that the search process on the SiPaGa application is more accurate than before. By using the cosine sismilarity algorithm and TF-IDF, it is hoped that it can improve the accuracy of the search for product codification. So that OPD can choose the product code as desired
APA, Harvard, Vancouver, ISO, and other styles
34

M., Ali Fauzi, Zainal Arifin Agus, and Yuniarti Anny. "Arabic Book Retrieval using Class and Book Index Based Term Weighting." International Journal of Electrical and Computer Engineering (IJECE) 7, no. 6 (2017): 3705–11. https://doi.org/10.11591/ijece.v7i6.pp3705-3711.

Full text
Abstract:
One of the most common issue in information retrieval is documents ranking. Documents ranking system collects search terms from the user and orderly retrieves documents based on the relevance. Vector space models based on TF.IDF term weighting is the most common method for this topic. In this study, we are concerned with the study of automatic retrieval of Islamic Fiqh (Law) book collection. This collection contains many books, each of which has tens to hundreds of pages. Each page of the book is treated as a document that will be ranked based on the user query. We developed class-based indexing method called inverse class frequency (ICF) and book-based indexing method inverse book frequency (IBF) for this Arabic information retrieval. Those method then been incorporated with the previous method so that it becomes TF.IDF.ICF.IBF. The term weighting method also used for feature selection due to high dimensionality of the feature space. This novel method was tested using a dataset from 13 Arabic Fiqh e-books. The experimental results showed that the proposed method have the highest precision, recall, and F-Measure than the other three methods at variations of feature selection. The best performance of this method was obtained when using best 1000 features by precision value of 76%, recall value of 74%, and F-Measure value of 75%.
APA, Harvard, Vancouver, ISO, and other styles
35

Feng, Guozhong, Han Wang, Tieli Sun, and Libiao Zhang. "A Term Frequency Based Weighting Scheme Using Naïve Bayes for Text Classification." Journal of Computational and Theoretical Nanoscience 13, no. 1 (2016): 319–26. http://dx.doi.org/10.1166/jctn.2016.4807.

Full text
APA, Harvard, Vancouver, ISO, and other styles
36

Podder, Amit Kumer, Md Habibullah, Md Tariquzzaman, Eklas Hossain, and Sanjeevikumar Padmanaban. "Power Loss Analysis of Solar Photovoltaic Integrated Model Predictive Control Based On-Grid Inverter." Energies 13, no. 18 (2020): 4669. http://dx.doi.org/10.3390/en13184669.

Full text
Abstract:
This paper presents a finite control-set model predictive control (FCS-MPC) based technique to reduce the switching loss and frequency of the on-grid PV inverter by incorporating a switching frequency term in the cost function of the model predictive control (MPC). In the proposed MPC, the control objectives (current and switching frequency) select an optimal switching state for the inverter by minimizing a predefined cost function. The two control objectives are combined with a weighting factor. A trade-off between the switching frequency (average) and total harmonic distortion (THD) of the current was utilized to determine the value of the weighting factor. The switching, conduction, and harmonic losses were determined at the selected value of the weighting factor for both the proposed and conventional FCS-MPC and compared. The system was simulated in MATLAB/Simulink, and a small-scale hardware prototype was built to realize the system and verify the proposal. Considering only 0.25% more current THD, the switching frequency and loss per phase were reduced by 20.62% and 19.78%, respectively. The instantaneous overall power loss was also reduced by 2% due to the addition of a switching frequency term in the cost function which ensures a satisfactory empirical result for an on-grid PV inverter.
APA, Harvard, Vancouver, ISO, and other styles
37

Ghag, Kranti Vithal, and Ketan Shah. "ARTFSC Average Relative Term Frequency Sentiment Classification." INTERNATIONAL JOURNAL OF COMPUTERS & TECHNOLOGY 12, no. 6 (2014): 3591–601. http://dx.doi.org/10.24297/ijct.v12i6.3141.

Full text
Abstract:
Sentiment Classification refers to the computational techniques for classifying whether the sentiments of text are positive or negative. Statistical Techniques based on Term Presence and Term Frequency, using Support Vector Machine are popularly used for Sentiment Classification. This paper presents an approach for classifying a term as positive or negative based on its average frequency in positively tagged documents in comparison with negatively tagged documents. Our approach is based on term weighting techniques that are used for information retrieval and sentiment classification. It differs significantly from these traditional methods due to our model of logarithmic differential average term distribution for sentiment classification. Terms with nearly equal distribution in positively tagged documents and negatively tagged documents were classified as a Senti-stop-word and discarded. The proportional distribution of a term to be classified as Senti-stop-word was determined experimentally. Our model was evaluated by comparing it with state of art techniques for sentiment classification using the movie review dataset.
APA, Harvard, Vancouver, ISO, and other styles
38

Falasari, Anisa, and Much Aziz Muslim. "Optimize Naïve Bayes Classifier Using Chi Square and Term Frequency Inverse Document Frequency For Amazon Review Sentiment Analysis." Journal of Soft Computing Exploration 3, no. 1 (2022): 31–36. http://dx.doi.org/10.52465/joscex.v3i1.68.

Full text
Abstract:
The rapid development of the internet has made information flow rapidly wich has an impact on the world of commerce. Some people who have bought a product will write their opinion on social media or other online site. Long-text buyer reviews need a machine to recognize opinions. Sentiment analysis applies the text mining method. One of the methods applied in sentiment analysis is classification. One of the classification algorithms is the naïve bayes classifier. Naïve bayes classifier is a classification method with good efficiency and performance. However, it is very sensitive with too many features, wich makes the accuracy low. To improve the accuracy of the naïve bayes classifier algorithm it can be done by selecting features. One of the feature selection is chi square. The selection of features with chi square calculation based on the top-K value that has been determined, namely 450. In addition, weighting features can also improve the accuracy of the naïve bayes classifier algorithm. One of the feature weighting techniques is term frequency inverse document frequency (TF-IDF). In this study, using sentiment labelled dataset (field amazon_labelled) obtained from UCI Machine Learning. This dataset has 500 positive reviews and 500 negative reviews. The accuracy of the naïve bayes classifier in the amazon review sentiment analysis was 82%. Meanwhile, the accuracy of the naïve bayes classifier by applying chi square and TF-IDF is 83%.
APA, Harvard, Vancouver, ISO, and other styles
39

Pande Made Risky Cahya Dinatha and Nur Aini Rakhmawati. "Komparasi Term Weighting dan Word Embedding pada Klasifikasi Tweet Pemerintah Daerah." Jurnal Nasional Teknik Elektro dan Teknologi Informasi 9, no. 2 (2020): 155–61. http://dx.doi.org/10.22146/jnteti.v9i2.90.

Full text
Abstract:
Munculnya media sosial mendorong pemerintah untuk memanfaatkan media sosial sebagai sarana penyebaran informasi. Informasi yang diberikan haruslah bermanfaat bagi masyarakat dalam rangka meningkatkan hubungan government to citizen. Klasifikasi terhadap unggahan media sosial pemerintah daerah dapat dilakukan untuk mengetahui jenis informasi yang diunggah. Penelitian klasifikasi unggahan media sosial pada studi kasus pemerintah daerah di Indonesia telah berhasil dilakukan, tetapi pengolahan teks untuk membangun model klasifikasinya masih dapat dieksplorasi. Metode pengolahan teks yang dibahas di dalam makalah ini adalah term weighting dan word embedding. Tujuan makalah ini adalah membandingkan term weighting term frequency-inverse document frequency, Okapi BM25, dan word embedding doc2vec dalam menghasilkan fitur untuk mengatasi masalah klasifikasi teks pendek. Makalah ini merepresentasikan teks sebagai fitur untuk melakukan klasifikasi, mengetahui kinerja model klasifikasi yang telah menerapkan teknik tersebut, dan membandingkan kinerja setiap model klasifikasi untuk mengetahui metode terbaik di dalam studi kasus klasifikasi unggahan media sosial pemerintah daerah di Indonesia. Terdapat enam kelas untuk mengklasifikasi 1.000 teks pendek dari 91 akun pemda. Pengukuran precision, recall, f-1, macro-average, micro-average, dan AUC dilakukan pada masing-masing model. Hasil menunjukkan bahwa model TF-IDF bersama SVM linear memberikan hasil yang lebih baik dibandingkan logistic regression dengan skor 0,572 dan 0,766 pada pengukuran macro-average recall dan micro-average recall.
APA, Harvard, Vancouver, ISO, and other styles
40

Wulandari, Dyah Ayu, Fitra A. Bachtiar, and Indriati Indriati. "Aspect Based Sentiment Analysis on Shopee Application Reviews Using Support Vector Machine." Lontar Komputer : Jurnal Ilmiah Teknologi Informasi 15, no. 02 (2025): 99. https://doi.org/10.24843/lkjiti.2024.v15.i02.p03.

Full text
Abstract:
One of the e-commerce in Indonesia is Shopee. Feedback from users is needed to improve the quality of e-commerce services and user satisfaction. This research process includes data scraping, labeling, text pre-processing, TF-IDF, aspect, and sentiment classification. The novelty of this research is using the SVM method with SGD to classify Indonesian language application reviews based on aspect categories consisting of 7 dimensions of service quality and sentiment so that the website created in this research can display the aspects and sentiments of the input reviews. This research also builds an Indonesian normalization dictionary to optimize the terms used to increase model accuracy. The test in aspect classification resulted in a precision value of 90%, recall of 88.73%, accuracy of 88.57%, and f1-score of 89%. Meanwhile, the sentiment classification resulted in a precision value of 96.15%, recall of 91.91%, accuracy of 94.28%, and f1-score of 93.98%. In addition, the test results (accuracy, f1-score, precision, recall) show that the lemmatization process is better than stemming and term weighting using the TF-IDF method is better than other methods (raw-term frequency, log-frequency weighting, binary-term weighting).
APA, Harvard, Vancouver, ISO, and other styles
41

Yusof, Yuhanis, Taha Alhersh, Massudi Mahmuddin, and Aniza Mohamed Din. "Source Code Classification using Latent Semantic Indexing with Structural and Frequency Term Weighting." Research Journal of Applied Sciences 7, no. 5 (2012): 266–71. http://dx.doi.org/10.3923/rjasci.2012.266.271.

Full text
APA, Harvard, Vancouver, ISO, and other styles
42

Fauzi, M. Ali, Agus Zainal Arifin, and Anny Yuniarti. "Arabic Book Retrieval using Class and Book Index Based Term Weighting." International Journal of Electrical and Computer Engineering (IJECE) 7, no. 6 (2017): 3705. http://dx.doi.org/10.11591/ijece.v7i6.pp3705-3710.

Full text
Abstract:
One of the most common issue in information retrieval is documents ranking. Documents ranking system collects search terms from the user and orderly retrieves documents based on the relevance. Vector space models based on TF.IDF term weighting is the most common method for this topic. In this study, we are concerned with the study of automatic retrieval of Islamic &lt;em&gt;Fiqh&lt;/em&gt; (Law) book collection. This collection contains many books, each of which has tens to hundreds of pages. Each page of the book is treated as a document that will be ranked based on the user query. We developed class-based indexing method called inverse class frequency (ICF) and book-based indexing method inverse book frequency (IBF) for this Arabic information retrieval. Those method then been incorporated with the previous method so that it becomes TF.IDF.ICF.IBF. The term weighting method also used for feature selection due to high dimensionality of the feature space. This novel method was tested using a dataset from 13 Arabic Fiqh e-books. The experimental results showed that the proposed method have the highest precision, recall, and F-Measure than the other three methods at variations of feature selection. The best performance of this method was obtained when using best 1000 features by precision value of 76%, recall value of 74%, and F-Measure value of 75%.
APA, Harvard, Vancouver, ISO, and other styles
43

Li, Yong Fei. "A Feature Weight Algorithm for Text Classification Based on Class Information." Advanced Materials Research 756-759 (September 2013): 3419–22. http://dx.doi.org/10.4028/www.scientific.net/amr.756-759.3419.

Full text
Abstract:
TFIDF algorithm was used for feature weighting in text classification. But the result of classification was not very well because of lack of class information in feature weighting. The known class information in the training set was used to improve the traditional TFIDF feature weight algorithm. Class distinction ability and class description ability were introduced, respectively expressed by inverse class frequency and term frequency in class, document frequency in class. A new feature weight algorithm based on class information, TF_IDT, was proposed. Naïve Bayes classifier was used to test the algorithm. The precision, recall and F1 measure were significantly increased. Macro F1 measure raise by 6.46%. It was proved to be useful for improving text classification to use class information in feature weighting. In addition, the computational complexity of the proposed algorithm was lower and more suitable for use in fields of limited computing capability.
APA, Harvard, Vancouver, ISO, and other styles
44

Taufik, Ichsan, Agra Agra, and Yana Aditia Gerhana. "Vector space model, term frequency-inverse document frequency with linear search, and object-relational mapping Django on hadith data search." Computer Science and Information Technologies 5, no. 3 (2024): 306–14. http://dx.doi.org/10.11591/csit.v5i3.p306-314.

Full text
Abstract:
For Muslims, the Hadith ranks as the secondary legal authority following the Quran. This research leverages hadith data to streamline the search process within the nine imams’ compendium using the vector space model (VSM) approach. The primary objective of this research is to enhance the efficiency and effectiveness of the search process within Hadith collections by implementing pre-filtering techniques. This study aims to demonstrate the potential of linear search and Django object-relational mapping (ORM) filters in reducing search times and improving retrieval performance, thereby facilitating quicker and more accurate access to relevant Hadiths. Prior studies have indicated that VSM is efficient for large data sets because it assigns weights to every term across all documents, regardless of whether they include the search keywords. Consequently, the more documents there are, the more protracted the weighting phase becomes. To address this, the current research pre-filters documents prior to weighting, utilizing linear search and Django ORM as filters. Testing on 62,169 hadiths with 20 keywords revealed that the average VSM search duration was 51 seconds. However, with the implementation of linear and Django ORM filters, the times were reduced to 7.93 and 8.41 seconds, respectively. The recall@10 rates were 79% and 78.5%, with MAP scores of 0.819 and 0.814, accordingly.
APA, Harvard, Vancouver, ISO, and other styles
45

Ichsan, Taufik, Agra Agra, and Aditia Gerhana Yana. "Vector space model, term frequency-inverse document frequency with linear search, and object-relational mapping Django on hadith data search." Computer Science and Information Technologies 5, no. 3 (2024): 306–14. https://doi.org/10.11591/csit.v5i3.pp306-314.

Full text
Abstract:
For Muslims, the Hadith ranks as the secondary legal authority following the Quran. This research leverages hadith data to streamline the search process within the nine imams&rsquo; compendium using the vector space model (VSM) approach. The primary objective of this research is to enhance the efficiency and effectiveness of the search process within Hadith collections by implementing pre-filtering techniques. This study aims to demonstrate the potential of linear search and Django object-relational mapping (ORM) filters in reducing search times and improving retrieval performance, thereby facilitating quicker and more accurate access to relevant Hadiths. Prior studies have indicated that VSM is efficient for large data sets because it assigns weights to every term across all documents, regardless of whether they include the search keywords. Consequently, the more documents there are, the more protracted the weighting phase becomes. To address this, the current research pre-filters documents prior to weighting, utilizing linear search and Django ORM as filters. Testing on 62,169 hadiths with 20 keywords revealed that the average VSM search duration was 51 seconds. However, with the implementation of linear and Django ORM filters, the times were reduced to 7.93 and 8.41 seconds, respectively. The recall@10 rates were 79% and 78.5%, with MAP scores of 0.819 and 0.814, accordingly.
APA, Harvard, Vancouver, ISO, and other styles
46

Larasati, Ukhti Ikhsani, Much Aziz Muslim, Riza Arifudin, and Alamsyah Alamsyah. "Improve the Accuracy of Support Vector Machine Using Chi Square Statistic and Term Frequency Inverse Document Frequency on Movie Review Sentiment Analysis." Scientific Journal of Informatics 6, no. 1 (2019): 138–49. http://dx.doi.org/10.15294/sji.v6i1.14244.

Full text
Abstract:
Data processing can be done with text mining techniques. To process large text data is required a machine to explore opinions, including positive or negative opinions. Sentiment analysis is a process that applies text mining methods. Sentiment analysis is a process that aims to determine the content of the dataset in the form of text is positive or negative. Support vector machine is one of the classification algorithms that can be used for sentiment analysis. However, support vector machine works less well on the large-sized data. In addition, in the text mining process there are constraints one is number of attributes used. With many attributes it will reduce the performance of the classifier so as to provide a low level of accuracy. The purpose of this research is to increase the support vector machine accuracy with implementation of feature selection and feature weighting. Feature selection will reduce a large number of irrelevant attributes. In this study the feature is selected based on the top value of K = 500. Once selected the relevant attributes are then performed feature weighting to calculate the weight of each attribute selected. The feature selection method used is chi square statistic and feature weighting using Term Frequency Inverse Document Frequency (TFIDF). Result of experiment using Matlab R2017b is integration of support vector machine with chi square statistic and TFIDF that uses 10 fold cross validation gives an increase of accuracy of 11.5% with the following explanation, the accuracy of the support vector machine without applying chi square statistic and TFIDF resulted in an accuracy of 68.7% and the accuracy of the support vector machine by applying chi square statistic and TFIDF resulted in an accuracy of 80.2%.
APA, Harvard, Vancouver, ISO, and other styles
47

A. Nicholas, Danie, and Devi Jayanthila. "Data retrieval in cancer documents using various weighting schemes." i-manager's Journal on Information Technology 12, no. 4 (2023): 28. http://dx.doi.org/10.26634/jit.12.4.20365.

Full text
Abstract:
In the realm of data retrieval, sparse vectors serve as a pivotal representation for both documents and queries, where each element in the vector denotes a word or phrase from a predefined lexicon. In this study, multiple scoring mechanisms are introduced aimed at discerning the significance of specific terms within the context of a document extracted from an extensive textual dataset. Among these techniques, the widely employed method revolves around inverse document frequency (IDF) or Term Frequency-Inverse Document Frequency (TF-IDF), which emphasizes terms unique to a given context. Additionally, the integration of BM25 complements TF-IDF, sustaining its prevalent usage. However, a notable limitation of these approaches lies in their reliance on near-perfect matches for document retrieval. To address this issue, researchers have devised latent semantic analysis (LSA), wherein documents are densely represented as low-dimensional vectors. Through rigorous testing within a simulated environment, findings indicate a superior level of accuracy compared to preceding methodologies.
APA, Harvard, Vancouver, ISO, and other styles
48

Sholikah, Rizka, Dhian Kartika, Agus Zainal Arifin, and Diana Purwitasari. "TERM WEIGHTING BASED ON POSITIVE IMPACT FACTOR QUERY FOR ARABIC FIQH DOCUMENT RANKING." Jurnal Ilmu Komputer dan Informasi 10, no. 1 (2017): 29. http://dx.doi.org/10.21609/jiki.v10i1.408.

Full text
Abstract:
Query becomes one of the most decisive factor on documents searching. A query contains several words, where one of them will become a key term. Key term is a word that has higher information and value than the others in query. It can be used in any kind of text documents, including Arabic Fiqh documents. Using key term in term weighting process could led to an improvement on result’s relevancy. In Arabic Fiqh document searching, not using the proper method in term weighting will relieve important value of key term. In this paper, we propose a new term weighting method based on Positive Impact Factor Query (PIFQ) for Arabic Fiqh documents ranking. PIFQ calculated using key term’s frequency on each category (mazhab) on Fiqh. The key term that frequently appear on a certain mazhab will get higher score on that mazhab, and vice versa. After PIFQ values are acquired, TF.IDF calculation will be done to each words. Then, PIFQ weight will be combine with the result from TF.IDF so that the new weight values for each words will be produced. Experimental result performed on a number of queries using 143 Arabic Fiqh documents show that the proposed method is better than traditional TF.IDF, with 77.9%, 83.1%, and 80.1% of precision, recall, and F-measure respectively.
APA, Harvard, Vancouver, ISO, and other styles
49

Grön, Leonie, and Ann Bertels. "Clinical sublanguages." Computational terminology and filtering of terminological information 24, no. 1 (2018): 41–65. http://dx.doi.org/10.1075/term.00013.gro.

Full text
Abstract:
Abstract Due to its specific linguistic properties, the language found in clinical records has been characterized as a distinct sublanguage. Even within the clinical domain, though, there are major differences in language use, which has led to more fine-grained distinctions based on medical fields and document types. However, previous work has mostly neglected the influence of term variation. By contrast, we propose to integrate the potential for term variation in the characterization of clinical sublanguages. By analyzing a corpus of clinical records, we show that the different sections of these records vary systematically with regard to their lexical, terminological and semantic composition, as well as their potential for term variation. These properties have implications for automatic term recognition, as they influence the performance of frequency-based term weighting.
APA, Harvard, Vancouver, ISO, and other styles
50

Bounabi, Mariem, Karim Elmoutaouakil, and Khalid Satori. "A new neutrosophic TF-IDF term weighting for text mining tasks: text classification use case." International Journal of Web Information Systems 17, no. 3 (2021): 229–49. http://dx.doi.org/10.1108/ijwis-11-2020-0067.

Full text
Abstract:
Purpose This paper aims to present a new term weighting approach for text classification as a text mining task. The original method, neutrosophic term frequency – inverse term frequency (NTF-IDF), is an extended version of the popular fuzzy TF-IDF (FTF-IDF) and uses the neutrosophic reasoning to analyze and generate weights for terms in natural languages. The paper also propose a comparative study between the popular FTF-IDF and NTF-IDF and their impacts on different machine learning (ML) classifiers for document categorization goals. Design/methodology/approach After preprocessing textual data, the original Neutrosophic TF-IDF applies the neutrosophic inference system (NIS) to produce weights for terms representing a document. Using the local frequency TF, global frequency IDF and text N's length as NIS inputs, this study generate two neutrosophic weights for a given term. The first measure provides information on the relevance degree for a word, and the second one represents their ambiguity degree. Next, the Zhang combination function is applied to combine neutrosophic weights outputs and present the final term weight, inserted in the document's representative vector. To analyze the NTF-IDF impact on the classification phase, this study uses a set of ML algorithms. Findings Practicing the neutrosophic logic (NL) characteristics, the authors have been able to study the ambiguity of the terms and their degree of relevance to represent a document. NL's choice has proven its effectiveness in defining significant text vectorization weights, especially for text classification tasks. The experimentation part demonstrates that the new method positively impacts the categorization. Moreover, the adopted system's recognition rate is higher than 91%, an accuracy score not attained using the FTF-IDF. Also, using benchmarked data sets, in different text mining fields, and many ML classifiers, i.e. SVM and Feed-Forward Network, and applying the proposed term scores NTF-IDF improves the accuracy by 10%. Originality/value The novelty of this paper lies in two aspects. First, a new term weighting method, which uses the term frequencies as components to define the relevance and the ambiguity of term; second, the application of NL to infer weights is considered as an original model in this paper, which also aims to correct the shortcomings of the FTF-IDF which uses fuzzy logic and its drawbacks. The introduced technique was combined with different ML models to improve the accuracy and relevance of the obtained feature vectors to fed the classification mechanism.
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!