Academic literature on the topic 'Term Frequency-Inverse Document Frequency Vectors'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the lists of relevant articles, books, theses, conference reports, and other scholarly sources on the topic 'Term Frequency-Inverse Document Frequency Vectors.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Journal articles on the topic "Term Frequency-Inverse Document Frequency Vectors"

1

Mohammed, Mohannad T., and Omar Fitian Rashid. "Document retrieval using term term frequency inverse sentence frequency weighting scheme." Indonesian Journal of Electrical Engineering and Computer Science 31, no. 3 (2023): 1478. http://dx.doi.org/10.11591/ijeecs.v31.i3.pp1478-1485.

Full text
Abstract:
The need for an efficient method to find the furthermost appropriate document corresponding to a particular search query has become crucial due to the exponential development in the number of papers that are now readily available to us on the web. The vector space model (VSM) a perfect model used in “information retrieval”, represents these words as a vector in space and gives them weights via a popular weighting method known as term frequency inverse document frequency (TF-IDF). In this research, work has been proposed to retrieve the most relevant document focused on representing documents and queries as vectors comprising average term term frequency inverse sentence frequency (TF-ISF) weights instead of representing them as vectors of term TF-IDF weight and two basic and effective similarity measures: Cosine and Jaccard were used. Using the MS MARCO dataset, this article analyzes and assesses the retrieval effectiveness of the TF-ISF weighting scheme. The result shows that the TF-ISF model with the Cosine similarity measure retrieves more relevant documents. The model was evaluated against the conventional TF-ISF technique and shows that it performs significantly better on MS MARCO data (Microsoft-curated data of Bing queries).
APA, Harvard, Vancouver, ISO, and other styles
2

Mohannad, T. Mohammed, and Fitian Rashid Omar. "Document retrieval using term frequency inverse sentence frequency weighting scheme." Document retrieval using term frequency inverse sentence frequency weighting scheme 31, no. 3 (2023): 1478–85. https://doi.org/10.11591/ijeecs.v31.i3.pp1478-1485.

Full text
Abstract:
The need for an efficient method to find the furthermost appropriate document corresponding to a particular search query has become crucial due to the exponential development in the number of papers that are now readily available to us on the web. The vector space model (VSM) a perfect model used in “information retrieval”, represents these words as a vector in space and gives them weights via a popular weighting method known as term frequency inverse document frequency (TF-IDF). In this research, work has been proposed to retrieve the most relevant document focused on representing documents and queries as vectors comprising average term term frequency inverse sentence frequency (TF-ISF) weights instead of representing them as vectors of term TF-IDF weight and two basic and effective similarity measures: Cosine and Jaccard were used. Using the MS MARCO dataset, this article analyzes and assesses the retrieval effectiveness of the TF-ISF weighting scheme. The result shows that the TF-ISF model with the Cosine similarity measure retrieves more relevant documents. The model was evaluated against the conventional TF-ISF technique and shows that it performs significantly better on MS MARCO data (Microsoft-curated data of Bing queries).
APA, Harvard, Vancouver, ISO, and other styles
3

Widianto, Adi, Eka Pebriyanto, Fitriyanti Fitriyanti, and Marna Marna. "Document Similarity Using Term Frequency-Inverse Document Frequency Representation and Cosine Similarity." Journal of Dinda : Data Science, Information Technology, and Data Analytics 4, no. 2 (2024): 149–53. http://dx.doi.org/10.20895/dinda.v4i2.1589.

Full text
Abstract:
Document similarity is a fundamental task in natural language processing and information retrieval, with applications ranging from plagiarism detection to recommendation systems. In this study, we leverage the term frequency-inverse document frequency (TF-IDF) to represent documents in a high-dimensional vector space, capturing their unique content while mitigating the influence of common terms. Subsequently, we employ the cosine similarity metric to measure the similarity between pairs of documents, which assesses the angle between their respective TF-IDF vectors. To evaluate the effectiveness of our approach, we conducted experiments on the Document Similarity Triplets Dataset, a benchmark dataset specifically designed for assessing document similarity techniques. Our experimental results demonstrate a significant performance with an accuracy score of 93.6% using bigram-only representation. However, we observed instances where false predictions occurred due to paired documents having similar terms but differing semantics, revealing a weakness in the TF-IDF approach. To address this limitation, future research could focus on augmenting document representations with semantic features. Incorporating semantic information, such as word embeddings or contextual embeddings, could enhance the model's ability to capture nuanced semantic relationships between documents, thereby improving accuracy in scenarios where term overlap does not adequately signify similarity.
APA, Harvard, Vancouver, ISO, and other styles
4

Hu, Zheng, Hua Dai, Geng Yang, Xun Yi, and Wenjie Sheng. "Semantic-Based Multi-Keyword Ranked Search Schemes over Encrypted Cloud Data." Security and Communication Networks 2022 (April 29, 2022): 1–15. http://dx.doi.org/10.1155/2022/4478618.

Full text
Abstract:
Traditional searchable encryption schemes construct document vectors based on the term frequency-inverse document frequency (TF-IDF) model. Such vectors are not only high-dimensional and sparse but also ignore the semantic information of the documents. The Sentence Bidirectional Encoder Representations from Transformers (SBERT) model can be used to train vectors containing document semantic information to realize semantic-aware multi-keyword search. In this paper, we propose a privacy-preserving searchable encryption scheme based on the SBERT model. The SBERT model is used to train vectors containing the semantic information of documents, and these document vectors are then used as input to the Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) clustering algorithm. The HDBSCAN algorithm generates a soft cluster membership vector for each document. We treat each cluster as a topic, and the vector represents the probability that the document belongs to each topic. According to the clustering process in the schemes, the topic-term frequency-inverse topic frequency (TTF-ITF) model is proposed to generate keyword topic vectors. Through the SBERT model, searchable encryption scheme can achieve more precise semantic-aware keyword search. At the same time, the special index tree is used to improve search efficiency. The experimental results on real datasets prove the effectiveness of our scheme.
APA, Harvard, Vancouver, ISO, and other styles
5

Priyanka, Mesariya, and Madia Nidhi. "Document Ranking using Customizes Vector Method." International Journal of Trend in Scientific Research and Development 1, no. 4 (2017): 278–83. https://doi.org/10.31142/ijtsrd125.

Full text
Abstract:
Information retrieval IR system is about positioning reports utilizing clients question and get the important records from extensive dataset. Archive positioning is fundamentally looking the pertinent record as per their rank. Document ranking is basically search the relevant document according to their rank. Vector space model is traditional and widely applied information retrieval models to rank the web page based on similarity values. Term weighting schemes are the significant of an information retrieval system and it is query used in document ranking. Tf idf ranked calculates the term weight according to users query on basis of term which is including in documents. When user enter query it will find the documents in which the query terms are included and it will count the term calculate the Tf idf according to the highest weight of value it will gives the ranked documents. Priyanka Mesariya | Nidhi Madia "Document Ranking using Customizes Vector Method" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-1 | Issue-4 , June 2017, URL: https://www.ijtsrd.com/papers/ijtsrd125.pdf
APA, Harvard, Vancouver, ISO, and other styles
6

A. Nicholas, Danie, and Devi Jayanthila. "Data retrieval in cancer documents using various weighting schemes." i-manager's Journal on Information Technology 12, no. 4 (2023): 28. http://dx.doi.org/10.26634/jit.12.4.20365.

Full text
Abstract:
In the realm of data retrieval, sparse vectors serve as a pivotal representation for both documents and queries, where each element in the vector denotes a word or phrase from a predefined lexicon. In this study, multiple scoring mechanisms are introduced aimed at discerning the significance of specific terms within the context of a document extracted from an extensive textual dataset. Among these techniques, the widely employed method revolves around inverse document frequency (IDF) or Term Frequency-Inverse Document Frequency (TF-IDF), which emphasizes terms unique to a given context. Additionally, the integration of BM25 complements TF-IDF, sustaining its prevalent usage. However, a notable limitation of these approaches lies in their reliance on near-perfect matches for document retrieval. To address this issue, researchers have devised latent semantic analysis (LSA), wherein documents are densely represented as low-dimensional vectors. Through rigorous testing within a simulated environment, findings indicate a superior level of accuracy compared to preceding methodologies.
APA, Harvard, Vancouver, ISO, and other styles
7

Murata, Hiroshi, Takashi Onoda, and Seiji Yamada. "Comparative Analysis of Relevance for SVM-Based Interactive Document Retrieval." Journal of Advanced Computational Intelligence and Intelligent Informatics 17, no. 2 (2013): 149–56. http://dx.doi.org/10.20965/jaciii.2013.p0149.

Full text
Abstract:
Support Vector Machines (SVMs) were applied to interactive document retrieval that uses active learning. In such a retrieval system, the degree of relevance is evaluated by using a signed distance from the optimal hyperplane. It is not clear, however, how the signed distance in SVMs has characteristics of vector space model. We therefore formulated the degree of relevance by using the signed distance in SVMs and comparatively analyzed it with a conventional Rocchio-based method. Although vector normalization has been utilized as preprocessing for document retrieval, few studies explained why vector normalization was effective. Based on our comparative analysis, we theoretically show the effectiveness of normalizing document vectors in SVM-based interactive document retrieval. We then propose a cosine kernel that is suitable for SVM-based interactive document retrieval. The effectiveness of the method was compared experimentally with conventional relevance feedback for Boolean, Term Frequency and Term Frequency-Inverse Document Frequency representations of document vectors. Experimental results for a Text REtrieval Conference data set showed that the cosine kernel is effective for all document representations, especially Term Frequency representation.
APA, Harvard, Vancouver, ISO, and other styles
8

Ni'mah, Ana Tsalitsatun, and Agus Zainal Arifin. "Perbandingan Metode Term Weighting terhadap Hasil Klasifikasi Teks pada Dataset Terjemahan Kitab Hadis." Rekayasa 13, no. 2 (2020): 172–80. http://dx.doi.org/10.21107/rekayasa.v13i2.6412.

Full text
Abstract:
Hadis adalah sumber rujukan agama Islam kedua setelah Al-Qur’an. Teks Hadis saat ini diteliti dalam bidang teknologi untuk dapat ditangkap nilai-nilai yang terkandung di dalamnya secara pegetahuan teknologi. Dengan adanya penelitian terhadap Kitab Hadis, pengambilan informasi dari Hadis tentunya membutuhkan representasi teks ke dalam vektor untuk mengoptimalkan klasifikasi otomatis. Klasifikasi Hadis diperlukan untuk dapat mengelompokkan isi Hadis menjadi beberapa kategori. Ada beberapa kategori dalam Kitab Hadis tertentu yang sama dengan Kitab Hadis lainnya. Ini menunjukkan bahwa ada beberapa dokumen Kitab Hadis tertentu yang memiliki topik yang sama dengan Kitab Hadis lain. Oleh karena itu, diperlukan metode term weighting yang dapat memilih kata mana yang harus memiliki bobot tinggi atau rendah dalam ruang Kitab Hadis untuk optimalisasi hasil klasifikasi dalam Kitab-kitab Hadis. Penelitian ini mengusulkan sebuah perbandingan beberapa metode term weighting, yaitu: Term Frequency Inverse Document Frequency (TF-IDF), Term Frequency Inverse Document Frequency Inverse Class Frequency (TF-IDF-ICF), Term Frequency Inverse Document Frequency Inverse Class Space Density Frequency (TF-IDF-ICSδF), dan Term Frequency Inverse Document Frequency Inverse Class Space Density Frequency Inverse Hadith Space Density Frequency (TF-IDF-ICSδF-IHSδF). Penelitian ini melakukan perbandingan hasil term weighting terhadap dataset Terjemahan 9 Kitab Hadis yang diterapkan pada mesin klasifikasi Naive Bayes dan SVM. 9 Kitab Hadis yang digunakan, yaitu: Sahih Bukhari, Sahih Muslim, Abu Dawud, at-Turmudzi, an-Nasa'i, Ibnu Majah, Ahmad, Malik, dan Darimi. Hasil uji coba menunjukkan bahwa hasil klasifikasi menggunakan metode term weighting TF-IDF-ICSδF-IHSδF mengungguli term weighting lainnya, yaitu mendapatkan Precission sebesar 90%, Recall sebesar 93%, F1-Score sebesar 92%, dan Accuracy sebesar 83%.Comparison of a term weighting method for the text classification in Indonesian hadithHadith is the second source of reference for Islam after the Qur’an. Currently, hadith text is researched in the field of technology for capturing the values of technology knowledge. With the research of the Book of Hadith, retrieval of information from the hadith certainly requires the representation of text into vectors to optimize automatic classification. The classification of the hadith is needed to be able to group the contents of the hadith into several categories. There are several categories in certain Hadiths that are the same as other Hadiths. Shows that there are certain documents of the hadith that have the same topic as other Hadiths. Therefore, a term weighting method is needed that can choose which words should have high or low weights in the Hadith Book space to optimize the classification results in the Hadith Books. This study proposes a comparison of several term weighting methods, namely: Term Frequency Inverse Document Frequency (TF-IDF), Term Frequency Inverse Document Frequency Inverse Class Frequency (TF-IDF-ICF), Term Frequency Inverse Document Frequency Inverse Class Space Density Frequency (TF-IDF-ICSδF) and Term Frequency Inverse Document Frequency Inverse Class Space Density Frequency Inverse Hadith Space Density Frequency (TF-IDF-ICSδF-IHSδF). This research compares the term weighting results to the 9 Hadith Book Translation dataset applied to the Naive Bayes classification engine and SVM. 9 Books of Hadith are used, namely: Sahih Bukhari, Sahih Muslim, Abu Dawud, at-Turmudzi, an-Nasa’i, Ibn Majah, Ahmad, Malik, and Darimi. The trial results show that the classification results using the TF-IDF-ICSδF-IHSδF term weighting method outperformed another term weighting, namely getting a Precession of 90%, Recall of 93%, F1-Score of 92%, and Accuracy of 83%.
APA, Harvard, Vancouver, ISO, and other styles
9

I Wayan Alston Argodi, Eva Yulia Puspaningrum, and Muhammad Muharrom Al Haromainy. "IMPLEMENTASI METODE TF-IDF DAN ALGORITMA NAIVE BAYES DALAM APLIKASI DIABETIC BERBASIS ANDROID." Jurnal Teknik Mesin, Elektro dan Ilmu Komputer 3, no. 2 (2023): 23–33. http://dx.doi.org/10.55606/teknik.v3i2.2009.

Full text
Abstract:
Diabetes is a serious disease that occurs when the pancreas does not produce enough insulin as a hormone that regulates blood sugar in the body. This disease also has an impact on health. This research builds an Android-based application called Diabetic to help classify and provide information related to diabetes and analyze the performance of the Term Frequency Inverse Document Frequency method and the Naive Bayes algorithm. The Term Frequency Inverse Document Frequency method is a technique for calculating the presence of words in a collection of documents by creating document vectors. The Naive Bayes algorithm is an algorithm that uses probability to solve a classification case. This algorithm has an efficient and fast calculation. Based on this research, it is known that the Naive Bayes Algorithm produces an accuracy of 66% by taking a computation time of 39 seconds with a memory consumption of 80 to 351 mb.
APA, Harvard, Vancouver, ISO, and other styles
10

Suhartono, Didit, and Khodirun Khodirun. "System of Information Feedback on Archive Using Term Frequency-Inverse Document Frequency and Vector Space Model Methods." IJIIS: International Journal of Informatics and Information Systems 3, no. 1 (2020): 36–42. http://dx.doi.org/10.47738/ijiis.v3i1.6.

Full text
Abstract:
The archive is one of the examples of documents that important. Archives are stored systematically with a view to helping and simplifying the storage and retrieval of the archive. In the information retrieval (Information retrieval) the process of retrieving relevant documents and not retrieving documents that are not relevant. To retrieve the relevant documents, a method is needed. Using the Term Frequency-Inverse Document and Vector Space Model methods can find relevant documents according to the level of closeness or similarity, in addition to applying the Nazief-Adriani stemming algorithm can improve information retrieval performance by transforming words in a document or text to the basic word form. then the system indexes the document to simplify and speed up the search process. Relevance is determined by calculating the similarity values between existing documents by querying and represented in certain forms. The documents obtained, then the system sort by the level of relevance to the query.
APA, Harvard, Vancouver, ISO, and other styles
More sources

Dissertations / Theses on the topic "Term Frequency-Inverse Document Frequency Vectors"

1

Sullivan, Daniel Edward. "Evaluation of Word and Paragraph Embeddings and Analogical Reasoning as an Alternative to Term Frequency-Inverse Document Frequency-based Classification in Support of Biocuration." Diss., Virginia Tech, 2016. http://hdl.handle.net/10919/80572.

Full text
Abstract:
This research addresses the problem, can unsupervised learning generate a representation that improves on the commonly used term frequency-inverse document frequency (TF-IDF ) representation by capturing semantic relations? The analysis measures the quality of sentence classification using term TF-IDF representations, and finds a practical upper limit to precision and recall in a biomedical text classification task (F1-score of 0.85). Arguably, one could use ontologies to supplement TF-IDF, but ontologies are sparse in coverage and costly to create. This prompts a correlated question: can unsupervised learning capture semantic relations at least as well as existing ontologies, and thus supplement existing sparse ontologies? A shallow neural network implementing the Skip-Gram algorithm is used to generate semantic vectors using a corpus of approximately 2.4 billion words. The ability to capture meaning is assessed by comparing semantic vectors generated with MESH. Results indicate that semantic vectors trained by unsupervised methods capture comparable levels of semantic features in some cases, such as amino acid (92% of similarity represented in MESH), but perform substantially poorer in more expansive topics, such as pathogenic bacteria (37.8% similarity represented in MESH). Possible explanations for this difference in performance are proposed along with a method to combine manually curated ontologies with semantic vector spaces to produce a more comprehensive representation than either alone. Semantic vectors are also used as representations for paragraphs, which, when used for classification, achieve an F1-score of 0.92. The results of classification and analogical reasoning tasks are promising but a formal model of semantic vectors, subject to the constraints of known linguistic phenomenon, is needed. This research includes initial steps for developing a formal model of semantic vectors based on a combination of linear algebra and fuzzy set theory subject to the semantic molecularism linguistic model. This research is novel in its analysis of semantic vectors applied to the biomedical domain, analysis of different performance characteristics in biomedical analogical reasoning tasks, comparison semantic relations captured by between vectors and MESH, and the initial development of a formal model of semantic vectors.<br>Ph. D.
APA, Harvard, Vancouver, ISO, and other styles
2

Regard, Viktor. "Studying the effectiveness of dynamic analysis for fingerprinting Android malware behavior." Thesis, Linköpings universitet, Databas och informationsteknik, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-163090.

Full text
Abstract:
Android is the second most targeted operating system for malware authors and to counter the development of Android malware, more knowledge about their behavior is needed. There are mainly two approaches to analyze Android malware, namely static and dynamic analysis. Recently in 2017, a study and well labeled dataset, named AMD (Android Malware Dataset), consisting of over 24,000 malware samples was released. It is divided into 135 varieties based on similar malicious behavior, retrieved through static analysis of the file classes.dex in the APK of each malware, whereas the labeled features were determined by manual inspection of three samples in each variety. However, static analysis is known to be weak against obfuscation techniques, such as repackaging or dynamic loading, which can be exploited to avoid the analysis. In this study the second approach is utilized and all malware in the dataset are analyzed at run-time in order to monitor their dynamic behavior. However, analyzing malware at run-time has known weaknesses as well, as it can be avoided through, for instance, anti-emulator techniques. Therefore, the study aimed to explore the available sandbox environments for dynamic analysis, study the effectiveness of fingerprinting Android malware using one of the tools and investigate whether static features from AMD and the dynamic analysis correlate. For instance, by an attempt to classify the samples based on similar dynamic features and calculating the Pearson Correlation Coefficient (r) for all combinations of features from AMD and the dynamic analysis. The comparison of tools for dynamic analysis, showed a need of development, as most popular tools has been released for a long time and the common factor is a lack of continuous maintenance. As a result, the choice of sandbox environment for this study ended up as Droidbox, because of aspects like ease of use/install and easily adaptable for large scale analysis. Based on the dynamic features extracted with Droidbox, it could be shown that Android malware are more similar to the varieties which they belong to. The best metric for classifying samples to varieties, out of four investigated metrics, turned out to be Cosine Similarity, which received an accuracy of 83.6% for the entire dataset. The high accuracy indicated a correlation between the dynamic features and static features which the varieties are based on. Furthermore, the Pearson Correlation Coefficient confirmed that the manually extracted features, used to describe the varieties, and the dynamic features are correlated to some extent, which could be partially confirmed by a manual inspection in the end of the study.
APA, Harvard, Vancouver, ISO, and other styles
3

Fan, Fang-Syuan, and 范芳瑄. "Classified Term Frequency-Inverse Document Frequency technique applied to school regulationsClassified Term Frequency-Inverse Document Frequency technique applied to school regulations." Thesis, 2019. http://ndltd.ncl.edu.tw/handle/6hb936.

Full text
Abstract:
碩士<br>國立中央大學<br>資訊工程學系在職專班<br>107<br>This study combines Term Frequency-Inverse Document Frequency technique with compatibility and applies it to the “Regulations of National Central University and Extensions of Off-campus Regulations” and establishes them on the cloud platform for tax classification. Term Frequency-Inverse Document Frequency technique can only present one type of measurement and quantitative method and is not capable of presenting diverse selection. Therefore, through the combination of compatibility, Cosine Similarity, Hierarchical Clustering and other techniques, a regulation can produce different results in different compatibility. A wide range of selection can be produced through classification, helping users to find the proper regulations which is related. keyword:text mining、TF-IDF、Cosine Similarity、Hierarchical Clustering This study combines Term Frequency-Inverse Document Frequency technique with compatibility and applies it to the “Regulations of National Central University and Extensions of Off-campus Regulations” and establishes them on the cloud platform for tax classification. Term Frequency-Inverse Document Frequency technique can only present one type of measurement and quantitative method and is not capable of presenting diverse selection. Therefore, through the combination of compatibility, Cosine Similarity, Hierarchical Clustering and other techniques, a regulation can produce different results in different compatibility. A wide range of selection can be produced through classification, helping users to find the proper regulations which is related. keyword:text mining、TF-IDF、Cosine Similarity、Hierarchical Clustering This study combines Term Frequency-Inverse Document Frequency technique with compatibility and applies it to the “Regulations of National Central University and Extensions of Off-campus Regulations” and establishes them on the cloud platform for tax classification. Term Frequency-Inverse Document Frequency technique can only present one type of measurement and quantitative method and is not capable of presenting diverse selection. Therefore, through the combination of compatibility, Cosine Similarity, Hierarchical Clustering and other techniques, a regulation can produce different results in different compatibility. A wide range of selection can be produced through classification, helping users to find the proper regulations which is related.
APA, Harvard, Vancouver, ISO, and other styles
4

Lin, Jun-liang, and 林俊良. "A New Auto Document Category System by Using Google N-gram and Probability based Term Frequency and Inverse Category Frequency." Thesis, 2012. http://ndltd.ncl.edu.tw/handle/20409545542311421955.

Full text
Abstract:
碩士<br>國立高雄第一科技大學<br>資訊管理研究所<br>100<br>The electronic documents between companies and organizations are growing fast. Automatic classification is an important issue of information service and knowledge management. Keywords are the smallest units to present the document. Therefore, almost each part of document automation processing such as knowledge mining, automatic filtering, automatic summarization, event tracking, or concept retrieval etc., have to retrieve keywords from documents first, and then proceed with analytical processing. We propose the N-gram Segmentation Algorithm (NSA) in this study to improve the problem of static retrieving keywords. NSA method combines Stopwords, Stemming, N-gram choosing, and Google N-gram Corpus. We fetch the meaningful N-gram keywords by using NSA method. In addition to keywords extraction methods, this research also proposes a new keyword weighting method. We use Google N-gram Frequency as a weight for terms frequency. This method can enhances the weighting mechanism for keywords extraction in particular group. The Probability based Term Frequency and Inverse Category Frequency (PTFICF) is used for weighting the keywords in documents. Finally, we use SVM to classify the test documents. This study sets up three experiments: Experiment 1 used Classic4 as a balanced data set, and experimental result showed that F_1Value was 96.4%. Experiment 2 used Reuter-21578 as an imbalanced dataset, and experimental result showed that F_1Value was 78.7%. Experiment 3 used Google Frequency as a weighting method, the experimental result demonstrated that if the Google Frequency was higher, the classification result would be more accurate. Overall, the proposed methods are more accurate than traditional methods, and they also reduce 90% of the training time.
APA, Harvard, Vancouver, ISO, and other styles
5

Costa, Joao Mario Goncalves da. "Classificação automática de páginas web usando features visuais." Master's thesis, 2014. http://hdl.handle.net/10316/40401.

Full text
Abstract:
Dissertação de Mestrado Integrado em Engenharia Electrotécnica e de Computadores apresentada à Faculdade de Ciências e Tecnologia da Universidade de Coimbra<br>The world of Internet grows up every day. There are a large number of web pages actives at this moment and more are released every day. It is impossible to perform the web page classification manually. It was already developed several approaches in this area. Most of them only use the text information contained in the web pages, ignoring the visual content of them. This work shows that the visual content can improve the accuracies of the classifications that only use the text. It was extracted the text features of the web pages using the term frequency inverse document frequency method. As well, it was also extracted two different types of visual features: the low-level features and the local SIFT ones. Since the amount of the SIFT features is extremely high, it was created a dictionary using the “Bag-of-Words” method. After this extraction the features were merged, using all the types of combinations of them. It was also used the Chi-Square method that selects the best features of a vector. In the classification it was used four different classifiers. It was implemented a multi-label classification, for which we gave unknown web pages to the classifiers, so they could predict the main topic of the web page. It was also implemented a binary classification, for which we used only visual features to verify if a web page was a blog or non-blog. It was obtained good results that shows that adding the visual content to the text the accuracies improve. The best classification it was obtained using only four different categories, where was achieved 98% of accuracy. Later it was developed a web application, where the user can find out the main topic of a web page only inserting the web page URL. It can be accessed in ”http://scrat.isr.uc.pt/uniprojection /wpc.html”.<br>O mundo da internet cresce a cada dia que passa. Existe um enorme numero de p´aginas web activas neste preciso momento e muitas mais s˜ao lan¸cadas a cada dia que passa. E impossivel ´ realizar uma classifica¸c˜ao manual destas p´aginas web. J´a foram realizados diversos trabalhos nesta ´area. A maioria delas apenas utiliza a informa¸c˜ao do texto da p´agina web, ignorando o conte´udo visual das mesmas. Neste trabalho mostramos que o conte´udo visual melhora as precis˜oes dos classificadores que utilizavam apenas texto. Para isso foram extra´ıdas caracter´ısticas de texto das p´aginas web utilizando o m´etodo term frequency-inverse document frequency. Foram extra´ıdos dois tipos de caracter´ısticas visuais: as caracter´ısticas low-level e as caracter´ısticas locais SIFT. Sendo que o n´umero de caracter´ısticas SIFT ´e extremamente alto, foi criado um dicion´ario utilizando o m´etodo “Bag-of-Words”. Depois de exta´ıdas, foram feitas todas as combina¸c˜oes poss´ıveis entre estes trˆes tipos de caracter´ısticas. Foi utilizado tamb´em o m´etodo Chi-Square que seleciona as melhores caracter´ısticas. Na classifica¸c˜ao, foram utilizados quatro classificadores diferentes. Foi realizada uma classifica¸c˜ao multi-label, onde introduzindo p´aginas web desconhecidas pelos classificadores, os mesmos previam o t´opico principal dessa p´agina. Foi tamb´em realizada uma classifica¸c˜ao bin´aria onde apenas foram utilizadas as features visuais para verificarem se uma p´agina web ´e um blog. Foram obtidos bons resultados que mostram que realmente adicionando o conte´udo visual ao texto, as precis˜oes dos classificadores melhoram. A melhor classifica¸c˜ao foi obtida quando utilizadas apenas quatro categorias diferentes, onde foi obtida uma precis˜ao de 98%. Posteriormente foi desenvolvida uma aplica¸c˜ao web com o objectivo de um utilizador conseguir descobrir qual o t´opico principal de uma p´agina web apenas inserindo o seu URL. Pode ser acedida em “http://scrat.isr.uc.pt/uniprojection/wpc.html”.
APA, Harvard, Vancouver, ISO, and other styles

Books on the topic "Term Frequency-Inverse Document Frequency Vectors"

1

Shi, Feng. Learn About Term Frequency–Inverse Document Frequency in Text Analysis in R With Data From How ISIS Uses Twitter Dataset (2016). SAGE Publications, Ltd., 2019. http://dx.doi.org/10.4135/9781526489012.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Shi, Feng. Learn About Term Frequency–Inverse Document Frequency in Text Analysis in Python With Data From How ISIS Uses Twitter Dataset (2016). SAGE Publications, Ltd., 2019. http://dx.doi.org/10.4135/9781526498038.

Full text
APA, Harvard, Vancouver, ISO, and other styles

Book chapters on the topic "Term Frequency-Inverse Document Frequency Vectors"

1

Kumar, Mukesh, and Renu Vig. "Term-Frequency Inverse-Document Frequency Definition Semantic (TIDS) Based Focused Web Crawler." In Communications in Computer and Information Science. Springer Berlin Heidelberg, 2012. http://dx.doi.org/10.1007/978-3-642-29216-3_5.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Quan, Do Viet, and Phan Duy Hung. "Application of Customized Term Frequency-Inverse Document Frequency for Vietnamese Document Classification in Place of Lemmatization." In Advances in Intelligent Systems and Computing. Springer International Publishing, 2021. http://dx.doi.org/10.1007/978-3-030-68154-8_37.

Full text
APA, Harvard, Vancouver, ISO, and other styles
3

Rajagukguk, Novita, I. Putu Eka Nila Kencana, and I. G. N. Lanang Wijaya Kusuma. "Application of Term Frequency - Inverse Document Frequency in The Naive Bayes Algorithm For ChatGPT User Sentiment Analysis." In Advances in Computer Science Research. Atlantis Press International BV, 2024. http://dx.doi.org/10.2991/978-94-6463-413-6_4.

Full text
APA, Harvard, Vancouver, ISO, and other styles
4

Muppala, Gurusai, and T. Devi. "Comparison of accuracy for rapid automatic keyword extraction algorithm with term frequency inverse document frequency to recast giant text into charts." In Applications of Mathematics in Science and Technology. CRC Press, 2025. https://doi.org/10.1201/9781003606659-100.

Full text
APA, Harvard, Vancouver, ISO, and other styles
5

Nagaraj, Nagendra, and Chandra J. "Sentence Classification using Machine Learning with Term Frequency–Inverse Document Frequency with N-Gram." In New Frontiers in Communication and Intelligent Systems. Soft Computing Research Society, 2021. http://dx.doi.org/10.52458/978-81-95502-00-4-35.

Full text
Abstract:
Automatic text classification has proven to be a vital method for managing and processing a very large text area—the volume of digital materials that is spreading and growing on a daily basis. In general, text plays an important role in classifying, extracting, and summarizing information, searching for text, and answering questions. This paper demonstrates machine learning techniques are used for the text classification process..And also, with the vast rapid growth of text analysis in all areas, the demand for automatic text classification has widely improved by day by day. The pattern of text classification has been the subject of a lot of research and development works in recent times of natural language processing is a field that entails a lot of work. This paper represents a text classification technique using the term frequency-inverse document frequency and N-Gram. Also compared the performances of a different model. The recommended model is adopted with four different algorithms and compared with generated results from the algorithms. The linear support vector machine is most relevant to this work with our proposed model. The final result shows a significant accuracy compared with earlier methods.
APA, Harvard, Vancouver, ISO, and other styles
6

Tardelli Adalberto O., Anção Meide S., Packer Abel L., and Sigulem Daniel. "An implementation of the Trigram Phrase Matching method for text similarity problems." In Studies in Health Technology and Informatics. IOS Press, 2004. https://doi.org/10.3233/978-1-60750-946-2-43.

Full text
Abstract:
The representation of texts by term vectors with element values calculated by a TFIDF method yields to significant results in text similarity problems, such as finding related documents in bibliographic or full-text databases and identifying MeSH concepts from medical texts by lexical approach and also harmonizing journal citation in ISI/SciELO references and normalizing author&amp;apos;s affiliation in MEDLINE. Our work considered &amp;ldquo;trigrams&amp;rdquo; as the terms (elements) of a term vector representing a text, according to the Trigram Phrase Matching published by the NLM&amp;apos;s Indexing Initiative and its logarithmic Term Frequency &amp;ndash; Inverse Document Frequency method for term weighting. Trigrams are overlapping 3-char strings from a text, extracted by a couple of rules, and a trigram matching method may improve the probability of identifying synonym phrases or similar texts. The matching process was implemented as a simple algorithm, and requires a certain amount of computer resources. An efficiency-focused C-programming was adopted. In addition, some heuristic rules improved the efficiency of the method and made it feasible a regular &amp;ldquo;find your scientific production in SciELO collection&amp;rdquo; information service. We describe an implementation of the Trigram Matching method, the software tool we developed and a set of experimental parameters for the above results.
APA, Harvard, Vancouver, ISO, and other styles
7

Anilkumar, Munikrishnappa, Manasa Nagabhushanam, and Mallieswari R. "Topic Modelling of India’s Digital Healthcare Research Trend." In Applied Intelligence and Computing. Soft Computing Research Society, 2024. http://dx.doi.org/10.56155/978-81-955020-9-7-18.

Full text
Abstract:
New advancements in information technology have had a tremendous impact on digital healthcare applications in the medical area. The study themes connected to digital healthcare technology and its intervention must be discovered and studied systematically. As a research gap, digital healthcare research in India has yet to be investigated thematically using topic modeling; in this context, the study employs topic modeling's Non-Negative Matrix Factorization algorithm to systematically generate digital health research themes in India. After preprocessing, the raw texts were transformed into Term Frequency-Inverse Document Frequency vectors. The Non-Negative Matrix Factorization approach from topic modeling was used for text classification. The k parameter was used for feature selection, yielding a set number of topics for semantic interpretation. Analysis of the research articles revealed that there has been considerable growth in digital healthcare research in India since 2017; the majority of publications occurred in 2020 and 2021, with less previous to 2017. Topic modeling of 97 published articles yielded the top three research themes: evaluation, public policy, and communities. The findings from the research themes will provide a thematic understanding of digital healthcare research in India. It will also aid future studies through text analysis, topic modeling, and decision-making in digital healthcare treatments.
APA, Harvard, Vancouver, ISO, and other styles
8

Souza de Oliveira, Raphael, and Erick Giovani Sperandio Nascimento. "Clustering by Similarity of Brazilian Legal Documents Using Natural Language Processing Approaches." In Artificial Intelligence. IntechOpen, 2021. http://dx.doi.org/10.5772/intechopen.99875.

Full text
Abstract:
The Brazilian legal system postulates the expeditious resolution of judicial proceedings. However, legal courts are working under budgetary constraints and with reduced staff. As a way to face these restrictions, artificial intelligence (AI) has been tackling many complex problems in natural language processing (NLP). This work aims to detect the degree of similarity between judicial documents that can be achieved in the inference group using unsupervised learning, by applying three NLP techniques, namely term frequency-inverse document frequency (TF-IDF), Word2Vec CBoW, and Word2Vec Skip-gram, the last two being specialized with a Brazilian language corpus. We developed a template for grouping lawsuits, which is calculated based on the cosine distance between the elements of the group to its centroid. The Ordinary Appeal was chosen as a reference file since it triggers legal proceedings to follow to the higher court and because of the existence of a relevant contingent of lawsuits awaiting judgment. After the data-processing steps, documents had their content transformed into a vector representation, using the three NLP techniques. We notice that specialized word-embedding models—like Word2Vec—present better performance, making it possible to advance in the current state of the art in the area of NLP applied to the legal sector.
APA, Harvard, Vancouver, ISO, and other styles
9

Hadi, Setiawan, and Paquita Putri Ramadhani. "Text Classification on the Instagram Caption Using Support Vector Machine." In Artificial Intelligence. IntechOpen, 2022. http://dx.doi.org/10.5772/intechopen.99684.

Full text
Abstract:
Instagram is one of the world’s top ten most popular social networks. Instagram is the most popular social networking platform in the United States, India, and Brazil, with over 1 billion monthly active users. Each of these countries has more than 91 million Instagram users. The number of Instagram users shows the various reasons and goals for them to play this social media. Social Media Marketing does not escape being one of the purposes of using Instagram, with benefits to place a market for their products. Using text classification to categorize Instagram captions into organized groups, namely fashion, food &amp; beverage, technology, health &amp; beauty, lifestyle &amp; travel, this paper is expected to help people know the current trends on Instagram. The Support Vector Machine algorithm in this research is used in 66171 post captions to classify trending on Instagram. The TF-IDF (Term Frequency times Inverse Document Frequency) method and percentage variations were used for data separation in this study. This study result indicates that the use of SVM with a percentage ratio 70% of dataset for training and 30% of dataset for testing produces a higher level of accuracy compared to the others.
APA, Harvard, Vancouver, ISO, and other styles
10

"Term Frequency by Inverse Document Frequency." In Encyclopedia of Database Systems. Springer US, 2009. http://dx.doi.org/10.1007/978-0-387-39940-9_3784.

Full text
APA, Harvard, Vancouver, ISO, and other styles

Conference papers on the topic "Term Frequency-Inverse Document Frequency Vectors"

1

Liu, Fengjuan. "Japanese Dependency Analysis using Multi-Kernel Support Vector Machine based on Term Frequency and Inverse Document Frequency." In 2024 International Conference on Integrated Intelligence and Communication Systems (ICIICS). IEEE, 2024. https://doi.org/10.1109/iciics63763.2024.10860205.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Islavath, Srinivas, and C. Rohith Bhat. "Uniform Resource Locator Phishing in Real Time Scenario Predicted Using Novel Term Frequency-Inverse Document Frequency +N Gram in Comparison with Support Vector Machine Algorithm." In 2024 15th International Conference on Computing Communication and Networking Technologies (ICCCNT). IEEE, 2024. http://dx.doi.org/10.1109/icccnt61001.2024.10725919.

Full text
APA, Harvard, Vancouver, ISO, and other styles
3

Lubis, Hasby Sahendri, Mahyuddin K. M. Nasution, and Amalia Amalia. "Performance of Term Frequency - Inverse Document Frequency and K-Means in Government Service Identification." In 2024 4th International Conference of Science and Information Technology in Smart Administration (ICSINTESA). IEEE, 2024. http://dx.doi.org/10.1109/icsintesa62455.2024.10748106.

Full text
APA, Harvard, Vancouver, ISO, and other styles
4

Wen, Xiaojiao. "Student Grade Prediction and Classification based on Term Frequency-Inverse Document Frequency with Random Forest." In 2024 First International Conference on Software, Systems and Information Technology (SSITCON). IEEE, 2024. https://doi.org/10.1109/ssitcon62437.2024.10796287.

Full text
APA, Harvard, Vancouver, ISO, and other styles
5

Bharattej R, Rana Veer Samara Sihman, Prashanth V, Haideer Alabdeli, Sunaina Sangeet Thottan, and S. Ananthi. "Modified Term Frequency and Inverse Document Frequency with Optimized Deep Learning Algorithm based Fake News Detection." In 2025 International Conference on Intelligent Systems and Computational Networks (ICISCN). IEEE, 2025. https://doi.org/10.1109/iciscn64258.2025.10934578.

Full text
APA, Harvard, Vancouver, ISO, and other styles
6

Zen, Bita Parga, Irwan Susanto, Khofifah Putriyani, and Sintiya. "Automatic document classification for tempo news articles about covid 19 based on term frequency, inverse document frequency (TF-IDF), and Vector Space Model (VSM)." In THE 8TH INTERNATIONAL CONFERENCE ON TECHNOLOGY AND VOCATIONAL TEACHERS 2022. AIP Publishing, 2024. http://dx.doi.org/10.1063/5.0212036.

Full text
APA, Harvard, Vancouver, ISO, and other styles
7

Patel, Ankitkumar, and Kevin Meehan. "Fake News Detection on Reddit Utilising CountVectorizer and Term Frequency-Inverse Document Frequency with Logistic Regression, MultinominalNB and Support Vector Machine." In 2021 32nd Irish Signals and Systems Conference (ISSC). IEEE, 2021. http://dx.doi.org/10.1109/issc52156.2021.9467842.

Full text
APA, Harvard, Vancouver, ISO, and other styles
8

Schofield, Matthew, Gulsum Alicioglu, Russell Binaco, et al. "Convolutional Neural Network for Malware Classification Based on API Call Sequence." In 8th International Conference on Artificial Intelligence and Applications (AIAP 2021). AIRCC Publishing Corporation, 2021. http://dx.doi.org/10.5121/csit.2021.110106.

Full text
Abstract:
Malicious software is constantly being developed and improved, so detection and classification of malicious applications is an ever-evolving problem. Since traditional malware detection techniques fail to detect new or unknown malware, machine learning algorithms have been used to overcome this disadvantage. We present a Convolutional Neural Network (CNN) for malware type classification based on the Windows system API (Application Program Interface) calls. This research uses a database of 5385 instances of API call streams labeled with eight types of malware of the source malicious application. We use a 1-Dimensional CNN by mapping API call streams as categorical and term frequency-inverse document frequency (TF-IDF) vectors respectively. We achieved accuracy scores of 98.17% using TF-IDF vector and 95.40% via categorical vector. The proposed 1-D CNN outperformed other traditional classification techniques with overall accuracy score of 91.0%.
APA, Harvard, Vancouver, ISO, and other styles
9

Putri Ratna, Anak Agung, Aaliyah Kaltsum, Lea Santiar, Hanifah Khairunissa, Ihsan Ibrahim, and Prima Dewi Purnamasari. "Term Frequency-Inverse Document Frequency Answer Categorization with Support Vector Machine on Automatic Short Essay Grading System with Latent Semantic Analysis for Japanese Language." In 2019 International Conference on Electrical Engineering and Computer Science (ICECOS). IEEE, 2019. http://dx.doi.org/10.1109/icecos47637.2019.8984530.

Full text
APA, Harvard, Vancouver, ISO, and other styles
10

Carvalho, Hevelyn Sthefany Lima de, and Vinícius R. P. Borges. "A Comparative Study of Text Document Representation Approaches Using Point Placement-based Visualizations." In Anais Estendidos da Conference on Graphics, Patterns and Images. Sociedade Brasileira de Computação - SBC, 2021. http://dx.doi.org/10.5753/sibgrapi.est.2021.20035.

Full text
Abstract:
In natural language processing, text representation plays an important role which can affect the performance of language models and machine learning algorithms. Basic vector space models, such as the term frequency-inverse document frequency, became popular approaches to represent text documents. In the last years, approaches based on word embeddings have been proposed to preserve the meaning and semantic relations of words, phrases and texts. In this paper, we focus on studying the influences of different text representations to the quality of the 2D visual spaces (layouts) generated by state-of-art visualizations based on point placement. For that purpose, a visualizationassisted approach is proposed to support users when exploring such representations in classification tasks. Experimental results using two public labeled corpora were conducted to assess the quality of the layouts and to discuss possible relations to the classification performances. The results are promising, indicating that the proposed approach can guide users to understand the relevant patterns of a corpus in each representation.
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!