Увійти

Готові списки джерел за темами / Text document classification / Статті в журналах

Щоб переглянути інші типи публікацій з цієї теми, перейдіть за посиланням: Text document classification.

Статті в журналах з теми "Text document classification"

Автор: Grafiati

Опубліковано: 19 червня 2022

Оновлено: 22 червня 2025

Оформте джерело за APA, MLA, Chicago, Harvard та іншими стилями

Оберіть тип джерела:

Ознайомтеся з топ-50 статей у журналах для дослідження на тему "Text document classification".

Біля кожної праці в переліку літератури доступна кнопка «Додати до бібліографії». Скористайтеся нею – і ми автоматично оформимо бібліографічне посилання на обрану працю в потрібному вам стилі цитування: APA, MLA, «Гарвард», «Чикаго», «Ванкувер» тощо.

Також ви можете завантажити повний текст наукової публікації у форматі «.pdf» та прочитати онлайн анотацію до роботи, якщо відповідні параметри наявні в метаданих.

Переглядайте статті в журналах для різних дисциплін та оформлюйте правильно вашу бібліографію.

1

Y Baravkar, B. "Automated Text Document Classification Using Predictive Network." International Journal of Scientific Engineering and Research 12, no. 1 (2024): 17–19. https://doi.org/10.70729/se24120142420.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

2

Mr. D Krishna, Erukulla Laasya, A Sowmya Sri, T Ravinder Reddy, and Akhil Sanjoy. "BIOMEDICAL TEXT DOCUMENT CLASSIFICATION." international journal of engineering technology and management sciences 7, no. 3 (2023): 788–92. http://dx.doi.org/10.46647/ijetms.2023.v07i03.121.

Повний текст джерела

Анотація:

Information extraction, retrieval, and text categorization are only a few of the significant research fields covered by "bio medical text classification." This study examines many text categorization techniques utilised in practise, as well as their strengths and weaknesses, in order to improve knowledge of various information extraction opportunities in the field of data mining. We compiled a dataset with a focus on three categories: "Thyroid Cancer," "Lung Cancer," and "Colon Cancer." This paper presents an empirical study of a classifier. The investigation was carried out using biomedical literature benchmarks. Many metaheuristic algorithms are investigated, including genetic algorithms, particle swarm optimisation, firefly, cuckoo, and bat algorithms. In addition, the proposed multiple classifier system outperforms ensemble learning, ensemble pruning, and traditional classification methods. Based on the data, we forecast if it is Thyroid Cancer, Lung Cancer, or Colon Cancer using basic EDA, text preprocessing, and several models such as Logistic Regression, Decision Tree Classification, and Random Forest Classification.

Стилі APA, Harvard, Vancouver, ISO та ін.

3

Mukherjee, Indrajit, Prabhat Kumar Mahanti, Vandana Bhattacharya, and Samudra Banerjee. "Text classification using document-document semantic similarity." International Journal of Web Science 2, no. 1/2 (2013): 1. http://dx.doi.org/10.1504/ijws.2013.056572.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

4

Wu, Tiandeng, Qijiong Liu, Yi Cao, Yao Huang, Xiao-Ming Wu, and Jiandong Ding. "Continual Graph Convolutional Network for Text Classification." Proceedings of the AAAI Conference on Artificial Intelligence 37, no. 11 (2023): 13754–62. http://dx.doi.org/10.1609/aaai.v37i11.26611.

Повний текст джерела

Анотація:

Graph convolutional network (GCN) has been successfully applied to capture global non-consecutive and long-distance semantic information for text classification. However, while GCN-based methods have shown promising results in offline evaluations, they commonly follow a seen-token-seen-document paradigm by constructing a fixed document-token graph and cannot make inferences on new documents. It is a challenge to deploy them in online systems to infer steaming text data. In this work, we present a continual GCN model (ContGCN) to generalize inferences from observed documents to unobserved documents. Concretely, we propose a new all-token-any-document paradigm to dynamically update the document-token graph in every batch during both the training and testing phases of an online system. Moreover, we design an occurrence memory module and a self-supervised contrastive learning objective to update ContGCN in a label-free manner. A 3-month A/B test on Huawei public opinion analysis system shows ContGCN achieves 8.86% performance gain compared with state-of-the-art methods. Offline experiments on five public datasets also show ContGCN can improve inference quality. The source code will be released at https://github.com/Jyonn/ContGCN.

Стилі APA, Harvard, Vancouver, ISO та ін.

5

Yao, Liang, Chengsheng Mao, and Yuan Luo. "Graph Convolutional Networks for Text Classification." Proceedings of the AAAI Conference on Artificial Intelligence 33 (July 17, 2019): 7370–77. http://dx.doi.org/10.1609/aaai.v33i01.33017370.

Повний текст джерела

Анотація:

Text classification is an important and classical problem in natural language processing. There have been a number of studies that applied convolutional neural networks (convolution on regular grid, e.g., sequence) to classification. However, only a limited number of studies have explored the more flexible graph convolutional neural networks (convolution on non-grid, e.g., arbitrary graph) for the task. In this work, we propose to use graph convolutional networks for text classification. We build a single text graph for a corpus based on word co-occurrence and document word relations, then learn a Text Graph Convolutional Network (Text GCN) for the corpus. Our Text GCN is initialized with one-hot representation for word and document, it then jointly learns the embeddings for both words and documents, as supervised by the known class labels for documents. Our experimental results on multiple benchmark datasets demonstrate that a vanilla Text GCN without any external word embeddings or knowledge outperforms state-of-the-art methods for text classification. On the other hand, Text GCN also learns predictive word and document embeddings. In addition, experimental results show that the improvement of Text GCN over state-of-the-art comparison methods become more prominent as we lower the percentage of training data, suggesting the robustness of Text GCN to less training data in text classification.

Стилі APA, Harvard, Vancouver, ISO та ін.

6

Mohammed, Ali Sura I., Marwah Nihad, Sharaf Hussien Mohamed, and Haitham Farouk. "Machine learning for text document classification-efficient classification approach." IAES International Journal of Artificial Intelligence (IJ-AI) 13, no. 1 (2024): 703–10. https://doi.org/10.11591/ijai.v13.i1.pp703-710.

Повний текст джерела

Анотація:

Numerous alternative methods for text classification have been created because of the increase in the amount of online text information available. The cosine similarity classifier is the most extensively utilized simple and efficient approach. It improves text classification performance. It is combined with estimated values provided by conventional classifiers such as Multinomial Naive Bayesian (MNB). Consequently, combining the similarity between a test document and a category with the estimated value for the category enhances the performance of the classifier. This approach provides a text document categorization method that is both efficient and effective. In addition, methods for determining the proper relationship between a set of words in a document and its document categorization is also obtained.

Стилі APA, Harvard, Vancouver, ISO та ін.

7

K, Dinesh Balaji. "SMART DOCUMENT COMPANION - TEXT DATA CLASSIFICATION IN DOCUMENTS USING AI." International Research Journal of Education and Technology 6, no. 11 (2024): 2041–46. https://doi.org/10.70127/irjedt.vol.7.issue03.2046.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

8

Mohammed Ali, Sura I., Marwah Nihad, Hussien Mohamed Sharaf, and Haitham Farouk. "Machine learning for text document classification-efficient classification approach." IAES International Journal of Artificial Intelligence (IJ-AI) 13, no. 1 (2024): 703. http://dx.doi.org/10.11591/ijai.v13.i1.pp703-710.

Повний текст джерела

Анотація:

<p>Numerous alternative methods for text classification have been created because of the increase in the amount of online text information available. The cosine similarity classifier is the most extensively utilized simple and efficient approach. It improves text classification performance. It is combined with estimated values provided by conventional classifiers such as Multinomial Naive Bayesian (MNB). Consequently, combining the similarity between a test document and a category with the estimated value for the category enhances the performance of the classifier. This approach provides a text document categorization method that is both efficient and effective. In addition, methods for determining the proper relationship between a set of words in a document and its document categorization is also obtained.</p>

Стилі APA, Harvard, Vancouver, ISO та ін.

9

Cheng, Betty Yee Man, Jaime G. Carbonell, and Judith Klein-Seetharaman. "Protein classification based on text document classification techniques." Proteins: Structure, Function, and Bioinformatics 58, no. 4 (2005): 955–70. http://dx.doi.org/10.1002/prot.20373.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

10

Anna, Fay E. Naïve, and B. Barbosa Jocelyn. "Efficient Accreditation Document Classification Using Naïve Bayes Classifier." Indian Journal of Science and Technology 15, no. 1 (2022): 9–18. https://doi.org/10.17485/IJST/v15i1.1761.

Повний текст джерела

Анотація:

ABSTRACT <strong>Objectives:</strong> To develop a desktop application that automatically classifies a document as to which area of accreditation documents it should belong to. Specifically, it aims to: a) To create a predictive model that addresses document classification tasks. b) To design and develop an application that classifies documents according to document classification. c) To evaluate the performance measures of the automatic document classification. <strong>Methods:</strong> We introduce an innovative approach for the automatic classification of accreditation documents. Specifically, an approach of including scanned or captured documents in classification task using Optical Character Recognition (OCR); use TFIDF (Term-frequency Inverse Document Frequency) with stopwords removal, ngram of 1-2 in preprocessing of the text documents; and Naive Bayes algorithm with additive (Laplace/Lidstone) smoothing as a classifier in building our model. <strong>Results:</strong> Performance measures such as accuracy, precision, recall, and f-score were conducted to evaluate the efficiency of the study. The results showed 82% accuracy, 84% precision, 82% recall, and 82% F-1 score. As we explore the use of OCR for text extraction, TF-IDF for text preprocessing, and Naive Bayes classifier, the results indicate that the proposed approach is efficient. <strong>Conclusions:</strong> Classification of input documents in whatever forms, may it be captured image, scanned or simple text documents were obtained using OCR, TF-IDF, and Naive Bayes classifier. It provides an efficient way of automatic classification of accreditation documents and it gives an avenue to address limiting factors of the previous works, i.e classifying documents based on one’s opinion and time-consuming classification. <strong>Keywords:</strong> Accreditation Document Classification; Document Classification Objective Evaluation; TF-IDF; Term frequency-inverse document frequency; Multinomial Naive Bayes; OCR; Optical Character Recognition

Стилі APA, Harvard, Vancouver, ISO та ін.

11

Kim, Jiyun, and Han-joon Kim. "Multidimensional Text Warehousing for Automated Text Classification." Journal of Information Technology Research 11, no. 2 (2018): 168–83. http://dx.doi.org/10.4018/jitr.2018040110.

Повний текст джерела

Анотація:

This article describes how, in the era of big data, a data warehouse is an integrated multidimensional database that provides the basis for the decision making required to establish crucial business strategies. Efficient, effective analysis requires a data organization system that integrates and manages data of various dimensions. However, conventional data warehousing techniques do not consider the various data manipulation operations required for data-mining activities. With the current explosion of text data, much research has examined text (or document) repositories to support text mining and document retrieval. Therefore, this article presents a method of developing a text warehouse that provides a machine-learning-based text classification service. The document is represented as a term-by-concept matrix using a 3rd-order tensor-based textual representation model, which emphasizes the meaning of words occurring in the document. As a result, the proposed text warehouse makes it possible to develop a semantic Naïve Bayes text classifier only by executing appropriate SQL statements.

Стилі APA, Harvard, Vancouver, ISO та ін.

12

Muhaimin, Amri, Tresna Maulana Fahrudin, Trimono, Prismahardi Aji Riyantoko, and Kartika Maulida Hindrayani. "Metric Comparison For Text Classification." Internasional Journal of Data Science, Engineering, and Anaylitics 2, no. 1 (2022): 86–90. http://dx.doi.org/10.33005/ijdasea.v2i1.34.

Повний текст джерела

Анотація:

Text classifications have been popular in recent years. To classify the text, the first step that needs to be done is to convert the text into some value. Some values that can be used, such as Term Frequencies, Inverse Document Frequencies, Term Frequencies – Inverse Document Frequencies, and Frequency of the word itself. This study aims to get which metric value is best in text classification. The method used is Naïve Bayes, Logistic Regression, and Random Forest. The evaluation score that is used is accuracy and Area Under Curve value. It comes out that some metric values produce similar evaluation scores. Another finding is that Random Forest is the best method among others, also the best metric for text classification is Term Frequencies – Inverse Document Frequencies.

Стилі APA, Harvard, Vancouver, ISO та ін.

13

P, Ashokkumar, Siva Shankar G, Gautam Srivastava, Praveen Kumar Reddy Maddikunta, and Thippa Reddy Gadekallu. "A Two-stage Text Feature Selection Algorithm for Improving Text Classification." ACM Transactions on Asian and Low-Resource Language Information Processing 20, no. 3 (2021): 1–19. http://dx.doi.org/10.1145/3425781.

Повний текст джерела

Анотація:

As the number of digital text documents increases on a daily basis, the classification of text is becoming a challenging task. Each text document consists of a large number of words (or features) that drive down the efficiency of a classification algorithm. This article presents an optimized feature selection algorithm designed to reduce a large number of features to improve the accuracy of the text classification algorithm. The proposed algorithm uses noun-based filtering, a word ranking that enhances the performance of the text classification algorithm. Experiments are carried out on three benchmark datasets, and the results show that the proposed classification algorithm has achieved the maximum accuracy when compared to the existing algorithms. The proposed algorithm is compared to Term Frequency-Inverse Document Frequency, Balanced Accuracy Measure, GINI Index, Information Gain, and Chi-Square. The experimental results clearly show the strength of the proposed algorithm.

Стилі APA, Harvard, Vancouver, ISO та ін.

14

Zheng, Jianming, Yupu Guo, Chong Feng, and Honghui Chen. "A Hierarchical Neural-Network-Based Document Representation Approach for Text Classification." Mathematical Problems in Engineering 2018 (2018): 1–10. http://dx.doi.org/10.1155/2018/7987691.

Повний текст джерела

Анотація:

Document representation is widely used in practical application, for example, sentiment classification, text retrieval, and text classification. Previous work is mainly based on the statistics and the neural networks, which suffer from data sparsity and model interpretability, respectively. In this paper, we propose a general framework for document representation with a hierarchical architecture. In particular, we incorporate the hierarchical architecture into three traditional neural-network models for document representation, resulting in three hierarchical neural representation models for document classification, that is, TextHFT, TextHRNN, and TextHCNN. Our comprehensive experimental results on two public datasets, that is, Yelp 2016 and Amazon Reviews (Electronics), show that our proposals with hierarchical architecture outperform the corresponding neural-network models for document classification, resulting in a significant improvement ranging from 4.65% to 35.08% in terms of accuracy with a comparable (or substantially less) expense of time consumption. In addition, we find that the long documents benefit more from the hierarchical architecture than the short ones as the improvement in terms of accuracy on long documents is greater than that on short documents.

Стилі APA, Harvard, Vancouver, ISO та ін.

15

Kiran, V. Gaidhane* Prof. L. H. Patil Prof. C. U. Chouhan. "AN EFFICIENT APPROACH FOR TEXT MINING USING SIDE INFORMATION." INTERNATIONAL JOURNAL OF ENGINEERING SCIENCES & RESEARCH TECHNOLOGY 5, no. 7 (2016): 1137–41. https://doi.org/10.5281/zenodo.58632.

Повний текст джерела

Анотація:

Nowadays, in many text mining applications, information is present in the form of text documents. Text document contains various types of information such as side information or metadata. The different types of information such as document provenance information, title of the document, links in the document, user-access behavior from web logs, or other non-textual attributes treated as side information contained into the text document. Such attributes contains a large amount of information for clustering purposes. It is difficult to estimate the importance of this side-information when text document contains some of the information is noisy. In such cases, to avoid the low quality of mining process we need a principled way to perform the text mining, to maximize the advantages from using this side information. Conformation to that, this paper represents solution to the use of side information for clustering by hierarchical algorithm which then extends to the classification problem on real data sets.

Стилі APA, Harvard, Vancouver, ISO та ін.

16

Lee, Kangwook, Sanggyu Han, and Sung-Hyon Myaeng. "A discourse-aware neural network-based text model for document-level text classification." Journal of Information Science 44, no. 6 (2017): 715–35. http://dx.doi.org/10.1177/0165551517743644.

Повний текст джерела

Анотація:

Capturing semantics scattered across entire text is one of the important issues for Natural Language Processing (NLP) tasks. It would be particularly critical with long text embodying a flow of themes. This article proposes a new text modelling method that can handle thematic flows of text with Deep Neural Networks (DNNs) in such a way that discourse information and distributed representations of text are incorporate. Unlike previous DNN-based document models, the proposed model enables discourse-aware analysis of text and composition of sentence-level distributed representations guided by the discourse structure. More specifically, our method identifies Elementary Discourse Units (EDUs) and their discourse relations in a given document by applying Rhetorical Structure Theory (RST)-based discourse analysis. The result is fed into a tree-structured neural network that reflects the discourse information including the structure of the document and the discourse roles and relation types. We evaluate the document model for two document-level text classification tasks, sentiment analysis and sarcasm detection, with comparisons against the reference systems that also utilise discourse information. In addition, we conduct additional experiments to evaluate the impact of neural network types and adopted discourse factors on modelling documents vis-à-vis the two classification tasks. Furthermore, we investigate the effects of various learning methods, input units on the quality of the proposed discourse-aware document model.

Стилі APA, Harvard, Vancouver, ISO та ін.

17

Kumari, Lalitha, and Ch Satyanarayana. "An novel cluster based feature selection and document classification model on high dimension trec data." International Journal of Engineering & Technology 7, no. 1.1 (2017): 466. http://dx.doi.org/10.14419/ijet.v7i1.1.10146.

Повний текст джерела

Анотація:

TREC text documents are complex to analyze the features its relevant similar documents using the traditional document similarity measures. As the size of the TREC repository is increasing, finding relevant clustered documents from a large collection of unstructured documents is a challenging task. Traditional document similarity and classification models are implemented on homogeneous TREC data to find essential features for document entities that are similar to the TREC documents. Also, most of the traditional models are applicable to limited text document sets for text analysis. The main issues in the traditional text mining models in TREC repository include :1) Each document is represented in vector form with many sparsity values 2) Failed to find the document semantic similarity between the intra and inter clusters 3) High mean squared error rate. In this paper, novel feature selection based clustered and classification model is proposed on large number of different TREC repositories. Traditional latent Semantic Indexing and document clustering models are failed to find the topic relevance on large number of TREC clinical text document sets due to computational memory and time. Proposed document feature selection and clustered based classification model is applied on TREC clinical benchmark datasets. From the experimental results, it is proved that the proposed model is efficient than the existing models in terms of computational memory, accuracy and error rate are concerned.

Стилі APA, Harvard, Vancouver, ISO та ін.

18

Rahamat, Basha S., Rani J. Keziya, and Yadav J. J. C. Prasad. "A Novel Summarization-based Approach for Feature Reduction Enhancing Text Classification Accuracy." Engineering, Technology & Applied Science Research 9, no. 6 (2019): 5001–5. https://doi.org/10.5281/zenodo.3566535.

Повний текст джерела

Анотація:

Automatic summarization is the process of shortening one (in single document summarization) or multiple documents (in multi-document summarization). In this paper, a new feature selection method for the nearest neighbor classifier by summarizing the original training documents based on sentence importance measure is proposed. Our approach for single document summarization uses two measures for sentence similarity: the frequency of the terms in one sentence and the similarity of that sentence to other sentences. All sentences were ranked accordingly and the sentences with top ranks (with a threshold constraint) were selected for summarization. The summary of every document in the corpus is taken into a new document used for the summarization evaluation process.

Стилі APA, Harvard, Vancouver, ISO та ін.

19

Seifert, Christin, Eva Ulbrich, Roman Kern, and Michael Granitzer. "Text Representation for Efficient Document Annotation." JUCS - Journal of Universal Computer Science 19, no. (3) (2013): 383–405. https://doi.org/10.3217/jucs-019-03-0383.

Повний текст джерела

Анотація:

In text classification the amount and quality of training data is crucial for the performance of the classifier. The generation of training data is done by human labellers - a tedious and time-consuming work. To reduce the labelling time for single documents we propose to use condensed representations of text documents instead of the full-text document. These condensed representations are key sentences and key phrases and can be generated in a fully unsupervised way. We extended and evaluated the TextRank algorithm to automatically extract key sentences and key phrases. For representing key phrases we propose a layout similar to a tag cloud. In a user study with 37 participants we evaluated whether document labelling with these condensed representations can be done faster and equally accurate by the human labellers. Our evaluation shows that the users labelled tag clouds twice as fast and as accurately as full-text documents. While further investigations for different classification tasks are necessary, this insight could potentially reduce costs for the labelling process of text documents.

Стилі APA, Harvard, Vancouver, ISO та ін.

20

Nakajima, Hiromu, and Minoru Sasaki. "Text Classification Based on the Heterogeneous Graph Considering the Relationships between Documents." Big Data and Cognitive Computing 7, no. 4 (2023): 181. http://dx.doi.org/10.3390/bdcc7040181.

Повний текст джерела

Анотація:

Text classification is the task of estimating the genre of a document based on information such as word co-occurrence and frequency of occurrence. Text classification has been studied by various approaches. In this study, we focused on text classification using graph structure data. Conventional graph-based methods express relationships between words and relationships between words and documents as weights between nodes. Then, a graph neural network is used for learning. However, there is a problem that conventional methods are not able to represent the relationship between documents on the graph. In this paper, we propose a graph structure that considers the relationships between documents. In the proposed method, the cosine similarity of document vectors is set as weights between document nodes. This completes a graph that considers the relationship between documents. The graph is then input into a graph convolutional neural network for training. Therefore, the aim of this study is to improve the text classification performance of conventional methods by using this graph that considers the relationships between document nodes. In this study, we conducted evaluation experiments using five different corpora of English documents. The results showed that the proposed method outperformed the performance of the conventional method by up to 1.19%, indicating that the use of relationships between documents is effective. In addition, the proposed method was shown to be particularly effective in classifying long documents.

Стилі APA, Harvard, Vancouver, ISO та ін.

21

Uddin, Farid, Yibo Chen, Zuping Zhang, and Xin Huang. "Corpus Statistics Empowered Document Classification." Electronics 11, no. 14 (2022): 2168. http://dx.doi.org/10.3390/electronics11142168.

Повний текст джерела

Анотація:

In natural language processing (NLP), document classification is an important task that relies on the proper thematic representation of the documents. Gaussian mixture-based clustering is widespread for capturing rich thematic semantics but ignores emphasizing potential terms in the corpus. Moreover, the soft clustering approach causes long-tail noise by putting every word into every cluster, which affects the natural thematic representation of documents and their proper classification. It is more challenging to capture semantic insights when dealing with short-length documents where word co-occurrence information is limited. In this context, for long texts, we proposed Weighted Sparse Document Vector (WSDV), which performs clustering on the weighted data that emphasizes vital terms and moderates the soft clustering by removing outliers from the converged clusters. Besides the removal of outliers, WSDV utilizes corpus statistics in different steps for the vectorial representation of the document. For short texts, we proposed Weighted Compact Document Vector (WCDV), which captures better semantic insights in building document vectors by emphasizing potential terms and capturing uncertainty information while measuring the affinity between distributions of words. Using available corpus statistics, WCDV sufficiently handles the data sparsity of short texts without depending on external knowledge sources. To evaluate the proposed models, we performed a multiclass document classification using standard performance measures (precision, recall, f1-score, and accuracy) on three long- and two short-text benchmark datasets that outperform some state-of-the-art models. The experimental results demonstrate that in the long-text classification, WSDV reached 97.83% accuracy on the AgNews dataset, 86.05% accuracy on the 20Newsgroup dataset, and 98.67% accuracy on the R8 dataset. In the short-text classification, WCDV reached 72.7% accuracy on the SearchSnippets dataset and 89.4% accuracy on the Twitter dataset.

Стилі APA, Harvard, Vancouver, ISO та ін.

22

Wang, Bohan, Rui Qi, Jinhua Gao, Jianwei Zhang, Xiaoguang Yuan, and Wenjun Ke. "Mining the Frequent Patterns of Named Entities for Long Document Classification." Applied Sciences 12, no. 5 (2022): 2544. http://dx.doi.org/10.3390/app12052544.

Повний текст джерела

Анотація:

Nowadays, a large amount of information is stored as text, and numerous text mining techniques have been developed for various applications, such as event detection, news topic classification, public opinion detection, and sentiment analysis. Although significant progress has been achieved for short text classification, document-level text classification requires further exploration. Long documents always contain irrelevant noisy information that shelters the prominence of indicative features, limiting the interpretability of classification results. To alleviate this problem, a model called MIPELD (mining the frequent pattern of a named entity for long document classification) for long document classification is demonstrated, which mines the frequent patterns of named entities as features. Discovered patterns allow semantic generalization among documents and provide clues for verifying the results. Experiments on several datasets resulted in good accuracy and marco-F1 values, meeting the requirements for practical application. Further analysis validated the effectiveness of MIPELD in mining interpretable information in text classification.

Стилі APA, Harvard, Vancouver, ISO та ін.

23

Rahamat Basha, S., J. Keziya Rani, and J. J. C. Prasad Yadav. "A Novel Summarization-based Approach for Feature Reduction Enhancing Text Classification Accuracy." Engineering, Technology & Applied Science Research 9, no. 6 (2019): 5001–5. http://dx.doi.org/10.48084/etasr.3173.

Повний текст джерела

Анотація:

Automatic summarization is the process of shortening one (in single document summarization) or multiple documents (in multi-document summarization). In this paper, a new feature selection method for the nearest neighbor classifier by summarizing the original training documents based on sentence importance measure is proposed. Our approach for single document summarization uses two measures for sentence similarity: the frequency of the terms in one sentence and the similarity of that sentence to other sentences. All sentences were ranked accordingly and the sentences with top ranks (with a threshold constraint) were selected for summarization. The summary of every document in the corpus is taken into a new document used for the summarization evaluation process.

Стилі APA, Harvard, Vancouver, ISO та ін.

24

Chouhan, Khushi Udaysingh, Nikita Pradeep Kumar Jha, Roshni Sanjay Jha, Shaikh Insha Kamaluddin, and Dr Sujata Khedkar. "Legal Document Analysis." International Journal for Research in Applied Science and Engineering Technology 11, no. 4 (2023): 548–57. http://dx.doi.org/10.22214/ijraset.2023.50123.

Повний текст джерела

Анотація:

Abstract: Text preprocessing is the most essential and foremost step for any Machine Learning model. The raw data needs to be cleaned and pre-processed to get better performance. It is the method to clean the data and makes it ready to feed the data to the model. Text classification is the heart of many software systems that involve text documents processing. The purpose of text classification is to classify the text documents automatically into two or many defined categories. In this paper ,various preprocessing and classification approaches are used such as NLP, Machine Learning, etc from patent documents.

Стилі APA, Harvard, Vancouver, ISO та ін.

25

Ranjan, Nihar M., and Rajesh S. Prasad. "A Brief Survey of Text Document Classification Algorithms and Processes." Journal of Data Mining and Management 8, no. 1 (2023): 6–11. http://dx.doi.org/10.46610/jodmm.2023.v08i01.002.

Повний текст джерела

Анотація:

The exponential growth of unstructured data is one of the most critical challenges in data mining, text analytics, or data analytics. Around 80% of the world's data are available in unstructured format and most are left unattended due to the complexity of its analysis. It is a great challenge to guarantee the quality of the text document classifier that classifies documents based on user preferences because of large-scale terms and data patterns. The World Wide Web is growing rapidly and the availability of electronic documents is also increasing. Therefore, the automatic categorization of documents is the key factor for the systematic organization of information and knowledge discovery. Most existing widespread text mining and classification strategies have adopted term-based approaches. However, the problems of polysemy and synonymy in such approaches are of great concern. To classify documents based on their context, the context-based approach is needed to be followed. Semantic analysis of the text overcomes the limitations of the term-based approach and it also enhances the accuracy of the classifiers. This paper aims to highlight the important algorithms, techniques, and methodologies that can be used for text document classification. Furthermore, the paper also provides a review of the different stages of Text Document Classification.

Стилі APA, Harvard, Vancouver, ISO та ін.

26

Youngseok, Lee, and Cho Jungwon. "Web document classification using topic modeling based document ranking." International Journal of Electrical and Computer Engineering (IJECE) 11, no. 3 (2021): 2386–92. https://doi.org/10.11591/ijece.v11i3.pp2386-2392.

Повний текст джерела

Анотація:

In this paper, we propose a web document ranking method using topic modeling for effective information collection and classification. The proposed method is applied to the document ranking technique to avoid duplicated crawling when crawling at high speed. Through the proposed document ranking technique, it is feasible to remove redundant documents, classify the documents efficiently, and confirm that the crawler service is running. The proposed method enables rapid collection of many web documents; the user can search the web pages with constant data update efficiently. In addition, the efficiency of data retrieval can be improved because new information can be automatically classified and transmitted. By expanding the scope of the method to big data based web pages and improving it for application to various websites, it is expected that more effective information retrieval will be possible.

Стилі APA, Harvard, Vancouver, ISO та ін.

27

Srikanth, Bethu* B. Sankara Babu. "DATA MINING AND TEXT MINING: EFFICIENT TEXT CLASSIFICATION USING SVMS FOR LARGE DATASETS." Global Journal of Engineering Science and Research Management 3, no. 8 (2016): 47–56. https://doi.org/10.5281/zenodo.60657.

Повний текст джерела

Анотація:

The Text mining and Data mining supports different kinds of algorithms for classification of large data sets. The Text Categorization is traditionally done by using the Term Frequency and Inverse Document Frequency. This method does not satisfy elimination of unimportant words in the document. For reducing the error classifying of documents in wrong category, efficient classification algorithms are needed. Support Vector Machines (SVM) is used based on the large margin data sets for classification algorithms that give good generalization, compactness and performance. Support Vector Machines (SVM) provides low accuracy and to solve large data sets, it typically needs large number of support vectors. We introduce a new learning algorithm, which is comfortable to solve the dual problem, by adding the support vectors incrementally. It majorly involves a classification algorithm by solving the primal problem instead of the dual problem. By using this, we are able to reduce the resultant classifier complexity by comparing with the existing works. Experimental results done and produce comparable classification accuracy with existing works.

Стилі APA, Harvard, Vancouver, ISO та ін.

28

Hong, Jiwon, Dongho Jeong, and Sang-Wook Kim. "Classifying Malicious Documents on the Basis of Plain-Text Features: Problem, Solution, and Experiences." Applied Sciences 12, no. 8 (2022): 4088. http://dx.doi.org/10.3390/app12084088.

Повний текст джерела

Анотація:

Cyberattacks widely occur by using malicious documents. A malicious document is an electronic document containing malicious codes along with some plain-text data that is human-readable. In this paper, we propose a novel framework that takes advantage of such plaintext data to determine whether a given document is malicious. We extracted plaintext features from the corpus of electronic documents and utilized them to train a classification model for detecting malicious documents. Our extensive experimental results with different combinations of three well-known vectorization strategies and three popular classification methods on five types of electronic documents demonstrate that our framework provides high prediction accuracy in detecting malicious documents.

Стилі APA, Harvard, Vancouver, ISO та ін.

29

D Krishna, Erukulla Laasya, A Sowmya Sri, T Ravinder Reddy, and Akhil Sanjoy. "A SURVEY ON BIOMEDICAL TEXT DOCUMENT CLASSIFICATION." international journal of engineering technology and management sciences 6, no. 6 (2022): 503–8. http://dx.doi.org/10.46647/ijetms.2022.v06i06.086.

Повний текст джерела

Анотація:

Information extraction, information retrieval, and text classification are only a few of the important study areas that fall under the heading of "bio medical text classification." In order to increase understanding of various information extraction opportunities in the field of data mining, this study analyses several text categorization approaches used in practise, their strengths and shortcomings. We have gathered a dataset with a strong emphasis on three categories, including "Thyroid Cancer," "Lung Cancer," and "Colon Cancer." This essay offers an empirical investigation of a classifier. Benchmarks for biomedical text were used to conduct the experiment. We study many metaheuristic algorithms, including genetic algorithms, particle swarm optimization, firefly, cuckoo, and bat algorithms. The suggested multiple classifier system also outperforms ensemble learning, ensemble pruning, and conventional classification algorithms. In the data we use predict the Biomedical text document classification is whether it's Thyroid Cancer, Lung Cancer, Colon Cancer based on the performed basic EDA, text pre-processing, build different models, such as LogisticRegression, DecisiontreeClassification, RandomForest Classification

Стилі APA, Harvard, Vancouver, ISO та ін.

30

Rahamat Basha, S., and J. K. Rani. "A Comparative Approach of Dimensionality Reduction Techniques in Text Classification." Engineering, Technology & Applied Science Research 9, no. 6 (2019): 4974–79. http://dx.doi.org/10.48084/etasr.3146.

Повний текст джерела

Анотація:

This work deals with document classification. It is a supervised learning method (it needs a labeled document set for training and a test set of documents to be classified). The procedure of document categorization includes a sequence of steps consisting of text preprocessing, feature extraction, and classification. In this work, a self-made data set was used to train the classifiers in every experiment. This work compares the accuracy, average precision, precision, and recall with or without combinations of some feature selection techniques and two classifiers (KNN and Naive Bayes). The results concluded that the Naive Bayes classifier performed better in many situations.

Стилі APA, Harvard, Vancouver, ISO та ін.

31

Rahamat, Basha S., and J. K. Rani. "A Comparative Approach of Dimensionality Reduction Techniques in Text Classification." Engineering, Technology & Applied Science Research 9, no. 6 (2019): 4974–79. https://doi.org/10.5281/zenodo.3566201.

Повний текст джерела

Анотація:

This work deals with document classification. It is a supervised learning method (it needs a labeled document set for training and a test set of documents to be classified). The procedure of document categorization includes a sequence of steps consisting of text preprocessing, feature extraction, and classification. In this work, a self-made data set was used to train the classifiers in every experiment. This work compares the accuracy, average precision, precision, and recall with or without combinations of some feature selection techniques and two classifiers (KNN and Naive Bayes). The results concluded that the Naive Bayes classifier performed better in many situations.

Стилі APA, Harvard, Vancouver, ISO та ін.

32

Anne, Chaitanya, Avdesh Mishra, Md Tamjidul Hoque, and Shengru Tu. "Multiclass patent document classification." Artificial Intelligence Research 7, no. 1 (2017): 1. http://dx.doi.org/10.5430/air.v7n1p1.

Повний текст джерела

Анотація:

Text classification is used in information extraction and retrieval from a given text, and text classification has been considered as an important step to manage a vast number of records given in digital form that is far-reaching and expanding. This article addresses patent document classification problem into fifteen different categories or classes, where some classes overlap with each other for practical reasons. For the development of the classification model using machine learning techniques, useful features have been extracted from the given documents. The features are used to classify patent document as well as to generate useful tag-words. The overall objective of this work is to systematize NASA’s patent management, by developing a set of automated tools that can assist NASA to manage and market its portfolio of intellectual properties (IP), and to enable easier discovery of relevant IP by users. We have identified an array of methods that can be applied such as k-Nearest Neighbors (kNN), two variations of the Support Vector Machine (SVM) algorithms, and two tree based classification algorithms: Random Forest and J48. The major research steps in this paper consist of filtering techniques for variable selection, information gain and feature correlation analysis, and training and testing potential models using effective classifiers. Further, the obstacles associated with the imbalanced data were mitigated by adding pseudo-synthetic data wherever appropriate, which resulted in a superior SVM classifier based model.

Стилі APA, Harvard, Vancouver, ISO та ін.

33

Golub, Koraljka. "Automatic Subject Indexing of Text." KNOWLEDGE ORGANIZATION 46, no. 2 (2019): 104–21. http://dx.doi.org/10.5771/0943-7444-2019-2-104.

Повний текст джерела

Анотація:

Automatic subject indexing addresses problems of scale and sustainability and can be at the same time used to enrich existing metadata records, establish more connections across and between resources from various metadata and resource collections, and enhance consistency of the metadata. In this work, automatic subject indexing focuses on assigning index terms or classes from established knowledge organization systems (KOSs) for subject indexing like thesauri, subject headings systems and classification systems. The following major approaches are discussed, in terms of their similarities and differences, advantages and disadvantages for automatic assigned indexing from KOSs: “text categorization,” “document clustering,” and “document classification.” Text categorization is perhaps the most widespread, machine-learning approach with what seems generally good reported performance. Document clustering automatically both creates groups of related documents and extracts names of subjects depicting the group at hand. Document classification re-uses the intellectual effort invested into creating a KOS for subject indexing and even simple string-matching algorithms have been reported to achieve good results, because one concept can be described using a number of different terms, including equivalent, related, narrower and broader terms. Finally, applicability of automatic subject indexing to operative information systems and challenges of evaluation are outlined, suggesting the need for more research.

Стилі APA, Harvard, Vancouver, ISO та ін.

34

Smith, Dan, and Richard Harvey. "Document Retrieval Using SIFT Image Features." JUCS - Journal of Universal Computer Science 17, no. (1) (2011): 3–15. https://doi.org/10.3217/jucs-017-01-0003.

Повний текст джерела

Анотація:

This paper describes a new approach to document classification based on visual features alone. Text-based retrieval systems perform poorly on noisy text. We have conducted series of experiments using cosine distance as our similarity measure, selecting varying numbers local interest points per page, and varying numbers of nearest neighbour points in the similarity calculations. We have found that a distance-based measure of similarity outperforms a rank-based measure except when there are few interest points. We show that using visual features substantially outperforms textbased approaches for noisy text, giving average precision in the range 0.4-0.43 in several experiments retrieving scientific papers.

Стилі APA, Harvard, Vancouver, ISO та ін.

35

M.Shaikh, Mustafa, Ashwini A. Pawar, and Vibha B. Lahane. "Pattern Discovery Text Mining for Document Classification." International Journal of Computer Applications 117, no. 1 (2015): 6–12. http://dx.doi.org/10.5120/20516-2101.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

36

Elhadad, Mohamed, Khaled Badran, and Gouda Salama. "Towards Ontology-Based web text Document Classification." Journal of Engineering Science and Military Technologies 17, no. 17 (2017): 1–8. http://dx.doi.org/10.21608/ejmtc.2017.21564.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

37

Sinoara, Roberta A., Jose Camacho-Collados, Rafael G. Rossi, Roberto Navigli, and Solange O. Rezende. "Knowledge-enhanced document embeddings for text classification." Knowledge-Based Systems 163 (January 2019): 955–71. http://dx.doi.org/10.1016/j.knosys.2018.10.026.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

38

Elhadad, Mohamed, Khaled Badran, and Gouda Salama. "Towards Ontology-Based web text Document Classification." International Conference on Aerospace Sciences and Aviation Technology 17, AEROSPACE SCIENCES (2017): 1–8. http://dx.doi.org/10.21608/asat.2017.22749.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

39

Endalie, Demeke, Getamesay Haile, and Wondmagegn Taye Abebe. "Feature selection by integrating document frequency with genetic algorithm for Amharic news document classification." PeerJ Computer Science 8 (April 25, 2022): e961. http://dx.doi.org/10.7717/peerj-cs.961.

Повний текст джерела

Анотація:

Text classification is the process of categorizing documents based on their content into a predefined set of categories. Text classification algorithms typically represent documents as collections of words and it deals with a large number of features. The selection of appropriate features becomes important when the initial feature set is quite large. In this paper, we present a hybrid of document frequency (DF) and genetic algorithm (GA)-based feature selection method for Amharic text classification. We evaluate this feature selection method on Amharic news documents obtained from the Ethiopian News Agency (ENA). The number of categories used in this study is 13. Our experimental results showed that the proposed feature selection method outperformed other feature selection methods utilized for Amharic news document classification. Combining the proposed feature selection method with Extra Tree Classifier (ETC) improves classification accuracy. It improves classification accuracy up to 1% higher than the hybrid of DF, information gain (IG), chi-square (CHI), and principal component analysis (PCA), 2.47% greater than GA and 3.86% greater than a hybrid of DF, IG, and CHI.

Стилі APA, Harvard, Vancouver, ISO та ін.

40

Ernawati, Iin. "NAIVE BAYES CLASSIFIER DAN SUPPORT VECTOR MACHINE SEBAGAI ALTERNATIF SOLUSI UNTUK TEXT MINING." Jurnal Teknologi Informasi dan Pendidikan 12, no. 2 (2019): 32–38. http://dx.doi.org/10.24036/tip.v12i2.219.

Повний текст джерела

Анотація:

This study was conducted to text-based data mining or often called text mining, classification methods commonly used method Naïve bayes classifier (NBC) and support vector machine (SVM). This classification is emphasized for Indonesian language documents, while the relationship between documents is measured by the probability that can be proven with other classification algorithms. This evident from the conclusion that the probability result Naïve Bayes Classifier (NBC) word “party” at least in the economic document and political. Then the result of the algorithm support vector machine (svm) with the word “price” and “kpk” contains in both economic and politic document.

Стилі APA, Harvard, Vancouver, ISO та ін.

41

Chiraratanasopha, Boonthida, Thanaruk Theeramunkong, and Salin Boonbrahm. "Hierarchical text classification using Relative Inverse Document Frequency." ECTI Transactions on Computer and Information Technology (ECTI-CIT) 15, no. 2 (2021): 166–76. http://dx.doi.org/10.37936/ecti-cit.2021152.240515.

Повний текст джерела

Анотація:

Automatic hierarchical text classification has been a challenging and in-needed task with an increasing of hierarchical taxonomy from the booming of knowledge organization. The hierarchical structure identifies the relationships of dependence between different categories in which can be overlapped of generalized and specific concepts within the tree. This paper presents the use of frequency of the occurring term in related categories among the hierarchical tree to help in document classification. The four extended term weighting of Relative Inverse Document Frequency (IDFr) including its located category, its parent category, its sibling categories and its child categories are exploited to generate a classifier model using centroid-based technique. From the experiment on hierarchical text classification of Thai documents, the IDFr achieved the best accuracy and F-measure as 53.65% and 50.80% in Top-n features set from family-based evaluation in which are higher than TF-IDF for 2.35% and 1.15% in the same settings, respectively.

Стилі APA, Harvard, Vancouver, ISO та ін.

42

Aubaid, Asmaa M., and Alok Mishra. "A Rule-Based Approach to Embedding Techniques for Text Document Classification." Applied Sciences 10, no. 11 (2020): 4009. http://dx.doi.org/10.3390/app10114009.

Повний текст джерела

Анотація:

With the growth of online information and sudden expansion in the number of electronic documents provided on websites and in electronic libraries, there is difficulty in categorizing text documents. Therefore, a rule-based approach is a solution to this problem; the purpose of this study is to classify documents by using a rule-based. This paper deals with the rule-based approach with the embedding technique for a document to vector (doc2vec) files. An experiment was performed on two data sets Reuters-21578 and the 20 Newsgroups to classify the top ten categories of these data sets by using a document to vector rule-based (D2vecRule). Finally, this method provided us a good classification result according to the F-measures and implementation time metrics. In conclusion, it was observed that our algorithm document to vector rule-based (D2vecRule) was good when compared with other algorithms such as JRip, One R, and ZeroR applied to the same Reuters-21578 dataset.

Стилі APA, Harvard, Vancouver, ISO та ін.

43

Ni'mah, Ana Tsalitsatun, and Fahmi Syuhada. "Term Weighting Based Indexing Class and Indexing Short Document for Indonesian Thesis Title Classification." Journal of Computer Science and Informatics Engineering (J-Cosine) 6, no. 2 (2022): 167–75. http://dx.doi.org/10.29303/jcosine.v6i2.471.

Повний текст джерела

Анотація:

Document classification nowadays is an easy thing to do because there are the latest methods to get maximum results. Document classification using the term weighting TF-IDF-ICF method has been widely studied. Documents used in this research generally use large documents. If the term weighting TF-IDF method is used in a short text document such as the Thesis Title, the document will not get a perfect score from the classification results. Because in the IDF will calculate the weight of words that always appear to be few, ICF will calculate the weight of words that often appear in the class to be few. While the word should have great weight to be the core of a short text document. Therefore, this study aims to conduct research on word weighting based on class indexation and short document indexation, namely TF-IDF-ICF-IDSF. This study uses a classification comparison Naïve Bayes and SVM. The dataset used is Thesis Title of Informatics Education student at Trunojoyo Madura University. The test results show that the classification results using the TF-IDF-ICF-IDSF term weighting method outperform other term weighting, namely getting 91% Precision, 93% Recall, 86% F1-Score, and 84% Accuracy on SVM.

Стилі APA, Harvard, Vancouver, ISO та ін.

44

N J, Avinash, Krishnaraj Rao, Rama Moorthy H, et al. "A Novel Automatic Text Document Classification Using Learning based Text Classification(LbTC) Approach." Procedia Computer Science 258 (2025): 4279–90. https://doi.org/10.1016/j.procs.2025.04.677.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

45

Fragoso, Rogerio C. P., Roberto H. W. Pinheiro, and George Cavalcanti. "Data-driven Feature Selection Methods for Text Classification: an Empirical Evaluation." JUCS - Journal of Universal Computer Science 25, no. (4) (2019): 334–60. https://doi.org/10.3217/jucs-025-04-0334.

Повний текст джерела

Анотація:

Dimensionality reduction is a crucial task in text classification. The most adopted strategy is feature selection using filter methods. This approach presents a difficulty in determining the best size for the final feature vector. At Least One FeaTure (ALOFT), Maximum f Features per Document (MFD), Maximum f Features per Document-Reduced (MFDR) and Class-dependent Maximum f Features per Document-Reduced (cMFDR) are feature selection methods that define automatically the number of features per Corpus. However, MFD, MFDR, and cMFDR require a parameter that defines the number of features to be selected per document. Automatic Feature Subsets Analyzer (AFSA) is an auxiliary method that automates such configuration. In this paper, we evaluate dimensionality reduction, classification performance and execution time of this family of methods: ALOFT, MFD, MFDR, cMFDR and AFSA. The experiments are conducted using three feature evaluation functions and twenty databases. MFD obtained the best results among the feature selection methods. In addition, the experiments showed that the use of AFSA does not significantly affect the classification performances or the dimensionality reduction rates of the feature selection methods, but considerably reduces their execution times.

Стилі APA, Harvard, Vancouver, ISO та ін.

46

Idrush, G. Mahammad. "Offensive Language and Image Identification on Social Media Based on Text and Image Classification." International Journal for Research in Applied Science and Engineering Technology 13, no. 5 (2025): 2148–52. https://doi.org/10.22214/ijraset.2025.70351.

Повний текст джерела

Анотація:

A digital signature is like a digital version of a handwritten signature but much more secure. It ensures that digital documents are authentic, unaltered, and genuinely from the sender. Our project, Digital Signature Tool, focuses on creating an easy-to-use application for securely signing and verifying documents. Using advanced cryptographic methods like RSA or ECDSA, the tool allows users to generate and manage private and public keys securely. To sign a document, the sender uses their private key to create a unique digital signature, while the receiver uses the sender’s public key to verify the signature. This process confirms the document's authenticity and ensures it has not been tampered with. The application will integrate essential features, such as secure key management, document signing, and signature verification, all within a user-friendly interface. This project aims to provide individuals and organizations with a reliable solution for protecting their documents and communications, ensuring trust, data integrity, and security in the digital space.

Стилі APA, Harvard, Vancouver, ISO та ін.

47

Parsafard, Pouyan, Hadi Veisi, Niloofar Aflaki, and Siamak Mirzaei. "Text Classification based on DiscriminativeSemantic Features and Variance of Fuzzy Similarity." International Journal of Intelligent Systems and Applications 14, no. 2 (2022): 26–39. http://dx.doi.org/10.5815/ijisa.2022.02.03.

Повний текст джерела

Анотація:

Due to the rapid growth of the Internet, large amounts of unlabelled textual data are producing daily. Clearly, finding the subject of a text document is a primary source of information in the text processing applications. In this paper, a text classification method is presented and evaluated for Persian and English. The proposed technique utilizes variance of fuzzy similarity besides discriminative and semantic feature selection methods. Discriminative features are those that distinguish categories with higher power and the concept of semantic feature takes into the calculations the similarity between features and documents by using only available documents. In the proposed method, incorporating fuzzy weighting as a measure of similarity is presented. The fuzzy weights are derived from the concept of fuzzy similarity which is defined as the variance of membership values of a document to all categories in the way that with some membership value at the same time, the sum of these membership values should be equal to 1. The proposed document classification method is evaluated on three datasets (one Persian and two English datasets) and two classification methods, support vector machine (SVM) and artificial neural network (ANN), are used. Comparing the results with other text classification methods, demonstrate the consistent superiority of the proposed technique in all cases. The weighted average F-measure of our method are %82 and %97.8 in the classification of Persian and English documents, respectively.

Стилі APA, Harvard, Vancouver, ISO та ін.

48

Jia, Longjia, and Bangzuo Zhang. "A new document representation based on global policy for supervised term weighting schemes in text categorization." Mathematical Biosciences and Engineering 19, no. 5 (2022): 5223–40. http://dx.doi.org/10.3934/mbe.2022245.

Повний текст джерела

Анотація:

<abstract> <p>There are two main factors involved in documents classification, document representation method and classification algorithm. In this study, we focus on document representation method and demonstrate that the choice of representation methods has impacts on quality of classification results. We propose a document representation strategy for supervised text classification named document representation based on global policy (<italic>DRGP</italic>), which can obtain an appropriate document representation according to the distribution of terms. The main idea of <italic>DRGP</italic> is to construct the optimization function through the importance of terms to different categories. In the experiments, we investigate the effects of <italic>DRGP</italic> on the 20 Newsgroups, Reuters21578 datasets, and using the <italic>SVM</italic> as classifier. The results show that the <italic>DRGP</italic> outperforms other text representation strategy schemes, such as Document Max, Document Two Max and global policy.</p> </abstract>

Стилі APA, Harvard, Vancouver, ISO та ін.

49

Ayogu, I. I. "Exploring multinomial naïve Bayes for Yorùbá text document classification." Nigerian Journal of Technology 39, no. 2 (2020): 528–35. http://dx.doi.org/10.4314/njt.v39i2.23.

Повний текст джерела

Анотація:

The recent increase in the emergence of Nigerian language text online motivates this paper in which the problem of classifying text documents written in Yorùbá language into one of a few pre-designated classes is considered. Text document classification/categorization research is well established for English language and many other languages; this is not so for Nigerian languages. This paper evaluated the performance of a multinomial Naive Bayes model learned on a research dataset consisting of 100 samples of text each from business, sporting, entertainment, technology and political domains, separately on unigram, bigram and trigram features obtained using the bag of words representation approach. Results show that the performance of the model over unigram and bigram features is comparable but significantly better than a model learned on trigram features. The results generally indicate a possibility for the practical application of NB algorithm to the classification of text documents written in Yorùbá language. Keywords: Supervised learning, text classification, Yorùbá language, text mining, BoW Representation

Стилі APA, Harvard, Vancouver, ISO та ін.

50

Choo, Wou Onn, Lam Hong Lee, Yen Pei Tay, Khang Wen Goh, Dino Isa, and Suliman Mohamed Fati. "Automatic Folder Allocation System for Electronic Text Document Repositories Using Enhanced Bayesian Classification Approach." International Journal of Intelligent Information Technologies 15, no. 2 (2019): 1–19. http://dx.doi.org/10.4018/ijiit.2019040101.

Повний текст джерела

Анотація:

This article proposes a system equipped with the enhanced Bayesian classification techniques to automatically assign folders to store electronic text documents. Despite computer technology advancements in the information age where electronic text files are so pervasive in information exchange, almost every single document created or downloaded from the Internet requires manual classification by the users before being deposited into a folder in a computer. Not only does such a tedious task cause inconvenience to users, the time taken to repeatedly classify and allocate a folder for each text document impedes productivity, especially when dealing with a huge number of files and deep layers of folders. In order to overcome this, a prototype system is built to evaluate the performance of the enhanced Bayesian text classifier for automatic folder allocation, by categorizing text documents based on the existing types of text documents and folders present in user's hard drive. In this article, the authors deploy a High Relevance Keyword Extraction (HRKE) technique and an Automatic Computed Document Dependent (ACDD) Weighting Factor technique to a Bayesian classifier in order to obtain better classification accuracy, while maintaining the low training cost and simple classifying processes using the conventional Bayesian approach.

Стилі APA, Harvard, Vancouver, ISO та ін.

Ми пропонуємо знижки на всі преміум-плани для авторів, чиї праці увійшли до тематичних добірок літератури. Зв'яжіться з нами, щоб отримати унікальний промокод!