Статті в журналах з теми "Text document classification"

Щоб переглянути інші типи публікацій з цієї теми, перейдіть за посиланням: Text document classification.

Оформте джерело за APA, MLA, Chicago, Harvard та іншими стилями

Оберіть тип джерела:

Ознайомтеся з топ-50 статей у журналах для дослідження на тему "Text document classification".

Біля кожної праці в переліку літератури доступна кнопка «Додати до бібліографії». Скористайтеся нею – і ми автоматично оформимо бібліографічне посилання на обрану працю в потрібному вам стилі цитування: APA, MLA, «Гарвард», «Чикаго», «Ванкувер» тощо.

Також ви можете завантажити повний текст наукової публікації у форматі «.pdf» та прочитати онлайн анотацію до роботи, якщо відповідні параметри наявні в метаданих.

Переглядайте статті в журналах для різних дисциплін та оформлюйте правильно вашу бібліографію.

1

Mukherjee, Indrajit, Prabhat Kumar Mahanti, Vandana Bhattacharya, and Samudra Banerjee. "Text classification using document-document semantic similarity." International Journal of Web Science 2, no. 1/2 (2013): 1. http://dx.doi.org/10.1504/ijws.2013.056572.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
2

Cheng, Betty Yee Man, Jaime G. Carbonell, and Judith Klein-Seetharaman. "Protein classification based on text document classification techniques." Proteins: Structure, Function, and Bioinformatics 58, no. 4 (January 11, 2005): 955–70. http://dx.doi.org/10.1002/prot.20373.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
3

Yao, Liang, Chengsheng Mao, and Yuan Luo. "Graph Convolutional Networks for Text Classification." Proceedings of the AAAI Conference on Artificial Intelligence 33 (July 17, 2019): 7370–77. http://dx.doi.org/10.1609/aaai.v33i01.33017370.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Text classification is an important and classical problem in natural language processing. There have been a number of studies that applied convolutional neural networks (convolution on regular grid, e.g., sequence) to classification. However, only a limited number of studies have explored the more flexible graph convolutional neural networks (convolution on non-grid, e.g., arbitrary graph) for the task. In this work, we propose to use graph convolutional networks for text classification. We build a single text graph for a corpus based on word co-occurrence and document word relations, then learn a Text Graph Convolutional Network (Text GCN) for the corpus. Our Text GCN is initialized with one-hot representation for word and document, it then jointly learns the embeddings for both words and documents, as supervised by the known class labels for documents. Our experimental results on multiple benchmark datasets demonstrate that a vanilla Text GCN without any external word embeddings or knowledge outperforms state-of-the-art methods for text classification. On the other hand, Text GCN also learns predictive word and document embeddings. In addition, experimental results show that the improvement of Text GCN over state-of-the-art comparison methods become more prominent as we lower the percentage of training data, suggesting the robustness of Text GCN to less training data in text classification.
4

Kim, Jiyun, and Han-joon Kim. "Multidimensional Text Warehousing for Automated Text Classification." Journal of Information Technology Research 11, no. 2 (April 2018): 168–83. http://dx.doi.org/10.4018/jitr.2018040110.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
This article describes how, in the era of big data, a data warehouse is an integrated multidimensional database that provides the basis for the decision making required to establish crucial business strategies. Efficient, effective analysis requires a data organization system that integrates and manages data of various dimensions. However, conventional data warehousing techniques do not consider the various data manipulation operations required for data-mining activities. With the current explosion of text data, much research has examined text (or document) repositories to support text mining and document retrieval. Therefore, this article presents a method of developing a text warehouse that provides a machine-learning-based text classification service. The document is represented as a term-by-concept matrix using a 3rd-order tensor-based textual representation model, which emphasizes the meaning of words occurring in the document. As a result, the proposed text warehouse makes it possible to develop a semantic Naïve Bayes text classifier only by executing appropriate SQL statements.
5

P, Ashokkumar, Siva Shankar G, Gautam Srivastava, Praveen Kumar Reddy Maddikunta, and Thippa Reddy Gadekallu. "A Two-stage Text Feature Selection Algorithm for Improving Text Classification." ACM Transactions on Asian and Low-Resource Language Information Processing 20, no. 3 (May 2021): 1–19. http://dx.doi.org/10.1145/3425781.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
As the number of digital text documents increases on a daily basis, the classification of text is becoming a challenging task. Each text document consists of a large number of words (or features) that drive down the efficiency of a classification algorithm. This article presents an optimized feature selection algorithm designed to reduce a large number of features to improve the accuracy of the text classification algorithm. The proposed algorithm uses noun-based filtering, a word ranking that enhances the performance of the text classification algorithm. Experiments are carried out on three benchmark datasets, and the results show that the proposed classification algorithm has achieved the maximum accuracy when compared to the existing algorithms. The proposed algorithm is compared to Term Frequency-Inverse Document Frequency, Balanced Accuracy Measure, GINI Index, Information Gain, and Chi-Square. The experimental results clearly show the strength of the proposed algorithm.
6

Zheng, Jianming, Yupu Guo, Chong Feng, and Honghui Chen. "A Hierarchical Neural-Network-Based Document Representation Approach for Text Classification." Mathematical Problems in Engineering 2018 (2018): 1–10. http://dx.doi.org/10.1155/2018/7987691.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Document representation is widely used in practical application, for example, sentiment classification, text retrieval, and text classification. Previous work is mainly based on the statistics and the neural networks, which suffer from data sparsity and model interpretability, respectively. In this paper, we propose a general framework for document representation with a hierarchical architecture. In particular, we incorporate the hierarchical architecture into three traditional neural-network models for document representation, resulting in three hierarchical neural representation models for document classification, that is, TextHFT, TextHRNN, and TextHCNN. Our comprehensive experimental results on two public datasets, that is, Yelp 2016 and Amazon Reviews (Electronics), show that our proposals with hierarchical architecture outperform the corresponding neural-network models for document classification, resulting in a significant improvement ranging from 4.65% to 35.08% in terms of accuracy with a comparable (or substantially less) expense of time consumption. In addition, we find that the long documents benefit more from the hierarchical architecture than the short ones as the improvement in terms of accuracy on long documents is greater than that on short documents.
7

Lee, Kangwook, Sanggyu Han, and Sung-Hyon Myaeng. "A discourse-aware neural network-based text model for document-level text classification." Journal of Information Science 44, no. 6 (December 4, 2017): 715–35. http://dx.doi.org/10.1177/0165551517743644.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Capturing semantics scattered across entire text is one of the important issues for Natural Language Processing (NLP) tasks. It would be particularly critical with long text embodying a flow of themes. This article proposes a new text modelling method that can handle thematic flows of text with Deep Neural Networks (DNNs) in such a way that discourse information and distributed representations of text are incorporate. Unlike previous DNN-based document models, the proposed model enables discourse-aware analysis of text and composition of sentence-level distributed representations guided by the discourse structure. More specifically, our method identifies Elementary Discourse Units (EDUs) and their discourse relations in a given document by applying Rhetorical Structure Theory (RST)-based discourse analysis. The result is fed into a tree-structured neural network that reflects the discourse information including the structure of the document and the discourse roles and relation types. We evaluate the document model for two document-level text classification tasks, sentiment analysis and sarcasm detection, with comparisons against the reference systems that also utilise discourse information. In addition, we conduct additional experiments to evaluate the impact of neural network types and adopted discourse factors on modelling documents vis-à-vis the two classification tasks. Furthermore, we investigate the effects of various learning methods, input units on the quality of the proposed discourse-aware document model.
8

Rahamat Basha, S., J. Keziya Rani, and J. J. C. Prasad Yadav. "A Novel Summarization-based Approach for Feature Reduction Enhancing Text Classification Accuracy." Engineering, Technology & Applied Science Research 9, no. 6 (December 1, 2019): 5001–5. http://dx.doi.org/10.48084/etasr.3173.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Automatic summarization is the process of shortening one (in single document summarization) or multiple documents (in multi-document summarization). In this paper, a new feature selection method for the nearest neighbor classifier by summarizing the original training documents based on sentence importance measure is proposed. Our approach for single document summarization uses two measures for sentence similarity: the frequency of the terms in one sentence and the similarity of that sentence to other sentences. All sentences were ranked accordingly and the sentences with top ranks (with a threshold constraint) were selected for summarization. The summary of every document in the corpus is taken into a new document used for the summarization evaluation process.
9

M.Shaikh, Mustafa, Ashwini A. Pawar, and Vibha B. Lahane. "Pattern Discovery Text Mining for Document Classification." International Journal of Computer Applications 117, no. 1 (May 20, 2015): 6–12. http://dx.doi.org/10.5120/20516-2101.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
10

Elhadad, Mohamed, Khaled Badran, and Gouda Salama. "Towards Ontology-Based web text Document Classification." Journal of Engineering Science and Military Technologies 17, no. 17 (April 1, 2017): 1–8. http://dx.doi.org/10.21608/ejmtc.2017.21564.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
11

Sinoara, Roberta A., Jose Camacho-Collados, Rafael G. Rossi, Roberto Navigli, and Solange O. Rezende. "Knowledge-enhanced document embeddings for text classification." Knowledge-Based Systems 163 (January 2019): 955–71. http://dx.doi.org/10.1016/j.knosys.2018.10.026.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
12

Elhadad, Mohamed, Khaled Badran, and Gouda Salama. "Towards Ontology-Based web text Document Classification." International Conference on Aerospace Sciences and Aviation Technology 17, AEROSPACE SCIENCES (April 1, 2017): 1–8. http://dx.doi.org/10.21608/asat.2017.22749.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
13

Anne, Chaitanya, Avdesh Mishra, Md Tamjidul Hoque, and Shengru Tu. "Multiclass patent document classification." Artificial Intelligence Research 7, no. 1 (December 15, 2017): 1. http://dx.doi.org/10.5430/air.v7n1p1.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Text classification is used in information extraction and retrieval from a given text, and text classification has been considered as an important step to manage a vast number of records given in digital form that is far-reaching and expanding. This article addresses patent document classification problem into fifteen different categories or classes, where some classes overlap with each other for practical reasons. For the development of the classification model using machine learning techniques, useful features have been extracted from the given documents. The features are used to classify patent document as well as to generate useful tag-words. The overall objective of this work is to systematize NASA’s patent management, by developing a set of automated tools that can assist NASA to manage and market its portfolio of intellectual properties (IP), and to enable easier discovery of relevant IP by users. We have identified an array of methods that can be applied such as k-Nearest Neighbors (kNN), two variations of the Support Vector Machine (SVM) algorithms, and two tree based classification algorithms: Random Forest and J48. The major research steps in this paper consist of filtering techniques for variable selection, information gain and feature correlation analysis, and training and testing potential models using effective classifiers. Further, the obstacles associated with the imbalanced data were mitigated by adding pseudo-synthetic data wherever appropriate, which resulted in a superior SVM classifier based model.
14

Chiraratanasopha, Boonthida, Thanaruk Theeramunkong, and Salin Boonbrahm. "Hierarchical text classification using Relative Inverse Document Frequency." ECTI Transactions on Computer and Information Technology (ECTI-CIT) 15, no. 2 (April 21, 2021): 166–76. http://dx.doi.org/10.37936/ecti-cit.2021152.240515.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Automatic hierarchical text classification has been a challenging and in-needed task with an increasing of hierarchical taxonomy from the booming of knowledge organization. The hierarchical structure identifies the relationships of dependence between different categories in which can be overlapped of generalized and specific concepts within the tree. This paper presents the use of frequency of the occurring term in related categories among the hierarchical tree to help in document classification. The four extended term weighting of Relative Inverse Document Frequency (IDFr) including its located category, its parent category, its sibling categories and its child categories are exploited to generate a classifier model using centroid-based technique. From the experiment on hierarchical text classification of Thai documents, the IDFr achieved the best accuracy and F-measure as 53.65% and 50.80% in Top-n features set from family-based evaluation in which are higher than TF-IDF for 2.35% and 1.15% in the same settings, respectively.
15

Kumari, Lalitha, and Ch Satyanarayana. "An novel cluster based feature selection and document classification model on high dimension trec data." International Journal of Engineering & Technology 7, no. 1.1 (December 21, 2017): 466. http://dx.doi.org/10.14419/ijet.v7i1.1.10146.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
TREC text documents are complex to analyze the features its relevant similar documents using the traditional document similarity measures. As the size of the TREC repository is increasing, finding relevant clustered documents from a large collection of unstructured documents is a challenging task. Traditional document similarity and classification models are implemented on homogeneous TREC data to find essential features for document entities that are similar to the TREC documents. Also, most of the traditional models are applicable to limited text document sets for text analysis. The main issues in the traditional text mining models in TREC repository include :1) Each document is represented in vector form with many sparsity values 2) Failed to find the document semantic similarity between the intra and inter clusters 3) High mean squared error rate. In this paper, novel feature selection based clustered and classification model is proposed on large number of different TREC repositories. Traditional latent Semantic Indexing and document clustering models are failed to find the topic relevance on large number of TREC clinical text document sets due to computational memory and time. Proposed document feature selection and clustered based classification model is applied on TREC clinical benchmark datasets. From the experimental results, it is proved that the proposed model is efficient than the existing models in terms of computational memory, accuracy and error rate are concerned.
16

Rahamat Basha, S., and J. K. Rani. "A Comparative Approach of Dimensionality Reduction Techniques in Text Classification." Engineering, Technology & Applied Science Research 9, no. 6 (December 1, 2019): 4974–79. http://dx.doi.org/10.48084/etasr.3146.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
This work deals with document classification. It is a supervised learning method (it needs a labeled document set for training and a test set of documents to be classified). The procedure of document categorization includes a sequence of steps consisting of text preprocessing, feature extraction, and classification. In this work, a self-made data set was used to train the classifiers in every experiment. This work compares the accuracy, average precision, precision, and recall with or without combinations of some feature selection techniques and two classifiers (KNN and Naive Bayes). The results concluded that the Naive Bayes classifier performed better in many situations.
17

Wang, Bohan, Rui Qi, Jinhua Gao, Jianwei Zhang, Xiaoguang Yuan, and Wenjun Ke. "Mining the Frequent Patterns of Named Entities for Long Document Classification." Applied Sciences 12, no. 5 (February 28, 2022): 2544. http://dx.doi.org/10.3390/app12052544.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Nowadays, a large amount of information is stored as text, and numerous text mining techniques have been developed for various applications, such as event detection, news topic classification, public opinion detection, and sentiment analysis. Although significant progress has been achieved for short text classification, document-level text classification requires further exploration. Long documents always contain irrelevant noisy information that shelters the prominence of indicative features, limiting the interpretability of classification results. To alleviate this problem, a model called MIPELD (mining the frequent pattern of a named entity for long document classification) for long document classification is demonstrated, which mines the frequent patterns of named entities as features. Discovered patterns allow semantic generalization among documents and provide clues for verifying the results. Experiments on several datasets resulted in good accuracy and marco-F1 values, meeting the requirements for practical application. Further analysis validated the effectiveness of MIPELD in mining interpretable information in text classification.
18

Golub, Koraljka. "Automatic Subject Indexing of Text." KNOWLEDGE ORGANIZATION 46, no. 2 (2019): 104–21. http://dx.doi.org/10.5771/0943-7444-2019-2-104.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Automatic subject indexing addresses problems of scale and sustainability and can be at the same time used to enrich existing metadata records, establish more connections across and between resources from various metadata and resource collections, and enhance consistency of the metadata. In this work, automatic subject indexing focuses on assigning index terms or classes from established knowledge organization systems (KOSs) for subject indexing like thesauri, subject headings systems and classification systems. The following major approaches are discussed, in terms of their similarities and differences, advantages and disadvantages for automatic assigned indexing from KOSs: “text categorization,” “document clustering,” and “document classification.” Text categorization is perhaps the most widespread, machine-learning approach with what seems generally good reported performance. Document clustering automatically both creates groups of related documents and extracts names of subjects depicting the group at hand. Document classification re-uses the intellectual effort invested into creating a KOS for subject indexing and even simple string-matching algorithms have been reported to achieve good results, because one concept can be described using a number of different terms, including equivalent, related, narrower and broader terms. Finally, applicability of automatic subject indexing to operative information systems and challenges of evaluation are outlined, suggesting the need for more research.
19

Aubaid, Asmaa M., and Alok Mishra. "A Rule-Based Approach to Embedding Techniques for Text Document Classification." Applied Sciences 10, no. 11 (June 10, 2020): 4009. http://dx.doi.org/10.3390/app10114009.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
With the growth of online information and sudden expansion in the number of electronic documents provided on websites and in electronic libraries, there is difficulty in categorizing text documents. Therefore, a rule-based approach is a solution to this problem; the purpose of this study is to classify documents by using a rule-based. This paper deals with the rule-based approach with the embedding technique for a document to vector (doc2vec) files. An experiment was performed on two data sets Reuters-21578 and the 20 Newsgroups to classify the top ten categories of these data sets by using a document to vector rule-based (D2vecRule). Finally, this method provided us a good classification result according to the F-measures and implementation time metrics. In conclusion, it was observed that our algorithm document to vector rule-based (D2vecRule) was good when compared with other algorithms such as JRip, One R, and ZeroR applied to the same Reuters-21578 dataset.
20

Hong, Jiwon, Dongho Jeong, and Sang-Wook Kim. "Classifying Malicious Documents on the Basis of Plain-Text Features: Problem, Solution, and Experiences." Applied Sciences 12, no. 8 (April 18, 2022): 4088. http://dx.doi.org/10.3390/app12084088.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Cyberattacks widely occur by using malicious documents. A malicious document is an electronic document containing malicious codes along with some plain-text data that is human-readable. In this paper, we propose a novel framework that takes advantage of such plaintext data to determine whether a given document is malicious. We extracted plaintext features from the corpus of electronic documents and utilized them to train a classification model for detecting malicious documents. Our extensive experimental results with different combinations of three well-known vectorization strategies and three popular classification methods on five types of electronic documents demonstrate that our framework provides high prediction accuracy in detecting malicious documents.
21

Endalie, Demeke, Getamesay Haile, and Wondmagegn Taye Abebe. "Feature selection by integrating document frequency with genetic algorithm for Amharic news document classification." PeerJ Computer Science 8 (April 25, 2022): e961. http://dx.doi.org/10.7717/peerj-cs.961.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Text classification is the process of categorizing documents based on their content into a predefined set of categories. Text classification algorithms typically represent documents as collections of words and it deals with a large number of features. The selection of appropriate features becomes important when the initial feature set is quite large. In this paper, we present a hybrid of document frequency (DF) and genetic algorithm (GA)-based feature selection method for Amharic text classification. We evaluate this feature selection method on Amharic news documents obtained from the Ethiopian News Agency (ENA). The number of categories used in this study is 13. Our experimental results showed that the proposed feature selection method outperformed other feature selection methods utilized for Amharic news document classification. Combining the proposed feature selection method with Extra Tree Classifier (ETC) improves classification accuracy. It improves classification accuracy up to 1% higher than the hybrid of DF, information gain (IG), chi-square (CHI), and principal component analysis (PCA), 2.47% greater than GA and 3.86% greater than a hybrid of DF, IG, and CHI.
22

Puri, Shalini, and Satya Prakash Singh. "Hindi Text Document Classification System Using SVM and Fuzzy." International Journal of Rough Sets and Data Analysis 5, no. 4 (October 2018): 1–31. http://dx.doi.org/10.4018/ijrsda.2018100101.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
In recent years, many information retrieval, character recognition, and feature extraction methodologies in Devanagari and especially in Hindi have been proposed for different domain areas. Due to enormous scanned data availability and to provide an advanced improvement of existing Hindi automated systems beyond optical character recognition, a new idea of Hindi printed and handwritten document classification system using support vector machine and fuzzy logic is introduced. This first pre-processes and then classifies textual imaged documents into predefined categories. With this concept, this article depicts a feasibility study of such systems with the relevance of Hindi, a survey report of statistical measurements of Hindi keywords obtained from different sources, and the inherent challenges found in printed and handwritten documents. The technical reviews are provided and graphically represented to compare many parameters and estimate contents, forms and classifiers used in various existing techniques.
23

Ayogu, I. I. "Exploring multinomial naïve Bayes for Yorùbá text document classification." Nigerian Journal of Technology 39, no. 2 (July 16, 2020): 528–35. http://dx.doi.org/10.4314/njt.v39i2.23.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
The recent increase in the emergence of Nigerian language text online motivates this paper in which the problem of classifying text documents written in Yorùbá language into one of a few pre-designated classes is considered. Text document classification/categorization research is well established for English language and many other languages; this is not so for Nigerian languages. This paper evaluated the performance of a multinomial Naive Bayes model learned on a research dataset consisting of 100 samples of text each from business, sporting, entertainment, technology and political domains, separately on unigram, bigram and trigram features obtained using the bag of words representation approach. Results show that the performance of the model over unigram and bigram features is comparable but significantly better than a model learned on trigram features. The results generally indicate a possibility for the practical application of NB algorithm to the classification of text documents written in Yorùbá language. Keywords: Supervised learning, text classification, Yorùbá language, text mining, BoW Representation
24

Ernawati, Iin. "NAIVE BAYES CLASSIFIER DAN SUPPORT VECTOR MACHINE SEBAGAI ALTERNATIF SOLUSI UNTUK TEXT MINING." Jurnal Teknologi Informasi dan Pendidikan 12, no. 2 (December 10, 2019): 32–38. http://dx.doi.org/10.24036/tip.v12i2.219.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
This study was conducted to text-based data mining or often called text mining, classification methods commonly used method Naïve bayes classifier (NBC) and support vector machine (SVM). This classification is emphasized for Indonesian language documents, while the relationship between documents is measured by the probability that can be proven with other classification algorithms. This evident from the conclusion that the probability result Naïve Bayes Classifier (NBC) word “party” at least in the economic document and political. Then the result of the algorithm support vector machine (svm) with the word “price” and “kpk” contains in both economic and politic document.
25

Girgis, M. R., and A. A. Aly. "A Feature Selection and Classification Technique for Text Categorization." International Journal of Cooperative Information Systems 12, no. 04 (December 2003): 441–54. http://dx.doi.org/10.1142/s0218843003000826.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Text categorization is the automated assigning of documents to predefined categories based on their contents. It involves two main tasks — feature selection and document classification. This paper discusses the weak points of the text categorization technique developed by Maron and modified by Lewis. Then, it introduces a technique for text categorization that uses new formulas for feature selection and document classification. These formulas have been formulated to overcome the weak points of Maron's and Lewis' techniques. Also, the paper describes the design of an experimental text categorization system that is composed of the same set of processes as the MAXCAT system developed by Lewis. The paper presents and analyses the results of applying the system on a set of training and test documents by using Lewis' and the proposed formulas. In addition, a method for separately evaluating the effectiveness of feature selection is given. Finally, the impact of the feature set size on the effectiveness of the classification system is investigated, using the system and applying one of the proposed classification formulas with different feature set sizes.
26

Reddy, S. Sai Satyanarayana. "A Novel Document Weighted Approach for Text Classification." Journal of Computers 15, no. 3 (2020): 105–13. http://dx.doi.org/10.17706/jcp.15.3.105-113.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
27

Harish, B. S. "Text Document Classification: An Approach Based on Indexing." International Journal of Data Mining & Knowledge Management Process 2, no. 1 (January 31, 2012): 43–62. http://dx.doi.org/10.5121/ijdkp.2012.2104.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
28

Subba, Sanjeev, Nawaraj Paudel, and Tej Bahadur Shahi. "Nepali Text Document Classification Using Deep Neural Network." Tribhuvan University Journal 33, no. 1 (June 30, 2019): 11–22. http://dx.doi.org/10.3126/tuj.v33i1.28677.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
An automated text classification is a well-studied problem in text mining which generally demands the automatic assignment of a label or class to a particular text documents on the basis of its content. To design a computer program that learns the model form training data to assign the specific label to unseen text document, many researchers has applied deep learning technologies. For Nepali language, this is first attempt to use deep learning especially Recurrent Neural Network (RNN) and compare its performance to traditional Multilayer Neural Network (MNN). In this study, the Nepali texts were collected from online News portals and their pre-processing and vectorization was done. Finally deep learning classification framework was designed and experimented for ten experiments: five for Recurrent Neural Network and five for Multilayer Neural Network. On comparing the result of the MNN and RNN, it can be concluded that RNN outperformed the MNN as the highest accuracy achieved by MNN is 48 % and highest accuracy achieved by RNN is 63%.
29

Wen, Jiahui, Guangda Zhang, Hongyun Zhang, Wei Yin, and Jingwei Ma. "Speculative text mining for document-level sentiment classification." Neurocomputing 412 (October 2020): 52–62. http://dx.doi.org/10.1016/j.neucom.2020.06.024.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
30

Ke, Weimao. "Least information document representation for automated text classification." Proceedings of the American Society for Information Science and Technology 49, no. 1 (2012): 1–10. http://dx.doi.org/10.1002/meet.14504901118.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
31

Wijewickrema, PKCM, and RCG Gamage. "An enhanced text classifier for automatic document classification." Journal of the University Librarians Association of Sri Lanka 16, no. 2 (February 4, 2013): 138. http://dx.doi.org/10.4038/jula.v16i2.5205.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
32

Yatsko, V. A. "A New Method of Automatic Text Document Classification." Automatic Documentation and Mathematical Linguistics 55, no. 3 (May 2021): 122–33. http://dx.doi.org/10.3103/s0005105521030080.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
33

Jalal, Ahmed Adeeb, and Basheer Husham Ali. "Text documents clustering using data mining techniques." International Journal of Electrical and Computer Engineering (IJECE) 11, no. 1 (February 1, 2021): 664. http://dx.doi.org/10.11591/ijece.v11i1.pp664-670.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Increasing progress in numerous research fields and information technologies, led to an increase in the publication of research papers. Therefore, researchers take a lot of time to find interesting research papers that are close to their field of specialization. Consequently, in this paper we have proposed documents classification approach that can cluster the text documents of research papers into the meaningful categories in which contain a similar scientific field. Our presented approach based on essential focus and scopes of the target categories, where each of these categories includes many topics. Accordingly, we extract word tokens from these topics that relate to a specific category, separately. The frequency of word tokens in documents impacts on weight of document that calculated by using a numerical statistic of term frequency-inverse document frequency (TF-IDF). The proposed approach uses title, abstract, and keywords of the paper, in addition to the categories topics to perform the classification process. Subsequently, documents are classified and clustered into the primary categories based on the highest measure of cosine similarity between category weight and documents weights.
34

CS, Vijayashree, Shobha Rani, and Vasudev T. "An Unsupervised Classification Technique for Detection of Flipped Orientations in Document Images." International Journal of Electrical and Computer Engineering (IJECE) 6, no. 5 (October 1, 2016): 2140. http://dx.doi.org/10.11591/ijece.v6i5.10785.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
<table width="593" border="1" cellspacing="0" cellpadding="0"><tbody><tr><td valign="top" width="387"><p>Detection of text orientation in document images is of preliminary concern prior to processing of documents by Optical Character Reader. The text direction in document images should exist generally in a specific orientation, i.e., text direction for any automated document reading system. The flipped text orientation leads to an unambiguous result in such fully automated systems. In this paper, we focus on development of text orientation direction detection module which can be incorporated as the perquisite process in automatic reading system. Orientation direction detection of text is performed through employing directional gradient features of document image and adapts an unsupervised learning approach for detection of flipped text orientation at which the document has been originally fed into scanning device. The unsupervised learning is built on the directional gradient features of text of document based on four possible different orientations. The algorithm is experimented on document samples of printed plain English text as well as filled in pre-printed forms of Telugu script. The outcome attained by algorithm proves to be consistent and adequate with an average accuracy around 94%.</p></td></tr></tbody></table>
35

CS, Vijayashree, Shobha Rani, and Vasudev T. "An Unsupervised Classification Technique for Detection of Flipped Orientations in Document Images." International Journal of Electrical and Computer Engineering (IJECE) 6, no. 5 (October 1, 2016): 2140. http://dx.doi.org/10.11591/ijece.v6i5.pp2140-2149.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
<table width="593" border="1" cellspacing="0" cellpadding="0"><tbody><tr><td valign="top" width="387"><p>Detection of text orientation in document images is of preliminary concern prior to processing of documents by Optical Character Reader. The text direction in document images should exist generally in a specific orientation, i.e., text direction for any automated document reading system. The flipped text orientation leads to an unambiguous result in such fully automated systems. In this paper, we focus on development of text orientation direction detection module which can be incorporated as the perquisite process in automatic reading system. Orientation direction detection of text is performed through employing directional gradient features of document image and adapts an unsupervised learning approach for detection of flipped text orientation at which the document has been originally fed into scanning device. The unsupervised learning is built on the directional gradient features of text of document based on four possible different orientations. The algorithm is experimented on document samples of printed plain English text as well as filled in pre-printed forms of Telugu script. The outcome attained by algorithm proves to be consistent and adequate with an average accuracy around 94%.</p></td></tr></tbody></table>
36

Jia, Longjia, and Bangzuo Zhang. "A new document representation based on global policy for supervised term weighting schemes in text categorization." Mathematical Biosciences and Engineering 19, no. 5 (2022): 5223–40. http://dx.doi.org/10.3934/mbe.2022245.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
<abstract> <p>There are two main factors involved in documents classification, document representation method and classification algorithm. In this study, we focus on document representation method and demonstrate that the choice of representation methods has impacts on quality of classification results. We propose a document representation strategy for supervised text classification named document representation based on global policy (<italic>DRGP</italic>), which can obtain an appropriate document representation according to the distribution of terms. The main idea of <italic>DRGP</italic> is to construct the optimization function through the importance of terms to different categories. In the experiments, we investigate the effects of <italic>DRGP</italic> on the 20 Newsgroups, Reuters21578 datasets, and using the <italic>SVM</italic> as classifier. The results show that the <italic>DRGP</italic> outperforms other text representation strategy schemes, such as Document Max, Document Two Max and global policy.</p> </abstract>
37

Martinčić-Ipšić, Sanda, Tanja Miličić, and and Todorovski. "The Influence of Feature Representation of Text on the Performance of Document Classification." Applied Sciences 9, no. 4 (February 20, 2019): 743. http://dx.doi.org/10.3390/app9040743.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
In this paper we perform a comparative analysis of three models for a feature representation of text documents in the context of document classification. In particular, we consider the most often used family of bag-of-words models, the recently proposed continuous space models word2vec and doc2vec, and the model based on the representation of text documents as language networks. While the bag-of-word models have been extensively used for the document classification task, the performance of the other two models for the same task have not been well understood. This is especially true for the network-based models that have been rarely considered for the representation of text documents for classification. In this study, we measure the performance of the document classifiers trained using the method of random forests for features generated with the three models and their variants. Multi-objective rankings are proposed as the framework for multi-criteria comparative analysis of the results. Finally, the results of the empirical comparison show that the commonly used bag-of-words model has a performance comparable to the one obtained by the emerging continuous-space model of doc2vec. In particular, the low-dimensional variants of doc2vec generating up to 75 features are among the top-performing document representation models. The results finally point out that doc2vec shows a superior performance in the tasks of classifying large documents.
38

Oliveira Gonçalves, Carlos Adriano, Rui Camacho, Célia Talma Gonçalves, Adrián Seara Vieira, Lourdes Borrajo Diz, and Eva Lorenzo Iglesias. "Classification of Full Text Biomedical Documents: Sections Importance Assessment." Applied Sciences 11, no. 6 (March 17, 2021): 2674. http://dx.doi.org/10.3390/app11062674.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
The exponential growth of documents in the web makes it very hard for researchers to be aware of the relevant work being done within the scientific community. The task of efficiently retrieving information has therefore become an important research topic. The objective of this study is to test how the efficiency of the text classification changes if different weights are previously assigned to the sections that compose the documents. The proposal takes into account the place (section) where terms are located in the document, and each section has a weight that can be modified depending on the corpus. To carry out the study, an extended version of the OHSUMED corpus with full documents have been created. Through the use of WEKA, we compared the use of abstracts only with that of full texts, as well as the use of section weighing combinations to assess their significance in the scientific article classification process using the SMO (Sequential Minimal Optimization), the WEKA Support Vector Machine (SVM) algorithm implementation. The experimental results show that the proposed combinations of the preprocessing techniques and feature selection achieve promising results for the task of full text scientific document classification. We also have evidence to conclude that enriched datasets with text from certain sections achieve better results than using only titles and abstracts.
39

Choo, Wou Onn, Lam Hong Lee, Yen Pei Tay, Khang Wen Goh, Dino Isa, and Suliman Mohamed Fati. "Automatic Folder Allocation System for Electronic Text Document Repositories Using Enhanced Bayesian Classification Approach." International Journal of Intelligent Information Technologies 15, no. 2 (April 2019): 1–19. http://dx.doi.org/10.4018/ijiit.2019040101.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
This article proposes a system equipped with the enhanced Bayesian classification techniques to automatically assign folders to store electronic text documents. Despite computer technology advancements in the information age where electronic text files are so pervasive in information exchange, almost every single document created or downloaded from the Internet requires manual classification by the users before being deposited into a folder in a computer. Not only does such a tedious task cause inconvenience to users, the time taken to repeatedly classify and allocate a folder for each text document impedes productivity, especially when dealing with a huge number of files and deep layers of folders. In order to overcome this, a prototype system is built to evaluate the performance of the enhanced Bayesian text classifier for automatic folder allocation, by categorizing text documents based on the existing types of text documents and folders present in user's hard drive. In this article, the authors deploy a High Relevance Keyword Extraction (HRKE) technique and an Automatic Computed Document Dependent (ACDD) Weighting Factor technique to a Bayesian classifier in order to obtain better classification accuracy, while maintaining the low training cost and simple classifying processes using the conventional Bayesian approach.
40

Kim, Byoungwook, Yeongwook Yang, Ji Su Park, and Hong-Jun Jang. "A Convolution Neural Network-Based Representative Spatio-Temporal Documents Classification for Big Text Data." Applied Sciences 12, no. 8 (April 11, 2022): 3843. http://dx.doi.org/10.3390/app12083843.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
With the proliferation of mobile devices, the amount of social media users and online news articles are rapidly increasing, and text information online is accumulating as big data. As spatio-temporal information becomes more important, research on extracting spatiotemporal information from online text data and utilizing it for event analysis is being actively conducted. However, if spatiotemporal information that does not describe the core subject of a document is extracted, it is rather difficult to guarantee the accuracy of core event analysis. Therefore, it is important to extract spatiotemporal information that describes the core topic of a document. In this study, spatio-temporal information describing the core topic of a document is defined as ‘representative spatio-temporal information’, and documents containing representative spatiotemporal information are defined as ‘representative spatio-temporal documents’. We proposed a character-level Convolution Neuron Network (CNN)-based document classifier to classify representative spatio-temporal documents. To train the proposed CNN model, 7400 training data were constructed for representative spatio-temporal documents. The experimental results show that the proposed CNN model outperforms traditional machine learning classifiers and existing CNN-based classifiers.
41

Yan, Yi-Fan, Sheng-Jun Huang, Shaoyi Chen, Meng Liao, and Jin Xu. "Active Learning with Query Generation for Cost-Effective Text Classification." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 04 (April 3, 2020): 6583–90. http://dx.doi.org/10.1609/aaai.v34i04.6133.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Labeling a text document is usually time consuming because it requires the annotator to read the whole document and check its relevance with each possible class label. It thus becomes rather expensive to train an effective model for text classification when it involves a large dataset of long documents. In this paper, we propose an active learning approach for text classification with lower annotation cost. Instead of scanning all the examples in the unlabeled data pool to select the best one for query, the proposed method automatically generates the most informative examples based on the classification model, and thus can be applied to tasks with large scale or even infinite unlabeled data. Furthermore, we propose to approximate the generated example with a few summary words by sparse reconstruction, which allows the annotators to easily assign the class label by reading a few words rather than the long document. Experiments on different datasets demonstrate that the proposed approach can effectively improve the classification performance while significantly reduce the annotation cost.
42

Ali, Zuhair. "TEXT CLASSIFICATION BASED ON FUZZY RADIAL BASIS FUNCTION." Iraqi Journal for Computers and Informatics 45, no. 1 (May 1, 2019): 11–14. http://dx.doi.org/10.25195/ijci.v45i1.40.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Automated classification of text into predefined categories has always been considered as a vital method in thenatural language processing field. In this paper new methods based on Radial Basis Function (RBF) and Fuzzy Radial BasisFunction (FRBF) are used to solve the problem of text classification, where a set of features extracted for each sentencein the document collection these set of features introduced to FRBF and RBF to classify documents. Reuters 21578 datasetutilized for the purpose of text classification. The results showed the effectiveness of FRBF is better than RBF.
43

Tawdar, A. P., M. S. Bewoor, and S. H. Patil. "Incremental Approach of Neural Network in Back Propagation Algorithms for Web Data Mining." IAES International Journal of Artificial Intelligence (IJ-AI) 6, no. 2 (June 1, 2017): 74. http://dx.doi.org/10.11591/ijai.v6.i2.pp74-78.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Text Classification is also called as Text Categorization (TC), is the task of classifying a set of text documents automatically into different categories from a predefined set. If a text document relates to exactly one of the categories, then it is called as single-label classification task; otherwise, it is called as multi-label classification task. For Information Retrieval (IR) and Machine Learning (ML), TC uses several tools and has received much attention in the last decades. In this paper, first classifies the text documents using MLP based machine learning approach (BPP) and then return the most relevant documents. And also describes a proposed back propagation neural network classifier that performs cross validation for original Neural Network. In order to optimize the classification accuracy, training time. Proposed web content mining methodology in the exploration with the aid of BPP. The main objective of this investigation is web document extraction and utilizing different grouping algorithm. This work extricates the data from the web URL.
44

Saad, Yaqeen, and Khaled Shaker. "Support Vector Machine and Back Propagation Neural Network Approach for Text Classification." Journal of University of Human Development 3, no. 2 (June 30, 2017): 869. http://dx.doi.org/10.21928/juhd.v3n2y2017.pp869-876.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Text classification is the process of inserting text into one or additional categories. Text categorization has many of significant application, Mostly in the field of organization, and for browsing within great groups of document. It is sometimes completed by means of "machine learning.". Since the system is built based on a wide range of document features."Feature selection." is an important approach within this process, since there are typically several thousand possible features terms. Within text categorization, The target goal of features selection is to improve the efficiency of procedures and reliability of classification by deleting features that have no relevance and non-essential terms. While keeping terms which hold enough data that facilitate with the classification task. The target goal of this work is to increase the efficient text categorization models. Within the "text mining" algorithms, a document is appearing as "vector" whose dimension is that the range of special keywords in it, which can be very large. Classic document categorization may be computationally costly. Therefore, feature extraction through the singular valued decomposition is employed for decrease the dimensionality of the documents, we are applying classification algorithms based on "Back propagation" and "Support Vector Machine." methodology. before the classification we applied "Principle Component Analysis." technique in order to improve the result accuracy . We then compared the performance of these two algorithms via computing standard precision and recall for the documentscollection.
45

Alsmadi, Issa, and Keng Hoon Gan. "Review of short-text classification." International Journal of Web Information Systems 15, no. 2 (June 17, 2019): 155–82. http://dx.doi.org/10.1108/ijwis-12-2017-0083.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
PurposeRapid developments in social networks and their usage in everyday life have caused an explosion in the amount of short electronic documents. Thus, the need to classify this type of document based on their content has a significant implication in many applications. The need to classify these documents in relevant classes according to their text contents should be interested in many practical reasons. Short-text classification is an essential step in many applications, such as spam filtering, sentiment analysis, Twitter personalization, customer review and many other applications related to social networks. Reviews on short text and its application are limited. Thus, this paper aims to discuss the characteristics of short text, its challenges and difficulties in classification. The paper attempt to introduce all stages in principle classification, the technique used in each stage and the possible development trend in each stage.Design/methodology/approachThe paper as a review of the main aspect of short-text classification. The paper is structured based on the classification task stage.FindingsThis paper discusses related issues and approaches to these problems. Further research could be conducted to address the challenges in short texts and avoid poor accuracy in classification. Problems in low performance can be solved by using optimized solutions, such as genetic algorithms that are powerful in enhancing the quality of selected features. Soft computing solution has a fuzzy logic that makes short-text problems a promising area of research.Originality/valueUsing a powerful short-text classification method significantly affects many applications in terms of efficiency enhancement. Current solutions still have low performance, implying the need for improvement. This paper discusses related issues and approaches to these problems.
46

et al., Nohuddin. "Content analytics based on random forest classification technique: An empirical evaluation using online news dataset." International Journal of ADVANCED AND APPLIED SCIENCES 8, no. 2 (February 2021): 77–84. http://dx.doi.org/10.21833/ijaas.2021.02.011.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
In this paper, a study is established for exploiting a document classification technique for categorizing a set of random online documents. The technique is aimed to assign one or more classes or categories to a document, making it easier to manage and sort. This paper describes an experiment on the proposed method for classifying documents effectively using the decision tree technique. The proposed research framework is a Document Analysis based on the Random Forest Algorithm (DARFA). The proposed framework consists of 5 components, which are (i) Document dataset, (ii) Data Preprocessing, (iii) Document Term Matrix, (iv) Random Forest classification, and (v) Visualization. The proposed classification method can analyze the content of document datasets and classifies documents according to the text content. The proposed framework use algorithms that include TF-IDF and Random Forest algorithm. The outcome of this study benefits as an enhancement to document management procedures like managing documents in daily business operations, consolidating inventory systems, organizing files in databases, and categorizing document folders.
47

Onan, Aytug, Hasan Bulut, and Serdar Korukoglu. "An improved ant algorithm with LDA-based representation for text document clustering." Journal of Information Science 43, no. 2 (March 1, 2016): 275–92. http://dx.doi.org/10.1177/0165551516638784.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Document clustering can be applied in document organisation and browsing, document summarisation and classification. The identification of an appropriate representation for textual documents is extremely important for the performance of clustering or classification algorithms. Textual documents suffer from the high dimensionality and irrelevancy of text features. Besides, conventional clustering algorithms suffer from several shortcomings, such as slow convergence and sensitivity to the initial value. To tackle the problems of conventional clustering algorithms, metaheuristic algorithms are frequently applied to clustering. In this paper, an improved ant clustering algorithm is presented, where two novel heuristic methods are proposed to enhance the clustering quality of ant-based clustering. In addition, the latent Dirichlet allocation (LDA) is used to represent textual documents in a compact and efficient way. The clustering quality of the proposed ant clustering algorithm is compared to the conventional clustering algorithms using 25 text benchmarks in terms of F-measure values. The experimental results indicate that the proposed clustering scheme outperforms the compared conventional and metaheuristic clustering methods for textual documents.
48

Puri, Shalini, and Satya Prakash Singh. "A Fuzzy Matching based Image Classification System for Printed and Handwritten Text Documents." Journal of Information Technology Research 13, no. 2 (April 2020): 155–94. http://dx.doi.org/10.4018/jitr.2020040110.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
This article proposes a bi-leveled image classification system to classify printed and handwritten English documents into mutually exclusive predefined categories. The proposed system follows the steps of preprocessing, segmentation, feature extraction, and SVM based character classification at level 1, and word association and fuzzy matching based document classification at level 2. The system architecture and its modular structure discuss various task stages and their functionalities. Further, a case study on document classification is discussed to show the internal score computations of words and keywords with fuzzy matching. The experiments on proposed system illustrate that the system achieves promising results in the time-efficient manner and achieves better accuracy with less computation time for printed documents than handwritten ones. Finally, the performance of the proposed system is compared with the existing systems and it is observed that proposed system performs better than many other systems.
49

Dau, Hoan Manh, and Ning Xu. "Text Document Classification Using Support Vector Machine with Feature Selection Using Singular Value Decomposition." Advanced Materials Research 905 (April 2014): 528–32. http://dx.doi.org/10.4028/www.scientific.net/amr.905.528.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Text document classification is content analysis task of the text document and then giving decision (or giving a prediction) whether this text document belongs to which group among given text document ones. There are many classification techniques such as decision method basing on Naive Bayer, decision tree, k-Nearest neighbor (KNN), neural network, Support Vector Machine (SVM) method. Among those techniques, SVM is considered the popular and powerful one, especially, it is suitable to huge and multidimensional data classification. Text document classification with characteristics of very huge dimensional numbers and selecting features before classifying impact the classification results. Support Vector Machine is a very effective method in this field. This article studies Support Vector Machine and applies it in the problem of text document classification. The study shows that Support Vector Machine method with choosing features by singular value decomposition (SVD) method is better than other methods and decision tree.
50

BASILI, ROBERTO, and ALESSANDRO MOSCHITTI. "INTELLIGENT NLP-DRIVEN TEXT CLASSIFICATION." International Journal on Artificial Intelligence Tools 11, no. 03 (September 2002): 389–423. http://dx.doi.org/10.1142/s0218213002000952.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Information Retrieval (IR) and NLP-driven Information Extraction (IE) are complementary activities. IR helps in locating specific documents within a huge search space (localization) while IE supports the localization of specific information within a document (extraction or explanation). In application scenarios both capabilities are usually needed. IE is important here, as it can enrich the IR inferences with motivating information. Works on Web-based IR suggest that embedding linguistic information (e.g. sense distinctions) at a suitable level within traditional quantitative approaches (e.g. query expansion as in [26]) is a promising approach. "Which linguistic level is best suited to which IR mechanism" is the interesting representational problem posed by the current research stage. This is also the central concern of this paper. A traditional method for efficient text categorization is here presented. Original features of the proposed model are a self-adapting parameterized weighting model and the use of linguistic information. The key idea is the integration of NLP methods within a robust and efficient TC framework. This allows to combine benefits of large scale and efficient IR with the richer expressivity closer to IE. In this paper we capitalize the systematic benchmarking resources available in TC to extensively derive empirical evidence about the above representational problem. The positive experimental results confirm that the proposed TC framework characterizes as a viable approach to intelligent text categorization on a large scale.

До бібліографії