Добірка наукової літератури з теми "Text document classification"

Оформте джерело за APA, MLA, Chicago, Harvard та іншими стилями

Оберіть тип джерела:

Ознайомтеся зі списками актуальних статей, книг, дисертацій, тез та інших наукових джерел на тему "Text document classification".

Біля кожної праці в переліку літератури доступна кнопка «Додати до бібліографії». Скористайтеся нею – і ми автоматично оформимо бібліографічне посилання на обрану працю в потрібному вам стилі цитування: APA, MLA, «Гарвард», «Чикаго», «Ванкувер» тощо.

Також ви можете завантажити повний текст наукової публікації у форматі «.pdf» та прочитати онлайн анотацію до роботи, якщо відповідні параметри наявні в метаданих.

Статті в журналах з теми "Text document classification":

1

Mukherjee, Indrajit, Prabhat Kumar Mahanti, Vandana Bhattacharya, and Samudra Banerjee. "Text classification using document-document semantic similarity." International Journal of Web Science 2, no. 1/2 (2013): 1. http://dx.doi.org/10.1504/ijws.2013.056572.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
2

Cheng, Betty Yee Man, Jaime G. Carbonell, and Judith Klein-Seetharaman. "Protein classification based on text document classification techniques." Proteins: Structure, Function, and Bioinformatics 58, no. 4 (January 11, 2005): 955–70. http://dx.doi.org/10.1002/prot.20373.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
3

Yao, Liang, Chengsheng Mao, and Yuan Luo. "Graph Convolutional Networks for Text Classification." Proceedings of the AAAI Conference on Artificial Intelligence 33 (July 17, 2019): 7370–77. http://dx.doi.org/10.1609/aaai.v33i01.33017370.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Text classification is an important and classical problem in natural language processing. There have been a number of studies that applied convolutional neural networks (convolution on regular grid, e.g., sequence) to classification. However, only a limited number of studies have explored the more flexible graph convolutional neural networks (convolution on non-grid, e.g., arbitrary graph) for the task. In this work, we propose to use graph convolutional networks for text classification. We build a single text graph for a corpus based on word co-occurrence and document word relations, then learn a Text Graph Convolutional Network (Text GCN) for the corpus. Our Text GCN is initialized with one-hot representation for word and document, it then jointly learns the embeddings for both words and documents, as supervised by the known class labels for documents. Our experimental results on multiple benchmark datasets demonstrate that a vanilla Text GCN without any external word embeddings or knowledge outperforms state-of-the-art methods for text classification. On the other hand, Text GCN also learns predictive word and document embeddings. In addition, experimental results show that the improvement of Text GCN over state-of-the-art comparison methods become more prominent as we lower the percentage of training data, suggesting the robustness of Text GCN to less training data in text classification.
4

Kim, Jiyun, and Han-joon Kim. "Multidimensional Text Warehousing for Automated Text Classification." Journal of Information Technology Research 11, no. 2 (April 2018): 168–83. http://dx.doi.org/10.4018/jitr.2018040110.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
This article describes how, in the era of big data, a data warehouse is an integrated multidimensional database that provides the basis for the decision making required to establish crucial business strategies. Efficient, effective analysis requires a data organization system that integrates and manages data of various dimensions. However, conventional data warehousing techniques do not consider the various data manipulation operations required for data-mining activities. With the current explosion of text data, much research has examined text (or document) repositories to support text mining and document retrieval. Therefore, this article presents a method of developing a text warehouse that provides a machine-learning-based text classification service. The document is represented as a term-by-concept matrix using a 3rd-order tensor-based textual representation model, which emphasizes the meaning of words occurring in the document. As a result, the proposed text warehouse makes it possible to develop a semantic Naïve Bayes text classifier only by executing appropriate SQL statements.
5

P, Ashokkumar, Siva Shankar G, Gautam Srivastava, Praveen Kumar Reddy Maddikunta, and Thippa Reddy Gadekallu. "A Two-stage Text Feature Selection Algorithm for Improving Text Classification." ACM Transactions on Asian and Low-Resource Language Information Processing 20, no. 3 (May 2021): 1–19. http://dx.doi.org/10.1145/3425781.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
As the number of digital text documents increases on a daily basis, the classification of text is becoming a challenging task. Each text document consists of a large number of words (or features) that drive down the efficiency of a classification algorithm. This article presents an optimized feature selection algorithm designed to reduce a large number of features to improve the accuracy of the text classification algorithm. The proposed algorithm uses noun-based filtering, a word ranking that enhances the performance of the text classification algorithm. Experiments are carried out on three benchmark datasets, and the results show that the proposed classification algorithm has achieved the maximum accuracy when compared to the existing algorithms. The proposed algorithm is compared to Term Frequency-Inverse Document Frequency, Balanced Accuracy Measure, GINI Index, Information Gain, and Chi-Square. The experimental results clearly show the strength of the proposed algorithm.
6

Zheng, Jianming, Yupu Guo, Chong Feng, and Honghui Chen. "A Hierarchical Neural-Network-Based Document Representation Approach for Text Classification." Mathematical Problems in Engineering 2018 (2018): 1–10. http://dx.doi.org/10.1155/2018/7987691.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Document representation is widely used in practical application, for example, sentiment classification, text retrieval, and text classification. Previous work is mainly based on the statistics and the neural networks, which suffer from data sparsity and model interpretability, respectively. In this paper, we propose a general framework for document representation with a hierarchical architecture. In particular, we incorporate the hierarchical architecture into three traditional neural-network models for document representation, resulting in three hierarchical neural representation models for document classification, that is, TextHFT, TextHRNN, and TextHCNN. Our comprehensive experimental results on two public datasets, that is, Yelp 2016 and Amazon Reviews (Electronics), show that our proposals with hierarchical architecture outperform the corresponding neural-network models for document classification, resulting in a significant improvement ranging from 4.65% to 35.08% in terms of accuracy with a comparable (or substantially less) expense of time consumption. In addition, we find that the long documents benefit more from the hierarchical architecture than the short ones as the improvement in terms of accuracy on long documents is greater than that on short documents.
7

Lee, Kangwook, Sanggyu Han, and Sung-Hyon Myaeng. "A discourse-aware neural network-based text model for document-level text classification." Journal of Information Science 44, no. 6 (December 4, 2017): 715–35. http://dx.doi.org/10.1177/0165551517743644.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Capturing semantics scattered across entire text is one of the important issues for Natural Language Processing (NLP) tasks. It would be particularly critical with long text embodying a flow of themes. This article proposes a new text modelling method that can handle thematic flows of text with Deep Neural Networks (DNNs) in such a way that discourse information and distributed representations of text are incorporate. Unlike previous DNN-based document models, the proposed model enables discourse-aware analysis of text and composition of sentence-level distributed representations guided by the discourse structure. More specifically, our method identifies Elementary Discourse Units (EDUs) and their discourse relations in a given document by applying Rhetorical Structure Theory (RST)-based discourse analysis. The result is fed into a tree-structured neural network that reflects the discourse information including the structure of the document and the discourse roles and relation types. We evaluate the document model for two document-level text classification tasks, sentiment analysis and sarcasm detection, with comparisons against the reference systems that also utilise discourse information. In addition, we conduct additional experiments to evaluate the impact of neural network types and adopted discourse factors on modelling documents vis-à-vis the two classification tasks. Furthermore, we investigate the effects of various learning methods, input units on the quality of the proposed discourse-aware document model.
8

Rahamat Basha, S., J. Keziya Rani, and J. J. C. Prasad Yadav. "A Novel Summarization-based Approach for Feature Reduction Enhancing Text Classification Accuracy." Engineering, Technology & Applied Science Research 9, no. 6 (December 1, 2019): 5001–5. http://dx.doi.org/10.48084/etasr.3173.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Automatic summarization is the process of shortening one (in single document summarization) or multiple documents (in multi-document summarization). In this paper, a new feature selection method for the nearest neighbor classifier by summarizing the original training documents based on sentence importance measure is proposed. Our approach for single document summarization uses two measures for sentence similarity: the frequency of the terms in one sentence and the similarity of that sentence to other sentences. All sentences were ranked accordingly and the sentences with top ranks (with a threshold constraint) were selected for summarization. The summary of every document in the corpus is taken into a new document used for the summarization evaluation process.
9

M.Shaikh, Mustafa, Ashwini A. Pawar, and Vibha B. Lahane. "Pattern Discovery Text Mining for Document Classification." International Journal of Computer Applications 117, no. 1 (May 20, 2015): 6–12. http://dx.doi.org/10.5120/20516-2101.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
10

Elhadad, Mohamed, Khaled Badran, and Gouda Salama. "Towards Ontology-Based web text Document Classification." Journal of Engineering Science and Military Technologies 17, no. 17 (April 1, 2017): 1–8. http://dx.doi.org/10.21608/ejmtc.2017.21564.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.

Дисертації з теми "Text document classification":

1

Mondal, Abhro Jyoti. "Document Classification using Characteristic Signatures." University of Cincinnati / OhioLINK, 2017. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1511793852923472.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
2

Sendur, Zeynel. "Text Document Categorization by Machine Learning." Scholarly Repository, 2008. http://scholarlyrepository.miami.edu/oa_theses/209.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Because of the explosion of digital and online text information, automatic organization of documents has become a very important research area. There are mainly two machine learning approaches to enhance the organization task of the digital documents. One of them is the supervised approach, where pre-defined category labels are assigned to documents based on the likelihood suggested by a training set of labeled documents; and the other one is the unsupervised approach, where there is no need for human intervention or labeled documents at any point in the whole process. In this thesis, we concentrate on the supervised learning task which deals with document classification. One of the most important tasks of information retrieval is to induce classifiers capable of categorizing text documents. The same document can belong to two or more categories and this situation is referred by the term multi-label classification. Multi-label classification domains have been encountered in diverse fields. Most of the existing machine learning techniques which are in multi-label classification domains are extremely expensive since the documents are characterized by an extremely large number of features. In this thesis, we are trying to reduce these computational costs by applying different types of algorithms to the documents which are characterized by large number of features. Another important thing that we deal in this thesis is to have the highest possible accuracy when we have the high computational performance on text document categorization.
3

Blein, Florent. "Automatic Document Classification Applied to Swedish News." Thesis, Linköping University, Department of Computer and Information Science, 2005. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-3065.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:

The first part of this paper presents briefly the ELIN[1] system, an electronic newspaper project. ELIN is a framework that stores news and displays them to the end-user. Such news are formatted using the xml[2] format. The project partner Corren[3] provided ELIN with xml articles, however the format used was not the same. My first task has been to develop a software that converts the news from one xml format (Corren) to another (ELIN).

The second and main part addresses the problem of automatic document classification and tries to find a solution for a specific issue. The goal is to automatically classify news articles from a Swedish newspaper company (Corren) into the IPTC[4] news categories.

This work has been carried out by implementing several classification algorithms, testing them and comparing their accuracy with existing software. The training and test documents were 3 weeks of the Corren newspaper that had to be classified into 2 categories.

The last tests were run with only one algorithm (Naïve Bayes) over a larger amount of data (7, then 10 weeks) and categories (12) to simulate a more real environment.

The results show that the Naïve Bayes algorithm, although the oldest, was the most accurate in this particular case. An issue raised by the results is that feature selection improves speed but can seldom reduce accuracy by removing too many features.

4

Anne, Chaitanya. "Advanced Text Analytics and Machine Learning Approach for Document Classification." ScholarWorks@UNO, 2017. http://scholarworks.uno.edu/td/2292.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Text classification is used in information extraction and retrieval from a given text, and text classification has been considered as an important step to manage a vast number of records given in digital form that is far-reaching and expanding. This thesis addresses patent document classification problem into fifteen different categories or classes, where some classes overlap with other classes for practical reasons. For the development of the classification model using machine learning techniques, useful features have been extracted from the given documents. The features are used to classify patent document as well as to generate useful tag-words. The overall objective of this work is to systematize NASA’s patent management, by developing a set of automated tools that can assist NASA to manage and market its portfolio of intellectual properties (IP), and to enable easier discovery of relevant IP by users. We have identified an array of methods that can be applied such as k-Nearest Neighbors (kNN), two variations of the Support Vector Machine (SVM) algorithms, and two tree based classification algorithms: Random Forest and J48. The major research steps in this work consist of filtering techniques for variable selection, information gain and feature correlation analysis, and training and testing potential models using effective classifiers. Further, the obstacles associated with the imbalanced data were mitigated by adding synthetic data wherever appropriate, which resulted in a superior SVM classifier based model.
5

Alsaad, Amal. "Enhanced root extraction and document classification algorithm for Arabic text." Thesis, Brunel University, 2016. http://bura.brunel.ac.uk/handle/2438/13510.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Many text extraction and classification systems have been developed for English and other international languages; most of the languages are based on Roman letters. However, Arabic language is one of the difficult languages which have special rules and morphology. Not many systems have been developed for Arabic text categorization. Arabic language is one of the Semitic languages with morphology that is more complicated than English. Due to its complex morphology, there is a need for pre-processing routines to extract the roots of the words then classify them according to the group of acts or meaning. In this thesis, a system has been developed and tested for text classification. The system is based on two stages, the first is to extract the roots from text and the second is to classify the text according to predefined categories. The linguistic root extraction stage is composed of two main phases. The first phase is to handle removal of affixes including prefixes, suffixes and infixes. Prefixes and suffixes are removed depending on the length of the word, while checking its morphological pattern after each deduction to remove infixes. In the second phase, the root extraction algorithm is formulated to handle weak, defined, eliminated-long-vowel and two-letter geminated words, as there is a substantial great amount of irregular Arabic words in texts. Once the roots are extracted, they are checked against a predefined list of 3800 triliteral and 900 quad literal roots. Series of experiments has been conducted to improve and test the performance of the proposed algorithm. The obtained results revealed that the developed algorithm has better accuracy than the existing stemming algorithm. The second stage is the document classification stage. In this stage two non-parametric classifiers are tested, namely Artificial Neural Networks (ANN) and Support Vector Machine (SVM). The system is trained on 6 categories: culture, economy, international, local, religion and sports. The system is trained on 80% of the available data. From each category, the 10 top frequent terms are selected as features. Testing the classification algorithms has been done on the remaining 20% of the documents. The results of ANN and SVM are compared to the standard method used for text classification, the terms frequency-based method. Results show that ANN and SVM have better accuracy (80-90%) compared to the standard method (60-70%). The proposed method proves the ability to categorize the Arabic text documents into the appropriate categories with a high precision rate.
6

McElroy, Jonathan David. "Automatic Document Classification in Small Environments." DigitalCommons@CalPoly, 2012. https://digitalcommons.calpoly.edu/theses/682.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Document classification is used to sort and label documents. This gives users quicker access to relevant data. Users that work with large inflow of documents spend time filing and categorizing them to allow for easier procurement. The Automatic Classification and Document Filing (ACDF) system proposed here is designed to allow users working with files or documents to rely on the system to classify and store them with little manual attention. By using a system built on Hidden Markov Models, the documents in a smaller desktop environment are categorized with better results than the traditional Naive Bayes implementation of classification.
7

Felhi, Mehdi. "Document image segmentation : content categorization." Thesis, Université de Lorraine, 2014. http://www.theses.fr/2014LORR0109/document.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Dans cette thèse, nous abordons le problème de la segmentation des images de documents en proposant de nouvelles approches pour la détection et la classification de leurs contenus. Dans un premier lieu, nous étudions le problème de l'estimation d'inclinaison des documents numérisées. Le but de ce travail étant de développer une approche automatique en mesure d'estimer l'angle d'inclinaison du texte dans les images de document. Notre méthode est basée sur la méthode Maximum Gradient Difference (MGD), la R-signature et la transformée de Ridgelets. Nous proposons ensuite une approche hybride pour la segmentation des documents. Nous décrivons notre descripteur de trait qui permet de détecter les composantes de texte en se basant sur la squeletisation. La méthode est appliquée pour la segmentation des images de documents numérisés (journaux et magazines) qui contiennent du texte, des lignes et des régions de photos. Le dernier volet de la thèse est consacré à la détection du texte dans les photos et posters. Pour cela, nous proposons un ensemble de descripteurs de texte basés sur les caractéristiques du trait. Notre approche commence par l'extraction et la sélection des candidats de caractères de texte. Deux méthodes ont été établies pour regrouper les caractères d'une même ligne de texte (mot ou phrase) ; l'une consiste à parcourir en profondeur un graphe, l'autre consiste à établir un critère de stabilité d'une région de texte. Enfin, les résultats sont affinés en classant les candidats de texte en régions « texte » et « non-texte » en utilisant une version à noyau du classifieur Support Vector Machine (K-SVM)
In this thesis I discuss the document image segmentation problem and I describe our new approaches for detecting and classifying document contents. First, I discuss our skew angle estimation approach. The aim of this approach is to develop an automatic approach able to estimate, with precision, the skew angle of text in document images. Our method is based on Maximum Gradient Difference (MGD) and R-signature. Then, I describe our second method based on Ridgelet transform.Our second contribution consists in a new hybrid page segmentation approach. I first describe our stroke-based descriptor that allows detecting text and line candidates using the skeleton of the binarized document image. Then, an active contour model is applied to segment the rest of the image into photo and background regions. Finally, text candidates are clustered using mean-shift analysis technique according to their corresponding sizes. The method is applied for segmenting scanned document images (newspapers and magazines) that contain text, lines and photo regions. Finally, I describe our stroke-based text extraction method. Our approach begins by extracting connected components and selecting text character candidates over the CIE LCH color space using the Histogram of Oriented Gradients (HOG) correlation coefficients in order to detect low contrasted regions. The text region candidates are clustered using two different approaches ; a depth first search approach over a graph, and a stable text line criterion. Finally, the resulted regions are refined by classifying the text line candidates into « text» and « non-text » regions using a Kernel Support Vector Machine K-SVM classifier
8

Wang, Yanbo Justin. "Language-independent pre-processing of large document bases for text classification." Thesis, University of Liverpool, 2008. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.445960.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Text classification is a well-known topic in the research of knowledge discovery in databases. Algorithms for text classification generally involve two stages. The first is concerned with identification of textual features (i.e. words andlor phrases) that may be relevant to the classification process. The second is concerned with classification rule mining and categorisation of "unseen" textual data. The first stage is the subject of this thesis and often involves an analysis of text that is both language-specific (and possibly domain-specific), and that may also be computationally costly especially when dealing with large datasets. Existing approaches to this stage are not, therefore, generally applicable to all languages. In this thesis, we examine a number of alternative keyword selection methods and phrase generation strategies, coupled with two potential significant word list construction mechanisms and two final significant word selection mechanisms, to identify such words andlor phrases in a given textual dataset that are expected to serve to distinguish between classes, by simple, language-independent statistical properties. We present experimental results, using common (large) textual datasets presented in two distinct languages, to show that the proposed approaches can produce good performance with respect to both classification accuracy and processing efficiency. In other words, the study presented in this thesis demonstrates the possibility of efficiently solving the traditional text classification problem in a language-independent (also domain-independent) manner.
9

Wang, Yalin. "Document analysis : table structure understanding and zone content classification /." Thesis, Connect to this title online; UW restricted, 2002. http://hdl.handle.net/1773/6079.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
10

Wei, Zhihua. "The research on chinese text multi-label classification." Thesis, Lyon 2, 2010. http://www.theses.fr/2010LYO20025/document.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Text Classification (TC) which is an important field in information technology has many valuable applications. When facing the sea of information resources, the objects of TC are more complicated and diversity. The researches in pursuit of effective and practical TC technology are fairly challenging. More and more researchers regard that multi-label TC is more suited for many applications. This thesis analyses the difficulties and problems in multi-label TC and Chinese text representation based on a mass of algorithms for single-label TC and multi-label TC. Aiming at high dimensionality in feature space, sparse distribution in text representation and poor performance of multi-label classifier, this thesis will bring forward corresponding algorithms from different angles.Focusing on the problem of dimensionality “disaster” when Chinese texts are represented by using n-grams, two-step feature selection algorithm is constructed. The method combines filtering rare features within class and selecting discriminative features across classes. Moreover, the proper value of “n”, the strategy of feature weight and the correlation among features are discussed based on variety of experiments. Some useful conclusions are contributed to the research of n-gram representation in Chinese texts.In a view of the disadvantage in Latent Dirichlet Allocation (LDA) model, that is, arbitrarily revising the variable in smooth process, a new strategy for smoothing based on Tolerance Rough Set (TRS) is put forward. It constructs tolerant class in global vocabulary database firstly and then assigns value for out-of-vocabulary (oov) word in each class according to tolerant class.In order to improve performance of multi-label classifier and degrade computing complexity, a new TC method based on LDA model is applied for Chinese text representation. It extracts topics statistically from texts and then texts are represented by using the topic vector. It shows competitive performance both in English and in Chinese corpus.To enhance the performance of classifiers in multi-label TC, a compound classification framework is raised. It partitions the text space by computing the upper approximation and lower approximation. This algorithm decomposes a multi-label TC problem into several single-label TCs and several multi-label TCs which have less labels than original problem. That is, an unknown text should be classified by single-label classifier when it is partitioned into lower approximation space of some class. Otherwise, it should be classified by corresponding multi-label classifier.An application system TJ-MLWC (Tongji Multi-label Web Classifier) was designed. It could call the result from Search Engines directly and classify these results real-time using improved Naïve Bayes classifier. This makes the browse process more conveniently for users. Users could locate the texts interested immediately according to the class information given by TJ-MLWC
La thèse est centrée sur la Classification de texte, domaine en pleine expansion, avec de nombreuses applications actuelles et potentielles. Les apports principaux de la thèse portent sur deux points : Les spécificités du codage et du traitement automatique de la langue chinoise : mots pouvant être composés de un, deux ou trois caractères ; absence de séparation typographique entre les mots ; grand nombre d’ordres possibles entre les mots d’une phrase ; tout ceci aboutissant à des problèmes difficiles d’ambiguïté. La solution du codage en «n-grams »(suite de n=1, ou 2 ou 3 caractères) est particulièrement adaptée à la langue chinoise, car elle est rapide et ne nécessite pas les étapes préalables de reconnaissance des mots à l’aide d’un dictionnaire, ni leur séparation. La classification multi-labels, c'est-à-dire quand chaque individus peut être affecté à une ou plusieurs classes. Dans le cas des textes, on cherche des classes qui correspondent à des thèmes (topics) ; un même texte pouvant être rattaché à un ou plusieurs thème. Cette approche multilabel est plus générale : un même patient peut être atteint de plusieurs pathologies ; une même entreprise peut être active dans plusieurs secteurs industriels ou de services. La thèse analyse ces problèmes et tente de leur apporter des solutions, d’abord pour les classifieurs unilabels, puis multi-labels. Parmi les difficultés, la définition des variables caractérisant les textes, leur grand nombre, le traitement des tableaux creux (beaucoup de zéros dans la matrice croisant les textes et les descripteurs), et les performances relativement mauvaises des classifieurs multi-classes habituels
文本分类是信息科学中一个重要而且富有实际应用价值的研究领域。随着文本分类处理内容日趋复杂化和多元化,分类目标也逐渐多样化,研究有效的、切合实际应用需求的文本分类技术成为一个很有挑战性的任务,对多标签分类的研究应运而生。本文在对大量的单标签和多标签文本分类算法进行分析和研究的基础上,针对文本表示中特征高维问题、数据稀疏问题和多标签分类中分类复杂度高而精度低的问题,从不同的角度尝试运用粗糙集理论加以解决,提出了相应的算法,主要包括:针对n-gram作为中文文本特征时带来的维数灾难问题,提出了两步特征选择的方法,即去除类内稀有特征和类间特征选择相结合的方法,并就n-gram作为特征时的n值选取、特征权重的选择和特征相关性等问题在大规模中文语料库上进行了大量的实验,得出一些有用的结论。针对文本分类中运用高维特征表示文本带来的分类效率低,开销大等问题,提出了基于LDA模型的多标签文本分类算法,利用LDA模型提取的主题作为文本特征,构建高效的分类器。在PT3多标签分类转换方法下,该分类算法在中英文数据集上都表现出很好的效果,与目前公认最好的多标签分类方法效果相当。针对LDA模型现有平滑策略的随意性和武断性的缺点,提出了基于容差粗糙集的LDA语言模型平滑策略。该平滑策略首先在全局词表上构造词的容差类,再根据容差类中词的频率为每类文档的未登录词赋予平滑值。在中英文、平衡和不平衡语料库上的大量实验都表明该平滑方法显著提高了LDA模型的分类性能,在不平衡语料库上的提高尤其明显。针对多标签分类中分类复杂度高而精度低的问题,提出了一种基于可变精度粗糙集的复合多标签文本分类框架,该框架通过可变精度粗糙集方法划分文本特征空间,进而将多标签分类问题分解为若干个两类单标签分类问题和若干个标签数减少了的多标签分类问题。即,当一篇未知文本被划分到某一类文本的下近似区域时,可以直接用简单的单标签文本分类器判断其类别;当未知文本被划分在边界域时,则采用相应区域的多标签分类器进行分类。实验表明,这种分类框架下,分类的精确度和算法效率都有较大的提高。本文还设计和实现了一个基于多标签分类的网页搜索结果可视化系统(MLWC),该系统能够直接调用搜索引擎返回的搜索结果,并采用改进的Naïve Bayes多标签分类算法实现实时的搜索结果分类,使用户可以快速地定位搜索结果中感兴趣的文本。

Книги з теми "Text document classification":

1

Moens, Marie-Francine. Automatic indexing and abstracting of document texts. Boston: Kluwer Academic Publishers, 2000.

Знайти повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
2

Meister, Burkhardt W. The German Limited Liability Company: An introduction to the Act on limited liability companies with German/English text, synoptically arranged, of the act, a sample of arcticles of association, samples of the other formation documents of the company, the classification of the balance sheet and the profit and loss statement of a company and an extract from the commercial register = Die deutsche Gesellschaft mit beschränkter Haftung : eine Einführung zum Gesetz betreffend die Gesellschaften mit beschränkter Haftung mit synoptisch angeordnetem deutsch/englischem Text des Gesetzes, eines Gesellschaftsvertrages, der sonstigen Gründungsdokumente einer Bilanz und der Gewinn- und Verlustrechnung einer Gesellschaft und eines Auszugs aus dem Handelsregister. 7th ed. München: Beck, 2010.

Знайти повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
3

Intelligent Text Categorization And Clustering. Springer, 2008.

Знайти повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.

Частини книг з теми "Text document classification":

1

Guthrie, Louise, Joe Guthrie, and James Leistensnider. "Document Classification and Routing." In Text, Speech and Language Technology, 289–310. Dordrecht: Springer Netherlands, 1999. http://dx.doi.org/10.1007/978-94-017-2388-6_12.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
2

Huang, Chaochao, Xipeng Qiu, and Xuanjing Huang. "Text Classification with Document Embeddings." In Lecture Notes in Computer Science, 131–40. Cham: Springer International Publishing, 2014. http://dx.doi.org/10.1007/978-3-319-12277-9_12.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
3

Hrala, Michal, and Pavel Král. "Multi-label Document Classification in Czech." In Text, Speech, and Dialogue, 343–51. Berlin, Heidelberg: Springer Berlin Heidelberg, 2013. http://dx.doi.org/10.1007/978-3-642-40585-3_44.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
4

Kralicek, Jiri, and Jiri Matas. "Fast Text vs. Non-text Classification of Images." In Document Analysis and Recognition – ICDAR 2021, 18–32. Cham: Springer International Publishing, 2021. http://dx.doi.org/10.1007/978-3-030-86337-1_2.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
5

Král, Pavel, and Ladislav Lenc. "Confidence Measure for Czech Document Classification." In Computational Linguistics and Intelligent Text Processing, 525–34. Cham: Springer International Publishing, 2015. http://dx.doi.org/10.1007/978-3-319-18117-2_39.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
6

Penha, Gustavo, Raphael Campos, Sérgio Canuto, Marcos André Gonçalves, and Rodrygo L. T. Santos. "Document Performance Prediction for Automatic Text Classification." In Lecture Notes in Computer Science, 132–39. Cham: Springer International Publishing, 2019. http://dx.doi.org/10.1007/978-3-030-15719-7_17.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
7

Howland, Peg, and Haesun Park. "Cluster-Preserving Dimension Reduction Methods for Document Classification." In Survey of Text Mining II, 3–23. London: Springer London, 2008. http://dx.doi.org/10.1007/978-1-84800-046-9_1.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
8

Xia, Zhonghang, Guangming Xing, Houduo Qi, and Qi Li. "Applications of Semidefinite Programming in XML Document Classification." In Survey of Text Mining II, 129–44. London: Springer London, 2008. http://dx.doi.org/10.1007/978-1-84800-046-9_7.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
9

Gelbukh, Alexander, Grigori Sidorov, and Adolfo Guzman-Arénas. "Use of a Weighted Topic Hierarchy for Document Classification." In Text, Speech and Dialogue, 133–38. Berlin, Heidelberg: Springer Berlin Heidelberg, 1999. http://dx.doi.org/10.1007/3-540-48239-3_24.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
10

Lehečka, Jan, and Jan Švec. "Improving Multi-label Document Classification of Czech News Articles." In Text, Speech, and Dialogue, 307–15. Cham: Springer International Publishing, 2015. http://dx.doi.org/10.1007/978-3-319-24033-6_35.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.

Тези доповідей конференцій з теми "Text document classification":

1

Alshamari, Fatimah, and Abdou Youssef. "A Study into Math Document Classification using Deep Learning." In 8th International Conference on Computational Science and Engineering (CSE 2020). AIRCC Publishing Corporation, 2020. http://dx.doi.org/10.5121/csit.2020.101702.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Document classification is a fundamental task for many applications, including document annotation, document understanding, and knowledge discovery. This is especially true in STEM fields where the growth rate of scientific publications is exponential, and where the need for document processing and understanding is essential to technological advancement. Classifying a new publication into a specific domain based on the content of the document is an expensive process in terms of cost and time. Therefore, there is a high demand for a reliable document classification system. In this paper, we focus on classification of mathematics documents, which consist of English text and mathematics formulas and symbols. The paper addresses two key questions. The first question is whether math-document classification performance is impacted by math expressions and symbols, either alone or in conjunction with the text contents of documents. Our investigations show that Text-Only embedding produces better classification results. The second question we address is the optimization of a deep learning (DL) model, the LSTM combined with one dimension CNN, for math document classification. We examine the model with several input representations, key design parameters and decision choices, and choices of the best input representation for math documents classification.
2

Wu, Qin, Eddie Fuller, and Cun-Quan Zhang. "Text Document Classification and Pattern Recognition." In 2009 International Conference on Advances in Social Network Analysis and Mining (ASONAM). IEEE, 2009. http://dx.doi.org/10.1109/asonam.2009.21.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
3

Chen, ZhiHang, Liping Huang, and Yi L. Murphey. "Incremental Learning for Text Document Classification." In 2007 International Joint Conference on Neural Networks. IEEE, 2007. http://dx.doi.org/10.1109/ijcnn.2007.4371367.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
4

Zhang, Haopeng, and Jiawei Zhang. "Text Graph Transformer for Document Classification." In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Stroudsburg, PA, USA: Association for Computational Linguistics, 2020. http://dx.doi.org/10.18653/v1/2020.emnlp-main.668.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
5

Simske, Steven J., and Rafael Lins. "Automatic Text Summarization and Classification." In DocEng '18: ACM Symposium on Document Engineering 2018. New York, NY, USA: ACM, 2018. http://dx.doi.org/10.1145/3209280.3232791.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
6

Lima, João Marcos Carvalho, and José Everardo Bessa Maia. "A Topical Word Embeddings for Text Classification." In XV Encontro Nacional de Inteligência Artificial e Computacional. Sociedade Brasileira de Computação - SBC, 2018. http://dx.doi.org/10.5753/eniac.2018.4401.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
This paper presents an approach that uses topic models based on LDA to represent documents in text categorization problems. The document representation is achieved through the cosine similarity between document embeddings and embeddings of topic words, creating a Bag-of-Topics (BoT) variant. The performance of this approach is compared against those of two other representations: BoW (Bag-of-Words) and Topic Model, both based on standard tf-idf. Also, to reveal the effect of the classifier, we compared the performance of the nonlinear classifier SVM against that of the linear classifier Naive Bayes, taken as baseline. To evaluate the approach we use two bases, one multi-label (RCV-1) and another single-label (20 Newsgroup). The model presents significant results with low dimensionality when compared to the state of the art.
7

Thi Xuan Lam, Thanh, Anh Duc Le, and Masaki Nakagawa. "User Interface for Text and Non-Text Classification." In 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW). IEEE, 2019. http://dx.doi.org/10.1109/icdarw.2019.20044.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
8

"Contextual Latent Semantic Networks used for Document Classification." In Special Session on Text Mining. SciTePress - Science and and Technology Publications, 2012. http://dx.doi.org/10.5220/0004109304250430.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
9

Zhao, Miao, Rui-Qi Wang, Fei Yin, Xu-Yao Zhang, Lin-Lin Huang, and Jean-Marc Ogier. "Fast Text/non-Text Image Classification with Knowledge Distillation." In 2019 International Conference on Document Analysis and Recognition (ICDAR). IEEE, 2019. http://dx.doi.org/10.1109/icdar.2019.00234.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
10

Jiang, Suqi, Jason Lewris, Michael Voltmer, and Hongning Wang. "Integrating rich document representations for text classification." In 2016 Systems and Information Engineering Design Symposium (SIEDS). IEEE, 2016. http://dx.doi.org/10.1109/sieds.2016.7489319.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.

Звіти організацій з теми "Text document classification":

1

Idakwo, Gabriel, Sundar Thangapandian, Joseph Luttrell, Zhaoxian Zhou, Chaoyang Zhang, and Ping Gong. Deep learning-based structure-activity relationship modeling for multi-category toxicity classification : a case study of 10K Tox21 chemicals with high-throughput cell-based androgen receptor bioassay data. Engineer Research and Development Center (U.S.), July 2021. http://dx.doi.org/10.21079/11681/41302.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Deep learning (DL) has attracted the attention of computational toxicologists as it offers a potentially greater power for in silico predictive toxicology than existing shallow learning algorithms. However, contradicting reports have been documented. To further explore the advantages of DL over shallow learning, we conducted this case study using two cell-based androgen receptor (AR) activity datasets with 10K chemicals generated from the Tox21 program. A nested double-loop cross-validation approach was adopted along with a stratified sampling strategy for partitioning chemicals of multiple AR activity classes (i.e., agonist, antagonist, inactive, and inconclusive) at the same distribution rates amongst the training, validation and test subsets. Deep neural networks (DNN) and random forest (RF), representing deep and shallow learning algorithms, respectively, were chosen to carry out structure-activity relationship-based chemical toxicity prediction. Results suggest that DNN significantly outperformed RF (p < 0.001, ANOVA) by 22–27% for four metrics (precision, recall, F-measure, and AUPRC) and by 11% for another (AUROC). Further in-depth analyses of chemical scaffolding shed insights on structural alerts for AR agonists/antagonists and inactive/inconclusive compounds, which may aid in future drug discovery and improvement of toxicity prediction modeling.

До бібліографії