To see the other types of publications on this topic, follow the link: Classification de document.

Dissertations / Theses on the topic 'Classification de document'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 dissertations / theses for your research on the topic 'Classification de document.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.

1

Lovegrove, Will. "Advanced document analysis and automatic classification of PDF documents." Thesis, University of Nottingham, 1996. http://eprints.nottingham.ac.uk/13967/.

Full text
Abstract:
This thesis explores the domain of document analysis and document classification within the PDF document environment The main focus is the creation of a document classification technique which can identify the logical class of a PDF document and so provide necessary information to document class specific algorithms (such as document understanding techniques). The thesis describes a page decomposition technique which is tailored to render the information contained in an unstructured PDF file into a set of blocks. The new technique is based on published research but contains many modifications which enable it to competently analyse the internal document model of PDF documents. A new level of document processing is presented: advanced document analysis. The aim of advanced document analysis is to extract information from the PDF file which can be used to help identify the logical class of that PDF file. A blackboard framework is used in a process of block labelling in which the blocks created from earlier segmentation techniques are classified into one of eight basic categories. The blackboard's knowledge sources are programmed to find recurring patterns amongst the document's blocks and formulate document-specific heuristics which can be used to tag those blocks. Meaningful document features are found from three information sources: a statistical evaluation of the document's esthetic components; a logical based evaluation of the labelled document blocks and an appearance based evaluation of the labelled document blocks. The features are used to train and test a neural net classification system which identifies the recurring patterns amongst these features for four basic document classes: newspapers; brochures; forms and academic documents. In summary this thesis shows that it is possible to classify a PDF document (which is logically unstructured) into a basic logical document class. This has important ramifications for document processing systems which have traditionally relied upon a priori knowledge of the logical class of the document they are processing.
APA, Harvard, Vancouver, ISO, and other styles
2

Augereau, Olivier. "Reconnaissance et classification d’images de documents." Thesis, Bordeaux 1, 2013. http://www.theses.fr/2013BOR14764/document.

Full text
Abstract:
Ces travaux de recherche ont pour ambition de contribuer à la problématique de la classification d’images de documents. Plus précisément, ces travaux tendent à répondre aux problèmes rencontrés par des sociétés de numérisation dont l’objectif est de mettre à disposition de leurs clients une version numérique des documents papiers accompagnés d’informations qui leurs sont relatives. Face à la diversité des documents à numériser, l’extraction d’informations peut s’avérer parfois complexe. C’est pourquoi la classification et l’indexation des documents sont très souvent réalisées manuellement. Ces travaux de recherche ont permis de fournir différentes solutions en fonction des connaissances relatives aux images que possède l’utilisateur ayant en charge l’annotation des documents.Le premier apport de cette thèse est la mise en place d’une méthode permettant, de manière interactive, à un utilisateur de classer des images de documents dont la nature est inconnue. Le second apport de ces travaux est la proposition d’une technique de recherche d’images de documents par l’exemple basée sur l’extraction et la mise en correspondance de points d’intérêts. Le dernier apport de cette thèse est l’élaboration d’une méthode de classification d’images de documents utilisant les techniques de sacs de mots visuels<br>The aim of this research is to contribute to the document image classification problem. More specifically, these studies address digitizing company issues which objective is to provide the digital version of paper document with information relating to them. Given the diversity of documents, information extraction can be complex. This is why the classification and the indexing of documents are often performed manually. This research provides several solutions based on knowledge of the images that the user has. The first contribution of this thesis is a method for classifying interactively document images, where the content of documents and classes are unknown. The second contribution of this work is a new technique for document image retrieval by giving one example of researched document. This technique is based on the extraction and matching of interest points. The last contribution of this thesis is a method for classifying document images by using bags of visual words techniques
APA, Harvard, Vancouver, ISO, and other styles
3

Mondal, Abhro Jyoti. "Document Classification using Characteristic Signatures." University of Cincinnati / OhioLINK, 2017. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1511793852923472.

Full text
APA, Harvard, Vancouver, ISO, and other styles
4

Sandsmark, Håkon. "Spoken Document Classification of Broadcast News." Thesis, Norges teknisk-naturvitenskapelige universitet, Institutt for elektronikk og telekommunikasjon, 2012. http://urn.kb.se/resolve?urn=urn:nbn:no:ntnu:diva-19226.

Full text
Abstract:
Two systems for spoken document classification are implemented by combining an automatic speech recognizer with the two classification algorithms naive Bayes and logistic regression. The focus is on how to handle the inherent uncertainty in the output of the speech recognizer. Feature extraction is performed by computing expected word counts from speech recognition lattices, and subsequently removing words that are found to carry little or noisy information about the topic label, as determined by the information gain metric. The systems are evaluated by performing cross-validation on broadcast news stories, and the classification accuracy is measured with different configurations and on recognition output with different word error rates. The results show that a relatively high classification accuracy can be obtained with word error rates around 50%, and that the benefit of extracting features from lattices instead of 1-best transcripts increases with increasing word error rates.
APA, Harvard, Vancouver, ISO, and other styles
5

Chantar, Hamouda Khalifa Hamouda. "New techniques for Arabic document classification." Thesis, Heriot-Watt University, 2013. http://hdl.handle.net/10399/2669.

Full text
Abstract:
Text classification (TC) concerns automatically assigning a class (category) label to a text document, and has increasingly many applications, particularly in the domain of organizing, for browsing in large document collections. It is typically achieved via machine learning, where a model is built on the basis of a typically large collection of document features. Feature selection is critical in this process, since there are typically several thousand potential features (distinct words or terms). In text classification, feature selection aims to improve the computational e ciency and classification accuracy by removing irrelevant and redundant terms (features), while retaining features (words) that contain su cient information that help with the classification task. This thesis proposes binary particle swarm optimization (BPSO) hybridized with either K Nearest Neighbour (KNN) or Support Vector Machines (SVM) for feature selection in Arabic text classi cation tasks. Comparison between feature selection approaches is done on the basis of using the selected features in conjunction with SVM, Decision Trees (C4.5), and Naive Bayes (NB), to classify a hold out test set. Using publically available Arabic datasets, results show that BPSO/KNN and BPSO/SVM techniques are promising in this domain. The sets of selected features (words) are also analyzed to consider the di erences between the types of features that BPSO/KNN and BPSO/SVM tend to choose. This leads to speculation concerning the appropriate feature selection strategy, based on the relationship between the classes in the document categorization task at hand. The thesis also investigates the use of statistically extracted phrases of length two as terms in Arabic text classi cation. In comparison with Bag of Words text representation, results show that using phrases alone as terms in Arabic TC task decreases the classification accuracy of Arabic TC classifiers significantly while combining bag of words and phrase based representations may increase the classification accuracy of the SVM classifier slightly.
APA, Harvard, Vancouver, ISO, and other styles
6

Calabrese, Stephen. "Nonnegative Matrix Factorization and Document Classification." DigitalCommons@CalPoly, 2015. https://digitalcommons.calpoly.edu/theses/1462.

Full text
Abstract:
Applications of Non-negative Matrix Factorization are ubiquitous, and there are several well known algorithms available. This paper is concerned with the preprocessing of the documents and how the preprocessing effects document classification. The preprocessing discussed in this paper will run the classification on a variety of inner dimensions to see how my initialization compares to random initialization across an assortment of inner dimensions. The document classification is accomplished by using Non-negative Matrix Factorization and a Support Vector Machine. Several of the well known algorithms call for a random initialization of matrices before starting an iterative process to a locally best solution. Not only is the initialization often random, but choosing the size of the inner dimension also remains a difficult and mysterious task.\\ This paper explores the possible gains in categorization accuracy given a more intelligently chosen initialization as opposed to a random initialization through the use of the Reuters-21578 document collection. This paper presents two new and different approaches for initialization of the data matrix. The first approach uses the most important words for a given document that are least important to all the other documents. The second approach will incorporate the words that appear in the title and header of the documents that are not stop words. The motivation for this is that the title usually tells the reader what the document is about. As a result, the words should be relevant to the category of the document. This paper will also present an entire framework for testing and comparing different Non-negative Matrix Factorization initialization methods. A thorough overview of the implementation and results are presented to ease the interfacing with future work.
APA, Harvard, Vancouver, ISO, and other styles
7

McElroy, Jonathan David. "Automatic Document Classification in Small Environments." DigitalCommons@CalPoly, 2012. https://digitalcommons.calpoly.edu/theses/682.

Full text
Abstract:
Document classification is used to sort and label documents. This gives users quicker access to relevant data. Users that work with large inflow of documents spend time filing and categorizing them to allow for easier procurement. The Automatic Classification and Document Filing (ACDF) system proposed here is designed to allow users working with files or documents to rely on the system to classify and store them with little manual attention. By using a system built on Hidden Markov Models, the documents in a smaller desktop environment are categorized with better results than the traditional Naive Bayes implementation of classification.
APA, Harvard, Vancouver, ISO, and other styles
8

Blein, Florent. "Automatic Document Classification Applied to Swedish News." Thesis, Linköping University, Department of Computer and Information Science, 2005. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-3065.

Full text
Abstract:
<p>The first part of this paper presents briefly the ELIN[1] system, an electronic newspaper project. ELIN is a framework that stores news and displays them to the end-user. Such news are formatted using the xml[2] format. The project partner Corren[3] provided ELIN with xml articles, however the format used was not the same. My first task has been to develop a software that converts the news from one xml format (Corren) to another (ELIN).</p><p>The second and main part addresses the problem of automatic document classification and tries to find a solution for a specific issue. The goal is to automatically classify news articles from a Swedish newspaper company (Corren) into the IPTC[4] news categories.</p><p>This work has been carried out by implementing several classification algorithms, testing them and comparing their accuracy with existing software. The training and test documents were 3 weeks of the Corren newspaper that had to be classified into 2 categories.</p><p>The last tests were run with only one algorithm (Naïve Bayes) over a larger amount of data (7, then 10 weeks) and categories (12) to simulate a more real environment.</p><p>The results show that the Naïve Bayes algorithm, although the oldest, was the most accurate in this particular case. An issue raised by the results is that feature selection improves speed but can seldom reduce accuracy by removing too many features.</p>
APA, Harvard, Vancouver, ISO, and other styles
9

SHEN, TONG. "Document and Image Classification withTopic Ngram Model." Thesis, KTH, Skolan för datavetenskap och kommunikation (CSC), 2014. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-155771.

Full text
Abstract:
Latent Dirichlet Allocation (LDA) is a popular probabilistic model for information retrieval. Many extended models based on LDA have been introduced during the past 10 years. In LDA, a data point is represented as a bag (multiset)of words. In the text case, a word is a regular text word, but other types of data can also be represented as words (e.g. visual words). Due to the bag-of-words assumption, the original LDA neglects the structure of thedata, i.e., all the relationships between words, which leads to information loss. As a matter of fact, the spatial relationship is important and useful. In order to explore the importance of the relationship, we focus on an extensionof LDA called Topic Ngram Model, which models the relationship among adjacent words. In this thesis, we first implement the model and use it in for text classification. Furthermore, we propose a 2D extension, which enables us to model spatial relationships of features in images.
APA, Harvard, Vancouver, ISO, and other styles
10

Gupta, Anjum. "New framework for cross-domain document classification." Monterey, California. Naval Postgraduate School, 2011. http://hdl.handle.net/10945/10786.

Full text
Abstract:
Automatic text document classification is a fundamental problem in machine learning. Given the dynamic nature and the exponential growth of the World Wide Web, one needs the ability to classify not only a massive number of documents, but also documents that belong to wide variety of domains. Some examples of the domains are e-mails, blogs, Wikipedia articles, news articles, newsgroups, online chats, etc. It is the difference in the writing style that differentiates these domains. Text documents are usually classified using supervised learning algorithms that require large set of pre-labeled data. This requirement, of labeled data, poses a challenge in classifying documents that belong to different domains. Our goal is to classify text documents in the testing domain without requiring any labeled documents from the same domain. Our research develops specialized cross-domain learning algorithms based the distributions over words obtained from a collection of text documents by topic models such as Latent Dirichlet Allocation (LDA). Our major contributions include (1) empirically showing that conventional supervised learning algorithms fail to generalize their learned models across different domains and (2) development of novel and specialized cross-domain classification algorithms that show an appreciable improvement over conventional methods used for cross-domain classification that is consistent for different datasets. Our research addresses many real-world needs. Since massive number of new types of text documents is generated daily, it is crucial to have the ability to transfer learned information from one domain to another domain. Cross-domain classification lets us leverage information learned from one domain for use in the classification of documents in a new domain.
APA, Harvard, Vancouver, ISO, and other styles
11

Hadish, Mulugeta. "Extended Multidimensional Conceptual Spaces in Document Classification." University of Cincinnati / OhioLINK, 2008. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1227158181.

Full text
APA, Harvard, Vancouver, ISO, and other styles
12

Zhou, Shun. "Incremental document classification in a knowledge management environment." Thesis, National Library of Canada = Bibliothèque nationale du Canada, 2001. http://www.collectionscanada.ca/obj/s4/f2/dsk3/ftp04/MQ62977.pdf.

Full text
APA, Harvard, Vancouver, ISO, and other styles
13

Dubien, Stephen, and University of Lethbridge Faculty of Arts and Science. "Question answering using document tagging and question classification." Thesis, Lethbridge, Alta. : University of Lethbridge, Faculty of Arts and Science, 2005, 2005. http://hdl.handle.net/10133/248.

Full text
Abstract:
Question answering (QA) is a relatively new area of research. QA is retriecing answers to questions rather than information retrival systems (search engines), which retrieve documents. This means that question answering systems will possibly be the next generation of search engines. What is left to be done to allow QA to be the next generation of search engines? The answer is higher accuracy, which can be achieved by investigating methods of questions answering. I took the approach of designing a question answering system that is based on document tagging and question classification. Question classification extracts useful information from the question about how to answer the question. Document tagging extracts useful information from the documents, which will be used in finding the answer to the question. We used different available systems to tage the documents. Our system classifies the questions using manually developed rules. I also investigated different ways which can use both these methods to answer questions and found that our methods had a comparable accuracy to some systems that use deeper processing techniques. This thesis includes investigations into modules of a question answering system and gives insights into how to go about developing a question answering system based on document tagging and question classification. I also evaluated our current system with the questions from the TREC 2004 question answering track.<br>viii, 139 leaves ; 29 cm.
APA, Harvard, Vancouver, ISO, and other styles
14

Voerman, Joris. "Classification automatique à partir d’un flux de documents." Electronic Thesis or Diss., La Rochelle, 2022. http://www.theses.fr/2022LAROS025.

Full text
Abstract:
Les documents administratifs sont aujourd’hui omniprésents dans notre quotidien. Nombreux et diversifiés, ils sont utilisés sous deux formes distinctes : physique ou numérique. La nécessité de passer du physique au numérique selon les situations entraîne des besoins dont le développement de solutions constitue un domaine de recherche actif notamment d’un point de vue industriel. Une fois un document scanné, l’un des premiers éléments à déterminer est le type, la classe ou la catégorie, permettant de faciliter toutes opérations ultérieures. Si la classification automatique est une opération disposant de nombreuses solutions dans l’état de l’art, la classification de documents, le fort déséquilibre au sein des données d’apprentissage et les contraintes industrielles restent trois difficultés majeures. Ce manuscrit se concentre sur la classification automatique par apprentissage de documents à partir de flux industriels en tentant de solutionner ces trois problèmes. Pour cela, il contient une évaluation de l’adaptation au contexte des méthodes préexistantes ; suivie d’une évaluation des solutions existantes permettant de renforcer les méthodes, ainsi que des combinaisons possibles. Il se termine par la proposition d’une méthode de combinaison de modèles sous la forme de cascade offrant une réponse progressive. Les solutions mises en avant sont d’un côté un réseau multimodal renforcé par un système d’attention assurant la classification d’une grande variété de documents. De l’autre, une cascade de trois réseaux complémentaires : un pour les images, un pour le texte et un pour les classes faiblement représentées. Ces deux options offrent des résultats solides autant dans un contexte idéal que dans un contexte déséquilibré. Dans le premier cas, il équivaut voire dépasse l’état de l’art. Dans le second, ils montrent une augmentation d’environ+6% de F0,5-Mesure par rapport à l’état de l’art<br>Administrative documents can be found everywhere today. They are numerous, diverse and can be of two types: physical and numerical. The need to switch between these two forms required the development of new solutions. After document digitization (mainly with a scanner), one of the first problems is to determine the type of the document, which will simplify all future processes. Automatic classification is a complex process that has multiple solutions in the state of the art. Therefore, the document classification, the imbalanced context and industrial constraints will heavily challenge these solutions. This thesis focuses on the automatic classification of document streams with research of solutions to the three major problems previously introduced. To this end, we first propose an evaluation of existing methods adaptation to document streams context. In addition, this work proposes an evaluation of state-of-the-art solutions to contextual constraints and possible combinations between them. Finally, we propose a new combination method that uses a cascade of systems to offer a gradual solution. The most effective solutions are, at first, a multimodal neural network reinforced by an attention model that is able to classify a great variety of documents. In second, a cascade of three complementary networks with : a one network for text classification, one for image classification and one for low represented classes. These two options provide good results as well in ideal context than in imbalanced context. In the first case, it challenges the state of the art. In the second case, it shows an improvement of +6% F0.5-Measure in comparison to the state of the art
APA, Harvard, Vancouver, ISO, and other styles
15

Evans, Ieuan. "Semi-supervised topic models applied to mathematical document classification." Thesis, University of Bath, 2017. https://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.715299.

Full text
Abstract:
Our objective is to build a mathematical document classifier: a machine which for a given mathematical document $\mathbf{x}$, determines the mathematical subject area $\cc$. In particular, we wish to construct the function $f$ such that $f(\mathbf{x}, \TTheta) = \cc$ where $f$ requires the possibly unknown parameters $\TTheta$ which may be estimated using an existing corpus of labelled documents. The novelty here is that our proposed classifiers will observe a mathematical document over dual vocabularies. In particular, as a collection of both words and mathematical symbols. In this thesis, we predominantly review the claims made in \cite{Watt}: mathematical document classification is possible via symbol frequency analysis. In particular, we investigate whether this claim is justified: \cite{Watt} contains no experimental evidence which supports this. Furthermore, we extend this research further and investigate whether the inclusion of mathematical notational information improves classification accuracy over the existing single vocabulary approaches. To do so, we review a selection of machine learning methods for document classification and refine and extend these models to incorporate mathematical notational information and investigate whether these models yield higher classification performance over existing word only versions. In this research, we develop the novel mathematical document models ``Dual Latent Dirichlet Allocation'' and ``Dual Pachinko Allocation'' which are extensions to the existing topic models ``Latent Dirichlet Allocation'' and ``Pachinko Allocation'' respectively. Our proposed models observe mathematical documents over two separate vocabularies (words and mathematical symbols). Furthermore, we present Online Variational Bayes for Pachinko Allocation and our proposed models to allow for fast parameter estimation over a single pass of the data. We perform systematic analysis on these models, and we verify the claims made in \cite{Watt}, and furthermore, we observe that the inclusion of symbol data via Dual Pachinko Allocation only yields in an increase of classification performance over the single vocabulary variants and the prior art in this field.
APA, Harvard, Vancouver, ISO, and other styles
16

Namburi, Sruthi. "Logistic regression with conjugate gradient descent for document classification." Kansas State University, 2016. http://hdl.handle.net/2097/32658.

Full text
Abstract:
Master of Science<br>Department of Computing and Information Sciences<br>William H. Hsu<br>Logistic regression is a model for function estimation that measures the relationship between independent variables and a categorical dependent variable, and by approximating a conditional probabilistic density function using a logistic function, also known as a sigmoidal function. Multinomial logistic regression is used to predict categorical variables where there can be more than two categories or classes. The most common type of algorithm for optimizing the cost function for this model is gradient descent. In this project, I implemented logistic regression using conjugate gradient descent (CGD). I used the 20 Newsgroups data set collected by Ken Lang. I compared the results with those for existing implementations of gradient descent. The conjugate gradient optimization methodology outperforms existing implementations.
APA, Harvard, Vancouver, ISO, and other styles
17

Wang, Yalin. "Document analysis : table structure understanding and zone content classification /." Thesis, Connect to this title online; UW restricted, 2002. http://hdl.handle.net/1773/6079.

Full text
APA, Harvard, Vancouver, ISO, and other styles
18

Chagheri, Samaneh. "An XML document representation method based on structure and content : application in technical document classification." Thesis, Lyon, INSA, 2012. http://www.theses.fr/2012ISAL0085.

Full text
Abstract:
L’amélioration rapide du nombre de documents stockés électroniquement représente un défi pour la classification automatique de documents. Les systèmes de classification traditionnels traitent les documents en tant que texte plat, mais les documents sont de plus en plus structurés. Par exemple, XML est la norme plus connue et plus utilisée pour la représentation de documents structurés. Ce type des documents comprend des informations complémentaires sur l'organisation du contenu représentées par différents éléments comme les titres, les sections, les légendes etc. Pour tenir compte des informations stockées dans la structure logique, nous proposons une approche de représentation des documents structurés basée à la fois sur la structure logique du document et son contenu textuel. Notre approche étend le modèle traditionnel de représentation du document appelé modèle vectoriel. Nous avons essayé d'utiliser d'information structurelle dans toutes les phases de la représentation du document: -procédure d'extraction de caractéristiques, -La sélection des caractéristiques, -Pondération des caractéristiques. Notre deuxième contribution concerne d’appliquer notre approche générique à un domaine réel : classification des documents techniques. Nous désirons mettre en œuvre notre proposition sur une collection de documents techniques sauvegardés électroniquement dans la société CONTINEW spécialisée dans l'audit de documents techniques. Ces documents sont en format représentations où la structure logique est non accessible. Nous proposons une solution d’interprétation de documents pour détecter la structure logique des documents à partir de leur présentation physique. Ainsi une collection hétérogène en différents formats de stockage est transformée en une collection homogène de documents XML contenant le même schéma logique. Cette contribution est basée sur un apprentissage supervisé. En conclusion, notre proposition prend en charge l'ensemble de flux de traitements des documents partant du format original jusqu’à la détermination de la ses classe Dans notre système l’algorithme de classification utilisé est SVM<br>Rapid improvement in the number of documents stored electronically presents a challenge for automatic classification of documents. Traditional classification systems consider documents as a plain text; however documents are becoming more and more structured. For example, XML is the most known and used standard for structured document representation. These documents include supplementary information on content organization represented by different elements such as title, section, caption etc. We propose an approach on structured document classification based on both document logical structure and its content in order to take into account the information present in logical structure. Our approach extends the traditional document representation model called Vector Space Model (VSM). We have tried to integrate structural information in all phases of document representation construction: -Feature extraction procedure, -Feature selection, -Feature weighting. Our second contribution concerns to apply our generic approach to a real domain of technical documentation. We desire to use our proposition for classifying technical documents electronically saved in CONTINEW; society specialized in technical document audit. These documents are in legacy format in which logical structure is inaccessible. Then we propose an approach for document understanding in order to extract documents logical structure from their presentation layout. Thus a collection of heterogeneous documents in different physical presentations and formats is transformed to a homogenous XML collection sharing the same logical structure. Our contribution is based on learning approach where each logical element is described by its physical characteristics. Therefore, our proposal supports whole document transformation workflow from document’s original format to being classified. In our system SVM has been used as classification algorithm
APA, Harvard, Vancouver, ISO, and other styles
19

Gordo, Albert. "Document Image Representation, Classification and Retrieval in Large-Scale Domains." Doctoral thesis, Universitat Autònoma de Barcelona, 2013. http://hdl.handle.net/10803/117445.

Full text
Abstract:
A pesar del ideal de “oficina sin papeles” nacida en la década de los setenta, la mayoría de empresas siguen todavía luchando contra una ingente cantidad de documentación en papel. Aunque muchas empresas están haciendo un esfuerzo en la transformación de parte de su documentación interna a un formato digital sin necesidad de pasar por el papel, la comunicación con otras empresas y clientes en un formato puramente digital es un problema mucho más complejo debido a la escasa adopción de estándares. Las empresas reciben una gran cantidad de documentación en papel que necesita ser analizada y procesada, en su mayoría de forma manual. Una solución para esta tarea consiste en, en primer lugar, el escaneo automático de los documentos entrantes. A continuación, las imágenes de los documentos puede ser analizadas y la información puede ser extraida a partir de los datos. Los documentos también pueden ser automáticamente enviados a los flujos de trabajo adecuados, usados para buscar documentos similares en bases de datos para transferir información, etc. Debido a la naturaleza de esta “sala de correo” digital, es necesario que los métodos de representación de documentos sean generales, es decir, adecuados para representar correctamente tipos muy diferentes de documentos. Es necesario que los métodos sean robustos, es decir, capaces de representar nuevos tipos de documentos, imágenes con ruido, etc. Y, por último, es necesario que los métodos sean escalables, es decir, capaces de funcionar cuando miles o millones de documentos necesitan ser tratados, almacenados y consultados. Desafortunadamente, las técnicas actuales de representación, clasificación y búsqueda de documentos no son aptos para esta sala de correo digital, ya que no cumplen con algunos o ninguno de estos requisitos. En esta tesis nos centramos en el problema de la representación de documentos enfocada a la clasificación y búsqueda en el marco de la sala de correo digital. En particular, en la primera parte de esta tesis primero presentamos un descriptor de documentos basado en un histograma de “runlengths” a múltiples escalas. Este descriptor supera en resultados a otros métodos del estado-del-arte en bases de datos públicas y propias de diferente naturaleza y condición en tareas de clasificación y búsqueda de documentos. Más tarde modificamos esta representación para hacer frente a documentos más complejos, tales como documentos de varias páginas o documentos que contienen más fuentes de información como texto extraído por OCR. En la segunda parte de esta tesis nos centramos en el requisito de escalabilidad, sobre todo para las tareas de búsqueda, en el que todos los documentos deben estar disponibles en la memoria RAM para que la búsqueda pueda ser eficiente. Proponemos un nuevo método de binarización que llamamos PCAE, así como dos distancias asimétricas generales para descriptores binarios que pueden mejorar significativamente los resultados de la búsqueda con un mínimo coste computacional adicional. Por último, señalamos la importancia del aprendizaje supervisado cuando se realizan búsquedas en grandes bases de datos y estudiamos varios enfoques que pueden aumentar significativamente la precisión de los resultados sin coste adicional en tiempo de consulta.<br>Despite the “paperless office” ideal that started in the decade of the seventies, businesses still strive against an increasing amount of paper documentation. Although many businesses are making an effort in transforming some of the internal documentation into a digital form with no intrinsic need for paper, the communication with other businesses and clients in a pure digital form is a much more complex problem due to the lack of adopted standards. Companies receive huge amounts of paper documentation that need to be analyzed and processed, mostly in a manual way. A solution for this task consists in, first, automatically scanning the incoming documents. Then, document images can be analyzed and information can be extracted from the data. Documents can also be automatically dispatched to the appropriate workflows, used to retrieve similar documents in the dataset to transfer information, etc. Due to the nature of this “digital mailroom”, we need document representation methods to be general, i.e., able to cope with very different types of documents. We need the methods to be sound, i.e., able to cope with unexpected types of documents, noise, etc. And, we need to methods to be scalable, i.e., able to cope with thousands or millions of documents that need to be processed, stored, and consulted. Unfortunately, current techniques of document representation, classification and retrieval are not apt for this digital mailroom framework, since they do not fulfill some or all of these requirements. Through this thesis we focus on the problem of document representation aimed at classification and retrieval tasks under this digital mailroom framework. Specifically, on the first part of this thesis, we first present a novel document representation based on runlength histograms that achieves state-of-the-art results on public and in-house datasets of different nature and quality on classification and retrieval tasks. This representation is later modified to cope with more complex documents such as multiple-page documents, or documents that contain more sources of information such as extracted OCR text. Then, on the second part of this thesis, we focus on the scalability requirements, particularly for retrieval tasks, where all the documents need to be available in RAM memory for the retrieval to be efficient. We propose a novel binarization method which we dubbed PCAE, as well as two general asymmetric distances between binary embeddings that can significantly improve the retrieval results at a minimal extra computational cost. Finally, we note the importance of supervised learning when performing large-scale retrieval, and study several approaches that can significantly boost the results at no extra cost at query time.
APA, Harvard, Vancouver, ISO, and other styles
20

Alsaad, Amal. "Enhanced root extraction and document classification algorithm for Arabic text." Thesis, Brunel University, 2016. http://bura.brunel.ac.uk/handle/2438/13510.

Full text
Abstract:
Many text extraction and classification systems have been developed for English and other international languages; most of the languages are based on Roman letters. However, Arabic language is one of the difficult languages which have special rules and morphology. Not many systems have been developed for Arabic text categorization. Arabic language is one of the Semitic languages with morphology that is more complicated than English. Due to its complex morphology, there is a need for pre-processing routines to extract the roots of the words then classify them according to the group of acts or meaning. In this thesis, a system has been developed and tested for text classification. The system is based on two stages, the first is to extract the roots from text and the second is to classify the text according to predefined categories. The linguistic root extraction stage is composed of two main phases. The first phase is to handle removal of affixes including prefixes, suffixes and infixes. Prefixes and suffixes are removed depending on the length of the word, while checking its morphological pattern after each deduction to remove infixes. In the second phase, the root extraction algorithm is formulated to handle weak, defined, eliminated-long-vowel and two-letter geminated words, as there is a substantial great amount of irregular Arabic words in texts. Once the roots are extracted, they are checked against a predefined list of 3800 triliteral and 900 quad literal roots. Series of experiments has been conducted to improve and test the performance of the proposed algorithm. The obtained results revealed that the developed algorithm has better accuracy than the existing stemming algorithm. The second stage is the document classification stage. In this stage two non-parametric classifiers are tested, namely Artificial Neural Networks (ANN) and Support Vector Machine (SVM). The system is trained on 6 categories: culture, economy, international, local, religion and sports. The system is trained on 80% of the available data. From each category, the 10 top frequent terms are selected as features. Testing the classification algorithms has been done on the remaining 20% of the documents. The results of ANN and SVM are compared to the standard method used for text classification, the terms frequency-based method. Results show that ANN and SVM have better accuracy (80-90%) compared to the standard method (60-70%). The proposed method proves the ability to categorize the Arabic text documents into the appropriate categories with a high precision rate.
APA, Harvard, Vancouver, ISO, and other styles
21

Soltan-Zadeh, Yasaman. "Improved rule-based document representation and classification using genetic programming." Thesis, Royal Holloway, University of London, 2011. http://repository.royalholloway.ac.uk/items/479a1773-779b-8b24-b334-7ed485311abe/8/.

Full text
APA, Harvard, Vancouver, ISO, and other styles
22

Borodavkina, Lyudmila 1977. "Investigation of machine learning tools for document clustering and classification." Thesis, Massachusetts Institute of Technology, 2000. http://hdl.handle.net/1721.1/8932.

Full text
Abstract:
Thesis (M.Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2000.<br>Includes bibliographical references (leaves 57-59).<br>Data clustering is a problem of discovering the underlying data structure without any prior information about the data. The focus of this thesis is to evaluate a few of the modern clustering algorithms in order to determine their performance in adverse conditions. Synthetic Data Generation software is presented as a useful tool both for generating test data and for investigating results of the data clustering. Several theoretical models and their behavior are discussed, and, as the result of analysis of a large number of quantitative tests, we come up with a set of heuristics that describe the quality of clustering output in different adverse conditions.<br>by Lyudmila Borodavkina.<br>M.Eng.
APA, Harvard, Vancouver, ISO, and other styles
23

Moritz, Hugo. "A comparative study of machine learning algorithms for Document Classification." Thesis, Uppsala universitet, Informationssystem, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-414709.

Full text
Abstract:
In a more digitalized world, companies with e-archive solutions want to be part of the usage of modern methods to develop their business. One method is to automatically classify the content of the documents. A common approach is to apply machine learning, also known as document classification. There is a lack of updated research on comparing different machine learning algorithms. Also, in the context of whether more modern methods as neural networks are better than more statistical traditional/classic machine learning methods. The document classification process goes through pre-processing, feature selection, document representation and training and testing of the classifiers. Implementation of five different machine learning methods, with different stemming and feature selection settings, presents result based on various classification metrics and time consumption. The result shows that the neural network classifier have as high accuracy as one of the traditional statistical classifiers SVM, but the neural network provides a higher computational time cost. More studies for the document classification area with other programming language and libraries may give interesting aspects to whether the differences can be determined even more.
APA, Harvard, Vancouver, ISO, and other styles
24

Anne, Chaitanya. "Advanced Text Analytics and Machine Learning Approach for Document Classification." ScholarWorks@UNO, 2017. http://scholarworks.uno.edu/td/2292.

Full text
Abstract:
Text classification is used in information extraction and retrieval from a given text, and text classification has been considered as an important step to manage a vast number of records given in digital form that is far-reaching and expanding. This thesis addresses patent document classification problem into fifteen different categories or classes, where some classes overlap with other classes for practical reasons. For the development of the classification model using machine learning techniques, useful features have been extracted from the given documents. The features are used to classify patent document as well as to generate useful tag-words. The overall objective of this work is to systematize NASA’s patent management, by developing a set of automated tools that can assist NASA to manage and market its portfolio of intellectual properties (IP), and to enable easier discovery of relevant IP by users. We have identified an array of methods that can be applied such as k-Nearest Neighbors (kNN), two variations of the Support Vector Machine (SVM) algorithms, and two tree based classification algorithms: Random Forest and J48. The major research steps in this work consist of filtering techniques for variable selection, information gain and feature correlation analysis, and training and testing potential models using effective classifiers. Further, the obstacles associated with the imbalanced data were mitigated by adding synthetic data wherever appropriate, which resulted in a superior SVM classifier based model.
APA, Harvard, Vancouver, ISO, and other styles
25

Rida, Imad. "Temporal signals classification." Thesis, Normandie, 2017. http://www.theses.fr/2017NORMIR01/document.

Full text
Abstract:
De nos jours, il existe de nombreuses applications liées à la vision et à l’audition visant à reproduire par des machines les capacités humaines. Notre intérêt pour ce sujet vient du fait que ces problèmes sont principalement modélisés par la classification de signaux temporels. En fait, nous nous sommes intéressés à deux cas distincts, la reconnaissance de la démarche humaine et la reconnaissance de signaux audio, (notamment environnementaux et musicaux). Dans le cadre de la reconnaissance de la démarche, nous avons proposé une nouvelle méthode qui apprend et sélectionne automatiquement les parties dynamiques du corps humain. Ceci permet de résoudre le problème des variations intra-classe de façon dynamique; les méthodes à l’état de l’art se basant au contraire sur des connaissances a priori. Dans le cadre de la reconnaissance audio, aucune représentation de caractéristiques conventionnelle n’a montré sa capacité à s’attaquer indifféremment à des problèmes de reconnaissance d’environnement ou de musique : diverses caractéristiques ont été introduites pour résoudre chaque tâche spécifiquement. Nous proposons ici un cadre général qui effectue la classification des signaux audio grâce à un problème d’apprentissage de dictionnaire supervisé visant à minimiser et maximiser les variations intra-classe et inter-classe respectivement<br>Nowadays, there are a lot of applications related to machine vision and hearing which tried to reproduce human capabilities on machines. These problems are mainly amenable to a temporal signals classification problem, due our interest to this subject. In fact, we were interested to two distinct problems, humain gait recognition and audio signal recognition including both environmental and music ones. In the former, we have proposed a novel method to automatically learn and select the dynamic human body-parts to tackle the problem intra-class variations contrary to state-of-art methods which relied on predefined knowledge. To achieve it a group fused lasso algorithm is applied to segment the human body into parts with coherent motion value across the subjects. In the latter, while no conventional feature representation showed its ability to tackle both environmental and music problems, we propose to model audio classification as a supervised dictionary learning problem. This is done by learning a dictionary per class and encouraging the dissimilarity between the dictionaries by penalizing their pair- wise similarities. In addition the coefficients of a signal representation over these dictionaries is sought as sparse as possible. The experimental evaluations provide performing and encouraging results
APA, Harvard, Vancouver, ISO, and other styles
26

Cisse, Mouhamadou Moustapha. "Efficient extreme classification." Thesis, Paris 6, 2014. http://www.theses.fr/2014PA066594/document.

Full text
Abstract:
Dans cette thèse, nous proposons des méthodes a faible complexité pour la classification en présence d'un très grand nombre de catégories. Ces methodes permettent d'accelerer la prediction des classifieurs afin des les rendre utilisables dans les applications courantes. Nous proposons deux methodes destinées respectivement a la classification monolabel et a la classification multilabel. La première méthode utilise l'information hierarchique existante entre les catégories afin de créer un représentation binaire compact de celles-ci. La seconde approche , destinée aux problemes multilabel adpate le framework des Filtres de Bloom a la representation de sous ensembles de labels sous forme de de vecteurs binaires sparses. Dans chacun des cas, des classifieurs binaires sont appris afin de prédire les representations des catégories/labels et un algorithme permettant de retrouver l'ensemble de catégories pertinentes a partir de la représentation prédite est proposée. Les méthodes proposées sont validées par des expérience sur des données de grandes échelles et donnent des performances supérieures aux méthodes classiquement utilisées pour la classification extreme<br>We propose in this thesis new methods to tackle classification problems with a large number of labes also called extreme classification. The proposed approaches aim at reducing the inference conplexity in comparison with the classical methods such as one-versus-rest in order to make learning machines usable in a real life scenario. We propose two types of methods respectively for single label and multilable classification. The first proposed approach uses existing hierarchical information among the categories in order to learn low dimensional binary representation of the categories. The second type of approaches, dedicated to multilabel problems, adapts the framework of Bloom Filters to represent subsets of labels with sparse low dimensional binary vectors. In both approaches, binary classifiers are learned to predict the new low dimensional representation of the categories and several algorithms are also proposed to recover the set of relevant labels. Large scale experiments validate the methods
APA, Harvard, Vancouver, ISO, and other styles
27

Sendur, Zeynel. "Text Document Categorization by Machine Learning." Scholarly Repository, 2008. http://scholarlyrepository.miami.edu/oa_theses/209.

Full text
Abstract:
Because of the explosion of digital and online text information, automatic organization of documents has become a very important research area. There are mainly two machine learning approaches to enhance the organization task of the digital documents. One of them is the supervised approach, where pre-defined category labels are assigned to documents based on the likelihood suggested by a training set of labeled documents; and the other one is the unsupervised approach, where there is no need for human intervention or labeled documents at any point in the whole process. In this thesis, we concentrate on the supervised learning task which deals with document classification. One of the most important tasks of information retrieval is to induce classifiers capable of categorizing text documents. The same document can belong to two or more categories and this situation is referred by the term multi-label classification. Multi-label classification domains have been encountered in diverse fields. Most of the existing machine learning techniques which are in multi-label classification domains are extremely expensive since the documents are characterized by an extremely large number of features. In this thesis, we are trying to reduce these computational costs by applying different types of algorithms to the documents which are characterized by large number of features. Another important thing that we deal in this thesis is to have the highest possible accuracy when we have the high computational performance on text document categorization.
APA, Harvard, Vancouver, ISO, and other styles
28

Wang, Yanbo Justin. "Language-independent pre-processing of large document bases for text classification." Thesis, University of Liverpool, 2008. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.445960.

Full text
Abstract:
Text classification is a well-known topic in the research of knowledge discovery in databases. Algorithms for text classification generally involve two stages. The first is concerned with identification of textual features (i.e. words andlor phrases) that may be relevant to the classification process. The second is concerned with classification rule mining and categorisation of "unseen" textual data. The first stage is the subject of this thesis and often involves an analysis of text that is both language-specific (and possibly domain-specific), and that may also be computationally costly especially when dealing with large datasets. Existing approaches to this stage are not, therefore, generally applicable to all languages. In this thesis, we examine a number of alternative keyword selection methods and phrase generation strategies, coupled with two potential significant word list construction mechanisms and two final significant word selection mechanisms, to identify such words andlor phrases in a given textual dataset that are expected to serve to distinguish between classes, by simple, language-independent statistical properties. We present experimental results, using common (large) textual datasets presented in two distinct languages, to show that the proposed approaches can produce good performance with respect to both classification accuracy and processing efficiency. In other words, the study presented in this thesis demonstrates the possibility of efficiently solving the traditional text classification problem in a language-independent (also domain-independent) manner.
APA, Harvard, Vancouver, ISO, and other styles
29

Stewart, Seth Andrew. "Fully Convolutional Neural Networks for Pixel Classification in Historical Document Images." BYU ScholarsArchive, 2018. https://scholarsarchive.byu.edu/etd/7064.

Full text
Abstract:
We use a Fully Convolutional Neural Network (FCNN) to classify pixels in historical document images, enabling the extraction of high-quality, pixel-precise and semantically consistent layers of masked content. We also analyze a dataset of hand-labeled historical form images of unprecedented detail and complexity. The semantic categories we consider in this new dataset include handwriting, machine-printed text, dotted and solid lines, and stamps. Segmentation of document images into distinct layers allows handwriting, machine print, and other content to be processed and recognized discriminatively, and therefore more intelligently than might be possible with content-unaware methods. We show that an efficient FCNN with relatively few parameters can accurately segment documents having similar textural content when trained on a single representative pixel-labeled document image, even when layouts differ significantly. In contrast to the overwhelming majority of existing semantic segmentation approaches, we allow multiple labels to be predicted per pixel location, which allows for direct prediction and reconstruction of overlapped content. We perform an analysis of prevalent pixel-wise performance measures, and show that several popular performance measures can be manipulated adversarially, yielding arbitrarily high measures based on the type of bias used to generate the ground-truth. We propose a solution to the gaming problem by comparing absolute performance to an estimated human level of performance. We also present results on a recent international competition requiring the automatic annotation of billions of pixels, in which our method took first place.
APA, Harvard, Vancouver, ISO, and other styles
30

Felhi, Mehdi. "Document image segmentation : content categorization." Thesis, Université de Lorraine, 2014. http://www.theses.fr/2014LORR0109/document.

Full text
Abstract:
Dans cette thèse, nous abordons le problème de la segmentation des images de documents en proposant de nouvelles approches pour la détection et la classification de leurs contenus. Dans un premier lieu, nous étudions le problème de l'estimation d'inclinaison des documents numérisées. Le but de ce travail étant de développer une approche automatique en mesure d'estimer l'angle d'inclinaison du texte dans les images de document. Notre méthode est basée sur la méthode Maximum Gradient Difference (MGD), la R-signature et la transformée de Ridgelets. Nous proposons ensuite une approche hybride pour la segmentation des documents. Nous décrivons notre descripteur de trait qui permet de détecter les composantes de texte en se basant sur la squeletisation. La méthode est appliquée pour la segmentation des images de documents numérisés (journaux et magazines) qui contiennent du texte, des lignes et des régions de photos. Le dernier volet de la thèse est consacré à la détection du texte dans les photos et posters. Pour cela, nous proposons un ensemble de descripteurs de texte basés sur les caractéristiques du trait. Notre approche commence par l'extraction et la sélection des candidats de caractères de texte. Deux méthodes ont été établies pour regrouper les caractères d'une même ligne de texte (mot ou phrase) ; l'une consiste à parcourir en profondeur un graphe, l'autre consiste à établir un critère de stabilité d'une région de texte. Enfin, les résultats sont affinés en classant les candidats de texte en régions « texte » et « non-texte » en utilisant une version à noyau du classifieur Support Vector Machine (K-SVM)<br>In this thesis I discuss the document image segmentation problem and I describe our new approaches for detecting and classifying document contents. First, I discuss our skew angle estimation approach. The aim of this approach is to develop an automatic approach able to estimate, with precision, the skew angle of text in document images. Our method is based on Maximum Gradient Difference (MGD) and R-signature. Then, I describe our second method based on Ridgelet transform.Our second contribution consists in a new hybrid page segmentation approach. I first describe our stroke-based descriptor that allows detecting text and line candidates using the skeleton of the binarized document image. Then, an active contour model is applied to segment the rest of the image into photo and background regions. Finally, text candidates are clustered using mean-shift analysis technique according to their corresponding sizes. The method is applied for segmenting scanned document images (newspapers and magazines) that contain text, lines and photo regions. Finally, I describe our stroke-based text extraction method. Our approach begins by extracting connected components and selecting text character candidates over the CIE LCH color space using the Histogram of Oriented Gradients (HOG) correlation coefficients in order to detect low contrasted regions. The text region candidates are clustered using two different approaches ; a depth first search approach over a graph, and a stable text line criterion. Finally, the resulted regions are refined by classifying the text line candidates into « text» and « non-text » regions using a Kernel Support Vector Machine K-SVM classifier
APA, Harvard, Vancouver, ISO, and other styles
31

Lourme, Alexandre. "Contribution à la classification par modèles de mélange et classification simultanée d’échantillons d’origines multiples." Thesis, Lille 1, 2011. http://www.theses.fr/2011LIL10073/document.

Full text
Abstract:
Dans la première partie de cette thèse nous passons en revue la classification par modèle de mélange. En particulier nous décrivons une famille de mélanges gaussiens d’un usage courant, dont la parcimonie porte sur des paramètres d’interprétation géométrique. Comme ces modèles possèdent des inconvénients majeurs, nous leur opposons une nouvelle famille de mélanges dont la parcimonie porte sur des paramètres statistiques. Ces nouveaux modèles possèdent de nombreuses propriétés de stabilité qui les rendent mathématiquement cohérents et facilitent leur interprétation. Dans la seconde partie de ce travail nous présentons une méthode nouvelle dite de classification simultanée. Nous montrons que la classification d'un échantillon revient très souvent au partitionnement de plusieurs échantillons ; puis nous proposons d'établir un lien entre la population d'origine des différents échantillons. Ce lien, dont la nature varie selon le contexte, a toujours pour vocation de formaliser de façon réaliste une information commune aux données à classifier.Lorsque les échantillons sont décrits par des variables de même signification et que l'on cherche le même nombre de groupes dans chacun d'eux, nous établissons un lien stochastique entre populations conditionnelles. Lorsque les variables sont différentes mais sémantiquement proches d'un échantillon à l'autre, il se peut que leur pouvoir discriminant soit similaire et que l'imbrication des données conditionnelles soit comparable. Nous envisageons des mélanges spécifiques à ce contexte, liés par un chevauchement homogène de leurs composantes<br>In the first part of this work we review the mixture model-based clustering method. In particular we describe a family of common Gaussian mixtures the parsimony of which is about geometrical parameters. As these models suffer from major drawbacks, we display new Gaussian mixtures the parsimony of which focuses on statistical parameters. These new models own many stability properties that make them mathematically consistent and facilitate their interpretation. In the second part of this work we display the so-called simultaneous clustering method. We highlight that the classification of a single sample can often be seen as a multiple sample clustering problem; then we propose to establish a link between the original population of the diverse samples. This link varies depending on the context but it always tries to formalize in a realistic way some common information of the samples to classify. When samples are described by variables with identical meaning and when the same number of groups is researched within each of them, we establish a stochastic link between the conditional populations. When the variables are different but semantically close through the diverse samples nevertheless their discriminant power may be similar and the nesting of the conditional data can be comparable. We consider specific mixtures dedicated to this context: the link between the populations consists in an homogeneous overlap of the components
APA, Harvard, Vancouver, ISO, and other styles
32

Neggaz, Mohammed Yessin. "Automatic classification of dynamic graphs." Thesis, Bordeaux, 2016. http://www.theses.fr/2016BORD0169/document.

Full text
Abstract:
Les réseaux dynamiques sont constitués d’entités établissant des contacts les unes avec les autres dans le temps. Un défi majeur dans les réseaux dynamiques est de prédire les modèles de mobilité et de décider si l’évolution de la topologie satisfait aux exigences du succès d’un algorithme donné. Les types de dynamique résultant de ces réseaux sont variés en échelle et en nature. Par exemple,certains de ces réseaux restent connexes tout le temps; d’autres sont toujours déconnectés mais offrent toujours une sorte de connexité dans le temps et dans l’espace(connexité temporelle); d’autres sont connexes de manière récurrente, périodique,etc. Tous ces contextes peuvent être représentés sous forme de classes de graphes dynamiques correspondant à des conditions nécessaires et/ou suffisantes pour des problèmes ou algorithmes distribués donnés. Étant donné un graphe dynamique,une question naturelle est de savoir à quelles classes appartient ce graphe. Dans ce travail, nous apportons une contribution à l’automatisation de la classification de graphes dynamiques. Nous proposons des stratégies pour tester l’appartenance d’un graphe dynamique à une classe donnée et nous définissons un cadre générique pour le test de propriétés dans les graphes dynamiques. Nous explorons également le cas où aucune propriété sur le graphe n’est garantie, à travers l’étude du problème de maintien d’une forêt d’arbres couvrants dans un graphe dynamique<br>Dynamic networks consist of entities making contact over time with one another. A major challenge in dynamic networks is to predict mobility patterns and decide whether the evolution of the topology satisfies requirements for the successof a given algorithm. The types of dynamics resulting from these networks are varied in scale and nature. For instance, some of these networks remain connected at all times; others are always disconnected but still offer some kind of connectivity over time and space (temporal connectivity); others are recurrently connected,periodic, etc. All of these contexts can be represented as dynamic graph classes corresponding to necessary or sufficient conditions for given distributed problems or algorithms. Given a dynamic graph, a natural question to ask is to which of the classes this graph belongs. In this work we provide a contribution to the automation of dynamic graphs classification. We provide strategies for testing membership of a dynamic graph to a given class and a generic framework to test properties in dynamic graphs. We also attempt to understand what can still be done in a context where no property on the graph is guaranteed through the distributed problem of maintaining a spanning forest in highly dynamic graphs
APA, Harvard, Vancouver, ISO, and other styles
33

Lu, Ying. "Transfer Learning for Image Classification." Thesis, Lyon, 2017. http://www.theses.fr/2017LYSEC045/document.

Full text
Abstract:
Lors de l’apprentissage d’un modèle de classification pour un nouveau domaine cible avec seulement une petite quantité d’échantillons de formation, l’application des algorithmes d’apprentissage automatiques conduit généralement à des classifieurs surdimensionnés avec de mauvaises compétences de généralisation. D’autre part, recueillir un nombre suffisant d’échantillons de formation étiquetés manuellement peut s’avérer très coûteux. Les méthodes de transfert d’apprentissage visent à résoudre ce type de problèmes en transférant des connaissances provenant d’un domaine source associé qui contient beaucoup plus de données pour faciliter la classification dans le domaine cible. Selon les différentes hypothèses sur le domaine cible et le domaine source, l’apprentissage par transfert peut être classé en trois catégories: apprentissage par transfert inductif, apprentissage par transfert transducteur (adaptation du domaine) et apprentissage par transfert non surveillé. Nous nous concentrons sur le premier qui suppose que la tâche cible et la tâche source sont différentes mais liées. Plus précisément, nous supposons que la tâche cible et la tâche source sont des tâches de classification, tandis que les catégories cible et les catégories source sont différentes mais liées. Nous proposons deux méthodes différentes pour aborder ce problème. Dans le premier travail, nous proposons une nouvelle méthode d’apprentissage par transfert discriminatif, à savoir DTL(Discriminative Transfer Learning), combinant une série d’hypothèses faites à la fois par le modèle appris avec les échantillons de cible et les modèles supplémentaires appris avec des échantillons des catégories sources. Plus précisément, nous utilisons le résidu de reconstruction creuse comme discriminant de base et améliore son pouvoir discriminatif en comparant deux résidus d’un dictionnaire positif et d’un dictionnaire négatif. Sur cette base, nous utilisons des similitudes et des dissemblances en choisissant des catégories sources positivement corrélées et négativement corrélées pour former des dictionnaires supplémentaires. Une nouvelle fonction de coût basée sur la statistique de Wilcoxon-Mann-Whitney est proposée pour choisir les dictionnaires supplémentaires avec des données non équilibrées. En outre, deux processus de Boosting parallèles sont appliqués à la fois aux distributions de données positives et négatives pour améliorer encore les performances du classificateur. Sur deux bases de données de classification d’images différentes, la DTL proposée surpasse de manière constante les autres méthodes de l’état de l’art du transfert de connaissances, tout en maintenant un temps d’exécution très efficace. Dans le deuxième travail, nous combinons le pouvoir du transport optimal (OT) et des réseaux de neurones profond (DNN) pour résoudre le problème ITL. Plus précisément, nous proposons une nouvelle méthode pour affiner conjointement un réseau de neurones avec des données source et des données cibles. En ajoutant une fonction de perte du transfert optimal (OT loss) entre les prédictions du classificateur source et cible comme une contrainte sur le classificateur source, le réseau JTLN (Joint Transfer Learning Network) proposé peut effectivement apprendre des connaissances utiles pour la classification cible à partir des données source. En outre, en utilisant différents métriques comme matrice de coût pour la fonction de perte du transfert optimal, JTLN peut intégrer différentes connaissances antérieures sur la relation entre les catégories cibles et les catégories sources. Nous avons effectué des expérimentations avec JTLN basées sur Alexnet sur les jeux de données de classification d’image et les résultats vérifient l’efficacité du JTLN proposé. A notre connaissances, ce JTLN proposé est le premier travail à aborder ITL avec des réseaux de neurones profond (DNN) tout en intégrant des connaissances antérieures sur la relation entre les catégories cible et source<br>When learning a classification model for a new target domain with only a small amount of training samples, brute force application of machine learning algorithms generally leads to over-fitted classifiers with poor generalization skills. On the other hand, collecting a sufficient number of manually labeled training samples may prove very expensive. Transfer Learning methods aim to solve this kind of problems by transferring knowledge from related source domain which has much more data to help classification in the target domain. Depending on different assumptions about target domain and source domain, transfer learning can be further categorized into three categories: Inductive Transfer Learning, Transductive Transfer Learning (Domain Adaptation) and Unsupervised Transfer Learning. We focus on the first one which assumes that the target task and source task are different but related. More specifically, we assume that both target task and source task are classification tasks, while the target categories and source categories are different but related. We propose two different methods to approach this ITL problem. In the first work we propose a new discriminative transfer learning method, namely DTL, combining a series of hypotheses made by both the model learned with target training samples, and the additional models learned with source category samples. Specifically, we use the sparse reconstruction residual as a basic discriminant, and enhance its discriminative power by comparing two residuals from a positive and a negative dictionary. On this basis, we make use of similarities and dissimilarities by choosing both positively correlated and negatively correlated source categories to form additional dictionaries. A new Wilcoxon-Mann-Whitney statistic based cost function is proposed to choose the additional dictionaries with unbalanced training data. Also, two parallel boosting processes are applied to both the positive and negative data distributions to further improve classifier performance. On two different image classification databases, the proposed DTL consistently out performs other state-of-the-art transfer learning methods, while at the same time maintaining very efficient runtime. In the second work we combine the power of Optimal Transport and Deep Neural Networks to tackle the ITL problem. Specifically, we propose a novel method to jointly fine-tune a Deep Neural Network with source data and target data. By adding an Optimal Transport loss (OT loss) between source and target classifier predictions as a constraint on the source classifier, the proposed Joint Transfer Learning Network (JTLN) can effectively learn useful knowledge for target classification from source data. Furthermore, by using different kind of metric as cost matrix for the OT loss, JTLN can incorporate different prior knowledge about the relatedness between target categories and source categories. We carried out experiments with JTLN based on Alexnet on image classification datasets and the results verify the effectiveness of the proposed JTLN in comparison with standard consecutive fine-tuning. To the best of our knowledge, the proposed JTLN is the first work to tackle ITL with Deep Neural Networks while incorporating prior knowledge on relatedness between target and source categories. This Joint Transfer Learning with OT loss is general and can also be applied to other kind of Neural Networks
APA, Harvard, Vancouver, ISO, and other styles
34

Chzhen, Evgenii. "Plug-in methods in classification." Thesis, Paris Est, 2019. http://www.theses.fr/2019PESC2027/document.

Full text
Abstract:
Ce manuscrit étudie plusieurs problèmes de classification sous contraintes. Dans ce cadre de classification, notre objectif est de construire un algorithme qui a des performances aussi bonnes que la meilleure règle de classification ayant une propriété souhaitée. Fait intéressant, les méthodes de classification de type plug-in sont bien appropriées à cet effet. De plus, il est montré que, dans plusieurs configurations, ces règles de classification peuvent exploiter des données non étiquetées, c'est-à-dire qu'elles sont construites de manière semi-supervisée. Le Chapitre 1 décrit deux cas particuliers de la classification binaire - la classification où la mesure de performance est reliée au F-score, et la classification équitable. A ces deux problèmes, des procédures semi-supervisées sont proposées. En particulier, dans le cas du F-score, il s'avère que cette méthode est optimale au sens minimax sur une classe usuelle de distributions non-paramétriques. Aussi, dans le cas de la classification équitable, la méthode proposée est consistante en terme de risque de classification, tout en satisfaisant asymptotiquement la contrainte d’égalité des chances. De plus, la procédure proposée dans ce cadre d'étude surpasse en pratique les algorithmes de pointe. Le Chapitre 3 décrit le cadre de la classification multi-classes par le biais d'ensembles de confiance. Là encore, une procédure semi-supervisée est proposée et son optimalité presque minimax est établie. Il est en outre établi qu'aucun algorithme supervisé ne peut atteindre une vitesse de convergence dite rapide. Le Chapitre 4 décrit un cas de classification multi-labels dans lequel on cherche à minimiser le taux de faux-négatifs sous réserve de contraintes de type presque sûres sur les règles de classification. Dans cette partie, deux contraintes spécifiques sont prises en compte: les classifieurs parcimonieux et ceux soumis à un contrôle des erreurs négatives à tort. Pour les premiers, un algorithme supervisé est fourni et il est montré que cet algorithme peut atteindre une vitesse de convergence rapide. Enfin, pour la seconde famille, il est montré que des hypothèses supplémentaires sont nécessaires pour obtenir des garanties théoriques sur le risque de classification<br>This manuscript studies several problems of constrained classification. In this frameworks of classification our goal is to construct an algorithm which performs as good as the best classifier that obeys some desired property. Plug-in type classifiers are well suited to achieve this goal. Interestingly, it is shown that in several setups these classifiers can leverage unlabeled data, that is, they are constructed in a semi-supervised manner.Chapter 2 describes two particular settings of binary classification -- classification with F-score and classification of equal opportunity. For both problems semi-supervised procedures are proposed and their theoretical properties are established. In the case of the F-score, the proposed procedure is shown to be optimal in minimax sense over a standard non-parametric class of distributions. In the case of the classification of equal opportunity the proposed algorithm is shown to be consistent in terms of the misclassification risk and its asymptotic fairness is established. Moreover, for this problem, the proposed procedure outperforms state-of-the-art algorithms in the field.Chapter 3 describes the setup of confidence set multi-class classification. Again, a semi-supervised procedure is proposed and its nearly minimax optimality is established. It is additionally shown that no supervised algorithm can achieve a so-called fast rate of convergence. In contrast, the proposed semi-supervised procedure can achieve fast rates provided that the size of the unlabeled data is sufficiently large.Chapter 4 describes a setup of multi-label classification where one aims at minimizing false negative error subject to almost sure type constraints. In this part two specific constraints are considered -- sparse predictions and predictions with the control over false negative errors. For the former, a supervised algorithm is provided and it is shown that this algorithm can achieve fast rates of convergence. For the later, it is shown that extra assumptions are necessary in order to obtain theoretical guarantees in this case
APA, Harvard, Vancouver, ISO, and other styles
35

"Parameter free document stream classification." Thesis, 2006. http://library.cuhk.edu.hk/record=b6074286.

Full text
Abstract:
Extensive experiments are conducted to evaluate the effectiveness PFreeBT and PNLH by using a stream of two-year news stories and three benchmarks. The results showed that the patterns of the bursty features and the bursty topics which are identified by PFreeBT match our expectations, whereas PNLH demonstrates significant improvements over all of the existing heuristics. These favorable results indicated that both PFreeBT and PNLH are highly effective and feasible.<br>For the problem of bursty topics identification, PFreeBT adopts an approach, in which we term it as feature-pivot clustering approach. Given a document stream, PFreeBT first identifies a set of bursty features from there. The identification process is based on computing the probability distributions. According to the patterns of the bursty features and two newly defined concepts (equivalent and map-to), a set of bursty topics can be extracted.<br>For the problem of constructing a reliable classifier, we formulate it as a partially supervised classification problem. In this classification problem, only a few training examples are labeled as positive (P). All other training examples (U) are remained unlabeled. Here, U is mixed with the negative examples (N) and some other positive examples (P'). Existing techniques that tackle this problem all focus on finding N from U. None of them attempts to extract P' from U. In fact, it is difficult to succeed as the topics in U are diverse and the features in there are sparse. In this dissertation, PNLH is proposed for extracting a high quality of P' and N from U.<br>In this dissertation, two heuristics, PFreeBT and PNLH, are proposed to tackle the aforementioned problems. PFreeBT aims at identifying the bursty topics in a document stream, whereas PNLH aims at constructing a reliable classifier for a given bursty topic. It is worth noting that both heuristics are parameter free. Users do not need to provide any parameter explicitly. All of the required variables can be computed base on the given document stream automatically.<br>In this information overwhelming century, information becomes ever more pervasive. A new class of data-intensive application arises where data is modeled best as an open-ended stream. We call such kind of data as data stream. Document stream is a variation of data stream, which consists of a sequence of chronological ordered documents. A fundamental problem of mining document streams is to extract meaningful structure from there, so as to help us to organize the contents systematically. In this dissertation, we focus on such a problem. Specifically, this dissertation studies two problems: to identify the bursty topics in a document stream and to construct a classifiers for the bursty topics. A bursty topic is one of the topics resides in the document stream, such that a large number of documents would be related to it during a bounded time interval.<br>Fung Pui Cheong Gabriel.<br>"August 2006."<br>Adviser: Jeffrey Xu Yu.<br>Source: Dissertation Abstracts International, Volume: 68-03, Section: B, page: 1720.<br>Thesis (Ph.D.)--Chinese University of Hong Kong, 2006.<br>Includes bibliographical references (p. 122-130).<br>Electronic reproduction. Hong Kong : Chinese University of Hong Kong, [2012] System requirements: Adobe Acrobat Reader. Available via World Wide Web.<br>Electronic reproduction. [Ann Arbor, MI] : ProQuest Information and Learning, [200-] System requirements: Adobe Acrobat Reader. Available via World Wide Web.<br>Abstracts in English and Chinese.<br>School code: 1307.
APA, Harvard, Vancouver, ISO, and other styles
36

Jiang, Zhao Ren, and 江昭仁. "Document classification and character isolation." Thesis, 1994. http://ndltd.ncl.edu.tw/handle/57829476991764310838.

Full text
APA, Harvard, Vancouver, ISO, and other styles
37

Yang, Yun Yan, and 楊允言. "Document Automatic Classification and Ranking." Thesis, 1993. http://ndltd.ncl.edu.tw/handle/08002726935326280969.

Full text
APA, Harvard, Vancouver, ISO, and other styles
38

Wu, Chun-Yi, and 吳俊儀. "Performance Evaluation of Various Document Content Sources on Document Classification." Thesis, 2008. http://ndltd.ncl.edu.tw/handle/26716323694509244674.

Full text
Abstract:
碩士<br>華梵大學<br>資訊管理學系碩士班<br>96<br>Nowadays, keyword extracting which mostly depends on the judgement of professional researchers is a waste of time and manpower. Therefore, it is important to employ automatic keyword extraction methods in text categorization. The principal procedures of automatic document classification are (1) data retrieval and text process, (2) keyword retrieval, (3) feature selection, and (4) classifier selection. The principal intention of data retrieval and text process is to extraction the classified document source, determine the text content to be classified, and then process the text according to document contents. There are four major types of document contents which classification is based on: (1) document topic, (2) document abstract, (3) the full text, and (4) automatic document summarization. I use the digital full text in Electronic Theses and Dissertation System as experiment data set, classify topic, abstract, and the full text, and adopt K-Folds to analyze. CKIP is used as Chinese word-segmented, and using the rule of syntax and Stop Word deletion to extract keywords. Those keywords are weighted by tf-idf and then preceded classification with Support Vector Machine. The result of this research from the maximum to the minimum is the full text, topic, and abstract according to their accuracy.
APA, Harvard, Vancouver, ISO, and other styles
39

Fu, Liang Ching, and 梁清福. "Automatic Document Classification Using Multiple Classifiers." Thesis, 2009. http://ndltd.ncl.edu.tw/handle/47087767700942760032.

Full text
Abstract:
碩士<br>中國文化大學<br>資訊管理研究所<br>98<br>The development for automatic document classification technology not only can assist massive and repetitive efforts needed in manual classification, but also with the standardization of automatic classification principle and the employment of the repetitive experimentation verification for the algorithms, it can save both the manpower and time cost factors as well. In the advent of internet access and the priority placed on both the knowledge and digital contents either from the personal or enterprise perspective, it results with rather large accumulation of document quantities. Therefore, rapid and effective utilization of information and having them converted into systematic knowledge become increasingly important. This research employs the digital contents processing technology and extends the research findings from other scholars. we combines different classifiers (Bayes classifier,KNN classifier and SVM classifier) to establish a multiple classifiers system for document classification with the aim to obtain better performance. First,a single classifier prototype for each classification algorithm is produced and evaluated by its integral classification performance and individual performance in each category. Then the multiple classifier system combines the results from each single classifier using the voting and the maximum precision schemes. Experimental results show that the multiple classifier system is superior to single classifier in either Macro-F or Macro-F measure.
APA, Harvard, Vancouver, ISO, and other styles
40

Chen, Jr-Wei, and 陳智偉. "Methodologies and Analysis for Document Classification." Thesis, 1998. http://ndltd.ncl.edu.tw/handle/42726615497946198427.

Full text
Abstract:
碩士<br>國立清華大學<br>資訊工程學系<br>86<br>Document classification is closely related to pattern recognition. Inthis study, we apply several conventional pattern recognition methodsfor document classification. We propose an efficient method for findingthe nearest neighbor in the KNNR method. Moreover, we derive an incre-mental formula to find the leave-one-out error measure for a multi-dimensional Gaussian classifier. To verify our results, we invite somepeople to do manual classification. We examine the results from oursimulation and manual classification, and discuss the factors that maycause the performance discrepancies between machine and manual classifi-cation. Other approaches for improving the classification performanceare also suggested in the thesis.
APA, Harvard, Vancouver, ISO, and other styles
41

juang, Hwey-Mei, and 莊慧美. "An Intelligent Approach for Document Classification." Thesis, 2000. http://ndltd.ncl.edu.tw/handle/15433773092783549017.

Full text
Abstract:
碩士<br>國立屏東科技大學<br>資訊管理系<br>88<br>Abstract Following the on-going advance of Internet technology, we can easily provide information to and retrieve information from the Internet. However, the problem of information overload has to be overcome. One of the central issue to be addressed for the infromation overloasd problem is document classification. In this paper, we present an evolutionary approach to automatically categorize documents into appropriate categories. Our approach deals with different categories of documents separately: it evolves a numerical list that consists of the corresponding weights of the feature words for each class of documents. The experimental results show that our approach can easily evolve the classifiers of numerical lists, and the evolved classifiers perform better than the one constructed by the traditional approach.
APA, Harvard, Vancouver, ISO, and other styles
42

Fang, Shih-Yuan, and 方士元. "Document Classification based on Fuzzy AdaBoost.MH." Thesis, 2011. http://ndltd.ncl.edu.tw/handle/76405968329402191950.

Full text
Abstract:
碩士<br>國立交通大學<br>資訊科學與工程研究所<br>99<br>In this paper, we propose a fuzzy AdaBoost.MH algorithm and apply fuzzy AdaBoost.MH to document classification domain. The main idea of boosting is to generate many, relatively weak hypotheses and to combine these weak hypotheses into a single highly accurate classifier. In rule design, we employ decision stump rule as the basic discriminative function and each rule is correspondent to a weak hypothesis. In system design, we employ term frequency as filtering criterion to construct a rule pool. On each round, the best fuzzy rule can be selected from the pool using AdaBoost framework. Meanwhile, we propose a fuzzy number representation to represent each rule’s confidence. These fuzzy rules with confidence information are the bases of classification inference. When the training phase is completed, the final fuzzy classification result can be obtained from the inference result with a degree transformation process. The experimental results show that fuzzy AdaBoost.MH works very well in three data corpora.
APA, Harvard, Vancouver, ISO, and other styles
43

Kao, Chih-Chiang, and 高志強. "A Study of Combining Automatic Document Classification-Example on Patent Document." Thesis, 2004. http://ndltd.ncl.edu.tw/handle/73102024458255889991.

Full text
Abstract:
碩士<br>中原大學<br>資訊管理研究所<br>92<br>Because of the importance of intellectual property rights, the patent causes keen competiton among enterprises. For enterprises to use the patent to get the advantage of competition has become more importment. In the past, using Automatic Document Categorization can help the patent engineers to classify patent document more effectively. Howerer, there are many kinds of document classifiers, and each of them has its characteristics. Every classifier has different performance in different situations. The performace of classification is unstable. There are many classifiers, and every one has its characteristics. However, when should we use which classifier doesn’t have a final conclusion. In this article, we try to use Na��ve Bayes, KNN, and Rocchio classifiers with voting measure and the Sampling method to solve the unstable performance of classification. In the experiment of this article, the result shows that using voting and the Sampling method are effective. Voting measure and the Sampling method can make the performance of classification more stable. But voting measure and the Sampling method suit different cases. When the performance of each classifier is closer, the improvement of voting measure will be greater. On the other hand, the Sampling method is good choice To improve the unstable performance of classification can make the Automatic Document Categorization technology more useful for the patent engineers. The engineers can have more time to do more advanced analysis.
APA, Harvard, Vancouver, ISO, and other styles
44

Wang, Shsy-Ching, and 王世卿. "Using Text Mining Technology to Construct the Automatic Document Classification System of Electronic Documents." Thesis, 2017. http://ndltd.ncl.edu.tw/handle/qkmd34.

Full text
Abstract:
碩士<br>國立屏東科技大學<br>資訊管理系所<br>105<br>The electronic documents processing in the government is a critical issue. The Electronic Document Exchange system is used to exchange electronic documents among different departments, however, the assignments including dispatching, digitizing, proofreading, issuing and archiving documents, are accomplished by human. In this research, we applied text mining technologies to develop an automatic document classification system. There were several related technologies are used in the system, including Chinese Word Segmentation, Keyword Analysis, Document Exploration, Machine Learning and Vector Space Model. The historical data was used to create a dictionary of Documents Domain and an Automatic Categorization System (ACS). We also conducted the experiments to evaluate its performance of classification. The experiment results show that the Text Mining techniques can be applied to ACS effectively. In the other, the ACS system is developed based on four machine learning algorithms, including Naïve Bayes classifiers, Support Vector Machine, Latent Dirichlet Allocation, and FastText, and we also compared the performance among different algorithm. The results show that fastText, proposed by facebook company, has the rapidest learning rate and best classification results.
APA, Harvard, Vancouver, ISO, and other styles
45

Li-Chun, Sung. "Progressive Analysis Scheme for Web Document Classification." 2007. http://www.cetd.com.tw/ec/thesisdetail.aspx?etdun=U0002-1501200716224000.

Full text
APA, Harvard, Vancouver, ISO, and other styles
46

Cheng, Pei-Chi, and 鄭佩琪. "Domain-space Weighting Scheme for Document Classification." Thesis, 2004. http://ndltd.ncl.edu.tw/handle/32268569573852256250.

Full text
Abstract:
碩士<br>國立交通大學<br>資訊科學系所<br>92<br>As evolving and available of digital documents, automatic document classification (a.k.a. document categorization) has become more and more important for managing and discovering useful information for users. Many typical classification approaches, such as C4.5, SVM, Naïve Bayesian and so on, have been applied to develop a classifier. However, most of them are batch-based mining approaches, which cannot resolve the category adaptation problem; and referring to the document representation problem, the representations are usually in term-space, which may result in lots of less representative dimensions such that the efficiency and effectiveness are decreased. In this thesis, we propose a domain-space weighting scheme to represent documents in domain-space and incrementally construct a classifier to resolve both document representation and category adaptation problems. The proposed scheme consists of three major phases: Training Phase, Discrimination Phase and Tuning Phase. In the Training Phase, the scheme first incrementally extracts and weights features from each individual category, and then integrates the results into the feature-domain association weighting table which is used to maintain the association weight between each feature and all involved categories. Then in the Discrimination Phase, it diminishes feature weights with lower discriminating powers. A classifier can be therefore constructed according to the feature-domain association weighting table. Finally, the Tuning Phase is optional to strengthen the classifier by the feedback information of tuning documents. Experiments over the standard Reuters-21578 benchmark based on the “ModApte” split version are carried out and the experimental results show that with enough training documents the classifier constructed by our proposed scheme is rather effective and it is getting stronger by the Tuning Phase.
APA, Harvard, Vancouver, ISO, and other styles
47

"Incremental document clustering for web page classification." 2000. http://library.cuhk.edu.hk/record=b5890417.

Full text
Abstract:
by Wong, Wai-Chiu.<br>Thesis (M.Phil.)--Chinese University of Hong Kong, 2000.<br>Includes bibliographical references (leaves 89-94).<br>Abstracts in English and Chinese.<br>Abstract --- p.ii<br>Acknowledgments --- p.iv<br>Chapter 1 --- Introduction --- p.1<br>Chapter 1.1 --- Document Clustering --- p.2<br>Chapter 1.2 --- DC-tree --- p.4<br>Chapter 1.3 --- Feature Extraction --- p.5<br>Chapter 1.4 --- Outline of the Thesis --- p.5<br>Chapter 2 --- Related Work --- p.8<br>Chapter 2.1 --- Clustering Algorithms --- p.8<br>Chapter 2.1.1 --- Partitional Clustering Algorithms --- p.8<br>Chapter 2.1.2 --- Hierarchical Clustering Algorithms --- p.10<br>Chapter 2.2 --- Document Classification by Examples --- p.11<br>Chapter 2.2.1 --- k-NN algorithm - Expert Network (ExpNet) --- p.11<br>Chapter 2.2.2 --- Learning Linear Text Classifier --- p.12<br>Chapter 2.2.3 --- Generalized Instance Set (GIS) algorithm --- p.12<br>Chapter 2.3 --- Document Clustering --- p.13<br>Chapter 2.3.1 --- B+-tree-based Document Clustering --- p.13<br>Chapter 2.3.2 --- Suffix Tree Clustering --- p.14<br>Chapter 2.3.3 --- Association Rule Hypergraph Partitioning Algorithm --- p.15<br>Chapter 2.3.4 --- Principal Component Divisive Partitioning --- p.17<br>Chapter 2.4 --- Projections for Efficient Document Clustering --- p.18<br>Chapter 3 --- Background --- p.21<br>Chapter 3.1 --- Document Preprocessing --- p.21<br>Chapter 3.1.1 --- Elimination of Stopwords --- p.22<br>Chapter 3.1.2 --- Stemming Technique --- p.22<br>Chapter 3.2 --- Problem Modeling --- p.23<br>Chapter 3.2.1 --- Basic Concepts --- p.23<br>Chapter 3.2.2 --- Vector Model --- p.24<br>Chapter 3.3 --- Feature Selection Scheme --- p.25<br>Chapter 3.4 --- Similarity Model --- p.27<br>Chapter 3.5 --- Evaluation Techniques --- p.29<br>Chapter 4 --- Feature Extraction and Weighting --- p.31<br>Chapter 4.1 --- Statistical Analysis of the Words in the Web Domain --- p.31<br>Chapter 4.2 --- Zipf's Law --- p.33<br>Chapter 4.3 --- Traditional Methods --- p.36<br>Chapter 4.4 --- The Proposed Method --- p.38<br>Chapter 4.5 --- Experimental Results --- p.40<br>Chapter 4.5.1 --- Synthetic Data Generation --- p.40<br>Chapter 4.5.2 --- Real Data Source --- p.41<br>Chapter 4.5.3 --- Coverage --- p.41<br>Chapter 4.5.4 --- Clustering Quality --- p.43<br>Chapter 4.5.5 --- Binary Weight vs Numerical Weight --- p.45<br>Chapter 5 --- Web Document Clustering Using DC-tree --- p.48<br>Chapter 5.1 --- Document Representation --- p.48<br>Chapter 5.2 --- Document Cluster (DC) --- p.49<br>Chapter 5.3 --- DC-tree --- p.52<br>Chapter 5.3.1 --- Tree Definition --- p.52<br>Chapter 5.3.2 --- Insertion --- p.54<br>Chapter 5.3.3 --- Node Splitting --- p.55<br>Chapter 5.3.4 --- Deletion and Node Merging --- p.56<br>Chapter 5.4 --- The Overall Strategy --- p.57<br>Chapter 5.4.1 --- Preprocessing --- p.57<br>Chapter 5.4.2 --- Building DC-tree --- p.59<br>Chapter 5.4.3 --- Identifying the Interesting Clusters --- p.60<br>Chapter 5.5 --- Experimental Results --- p.61<br>Chapter 5.5.1 --- Alternative Similarity Measurement : Synthetic Data --- p.61<br>Chapter 5.5.2 --- DC-tree Characteristics : Synthetic Data --- p.63<br>Chapter 5.5.3 --- Compare DC-tree and B+-tree: Synthetic Data --- p.64<br>Chapter 5.5.4 --- Compare DC-tree and B+-tree: Real Data --- p.66<br>Chapter 5.5.5 --- Varying the Number of Features : Synthetic Data --- p.67<br>Chapter 5.5.6 --- Non-Correlated Topic Web Page Collection: Real Data --- p.69<br>Chapter 5.5.7 --- Correlated Topic Web Page Collection: Real Data --- p.71<br>Chapter 5.5.8 --- Incremental updates on Real Data Set --- p.72<br>Chapter 5.5.9 --- Comparison with the other clustering algorithms --- p.73<br>Chapter 6 --- Conclusion --- p.75<br>Appendix --- p.77<br>Chapter A --- Stopword List --- p.77<br>Chapter B --- Porter's Stemming Algorithm --- p.81<br>Chapter C --- Insertion Algorithm --- p.83<br>Chapter D --- Node Splitting Algorithm --- p.85<br>Chapter E --- Features Extracted in Experiment 4.53 --- p.87<br>Bibliography --- p.88
APA, Harvard, Vancouver, ISO, and other styles
48

Sung, Li-Chun, and 宋立群. "Progressive Analysis Scheme for Web Document Classification." Thesis, 2007. http://ndltd.ncl.edu.tw/handle/31463661649405385997.

Full text
Abstract:
博士<br>淡江大學<br>資訊工程學系博士班<br>95<br>In this thesis, we propose a web document classification scheme, called the Progressive Analysis Scheme (PAS), whose classification performance is improved by just analyzing few key parts sufficient for category confirmation. Based on the DOM tag-tree structure, a web document can be segmented into small tag-regions. Each tag-region is visualized by a visual type which corresponds to a specific nested combination of tag-pairs. Under observation, the profitabilities of tag-regions for classification will vary among visual types caused by the web authoring convention. In addition, in a document, the profitabilities of tag-regions of a visual type may also vary caused by the document writing knacks. In the thesis, for each visual type, we model the two kinds of profitability variations as the profitability tendencies and the tendency transition patterns based on the Expectation Maximization scheme and the Hidden Markov Model scheme. For classification, we integrate them into a profitability forecasting strategy further. Based on the forecasting strategies, we will forecast the potential profitabilities of unanalyzed tag-regions and extract continuously the most profitable unanalyzed tag-regions for classification until category confirmation. Dynamically, the forecasting strategies will be optimized for the document by feeding back the actual profitabilities of analyzed tag-regions to them. Thus, the profitabilities of next tag-regions can be forecasted more accurately. In addition, for each unreliable model generated by a sparse set of training samples, we propose a solution which is to support its forecasting process by the strategies of other similar visual types. Through simulations, the results will show that PAS has better classification performance than the previous approaches, such as the full-text (e.g. SVM) and sequential classifiers.
APA, Harvard, Vancouver, ISO, and other styles
49

Ho, Chun-Yuan, and 何俊元. "Multi-Label Classification for Emotion Document Retrieval." Thesis, 2011. http://ndltd.ncl.edu.tw/handle/63598608180061148660.

Full text
Abstract:
碩士<br>元智大學<br>資訊管理學系<br>99<br>The aim of Information Retrieval (IR) is to retrieve a set of documents relevant to users’ queries from a database. This thesis builds a retrieval model using the query-by-example scheme. The document database used herein is a mental health website, PsychPark. Since each document in PsychPark has been annotated with emotion labels (topics), we use the independent component analysis (ICA) for multi-label classification. The identified labels are then combined with the BM25 retrieval model to calculate the similarity between users’ queries and documents. The experimental results show that the use of ICA can identify the features of different labels to improve the performance of multi-label document classification. Additionally, incorporating the label information can further improve the precision of information retrieval.
APA, Harvard, Vancouver, ISO, and other styles
50

Lin, Hun-Ching, and 林紘靖. "Automatic Document Classification UsingFuzzy Formal Concept Analysis." Thesis, 2009. http://ndltd.ncl.edu.tw/handle/12015487957874690408.

Full text
Abstract:
碩士<br>國立成功大學<br>資訊管理研究所<br>97<br>As computer becomes popular, the internet developes and the coming of the age of knowledge, the numerious of digital documents increases faster. There are always a huge deal of search resoult when we use search engine on the internet, and it becomes more and more difficult to find specified document from databases. Hense, people starts to find the way to find required documents from a huge database. Thus, automatical categorization of documennts becomes an important issue in managing document datas. In recent years, more and more research uses formal concept analysis(FCA) on information retrieval. However, classical formal concept analysis present the fuzzy information of document categorization (Tho et al., 2006), some research thus combines fuzzy theory with FCA to fuzzy FCA (Burusco and Fuentes-Gonzales, 1994). The researches of FCA then become more and more. This proposed research is trying to analysis documents with information retrieval technology to find the most important keywords of the specified dataset, then give fuzzy membership degree and then categorize the documents with fuzzy FCA. In this research, the categorization is computed with the concept lattice produced from the FCA process to find an application of the concept lattice besides presenting the domain knowledge. We hope this to be helpful to the researches of document categorization using FCA. The result shows that the categorization using concept lattice combining with fuzzy logic is precise. And the result is steady for all categories.
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!