Dissertations / Theses on the topic 'Automated Text Categorization'
Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles
Consult the top 30 dissertations / theses for your research on the topic 'Automated Text Categorization.'
Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.
You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.
Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.
Wirantono, Marcel. "Automated text categorization with collaboratively tagged data." Thesis, University of Ottawa (Canada), 2009. http://hdl.handle.net/10393/28116.
Full textEramo, Mark D. Sutter Christopher M. "Automated psychological categorization via linguistic processing system /." Monterey, Calif. : Springfield, Va. : Naval Postgraduate School ; Available from National Technical Information Service, 2004. http://library.nps.navy.mil/uhtbin/hyperion/04Sep%5FEramo.pdf.
Full textThesis advisor(s): Raymond Buettner, Magdi Kamel. Includes bibliographical references (p. 115-122). Also available online.
Sutter, Christopher M., and Mark D. Eramo. "Automated psychological categorization via linguistic processing system." Thesis, Monterey, California. Naval Postgraduate School, 2004. http://hdl.handle.net/10945/1439.
Full textInfluencing one's adversary has always been an objective in warfare. However, to date the majority of influence operations have been geared toward the masses or to very small numbers of individuals. Although marginally effective, this approach is inadequate with respect to larger numbers of high value targets and to specific subsets of the population. Limited human resources have prevented a more tailored approach, which would focus on segmentation, because individual targeting demands significant time from psychological analysts. This research examined whether or not Information Technology (IT) tools, specializing in text mining, are robust enough to automate the categorization/segmentation of individual profiles for the purpose of psychological operations (PSYOP). Research indicated that only a handful of software applications claimed to provide adequate functionality to perform these tasks. Text mining via neural networks was determined to be the best approach given the constraints of the profile data and the desired output. Five software applications were tested and evaluated for their ability to reproduce the results of a social psychologist. Through statistical analysis, it was concluded that the tested applications are not currently mature enough to produce accurate results that would enable automated segmentation of individual profiles based on supervised linguistic processing.
Captain, United States Marine Corps
Lieutenant, United States Navy
SOARES, FABIO DE AZEVEDO. "AUTOMATIC TEXT CATEGORIZATION BASED ON TEXT MINING." PONTIFÍCIA UNIVERSIDADE CATÓLICA DO RIO DE JANEIRO, 2013. http://www.maxwell.vrac.puc-rio.br/Busca_etds.php?strSecao=resultado&nrSeq=23213@1.
Full textCONSELHO NACIONAL DE DESENVOLVIMENTO CIENTÍFICO E TECNOLÓGICO
A Categorização de Documentos, uma das tarefas desempenhadas em Mineração de Textos, pode ser descrita como a obtenção de uma função que seja capaz de atribuir a um documento uma categoria a que ele pertença. O principal objetivo de se construir uma taxonomia de documentos é tornar mais fácil a obtenção de informação relevante. Porém, a implementação e a execução de um processo de Categorização de Documentos não é uma tarefa trivial: as ferramentas de Mineração de Textos estão em processo de amadurecimento e ainda, demandam elevado conhecimento técnico para a sua utilização. Além disso, exercendo grande importância em um processo de Mineração de Textos, a linguagem em que os documentos se encontram escritas deve ser tratada com as particularidades do idioma. Contudo há grande carência de ferramentas que forneçam tratamento adequado ao Português do Brasil. Dessa forma, os objetivos principais deste trabalho são pesquisar, propor, implementar e avaliar um framework de Mineração de Textos para a Categorização Automática de Documentos, capaz de auxiliar a execução do processo de descoberta de conhecimento e que ofereça processamento linguístico para o Português do Brasil.
Text Categorization, one of the tasks performed in Text Mining, can be described as the achievement of a function that is able to assign a document to the category, previously defined, to which it belongs. The main goal of building a taxonomy of documents is to make easier obtaining relevant information. However, the implementation and execution of Text Categorization is not a trivial task: Text Mining tools are under development and still require high technical expertise to be handled, also having great significance in a Text Mining process, the language of the documents should be treated with the peculiarities of each idiom. Yet there is great need for tools that provide proper handling to Portuguese of Brazil. Thus, the main aims of this work are to research, propose, implement and evaluate a Text Mining Framework for Automatic Text Categorization, capable of assisting the execution of knowledge discovery process and provides language processing for Brazilian Portuguese.
Hall, Scott R. "Automatic text categorization applied to E-mail." Thesis, Monterey, Calif. : Springfield, Va. : Naval Postgraduate School ; Available from National Technical Information Service, 2002. http://library.nps.navy.mil/uhtbin/hyperion-image/02sep%5FHall.pdf.
Full textDemirtas, Kezban. "Automatic Video Categorization And Summarization." Master's thesis, METU, 2009. http://etd.lib.metu.edu.tr/upload/3/12611113/index.pdf.
Full textEklund, Johan. "With or without context : Automatic text categorization using semantic kernels." Doctoral thesis, Högskolan i Borås, Akademin för bibliotek, information, pedagogik och IT, 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:hb:diva-8949.
Full textBorggren, Lukas. "Automatic Categorization of News Articles With Contextualized Language Models." Thesis, Linköpings universitet, Artificiell intelligens och integrerade datorsystem, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-177004.
Full textZhang, Xueying. "Rough set theory based automatic text categorization and the handling of semantic heterogeneity." Bonn Informationszentrum Sozialwiss, 2006. http://deposit.ddb.de/cgi-bin/dokserv?id=2704442&prov=M&dokv̲ar=1&doke̲xt=htm.
Full textPereira, Dennis V. "Automatic Lexicon Generation for Unsupervised Part-of-Speech Tagging Using Only Unannotated Text." Thesis, Virginia Tech, 1999. http://hdl.handle.net/10919/10094.
Full textMaster of Science
Chung, EunKyung. "A Framework of Automatic Subject Term Assignment: An Indexing Conception-Based Approach." Thesis, University of North Texas, 2006. https://digital.library.unt.edu/ark:/67531/metadc5473/.
Full textMaguluri, Naga Sai Nikhil. "Multi-Class Classification of Textual Data: Detection and Mitigation of Cheating in Massively Multiplayer Online Role Playing Games." Wright State University / OhioLINK, 2017. http://rave.ohiolink.edu/etdc/view?acc_num=wright1494248022049882.
Full textFaulstich, Lukas C., Peter F. Stadler, Caroline Thurner, and Christina Witwer. "litsift: Automated Text Categorization in Bibliographic Search." 2003. https://ul.qucosa.de/id/qucosa%3A32597.
Full textWei, Yuan-Gu, and 魏源谷. "A Study of Multiple Classifier Systems in Automated Text Categorization." Thesis, 2002. http://ndltd.ncl.edu.tw/handle/58157330409643777309.
Full text國立中正大學
資訊工程研究所
90
Automatic text categorization, which is defined as the task of assigning predefined class (category) labels to text documents, is one of the main techniques that are useful both in organizing and in locating information in huge text collections from, for example, the Internet. Many approaches such as linear classifiers, decision trees, Bayesian methods, neural networks and support vector machines, have been extensively studied and used to implement classifier systems for text categorization as well as for web page classification. Although a lot of efforts have been spent in each of these methods, we are reaching the limit of further performance improvement. Multiple classifier systems whose objective aims to combine the strength of individual classifiers to improve overall performance, have been widely studied recently. In this thesis, we study the development of multiple classifier systems in the automated text categorization. We investigate and propose various approaches for fundamental issues such as classifier combination, classifier subset selection, and static and dynamic classifier selection. We use our idea to develop efficient combination-based as well as selection-based multiple classifier systems. Experiments show that our approaches significantly improves the classification accuracy of individual classifiers for web page collections from web portals. In addition, we also propose a cascaded class reduction method in which a sequence of classifiers are cascaded to successively reducing the set of possible classes. We show that by cascading Naive Bayes and SVMs, we can improve the classification accuracy of SVMs while reducing the running time of SVMs.
Silva, Sara Alexandra Teixeira da. "Automatization of incident categorization." Master's thesis, 2018. http://hdl.handle.net/10071/17585.
Full textDe forma a acompanhar o crescimento da quantidade de incidentes criados no diaa-dia de uma organização, houve a necessidade de aumentar a quantidade de recursos, de maneira a assegurar a gestão de todos os incidentes. A gestão de incidentes é composta por várias atividades, sendo uma delas, a categorização de incidentes. Através da junção de técnicas de Linguagem Natural e Processamento de Texto e de Algoritmos de Aprendizagem Automática propomos melhorar esta atividade, especificamente o Processo de Gestão de Incidentes. Para tal, propomos a substituição do subprocesso manual de Categorização inerente ao Processo de Gestão de Incidentes por um subprocesso automatizado, sem qualquer interação humana. A dissertação tem como objetivo propor uma solução para categorizar corretamente e automaticamente incidentes. Para tal, temos dados reais de uma organização, que devido a questões de privacidade não será mencionada ao longo da dissertação. Os datasets são compostos por incidentes corretamente categorizados o que nos leva a aplicar algoritmos de aprendizagem supervisionada. Pretendemos ter como resultado final um método desenvolvido através da junção das diferentes técnicas de Linguagem Natural e dos algoritmos com melhor performance para classificar os dados. No final será avaliado o método proposto comparativamente à categorização que é realizada atualmente, de modo a concluir se a nossa proposta realmente melhora o Processo de Gestão de Incidentes e quais são as vantagens trazidas pela automatização.
Hsu, Ya-Fen, and 許雅芬. "Automatic Text Categorization on News." Thesis, 2002. http://ndltd.ncl.edu.tw/handle/91208800987400778267.
Full text東吳大學
資訊科學系
90
Nowadays, people are eager to get new information. People can’t easily and efficiently find out the wanted information among such huge data. So, we have to classify the documents and then users can efficiently search these documents in the category they belong. Traditionally, by understanding the document experts assign specific categories to that document. However, it costs a lot of resources and has no economic benefits. So, we need an automatic text classifier to heap classification process. Automatic text categorization is the task of assigning predefined categories to free text documents. In text classification, there are always two important steps. The first step is features selection, and the second one is relevance function selection. Here we propose two techniques to improve the precision of classification by using co-occurrence terms and by considering the positions which bigram occurs. Moreover, this research also provides some other different features selection methods as the contrast for the experiment, including single terms features, bigram features, segmentation features and the position which segmentation occurs. The experimental result shows that the strategy which uses the co-occurrences as features did perform relatively well. Comparing with using pure bigram, there is about 15% improvement of the performance in average. Besides, the experiment also proves our observation of the texts, that is, bigram is more representative than single terms. In the next place, the positions of the key words have quite positive relation to importance.
"Automatic text categorization for information filtering." 1998. http://library.cuhk.edu.hk/record=b5889734.
Full textThesis (M.Phil.)--Chinese University of Hong Kong, 1998.
Includes bibliographical references (leaves 157-163).
Abstract also in Chinese.
Abstract --- p.i
Acknowledgment --- p.iii
List of Figures --- p.viii
List of Tables --- p.xiv
Chapter 1 --- Introduction --- p.1
Chapter 1.1 --- Automatic Document Categorization --- p.1
Chapter 1.2 --- Information Filtering --- p.3
Chapter 1.3 --- Contributions --- p.6
Chapter 1.4 --- Organization of the Thesis --- p.7
Chapter 2 --- Related Work --- p.9
Chapter 2.1 --- Existing Automatic Document Categorization Approaches --- p.9
Chapter 2.1.1 --- Rule-Based Approach --- p.10
Chapter 2.1.2 --- Similarity-Based Approach --- p.13
Chapter 2.2 --- Existing Information Filtering Approaches --- p.19
Chapter 2.2.1 --- Information Filtering Systems --- p.19
Chapter 2.2.2 --- Filtering in TREC --- p.21
Chapter 3 --- Document Pre-Processing --- p.23
Chapter 3.1 --- Document Representation --- p.23
Chapter 3.2 --- Classification Scheme Learning Strategy --- p.26
Chapter 4 --- A New Approach - IBRI --- p.31
Chapter 4.1 --- Overview of Our New IBRI Approach --- p.31
Chapter 4.2 --- The IBRI Representation and Definitions --- p.34
Chapter 4.3 --- The IBRI Learning Algorithm --- p.37
Chapter 5 --- IBRI Experiments --- p.43
Chapter 5.1 --- Experimental Setup --- p.43
Chapter 5.2 --- Evaluation Metric --- p.45
Chapter 5.3 --- Results --- p.46
Chapter 6 --- A New Approach - GIS --- p.50
Chapter 6.1 --- Motivation of GIS --- p.50
Chapter 6.2 --- Similarity-Based Learning --- p.51
Chapter 6.3 --- The Generalized Instance Set Algorithm (GIS) --- p.58
Chapter 6.4 --- Using GIS Classifiers for Classification --- p.63
Chapter 6.5 --- Time Complexity --- p.64
Chapter 7 --- GIS Experiments --- p.68
Chapter 7.1 --- Experimental Setup --- p.68
Chapter 7.2 --- Results --- p.73
Chapter 8 --- A New Information Filtering Approach Based on GIS --- p.87
Chapter 8.1 --- Information Filtering Systems --- p.87
Chapter 8.2 --- GIS-Based Information Filtering --- p.90
Chapter 9 --- Experiments on GIS-based Information Filtering --- p.95
Chapter 9.1 --- Experimental Setup --- p.95
Chapter 9.2 --- Results --- p.100
Chapter 10 --- Conclusions and Future Work --- p.108
Chapter 10.1 --- Conclusions --- p.108
Chapter 10.2 --- Future Work --- p.110
Chapter A --- Sample Documents in the corpora --- p.111
Chapter B --- Details of Experimental Results of GIS --- p.120
Chapter C --- Computational Time of Reuters-21578 Experiments --- p.141
"Training example adaptation for text categorization." 2005. http://library.cuhk.edu.hk/record=b5892711.
Full textThesis (M.Phil.)--Chinese University of Hong Kong, 2005.
Includes bibliographical references (leaves 68-72).
Abstracts in English and Chinese.
Chapter 1 --- Introduction --- p.1
Chapter 1.1 --- Background and Motivation --- p.1
Chapter 1.2 --- Thesis Organization --- p.4
Chapter 2 --- Related Work --- p.6
Chapter 2.1 --- Semi-supervised learning --- p.6
Chapter 2.2 --- Hierarchical Categorization --- p.10
Chapter 3 --- Framework Overview --- p.13
Chapter 4 --- Inherent Concept Detection --- p.18
Chapter 4.1 --- Data Preprocessing --- p.18
Chapter 4.2 --- Concept Detection Algorithm --- p.22
Chapter 4.3 --- Kernel-based Distance Measure --- p.27
Chapter 5 --- Training Example Discovery from Unlabeled Documents --- p.33
Chapter 5.1 --- Training Document Discovery --- p.33
Chapter 5.2 --- Automatically determining the number of extracted positive examples --- p.37
Chapter 5.3 --- Classification Model --- p.39
Chapter 6 --- Experimental Evaluation --- p.44
Chapter 6.1 --- Corpus Description --- p.44
Chapter 6.2 --- Evaluation Metric --- p.49
Chapter 6.3 --- Result Analysis --- p.50
Chapter 7 --- Conclusions and Future Work --- p.66
Bibliography --- p.68
Chapter A --- Detailed result on the inherent concept detection process for the TDT and RCV1 corpora --- p.73
"New learning strategies for automatic text categorization." 2001. http://library.cuhk.edu.hk/record=b5890838.
Full textThesis (M.Phil.)--Chinese University of Hong Kong, 2001.
Includes bibliographical references (leaves 125-130).
Abstracts in English and Chinese.
Chapter 1 --- Introduction --- p.1
Chapter 1.1 --- Automatic Textual Document Categorization --- p.1
Chapter 1.2 --- Meta-Learning Approach For Text Categorization --- p.3
Chapter 1.3 --- Contributions --- p.6
Chapter 1.4 --- Organization of the Thesis --- p.7
Chapter 2 --- Related Work --- p.9
Chapter 2.1 --- Existing Automatic Document Categorization Approaches --- p.9
Chapter 2.2 --- Existing Meta-Learning Approaches For Information Retrieval --- p.14
Chapter 2.3 --- Our Meta-Learning Approaches --- p.20
Chapter 3 --- Document Pre-Processing --- p.22
Chapter 3.1 --- Document Representation --- p.22
Chapter 3.2 --- Classification Scheme Learning Strategy --- p.25
Chapter 4 --- Linear Combination Approach --- p.30
Chapter 4.1 --- Overview --- p.30
Chapter 4.2 --- Linear Combination Approach - The Algorithm --- p.33
Chapter 4.2.1 --- Equal Weighting Strategy --- p.34
Chapter 4.2.2 --- Weighting Strategy Based On Utility Measure --- p.34
Chapter 4.2.3 --- Weighting Strategy Based On Document Rank --- p.35
Chapter 4.3 --- Comparisons of Linear Combination Approach and Existing Meta-Learning Methods --- p.36
Chapter 4.3.1 --- LC versus Simple Majority Voting --- p.36
Chapter 4.3.2 --- LC versus BORG --- p.38
Chapter 4.3.3 --- LC versus Restricted Linear Combination Method --- p.38
Chapter 5 --- The New Meta-Learning Model - MUDOF --- p.40
Chapter 5.1 --- Overview --- p.41
Chapter 5.2 --- Document Feature Characteristics --- p.42
Chapter 5.3 --- Classification Errors --- p.44
Chapter 5.4 --- Linear Regression Model --- p.45
Chapter 5.5 --- The MUDOF Algorithm --- p.47
Chapter 6 --- Incorporating MUDOF into Linear Combination approach --- p.52
Chapter 6.1 --- Background --- p.52
Chapter 6.2 --- Overview of MUDOF2 --- p.54
Chapter 6.3 --- Major Components of the MUDOF2 --- p.57
Chapter 6.4 --- The MUDOF2 Algorithm --- p.59
Chapter 7 --- Experimental Setup --- p.66
Chapter 7.1 --- Document Collection --- p.66
Chapter 7.2 --- Evaluation Metric --- p.68
Chapter 7.3 --- Component Classification Algorithms --- p.71
Chapter 7.4 --- Categorical Document Feature Characteristics for MUDOF and MUDOF2 --- p.72
Chapter 8 --- Experimental Results and Analysis --- p.74
Chapter 8.1 --- Performance of Linear Combination Approach --- p.74
Chapter 8.2 --- Performance of the MUDOF Approach --- p.78
Chapter 8.3 --- Performance of MUDOF2 Approach --- p.87
Chapter 9 --- Conclusions and Future Work --- p.96
Chapter 9.1 --- Conclusions --- p.96
Chapter 9.2 --- Future Work --- p.98
Chapter A --- Details of Experimental Results for Reuters-21578 corpus --- p.99
Chapter B --- Details of Experimental Results for OHSUMED corpus --- p.114
Bibliography --- p.125
Lin, Ching-Han, and 林京翰. "Cascaded Class Reduction for Automatic Text Categorization." Thesis, 2003. http://ndltd.ncl.edu.tw/handle/09069339792107773438.
Full text國立中正大學
資訊工程研究所
91
The task of text categorization is the classification of natural text (or hypertext) documents into a fixed number of predefined categories. This problem arises in a number of different areas including email filtering, web searching, office automation, sorting documents by topics, and classification of newsagency stories etc. Some approaches such as K-nearest neighbor and support vector machines achieve outstanding performance, but they suffer long classification time when the number of predefined categories is very large. In this thesis, we investigate and propose a cascaded class reduction method in which a sequence of classifiers are cascaded to successively reducing the set of possible classes. We show that by cascading simple clasifiers and SVM or KNN, we can improve the classification accuracy while reducing the classification time.
Yang, Cheng-Han, and 楊承翰. "Automatic Text Categorization Model Based on Genetic Algorithm." Thesis, 2013. http://ndltd.ncl.edu.tw/handle/z9j2cq.
Full text國立中正大學
資訊管理學系暨研究所
101
The rapid accumulation of a large number of digital information indeed raises the difficulties in searching information, so effectively manage documents has become an important task. Therefore, Text Categorization (TC) research growing in importance. The majority of TC studies focus on trying to find out a best individual classifier with the highest accuracy from different classifiers to be the model of TC. However, the individual classifier often provides better results only in the appropriate data. So our research attempts to integrate various individual classifiers into ensemble to improve the classification performance. And then compile the opinions of different experts (classifiers) to make decision. In this way, it can solve the problem of that the original individual classifier can only fit the particular document datasets. TC is also likely to be confronted by the problem of excessive document feature dimensions. Therefore, We hope to use the Genetic Algorithm (GA) to optimize the classifier's training, and make each classifier have diverse features, mutual independences and better prediction abilities, and further enhance the overall classification performance. We propose two versions of GA encoding methods: (1) Selection of Disjoint Feature Subsets (SDFS) which lets each feature can use only one kind of classifier to perform training. (2) Selection of Possibly Overlapping Feature Subsets (SPOFS) which lets each feature can use more than one kinds of classifiers to perform training. In experimental evaluation, we use the real-world data set from Reuters-21578 news article collection with Modified Apte Split. The experimental result shows that our method can improve the document classification accuracy both in individual classifier and ensemble, and ensemble document classification model which has good and stable classification effects.
Ying, Jia-Ching, and 英家慶. "Automatic Chinese Text Categorization Using N-gram Model." Thesis, 2007. http://ndltd.ncl.edu.tw/handle/p5k937.
Full text銘傳大學
資訊傳播工程學系碩士班
95
Chinese text classification is an important and well-known technique in the field of machine learning. However, most applications often avoid the problem of word segmentation and ignore the relationship between words. It is important to model a suitable classifier for Chinese text classification. In this paper, we propose an N-gram-based Language model for Chinese text categorization which considers the relationship of words. To prevent from out-of-vocabulary, we also propose a novel smoothing approach based on logistic regression to improve accuracy. The experimental result shows that our approach outperforms former N-gram-based classification model above 11% on micro-average F-measure.
Ko-Li, Kan, and 甘可立. "Effectiveness Issues in Keyword Extraction and Automatic Text Categorization." Thesis, 2006. http://ndltd.ncl.edu.tw/handle/06898275923006333937.
Full text中華大學
資訊工程學系碩士班
94
Presently, automatic text categorization is primarily based on extractingkeywords from documents. Extracting keywords is also a basic and coretechnology for document analysis. Nowadays, keyword extracting whichmostly depends on the judgement of professional researchers is a waste oftime and manpower. Therefore, it is important to employ automatic keywordextraction methods in text categorization.. In this thesis, we propose four keyword extraction methods to improve the efficiency and accuracy of automatic text categorization: (1)Content-reduction text categorization - The abstract of an article isretrieved before keyword extracting; (2) Hierarchical text categorization- The keyword is extracted according to the taxonomy hierarchy; (3) PNpruning - Redundant keywords are pruned to retain the important keywords;(4) TFxR keyword weighting method - The accuracy of categorization is increased by calculating keyword weight. We evaluated the new methods by the efficiency of both keyword extraction and text categorization as well as the accuracy of text categorization. The experiment results showed our new approaches improve both the efficiency of keyword extraction and the accuracy of text categorization. Furthermore, our new methods demostrate a huge saving on manpower and time especially when applied to the knowledge management systems of some industries.
Wang, Jing-Doo, and 王經篤. "Design and Evaluation of Approaches for Automatic Chinese Text Categorization." Thesis, 2002. http://ndltd.ncl.edu.tw/handle/16980764430249360032.
Full text國立中正大學
資訊工程研究所
90
In recent years, we have seen a tremendous growth in the number of online text documents available on the Internet, in digital libraries and news sources. Effective location of information in these huge resources is difficult without good indexing as well as organization of text collections. Automatic text categorization, which is defined as the task of assigning predefined class (category) labels to free text documents, is one of the main techniques that are useful both in organizing and in locating information in these huge collections. Many approaches to text categorization and web page classification have been proposed. Most of them have been evaluated using English texts. Evaluation of these approaches using texts in Chinese and other oriental languages has been limited. This dissertation proposes and evaluates approaches for categorizing Chinese texts, which consist of term extraction, term selection, term clustering and text classification. For term extraction, we propose an I/O-efficient approach which uses frequency counts to identify left and right boundaries of possibly significant terms. We then perform term selection and term clustering to reduce the dimension of term space into a practical level while without losing classification accuracy. We study and compare the performance of three well known classifiers, including linear classifier, naive Bayes probabilistic classifier and k-Nearnest Neighbors (kNN) classifier, when they are applied to categorize Chinese texts. Overall, kNN achieves the best accuracy but requires large amount of computation time and memory in classifying new texts. Linear classifier is very time and memory efficient in practical implementation, but achieves accuracy which is slightly worse than that of kNN. To compensate for the potential weakness of linear classifier which computes one representative for each class, we increase the number of representatives for each class. Experimental results show that this approach improved linear classifier and achieved micro-averaged accuracy similar to that of kNN, with much less classification time. Furthermore, we provide a suggestion to reorganize the structure of classes when identify new representatives for linear classifier. With the scalability of our term extraction approach that could handle large text collections derived from the chronologically-ordered Chinese news, we could mine for periodic events via the term frequency distribution of significant terms in some time series. Note that chronologically-ordered news articles concerned with regular events such as annual festivals, ceremonies, games and customs are appealing to a foreigner who likes to have a deep understanding of an unfamiliar country, and are useful to an observer who wants to review news after a long period.
Lin, Jeng-Wei, and 林政緯. "A Study on Automatic Text Categorization And Its Performance Evaluation." Thesis, 2001. http://ndltd.ncl.edu.tw/handle/13076597873081176063.
Full text輔仁大學
圖書資訊學系
90
The study tries to use computers to classify documents automatically and evaluate the efficiencies of each methodology and itself. According to results of the processes in those classification systems, we learn something about which factors may impact the efficiencies. The system in this thesis trains the classification module to improve the correct rate by using the documents which were classified into many categories. For this paper we use the ApteMod version of Reuters-21578, which was obtained by eliminating unlabelled documents and selecting the categories which have at least one document in the training set and the test set. This process resulted in 90 categories in both the training and test sets. After eliminating documents which do not belong to any of these 90 categories, we obtained a training set of 7769 documents, a test set of 3019 documents. In the thesis, we not only discuss the Linear Function in IR, Rocchio Algorithm and the k-Nearest Neighbor (kNN), but also investigate the methodologies including Vector Space Module and kNN Classifier. Based on the concepts, we runs several experiments. Finally we compare the results with the data from the references, and evaluate the efficiencies of the study.
Li, Po-Yi, and 李柏毅. "Automatic Text Categorization of Chinese Document Using Support Vector Machine Techniques." Thesis, 2004. http://ndltd.ncl.edu.tw/handle/65318851471594812159.
Full text國立高雄應用科技大學
電機工程系碩士班
92
“Automatic text categorization” is based on machine learning techniques to fulfill classification of heterogeneous texts through an implemented classification system. The theory of Support Vector Machine (SVM) was constructed based on statistical learning, neural network and optimization techniques. The major features of SVM are: (1). the capacity to deal with linear and non-linear problems, and (2). the total sizes of tested data items (data size) are not limited. As a result, SVM algorithm offers an effective solution to resolve the difficulties in text categorization with a large scale data size. This research work is mainly based on Support Vector Machine (SVM) learning algorithm and proposed a strategy of feature selection to carry out classification of Chinese document. Based on several experimental situations, we discussed the differences among several feature selection strategies, and verified their impacts on the performance of SVM based classification tasks. After that, according to the analysis of the strategies, we determined one of them for our implementation of developed classification system, and combined different kernel functions with various parameters into the SVM algorithm to establish the experiments of document categorization. Our experimental results indicate that the SVM algorithm for document classification can produce a satisfactory performance, based on the determined strategy of feature selection. We also demonstrate that only 500 dimensions required, our system can perform an outstanding accuracy of categorization. Eventually we conducted several experiments to compare the neural networks and kNN classifiers with our implemented SVM classifier for document categorization. The SVM classifier also obtains a superior performance than others.
Wu, Chia-Chuan, and 吳家銓. "A Study of Automatic Text Categorization based on Directional Term Structure." Thesis, 2010. http://ndltd.ncl.edu.tw/handle/25870222133144392967.
Full text國立中興大學
資訊管理學系所
98
In the mainstream research of automatic text categorization, rule-based classifier provides an interesting advantage, the interpretability. Because the rules in the classifier are composed by terms, thus they can be easily understood, modified and maintained by human. And the classifier’s accuracy is also competitive when compared to other accurate classifiers such as SVM and Bayes net, so rule-based classification techniques are very popular. Today’s rule-based techniques identify valuable patterns from training documents to construct classification rules. These techniques did not consider the relationship between terms and paragraphs in documents. Since there must be a dominant topic throughout a document’s content, thus generates certain term structures across paragraphs. Therefore, this study presents a new concept, Meaningful Inner Link Object-MILO, by finding underlying directional term links across paragraphs of document for text categorization. In this study, the process of MILO for text categorization consists of four main procedures. Firstly, feature selection, the purpose is to find representative terms to compose MILO from training documents which have a great quantity of noises terms. Secondly, MILO filtering, through the number of MILOs can be more than ten thousand, to measure MILO’s quality is an important issue, by filtering useless MILOs, the accuracy can be improved. Thirdly, the designing of a scoring model, to correctly classify document, an effective model is needed to accurately assign category to unlabeled document. Finally, classification structures, traditional techniques only use one classifier for classification, while this study presents a hierarchical classification structure to improve accuracy. Summary of our method, firstly, a novel method is presented by observing term’s distribution in document paragraphs to extract MILOs for text categorization. Secondly, an improved method is presented by eliminating noises MILOs and using a hierarchical classification structure. The experimental results of the two methods in this study show competitive performance on famous benchmarks such as Reuters, WebKB and Ohsumed.
Chen, Chao-long, and 陳朝龍. "A Text Categorization Method Based on Term Distributional Clustering and Automatic Summarization." Thesis, 2008. http://ndltd.ncl.edu.tw/handle/73204557664150148160.
Full text國立屏東商業技術學院
資訊管理系
96
Owing to the exponential growth of electronic documents, research on automatic summarization is flourishing in the last decade. This is evident from the fact that Computational Linguistics (Vo. 28, No. 4, 2002) and Information Processing & Processing (Vol. 43, No. 6, 2007) both have published a special issue on automatic summarization. However, the summaries generated by these sophisticated methods have little uses except for affording document searchers a glimpse of what the document is about. Therefore, this study proposes to use the automatic summarization for term selection as a way of dimension reduction. Furthermore, to solve the problem that similar concepts with different term representations might cause the deficiency of classification, we also investigate the effect of term expansion on the classification accuracy. Contrary to using term distributional clustering for feature extraction, we propose use it for expanding the feature terms. Finally, we compare four data sets with different attributes (including Chinese and English news stories, longer articles like academic research papers, and short articles like medical abstracts), and different classification algorithms (KNN, Naive Bayes, and SVM) to understand the feasibility of the proposed method. The results show that text summarization is an effective way for dimension reduction. The classification accuracy from the summarization performs better than the traditional TFIDF and Information Gain term weighting schemes. Also term distributional clustering can also be applied to term expansion, and further improve the classification accuracy, especially when the size of feature terms is small. The proposed method will not only reduce the dimensionality of the term vector and select more representative terms; it can also save the computation resources. That is, one need not redo the feature selection process to cope with the task of text categorization. Finally, a by-product of our proposed method is that it can generate indicative summaries of those documents. Thus, readers can easily grasp the concepts of those documents by our method when browsing the classification results.
Su, Jong-Ming, and 蘇中明. "Use the Automatic Text Categorization Technology to Support the Management of the Discussion Portfolio Process." Thesis, 2002. http://ndltd.ncl.edu.tw/handle/11562412456994381992.
Full text大葉大學
資訊管理研究所
90
The subject of this thesis is clustering technology in information retrieval. The simplest approach will be used, free of jargon, to find the similarities between articles on this subject. Also, the technical acceptance model aided by automatic text categorization will be used. A web-based education database will provide discussion groups for students, and provide links to other sites useful to all students. This is the subject of this thesis. Moreover, teachers will be assisted with a heavy workload. Before the school year the teacher can reorganize the database so it can be easily retrieved. This way, teachers can save time and effort carrying out their jobs. Furthermore, the analysis of the learning portfolio should focus on the acquisition of knowledge. By recording the progress of each student, we may influence the study techniques of each student, and increase our knowledge of the learning process itself through data mining. This will be the focus of this thesis? Using Automatic Text Categorization to provide a richer learning environment for students. At last, using Davis’s technology acceptance model to put to the proof, and show the results on the chapter 4, to be a better ending, and adding a chapter 5 to discussing the difference with Davis’s, to look into the reason, it is the strategy of the adding number to their score. Key Words : Information Retrieval ,Clustering Technology , Technology Acceptance Model , Automatic Text Categorization
Alberts, Inge. "Exploitation des genres de textes pour assister les pratiques textuelles dans les environnements numériques de travail : le cas du courriel chez des cadres et des secrétaires dans une municipalité et une administration fédérale canadiennes." Thèse, 2009. http://hdl.handle.net/1866/2839.
Full textThis research reveals how textual genres can be exploited in digital work environments to improve the textual practices of managers and secretaries in the context of a municipality and the Canadian federal government. The first objective of this research assesses the suitability of digital work environments to support the textual practices of managers and secretaries in their reading, writing and manipulation of texts. The second objective describes the various roles of textual genre during the managerial and secretarial textual practices. Using email as a focal point, the third objective examines how genre can be exploited to advance the benefits of textual practices in the digital work environments. This qualitative research entails a two-phase methodology. By the study of 17 secretaries and 17 managers, the first phase consists of a thorough examination of the current textual practices in the Canadian federal government and municipal contexts and the difficulties encountered during these practices. This phase also considers the various roles of genre in the digital work environments along with the salient clues sought during email management. This study deployed three data collection techniques: semi-structured interviews, diary journals and cognitive inquiries. The results are examined using several qualitative content analysis techniques. The second phase of this research consists of developing an email processing sequence to further expand our understanding of textual genre and its exploitation in the design of digital work environments. The data for this phase uses a corpus of 1703 messages developed from a sample of two governmental managers. The results provide an encompassing overview of practices relating to the reading, writing and manipulation of texts that are both common and specific to managers and secretaries. With over 40% of events recorded in the diary journal relating to email, the importance of this type of system in digital work environments is clearly emphasized. The difficulties encountered in the digital work environments are also described. The role of genre during textual practices is examined according to a matrix illustrating both the individual and collective dimensions of genre in addition to its three main facets: the form, the content and the purpose. We present next an analytic framework of the prominent cues affecting email management to summarize the process of interpreting messages by the recipient. A typology of the categorization patterns of managers is also developed and used in a statistical experiment aiming to automatically describe and categorize email. Resulting from this experiment, we observe specific linguistic behaviours that characterize each email category. It is also revealed that automatic categorization based on message lexicon is more efficient than non-lexical categorization. At the conclusion of this research, we suggest to enrich the traditional human-computer interaction paradigm with a semiotics of genre in the digital work environments. The study also offers a reflection regarding email membership to a specific genre using the theoretical concepts of hypergenre, genre and sub-genre. The success of the automatic categorization of email according to genre-related facets (the content, the form and the purpose) uncovers valuable insights and perspectives in designing digital work environments with the objective of facilitating the vital performance of textual practices by employees.
Conseil de recherches en sciences humaines du Canada (CRSH), Faculté des études supérieures de l'Université de Montréal