Dissertations / Theses on the topic 'Clustering de documents'
Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles
Consult the top 50 dissertations / theses for your research on the topic 'Clustering de documents.'
Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.
You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.
Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.
Khy, Sophoin, Yoshiharu Ishikawa, and Hiroyuki Kitagawa. "Novelty-based Incremental Document Clustering for On-line Documents." IEEE, 2006. http://hdl.handle.net/2237/7520.
Full textKhy, Sophoin, Yoshiharu Ishikawa, and Hiroyuki Kitagawa. "A Novelty-based Clustering Method for On-line Documents." Springer, 2007. http://hdl.handle.net/2237/7739.
Full textSinka, Mark P. "Issues in the unsupervised clustering of web documents." Thesis, University of Reading, 2006. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.430847.
Full textHossain, Mahmud Shahriar. "Apriori approach to graph-based clustering of text documents." Thesis, Montana State University, 2008. http://etd.lib.montana.edu/etd/2008/hossain/HossainM0508.pdf.
Full textArac̆ić, Damir. "Exploring potential improvements to term-based clustering of web documents." Online access for everyone, 2007. http://www.dissertations.wsu.edu/Thesis/Fall2007/D_Aracic_112807.pdf.
Full textCaubet, Marc, and Mònica Cifuentes. "Extracting metadata from textual documents and utilizing metadata for adding textual documents to an ontology." Thesis, Växjö universitet, Matematiska och systemtekniska institutionen, 2006. http://urn.kb.se/resolve?urn=urn:nbn:se:vxu:diva-534.
Full textTensmeyer, Christopher Alan. "CONFIRM: Clustering of Noisy Form Images using Robust Matching." BYU ScholarsArchive, 2016. https://scholarsarchive.byu.edu/etd/6055.
Full textTombros, Anastasios. "The effectiveness of query based hierarchic clustering of documents for information retrieval." Thesis, University of Glasgow, 2002. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.248257.
Full textZamir, Oren Eli. "Clustering web documents : a phrase-based method for grouping search engine results /." Thesis, Connect to this title online; UW restricted, 1999. http://hdl.handle.net/1773/6884.
Full textPoudyal, Prakash. "Automatic extraction and structure of arguments in legal documents." Doctoral thesis, Universidade de Évora, 2018. http://hdl.handle.net/10174/24848.
Full textEspinosa, Javier. "Clustering of Image Search Results to Support Historical Document Recognition." Thesis, Blekinge Tekniska Högskola, Institutionen för datalogi och datorsystemteknik, 2014. http://urn.kb.se/resolve?urn=urn:nbn:se:bth-5577.
Full textAli, Klaib Alhadi. "Clustering-based labelling scheme : a hybrid approach for efficient querying and updating XML documents." Thesis, University of Huddersfield, 2018. http://eprints.hud.ac.uk/id/eprint/34580/.
Full textEbadat, Ali-Reza. "Toward Robust Information Extraction Models for Multimedia Documents." Phd thesis, INSA de Rennes, 2012. http://tel.archives-ouvertes.fr/tel-00760383.
Full textTaylor, William P. "A comparative study on ontology generation and text clustering using VSM, LSI, and document ontology models." Connect to this title online, 2007. http://etd.lib.clemson.edu/documents/1193080300/.
Full textPundlik, Shrinivas J. "Motion segmentation from clustering of sparse point features using spatially constrained mixture models." Connect to this title online, 2009. http://etd.lib.clemson.edu/documents/1252937182/.
Full textRios, Tatiane Nogueira. "Organização flexível de documentos." Universidade de São Paulo, 2013. http://www.teses.usp.br/teses/disponiveis/55/55134/tde-03052013-101143/.
Full textSeveral methods have been developed to organize the growing number of textual documents. Such methods frequently use clustering algorithms to organize documents with similar topics into clusters. However, there are situations when documents of dffierent clusters can also have similar characteristics. In order to overcome this drawback, it is necessary to develop methods that permit a soft document organization, i.e., clustering documents into different clusters according to different compatibility degrees. Among the techniques that we can use to develop methods in this sense, we highlight fuzzy clustering algorithms (FCA). By using FCA, one of the most important steps is the evaluation of the yield organization, which is performed considering that all analyzed topics are adequately identified by cluster descriptors. In general, cluster descriptors are extracted using some heuristic over a small number of documents. The adequate extraction and evaluation of cluster descriptors is important because they are terms that represent the collection and identify the topics of the documents. Therefore, an adequate description of the obtained clusters is as important as a good clustering, since the same descriptor might identify one or more clusters. Hence, the development of methods to extract descriptors from fuzzy clusters obtained for soft organization of documents motivated this thesis. Aiming at investigating such methods, we developed: i) the SoftO-FDCL (Soft Organization - Fuzzy Description Comes Last) method, in which descriptors of fuzzy clusters are extracted after clustering documents, identifying topics regardless the adopted fuzzy clustering algorithm; ii) the SoftO-wFDCL (Soft Organization - weighted Fuzzy Description Comes Last) method, in which descriptors of fuzzy clusters are also extracted after the fuzzy clustering process using the membership degrees of the documents as a weighted factor for the candidate descriptors; iii) the HSoftO-FDCL (Hierarchical Soft Organization - Fuzzy Description Comes Last) method, in which descriptors of hierarchical fuzzy clusters are extracted after the hierarchical fuzzy clustering process, identifying topics by means of a soft hierarchical organization of documents. Besides presenting these new methods, this thesis also discusses the application of the SoftO-FDCL method on documents produced by the Canadian continuing medical education program, presenting the utility and applicability of the soft organization of documents in real-world scenario
Dunkel, Christopher T. "Person detection and tracking using binocular Lucas-Kanade feature tracking and k-means clustering." Connect to this title online, 2008. http://etd.lib.clemson.edu/documents/1219850371/.
Full textAu, Émilie. "Intégration de la sémantique dans la représentation de documents par les arbres de dépendances syntaxiques." Mémoire, Université de Sherbrooke, 2011. http://savoirs.usherbrooke.ca/handle/11143/4938.
Full textLecerf, Loïc. "L' apprentissage machine pour assister l'annotation de documents : clustering visuel interactif, apprentissage actif et extraction automatique des descripteurs." Paris 6, 2009. http://www.theses.fr/2009PA066186.
Full textTarafdar, Arundhati. "Wordspotting from multilingual and stylistic documents." Thesis, Tours, 2017. http://www.theses.fr/2017TOUR4022/document.
Full textWord spotting in graphical documents is a very challenging task. To address such scenarios this thesis deals with developing a word spotting system dedicated to geographical documents with Bangla and English (Roman) scripts. In the proposed system, at first, text-graphics layers are separated using filtering, clustering and self-reinforcement through classifier. Additionally, instead of using binary decision we have used probabilistic measurement to represent the text components. Subsequently, in the text layer, character segmentation approach is applied using water-reservoir based method to extract individual character from the document. Then recognition of these isolated characters is done using rotation invariant feature, coupled with SVM classifier. Well recognized characters are then grouped based on their sizes. Initial spotting is started to find a query word among those groups of characters. In case if the system could spot a word partially due to any noise, SIFT is applied to identify missing portion of that partial spotting. Experimental results on Roman and Bangla scripts document images show that the method is feasible to spot a location in text labeled graphical documents. Experiments are done on an annotated dataset which was developed for this work. We have made this annotated dataset available publicly for other researchers
Sellah, Smail. "Approche automatisée d'assistance à la structuration des connaissances." Thesis, Bourgogne Franche-Comté, 2019. http://www.theses.fr/2019UBFCA026.
Full textIn a globalized context, companies must be innovative to increase their productivity and continue to exist in an increasingly competitive market. Innovations, potential sources of profit for a company, can be at the level of a process, a new product or a service, etc. An innovative company is a company which capitalizes on its knowledge. Knowledge management (KM) is a set of approaches that can address a range of issues related to knowledge including capitalization of knowledge. However, despite the benefits and the positive impact which can have such practices on an organization, these are very little implemented. In the thesis defended in this manuscript, we are interested in improving the capitalization of knowledge and in particular the structuring of information in order to propose candidate knowledge. Our goal is to make the access to knowledge more effective to business. To do this, we must reduce the number of irrelevant results and identify the knowledge that can help business in their daily problems.By this approach, we can help an organization to optimize its feedbacks and the time spent in the different processes put in place. In order to meet these challenges, we are interested in setting up a set of elementary components, each of these components having a specific role. These components are organized as an interactive cycle. Each component will interact with others, the underlying idea is that a component improves its results by learning results from other components. Users interact directly with these components in a transparent way. To search for knowledge, the cycle scrutinizes and analyzes the behavior of users to better understand their expectations. Thus, the cycle is able to learn and improve to better capture and seek knowledge of the company. The first component is named «identification and representation of knowledge», this component has the role of exploiting a set of documents in order to extract the knowledge within this corpus. The second component aims to organize this set of documents using the knowledge extracted by the first component. The last component builds on the results provided by the previous components, the role of this component is to allow the users to be able to do a semantic search by exploiting the knowledge model built by the first component and document organization which the second component offers.This last component will aim to share knowledge, this component is not restricted to only a search, it also includes a mechanism of suggestions that assists the users in their search by offering similar documents, etc.The global approach is tested and validated with a set of documents from Reuters newspaper articles. The results of the automatic analysis are compared to the tags produced by human readers
Ouji, Asma. "Segmentation et classification dans les images de documents numérisés." Phd thesis, INSA de Lyon, 2012. http://tel.archives-ouvertes.fr/tel-00749933.
Full textBui, Quang Vu. "Pretopology and Topic Modeling for Complex Systems Analysis : Application on Document Classification and Complex Network Analysis." Thesis, Paris Sciences et Lettres (ComUE), 2018. http://www.theses.fr/2018PSLEP034/document.
Full textThe work of this thesis presents the development of algorithms for document classification on the one hand, or complex network analysis on the other hand, based on pretopology, a theory that models the concept of proximity. The first work develops a framework for document clustering by combining Topic Modeling and Pretopology. Our contribution proposes using topic distributions extracted from topic modeling treatment as input for classification methods. In this approach, we investigated two aspects: determine an appropriate distance between documents by studying the relevance of Probabilistic-Based and Vector-Based Measurements and effect groupings according to several criteria using a pseudo-distance defined from pretopology. The second work introduces a general framework for modeling Complex Networks by developing a reformulation of stochastic pretopology and proposes Pretopology Cascade Model as a general model for information diffusion. In addition, we proposed an agent-based model, Textual-ABM, to analyze complex dynamic networks associated with textual information using author-topic model and introduced Textual-Homo-IC, an independent cascade model of the resemblance, in which homophily is measured based on textual content obtained by utilizing Topic Modeling
Fiorini, Nicolas. "Semantic similarities at the core of generic indexing and clustering approaches." Thesis, Montpellier, 2015. http://www.theses.fr/2015MONTS178/document.
Full textIn order to improve the exploitation of even growing number of electronic documents, Artificial Intelligence has dedicated a lot of effort to the creation and use of systems grounded on knowledge bases. In particular in the information retrieval field, such semantic approaches have proved their efficiency.Therefore, indexing documents is a necessary task. It consists of associating them with sets of terms that describe their content. These terms can be keywords but also concepts from an ontology, in which case the annotation is said to be semantic and benefit from the inherent properties of ontologies which are the absence of ambiguities.Most approaches designed to annotate documents have to parse them and extract concepts from this parsing. This underlines the dependance of such approaches to the type of documents, since parsing requires dedicated algorithms.On the other hand, approaches that solely rely on semantic annotations can ignore the document type, enabling the creation of generic processes. This thesis capitalizes on genericity to build novel systems and compare them to state-of-the-art approaches. To this end, we rely on semantic annotations coupled with semantic similarity measures. Of course, such generic approaches can then be enriched with type-specific ones, which would further increase the quality of the results.First of all, this work explores the relevance of this paradigm for indexing documents. The idea is to rely on already annotated close documents to annotate a target document. We define a heuristic algorithm for this purpose that uses the semantic annotations of these close documents and semantic similarities to provide a generic indexing method. This results in USI (User-oriented Semantic Indexer) that we show to perform as well as best current systems while being faster.Second of all, this idea is extended to another task, clustering. Clustering is a very common and ancient process that is very useful for finding documents or understanding a set of documents. We propose a hierarchical clustering algorithm that reuses the same components of classical methods to provide a novel one applicable to any kind of documents. Another benefit of this approach is that when documents are grouped together, the group can be annotated by using our indexing algorithm. Therefore, the result is not only a hierarchy of clusters containing documents as clusters are actually described by concepts as well. This helps a lot to better understand the results of the clustering.This thesis shows that apart from enhancing classical approaches, building conceptual approaches allows us to abstract them and provide a generic framework. Yet, while bringing easy-to-set-up methods – as long as documents are semantically annotated –, genericity does not prevent us from mixing these methods with type-specific ones, in other words creating hybrid methods
Dupuy, Grégor. "Les collections volumineuses de documents audiovisuels : segmentation et regroupement en locuteurs." Thesis, Le Mans, 2015. http://www.theses.fr/2015LEMA1006/document.
Full textThe task of speaker diarization, as defined by NIST, considers the recordings from a corpus as independent processes. The recordings are processed separately, and the overall error rate is a weighted average. In this context, detected speakers are identified by anonymous labels specific to each recording. Therefore, a speaker appearing in several recordings will be identified by a different label in each of the recordings. Yet, this situation is very common in broadcast news data: hosts, journalists and other guests may appear recurrently. Consequently, speaker diarization has been recently considered in a broader context, where recurring speakers must be uniquely identified in every recording that compose a corpus. This generalization of the speaker partitioning problem goes hand in hand with the emergence of the concept of collections, which refers, in the context of speaker diarization, to a set of recordings sharing one or more common characteristics.The work proposed in this thesis concerns speaker clustering of large audiovisual collections (several tens of hours of recordings). The main objective is to propose (or adapt) clustering approaches in order to efficiently process large volumes of data, while detecting recurrent speakers. The effectiveness of the proposed approaches is discussed from two point of view: first, the quality of the produced clustering (in terms of error rate), and secondly, the time required to perform the process. For this purpose, we propose two architectures designed to perform cross-show speaker diarization with collections of recordings. We propose a simplifying approach to decomposing a large clustering problem in several independent sub-problems. Solving these sub-problems is done with either of two clustering approaches which takeadvantage of the recent advances in speaker modeling
Felhi, Mehdi. "Document image segmentation : content categorization." Thesis, Université de Lorraine, 2014. http://www.theses.fr/2014LORR0109/document.
Full textIn this thesis I discuss the document image segmentation problem and I describe our new approaches for detecting and classifying document contents. First, I discuss our skew angle estimation approach. The aim of this approach is to develop an automatic approach able to estimate, with precision, the skew angle of text in document images. Our method is based on Maximum Gradient Difference (MGD) and R-signature. Then, I describe our second method based on Ridgelet transform.Our second contribution consists in a new hybrid page segmentation approach. I first describe our stroke-based descriptor that allows detecting text and line candidates using the skeleton of the binarized document image. Then, an active contour model is applied to segment the rest of the image into photo and background regions. Finally, text candidates are clustered using mean-shift analysis technique according to their corresponding sizes. The method is applied for segmenting scanned document images (newspapers and magazines) that contain text, lines and photo regions. Finally, I describe our stroke-based text extraction method. Our approach begins by extracting connected components and selecting text character candidates over the CIE LCH color space using the Histogram of Oriented Gradients (HOG) correlation coefficients in order to detect low contrasted regions. The text region candidates are clustered using two different approaches ; a depth first search approach over a graph, and a stable text line criterion. Finally, the resulted regions are refined by classifying the text line candidates into « text» and « non-text » regions using a Kernel Support Vector Machine K-SVM classifier
Johnson, Samuel. "Document Clustering Interface." Thesis, Linköpings universitet, Institutionen för datavetenskap, 2014. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-112878.
Full textLai, Hien Phuong. "Vers un système interactif de structuration des index pour une recherche par le contenu dans des grandes bases d'images." Phd thesis, Université de La Rochelle, 2013. http://tel.archives-ouvertes.fr/tel-00934842.
Full textGalåen, Magnus. "Dokument-klynging (document clustering)." Thesis, Norwegian University of Science and Technology, Department of Computer and Information Science, 2008. http://urn.kb.se/resolve?urn=urn:nbn:no:ntnu:diva-8868.
Full textAs document searching becomes more and more important with the rapid growth of document bases today, document clustering also becomes more important. Some of the most commonly used document clustering algorithms today, are pure statistical in nature. Other algorithms have emerged, adressing some of the issues with numerical algorithms, claiming to be better. This thesis compares two well-known algorithms: Elliptic K-Means and Suffix Tree Clustering. They are compared in speed and quality, and it is shown that Elliptic K-Means performs better in speed, while Suffix Tree Clustering (STC) performs better in quality. It is further shown that STC performs better using small portions of relevant text (snippets) on real web-data compared to the full document. It is also shown that a threshold value for base cluster merging is unneccesary. As STC is shown to perform adequately in speed when running on snippets only, it is concluded that STC is the better algorithm for the purpose of search results clustering.
Stankov, Ivan. "Semantically enhanced document clustering." Thesis, Cardiff University, 2013. http://orca.cf.ac.uk/47585/.
Full textClaude, Grégory. "Modélisation de documents et recherches de points communs : propositions d'un framework de gestion de fiches d'anomalie pour faciliter les maintenances corrective et préventive." Toulouse 3, 2012. http://thesesups.ups-tlse.fr/1575/.
Full textThe daily practice of an activity generates a set of knowledge that results in a know-how, a mastery, a skill a person gains over time. In order to take advantage of this experience, capitalization of knowledge has become an essential activity for companies. Our research work aims to model and implement such a system that extracts and formalizes knowledge from defects that occur in the context of industrial production, and to integrate it into a framework in order to facilitate corrective and preventive maintenance. This framework organizes the knowledge in the form of defects' groups. These groups can be compared to patterns: they represent a problem to which one or more solutions are related. They are not defined a priori; the analysis of past defects generates relevant groups, which may change with the addition of new defects. To identify these patterns, a complete process of knowledge extraction and formalization is adopted, Knowledge Discovery in Databases, well known in the domain of knowledge management. This process has been applied in very diversified fields. In this work, we give a new dimension to this process, the processing of defects, especially those that occur during industrial production processes. The generic steps that compose it, from the simple data selection to the interpretation of patterns that support knowledge, are considered. A specific processing, relevant to our applicative context, is assigned to each of these steps
Li, Yanjun. "High Performance Text Document Clustering." Wright State University / OhioLINK, 2007. http://rave.ohiolink.edu/etdc/view?acc_num=wright1181005422.
Full textClaude, Grégory. "Modélisation de documents et recherche de points communs - Proposition d'un framework de gestion de fiches d'anomalie pour faciliter les maintenances corrective et préventive." Phd thesis, Université Paul Sabatier - Toulouse III, 2012. http://tel.archives-ouvertes.fr/tel-00701752.
Full textAkbar, Monika. "FP-growth approach for document clustering." Thesis, Montana State University, 2008. http://etd.lib.montana.edu/etd/2008/akbar/AkbarM0508.pdf.
Full textWang, Yong. "Incorporating semantic and syntactic information into document representation for document clustering." Diss., Mississippi State : Mississippi State University, 2005. http://library.msstate.edu/etd/show.asp?etd=etd-07072005-105806.
Full textDavis, Aaron Samuel. "Bisecting Document Clustering Using Model-Based Methods /." Diss., CLICK HERE for online access, 2010. http://contentdm.lib.byu.edu/ETD/image/etd3332.pdf.
Full textDavis, Aaron Samuel. "Bisecting Document Clustering Using Model-Based Methods." BYU ScholarsArchive, 2009. https://scholarsarchive.byu.edu/etd/1938.
Full textKim, Young-Min. "Document clustering in a learned concept space." Paris 6, 2010. http://www.theses.fr/2010PA066459.
Full textLatif, Seemab. "Automatic summarisation as pre-processing for document clustering." Thesis, University of Manchester, 2010. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.521783.
Full textGeiss, Johanna. "Latent semantic sentence clustering for multi-document summarization." Thesis, University of Cambridge, 2011. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.609761.
Full textRosell, Magnus. "Clustering in Swedish : The Impact of some Properties of the Swedish Language on Document Clustering and an Evaluation Method." Licentiate thesis, Stockholm, 2005. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-438.
Full textHe, Binlai. "A Document Recommender Based on Word Embedding." Thesis, KTH, Skolan för elektro- och systemteknik (EES), 2015. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-183502.
Full textLeixner, Petr. "Shlukování textových dat." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2010. http://www.nusl.cz/ntk/nusl-237188.
Full textAlise, Dario Fioravante. "Algoritmo di "Label Propagation" per il clustering di documenti testuali." Master's thesis, Alma Mater Studiorum - Università di Bologna, 2017. http://amslaurea.unibo.it/14388/.
Full textJarolím, Jordán. "Analýza a získávání informací ze souboru dokumentů spojených do jednoho celku." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2018. http://www.nusl.cz/ntk/nusl-385929.
Full textWalker, Daniel David. "Bayesian Test Analytics for Document Collections." BYU ScholarsArchive, 2012. https://scholarsarchive.byu.edu/etd/3530.
Full textLOU, YI-SHENG, and 劉易昇. "Document Clustering and Visualization of Documents Based on PageRank." Thesis, 2014. http://ndltd.ncl.edu.tw/handle/06143626443030804077.
Full text國立臺灣科技大學
資訊管理系
102
In this paper, we proposes a document clustering and visualization scheme with PageRank-based agglomerative clustering. This approach can be used to analyze document sets such that people may grasp the main topics or issues within a document set quickly. In addition, two metrics, including compactness and connectivity, are defined to measure the quality of document clusters. Experimental results show that PageRank-based approach outperforms k-means-based approach on both metrics by aggregating data strictly and eliminating outliers effectively. This scheme has been primarily tested on several document sets and satisfactory analysis results can be obtained. Visualization of 1,000 sport news based on this scheme was further given to show its applicability.
Liu, Shih-Chi, and 劉世琪. "The Automatic Clustering of Domain-Specific Chinese Documents." Thesis, 2007. http://ndltd.ncl.edu.tw/handle/12287595355701437385.
Full text元智大學
資訊管理學系
95
In the domain of the knowledge management, enterprises are at the beginning of building and constructing document management system, the documents authors offer are not classified very effectively. This fact let user unable searching and using in effect under a large number of documents. A lot of research reveals keywords can help users to decide whether the document is useful. And gather together piles and piles of documents in accordance with its similarity, can offer a more efficient way of searching documents to users. For this reason the experiments using the Electronic Theses and Dissertations System searches the photonics documents about color filter or Liquid Crystal Display-LCD domain. We improve Kea, an algorithm for automatically extracting keyphrases from Chinese texts. Besides by analyzing the results of using Hierarchical Clustering Algorithms can assist administrators to assess the suitable ways of the categorized documents.
Lei, Ying-Chieh, and 雷穎傑. "A Level-wise Clustering Algorithm on Structured Documents." Thesis, 2003. http://ndltd.ncl.edu.tw/handle/92797998848353514145.
Full text國立交通大學
資訊科學系
91
Document clustering is the process of applying clustering technique to the document management. Similar documents can be grouped together by clustering technique, so that both managing and searching the documents can be efficient. But, most existing document clustering algorithms do not take the structure information of the document into consideration, so the clustering results can not reflect the characteristics of the documents fully. Therefore, we represent each document as a tree structure and propose a level-wise clustering algorithm to solve the problem. The clustering process applies the level property of the tree and is run level by level by the concept generation operation. In order to store the clustering results and search similar clusters efficiently, a multistage graph is proposed. Based on the multistage graph, three search strategies are provided to meet the needs of different uses. Finally, our experimental results show that the similarity search is efficient and the accuracy of the search is acceptable.
Liou, Po-Lun, and 劉博倫. "Retrieving Representative Structures from XML Documents Using Clustering Techniques." Thesis, 2010. http://ndltd.ncl.edu.tw/handle/89488738673347465186.
Full text雲林科技大學
電子與資訊工程研究所
98
In the paper, we addressed the problem of finding the common structures in a collection of XML documents. Since an XML document can be represented as a tree structure, the problem how to cluster a collection of XML documents can be considered as how to cluster a collection of tree-structured documents. First, we used SOM (Self-Organizing Map) with the Jaccard coefficient to cluster XML documents. Then, an efficient sequential mining method called GST was applied to find maximum frequent sequences. Finally, we merged the maximum frequent sequences to produce the common structures in a cluster.