To see the other types of publications on this topic, follow the link: Clustering de documents.

Journal articles on the topic 'Clustering de documents'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 journal articles for your research on the topic 'Clustering de documents.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

Ma, Shutian, and Chengzhi Zhang. "Document representation and clustering models for bilingual documents clustering." Proceedings of the Association for Information Science and Technology 54, no. 1 (January 2017): 499–502. http://dx.doi.org/10.1002/pra2.2017.14505401056.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Ma, Shutian, Chengzhi Zhang, and Daqing He. "Document representation methods for clustering bilingual documents." Proceedings of the Association for Information Science and Technology 53, no. 1 (2016): 1–10. http://dx.doi.org/10.1002/pra2.2016.14505301065.

Full text
APA, Harvard, Vancouver, ISO, and other styles
3

Tarczynski, Tomasz. "Document Clustering - Concepts, Metrics and Algorithms." International Journal of Electronics and Telecommunications 57, no. 3 (September 1, 2011): 271–77. http://dx.doi.org/10.2478/v10177-011-0036-5.

Full text
Abstract:
Document Clustering - Concepts, Metrics and AlgorithmsDocument clustering, which is also refered to astext clustering, is a technique of unsupervised document organisation. Text clustering is used to group documents into subsets that consist of texts that are similar to each orher. These subsets are called clusters. Document clustering algorithms are widely used in web searching engines to produce results relevant to a query. An example of practical use of those techniques are Yahoo! hierarchies of documents [1]. Another application of document clustering is browsing which is defined as searching session without well specific goal. The browsing techniques heavily relies on document clustering. In this article we examine the most important concepts related to document clustering. Besides the algorithms we present comprehensive discussion about representation of documents, calculation of similarity between documents and evaluation of clusters quality.
APA, Harvard, Vancouver, ISO, and other styles
4

Rege, Manjeet, Josan Koruthu, and Reynold Bailey. "On Knowledge-Enhanced Document Clustering." International Journal of Information Retrieval Research 2, no. 3 (July 2012): 72–82. http://dx.doi.org/10.4018/ijirr.2012070105.

Full text
Abstract:
Document clustering plays an important role in text analytics by finding natural groupings of documents based on their similarity determined by the words appearing in them. Many of the clustering algorithms accessible through various text analytics tools are completely unsupervised in nature. That is, they are unable to incorporate any domain knowledge that might be available about the documents to improve the clustering accuracy and relevance. The authors present a graph partitioning based semi-supervised document clustering algorithm. The user provides knowledge about few of the documents in the form of “must-link” and “cannot-link” constraints between pairs of documents. A “must-link” constraint between two documents expresses the fact that the user feels that the two corresponding documents must be clustered irrespective of their dissimilarity. Similarly, a “cannot-link” signifies that the two documents should never be clustered together no matter how similar they might happen to be. These constraints are then incorporated into a graph partitioning based into a computationally efficient document clustering algorithm. Through experiments performed on publicly available text datasets, the proposed framework is validated.
APA, Harvard, Vancouver, ISO, and other styles
5

Musa, Saiful Bahri, Andi Baso Kaswar, Supria Supria, and Susiana Sari. "DOCUMENT CLUSTERING BY DYNAMIC HIERARCHICAL ALGORITHM BASED ON FUZZY SET TYPE-II FROM FREQUENT ITEMSET." Jurnal Ilmu Komputer dan Informasi 9, no. 2 (June 25, 2016): 88. http://dx.doi.org/10.21609/jiki.v9i2.383.

Full text
Abstract:
One of ways to facilitate process of information retrieval is by performing clustering toward collection of the existing documents. The existing text documents are often unstructured. The forms are varied and their groupings are ambiguous. This cases cause difficulty on information retrieval process. Moreover, every second new documents emerge and need to be clustered. Generally, static document clustering method performs clustering of document after whole documents are collected. However, performing re-clustering toward whole documents when new document arrives causes inefficient clustering process. In this paper, we proposed a new method for document clustering with dynamic hierarchy algorithm based on fuzzy set type - II from frequent itemset. To achieve the goals, there are three main phases, namely: determination of key-term, the extraction of candidates clusters and cluster hierarchical construction. Based on the experiment, it resulted the value of F-measure 0.40 for Newsgroup, 0.62 for Classic and 0.38 for Reuters. Meanwhile, time of computation when addition of new document is lower than to the previous static method. The result shows that this method is suitable to produce solution of clustering with hierarchy in dynamical environment effectively and efficiently. This method also gives accurate clustering result.
APA, Harvard, Vancouver, ISO, and other styles
6

Onan, Aytug, Hasan Bulut, and Serdar Korukoglu. "An improved ant algorithm with LDA-based representation for text document clustering." Journal of Information Science 43, no. 2 (March 1, 2016): 275–92. http://dx.doi.org/10.1177/0165551516638784.

Full text
Abstract:
Document clustering can be applied in document organisation and browsing, document summarisation and classification. The identification of an appropriate representation for textual documents is extremely important for the performance of clustering or classification algorithms. Textual documents suffer from the high dimensionality and irrelevancy of text features. Besides, conventional clustering algorithms suffer from several shortcomings, such as slow convergence and sensitivity to the initial value. To tackle the problems of conventional clustering algorithms, metaheuristic algorithms are frequently applied to clustering. In this paper, an improved ant clustering algorithm is presented, where two novel heuristic methods are proposed to enhance the clustering quality of ant-based clustering. In addition, the latent Dirichlet allocation (LDA) is used to represent textual documents in a compact and efficient way. The clustering quality of the proposed ant clustering algorithm is compared to the conventional clustering algorithms using 25 text benchmarks in terms of F-measure values. The experimental results indicate that the proposed clustering scheme outperforms the compared conventional and metaheuristic clustering methods for textual documents.
APA, Harvard, Vancouver, ISO, and other styles
7

IYER, SWAMI, and DAN A. SIMOVICI. "STRUCTURAL CLASSIFICATION OF XML DOCUMENTS USING MULTISETS." International Journal on Artificial Intelligence Tools 17, no. 05 (October 2008): 1003–22. http://dx.doi.org/10.1142/s0218213008004266.

Full text
Abstract:
In this paper, we investigate the problem of clustering XML documents based on their structure. We represent the paths in an XML document as a multiset and use the symmetric difference operation on multisets to define certain metrics. These metrics are then used to obtain a measure of similarity between any two documents in a collection. Our technique was successfully applied to real and synthesized XML documents yielding high-quality clusterings.
APA, Harvard, Vancouver, ISO, and other styles
8

Thanh, Nguyen Chi, Koichi Yamada, and Muneyuki Unehara. "A Similarity Rough Set Model for Document Representation and Document Clustering." Journal of Advanced Computational Intelligence and Intelligent Informatics 15, no. 2 (March 20, 2011): 125–33. http://dx.doi.org/10.20965/jaciii.2011.p0125.

Full text
Abstract:
Document clustering is a textmining technique for unsupervised document organization. It helps the users browse and navigate large sets of documents. Ho et al. proposed a Tolerance Rough Set Model (TRSM) [1] for improving the vector space model that represents documents by vectors of terms and applied it to document clustering. In this paper we analyze their model to propose a new model for efficient clustering of documents. We introduce Similarity Rough Set Model (SRSM) as another model for presenting documents in document clustering. The model is evaluated by experiments on test collections. The experiment results show that the SRSM document clusteringmethod outperforms the one with TRSM and the results of SRSM are less affected by the value of parameter than TRSM.
APA, Harvard, Vancouver, ISO, and other styles
9

Guillaume, Damien, and Fionn Murtagh. "Clustering of XML documents." Computer Physics Communications 127, no. 2-3 (May 2000): 215–27. http://dx.doi.org/10.1016/s0010-4655(99)00511-1.

Full text
APA, Harvard, Vancouver, ISO, and other styles
10

., Nilam A. Sinari. "DYNAMIC CLUSTERING OF DOCUMENTS." International Journal of Research in Engineering and Technology 04, no. 18 (May 25, 2015): 6–10. http://dx.doi.org/10.15623/ijret.2015.0418002.

Full text
APA, Harvard, Vancouver, ISO, and other styles
11

Thangamani, M., and P. Thangaraj. "Effective Fuzzy Ontology Based Distributed Document Using Non-Dominated Ranked Genetic Algorithm." International Journal of Intelligent Information Technologies 7, no. 4 (October 2011): 26–46. http://dx.doi.org/10.4018/jiit.2011100102.

Full text
Abstract:
The increase in the number of documents has aggravated the difficulty of classifying those documents according to specific needs. Clustering analysis in a distributed environment is a thrust area in artificial intelligence and data mining. Its fundamental task is to utilize characters to compute the degree of related corresponding relationship between objects and to accomplish automatic classification without earlier knowledge. Document clustering utilizes clustering technique to gather the documents of high resemblance collectively by computing the documents resemblance. Recent studies have shown that ontologies are useful in improving the performance of document clustering. Ontology is concerned with the conceptualization of a domain into an individual identifiable format and machine-readable format containing entities, attributes, relationships, and axioms. By analyzing types of techniques for document clustering, a better clustering technique depending on Genetic Algorithm (GA) is determined. Non-Dominated Ranked Genetic Algorithm (NRGA) is used in this paper for clustering, which has the capability of providing a better classification result. The experiment is conducted in 20 newsgroups data set for evaluating the proposed technique. The result shows that the proposed approach is very effective in clustering the documents in the distributed environment.
APA, Harvard, Vancouver, ISO, and other styles
12

Nadubeediramesh, Rashmi, and Aryya Gangopadhyay. "Dynamic Document Clustering Using Singular Value Decomposition." International Journal of Computational Models and Algorithms in Medicine 3, no. 3 (July 2012): 27–55. http://dx.doi.org/10.4018/jcmam.2012070103.

Full text
Abstract:
Incremental document clustering is important in many applications, but particularly so in healthcare contexts where text data is found in abundance, ranging from published research in journals to day-to-day healthcare data such as discharge summaries and nursing notes. In such dynamic environments new documents are constantly added to the set of documents that have been used in the initial cluster formation. Hence it is important to be able to incrementally update the clusters at a low computational cost as new documents are added. In this paper the authors describe a novel, low cost approach for incremental document clustering. Their method is based on conducting singular value decomposition (SVD) incrementally. They dynamically fold in new documents into the existing term-document space and dynamically assign these new documents into pre-defined clusters based on intra-cluster similarity. This saves the cost of re-computing SVD on the entire document set every time updates occur. The authors also provide a way to retrieve documents based on different window sizes with high scalability and good clustering accuracy. They have tested their proposed method experimentally with 960 medical abstracts retrieved from the PubMed medical library. The authors’ incremental method is compared with the default situation where complete re-computation of SVD is done when new documents are added to the initial set of documents. The results show minor decreases in the quality of the cluster formation but much larger gains in computational throughput.
APA, Harvard, Vancouver, ISO, and other styles
13

Tae, J., and D. Shin. "Keyword Clustering for Comparing Documents in Different Languages." International Journal of Machine Learning and Computing 5, no. 4 (August 2015): 277–82. http://dx.doi.org/10.7763/ijmlc.2015.v5.520.

Full text
APA, Harvard, Vancouver, ISO, and other styles
14

Rani Manukonda, Sumathi, Asst Prof Kmit, Narayanguda ., Hyderabad ., Nomula Divya, Asst Prof Cmrit, Medchal ., and Hyderabad . "Efficient Document Clustering for Web Search Result." International Journal of Engineering & Technology 7, no. 3.3 (June 21, 2018): 90. http://dx.doi.org/10.14419/ijet.v7i3.3.14494.

Full text
Abstract:
Clustering the document in data mining is one of the traditional approach in which the same documents that are more relevant are grouped together. Document clustering take part in achieving accuracy that retrieve information for systems that identifies the nearest neighbors of the document. Day to day the massive quantity of data is being generated and it is clustered. According to particular sequence to improve the cluster qualityeven though different clustering methods have been introduced, still many challenges exist for the improvement of document clustering. For web search purposea document in group is efficiently arranged for the result retrieval.The users accordingly search query in an organized way. Hierarchical clustering is attained by document clustering.To the greatest algorithms for groupingdo not concentrate on the semantic approach, hence resulting to the unsatisfactory output clustering. The involuntary approach of organizing documents of web like Google, Yahoo is often considered as a reference. A distinct method to identify the existing group of similar things in the previously organized documents and retrieves effective document classifier for new documents. In this paper the main concentration is on hierarchical clustering and k-means algorithms, hence prove that k-means and its variant are efficient than hierarchical clustering along with this by implementing greedy fast k-means algorithm (GFA) for cluster document in efficient way is considered.
APA, Harvard, Vancouver, ISO, and other styles
15

Avanija, J., and K. Ramar. "Semantic Clustering of Web Documents." International Journal of Information Technology and Web Engineering 7, no. 4 (October 2012): 20–33. http://dx.doi.org/10.4018/jitwe.2012100102.

Full text
Abstract:
With the massive growth and large volume of the web it is very difficult to recover results based on the user preferences. The next generation web architecture, semantic web reduces the burden of the user by performing search based on semantics instead of keywords. Even in the context of semantic technologies optimization problem occurs but rarely considered. In this paper document clustering is applied to recover relevant documents. The authors propose an ontology based clustering algorithm using semantic similarity measure and Particle Swarm Optimization (PSO), which is applied to the annotated documents for optimizing the result. The proposed method uses Jena API and GATE tool API and the documents can be recovered based on their annotation features and relations. A preliminary experiment comparing the proposed method with K-Means shows that the proposed method is feasible and performs better than K-Means.
APA, Harvard, Vancouver, ISO, and other styles
16

Kumar, R. Lakshmana, N. Kannammal, Sujatha Krishnamoorthy, and Seifedine Kadry. "Semantics Based Clustering through Cover-Kmeans with OntoVsm for Information Retrieval." Information Technology And Control 49, no. 3 (September 23, 2020): 370–80. http://dx.doi.org/10.5755/j01.itc.49.3.25988.

Full text
Abstract:
Document clustering plays a significant task in the retrieval of the information, which seeks to divide documents into groups automatically, depending on their content similarity. The cluster consists of related documents within the group (having high intra-cluster similarity) and dissimilar to other group documents (having low inter-cluster similarity). Clustering documents should be considered an unsupervised process that aims to classify documents by identifying underlying structures, i.e. the learning process is unsupervised. So there is no need to determine the correct output for an input. Previous clustering methods do not know the semantic associations between words such that the context of documents cannot be correctly interpreted. In order to address this problem, the advent of semantic ontology information such as WordNet was widely used to enhance text clustering consistency. This paper initially proposes an OntoVSM model to reduce the dimension of the document efficiently. The cover K-means clustering algorithm is proposed for semantic document clustering. The proposed algorithm is a hybrid version of K-Means and covers coefficient-based clustering methodology (C3M) that is improved semantically using WordNet ontology. The dimensionality reduction based on semantic knowledge of each term preserves the information without loss. The performance of the proposed work is analysed through experimental results. This shows that the proposed work gives improved results compared to other standard methods.
APA, Harvard, Vancouver, ISO, and other styles
17

Liu, Chien-Liang, Wen-Hoar Hsaio, Chia-Hoang Lee, and Chun-Hsien Chen. "Clustering tagged documents with labeled and unlabeled documents." Information Processing & Management 49, no. 3 (May 2013): 596–606. http://dx.doi.org/10.1016/j.ipm.2012.12.004.

Full text
APA, Harvard, Vancouver, ISO, and other styles
18

Santoso, Ibnu, and Lya Hulliyyatus Suadaa. "PENGUKURAN TINGKAT KEMIRIPAN DOKUMEN BERBASIS CLUSTER." KLIK - KUMPULAN JURNAL ILMU KOMPUTER 6, no. 1 (February 28, 2019): 71. http://dx.doi.org/10.20527/klik.v6i1.181.

Full text
Abstract:
<p><em>Document similarity can be measured and used to discover other similar documents in a document collection (corpus). In a small corpus, measuring document similarity is not a problem. In a bigger corpus, comparing similarity rate between documents can be time consuming. A clustering method can be used to minimize number of document collection that has to be compared to a document to save time. This research is aimed to discover the effect of clustering technique in measuring document similarity and evaluate the performance. Corpus used was undergraduate thesis of Politeknik Statistika STIS students from year 2007-2016 as many as 2.049 documents. These documents were represented as bag of words model and clustered using k-means clustering method. Measurement of similarity used is Cosine similarity. From the simulation, clustering process for 3 clusters needs longer preparation time (17,32%) but resulting in faster query processing (77,88%) with accuracy of 0,98. Clustering process for 5 clusters needs longer preparation time (31,10%) but resulting in faster query processing (83,79%) with accuracy of 0,86. Clustering process for 7 clusters needs longer preparation time (45,10%) but resulting in faster query processing (85,30%) with accuracy of 0,98.</em></p>
APA, Harvard, Vancouver, ISO, and other styles
19

Rao, Bapuji, and Brojo Kishore Mishra. "An Approach to Clustering of Text Documents Using Graph Mining Techniques." International Journal of Rough Sets and Data Analysis 4, no. 1 (January 2017): 38–55. http://dx.doi.org/10.4018/ijrsda.2017010103.

Full text
Abstract:
This paper introduces a new approach of clustering of text documents based on a set of words using graph mining techniques. The proposed approach clusters (groups) those text documents having searched successfully for the given set of words from a set of given text documents. The document-word relation can be represented as a bi-partite graph. All the clustering of text documents is represented as sub-graphs. Further, the paper proposes an algorithm for clustering of text documents for a given set of words. It is an automated system and requires minimal human interaction for the clustering of text documents. The algorithm has been implemented using C++ programming language and observed satisfactory results.
APA, Harvard, Vancouver, ISO, and other styles
20

Jalal, Ahmed Adeeb, and Basheer Husham Ali. "Text documents clustering using data mining techniques." International Journal of Electrical and Computer Engineering (IJECE) 11, no. 1 (February 1, 2021): 664. http://dx.doi.org/10.11591/ijece.v11i1.pp664-670.

Full text
Abstract:
Increasing progress in numerous research fields and information technologies, led to an increase in the publication of research papers. Therefore, researchers take a lot of time to find interesting research papers that are close to their field of specialization. Consequently, in this paper we have proposed documents classification approach that can cluster the text documents of research papers into the meaningful categories in which contain a similar scientific field. Our presented approach based on essential focus and scopes of the target categories, where each of these categories includes many topics. Accordingly, we extract word tokens from these topics that relate to a specific category, separately. The frequency of word tokens in documents impacts on weight of document that calculated by using a numerical statistic of term frequency-inverse document frequency (TF-IDF). The proposed approach uses title, abstract, and keywords of the paper, in addition to the categories topics to perform the classification process. Subsequently, documents are classified and clustered into the primary categories based on the highest measure of cosine similarity between category weight and documents weights.
APA, Harvard, Vancouver, ISO, and other styles
21

Ashokkumar, P., and S. Don. "Link-Based Clustering Algorithm for Clustering Web Documents." Journal of Testing and Evaluation 47, no. 6 (February 28, 2019): 20180497. http://dx.doi.org/10.1520/jte20180497.

Full text
APA, Harvard, Vancouver, ISO, and other styles
22

Pourvali, Mohsen, and Salvatore Orlando. "Enriching Documents by Linking Salient Entities and Lexical-Semantic Expansion." Journal of Intelligent Systems 29, no. 1 (December 4, 2018): 1109–21. http://dx.doi.org/10.1515/jisys-2018-0098.

Full text
Abstract:
Abstract This paper explores a multi-strategy technique that aims at enriching text documents for improving clustering quality. We use a combination of entity linking and document summarization in order to determine the identity of the most salient entities mentioned in texts. To effectively enrich documents without introducing noise, we limit ourselves to the text fragments mentioning the salient entities, in turn, belonging to a knowledge base like Wikipedia, while the actual enrichment of text fragments is carried out using WordNet. To feed clustering algorithms, we investigate different document representations obtained using several combinations of document enrichment and feature extraction. This allows us to exploit ensemble clustering, by combining multiple clustering results obtained using different document representations. Our experiments indicate that our novel enriching strategies, combined with ensemble clustering, can improve the quality of classical text clustering when applied to text corpora like The British Broadcasting Corporation (BBC) NEWS.
APA, Harvard, Vancouver, ISO, and other styles
23

Lomakina, L. S., V. B. Rodionov, and A. S. Surkova. "Hierarchical clustering of text documents." Automation and Remote Control 75, no. 7 (July 2014): 1309–15. http://dx.doi.org/10.1134/s000511791407011x.

Full text
APA, Harvard, Vancouver, ISO, and other styles
24

Tagarelli, Andrea, and Sergio Greco. "Semantic clustering of XML documents." ACM Transactions on Information Systems 28, no. 1 (January 2010): 1–56. http://dx.doi.org/10.1145/1658377.1658380.

Full text
APA, Harvard, Vancouver, ISO, and other styles
25

Piernik, Maciej, Dariusz Brzezinski, and Tadeusz Morzy. "Clustering XML documents by patterns." Knowledge and Information Systems 46, no. 1 (January 23, 2015): 185–212. http://dx.doi.org/10.1007/s10115-015-0820-0.

Full text
APA, Harvard, Vancouver, ISO, and other styles
26

Johnson, Andrew, and Farshad Fotouhi. "Adaptive clustering of hypermedia documents." Information Systems 21, no. 6 (September 1996): 459–73. http://dx.doi.org/10.1016/0306-4379(96)00023-3.

Full text
APA, Harvard, Vancouver, ISO, and other styles
27

Greco, Sergio, Francesco Gullo, Giovanni Ponti, and Andrea Tagarelli. "Collaborative clustering of XML documents." Journal of Computer and System Sciences 77, no. 6 (November 2011): 988–1008. http://dx.doi.org/10.1016/j.jcss.2011.02.005.

Full text
APA, Harvard, Vancouver, ISO, and other styles
28

Rashaideh, Hasan, Ahmad Sawaie, Mohammed Azmi Al-Betar, Laith Mohammad Abualigah, Mohammed M. Al-laham, Ra’ed M. Al-Khatib, and Malik Braik. "A Grey Wolf Optimizer for Text Document Clustering." Journal of Intelligent Systems 29, no. 1 (July 21, 2018): 814–30. http://dx.doi.org/10.1515/jisys-2018-0194.

Full text
Abstract:
Abstract Text clustering problem (TCP) is a leading process in many key areas such as information retrieval, text mining, and natural language processing. This presents the need for a potent document clustering algorithm that can be used effectively to navigate, summarize, and arrange information to congregate large data sets. This paper encompasses an adaptation of the grey wolf optimizer (GWO) for TCP, referred to as TCP-GWO. The TCP demands a degree of accuracy beyond that which is possible with metaheuristic swarm-based algorithms. The main issue to be addressed is how to split text documents on the basis of GWO into homogeneous clusters that are sufficiently precise and functional. Specifically, TCP-GWO, or referred to as the document clustering algorithm, used the average distance of documents to the cluster centroid (ADDC) as an objective function to repeatedly optimize the distance between the clusters of the documents. The accuracy and efficiency of the proposed TCP-GWO was demonstrated on a sufficiently large number of documents of variable sizes, documents that were randomly selected from a set of six publicly available data sets. Documents of high complexity were also included in the evaluation process to assess the recall detection rate of the document clustering algorithm. The experimental results for a test set of over a part of 1300 documents showed that failure to correctly cluster a document occurred in less than 20% of cases with a recall rate of more than 65% for a highly complex data set. The high F-measure rate and ability to cluster documents in an effective manner are important advances resulting from this research. The proposed TCP-GWO method was compared to the other well-established text clustering methods using randomly selected data sets. Interestingly, TCP-GWO outperforms the comparative methods in terms of precision, recall, and F-measure rates. In a nutshell, the results illustrate that the proposed TCP-GWO is able to excel compared to the other comparative clustering methods in terms of measurement criteria, whereby more than 55% of the documents were correctly clustered with a high level of accuracy.
APA, Harvard, Vancouver, ISO, and other styles
29

HU, TIANMING, CHEW LIM TAN, YONG TANG, SAM YUAN SUNG, HUI XIONG, and CHAO QU. "CO-CLUSTERING BIPARTITE WITH PATTERN PRESERVATION FOR TOPIC EXTRACTION." International Journal on Artificial Intelligence Tools 17, no. 01 (February 2008): 87–107. http://dx.doi.org/10.1142/s0218213008003790.

Full text
Abstract:
The duality between document and word clustering naturally leads to the consideration of storing the document dataset in a bipartite. With documents and words modeled as vertices on two sides respectively, partitioning such a graph yields a co-clustering of words and documents. The topic of each cluster can then be represented by the top words and documents that have highest within-cluster degrees. However, such claims may fail if top words and documents are selected simply because they are very general and frequent. In addition, for those words and documents across several topics, it may not be proper to assign them to a single cluster. In other words, to precisely capture the cluster topic, we need to identify those micro-sets of words/documents that are similar among themselves and as a whole, representative of their respective topics. Along this line, in this paper, we use hyperclique patterns, strongly affiliated words/documents, to define such micro-sets. We introduce a new bipartite formulation that incorporates both word hypercliques and document hypercliques as super vertices. By co-preserving hyperclique patterns during the clustering process, our experiments on real-world data sets show that better clustering results can be obtained in terms of various external clustering validation measures and the cluster topic can be more precisely identified. Also, the partitioned bipartite with co-preserved patterns naturally lends itself to different clustering-related functions in search engines. To that end, we illustrate such an application, returning clustered search results for keyword queries. We show that the topic of each cluster with respect to the current query can be identified more accurately with the words and documents from the patterns than with those top ones from the standard bipartite formulation.
APA, Harvard, Vancouver, ISO, and other styles
30

Rafi, Muhammad, Muhammad Waqar, Hareem Ajaz, Umar Ayub, and Muhammad Danish. "Document Clustering using Self-Organizing Maps." MENDEL 23, no. 1 (June 1, 2017): 111–18. http://dx.doi.org/10.13164/mendel.2017.1.111.

Full text
Abstract:
Cluster analysis of textual documents is a common technique for better ltering, navigation, under-standing and comprehension of the large document collection. Document clustering is an autonomous methodthat separate out large heterogeneous document collection into smaller more homogeneous sub-collections calledclusters. Self-organizing maps (SOM) is a type of arti cial neural network (ANN) that can be used to performautonomous self-organization of high dimension feature space into low-dimensional projections called maps. Itis considered a good method to perform clustering as both requires unsupervised processing. In this paper, weproposed a SOM using multi-layer, multi-feature to cluster documents. The paper implements a SOM usingfour layers containing lexical terms, phrases and sequences in bottom layers respectively and combining all atthe top layers. The documents are processed to extract these features to feed the SOM. The internal weightsand interconnections between these layers features(neurons) automatically settle through iterations with a smalllearning rate to discover the actual clusters. We have performed extensive set of experiments on standard textmining datasets like: NEWS20, Reuters and WebKB with evaluation measures F-Measure and Purity. Theevaluation gives encouraging results and outperforms some of the existing approaches. We conclude that SOMwith multi-features (lexical terms, phrases and sequences) and multi-layers can be very e ective in producinghigh quality clusters on large document collections.
APA, Harvard, Vancouver, ISO, and other styles
31

LAYTON, ROBERT, PAUL WATTERS, and RICHARD DAZELEY. "Automated unsupervised authorship analysis using evidence accumulation clustering." Natural Language Engineering 19, no. 1 (November 21, 2011): 95–120. http://dx.doi.org/10.1017/s1351324911000313.

Full text
Abstract:
AbstractAuthorship Analysis aims to extract information about the authorship of documents from features within those documents. Typically, this is performed as a classification task with the aim of identifying the author of a document, given a set of documents of known authorship. Alternatively, unsupervised methods have been developed primarily as visualisation tools to assist the manual discovery of clusters of authorship within a corpus by analysts. However, there is a need in many fields for more sophisticated unsupervised methods to automate the discovery, profiling and organisation of related information through clustering of documents by authorship. An automated and unsupervised methodology for clustering documents by authorship is proposed in this paper. The methodology is named NUANCE, for n-gram Unsupervised Automated Natural Cluster Ensemble. Testing indicates that the derived clusters have a strong correlation to the true authorship of unseen documents.
APA, Harvard, Vancouver, ISO, and other styles
32

., AbinCherian. "CLUSTERING OF MEDLINE DOCUMENTS USING SEMI-SUPERVISED SPECTRAL CLUSTERING." International Journal of Research in Engineering and Technology 03, no. 03 (March 25, 2014): 145–47. http://dx.doi.org/10.15623/ijret.2014.0303026.

Full text
APA, Harvard, Vancouver, ISO, and other styles
33

Sridevi, U. K., and N. Nagaveni. "An Ontology Based Model for Document Clustering." International Journal of Intelligent Information Technologies 7, no. 3 (July 2011): 54–69. http://dx.doi.org/10.4018/jiit.2011070105.

Full text
Abstract:
Clustering is an important topic to find relevant content from a document collection and it also reduces the search space. The current clustering research emphasizes the development of a more efficient clustering method without considering the domain knowledge and user’s need. In recent years the semantics of documents have been utilized in document clustering. The discussed work focuses on the clustering model where ontology approach is applied. The major challenge is to use the background knowledge in the similarity measure. This paper presents an ontology based annotation of documents and clustering system. The semi-automatic document annotation and concept weighting scheme is used to create an ontology based knowledge base. The Particle Swarm Optimization (PSO) clustering algorithm can be applied to obtain the clustering solution. The accuracy of clustering has been computed before and after combining ontology with Vector Space Model (VSM). The proposed ontology based framework gives improved performance and better clustering compared to the traditional vector space model. The result using ontology was significant and promising.
APA, Harvard, Vancouver, ISO, and other styles
34

LIU, YONGLI, YUANXIN OUYANG, and ZHANG XIONG. "INCREMENTAL CLUSTERING USING INFORMATION BOTTLENECK THEORY." International Journal of Pattern Recognition and Artificial Intelligence 25, no. 05 (August 2011): 695–712. http://dx.doi.org/10.1142/s0218001411008622.

Full text
Abstract:
Document clustering is one of the most effective techniques to organize documents in an unsupervised manner. In this paper, an Incremental method for document Clustering based on Information Bottleneck theory (ICIB) is presented. The ICIB is designed to improve the accuracy and efficiency of document clustering, and resolve the issue that an arbitrary choice of document similarity measure could produce an inaccurate clustering result. In our approach, document similarity is calculated using information bottleneck theory and documents are grouped incrementally. A first document is selected randomly and classified as one cluster, then each remaining document is processed incrementally according to the mutual information loss introduced by the merger of the document and each existing cluster. If the minimum value of mutual information loss is below a certain threshold, the document will be added to its closest cluster; otherwise it will be classified as a new cluster. The incremental clustering process is low-precision and order-dependent, which cannot guarantee accurate clustering results. Therefore, an improved sequential clustering algorithm (SIB) is proposed to adjust the intermediate clustering results. In order to test the effectiveness of ICIB method, ten independent document subsets are constructed based on the 20NewsGroup and Reuters-21578 corpora. Experimental results show that our ICIB method achieves higher accuracy and time performance than K-Means, AIB and SIB algorithms.
APA, Harvard, Vancouver, ISO, and other styles
35

Fadllullah, Arif, Dasrit Debora Kamudi, Muhamad Nasir, Agus Zainal Arifin, and Diana Purwitasari. "WEB NEWS DOCUMENTS CLUSTERING IN INDONESIAN LANGUAGE USING SINGULAR VALUE DECOMPOSITION-PRINCIPAL COMPONENT ANALYSIS (SVDPCA) AND ANT ALGORITHMS." Jurnal Ilmu Komputer dan Informasi 9, no. 1 (February 15, 2016): 17. http://dx.doi.org/10.21609/jiki.v9i1.362.

Full text
Abstract:
Ant-based document clustering is a cluster method of measuring text documents similarity based on the shortest path between nodes (trial phase) and determines the optimal clusters of sequence document similarity (dividing phase). The processing time of trial phase Ant algorithms to make document vectors is very long because of high dimensional Document-Term Matrix (DTM). In this paper, we proposed a document clustering method for optimizing dimension reduction using Singular Value Decomposition-Principal Component Analysis (SVDPCA) and Ant algorithms. SVDPCA reduces size of the DTM dimensions by converting freq-term of conventional DTM to score-pc of Document-PC Matrix (DPCM). Ant algorithms creates documents clustering using the vector space model based on the dimension reduction result of DPCM. The experimental results on 506 news documents in Indonesian language demonstrated that the proposed method worked well to optimize dimension reduction up to 99.7%. We could speed up execution time efficiently of the trial phase and maintain the best F-measure achieved from experiments was 0.88 (88%).
APA, Harvard, Vancouver, ISO, and other styles
36

Umamaheswari, E., and T. V. Geetha. "Event Mining Through Clustering." Journal of Intelligent Systems 23, no. 1 (January 1, 2014): 59–73. http://dx.doi.org/10.1515/jisys-2013-0025.

Full text
Abstract:
AbstractTraditional document clustering algorithms consider text-based features such as unique word count, concept count, etc. to cluster documents. Meanwhile, event mining is the extraction of specific events, their related sub-events, and the associated semantic relations from documents. This work discusses an approach to event mining through clustering. The Universal Networking Language (UNL)-based subgraph, a semantic representation of the document, is used as the input for clustering. Our research focuses on exploring the use of three different feature sets for event clustering and comparing the approaches used for specific event mining. In our previous work, the clustering algorithm used UNL-based event semantics to represent event context for clustering. However, this approach resulted in different events with similar semantics being clustered together. Hence, instead of considering only UNL event semantics, we considered assigning additional weights to similarity between event contexts with event-related attributes such as time, place, and persons. Although we get specific events in a single cluster, sub-events related to the specific events are not necessarily in a single cluster. Therefore, to improve our cluster efficiency, connective terms between two sentences and their representation as UNL subgraphs were also considered for similarity determination. By combining UNL semantics, event-specific arguments similarity, and connective term concepts between sentences, we were able to obtain clusters for specific events and their sub-events. We have used 112 000 Tamil documents from the Forum for Information Retrieval Evaluation data corpus and achieved good results. We have also compared our approach with the previous state-of-the-art approach for Router-RCV1 corpus and achieved 30% improvements in precision.
APA, Harvard, Vancouver, ISO, and other styles
37

Sonawane, Vijay R., and D. Rajeswara Rao. "An Optimistic Approach for Clustering Multi-version XML Documents Using Compressed Delta." International Journal of Electrical and Computer Engineering (IJECE) 5, no. 6 (December 1, 2015): 1472. http://dx.doi.org/10.11591/ijece.v5i6.pp1472-1479.

Full text
Abstract:
Today with Standardization of XML as an information exchange over web, huge amount of information is formatted in the XML document. XML documents are huge in size. The amount of information that has to be transmitted, processed, stored, and queried is often larger than that of other data formats. Also in real world applications XML documents are dynamic in nature. The versatile applicability of XML documents in different fields of information maintenance and management is increasing the demand to store different versions of XML documents with time. However, storage of all versions of an XML document may introduce the redundancy. Self describing nature of XML creates the problem of verbosity,<br />in result documents are in huge size. This paper proposes optimistic approach to Re-cluster multi-version XML documents which change in time by reassessing distance between them by using knowledge from initial clustering solution and changes stored in compressed delta. Evolving size of XML document is reduced by applying homomorphic compression before clustering them which retains its original structure. Compressed delta stores the changes responsible for document versions, without decompressing them. Test results shows that our approach performs much better than using full pair-wise document comparison.
APA, Harvard, Vancouver, ISO, and other styles
38

Nefti, S., M. Oussalah, and Y. Rezgui. "A modified fuzzy clustering for documents retrieval: application to document categorization." Journal of the Operational Research Society 60, no. 3 (March 2009): 384–94. http://dx.doi.org/10.1057/palgrave.jors.2602555.

Full text
APA, Harvard, Vancouver, ISO, and other styles
39

Chawla, Suruchi. "Application of Fuzzy C-Means Clustering and Semantic Ontology in Web Query Session Mining for Intelligent Information Retrieval." International Journal of Fuzzy System Applications 10, no. 1 (January 2021): 1–19. http://dx.doi.org/10.4018/ijfsa.2021010101.

Full text
Abstract:
Information retrieval based on keywords search retrieves irrelevant documents because of vocabulary gap between document content and search queries. The keyword vector representation of web documents is very high dimensional, and keyword terms are unable to capture the semantic of document content. Ontology has been built in various domains for representing the semantics of documents based on concepts relevant to document subject. The web documents often contain multiple topics; therefore, fuzzy c-means document clustering has been used for discovering clusters with overlapping boundaries. In this paper, the method is proposed for intelligent information retrieval using hybrid of fuzzy c-means clustering and ontology in query session mining. Thus, use of fuzzy clusters of web query session concept vector improve quality of clusters for effective web search. The proposed method was evaluated experimentally, and results show the improvement in precision of search results.
APA, Harvard, Vancouver, ISO, and other styles
40

BRZEMINSKI, PAWEL, and WITOLD PEDRYCZ. "TEXTUAL-BASED CLUSTERING OF WEB DOCUMENTS." International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems 12, no. 06 (December 2004): 715–43. http://dx.doi.org/10.1142/s021848850400317x.

Full text
Abstract:
In our study we presented an effective method for clustering of Web pages. From flat HTML files we extracted keywords, formed feature vectors as representation of Web pages and applied them to a clustering method. We took advantage of the Fuzzy C-Means clustering algorithm (FCM). We demonstrated an organized and schematic manner of data collection. Various categories of Web pages were retrieved from ODP (Open Directory Project) in order to create our datasets. The results of clustering proved that the method performs well for all datasets. Finally, we presented a comprehensive experimental study examining: the behavior of the algorithm for different input parameters, internal structure of datasets and classification experiments.
APA, Harvard, Vancouver, ISO, and other styles
41

Kolte, Shilpa G., and Jagdish W. Bakal. "Big Data Summarization Using Novel Clustering Algorithm and Semantic Feature Approach." International Journal of Rough Sets and Data Analysis 4, no. 3 (July 2017): 108–17. http://dx.doi.org/10.4018/ijrsda.2017070108.

Full text
Abstract:
This paper proposes a big data (i.e., documents, texts) summarization method using proposed clustering and semantic features. This paper proposes a novel clustering algorithm which is used for big data summarization. The proposed system works in four phases and provides a modular implementation of multiple documents summarization. The experimental results using Iris dataset show that the proposed clustering algorithm performs better than K-means and K-medodis algorithm. The performance of big data (i.e., documents, texts) summarization is evaluated using Australian legal cases from the Federal Court of Australia (FCA) database. The experimental results demonstrate that the proposed method can summarize big data document superior as compared with existing systems.
APA, Harvard, Vancouver, ISO, and other styles
42

Yau, Chyi-Kwei, Alan Porter, Nils Newman, and Arho Suominen. "Clustering scientific documents with topic modeling." Scientometrics 100, no. 3 (May 6, 2014): 767–86. http://dx.doi.org/10.1007/s11192-014-1321-8.

Full text
APA, Harvard, Vancouver, ISO, and other styles
43

HUANG, S., H. KE, and W. YANG. "Structure clustering for Chinese patent documents." Expert Systems with Applications 34, no. 4 (May 2008): 2290–97. http://dx.doi.org/10.1016/j.eswa.2007.03.012.

Full text
APA, Harvard, Vancouver, ISO, and other styles
44

Ji, Jie, and Qiangfu Zhao. "Applying Naive Bayes Classifier to Document Clustering." Journal of Advanced Computational Intelligence and Intelligent Informatics 14, no. 6 (September 20, 2010): 624–30. http://dx.doi.org/10.20965/jaciii.2010.p0624.

Full text
Abstract:
Document clustering partitions sets of unlabeled documents so that documents in clusters share common concepts. A Naive Bayes Classifier (BC) is a simple probabilistic classifier based on applying Bayes’ theorem with strong (naive) independence assumptions. BC requires a small amount of training data to estimate parameters required for classification. Since training data must be labeled, we propose an Iterative Bayes Clustering (IBC) algorithm. To improve IBC performance, we propose combining IBC with Comparative Advantage-based (CA) initialization method. Experimental results show that our proposal improves performance significantly over classical clustering methods.
APA, Harvard, Vancouver, ISO, and other styles
45

Abasi, Ammar Kamal, Ahamad Tajudin Khader, Mohammed Azmi Al-Betar, Syibrah Naim, Mohammed A. Awadallah, and Osama Ahmad Alomari. "Text documents clustering using modified multi-verse optimizer." International Journal of Electrical and Computer Engineering (IJECE) 10, no. 6 (December 1, 2020): 6361. http://dx.doi.org/10.11591/ijece.v10i6.pp6361-6369.

Full text
Abstract:
In this study, a multi-verse optimizer (MVO) is utilised for the text document clus- tering (TDC) problem. TDC is treated as a discrete optimization problem, and an objective function based on the Euclidean distance is applied as similarity measure. TDC is tackled by the division of the documents into clusters; documents belonging to the same cluster are similar, whereas those belonging to different clusters are dissimilar. MVO, which is a recent metaheuristic optimization algorithm established for continuous optimization problems, can intelligently navigate different areas in the search space and search deeply in each area using a particular learning mechanism. The proposed algorithm is called MVOTDC, and it adopts the convergence behaviour of MVO operators to deal with discrete, rather than continuous, optimization problems. For evaluating MVOTDC, a comprehensive comparative study is conducted on six text document datasets with various numbers of documents and clusters. The quality of the final results is assessed using precision, recall, F-measure, entropy accuracy, and purity measures. Experimental results reveal that the proposed method performs competitively in comparison with state-of-the-art algorithms. Statistical analysis is also conducted and shows that MVOTDC can produce significant results in comparison with three well-established methods.
APA, Harvard, Vancouver, ISO, and other styles
46

Li, Xin Ye. "XML Document Clustering Based on Spectral Analysis Method." Advanced Materials Research 219-220 (March 2011): 304–7. http://dx.doi.org/10.4028/www.scientific.net/amr.219-220.304.

Full text
Abstract:
While K-Means algorithm usually gets local optimal solution, spectral clustering method can obtain satisfying clustering results through embedding the data points into a new space in which clusters are tighter. Since traditional spectral clustering method uses Gauss Kernel Function to compute the similarity between two points, the selection of scale parameter σ is related with domain knowledge usually. This paper uses spectral method to cluster XML documents. To consider both element and structure of XML documents, this paper proposes to use path feature to represent XML document; to avoild the selection of scale parameter σ, it also proposes to use Jaccard coefficient to compute the similarity between two XML documents. Experiment shows that using Jaccard coefficient to compute the similarity is effective, the clustering result is correct.
APA, Harvard, Vancouver, ISO, and other styles
47

Stanchev, Lubomir. "Fine-Tuning an Algorithm for Semantic Document Clustering Using a Similarity Graph." International Journal of Semantic Computing 10, no. 04 (December 2016): 527–55. http://dx.doi.org/10.1142/s1793351x16400195.

Full text
Abstract:
In this article, we examine an algorithm for document clustering using a similarity graph. The graph stores words and common phrases from the English language as nodes and it can be used to compute the degree of semantic similarity between any two phrases. One application of the similarity graph is semantic document clustering, that is, grouping documents based on the meaning of the words in them. Since our algorithm for semantic document clustering relies on multiple parameters, we examine how fine-tuning these values affects the quality of the result. Specifically, we use the Reuters-21578 benchmark, which contains [Formula: see text] newswire stories that are grouped in 82 categories using human judgment. We apply the k-means clustering algorithm to group the documents using a similarity metric that is based on keywords matching and one that uses the similarity graph. We evaluate the results of the clustering algorithms using multiple metrics, such as precision, recall, f-score, entropy, and purity.
APA, Harvard, Vancouver, ISO, and other styles
48

Ishak Boushaki, Saida, Nadjet Kamel, and Omar Bendjeghaba. "High-Dimensional Text Datasets Clustering Algorithm Based on Cuckoo Search and Latent Semantic Indexing." Journal of Information & Knowledge Management 17, no. 03 (September 2018): 1850033. http://dx.doi.org/10.1142/s0219649218500338.

Full text
Abstract:
The clustering is an important data analysis technique. However, clustering high-dimensional data like documents needs more effort in order to extract the richness relevant information hidden in the multidimensionality space. Recently, document clustering algorithms based on metaheuristics have demonstrated their efficiency to explore the search area and to achieve the global best solution rather than the local one. However, most of these algorithms are not practical and suffer from some limitations, including the requirement of the knowledge of the number of clusters in advance, they are neither incremental nor extensible and the documents are indexed by high-dimensional and sparse matrix. In order to overcome these limitations, we propose in this paper, a new dynamic and incremental approach (CS_LSI) for document clustering based on the recent cuckoo search (CS) optimization and latent semantic indexing (LSI). Conducted Experiments on four well-known high-dimensional text datasets show the efficiency of LSI model to reduce the dimensionality space with more precision and less computational time. Also, the proposed CS_LSI determines the number of clusters automatically by employing a new proposed index, focused on significant distance measure. This later is also used in the incremental mode and to detect the outlier documents by maintaining a more coherent clusters. Furthermore, comparison with conventional document clustering algorithms shows the superiority of CS_LSI to achieve a high quality of clustering.
APA, Harvard, Vancouver, ISO, and other styles
49

Shahana Bano, Mrs, B. Divyanjali, A. K M L R V Virajitha, and M. Tejaswi. "Document Summarization Using Clustering and Text Analysis." International Journal of Engineering & Technology 7, no. 2.32 (May 31, 2018): 456. http://dx.doi.org/10.14419/ijet.v7i2.32.15740.

Full text
Abstract:
Document summarization is a procedure of shortening the content report with a product, so as to make the outline with the significant parts of unique record.Now a days ,users are very much tired about their works and they don’t have much time to spend reading a lot of information .they just want the maximum and accurate information which describes everything and occupies minimum space.This paper discusses an important approach for document summarization by using clustering and text analysis. In this paper, we are performing the clustering and text analytic techniques for reducing the data redundancy and for identifying similarity sentences in text of documents and grouping them in cluster based on their term frequency value of the words. Mainly these techniques help to reduce the data and documents are generated with high efficiency.
APA, Harvard, Vancouver, ISO, and other styles
50

M. Mohammed, Shapol, Karwan Jacksi, and Subhi R. M. Zeebaree. "A state-of-the-art survey on semantic similarity for document clustering using GloVe and density-based algorithms." Indonesian Journal of Electrical Engineering and Computer Science 22, no. 1 (April 1, 2021): 552. http://dx.doi.org/10.11591/ijeecs.v22.i1.pp552-562.

Full text
Abstract:
<p><span>Semantic similarity is the process of identifying relevant data semantically. The traditional way of identifying document similarity is by using synonymous keywords and syntactician. In comparison, semantic similarity is to find similar data using meaning of words and semantics. Clustering is a concept of grouping objects that have the same features and properties as a cluster and separate from those objects that have different features and properties. In semantic document clustering, documents are clustered using semantic similarity techniques with similarity measurements. One of the common techniques to cluster documents is the density-based clustering algorithms using the density of data points as a main strategic to measure the similarity between them. In this paper, a state-of-the-art survey is presented to analyze the density-based algorithms for clustering documents. Furthermore, the similarity and evaluation measures are investigated with the selected algorithms to grasp the common ones. The delivered review revealed that the most used density-based algorithms in document clustering are DBSCAN and DPC. The most effective similarity measurement has been used with density-based algorithms, specifically DBSCAN and DPC, is Cosine similarity with F-measure for performance and accuracy evaluation.</span></p>
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography