Log in

Relevant bibliographies by topics / Allocation de Dirichlet latente (LDA) / Dissertations / Theses

To see the other types of publications on this topic, follow the link: Allocation de Dirichlet latente (LDA).

Dissertations / Theses on the topic 'Allocation de Dirichlet latente (LDA)'

Author: Grafiati

Published: 4 June 2021

Last updated: 1 June 2024

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 27 dissertations / theses for your research on the topic 'Allocation de Dirichlet latente (LDA).'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.

1

Ponweiser, Martin. "Latent Dirichlet Allocation in R." WU Vienna University of Economics and Business, 2012. http://epub.wu.ac.at/3558/1/main.pdf.

Full text

Abstract:

Topic models are a new research field within the computer sciences information retrieval and text mining. They are generative probabilistic models of text corpora inferred by machine learning and they can be used for retrieval and text mining tasks. The most prominent topic model is latent Dirichlet allocation (LDA), which was introduced in 2003 by Blei et al. and has since then sparked off the development of other topic models for domain-specific purposes. This thesis focuses on LDA's practical application. Its main goal is the replication of the data analyses from the 2004 LDA paper ``Finding scientific topics'' by Thomas Griffiths and Mark Steyvers within the framework of the R statistical programming language and the R~package topicmodels by Bettina Grün and Kurt Hornik. The complete process, including extraction of a text corpus from the PNAS journal's website, data preprocessing, transformation into a document-term matrix, model selection, model estimation, as well as presentation of the results, is fully documented and commented. The outcome closely matches the analyses of the original paper, therefore the research by Griffiths/Steyvers can be reproduced. Furthermore, this thesis proves the suitability of the R environment for text mining with LDA. (author's abstract)
Series: Theses / Institute for Statistics and Mathematics

APA, Harvard, Vancouver, ISO, and other styles

2

Lindgren, Jennifer. "Evaluating Hierarchical LDA Topic Models for Article Categorization." Thesis, Linköpings universitet, Institutionen för datavetenskap, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-167080.

Full text

Abstract:

With the vast amount of information available on the Internet today, helping users find relevant content has become a prioritized task in many software products that recommend news articles. One such product is Opera for Android, which has a news feed containing articles the user may be interested in. In order to easily determine what articles to recommend, they can be categorized by the topics they contain. One approach of categorizing articles is using Machine Learning and Natural Language Processing (NLP). A commonly used model is Latent Dirichlet Allocation (LDA), which finds latent topics within large datasets of for example text articles. An extension of LDA is hierarchical Latent Dirichlet Allocation (hLDA) which is an hierarchical variant of LDA. In hLDA, the latent topics found among a set of articles are structured hierarchically in a tree. Each node represents a topic, and the levels represent different levels of abstraction in the topics. A further extension of hLDA is constrained hLDA, where a set of predefined, constrained topics are added to the tree. The constrained topics are extracted from the dataset by grouping highly correlated words. The idea of constrained hLDA is to improve the topic structure derived by a hLDA model by making the process semi-supervised. The aim of this thesis is to create a hLDA and a constrained hLDA model from a dataset of articles provided by Opera. The models should then be evaluated using the novel metric word frequency similarity, which is a measure of the similarity between the words representing the parent and child topics in a hierarchical topic model. The results show that word frequency similarity can be used to evaluate whether the topics in a parent-child topic pair are too similar, so that the child does not specify a subtopic of the parent. It can also be used to evaluate if the topics are too dissimilar, so that the topics seem unrelated and perhaps should not be connected in the hierarchy. The results also show that the two topic models created had comparable word frequency similarity scores. None of the models seemed to significantly outperform the other with regard to the metric.

APA, Harvard, Vancouver, ISO, and other styles

3

Jaradat, Shatha. "OLLDA: Dynamic and Scalable Topic Modelling for Twitter : AN ONLINE SUPERVISED LATENT DIRICHLET ALLOCATION ALGORITHM." Thesis, KTH, Skolan för informations- och kommunikationsteknik (ICT), 2015. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-177535.

Full text

Abstract:

Providing high quality of topics inference in today's large and dynamic corpora, such as Twitter, is a challenging task. This is especially challenging taking into account that the content in this environment contains short texts and many abbreviations. This project proposes an improvement of a popular online topics modelling algorithm for Latent Dirichlet Allocation (LDA), by incorporating supervision to make it suitable for Twitter context. This improvement is motivated by the need for a single algorithm that achieves both objectives: analyzing huge amounts of documents, including new documents arriving in a stream, and, at the same time, achieving high quality of topics’ detection in special case environments, such as Twitter. The proposed algorithm is a combination of an online algorithm for LDA and a supervised variant of LDA - labeled LDA. The performance and quality of the proposed algorithm is compared with these two algorithms. The results demonstrate that the proposed algorithm has shown better performance and quality when compared to the supervised variant of LDA, and it achieved better results in terms of quality in comparison to the online algorithm. These improvements make our algorithm an attractive option when applied to dynamic environments, like Twitter. An environment for analyzing and labelling data is designed to prepare the dataset before executing the experiments. Possible application areas for the proposed algorithm are tweets recommendation and trends detection.
Tillhandahålla högkvalitativa ämnen slutsats i dagens stora och dynamiska korpusar, såsom Twitter, är en utmanande uppgift. Detta är särskilt utmanande med tanke på att innehållet i den här miljön innehåller korta texter och många förkortningar. Projektet föreslår en förbättring med en populär online ämnen modellering algoritm för Latent Dirichlet Tilldelning (LDA), genom att införliva tillsyn för att göra den lämplig för Twitter sammanhang. Denna förbättring motiveras av behovet av en enda algoritm som uppnår båda målen: analysera stora mängder av dokument, inklusive nya dokument som anländer i en bäck, och samtidigt uppnå hög kvalitet på ämnen "upptäckt i speciella fall miljöer, till exempel som Twitter. Den föreslagna algoritmen är en kombination av en online-algoritm för LDA och en övervakad variant av LDA - Labeled LDA. Prestanda och kvalitet av den föreslagna algoritmen jämförs med dessa två algoritmer. Resultaten visar att den föreslagna algoritmen har visat bättre prestanda och kvalitet i jämförelse med den övervakade varianten av LDA, och det uppnådde bättre resultat i fråga om kvalitet i jämförelse med den online-algoritmen. Dessa förbättringar gör vår algoritm till ett attraktivt alternativ när de tillämpas på dynamiska miljöer, som Twitter. En miljö för att analysera och märkning uppgifter är utformad för att förbereda dataset innan du utför experimenten. Möjliga användningsområden för den föreslagna algoritmen är tweets rekommendation och trender upptäckt.

APA, Harvard, Vancouver, ISO, and other styles

4

Mungre, Surbhi. "LDA-based dimensionality reduction and domain adaptation with application to DNA sequence classification." Thesis, Kansas State University, 2011. http://hdl.handle.net/2097/8846.

Full text

Abstract:

Master of Science
Department of Computing and Information Sciences
Doina Caragea
Several computational biology and bioinformatics problems involve DNA sequence classification using supervised machine learning algorithms. The performance of these algorithms is largely dependent on the availability of labeled data and the approach used to represent DNA sequences as {\it feature vectors}. For many organisms, the labeled DNA data is scarce, while the unlabeled data is easily available. However, for a small number of well-studied model organisms, large amounts of labeled data are available. This calls for {\it domain adaptation} approaches, which can transfer knowledge from a {\it source} domain, for which labeled data is available, to a {\it target} domain, for which large amounts of unlabeled data are available. Intuitively, one approach to domain adaptation can be obtained by extracting and representing the features that the source domain and the target domain sequences share. \emph{Latent Dirichlet Allocation} (LDA) is an unsupervised dimensionality reduction technique that has been successfully used to generate features for sequence data such as text. In this work, we explore the use of LDA for generating predictive DNA sequence features, that can be used in both supervised and domain adaptation frameworks. More precisely, we propose two dimensionality reduction approaches, LDA Words (LDAW) and LDA Distribution (LDAD) for DNA sequences. LDA is a probabilistic model, which is generative in nature, and is used to model collections of discrete data such as document collections. For our problem, a sequence is considered to be a ``document" and k-mers obtained from a sequence are ``document words". We use LDA to model our sequence collection. Given the LDA model, each document can be represented as a distribution over topics (where a topic can be seen as a distribution over k-mers). In the LDAW method, we use the top k-mers in each topic as our features (i.e., k-mers with the highest probability); while in the LDAD method, we use the topic distribution to represent a document as a feature vector. We study LDA-based dimensionality reduction approaches for both supervised DNA sequence classification, as well as domain adaptation approaches. We apply the proposed approaches on the splice site predication problem, which is an important DNA sequence classification problem in the context of genome annotation. In the supervised learning framework, we study the effectiveness of LDAW and LDAD methods by comparing them with a traditional dimensionality reduction technique based on the information gain criterion. In the domain adaptation framework, we study the effect of increasing the evolutionary distances between the source and target organisms, and the effect of using different weights when combining labeled data from the source domain and with labeled data from the target domain. Experimental results show that LDA-based features can be successfully used to perform dimensionality reduction and domain adaptation for DNA sequence classification problems.

APA, Harvard, Vancouver, ISO, and other styles

5

Harrysson, Mattias. "Neural probabilistic topic modeling of short and messy text." Thesis, KTH, Skolan för datavetenskap och kommunikation (CSC), 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-189532.

Full text

Abstract:

Exploring massive amount of user generated data with topics posits a new way to find useful information. The topics are assumed to be “hidden” and must be “uncovered” by statistical methods such as topic modeling. However, the user generated data is typically short and messy e.g. informal chat conversations, heavy use of slang words and “noise” which could be URL’s or other forms of pseudo-text. This type of data is difficult to process for most natural language processing methods, including topic modeling. This thesis attempts to find the approach that objectively give the better topics from short and messy text in a comparative study. The compared approaches are latent Dirichlet allocation (LDA), Re-organized LDA (RO-LDA), Gaussian Mixture Model (GMM) with distributed representation of words, and a new approach based on previous work named Neural Probabilistic Topic Modeling (NPTM). It could only be concluded that NPTM have a tendency to achieve better topics on short and messy text than LDA and RO-LDA. GMM on the other hand could not produce any meaningful results at all. The results are less conclusive since NPTM suffers from long running times which prevented enough samples to be obtained for a statistical test.
Att utforska enorma mängder användargenererad data med ämnen postulerar ett nytt sätt att hitta användbar information. Ämnena antas vara “gömda” och måste “avtäckas” med statistiska metoder såsom ämnesmodellering. Dock är användargenererad data generellt sätt kort och stökig t.ex. informella chattkonversationer, mycket slangord och “brus” som kan vara URL:er eller andra former av pseudo-text. Denna typ av data är svår att bearbeta för de flesta algoritmer i naturligt språk, inklusive ämnesmodellering. Det här arbetet har försökt hitta den metod som objektivt ger dem bättre ämnena ur kort och stökig text i en jämförande studie. De metoder som jämfördes var latent Dirichlet allocation (LDA), Re-organized LDA (RO-LDA), Gaussian Mixture Model (GMM) with distributed representation of words samt en egen metod med namnet Neural Probabilistic Topic Modeling (NPTM) baserat på tidigare arbeten. Den slutsats som kan dras är att NPTM har en tendens att ge bättre ämnen på kort och stökig text jämfört med LDA och RO-LDA. GMM lyckades inte ge några meningsfulla resultat alls. Resultaten är mindre bevisande eftersom NPTM har problem med långa körtider vilket innebär att tillräckligt många stickprov inte kunde erhållas för ett statistiskt test.

APA, Harvard, Vancouver, ISO, and other styles

6

Chen, Yuxin. "Apprentissage interactif de mots et d'objets pour un robot humanoïde." Thesis, Université Paris-Saclay (ComUE), 2017. http://www.theses.fr/2017SACLY003/document.

Full text

Abstract:

Les applications futures de la robotique, en particulier pour des robots de service à la personne, exigeront des capacités d’adaptation continue à l'environnement, et notamment la capacité à reconnaître des nouveaux objets et apprendre des nouveaux mots via l'interaction avec les humains. Bien qu'ayant fait d'énormes progrès en utilisant l'apprentissage automatique, les méthodes actuelles de vision par ordinateur pour la détection et la représentation des objets reposent fortement sur de très bonnes bases de données d’entrainement et des supervisions d'apprentissage idéales. En revanche, les enfants de deux ans ont une capacité impressionnante à apprendre à reconnaître des nouveaux objets et en même temps d'apprendre les noms des objets lors de l'interaction avec les adultes et sans supervision précise. Par conséquent, suivant l'approche de le robotique développementale, nous développons dans la thèse des approches d'apprentissage pour les objets, en associant leurs noms et leurs caractéristiques correspondantes, inspirées par les capacités des enfants, en particulier l'interaction ambiguë avec l’homme en s’inspirant de l'interaction qui a lieu entre les enfants et les parents.L'idée générale est d’utiliser l'apprentissage cross-situationnel (cherchant les points communs entre différentes présentations d’un objet ou d’une caractéristique) et la découverte de concepts multi-modaux basée sur deux approches de découverte de thèmes latents: la Factorisation en Natrices Non-Négatives (NMF) et l'Allocation de Dirichlet latente (LDA). Sur la base de descripteurs de vision et des entrées audio / vocale, les approches proposées vont découvrir les régularités sous-jacentes dans le flux de données brutes afin de parvenir à produire des ensembles de mots et leur signification visuelle associée (p.ex le nom d’un objet et sa forme, ou un adjectif de couleur et sa correspondance dans les images). Nous avons développé une approche complète basée sur ces algorithmes et comparé leur comportements face à deux sources d'incertitudes: ambiguïtés de références, dans des situations où plusieurs mots sont donnés qui décrivent des caractéristiques d'objets multiples; et les ambiguïtés linguistiques, dans des situations où les mots-clés que nous avons l'intention d'apprendre sont intégrés dans des phrases complètes. Cette thèse souligne les solutions algorithmiques requises pour pouvoir effectuer un apprentissage efficace de ces associations de mot-référent à partir de données acquises dans une configuration d'acquisition simplifiée mais réaliste qui a permis d'effectuer des simulations étendues et des expériences préliminaires dans des vraies interactions homme-robot. Nous avons également apporté des solutions pour l'estimation automatique du nombre de thèmes pour les NMF et LDA.Nous avons finalement proposé deux stratégies d'apprentissage actives: la Sélection par l'Erreur de Reconstruction Maximale (MRES) et l'Exploration Basée sur la Confiance (CBE), afin d'améliorer la qualité et la vitesse de l'apprentissage incrémental en laissant les algorithmes choisir les échantillons d'apprentissage suivants. Nous avons comparé les comportements produits par ces algorithmes et montré leurs points communs et leurs différences avec ceux des humains dans des situations d'apprentissage similaires
Future applications of robotics, especially personal service robots, will require continuous adaptability to the environment, and particularly the ability to recognize new objects and learn new words through interaction with humans. Though having made tremendous progress by using machine learning, current computational models for object detection and representation still rely heavily on good training data and ideal learning supervision. In contrast, two year old children have an impressive ability to learn to recognize new objects and at the same time to learn the object names during interaction with adults and without precise supervision. Therefore, following the developmental robotics approach, we develop in the thesis learning approaches for objects, associating their names and corresponding features, inspired by the infants' capabilities, in particular, the ambiguous interaction with humans, inspired by the interaction that occurs between children and parents.The general idea is to use cross-situational learning (finding the common points between different presentations of an object or a feature) and to implement multi-modal concept discovery based on two latent topic discovery approaches : Non Negative Matrix Factorization (NMF) and Latent Dirichlet Association (LDA). Based on vision descriptors and sound/voice inputs, the proposed approaches will find the underlying regularities in the raw dataflow to produce sets of words and their associated visual meanings (eg. the name of an object and its shape, or a color adjective and its correspondence in images). We developed a complete approach based on these algorithms and compared their behavior in front of two sources of uncertainties: referential ambiguities, in situations where multiple words are given that describe multiple objects features; and linguistic ambiguities, in situations where keywords we intend to learn are merged in complete sentences. This thesis highlights the algorithmic solutions required to be able to perform efficient learning of these word-referent associations from data acquired in a simplified but realistic acquisition setup that made it possible to perform extensive simulations and preliminary experiments in real human-robot interactions. We also gave solutions for the automatic estimation of the number of topics for both NMF and LDA.We finally proposed two active learning strategies, Maximum Reconstruction Error Based Selection (MRES) and Confidence Based Exploration (CBE), to improve the quality and speed of incremental learning by letting the algorithms choose the next learning samples. We compared the behaviors produced by these algorithms and show their common points and differences with those of humans in similar learning situations

APA, Harvard, Vancouver, ISO, and other styles

7

Johansson, Richard, and Heino Otto Engström. "Topic propagation over time in internet security conferences : Topic modeling as a tool to investigate trends for future research." Thesis, Linköpings universitet, Institutionen för datavetenskap, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-177748.

Full text

Abstract:

When conducting research, it is valuable to find high-ranked papers closely related to the specific research area, without spending too much time reading insignificant papers. To make this process more effective an automated process to extract topics from documents would be useful, and this is possible using topic modeling. Topic modeling can also be used to provide topic trends, where a topic is first mentioned, and who the original author was. In this paper, over 5000 articles are scraped from four different top-ranked internet security conferences, using a web scraper built in Python. From the articles, fourteen topics are extracted, using the topic modeling library Gensim and LDA Mallet, and the topics are visualized in graphs to find trends about which topics are emerging and fading away over twenty years. The result found in this research is that topic modeling is a powerful tool to extract topics, and when put into a time perspective, it is possible to identify topic trends, which can be explained when put into a bigger context.

APA, Harvard, Vancouver, ISO, and other styles

8

Chen, Yuxin. "Apprentissage interactif de mots et d'objets pour un robot humanoïde." Electronic Thesis or Diss., Université Paris-Saclay (ComUE), 2017. http://www.theses.fr/2017SACLY003.

Full text

Abstract:

Les applications futures de la robotique, en particulier pour des robots de service à la personne, exigeront des capacités d’adaptation continue à l'environnement, et notamment la capacité à reconnaître des nouveaux objets et apprendre des nouveaux mots via l'interaction avec les humains. Bien qu'ayant fait d'énormes progrès en utilisant l'apprentissage automatique, les méthodes actuelles de vision par ordinateur pour la détection et la représentation des objets reposent fortement sur de très bonnes bases de données d’entrainement et des supervisions d'apprentissage idéales. En revanche, les enfants de deux ans ont une capacité impressionnante à apprendre à reconnaître des nouveaux objets et en même temps d'apprendre les noms des objets lors de l'interaction avec les adultes et sans supervision précise. Par conséquent, suivant l'approche de le robotique développementale, nous développons dans la thèse des approches d'apprentissage pour les objets, en associant leurs noms et leurs caractéristiques correspondantes, inspirées par les capacités des enfants, en particulier l'interaction ambiguë avec l’homme en s’inspirant de l'interaction qui a lieu entre les enfants et les parents.L'idée générale est d’utiliser l'apprentissage cross-situationnel (cherchant les points communs entre différentes présentations d’un objet ou d’une caractéristique) et la découverte de concepts multi-modaux basée sur deux approches de découverte de thèmes latents: la Factorisation en Natrices Non-Négatives (NMF) et l'Allocation de Dirichlet latente (LDA). Sur la base de descripteurs de vision et des entrées audio / vocale, les approches proposées vont découvrir les régularités sous-jacentes dans le flux de données brutes afin de parvenir à produire des ensembles de mots et leur signification visuelle associée (p.ex le nom d’un objet et sa forme, ou un adjectif de couleur et sa correspondance dans les images). Nous avons développé une approche complète basée sur ces algorithmes et comparé leur comportements face à deux sources d'incertitudes: ambiguïtés de références, dans des situations où plusieurs mots sont donnés qui décrivent des caractéristiques d'objets multiples; et les ambiguïtés linguistiques, dans des situations où les mots-clés que nous avons l'intention d'apprendre sont intégrés dans des phrases complètes. Cette thèse souligne les solutions algorithmiques requises pour pouvoir effectuer un apprentissage efficace de ces associations de mot-référent à partir de données acquises dans une configuration d'acquisition simplifiée mais réaliste qui a permis d'effectuer des simulations étendues et des expériences préliminaires dans des vraies interactions homme-robot. Nous avons également apporté des solutions pour l'estimation automatique du nombre de thèmes pour les NMF et LDA.Nous avons finalement proposé deux stratégies d'apprentissage actives: la Sélection par l'Erreur de Reconstruction Maximale (MRES) et l'Exploration Basée sur la Confiance (CBE), afin d'améliorer la qualité et la vitesse de l'apprentissage incrémental en laissant les algorithmes choisir les échantillons d'apprentissage suivants. Nous avons comparé les comportements produits par ces algorithmes et montré leurs points communs et leurs différences avec ceux des humains dans des situations d'apprentissage similaires
Future applications of robotics, especially personal service robots, will require continuous adaptability to the environment, and particularly the ability to recognize new objects and learn new words through interaction with humans. Though having made tremendous progress by using machine learning, current computational models for object detection and representation still rely heavily on good training data and ideal learning supervision. In contrast, two year old children have an impressive ability to learn to recognize new objects and at the same time to learn the object names during interaction with adults and without precise supervision. Therefore, following the developmental robotics approach, we develop in the thesis learning approaches for objects, associating their names and corresponding features, inspired by the infants' capabilities, in particular, the ambiguous interaction with humans, inspired by the interaction that occurs between children and parents.The general idea is to use cross-situational learning (finding the common points between different presentations of an object or a feature) and to implement multi-modal concept discovery based on two latent topic discovery approaches : Non Negative Matrix Factorization (NMF) and Latent Dirichlet Association (LDA). Based on vision descriptors and sound/voice inputs, the proposed approaches will find the underlying regularities in the raw dataflow to produce sets of words and their associated visual meanings (eg. the name of an object and its shape, or a color adjective and its correspondence in images). We developed a complete approach based on these algorithms and compared their behavior in front of two sources of uncertainties: referential ambiguities, in situations where multiple words are given that describe multiple objects features; and linguistic ambiguities, in situations where keywords we intend to learn are merged in complete sentences. This thesis highlights the algorithmic solutions required to be able to perform efficient learning of these word-referent associations from data acquired in a simplified but realistic acquisition setup that made it possible to perform extensive simulations and preliminary experiments in real human-robot interactions. We also gave solutions for the automatic estimation of the number of topics for both NMF and LDA.We finally proposed two active learning strategies, Maximum Reconstruction Error Based Selection (MRES) and Confidence Based Exploration (CBE), to improve the quality and speed of incremental learning by letting the algorithms choose the next learning samples. We compared the behaviors produced by these algorithms and show their common points and differences with those of humans in similar learning situations

APA, Harvard, Vancouver, ISO, and other styles

9

Ficapal, Vila Joan. "Anemone: a Visual Semantic Graph." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-252810.

Full text

Abstract:

Semantic graphs have been used for optimizing various natural language processing tasks as well as augmenting search and information retrieval tasks. In most cases these semantic graphs have been constructed through supervised machine learning methodologies that depend on manually curated ontologies such as Wikipedia or similar. In this thesis, which consists of two parts, we explore in the first part the possibility to automatically populate a semantic graph from an ad hoc data set of 50 000 newspaper articles in a completely unsupervised manner. The utility of the visual representation of the resulting graph is tested on 14 human subjects performing basic information retrieval tasks on a subset of the articles. Our study shows that, for entity finding and document similarity our feature engineering is viable and the visual map produced by our artifact is visually useful. In the second part, we explore the possibility to identify entity relationships in an unsupervised fashion by employing abstractive deep learning methods for sentence reformulation. The reformulated sentence structures are qualitatively assessed with respect to grammatical correctness and meaningfulness as perceived by 14 test subjects. We negatively evaluate the outcomes of this second part as they have not been good enough to acquire any definitive conclusion but have instead opened new doors to explore.
Semantiska grafer har använts för att optimera olika processer för naturlig språkbehandling samt för att förbättra sökoch informationsinhämtningsuppgifter. I de flesta fall har sådana semantiska grafer konstruerats genom övervakade maskininlärningsmetoder som förutsätter manuellt kurerade ontologier såsom Wikipedia eller liknande. I denna uppsats, som består av två delar, undersöker vi i första delen möjligheten att automatiskt generera en semantisk graf från ett ad hoc dataset bestående av 50 000 tidningsartiklar på ett helt oövervakat sätt. Användbarheten hos den visuella representationen av den resulterande grafen testas på 14 försökspersoner som utför grundläggande informationshämtningsuppgifter på en delmängd av artiklarna. Vår studie visar att vår funktionalitet är lönsam för att hitta och dokumentera likhet med varandra, och den visuella kartan som produceras av vår artefakt är visuellt användbar. I den andra delen utforskar vi möjligheten att identifiera entitetsrelationer på ett oövervakat sätt genom att använda abstraktiva djupa inlärningsmetoder för meningsomformulering. De omformulerade meningarna utvärderas kvalitativt med avseende på grammatisk korrekthet och meningsfullhet såsom detta uppfattas av 14 testpersoner. Vi utvärderar negativt resultaten av denna andra del, eftersom de inte har varit tillräckligt bra för att få någon definitiv slutsats, men har istället öppnat nya dörrar för att utforska.

APA, Harvard, Vancouver, ISO, and other styles

10

Schneider, Bruno. "Visualização em multirresolução do fluxo de tópicos em coleções de texto." reponame:Repositório Institucional do FGV, 2014. http://hdl.handle.net/10438/11745.

Full text

Abstract:

Submitted by Bruno Schneider (bruno.sch@gmail.com) on 2014-05-08T17:46:04Z No. of bitstreams: 1 dissertacao_bruno_schneider.pdf.pdf: 8019497 bytes, checksum: 70ff1fddb844b630666397e95c188672 (MD5)
Approved for entry into archive by Janete de Oliveira Feitosa (janete.feitosa@fgv.br) on 2014-05-13T12:56:21Z (GMT) No. of bitstreams: 1 dissertacao_bruno_schneider.pdf.pdf: 8019497 bytes, checksum: 70ff1fddb844b630666397e95c188672 (MD5)
Approved for entry into archive by Marcia Bacha (marcia.bacha@fgv.br) on 2014-05-14T19:44:51Z (GMT) No. of bitstreams: 1 dissertacao_bruno_schneider.pdf.pdf: 8019497 bytes, checksum: 70ff1fddb844b630666397e95c188672 (MD5)
Made available in DSpace on 2014-05-14T19:45:33Z (GMT). No. of bitstreams: 1 dissertacao_bruno_schneider.pdf.pdf: 8019497 bytes, checksum: 70ff1fddb844b630666397e95c188672 (MD5) Previous issue date: 2014-03-21
The combined use of algorithms for topic discovery in document collections with topic flow visualization techniques allows the exploration of thematic patterns in long corpus. In this task, those patterns could be revealed through compact visual representations. This research has investigated the requirements for viewing data about the thematic composition of documents obtained through topic modeling - where datasets are sparse and has multi-attributes - at different levels of detail through the development of an own technique and the use of an open source library for data visualization, comparatively. About the studied problem of topic flow visualization, we observed the presence of conflicting requirements for data display in different resolutions, which led to detailed investigation on ways of manipulating and displaying this data. In this study, the hypothesis put forward was that the integrated use of more than one visualization technique according to the resolution of data expands the possibilities for exploitation of the object under study in relation to what would be obtained using only one method. The exhibition of the limits on the use of these techniques according to the resolution of data exploration is the main contribution of this work, in order to provide subsidies for the development of new applications.
O uso combinado de algoritmos para a descoberta de tópicos em coleções de documentos com técnicas orientadas à visualização da evolução daqueles tópicos no tempo permite a exploração de padrões temáticos em corpora extensos a partir de representações visuais compactas. A pesquisa em apresentação investigou os requisitos de visualização do dado sobre composição temática de documentos obtido através da modelagem de tópicos – o qual é esparso e possui multiatributos – em diferentes níveis de detalhe, através do desenvolvimento de uma técnica de visualização própria e pelo uso de uma biblioteca de código aberto para visualização de dados, de forma comparativa. Sobre o problema estudado de visualização do fluxo de tópicos, observou-se a presença de requisitos de visualização conflitantes para diferentes resoluções dos dados, o que levou à investigação detalhada das formas de manipulação e exibição daqueles. Dessa investigação, a hipótese defendida foi a de que o uso integrado de mais de uma técnica de visualização de acordo com a resolução do dado amplia as possibilidades de exploração do objeto em estudo em relação ao que seria obtido através de apenas uma técnica. A exibição dos limites no uso dessas técnicas de acordo com a resolução de exploração do dado é a principal contribuição desse trabalho, no intuito de dar subsídios ao desenvolvimento de novas aplicações.

APA, Harvard, Vancouver, ISO, and other styles

11

Uys, J. W. "A framework for exploiting electronic documentation in support of innovation processes." Thesis, Stellenbosch : University of Stellenbosch, 2010. http://hdl.handle.net/10019.1/1449.

Full text

Abstract:

Thesis (PhD (Industrial Engineering))--University of Stellenbosch, 2010.
ENGLISH ABSTRACT: The crucial role of innovation in creating sustainable competitive advantage is widely recognised in industry today. Likewise, the importance of having the required information accessible to the right employees at the right time is well-appreciated. More specifically, the dependency of effective, efficient innovation processes on the availability of information has been pointed out in literature. A great challenge is countering the effects of the information overload phenomenon in organisations in order for employees to find the information appropriate to their needs without having to wade through excessively large quantities of information to do so. The initial stages of the innovation process, which are characterised by free association, semi-formal activities, conceptualisation, and experimentation, have already been identified as a key focus area for improving the effectiveness of the entire innovation process. The dependency on information during these early stages of the innovation process is especially high. Any organisation requires a strategy for innovation, a number of well-defined, implemented processes and measures to be able to innovate in an effective and efficient manner and to drive its innovation endeavours. In addition, the organisation requires certain enablers to support its innovation efforts which include certain core competencies, technologies and knowledge. Most importantly for this research, enablers are required to more effectively manage and utilise innovation-related information. Information residing inside and outside the boundaries of the organisation is required to feed the innovation process. The specific sources of such information are numerous. Such information may further be structured or unstructured in nature. However, an ever-increasing ratio of available innovation-related information is of the unstructured type. Examples include the textual content of reports, books, e-mail messages and web pages. This research explores the innovation landscape and typical sources of innovation-related information. In addition, it explores the landscape of text analytical approaches and techniques in search of ways to more effectively and efficiently deal with unstructured, textual information. A framework that can be used to provide a unified, dynamic view of an organisation‟s innovation-related information, both structured and unstructured, is presented. Once implemented, this framework will constitute an innovation-focused knowledge base that will organise and make accessible such innovation-related information to the stakeholders of the innovation process. Two novel, complementary text analytical techniques, Latent Dirichlet Allocation and the Concept-Topic Model, were identified for application with the framework. The potential value of these techniques as part of the information systems that would embody the framework is illustrated. The resulting knowledge base would cause a quantum leap in the accessibility of information and may significantly improve the way innovation is done and managed in the target organisation.
AFRIKAANSE OPSOMMING: Die belangrikheid van innovasie vir die daarstel van „n volhoubare mededingende voordeel word tans wyd erken in baie sektore van die bedryf. Ook die belangrikheid van die toeganklikmaking van relevante inligting aan werknemers op die geskikte tyd, word vandag terdeë besef. Die afhanklikheid van effektiewe, doeltreffende innovasieprosesse op die beskikbaarheid van inligting word deurlopend beklemtoon in die navorsingsliteratuur. „n Groot uitdaging tans is om die oorsake en impak van die inligtingsoorvloedverskynsel in ondernemings te bestry ten einde werknemers in staat te stel om inligting te vind wat voldoen aan hul behoeftes sonder om in die proses deur oormatige groot hoeveelhede inligting te sif. Die aanvanklike stappe van die innovasieproses, gekenmerk deur vrye assosiasie, semi-formele aktiwiteite, konseptualisering en eksperimentasie, is reeds geïdentifiseer as sleutelareas vir die verbetering van die effektiwiteit van die innovasieproses in sy geheel. Die afhanklikheid van hierdie deel van die innovasieproses op inligting is besonder hoog. Om op „n doeltreffende en optimale wyse te innoveer, benodig elke onderneming „n strategie vir innovasie sowel as „n aantal goed gedefinieerde, ontplooide prosesse en metingskriteria om die innovasieaktiwiteite van die onderneming te dryf. Bykomend benodig ondernemings sekere innovasie-ondersteuningsmeganismes wat bepaalde sleutelaanlegde, -tegnologiëe en kennis insluit. Kern tot hierdie navorsing, benodig organisasies ook ondersteuningsmeganismes om hul in staat te stel om meer doeltreffend innovasie-verwante inligting te bestuur en te gebruik. Inligting, gehuisves beide binne en buite die grense van die onderneming, word benodig om die innovasieproses te voer. Die bronne van sulke inligting is veeltallig en hierdie inligting mag gestruktureerd of ongestruktureerd van aard wees. „n Toenemende persentasie van innovasieverwante inligting is egter van die ongestruktureerde tipe, byvoorbeeld die inligting vervat in die tekstuele inhoud van verslae, boeke, e-posboodskappe en webbladsye. In hierdie navorsing word die innovasielandskap asook tipiese bronne van innovasie-verwante inligting verken. Verder word die landskap van teksanalitiese benaderings en -tegnieke ondersoek ten einde maniere te vind om meer doeltreffend en optimaal met ongestruktureerde, tekstuele inligting om te gaan. „n Raamwerk wat aangewend kan word om „n verenigde, dinamiese voorstelling van „n onderneming se innovasieverwante inligting, beide gestruktureerd en ongestruktureerd, te skep word voorgestel. Na afloop van implementasie sal hierdie raamwerk die innovasieverwante inligting van die onderneming organiseer en meer toeganklik maak vir die deelnemers van die innovasieproses. Daar word verslag gelewer oor die aanwending van twee nuwerwetse, komplementêre teksanalitiese tegnieke tot aanvulling van die raamwerk. Voorts word die potensiele waarde van hierdie tegnieke as deel van die inligtingstelsels wat die raamwerk realiseer, verder uitgewys en geillustreer.

APA, Harvard, Vancouver, ISO, and other styles

12

Morchid, Mohamed. "Représentations robustes de documents bruités dans des espaces homogènes." Thesis, Avignon, 2014. http://www.theses.fr/2014AVIG0202/document.

Full text

Abstract:

En recherche d’information, les documents sont le plus souvent considérés comme des "sacs-de-mots". Ce modèle ne tient pas compte de la structure temporelle du document et est sensible aux bruits qui peuvent altérer la forme lexicale. Ces bruits peuvent être produits par différentes sources : forme peu contrôlée des messages des sites de micro-blogging, messages vocaux dont la transcription automatique contient des erreurs, variabilités lexicales et grammaticales dans les forums du Web. . . Le travail présenté dans cette thèse s’intéresse au problème de la représentation de documents issus de sources bruitées.La thèse comporte trois parties dans lesquelles différentes représentations des contenus sont proposées. La première partie compare une représentation classique utilisant la fréquence des mots à une représentation de haut-niveau s’appuyant sur un espace de thèmes. Cette abstraction du contenu permet de limiter l’altération de la forme de surface du document bruité en le représentant par un ensemble de caractéristiques de haut-niveau. Nos expériences confirment que cette projection dans un espace de thèmes permet d’améliorer les résultats obtenus sur diverses tâches de recherche d’information en comparaison d’une représentation plus classique utilisant la fréquence des mots.Le problème majeur d’une telle représentation est qu’elle est fondée sur un espace de thèmes dont les paramètres sont choisis empiriquement.La deuxième partie décrit une nouvelle représentation s’appuyant sur des espaces multiples et permettant de résoudre trois problèmes majeurs : la proximité des sujets traités dans le document, le choix difficile des paramètres du modèle de thèmes ainsi que la robustesse de la représentation. Partant de l’idée qu’une seule représentation des contenus ne peut pas capturer l’ensemble des informations utiles, nous proposons d’augmenter le nombre de vues sur un même document. Cette multiplication des vues permet de générer des observations "artificielles" qui contiennent des fragments de l’information utile. Une première expérience a validé cette approche multi-vues de la représentation de textes bruités. Elle a cependant l’inconvénient d’être très volumineuse,redondante, et de contenir une variabilité additionnelle liée à la diversité des vues. Dans un deuxième temps, nous proposons une méthode s’appuyant sur l’analyse factorielle pour fusionner les vues multiples et obtenir une nouvelle représentation robuste,de dimension réduite, ne contenant que la partie "utile" du document tout en réduisant les variabilités "parasites". Lors d’une tâche de catégorisation de conversations,ce processus de compression a confirmé qu’il permettait d’augmenter la robustesse de la représentation du document bruité.Cependant, lors de l’élaboration des espaces de thèmes, le document reste considéré comme un "sac-de-mots" alors que plusieurs études montrent que la position d’un terme au sein du document est importante. Une représentation tenant compte de cette structure temporelle du document est proposée dans la troisième partie. Cette représentation s’appuie sur les nombres hyper-complexes de dimension appelés quaternions. Nos expériences menées sur une tâche de catégorisation ont montré l’efficacité de cette méthode comparativement aux représentations classiques en "sacs-de-mots"
In the Information Retrieval field, documents are usually considered as a "bagof-words". This model does not take into account the temporal structure of thedocument and is sensitive to noises which can alter its lexical form. These noisescan be produced by different sources : uncontrolled form of documents in microbloggingplatforms, automatic transcription of speech documents which are errorprone,lexical and grammatical variabilities in Web forums. . . The work presented inthis thesis addresses issues related to document representations from noisy sources.The thesis consists of three parts in which different representations of content areavailable. The first one compares a classical representation based on a term-frequencyrepresentation to a higher level representation based on a topic space. The abstractionof the document content allows us to limit the alteration of the noisy document byrepresenting its content with a set of high-level features. Our experiments confirm thatmapping a noisy document into a topic space allows us to improve the results obtainedduring different information retrieval tasks compared to a classical approach based onterm frequency. The major problem with such a high-level representation is that it isbased on a space theme whose parameters are chosen empirically.The second part presents a novel representation based on multiple topic spaces thatallow us to solve three main problems : the closeness of the subjects discussed in thedocument, the tricky choice of the "right" values of the topic space parameters and therobustness of the topic-based representation. Based on the idea that a single representationof the contents cannot capture all the relevant information, we propose to increasethe number of views on a single document. This multiplication of views generates "artificial"observations that contain fragments of useful information. The first experimentvalidated the multi-view approach to represent noisy texts. However, it has the disadvantageof being very large and redundant and of containing additional variability associatedwith the diversity of views. In the second step, we propose a method based onfactor analysis to compact the different views and to obtain a new robust representationof low dimension which contains only the informative part of the document whilethe noisy variabilities are compensated. During a dialogue classification task, the compressionprocess confirmed that this compact representation allows us to improve therobustness of noisy document representation.Nonetheless, during the learning process of topic spaces, the document is consideredas a "bag-of-words" while many studies have showed that the word position in a7document is useful. A representation which takes into account the temporal structureof the document based on hyper-complex numbers is proposed in the third part. Thisrepresentation is based on the hyper-complex numbers of dimension four named quaternions.Our experiments on a classification task have showed the effectiveness of theproposed approach compared to a conventional "bag-of-words" representation

APA, Harvard, Vancouver, ISO, and other styles

13

Bui, Quang Vu. "Pretopology and Topic Modeling for Complex Systems Analysis : Application on Document Classification and Complex Network Analysis." Thesis, Paris Sciences et Lettres (ComUE), 2018. http://www.theses.fr/2018PSLEP034/document.

Full text

Abstract:

Les travaux de cette thèse présentent le développement d'algorithmes de classification de documents d'une part, ou d'analyse de réseaux complexes d'autre part, en s'appuyant sur la prétopologie, une théorie qui modélise le concept de proximité. Le premier travail développe un cadre pour la classification de documents en combinant une approche de topicmodeling et la prétopologie. Notre contribution propose d'utiliser des distributions de sujets extraites à partir d'un traitement topic-modeling comme entrées pour des méthodes de classification. Dans cette approche, nous avons étudié deux aspects : déterminer une distance adaptée entre documents en étudiant la pertinence des mesures probabilistes et des mesures vectorielles, et effet réaliser des regroupements selon plusieurs critères en utilisant une pseudo-distance définie à partir de la prétopologie. Le deuxième travail introduit un cadre général de modélisation des Réseaux Complexes en développant une reformulation de la prétopologie stochastique, il propose également un modèle prétopologique de cascade d'informations comme modèle général de diffusion. De plus, nous avons proposé un modèle agent, Textual-ABM, pour analyser des réseaux complexes dynamiques associés à des informations textuelles en utilisant un modèle auteur-sujet et nous avons introduit le Textual-Homo-IC, un modèle de cascade indépendant de la ressemblance, dans lequel l'homophilie est fondée sur du contenu textuel obtenu par un topic-model
The work of this thesis presents the development of algorithms for document classification on the one hand, or complex network analysis on the other hand, based on pretopology, a theory that models the concept of proximity. The first work develops a framework for document clustering by combining Topic Modeling and Pretopology. Our contribution proposes using topic distributions extracted from topic modeling treatment as input for classification methods. In this approach, we investigated two aspects: determine an appropriate distance between documents by studying the relevance of Probabilistic-Based and Vector-Based Measurements and effect groupings according to several criteria using a pseudo-distance defined from pretopology. The second work introduces a general framework for modeling Complex Networks by developing a reformulation of stochastic pretopology and proposes Pretopology Cascade Model as a general model for information diffusion. In addition, we proposed an agent-based model, Textual-ABM, to analyze complex dynamic networks associated with textual information using author-topic model and introduced Textual-Homo-IC, an independent cascade model of the resemblance, in which homophily is measured based on textual content obtained by utilizing Topic Modeling

APA, Harvard, Vancouver, ISO, and other styles

14

Dupuy, Christophe. "Inference and applications for topic models." Thesis, Paris Sciences et Lettres (ComUE), 2017. http://www.theses.fr/2017PSLEE055/document.

Full text

Abstract:

La plupart des systèmes de recommandation actuels se base sur des évaluations sous forme de notes (i.e., chiffre entre 0 et 5) pour conseiller un contenu (film, restaurant...) à un utilisateur. Ce dernier a souvent la possibilité de commenter ce contenu sous forme de texte en plus de l'évaluer. Il est difficile d'extraire de l'information d'un texte brut tandis qu'une simple note contient peu d'information sur le contenu et l'utilisateur. Dans cette thèse, nous tentons de suggérer à l'utilisateur un texte lisible personnalisé pour l'aider à se faire rapidement une opinion à propos d'un contenu. Plus spécifiquement, nous construisons d'abord un modèle thématique prédisant une description de film personnalisée à partir de commentaires textuels. Notre modèle sépare les thèmes qualitatifs (i.e., véhiculant une opinion) des thèmes descriptifs en combinant des commentaires textuels et des notes sous forme de nombres dans un modèle probabiliste joint. Nous évaluons notre modèle sur une base de données IMDB et illustrons ses performances à travers la comparaison de thèmes. Nous étudions ensuite l'inférence de paramètres dans des modèles à variables latentes à grande échelle, incluant la plupart des modèles thématiques. Nous proposons un traitement unifié de l'inférence en ligne pour les modèles à variables latentes à partir de familles exponentielles non-canoniques et faisons explicitement apparaître les liens existants entre plusieurs méthodes fréquentistes et Bayesiennes proposées auparavant. Nous proposons aussi une nouvelle méthode d'inférence pour l'estimation fréquentiste des paramètres qui adapte les méthodes MCMC à l'inférence en ligne des modèles à variables latentes en utilisant proprement un échantillonnage de Gibbs local. Pour le modèle thématique d'allocation de Dirichlet latente, nous fournissons une vaste série d'expériences et de comparaisons avec des travaux existants dans laquelle notre nouvelle approche est plus performante que les méthodes proposées auparavant. Enfin, nous proposons une nouvelle classe de processus ponctuels déterminantaux (PPD) qui peut être manipulée pour l'inférence et l'apprentissage de paramètres en un temps potentiellement sous-linéaire en le nombre d'objets. Cette classe, basée sur une factorisation spécifique de faible rang du noyau marginal, est particulièrement adaptée à une sous-classe de PPD continus et de PPD définis sur un nombre exponentiel d'objets. Nous appliquons cette classe à la modélisation de documents textuels comme échantillons d'un PPD sur les phrases et proposons une formulation du maximum de vraisemblance conditionnel pour modéliser les proportions de thèmes, ce qui est rendu possible sans aucune approximation avec notre classe de PPD. Nous présentons une application à la synthèse de documents avec un PPD sur 2 à la puissance 500 objets, où les résumés sont composés de phrases lisibles
Most of current recommendation systems are based on ratings (i.e. numbers between 0 and 5) and try to suggest a content (movie, restaurant...) to a user. These systems usually allow users to provide a text review for this content in addition to ratings. It is hard to extract useful information from raw text while a rating does not contain much information on the content and the user. In this thesis, we tackle the problem of suggesting personalized readable text to users to help them make a quick decision about a content. More specifically, we first build a topic model that predicts personalized movie description from text reviews. Our model extracts distinct qualitative (i.e., which convey opinion) and descriptive topics by combining text reviews and movie ratings in a joint probabilistic model. We evaluate our model on an IMDB dataset and illustrate its performance through comparison of topics. We then study parameter inference in large-scale latent variable models, that include most topic models. We propose a unified treatment of online inference for latent variable models from a non-canonical exponential family, and draw explicit links between several previously proposed frequentist or Bayesian methods. We also propose a novel inference method for the frequentist estimation of parameters, that adapts MCMC methods to online inference of latent variable models with the proper use of local Gibbs sampling.~For the specific latent Dirichlet allocation topic model, we provide an extensive set of experiments and comparisons with existing work, where our new approach outperforms all previously proposed methods. Finally, we propose a new class of determinantal point processes (DPPs) which can be manipulated for inference and parameter learning in potentially sublinear time in the number of items. This class, based on a specific low-rank factorization of the marginal kernel, is particularly suited to a subclass of continuous DPPs and DPPs defined on exponentially many items. We apply this new class to modelling text documents as sampling a DPP of sentences, and propose a conditional maximum likelihood formulation to model topic proportions, which is made possible with no approximation for our class of DPPs. We present an application to document summarization with a DPP on 2 to the power 500 items, where the summaries are composed of readable sentences

APA, Harvard, Vancouver, ISO, and other styles

15

Wei, Zhihua. "The research on chinese text multi-label classification." Thesis, Lyon 2, 2010. http://www.theses.fr/2010LYO20025/document.

Full text

Abstract:

Text Classification (TC) which is an important field in information technology has many valuable applications. When facing the sea of information resources, the objects of TC are more complicated and diversity. The researches in pursuit of effective and practical TC technology are fairly challenging. More and more researchers regard that multi-label TC is more suited for many applications. This thesis analyses the difficulties and problems in multi-label TC and Chinese text representation based on a mass of algorithms for single-label TC and multi-label TC. Aiming at high dimensionality in feature space, sparse distribution in text representation and poor performance of multi-label classifier, this thesis will bring forward corresponding algorithms from different angles.Focusing on the problem of dimensionality “disaster” when Chinese texts are represented by using n-grams, two-step feature selection algorithm is constructed. The method combines filtering rare features within class and selecting discriminative features across classes. Moreover, the proper value of “n”, the strategy of feature weight and the correlation among features are discussed based on variety of experiments. Some useful conclusions are contributed to the research of n-gram representation in Chinese texts.In a view of the disadvantage in Latent Dirichlet Allocation (LDA) model, that is, arbitrarily revising the variable in smooth process, a new strategy for smoothing based on Tolerance Rough Set (TRS) is put forward. It constructs tolerant class in global vocabulary database firstly and then assigns value for out-of-vocabulary (oov) word in each class according to tolerant class.In order to improve performance of multi-label classifier and degrade computing complexity, a new TC method based on LDA model is applied for Chinese text representation. It extracts topics statistically from texts and then texts are represented by using the topic vector. It shows competitive performance both in English and in Chinese corpus.To enhance the performance of classifiers in multi-label TC, a compound classification framework is raised. It partitions the text space by computing the upper approximation and lower approximation. This algorithm decomposes a multi-label TC problem into several single-label TCs and several multi-label TCs which have less labels than original problem. That is, an unknown text should be classified by single-label classifier when it is partitioned into lower approximation space of some class. Otherwise, it should be classified by corresponding multi-label classifier.An application system TJ-MLWC (Tongji Multi-label Web Classifier) was designed. It could call the result from Search Engines directly and classify these results real-time using improved Naïve Bayes classifier. This makes the browse process more conveniently for users. Users could locate the texts interested immediately according to the class information given by TJ-MLWC
La thèse est centrée sur la Classification de texte, domaine en pleine expansion, avec de nombreuses applications actuelles et potentielles. Les apports principaux de la thèse portent sur deux points : Les spécificités du codage et du traitement automatique de la langue chinoise : mots pouvant être composés de un, deux ou trois caractères ; absence de séparation typographique entre les mots ; grand nombre d’ordres possibles entre les mots d’une phrase ; tout ceci aboutissant à des problèmes difficiles d’ambiguïté. La solution du codage en «n-grams »(suite de n=1, ou 2 ou 3 caractères) est particulièrement adaptée à la langue chinoise, car elle est rapide et ne nécessite pas les étapes préalables de reconnaissance des mots à l’aide d’un dictionnaire, ni leur séparation. La classification multi-labels, c'est-à-dire quand chaque individus peut être affecté à une ou plusieurs classes. Dans le cas des textes, on cherche des classes qui correspondent à des thèmes (topics) ; un même texte pouvant être rattaché à un ou plusieurs thème. Cette approche multilabel est plus générale : un même patient peut être atteint de plusieurs pathologies ; une même entreprise peut être active dans plusieurs secteurs industriels ou de services. La thèse analyse ces problèmes et tente de leur apporter des solutions, d’abord pour les classifieurs unilabels, puis multi-labels. Parmi les difficultés, la définition des variables caractérisant les textes, leur grand nombre, le traitement des tableaux creux (beaucoup de zéros dans la matrice croisant les textes et les descripteurs), et les performances relativement mauvaises des classifieurs multi-classes habituels
文本分类是信息科学中一个重要而且富有实际应用价值的研究领域。随着文本分类处理内容日趋复杂化和多元化，分类目标也逐渐多样化，研究有效的、切合实际应用需求的文本分类技术成为一个很有挑战性的任务，对多标签分类的研究应运而生。本文在对大量的单标签和多标签文本分类算法进行分析和研究的基础上，针对文本表示中特征高维问题、数据稀疏问题和多标签分类中分类复杂度高而精度低的问题，从不同的角度尝试运用粗糙集理论加以解决，提出了相应的算法，主要包括：针对n-gram作为中文文本特征时带来的维数灾难问题，提出了两步特征选择的方法，即去除类内稀有特征和类间特征选择相结合的方法，并就n-gram作为特征时的n值选取、特征权重的选择和特征相关性等问题在大规模中文语料库上进行了大量的实验，得出一些有用的结论。针对文本分类中运用高维特征表示文本带来的分类效率低，开销大等问题，提出了基于LDA模型的多标签文本分类算法，利用LDA模型提取的主题作为文本特征，构建高效的分类器。在PT3多标签分类转换方法下，该分类算法在中英文数据集上都表现出很好的效果，与目前公认最好的多标签分类方法效果相当。针对LDA模型现有平滑策略的随意性和武断性的缺点，提出了基于容差粗糙集的LDA语言模型平滑策略。该平滑策略首先在全局词表上构造词的容差类，再根据容差类中词的频率为每类文档的未登录词赋予平滑值。在中英文、平衡和不平衡语料库上的大量实验都表明该平滑方法显著提高了LDA模型的分类性能，在不平衡语料库上的提高尤其明显。针对多标签分类中分类复杂度高而精度低的问题，提出了一种基于可变精度粗糙集的复合多标签文本分类框架，该框架通过可变精度粗糙集方法划分文本特征空间，进而将多标签分类问题分解为若干个两类单标签分类问题和若干个标签数减少了的多标签分类问题。即，当一篇未知文本被划分到某一类文本的下近似区域时，可以直接用简单的单标签文本分类器判断其类别；当未知文本被划分在边界域时，则采用相应区域的多标签分类器进行分类。实验表明，这种分类框架下，分类的精确度和算法效率都有较大的提高。本文还设计和实现了一个基于多标签分类的网页搜索结果可视化系统（MLWC），该系统能够直接调用搜索引擎返回的搜索结果，并采用改进的Naïve Bayes多标签分类算法实现实时的搜索结果分类，使用户可以快速地定位搜索结果中感兴趣的文本。

APA, Harvard, Vancouver, ISO, and other styles

16

Atrevi, Dieudonne Fabrice. "Détection et analyse des évènements rares par vision, dans un contexte urbain ou péri-urbain." Thesis, Orléans, 2019. http://www.theses.fr/2019ORLE2008.

Full text

Abstract:

L’objectif principal de cette thèse est le développement de méthodes complètes de détection d’événements rares. Les travaux de cette thèse se résument en deux parties. La première partie est consacrée à l’étude de descripteurs de formes de l’état de l’art. D’une part, la robustesse de certains descripteurs face à différentes conditions de luminosité a été étudiée. D’autre part, les moments géométriques ont été comparés à travers une application d’estimation de pose humaine 3D à partir d’image 2D. De cette étude, nous avons montré qu’à travers une application de recherche de formes, les moments géométriques permettent d’estimer la pose d’une personne à travers une recherche exhaustive dans une base d’apprentissage de poses connues.Cette application peut être utilisée dans un système de reconnaissance d’actions pour une analyse plus fine des événements détectés. Dans la deuxième partie, trois contributions à la détection d’événements rares sont présentées. La première contribution concerne l’élaboration d’une méthode d’analyse globale de scène pour la détection des événements liés aux mouvements de foule. Dans cette approche, la modélisation globale de la scène est faite en nous basant sur des points d’intérêt filtrés à partir de la carte de saillance de la scène. Les caractéristiques exploitées sont l’histogramme des orientations du flot optique et un ensemble de descripteur de formes étudié dans la première partie. L’algorithme LDA (Latent Dirichlet Allocation) est utilisé pour la création des modèles d’événements à partir d’une représentation en document visuel à partir de séquences d’images (clip vidéo). La deuxième contribution consiste en l’élaboration d’une méthode de détection de mouvements saillants ou dominants dans une vidéo. La méthode, totalement non supervisée,s’appuie sur les propriétés de la transformée en cosinus discrète pour analyser les informations du flot optique de la scène afin de mettre en évidence les mouvements saillants. La modélisation locale pour la détection et la localisation des événements est au coeur de la dernière contribution de cette thèse. La méthode se base sur les scores de saillance des mouvements et de l’algorithme SVM dans sa version "one class" pour créer le modèle d’événements. Les méthodes ont été évaluées sur différentes bases publiques et les résultats obtenus sont prometteurs
The main objective of this thesis is the development of complete methods for rare events detection. The works can be summarized in two parts. The first part is devoted to the study of shapes descriptors of the state of the art. On the one hand, the robustness of some descriptors to varying light conditions was studied.On the other hand, the ability of geometric moments to describe the human shape was also studied through a3D human pose estimation application based on 2D images. From this study, we have shown that through a shape retrieval application, geometric moments can be used to estimate a human pose through an exhaustive search in a pose database. This kind of application can be used in human actions recognition system which may be a final step of an event analysis system. In the second part of this report, three main contributions to rare event detection are presented. The first contribution concerns the development of a global scene analysis method for crowd event detection. In this method, global scene modeling is done based on spatiotemporal interest points filtered from the saliency map of the scene. The characteristics used are the histogram of the optical flow orientations and a set of shapes descriptors studied in the first part. The Latent Dirichlet Allocation algorithm is used to create event models by using a visual document representation of image sequences(video clip). The second contribution is the development of a method for salient motions detection in video.This method is totally unsupervised and relies on the properties of the discrete cosine transform to explore the optical flow information of the scene. Local modeling for events detection and localization is at the core of the latest contribution of this thesis. The method is based on the saliency score of movements and one class SVM algorithm to create the events model. The methods have been tested on different public database and the results obtained are promising

APA, Harvard, Vancouver, ISO, and other styles

17

Patel, Virashree Hrushikesh. "Topic modeling using latent dirichlet allocation on disaster tweets." 2018. http://hdl.handle.net/2097/39337.

Full text

Abstract:

Master of Science
Department of Computer Science
Cornelia Caragea
Doina Caragea
Social media has changed the way people communicate information. It has been noted that social media platforms like Twitter are increasingly being used by people and authorities in the wake of natural disasters. The year 2017 was a historic year for the USA in terms of natural calamities and associated costs. According to NOAA (National Oceanic and Atmospheric Administration), during 2017, USA experienced 16 separate billion-dollar disaster events, including three tropical cyclones, eight severe storms, two inland floods, a crop freeze, drought, and wild re. During natural disasters, due to the collapse of infrastructure and telecommunication, often it is hard to reach out to people in need or to determine what areas are affected. In such situations, Twitter can be a lifesaving tool for local government and search and rescue agencies. Using Twitter streaming API service, disaster-related tweets can be collected and analyzed in real-time. Although tweets received from Twitter can be sparse, noisy and ambiguous, some may contain useful information with respect to situational awareness. For example, some tweets express emotions, such as grief, anguish, or call for help, other tweets provide information specific to a region, place or person, while others simply help spread information from news or environmental agencies. To extract information useful for disaster response teams from tweets, disaster tweets need to be cleaned and classified into various categories. Topic modeling can help identify topics from the collection of such disaster tweets. Subsequently, a topic (or a set of topics) will be associated with a tweet. Thus, in this report, we will use Latent Dirichlet Allocation (LDA) to accomplish topic modeling for disaster tweets dataset.

APA, Harvard, Vancouver, ISO, and other styles

18

Karlsson, Kalle. "News media attention in Climate Action: Latent topics and open access." Thesis, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:hb:diva-23413.

Full text

Abstract:

The purpose of the thesis is i) to discover the latent topics of SDG13 and their coverage in news media ii) to investigate the share of OA and Non-OA articles and reviews in each topic iii) to compare the share of different OA types (Green, Gold, Hybrid and Bronze) in each topic. It imposes a heuristic perspective and explorative approach in reviewing the three concepts open access, altmetrics and climate action (SDG13). Data is collected from SciVal, Unpaywall, Altmetric.com and Scopus rendering a dataset of 70,206 articles and reviews published between 2014-2018. The documents retrieved are analyzed with descriptive statistics and topic modeling using Sklearn’s package for LDA(Latent Dirichlet Allocation) in Python. The findings show an altmetric advantage for OA in the case of news media and SDG13 which fluctuates over topics. News media is shown to focus on subjects with “visible” effects in concordance with previous research on media coverage. Examples of this were topics concerning emissions of greenhouse gases and melting glaciers. Gold OA is the most common type being mentioned in news outlets. It also generates the highest number of news mentions while the average sum of news mentions was highest for documents published as Bronze. Moreover, the thesis is largely driven by methods used and most notably the programming language Python. As such it outlines future paths for research into the three concepts reviewed as well as methods used for topic modeling and programming.

APA, Harvard, Vancouver, ISO, and other styles

19

Moon, Brenda. "Scanning the Science-Society Horizon." Phd thesis, 2016. http://hdl.handle.net/1885/101191.

Full text

Abstract:

Science communication approaches have evolved over time gradually placing more importance on understanding the context of the communication and audience. The increase in people participating in social media on the Internet offers a new resource for monitoring what people are discussing. People self publish their views on social media, which provides a rich source of every day, every person thinking. This introduces the possibility of using passive monitoring of this public discussion to find information useful to science communicators, to allow them to better target their communications about different topics. This research study is focussed on understanding what open source intelligence, in the form of public tweets on Twitter, reveals about the contexts in which the word 'science' is used by the English speaking public. By conducting a series of studies based on simpler questions, I gradually build up a view of who is contributing on Twitter, how often, and what topics are being discussed that include the keyword 'science'. An open source a data gathering tool for Twitter data was developed and used to collect a dataset from Twitter with the keyword 'science' during 2011. After collection was completed, data was prepared for analysis by removing unwanted tweets. The size of the dataset (12.2 million tweets by 3.6 million users (authors)) required the use of mainly quantitative approaches, even though this only represents a very small proportion, about 0.02%, of the total tweets per day on Twitter Fourier analysis was used to create a model of the underlying temporal pattern of tweets per day and revealed a weekly pattern. The number of users per day followed a similar pattern, and most of these users did not use the word 'science' often on Twitter. An investigation of types of tweets suggests that people using the word 'science' were engaged in more sharing of both links, and other peoples tweets, than is usual on Twitter. Consideration of word frequency and bigrams in the text of the tweets found that while word frequencies were not particularly effective when trying to understand such a large dataset, bigrams were able to give insight into the contexts in which 'science' is being used in up to 19.19% of the tweets. The final study used Latent Dirichlet Allocation (LDA) topic modelling to identify the contexts in which 'science' was being used and gave a much richer view of the whole corpus than the bigram analysis. Although the thesis has focused on the single keyword 'science' the techniques developed should be applicable to other keywords and so be able to provide science communicators with a near real time source of information about what issues the public is concerned about, what they are saying about those issues and how that is changing over time.

APA, Harvard, Vancouver, ISO, and other styles

20

Patel, Vishal. "Near-Duplicate Detection Using Instance Level Constraints." Thesis, 2009. https://etd.iisc.ac.in/handle/2005/1346.

Full text

Abstract:

For the task of near-duplicate document detection, comparison approaches based on bag-of-words used in information retrieval community are not sufficiently accurate. This work presents novel approach when instance-level constraints are given for documents and it is needed to retrieve them, given new query document for near-duplicate detection. The framework incorporates instance-level constraints and clusters documents into groups using novel clustering approach Grouped Latent Dirichlet Allocation (gLDA). Then distance metric is learned for each cluster using large margin nearest neighbor algorithm and finally ranked documents for given new unknown document using learnt distance metrics. The variety of experimental results on various datasets demonstrate that our clustering method (gLDA with side constraints) performs better than other clustering methods and the overall approach outperforms other near-duplicate detection algorithms.

APA, Harvard, Vancouver, ISO, and other styles

21

Patel, Vishal. "Near-Duplicate Detection Using Instance Level Constraints." Thesis, 2009. http://etd.iisc.ernet.in/handle/2005/1346.

Full text

Abstract:

For the task of near-duplicate document detection, comparison approaches based on bag-of-words used in information retrieval community are not sufficiently accurate. This work presents novel approach when instance-level constraints are given for documents and it is needed to retrieve them, given new query document for near-duplicate detection. The framework incorporates instance-level constraints and clusters documents into groups using novel clustering approach Grouped Latent Dirichlet Allocation (gLDA). Then distance metric is learned for each cluster using large margin nearest neighbor algorithm and finally ranked documents for given new unknown document using learnt distance metrics. The variety of experimental results on various datasets demonstrate that our clustering method (gLDA with side constraints) performs better than other clustering methods and the overall approach outperforms other near-duplicate detection algorithms.

APA, Harvard, Vancouver, ISO, and other styles

22

Correia, Acácio Filipe Pereira Pinto. "Towards Preemptive Text Edition using Topic Matching on Corpora." Master's thesis, 2016. http://hdl.handle.net/10400.6/6368.

Full text

Abstract:

Nowadays, the results of scientific research are only recognized when published in papers for international journals or magazines of the respective area of knowledge. This perspective reflects the importance of having the work reviewed by peers. The revision encompasses a thorough analysis on the work performed, including quality of writing and whether the study advances the state-of-the-art, among other details. For these reasons, with the publishing of the document, other researchers have an assurance of the high quality of the study presented and can, therefore, make direct usage of the findings in their own work. The publishing of documents creates a cycle of information exchange responsible for speeding up the progress behind the development of new techniques, theories and technologies, resulting in added value for the entire society. Nonetheless, the existence of a detailed revision of the content sent for publication requires additional effort and dedication from its authors. They must make sure that the manuscript is of high quality, since sending a document with mistakes conveys an unprofessional image of the authors, which may result in the rejection at the journal or magazine. The objective of this work is to develop an algorithm capable of assisting in the writing of this type of documents, by proposing suggestions of possible improvements or corrections according to its specific context. The general idea for the solution proposed is for the algorithm to calculate suggestions of improvements by comparing the content of the document being written in to that of similar published documents on the field. In this context, a study on Natural Language Processing (NLP) techniques used in the creation of models for representing the document and its subjects was performed. NLP provides the tools for creating models to represent the documents and identify their topics. The main concepts include n-grams and topic modeling. The study included also an analysis of some works performed in the field of academic writing. The structure and contents of this type of documents, the presentation of some of the characteristics that are common to high quality articles, as well as the tools developed with the objective of helping in its writing were also subject of analysis. The developed algorithm derives from the combination of several tools backed up by a collection of documents, as well as the logic connecting all components, implemented in the scope of this Master’s. The collection of documents is constituted by full text of articles from different areas, including Computer Science, Physics and Mathematics, among others. The topics of these documents were extracted and stored in order to be fed to the algorithm. By comparing the topics extracted from the document under analysis with those from the documents in the collection, it is possible to select its closest documents, using them for the creation of suggestions. The algorithm is capable of proposing suggestions for word replacements which are more commonly utilized in a given field of knowledge through a set of tools used in syntactic analysis, synonyms search and morphological realization. Both objective and subjective tests were conducted on the algorithm. They demonstrate that, in some cases, the algorithm proposes suggestions which approximate the terms used in the document to the most utilized terms in the state-of-the-art of a defined scientific field. This points towards the idea that the usage of the algorithm should improve the quality of the documents, as they become more similar to the ones already published. Even though the improvements to the documents are minimal, they should be understood as a lower bound for the real utility of the algorithm. This statement is partially justified by the existence of several parsing errors both in the training and test sets, resulting from the parsing of the pdf files from the original articles, which can be improved in a production system. The main contributions of this work include the presentation of the study performed on the state of the art, the design and implementation of the algorithm and the text editor developed as a proof of concept. The analysis on the specificity of the context, which results from the tests performed on different areas of knowledge, and the large collection of documents, gathered during this Master’s program, are also important contributions of this work.
Hoje em dia, a realização de uma investigação científica só é valorizada quando resulta na publicação de artigos científicos em jornais ou revistas internacionais de renome na respetiva área do conhecimento. Esta perspetiva reflete a importância de que os estudos realizados sejam validados por pares. A validação implica uma análise detalhada do estudo realizado, incluindo a qualidade da escrita e a existência de novidades, entre outros detalhes. Por estas razões, com a publicação do documento, outros investigadores têm uma garantia de qualidade do estudo realizado e podem, por isso, utilizar o conhecimento gerado para o seu próprio trabalho. A publicação destes documentos cria um ciclo de troca de informação que é responsável por acelerar o processo de desenvolvimento de novas técnicas, teorias e tecnologias, resultando na produção de valor acrescido para a sociedade em geral. Apesar de todas estas vantagens, a existência de uma verificação detalhada do conteúdo do documento enviado para publicação requer esforço e trabalho acrescentado para os autores. Estes devem assegurar-se da qualidade do manuscrito, visto que o envio de um documento defeituoso transmite uma imagem pouco profissional dos autores, podendo mesmo resultar na rejeição da sua publicação nessa revista ou ata de conferência. O objetivo deste trabalho é desenvolver um algoritmo para ajudar os autores na escrita deste tipo de documentos, propondo sugestões para melhoramentos tendo em conta o seu contexto específico. A ideia genérica para solucionar o problema passa pela extração do tema do documento a ser escrito, criando sugestões através da comparação do seu conteúdo com o de documentos científicos antes publicados na mesma área. Tendo em conta esta ideia e o contexto previamente apresentado, foi realizado um estudo de técnicas associadas à área de Processamento de Linguagem Natural (PLN). O PLN fornece ferramentas para a criação de modelos capazes de representar o documento e os temas que lhe estão associados. Os principais conceitos incluem n-grams e modelação de tópicos (topic modeling). Para concluir o estudo, foram analisados trabalhos realizados na área dos artigos científicos, estudando a sua estrutura e principais conteúdos, sendo ainda abordadas algumas características comuns a artigos de qualidade e ferramentas desenvolvidas para ajudar na sua escrita. O algoritmo desenvolvido é formado pela junção de um conjunto de ferramentas e por uma coleção de documentos, bem como pela lógica que liga todos os componentes, implementada durante este trabalho de mestrado. Esta coleção de documentos é constituída por artigos completos de algumas áreas, incluindo Informática, Física e Matemática, entre outras. Antes da análise de documentos, foi feita a extração de tópicos da coleção utilizada. Deste forma, ao extrair os tópicos do documento sob análise, é possível selecionar os documentos da coleção mais semelhantes, sendo estes utilizados para a criação de sugestões. Através de um conjunto de ferramentas para análise sintática, pesquisa de sinónimos e realização morfológica, o algoritmo é capaz de criar sugestões de substituições de palavras que são mais comummente utilizadas na área. Os testes realizados permitiram demonstrar que, em alguns casos, o algoritmo é capaz de fornecer sugestões úteis de forma a aproximar os termos utilizados no documento com os termos mais utilizados no estado de arte de uma determinada área científica. Isto constitui uma evidência de que a utilização do algoritmo desenvolvido pode melhorar a qualidade da escrita de documentos científicos, visto que estes tendem a aproximar-se daqueles já publicados. Apesar dos resultados apresentados não refletirem uma grande melhoria no documento, estes deverão ser considerados uma baixa estimativa ao valor real do algoritmo. Isto é justificado pela presença de inúmeros erros resultantes da conversão dos documentos pdf para texto, estando estes presentes tanto na coleção de documentos, como nos testes. As principais contribuições deste trabalho incluem a partilha do estudo realizado, o desenho e implementação do algoritmo e o editor de texto desenvolvido como prova de conceito. A análise de especificidade de um contexto, que advém dos testes realizados às várias áreas do conhecimento, e a extensa coleção de documentos, totalmente compilada durante este mestrado, são também contribuições do trabalho.

APA, Harvard, Vancouver, ISO, and other styles

23

Silva, Martín Gastón. "Predicción de tendencias en redes sociales basada en características sociales y contenido." Bachelor's thesis, 2018. http://hdl.handle.net/11086/6245.

Full text

Abstract:

Tesis (Lic. en Ciencias de la Computación)--Universidad Nacional de Córdoba, Facultad de Matemática, Astronomía, Física y Computación, 2018.
En el marco del análisis de redes sociales éste trabajo busca capturar el comportamiento de los usuarios influyentes sobre una publicación determinada. Con esta información, la intención es generar un modelo de aprendizaje automático capaz de predecir si un determinado tweet será “popular” o no. La construcción del conjunto de datos (dataset) fue realizada a través de la API pública de Twitter obteniendo un volumen final de más de 5,000 usuarios y 5,000,000 de publicaciones. Con esta información se entrenaron y evaluaron diversos modelos de aprendizaje auto- mático con múltiples configuraciones, con el objetivo encontrar así el mejor rendimiento. En este sentido, en un primer experimento, se logró inferir un modelo de clasificación binaria basado en SVM (Support Vector Machines) sólo utilizando información social, qué obtuvo un 77 % de certeza, basado en la métrica F1, para predecir si una publicación es considerada “popular”. En una segunda etapa, se decidió agregar técnicas de Procesamiento de Lenguaje Natural aplicadas sobre el contenido de las publicaciones, logrando algunas mejoras sig- nificativas en los casos donde el modelo anterior se veía disminuido. Dicho análisis de los tweets fue realizado utilizando detección de tópicos, mediante algoritmos tipo LDA (Latent Dirichlet Allocation).
n the framework of social network analysis, this work seeks to capture the behavior of influential users about a specific publication. With this information, the intention is to generate an automatic learning model capable of predicting if a certain tweet is popular or not. The construction of the dataset was made through the public Twitter API obtaining a final volume of more than 5,000 users and 5,000,000 publications. With this information, different models of machine learning with multiple configurations were trained and evaluated, in order to obtain the best performance. In this sense, in a database we can infer a classification model based on SVM (Support Vector Machines) only using social information, which obtained a 77% certainty, based on the F1 metric, for predict whether a publication is considered "popular". In a second stage, it was decided to add Natural Language Processing techniques, earning significant improvements in the cases where the previous model was reduced. This analysis of the tweets was done by detection of topics, through LDA(Latent Dirichlet Allocation) algorithms.

APA, Harvard, Vancouver, ISO, and other styles

24

Arun, R. "Cluster Identification : Topic Models, Matrix Factorization And Concept Association Networks." Thesis, 2010. https://etd.iisc.ac.in/handle/2005/2247.

Full text

Abstract:

The problem of identifying clusters arising in the context of topic models and related approaches is important in the area of machine learning. The problem concerning traversals on Concept Association Networks is of great interest in the area of cognitive modelling. Cluster identification is the problem of finding the right number of clusters in a given set of points(or a dataset) in different settings including topic models and matrix factorization algorithms. Traversals in Concept Association Networks provide useful insights into cognitive modelling and performance. First, We consider the problem of authorship attribution of stylometry and the problem of cluster identification for topic models. For the problem of authorship attribution we show empirically that by using stop-words as stylistic features of an author, vectors obtained from the Latent Dirichlet Allocation (LDA) , outperforms other classifiers. Topics obtained by this method are generally abstract and it may not be possible to identify the cohesiveness of words falling in the same topic by mere manual inspection. Hence it is difficult to determine if the chosen number of topics is optimal. We next address this issue. We propose a new measure for topics arising out of LDA based on the divergence between the singular value distribution and the L1 norm distribution of the document-topic and topic-word matrices, respectively. It is shown that under certain assumptions, this measure can be used to find the right number of topics. Next we consider the Non-negative Matrix Factorization(NMF) approach for clustering documents. We propose entropy based regularization for a variant of the NMF with row-stochastic constraints on the component matrices. It is shown that when topic-splitting occurs, (i.e when an extra topic is required) an existing topic vector splits into two and the divergence term in the cost function decreases whereas the entropy term increases leading to a regularization. Next we consider the problem of clustering in Concept Association Networks(CAN). The CAN are generic graph models of relationships between abstract concepts. We propose a simple clustering algorithm which takes into account the complex network properties of CAN. The performance of the algorithm is compared with that of the graph-cut based spectral clustering algorithm. In addition, we study the properties of traversals by human participants on CAN. We obtain experimental results contrasting these traversals with those obtained from (i) random walk simulations and (ii) shortest path algorithms.

APA, Harvard, Vancouver, ISO, and other styles

25

Arun, R. "Cluster Identification : Topic Models, Matrix Factorization And Concept Association Networks." Thesis, 2010. http://etd.iisc.ernet.in/handle/2005/2247.

Full text

Abstract:

The problem of identifying clusters arising in the context of topic models and related approaches is important in the area of machine learning. The problem concerning traversals on Concept Association Networks is of great interest in the area of cognitive modelling. Cluster identification is the problem of finding the right number of clusters in a given set of points(or a dataset) in different settings including topic models and matrix factorization algorithms. Traversals in Concept Association Networks provide useful insights into cognitive modelling and performance. First, We consider the problem of authorship attribution of stylometry and the problem of cluster identification for topic models. For the problem of authorship attribution we show empirically that by using stop-words as stylistic features of an author, vectors obtained from the Latent Dirichlet Allocation (LDA) , outperforms other classifiers. Topics obtained by this method are generally abstract and it may not be possible to identify the cohesiveness of words falling in the same topic by mere manual inspection. Hence it is difficult to determine if the chosen number of topics is optimal. We next address this issue. We propose a new measure for topics arising out of LDA based on the divergence between the singular value distribution and the L1 norm distribution of the document-topic and topic-word matrices, respectively. It is shown that under certain assumptions, this measure can be used to find the right number of topics. Next we consider the Non-negative Matrix Factorization(NMF) approach for clustering documents. We propose entropy based regularization for a variant of the NMF with row-stochastic constraints on the component matrices. It is shown that when topic-splitting occurs, (i.e when an extra topic is required) an existing topic vector splits into two and the divergence term in the cost function decreases whereas the entropy term increases leading to a regularization. Next we consider the problem of clustering in Concept Association Networks(CAN). The CAN are generic graph models of relationships between abstract concepts. We propose a simple clustering algorithm which takes into account the complex network properties of CAN. The performance of the algorithm is compared with that of the graph-cut based spectral clustering algorithm. In addition, we study the properties of traversals by human participants on CAN. We obtain experimental results contrasting these traversals with those obtained from (i) random walk simulations and (ii) shortest path algorithms.

APA, Harvard, Vancouver, ISO, and other styles

26

Sharma, Govind. "Sentiment-Driven Topic Analysis Of Song Lyrics." Thesis, 2012. https://etd.iisc.ac.in/handle/2005/2472.

Full text

Abstract:

Sentiment Analysis is an area of Computer Science that deals with the impact a document makes on a user. The very field is further sub-divided into Opinion Mining and Emotion Analysis, the latter of which is the basis for the present work. Work on songs is aimed at building affective interactive applications such as music recommendation engines. Using song lyrics, we are interested in both supervised and unsupervised analyses, each of which has its own pros and cons. For an unsupervised analysis (clustering), we use a standard probabilistic topic model called Latent Dirichlet Allocation (LDA). It mines topics from songs, which are nothing but probability distributions over the vocabulary of words. Some of the topics seem sentiment-based, motivating us to continue with this approach. We evaluate our clusters using a gold dataset collected from an apt website and get positive results. This approach would be useful in the absence of a supervisor dataset. In another part of our work, we argue the inescapable existence of supervision in terms of having to manually analyse the topics returned. Further, we have also used explicit supervision in terms of a training dataset for a classifier to learn sentiment specific classes. This analysis helps reduce dimensionality and improve classification accuracy. We get excellent dimensionality reduction using Support Vector Machines (SVM) for feature selection. For re-classification, we use the Naive Bayes Classifier (NBC) and SVM, both of which perform well. We also use Non-negative Matrix Factorization (NMF) for classification, but observe that the results coincide with those of NBC, with no exceptions. This drives us towards establishing a theoretical equivalence between the two.

APA, Harvard, Vancouver, ISO, and other styles

27

Sharma, Govind. "Sentiment-Driven Topic Analysis Of Song Lyrics." Thesis, 2012. http://etd.iisc.ernet.in/handle/2005/2472.

Full text

Abstract:

Sentiment Analysis is an area of Computer Science that deals with the impact a document makes on a user. The very field is further sub-divided into Opinion Mining and Emotion Analysis, the latter of which is the basis for the present work. Work on songs is aimed at building affective interactive applications such as music recommendation engines. Using song lyrics, we are interested in both supervised and unsupervised analyses, each of which has its own pros and cons. For an unsupervised analysis (clustering), we use a standard probabilistic topic model called Latent Dirichlet Allocation (LDA). It mines topics from songs, which are nothing but probability distributions over the vocabulary of words. Some of the topics seem sentiment-based, motivating us to continue with this approach. We evaluate our clusters using a gold dataset collected from an apt website and get positive results. This approach would be useful in the absence of a supervisor dataset. In another part of our work, we argue the inescapable existence of supervision in terms of having to manually analyse the topics returned. Further, we have also used explicit supervision in terms of a training dataset for a classifier to learn sentiment specific classes. This analysis helps reduce dimensionality and improve classification accuracy. We get excellent dimensionality reduction using Support Vector Machines (SVM) for feature selection. For re-classification, we use the Naive Bayes Classifier (NBC) and SVM, both of which perform well. We also use Non-negative Matrix Factorization (NMF) for classification, but observe that the results coincide with those of NBC, with no exceptions. This drives us towards establishing a theoretical equivalence between the two.

APA, Harvard, Vancouver, ISO, and other styles

We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!