Tesis sobre el tema "Heterogeneous Textual Data Mining"
Crea una cita precisa en los estilos APA, MLA, Chicago, Harvard y otros
Consulte los 50 mejores tesis para su investigación sobre el tema "Heterogeneous Textual Data Mining".
Junto a cada fuente en la lista de referencias hay un botón "Agregar a la bibliografía". Pulsa este botón, y generaremos automáticamente la referencia bibliográfica para la obra elegida en el estilo de cita que necesites: APA, MLA, Harvard, Vancouver, Chicago, etc.
También puede descargar el texto completo de la publicación académica en formato pdf y leer en línea su resumen siempre que esté disponible en los metadatos.
Explore tesis sobre una amplia variedad de disciplinas y organice su bibliografía correctamente.
Saneifar, Hassan. "Locating Information in Heterogeneous log files". Thesis, Montpellier 2, 2011. http://www.theses.fr/2011MON20092/document.
Texto completoIn this thesis, we present contributions to the challenging issues which are encounteredin question answering and locating information in complex textual data, like log files. Question answering systems (QAS) aim to find a relevant fragment of a document which could be regarded as the best possible concise answer for a question given by a user. In this work, we are looking to propose a complete solution to locate information in a special kind of textual data, i.e., log files generated by EDA design tools.Nowadays, in many application areas, modern computing systems are instrumented to generate huge reports about occurring events in the format of log files. Log files are generated in every computing field to report the status of systems, products, or even causes of problems that can occur. Log files may also include data about critical parameters, sensor outputs, or a combination of those. Analyzing log files, as an attractive approach for automatic system management and monitoring, has been enjoying a growing amount of attention [Li et al., 2005]. Although the process of generating log files is quite simple and straightforward, log file analysis could be a tremendous task that requires enormous computational resources, long time and sophisticated procedures [Valdman, 2004]. Indeed, there are many kinds of log files generated in some application domains which are not systematically exploited in an efficient way because of their special characteristics. In this thesis, we are mainly interested in log files generated by Electronic Design Automation (EDA) systems. Electronic design automation is a category of software tools for designing electronic systems such as printed circuit boards and Integrated Circuits (IC). In this domain, to ensure the design quality, there are some quality check rules which should be verified. Verification of these rules is principally performed by analyzing the generated log files. In the case of large designs that the design tools may generate megabytes or gigabytes of log files each day, the problem is to wade through all of this data to locate the critical information we need to verify the quality check rules. These log files typically include a substantial amount of data. Accordingly, manually locating information is a tedious and cumbersome process. Furthermore, the particular characteristics of log files, specially those generated by EDA design tools, rise significant challenges in retrieval of information from the log files. The specific features of log files limit the usefulness of manual analysis techniques and static methods. Automated analysis of such logs is complex due to their heterogeneous and evolving structures and the large non-fixed vocabulary.In this thesis, by each contribution, we answer to questions raised in this work due to the data specificities or domain requirements. We investigate throughout this work the main concern "how the specificities of log files can influence the information extraction and natural language processing methods?". In this context, a key challenge is to provide approaches that take the log file specificities into account while considering the issues which are specific to QA in restricted domains. We present different contributions as below:> Proposing a novel method to recognize and identify the logical units in the log files to perform a segmentation according to their structure. We thus propose a method to characterize complex logicalunits found in log files according to their syntactic characteristics. Within this approach, we propose an original type of descriptor to model the textual structure and layout of text documents.> Proposing an approach to locate the requested information in the log files based on passage retrieval. To improve the performance of passage retrieval, we propose a novel query expansion approach to adapt an initial query to all types of corresponding log files and overcome the difficulties like mismatch vocabularies. Our query expansion approach relies on two relevance feedback steps. In the first one, we determine the explicit relevance feedback by identifying the context of questions. The second phase consists of a novel type of pseudo relevance feedback. Our method is based on a new term weighting function, called TRQ (Term Relatedness to Query), introduced in this work, which gives a score to terms of corpus according to their relatedness to the query. We also investigate how to apply our query expansion approach to documents from general domains.> Studying the use of morpho-syntactic knowledge in our approaches. For this purpose, we are interested in the extraction of terminology in the log files. Thus, we here introduce our approach, named Exterlog (EXtraction of TERminology from LOGs), to extract the terminology of log files. To evaluate the extracted terms and choose the most relevant ones, we propose a candidate term evaluation method using a measure, based on the Web and combined with statistical measures, taking into account the context of log files
Zhou, Wubai. "Data Mining Techniques to Understand Textual Data". FIU Digital Commons, 2017. https://digitalcommons.fiu.edu/etd/3493.
Texto completoAl-Mutairy, Badr. "Data mining and integration of heterogeneous bioinformatics data sources". Thesis, Cardiff University, 2008. http://orca.cf.ac.uk/54178/.
Texto completoUr-Rahman, Nadeem. "Textual data mining applications for industrial knowledge management solutions". Thesis, Loughborough University, 2010. https://dspace.lboro.ac.uk/2134/6373.
Texto completoATTANASIO, ANTONIO. "Mining Heterogeneous Urban Data at Multiple Granularity Layers". Doctoral thesis, Politecnico di Torino, 2018. http://hdl.handle.net/11583/2709888.
Texto completoKubalík, Jakub. "Mining of Textual Data from the Web for Speech Recognition". Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2010. http://www.nusl.cz/ntk/nusl-237170.
Texto completoNimmagadda, Shastri Lakshman. "Ontology based data warehousing for mining of heterogeneous and multidimensional data sources". Thesis, Curtin University, 2015. http://hdl.handle.net/20.500.11937/2322.
Texto completoPreti, Giulia. "On the discovery of relevant structures in dynamic and heterogeneous data". Doctoral thesis, Università degli studi di Trento, 2019. http://hdl.handle.net/11572/242978.
Texto completoPreti, Giulia. "On the discovery of relevant structures in dynamic and heterogeneous data". Doctoral thesis, Università degli studi di Trento, 2019. http://hdl.handle.net/11572/242978.
Texto completoFang, Chunsheng. "Novel Frameworks for Mining Heterogeneous and Dynamic Networks". University of Cincinnati / OhioLINK, 2011. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1321369978.
Texto completoSIMONETTI, Andrea. "Development of statistical methods for the analysis of textual data". Doctoral thesis, Università degli Studi di Palermo, 2022. https://hdl.handle.net/10447/574870.
Texto completoMalherbe, Emmanuel. "Standardization of textual data for comprehensive job market analysis". Thesis, Université Paris-Saclay (ComUE), 2016. http://www.theses.fr/2016SACLC058/document.
Texto completoWith so many job adverts and candidate profiles available online, the e-recruitment constitutes a rich object of study. All this information is however textual data, which from a computational point of view is unstructured. The large number and heterogeneity of recruitment websites also means that there is a lot of vocabularies and nomenclatures. One of the difficulties when dealing with this type of raw textual data is being able to grasp the concepts contained in it, which is the problem of standardization that is tackled in this thesis. The aim of standardization is to create a unified process providing values in a nomenclature. A nomenclature is by definition a finite set of meaningful concepts, which means that the attributes resulting from standardization are a structured representation of the information. Several questions are however raised: Are the websites' structured data usable for a unified standardization? What structure of nomenclature is the best suited for standardization, and how to leverage it? Is it possible to automatically build such a nomenclature from scratch, or to manage the standardization process without one? To illustrate the various obstacles of standardization, the examples we are going to study include the inference of the skills or the category of a job advert, or the level of training of a candidate profile. One of the challenges of e-recruitment is that the concepts are continuously evolving, which means that the standardization must be up-to-date with job market trends. In light of this, we will propose a set of machine learning models that require minimal supervision and can easily adapt to the evolution of the nomenclatures. The questions raised found partial answers using Case Based Reasoning, semi-supervised Learning-to-Rank, latent variable models, and leveraging the evolving sources of the semantic web and social media. The different models proposed have been tested on real-world data, before being implemented in a industrial environment. The resulting standardization is at the core of SmartSearch, a project which provides a comprehensive analysis of the job market
Koeller, Andreas. "Integration of Heterogeneous Databases: Discovery of Meta-Information and Maintenance of Schema-Restructuring Views". Digital WPI, 2002. https://digitalcommons.wpi.edu/etd-dissertations/116.
Texto completoKalledat, Tobias. "Tracking domain knowledge based on segmented textual sources". Doctoral thesis, Humboldt-Universität zu Berlin, Wirtschaftswissenschaftliche Fakultät, 2009. http://dx.doi.org/10.18452/15925.
Texto completoThe research work available here has the goal of analysing the influence of pre-processing on the results of the generation of knowledge and of giving concrete recommendations for action for suitable pre-processing of text corpora in TDM. The research introduced here focuses on the extraction and tracking of concepts within certain knowledge domains using an approach of horizontally (timeline) and vertically (persistence of terms) segmenting of corpora. The result is a set of segmented corpora according to the timeline. Within each timeline segment clusters of concepts can be built according to their persistence quality in relation to each single time-based corpus segment and to the whole corpus. Based on a simple frequency measure it can be shown that only the statistical quality of a single corpus allows measuring the pre-processing quality. It is not necessary to use comparison corpora. The time series of the frequency measure have significant negative correlations between the two clusters of concepts that occur permanently and others that vary within an optimal pre-processed corpus. This was found to be the opposite in every other test set that was pre-processed with lower quality. The most frequent terms were grouped into concepts by the use of domain-specific taxonomies. A significant negative correlation was found between the time series of different terms per yearly corpus segments and the terms assigned to taxonomy for corpora with high quality level of pre-processing. A semantic analysis based on a simple TDM method with significant frequency threshold measures resulted in significant different knowledge extracted from corpora with different qualities of pre-processing. With measures introduced in this research it is possible to measure the quality of applied taxonomy. Rules for the measuring of corpus as well as taxonomy quality were derived from these results and advice suggested for the appropriate level of pre-processing.
Franzke, Maximilian [Verfasser] y Matthias [Akademischer Betreuer] Renz. "Querying and mining heterogeneous spatial, social, and temporal data / Maximilian Franzke ; Betreuer: Matthias Renz". München : Universitätsbibliothek der Ludwig-Maximilians-Universität, 2019. http://d-nb.info/1190563630/34.
Texto completoWu, Chao. "Intelligent Data Mining on Large-scale Heterogeneous Datasets and its Application in Computational Biology". University of Cincinnati / OhioLINK, 2014. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1406880774.
Texto completoKoeller, Andreas. "Integration of heterogeneous databases : discovery of meta-information and maintenance of schema-restructuring views". Link to electronic version, 2001. http://www.wpi.edu/Pubs/ETD/Available/etd-0415102-133008/.
Texto completoUMI no. 30-30945. Keywords: schema restructuring; schema changes; meta-data discovery; data mining; data integration. Includes bibliographical references (leaves 256-274).
Ait, Saada Mira. "Unsupervised learning from textual data with neural text representations". Electronic Thesis or Diss., Université Paris Cité, 2023. http://www.theses.fr/2023UNIP7122.
Texto completoThe digital era generates enormous amounts of unstructured data such as images and documents, requiring specific processing methods to extract value from them. Textual data presents an additional challenge as it does not contain numerical values. Word embeddings are techniques that transform text into numerical data, enabling machine learning algorithms to process them. Unsupervised tasks are a major challenge in the industry as they allow value creation from large amounts of data without requiring costly manual labeling. In thesis we explore the use of Transformer models for unsupervised tasks such as clustering, anomaly detection, and data visualization. We also propose methodologies to better exploit multi-layer Transformer models in an unsupervised context to improve the quality and robustness of document clustering while avoiding the choice of which layer to use and the number of classes. Additionally, we investigate more deeply Transformer language models and their application to clustering, examining in particular transfer learning methods that involve fine-tuning pre-trained models on a different task to improve their quality for future tasks. We demonstrate through an empirical study that post-processing methods based on dimensionality reduction are more advantageous than fine-tuning strategies proposed in the literature. Finally, we propose a framework for detecting text anomalies in French adapted to two cases: one where the data concerns a specific topic and the other where the data has multiple sub-topics. In both cases, we obtain superior results to the state of the art with significantly lower computation time
元吉, 忠寛 y Tadahiro MOTOYOSHI. "災害のイマジネーション力に関する探索的研究 - 大学生の想像力と阪神淡路大震災の事例との比較 -". 名古屋大学大学院教育発達科学研究科, 2006. http://hdl.handle.net/2237/9454.
Texto completoAlencar, Medeiros Gabriel Henrique. "ΡreDiViD Τοwards the Ρredictiοn οf the Disseminatiοn οf Viral Disease cοntagiοn in a pandemic setting". Electronic Thesis or Diss., Normandie, 2025. http://www.theses.fr/2025NORMR005.
Texto completoEvent-Based Surveillance (EBS) systems are essential for detecting and tracking emerging health phenomena such as epidemics and public health crises. However, they face limitations, including strong dependence on human expertise, challenges processing heterogeneous textual data, and insufficient consideration of spatiotemporal dynamics. To overcome these issues, we propose a hybrid approach combining knowledge-driven and data-driven methodologies, anchored in the Propagation Phenomena Ontology (PropaPhen) and the Description-Detection-Prediction Framework (DDPF), to enhance the description, detection, and prediction of propagation phenomena. PropaPhen is a FAIR ontology designed to model the spatiotemporal spread of phenomena. It has been specialized in the biomedical domain through the integration of UMLS and World-KG, leading to the creation of the BioPropaPhenKG knowledge graph. The DDPF framework consists of three modules: description, which generates domain-specific ontologies; detection, which applies relation extraction techniques to heterogeneous textual sources; and prediction, which uses advanced clustering methods. Tested on COVID-19 and Monkeypox datasets and validated against WHO data, DDPF demonstrated its effectiveness in detecting and predicting spatiotemporal clusters. Its modular architecture ensures scalability and adaptability to various domains, opening perspectives in public health, environmental monitoring, and social phenomena
Spiegler, Sebastian R. "Comparative study of clustering algorithms on textual databases : clustering of curricula vitae into comptency-based groups to support knowledge management /". Saarbrücken : VDM Verl. Müller, 2007. http://deposit.d-nb.de/cgi-bin/dokserv?id=3035354&prov=M&dok_var=1&dok_ext=htm.
Texto completoFabbri, Renato. "Topological stability and textual differentiation in human interaction networks: statistical analysis, visualization and linked data". Universidade de São Paulo, 2017. http://www.teses.usp.br/teses/disponiveis/76/76132/tde-11092017-154706/.
Texto completoEste trabalho relata propriedades topológicas estáveis (ou invariantes) e diferenciação textual em redes de interação humana, com referências derivadas de listas públicas de e-mail. A atividade ao longo do tempo e a topologia foram observadas em instantâneos ao longo de uma linha do tempo e em diferentes escalas. A análise mostra que a atividade é praticamente a mesma para todas as redes em escalas temporais de segundos a meses. As componentes principais dos participantes no espaço das métricas topológicas mantêm-se praticamente inalteradas quando diferentes conjuntos de mensagens são considerados. A atividade dos participantes segue o esperado perfil livre de escala, produzindo, assim, as classes de vértices dos hubs, dos intermediários e dos periféricos em comparação com o modelo Erdös-Rényi. Os tamanhos relativos destes três setores são essencialmente os mesmos para todas as listas de e-mail e ao longo do tempo. Normalmente, 3-12% dos vértices são hubs, 15-45% são intermediários e 44-81% são vértices periféricos. Os textos de cada um destes setores são considerados muito diferentes através de uma adaptação dos testes de Kolmogorov-Smirnov. Estas propriedades são consistentes com a literatura e podem ser gerais para redes de interação humana, o que tem implicações importantes para o estabelecimento de uma tipologia dos participantes com base em critérios quantitativos. De modo a guiar e apoiar esta pesquisa, também desenvolvemos um método de visualização para redes dinâmicas através de animações. Para facilitar a verificação e passos seguintes nas análises, fornecemos uma representação em dados ligados dos dados relacionados aos nossos resultados.
Nieto, Erick Mauricio Gómez. "Projeção multidimensional aplicada a visualização de resultados de busca textual". Universidade de São Paulo, 2012. http://www.teses.usp.br/teses/disponiveis/55/55134/tde-05122012-105730/.
Texto completoInternet users are very familiar with the results of a search query displayed as a ranked list of snippets. Each textual snippet shows a content summary of the referred document (or web page) and a link to it. This display has many advantages, e.g., it affords easy navigation and is straightforward to interpret. Nonetheless, any user of search engines could possibly report some experience of disappointment with this metaphor. Indeed, it has limitations in particular situations, as it fails to provide an overview of the document collection retrieved. Moreover, depending on the nature of the query - e.g., it may be too general, or ambiguous, or ill expressed - the desired information may be poorly ranked, or results may contemplate varied topics. Several search tasks would be easier if users were shown an overview of the returned documents, organized so as to reflect how related they are, content-wise. We propose a visualization technique to display the results of web queries aimed at overcoming such limitations. It combines the neighborhood preservation capability of multidimensional projections with the familiar snippet-based representation by employing a multidimensional projection to derive two-dimensional layouts of the query search results that preserve text similarity relations, or neighborhoods. Similarity is computed by applying the cosine similarity over a bag-of-words vector representation of collection built from the snippets. If the snippets are displayed directly according to the derived layout they will overlap considerably, producing a poor visualization. We overcome this problem by defining an energy functional that considers both the overlapping amongst snippets and the preservation of the neighborhood structure as given in vii the projected layout. Minimizing this energy functional provides a neighborhood preserving two-dimensional arrangement of the textual snippets with minimum overlap. The resulting visualization conveys both a global view of the query results and visual groupings that reflect related results, as illustrated in several examples shown
Driscoll, Timothy. "Host-Microbe Relations: A Phylogenomics-Driven Bioinformatic Approach to the Characterization of Microbial DNA from Heterogeneous Sequence Data". Diss., Virginia Tech, 2013. http://hdl.handle.net/10919/50921.
Texto completoPh. D.
Alonso, Gonzalez Kevin [Verfasser], Gerhard [Akademischer Betreuer] [Gutachter] Rigoll y Mihai [Gutachter] Datcu. "Heterogeneous Data Mining of Earth Observation Archives: Integration and Fusion of Images, Maps, and In-situ Data / Kevin Alonso Gonzalez ; Gutachter: Gerhard Rigoll, Mihai Datcu ; Betreuer: Gerhard Rigoll". München : Universitätsbibliothek der TU München, 2017. http://d-nb.info/1132774195/34.
Texto completoMrowca, Artur [Verfasser], Stephan [Akademischer Betreuer] Günnemann, Stephan [Gutachter] Günnemann y Sebastian [Gutachter] Steinhorst. "Specification Mining in High dimensional Heterogeneous Data Sets of Large-Scale Distributed Systems / Artur Mrowca ; Gutachter: Stephan Günnemann, Sebastian Steinhorst ; Betreuer: Stephan Günnemann". München : Universitätsbibliothek der TU München, 2021. http://d-nb.info/1234149125/34.
Texto completoKamenieva, Iryna. "Research Ontology Data Models for Data and Metadata Exchange Repository". Thesis, Växjö University, School of Mathematics and Systems Engineering, 2009. http://urn.kb.se/resolve?urn=urn:nbn:se:vxu:diva-6351.
Texto completoFor researches in the field of the data mining and machine learning the necessary condition is an availability of various input data set. Now researchers create the databases of such sets. Examples of the following systems are: The UCI Machine Learning Repository, Data Envelopment Analysis Dataset Repository, XMLData Repository, Frequent Itemset Mining Dataset Repository. Along with above specified statistical repositories, the whole pleiad from simple filestores to specialized repositories can be used by researchers during solution of applied tasks, researches of own algorithms and scientific problems. It would seem, a single complexity for the user will be search and direct understanding of structure of so separated storages of the information. However detailed research of such repositories leads us to comprehension of deeper problems existing in usage of data. In particular a complete mismatch and rigidity of data files structure with SDMX - Statistical Data and Metadata Exchange - standard and structure used by many European organizations, impossibility of preliminary data origination to the concrete applied task, lack of data usage history for those or other scientific and applied tasks.
Now there are lots of methods of data miming, as well as quantities of data stored in various repositories. In repositories there are no methods of DM (data miming) and moreover, methods are not linked to application areas. An essential problem is subject domain link (problem domain), methods of DM and datasets for an appropriate method. Therefore in this work we consider the building problem of ontological models of DM methods, interaction description of methods of data corresponding to them from repositories and intelligent agents allowing the statistical repository user to choose the appropriate method and data corresponding to the solved task. In this work the system structure is offered, the intelligent search agent on ontological model of DM methods considering the personal inquiries of the user is realized.
For implementation of an intelligent data and metadata exchange repository the agent oriented approach has been selected. The model uses the service oriented architecture. Here is used the cross platform programming language Java, multi-agent platform Jadex, database server Oracle Spatial 10g, and also the development environment for ontological models - Protégé Version 3.4.
Mendes, MarÃlia Soares. "MALTU - model for evaluation of interaction in social systems from the Users Textual Language". Universidade Federal do CearÃ, 2015. http://www.teses.ufc.br/tde_busca/arquivo.php?codArquivo=14296.
Texto completoA Ãrea de InteraÃÃo Humano-Computador (IHC) tem sugerido muitas formas para avaliar sistemas a fim de melhorar sua usabilidade e a eXperiÃncia do UsuÃrio (UX). O surgimento da web 2.0 permitiu o desenvolvimento de aplicaÃÃes marcadas pela colaboraÃÃo, comunicaÃÃo e interatividade entre seus usuÃrios de uma forma e em uma escala nunca antes observadas. Sistemas Sociais (SS) (e.g., Twitter, Facebook, MySpace, LinkedIn etc.) sÃo exemplos dessas aplicaÃÃes e possuem caracterÃsticas como: frequente troca de mensagens e expressÃo de sentimentos de forma espontÃnea. As oportunidades e os desafios trazidos por esses tipos de aplicaÃÃes exigem que os mÃtodos tradicionais de avaliaÃÃo sejam repensados, considerando essas novas caracterÃsticas. Por exemplo, as postagens dos usuÃrios em SS revelam suas opiniÃes sobre diversos assuntos, inclusive sobre o que eles pensam do sistema em uso. Esta tese procura testar a hipÃtese de que as postagens dos usuÃrios em SS fornecem dados relevantes para avaliaÃÃo da Usabilidade e da UX (UUX) em SS. Durante as pesquisas realizadas na literatura, nÃo foi identificado nenhum modelo de avaliaÃÃo que tenha direcionado seu foco na coleta e anÃlise das postagens dos usuÃrios a fim de avaliar a UUX de um sistema em uso. Sendo assim, este estudo propÃe o MALTU â Modelo para AvaliaÃÃo da interaÃÃo em sistemas sociais a partir da Linguagem Textual do UsuÃrio. A fim de fornecer bases para o desenvolvimento do modelo proposto, foram realizados estudos de como os usuÃrios expressam suas opiniÃes sobre o sistema em lÃngua natural. Foram extraÃdas postagens de usuÃrios de quatro SS de contextos distintos. Tais postagens foram classificadas por especialistas de IHC, estudadas e processadas utilizando tÃcnicas de Processamento da Linguagem Natural (PLN) e mineraÃÃo de dados e, analisadas a fim da obtenÃÃo de um modelo genÃrico. O MALTU foi aplicado em dois SS: um de entretenimento e um SS educativo. Os resultados mostram que à possÃvel avaliar um sistema a partir das postagens dos usuÃrios em SS. Tais avaliaÃÃes sÃo auxiliadas por padrÃes de extraÃÃo relacionados ao uso, aos tipos de postagens e Ãs metas de IHC utilizadas na avaliaÃÃo do sistema.
Egho, Elias. "Extraction de motifs séquentiels dans des données séquentielles multidimensionnelles et hétérogènes : une application à l'analyse de trajectoires de patients". Thesis, Université de Lorraine, 2014. http://www.theses.fr/2014LORR0066/document.
Texto completoAll domains of science and technology produce large and heterogeneous data. Although a lot of work was done in this area, mining such data is still a challenge. No previous research work targets the mining of heterogeneous multidimensional sequential data. This thesis proposes a contribution to knowledge discovery in heterogeneous sequential data. We study three different research directions: (i) Extraction of sequential patterns, (ii) Classification and (iii) Clustering of sequential data. Firstly we generalize the notion of a multidimensional sequence by considering complex and heterogeneous sequential structure. We present a new approach called MMISP to extract sequential patterns from heterogeneous sequential data. MMISP generates a large number of sequential patterns as this is usually the case for pattern enumeration algorithms. To overcome this problem, we propose a novel way of considering heterogeneous multidimensional sequences by mapping them into pattern structures. We develop a framework for enumerating only patterns satisfying given constraints. The second research direction is in concern with the classification of heterogeneous multidimensional sequences. We use Formal Concept Analysis (FCA) as a classification method. We show interesting properties of concept lattices and of stability index to classify sequences into a concept lattice and to select some interesting groups of sequences. The third research direction in this thesis is in concern with the clustering of heterogeneous multidimensional sequential data. We focus on the notion of common subsequences to define similarity between a pair of sequences composed of a list of itemsets. We use this similarity measure to build a similarity matrix between sequences and to separate them in different groups. In this work, we present theoretical results and an efficient dynamic programming algorithm to count the number of common subsequences between two sequences without enumerating all subsequences. The system resulting from this research work was applied to analyze and mine patient healthcare trajectories in oncology. Data are taken from a medico-administrative database including all information about the hospitalizations of patients in Lorraine Region (France). The system allows to identify and characterize episodes of care for specific sets of patients. Results were discussed and validated with domain experts
Ammari, Ahmad N. "Transforming user data into user value by novel mining techniques for extraction of web content, structure and usage patterns : the development and evaluation of new Web mining methods that enhance information retrieval and improve the understanding of users' Web behavior in websites and social blogs". Thesis, University of Bradford, 2010. http://hdl.handle.net/10454/5269.
Texto completoAmmari, Ahmad N. "Transforming user data into user value by novel mining techniques for extraction of web content, structure and usage patterns. The Development and Evaluation of New Web Mining Methods that enhance Information Retrieval and improve the Understanding of User¿s Web Behavior in Websites and Social Blogs". Thesis, University of Bradford, 2010. http://hdl.handle.net/10454/5269.
Texto completoMazoyer, Béatrice. "Social Media Stories. Event detection in heterogeneous streams of documents applied to the study of information spreading across social and news media". Thesis, université Paris-Saclay, 2020. http://www.theses.fr/2020UPASC009.
Texto completoSocial Media, and Twitter in particular, has become a privileged source of information for journalists in recent years. Most of them monitor Twitter, in the search for newsworthy stories. This thesis aims to investigate and to quantify the effect of this technological change on editorial decisions. Does the popularity of a story affects the way it is covered by traditional news media, regardless of its intrinsic interest?To highlight this relationship, we take a multidisciplinary approach at the crossroads of computer science and economics: first, we design a novel approach to collect a representative sample of 70% of all French tweets emitted during an entire year. Second, we study different types of algorithms to automatically discover tweets that relate to the same stories. We test several vector representations of tweets, looking at both text and text-image representations, Third, we design a new method to group together Twitter events and media events. Finally, we design an econometric instrument to identify a causal effect of the popularity of an event on Twitter on its coverage by traditional media. We show that the popularity of a story on Twitter does have an effect on the number of articles devoted to it by traditional media, with an increase of about 1 article per 1000 additional tweets
Egho, Elias. "Extraction de motifs séquentiels dans des données séquentielles multidimensionnelles et hétérogènes : une application à l'analyse de trajectoires de patients". Electronic Thesis or Diss., Université de Lorraine, 2014. http://www.theses.fr/2014LORR0066.
Texto completoAll domains of science and technology produce large and heterogeneous data. Although a lot of work was done in this area, mining such data is still a challenge. No previous research work targets the mining of heterogeneous multidimensional sequential data. This thesis proposes a contribution to knowledge discovery in heterogeneous sequential data. We study three different research directions: (i) Extraction of sequential patterns, (ii) Classification and (iii) Clustering of sequential data. Firstly we generalize the notion of a multidimensional sequence by considering complex and heterogeneous sequential structure. We present a new approach called MMISP to extract sequential patterns from heterogeneous sequential data. MMISP generates a large number of sequential patterns as this is usually the case for pattern enumeration algorithms. To overcome this problem, we propose a novel way of considering heterogeneous multidimensional sequences by mapping them into pattern structures. We develop a framework for enumerating only patterns satisfying given constraints. The second research direction is in concern with the classification of heterogeneous multidimensional sequences. We use Formal Concept Analysis (FCA) as a classification method. We show interesting properties of concept lattices and of stability index to classify sequences into a concept lattice and to select some interesting groups of sequences. The third research direction in this thesis is in concern with the clustering of heterogeneous multidimensional sequential data. We focus on the notion of common subsequences to define similarity between a pair of sequences composed of a list of itemsets. We use this similarity measure to build a similarity matrix between sequences and to separate them in different groups. In this work, we present theoretical results and an efficient dynamic programming algorithm to count the number of common subsequences between two sequences without enumerating all subsequences. The system resulting from this research work was applied to analyze and mine patient healthcare trajectories in oncology. Data are taken from a medico-administrative database including all information about the hospitalizations of patients in Lorraine Region (France). The system allows to identify and characterize episodes of care for specific sets of patients. Results were discussed and validated with domain experts
Ataky, Steve Tsham Mpinda. "Análise de dados sequenciais heterogêneos baseada em árvore de decisão e modelos de Markov : aplicação na logística de transporte". Universidade Federal de São Carlos, 2015. https://repositorio.ufscar.br/handle/ufscar/7242.
Texto completoApproved for entry into archive by Marina Freitas (marinapf@ufscar.br) on 2016-09-16T19:59:28Z (GMT) No. of bitstreams: 1 DissSATM.pdf: 3079104 bytes, checksum: 51b46ffeb4387370e30fb92e31771606 (MD5)
Approved for entry into archive by Marina Freitas (marinapf@ufscar.br) on 2016-09-16T19:59:34Z (GMT) No. of bitstreams: 1 DissSATM.pdf: 3079104 bytes, checksum: 51b46ffeb4387370e30fb92e31771606 (MD5)
Made available in DSpace on 2016-09-16T19:59:41Z (GMT). No. of bitstreams: 1 DissSATM.pdf: 3079104 bytes, checksum: 51b46ffeb4387370e30fb92e31771606 (MD5) Previous issue date: 2015-10-16
Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES)
Latterly, the development of data mining techniques has emerged in many applications’ fields with aim at analyzing large volumes of data which may be simple and / or complex. The logistics of transport, the railway setor in particular, is a sector with such a characteristic in that the data available in are of varied natures (classic variables such as top speed or type of train, symbolic variables such as the set of routes traveled by train, degree of tack, etc.). As part of this dissertation, one addresses the problem of classification and prediction of heterogeneous data; it is proposed to study through two main approaches. First, an automatic classification approach was implemented based on classification tree technique, which also allows new data to be efficiently integrated into partitions initialized beforehand. The second contribution of this work concerns the analysis of sequence data. It has been proposed to combine the above classification method with Markov models for obtaining a time series (temporal sequences) partition in homogeneous and significant groups based on probabilities. The resulting model offers good interpretation of classes built and allows us to estimate the evolution of the sequences of a particular vehicle. Both approaches were then applied onto real data from the a Brazilian railway information system company in the spirit of supporting the strategic management of planning and coherent prediction. This work is to initially provide a thinner type of planning to solve the problems associated with the existing classification in homogeneous circulations groups. Second, it sought to define a typology of train paths (sucession traffic of the same train) in order to provide or predict the next movement of statistical characteristics of a train carrying the same route. The general methodology provides a supportive environment for decision-making to monitor and control the planning organization. Thereby, a formula with two variants was proposed to calculate the adhesion degree between the track effectively carried out or being carried out with the planned one.
Nos últimos anos aflorou o desenvolvimento de técnicas de mineração de dados em muitos domínios de aplicação com finalidade de analisar grandes volumes de dados, os quais podendo ser simples e/ou complexos. A logística de transporte, o setor ferroviário em particular, é uma área com tal característica em que os dados disponíveis são muitos e de variadas naturezas (variáveis clássicas como velocidade máxima ou tipo de trem, variáveis simbólicas como o conjunto de vias percorridas pelo trem, etc). Como parte desta dissertação, aborda-se o problema de classificação e previsão de dados heterogêneos, propõe-se estudar através de duas abordagens principais. Primeiramente, foi utilizada uma abordagem de classificação automática com base na técnica por ´arvore de classificação, a qual também permite que novos dados sejam eficientemente integradas nas partições inicial. A segunda contribuição deste trabalho diz respeito à análise de dados sequenciais. Propôs-se a combinar o método de classificação anterior com modelos de Markov para obter uma participação de sequências temporais em grupos homogêneos e significativos com base nas probabilidades. O modelo resultante oferece uma boa interpretação das classes construídas e permite estimar a evolução das sequências de um determinado veículo. Ambas as abordagens foram então aplicadas nos dados do sistema de informação ferroviário, no espírito de dar apoio à gestão estratégica de planejamentos e previsões aderentes. Este trabalho consiste em fornecer inicialmente uma tipologia mais fina de planejamento para resolver os problemas associados com a classificação existente em grupos de circulações homogêneos. Em segundo lugar, buscou-se definir uma tipologia de trajetórias de trens (sucessão de circulações de um mesmo trem) para assim fornecer ou prever características estatísticas da próxima circulação mais provável de um trem realizando o mesmo percurso. A metodologia geral proporciona um ambiente de apoio à decisão para o monitoramento e controle da organização de planejamento. Deste fato, uma fórmula com duas variantes foi proposta para calcular o grau de aderência entre a trajetória efetivamente realizada ou em curso de realização com o planejado.
Valentin, Sarah. "Extraction et combinaison d’informations épidémiologiques à partir de sources informelles pour la veille des maladies infectieuses animales". Thesis, Montpellier, 2020. http://www.theses.fr/2020MONTS067.
Texto completoEpidemic intelligence aims to detect, investigate and monitor potential health threats while relying on formal (e.g. official health authorities) and informal (e.g. media) information sources. Monitoring of unofficial sources, or so-called event-based surveillance (EBS), requires the development of systems designed to retrieve and process unstructured textual data published online. This manuscript focuses on the extraction and combination of epidemiological information from informal sources (i.e. online news), in the context of the international surveillance of animal infectious diseases. The first objective of this thesis is to propose and compare approaches to enhance the identification and extraction of relevant epidemiological information from the content of online news. The second objective is to study the use of epidemiological entities extracted from the news articles (i.e. diseases, hosts, locations and dates) in the context of event extraction and retrieval of related online news.This manuscript proposes new textual representation approaches by selecting, expanding, and combining relevant epidemiological features. We show that adapting and extending text mining and classification methods improves the added value of online news sources for event-based surveillance. We stress the role of domain expert knowledge regarding the relevance and the interpretability of methods proposed in this thesis. While our researches are conducted in the context of animal disease surveillance, we discuss the generic aspects of our approaches regarding unknown threats and One Health surveillance
Richard, Jérémy. "De la capture de trajectoires de visiteurs vers l’analyse interactive de comportement après enrichissement sémantique". Electronic Thesis or Diss., La Rochelle, 2023. http://www.theses.fr/2023LAROS012.
Texto completoThis thesis focuses on the behavioral study of tourist activity using a generic and interactive analysis approach. The developed analytical process concerns the tourist trajectory in the city and museums as the study field. Experiments were conducted to collect movement data in the tourist city using GPS signals, thus enabling the acquisition of a movement trajectory. However, the study primarily focuses on reconstructing a visitor’s trajectory in museums using indoor positioning equipment, i.e., in a constrained environment. Then, a generic multi-aspect semantic enrichment model is developed to supplement an individual’s trajectory using multiple context data such as the names of neighborhoods the individual passed through in the city, museum rooms, weather outside, and indoor mobile application data. The enriched trajectories, called semantic trajectories, are then analyzed using formal concept analysis and the GALACTIC platform, which enables the analysis of complex and heterogeneous data structures as a hierarchy of subgroups of individuals sharing common behaviors. Finally, attention is paid to the "ReducedContextCompletion" algorithm that allows for interactive navigation in a lattice of concepts, allowing the data analyst to focus on the aspects of the data they wish to explore
Jiang, Xinxin. "Mining heterogeneous enterprise data". Thesis, 2018. http://hdl.handle.net/10453/129377.
Texto completoHeterogeneity is becoming one of the key characteristics inside enterprise data, because the current nature of globalization and competition stress the importance of leveraging huge amounts of enterprise accumulated data, according to various organizational processes, resources and standards. Effectively deriving meaningful insights from complex large-scaled heterogeneous enterprise data poses an interesting, but critical challenge. The aim of this thesis is to investigate the theoretical foundations of mining heterogeneous enterprise data in light of the above challenges and to develop new algorithms and frameworks that are able to effectively and efficiently consider heterogeneity in four elements of the data: objects, events, context, and domains. Objects describe a variety of business roles and instruments involved in business systems. Object heterogeneity means that object information at both the data and structural level is heterogeneous. The cost-sensitive hybrid neural network (Cs-HNN) proposed leverages parallel network architectures and an algorithm specifically designed for minority classification to generate a robust model for learning heterogeneous objects. Events trace an object’s behaviours or activities. Event heterogeneity reflects the level of variety in business events and is normally expressed in the type and format of features. The approach proposed in this thesis focuses on fleet tracking as a practical example of an application with a high degree of event heterogeneity. Context describes the environment and circumstances surrounding objects and events. Context heterogeneity reflects the degree of diversity in contextual features. The coupled collaborative filtering (CCF) approach proposed in this thesis is able to provide context-aware recommendations by measuring the non-independent and identically distributed (non-IID) relationships across diverse contexts. Domains are the sources of information and reflect the nature of the business or function that has generated the data. The cross-domain deep learning (Cd-DLA) proposed in this thesis provides a potential avenue to overcome the complexity and nonlinearity of heterogeneous domains. Each of the approaches, algorithms, and frameworks for heterogeneous enterprise data mining presented in this thesis outperform the state-of-the-art methods in a range of backgrounds and scenarios, as evidenced by a theoretical analysis, an empirical study, or both. All outcomes derived from this research have been published or accepted for publication, and the follow-up work has also been recognised, which demonstrates scholarly interest in mining heterogeneous enterprise data as a research topic. However, despite this interest, heterogeneous data mining still holds increasing attractive opportunities for further exploration and development in both academia and industry.
Bharti, Santosh Kumar. "Sarcasm Detection in Textual Data: A Supervised Approach". Thesis, 2019. http://ethesis.nitrkl.ac.in/10002/1/2019_PhD_SKBharti_513CS1037_Sarcasm.pdf.
Texto completoHasan, Maryam. "Extracting Structured Knowledge from Textual Data in Software Repositories". Master's thesis, 2011. http://hdl.handle.net/10048/1776.
Texto completoSarkas, Nikolaos. "Querying, Exploring and Mining the Extended Document". Thesis, 2011. http://hdl.handle.net/1807/29857.
Texto completoDev, Kapil. "Spatio-Textual Similarity Joins Using Variable Prefix Filtering and MBR Filtering". Thesis, 2016. http://ethesis.nitrkl.ac.in/9110/1/2016_MT_KDev.pdf.
Texto completoChen, Meng-Peng y 陳夢芃. "Using Data Mining Techniques for Studying the Consumer Heterogeneous Needs". Thesis, 2014. http://ndltd.ncl.edu.tw/handle/70574846603697298093.
Texto completo國立勤益科技大學
工業工程與管理系
102
Bookstore is face of the fierce horizontal competition. Customer buying behavior is often just as book sales data. The study concerned with Customer base for marketing strategy often ignoring the different living styles of people will show different reading tendencies and preferences. Improve the base conditions for sales of bookstores , is provide a variety of products or books ,set reasonable price to meet the needs of different customers good service. But original intention of the reading customer is want to know the information, not the books. If bookstores can find the real needs of consumers in the consumer's buying process, and made for the needs of different consumers of existing proposals or introduce more attractive to consumers buying behavior marketing activities for the current important topic. In this study, the use of Data Mining in association analysis method to identify the heterogeneous needs of customers spending, customers mastered reading preferences, and develop marketing programs and displays. Providing customized the product portfolio of cross-selling mode to improve customer satisfaction and loyalty to buy. Through research report will help industry to regain intimacy with readers and understand the new relationships with customers .
Chopra, Pankaj. "Data mining techniques to enable large-scale exploratory analysis of heterogeneous scientific data". 2009. http://www.lib.ncsu.edu/theses/available/etd-04092009-161454/unrestricted/etd.pdf.
Texto completoDlamini, Phezulu y 佩祖露. "Mining Textual Relationships from Social Media Data for Users’ E-Learning Experiences". Thesis, 2017. http://ndltd.ncl.edu.tw/handle/r4v6xc.
Texto completo"All Purpose Textual Data Information Extraction, Visualization and Querying". Master's thesis, 2018. http://hdl.handle.net/2286/R.I.50530.
Texto completoDissertation/Thesis
Masters Thesis Software Engineering 2018
"Learning from the Data Heterogeneity for Data Imputation". Doctoral diss., 2021. http://hdl.handle.net/2286/R.I.64299.
Texto completoDissertation/Thesis
Doctoral Dissertation Computer Engineering 2021
Louis, Anita Lily. "Unsupervised discovery of relations for analysis of textual data in digital forensics". Diss., 2010. http://hdl.handle.net/2263/27479.
Texto completoDissertation (MSc)--University of Pretoria, 2010.
Computer Science
unrestricted
hung, Cheng chih y 鄭志宏. "An Attribute-based Approach for Data Mining in the Environment of Heterogeneous Information Sources". Thesis, 1998. http://ndltd.ncl.edu.tw/handle/23514594121163647287.
Texto completo國立臺灣科技大學
電子工程技術研究所
86
Concerning the puzzles facing today in the process of KDD and data mining, weproposed an integrated approach for dealing with related problems in the heterogenous information environment. Our approach employs the mediator-based mechanism to construct an integrated and reconciled data warehouse according the results of the operation of mediator. The data warehouse, therefore, can serve as a target database during the process of data mining. With regard to the system of the data mining, the main framework is centered on the mechanism of the attribute-based reasioning. Frist, we translate the initial data into the conceptual information according to the attribute data via the process of generalization. Then, in the process of generalization, we analyze simultaneously the statistics of data quantity that is available for quantitative analysis and noise control, Finally, we represent the results of information in a logical formalism.In this paper, data was mined by four kinds of rules which are characterization, discrimination, classification, as well as association rules. Each type of rules can be used in the different subject-oriented applications. In addition, we defined a simplified data mining language based on the characterisitic of data mining system and data mining process. The language can be easily used for data mining and the manipulation of database systems.
Morgado, João Pedro Barreiro Gomes e. "Knowledge elicitation by merging heterogeneous data sources in a die-casting process". Master's thesis, 2015. http://hdl.handle.net/10316/39072.
Texto completoIn order to establish adaptive control of a manufacturing process knowledge must be acquired about both, the process and its environment. This knowledge can be obtained by mining large amounts of data collected through the monitoring of the manufacturing process. This enables the study of process parameters and the correlations between the process parameters and with the parameters of the environment. Through this, knowledge about the process and its relation to the environment can be established and, in turn, used for adaptive process control. The aim of this thesis is to study real manufacturing data, obtained through monitoring of a die casting process. First, in order to better understand the problem at hand, a literature review of using Big data and merging data from heterogeneous sources is given. Second, using the real data, a robust algorithm to asses the quality rate was developed, due to data being incomplete and noisy. Merging the process and the environment data was done. In this way it is possible to visualize the influences of various parameters on quality rate and make suggestions for improvement of the die casting process.
"Stock market forecasting by integrating time-series and textual information". 2003. http://library.cuhk.edu.hk/record=b5896089.
Texto completoThesis (M.Phil.)--Chinese University of Hong Kong, 2003.
Includes bibliographical references (leaves 88-93).
Abstracts in English and Chinese.
Abstract (English) --- p.i
Abstract (Chinese) --- p.ii
Acknowledgement --- p.iii
Contents --- p.v
List of Figures --- p.ix
List of Tables --- p.x
Chapter Part I --- The Very Beginning --- p.1
Chapter 1 --- Introduction --- p.2
Chapter 1.1 --- Contributions --- p.3
Chapter 1.2 --- Dissertation Organization --- p.4
Chapter 2 --- Problem Formulation --- p.6
Chapter 2.1 --- Defining the Prediction Task --- p.6
Chapter 2.2 --- Overview of the System Architecture --- p.8
Chapter Part II --- Literatures Review --- p.11
Chapter 3 --- The Social Dynamics of Financial Markets --- p.12
Chapter 3.1 --- The Collective Behavior of Groups --- p.13
Chapter 3.2 --- Prediction Based on Publicity Information --- p.16
Chapter 4 --- Time Series Representation --- p.20
Chapter 4.1 --- Technical Analysis --- p.20
Chapter 4.2 --- Piecewise Linear Approximation --- p.23
Chapter 5 --- Text Classification --- p.27
Chapter 5.1 --- Document Representation --- p.28
Chapter 5.2 --- Document Pre-processing --- p.30
Chapter 5.3 --- Classifier Construction --- p.31
Chapter 5.3.1 --- Naive Bayes (NB) --- p.31
Chapter 5.3.2 --- Support Vectors Machine (SVM) --- p.33
Chapter Part III --- Mining Financial Time Series and Textual Doc- uments Concurrently --- p.36
Chapter 6 --- Time Series Representation --- p.37
Chapter 6.1 --- Discovering Trends on the Time Series --- p.37
Chapter 6.2 --- t-test Based Split and Merge Segmentation Algorithm ´ؤ Splitting Phrase --- p.39
Chapter 6.3 --- t-test Based Split and Merge Segmentation Algorithm - Merging Phrase --- p.41
Chapter 7 --- Article Alignment and Pre-processing --- p.43
Chapter 7.1 --- Aligning News Articles to the Stock Trends --- p.44
Chapter 7.2 --- Selecting Positive Training Examples --- p.46
Chapter 7.3 --- Selecting Negative Training Examples --- p.48
Chapter 8 --- System Learning --- p.52
Chapter 8.1 --- Similarity Based Classification Approach --- p.53
Chapter 8.2 --- Category Sketch Generation --- p.55
Chapter 8.2.1 --- Within-Category Coefficient --- p.55
Chapter 8.2.2 --- Cross-Category Coefficient --- p.56
Chapter 8.2.3 --- Average-Importance Coefficient --- p.57
Chapter 8.3 --- Document Sketch Generation --- p.58
Chapter 9 --- System Operation --- p.60
Chapter 9.1 --- System Operation --- p.60
Chapter Part IV --- Results and Discussions --- p.62
Chapter 10 --- Evaluations --- p.63
Chapter 10.1 --- Time Series Evaluations --- p.64
Chapter 10.2 --- Classifier Evaluations --- p.64
Chapter 10.2.1 --- Batch Classification Evaluation --- p.69
Chapter 10.2.2 --- Online Classification Evaluation --- p.71
Chapter 10.2.3 --- Components Analysis --- p.74
Chapter 10.2.4 --- Document Sketch Analysis --- p.75
Chapter 10.3 --- Prediction Evaluations --- p.75
Chapter 10.3.1 --- Simulation Results --- p.77
Chapter 10.3.2 --- Hit Rate Analysis --- p.78
Chapter Part V --- The Final Words --- p.80
Chapter 11 --- Conclusion and Future Work --- p.81
Appendix --- p.84
Chapter A --- Hong Kong Stocks Categorization Powered by Reuters --- p.84
Chapter B --- Morgan Stanley Capital International (MSCI) Classification --- p.85
Chapter C --- "Precision, Recall and F1 measure" --- p.86
Bibliography --- p.88