Dissertations / Theses: 'Heterogeneous Textual Data Mining'

1

Saneifar, Hassan. "Locating Information in Heterogeneous log files." Thesis, Montpellier 2, 2011. http://www.theses.fr/2011MON20092/document.

Full text

Abstract:

Cette thèse s'inscrit dans les domaines des systèmes Question Réponse en domaine restreint, la recherche d'information ainsi que TALN. Les systèmes de Question Réponse (QR) ont pour objectif de retrouver un fragment pertinent d'un document qui pourrait être considéré comme la meilleure réponse concise possible à une question de l'utilisateur. Le but de cette thèse est de proposer une approche de localisation de réponses dans des masses de données complexes et évolutives décrites ci-dessous.. De nos jours, dans de nombreux domaines d'application, les systèmes informatiques sont instrumentés pour produire des rapports d'événements survenant, dans un format de données textuelles généralement appelé fichiers log. Les fichiers logs représentent la source principale d'informations sur l'état des systèmes, des produits, ou encore les causes de problèmes qui peuvent survenir. Les fichiers logs peuvent également inclure des données sur les paramètres critiques, les sorties de capteurs, ou une combinaison de ceux-ci. Ces fichiers sont également utilisés lors des différentes étapes du développement de logiciels, principalement dans l'objectif de débogage et le profilage. Les fichiers logs sont devenus un élément standard et essentiel de toutes les grandes applications. Bien que le processus de génération de fichiers logs est assez simple et direct, l'analyse de fichiers logs pourrait être une tâche difficile qui exige d'énormes ressources de calcul, de temps et de procédures sophistiquées. En effet, il existe de nombreux types de fichiers logs générés dans certains domaines d'application qui ne sont pas systématiquement exploités d'une manière efficace en raison de leurs caractéristiques particulières. Dans cette thèse, nous nous concentrerons sur un type des fichiers logs générés par des systèmes EDA (Electronic Design Automation). Ces fichiers logs contiennent des informations sur la configuration et la conception des Circuits Intégrés (CI) ainsi que les tests de vérification effectués sur eux. Ces informations, très peu exploitées actuellement, sont particulièrement attractives et intéressantes pour la gestion de conception, la surveillance et surtout la vérification de la qualité de conception. Cependant, la complexité de ces données textuelles complexes, c.-à-d. des fichiers logs générés par des outils de conception de CI, rend difficile l'exploitation de ces connaissances. Plusieurs aspects de ces fichiers logs ont été moins soulignés dans les méthodes de TALN et Extraction d'Information (EI). Le grand volume de données et leurs caractéristiques particulières limitent la pertinence des méthodes classiques de TALN et EI. Dans ce projet de recherche nous cherchons à proposer une approche qui permet de répondre à répondre automatiquement aux questionnaires de vérification de qualité des CI selon les informations se trouvant dans les fichiers logs générés par les outils de conception. Au sein de cette thèse, nous étudions principalement "comment les spécificités de fichiers logs peuvent influencer l'extraction de l'information et les méthodes de TALN?". Le problème est accentué lorsque nous devons également prendre leurs structures évolutives et leur vocabulaire spécifique en compte. Dans ce contexte, un défi clé est de fournir des approches qui prennent les spécificités des fichiers logs en compte tout en considérant les enjeux qui sont spécifiques aux systèmes QR dans des domaines restreints. Ainsi, les contributions de cette thèse consistent brièvement en :〉Proposer une méthode d'identification et de reconnaissance automatique des unités logiques dans les fichiers logs afin d'effectuer une segmentation textuelle selon la structure des fichiers. Au sein de cette approche, nous proposons un type original de descripteur qui permet de modéliser la structure textuelle et le layout des documents textuels.〉Proposer une approche de la localisation de réponse (recherche de passages) dans les fichiers logs. Afin d'améliorer la performance de recherche de passage ainsi que surmonter certains problématiques dûs aux caractéristiques des fichiers logs, nous proposons une approches d'enrichissement de requêtes. Cette approches, fondée sur la notion de relevance feedback, consiste en un processus d'apprentissage et une méthode de pondération des mots pertinents du contexte qui sont susceptibles d'exister dans les passage adaptés. Cela dit, nous proposons également une nouvelle fonction originale de pondération (scoring), appelée TRQ (Term Relatedness to Query) qui a pour objectif de donner un poids élevé aux termes qui ont une probabilité importante de faire partie des passages pertinents. Cette approche est également adaptée et évaluée dans les domaines généraux.〉Etudier l'utilisation des connaissances morpho-syntaxiques au sein de nos approches. A cette fin, nous nous sommes intéressés à l'extraction de la terminologie dans les fichiers logs. Ainsi, nous proposons la méthode Exterlog, adaptée aux spécificités des logs, qui permet d'extraire des termes selon des patrons syntaxiques. Afin d'évaluer les termes extraits et en choisir les plus pertinents, nous proposons un protocole de validation automatique des termes qui utilise une mesure fondée sur le Web associée à des mesures statistiques, tout en prenant en compte le contexte spécialisé des logs
In this thesis, we present contributions to the challenging issues which are encounteredin question answering and locating information in complex textual data, like log files. Question answering systems (QAS) aim to find a relevant fragment of a document which could be regarded as the best possible concise answer for a question given by a user. In this work, we are looking to propose a complete solution to locate information in a special kind of textual data, i.e., log files generated by EDA design tools.Nowadays, in many application areas, modern computing systems are instrumented to generate huge reports about occurring events in the format of log files. Log files are generated in every computing field to report the status of systems, products, or even causes of problems that can occur. Log files may also include data about critical parameters, sensor outputs, or a combination of those. Analyzing log files, as an attractive approach for automatic system management and monitoring, has been enjoying a growing amount of attention [Li et al., 2005]. Although the process of generating log files is quite simple and straightforward, log file analysis could be a tremendous task that requires enormous computational resources, long time and sophisticated procedures [Valdman, 2004]. Indeed, there are many kinds of log files generated in some application domains which are not systematically exploited in an efficient way because of their special characteristics. In this thesis, we are mainly interested in log files generated by Electronic Design Automation (EDA) systems. Electronic design automation is a category of software tools for designing electronic systems such as printed circuit boards and Integrated Circuits (IC). In this domain, to ensure the design quality, there are some quality check rules which should be verified. Verification of these rules is principally performed by analyzing the generated log files. In the case of large designs that the design tools may generate megabytes or gigabytes of log files each day, the problem is to wade through all of this data to locate the critical information we need to verify the quality check rules. These log files typically include a substantial amount of data. Accordingly, manually locating information is a tedious and cumbersome process. Furthermore, the particular characteristics of log files, specially those generated by EDA design tools, rise significant challenges in retrieval of information from the log files. The specific features of log files limit the usefulness of manual analysis techniques and static methods. Automated analysis of such logs is complex due to their heterogeneous and evolving structures and the large non-fixed vocabulary.In this thesis, by each contribution, we answer to questions raised in this work due to the data specificities or domain requirements. We investigate throughout this work the main concern "how the specificities of log files can influence the information extraction and natural language processing methods?". In this context, a key challenge is to provide approaches that take the log file specificities into account while considering the issues which are specific to QA in restricted domains. We present different contributions as below:> Proposing a novel method to recognize and identify the logical units in the log files to perform a segmentation according to their structure. We thus propose a method to characterize complex logicalunits found in log files according to their syntactic characteristics. Within this approach, we propose an original type of descriptor to model the textual structure and layout of text documents.> Proposing an approach to locate the requested information in the log files based on passage retrieval. To improve the performance of passage retrieval, we propose a novel query expansion approach to adapt an initial query to all types of corresponding log files and overcome the difficulties like mismatch vocabularies. Our query expansion approach relies on two relevance feedback steps. In the first one, we determine the explicit relevance feedback by identifying the context of questions. The second phase consists of a novel type of pseudo relevance feedback. Our method is based on a new term weighting function, called TRQ (Term Relatedness to Query), introduced in this work, which gives a score to terms of corpus according to their relatedness to the query. We also investigate how to apply our query expansion approach to documents from general domains.> Studying the use of morpho-syntactic knowledge in our approaches. For this purpose, we are interested in the extraction of terminology in the log files. Thus, we here introduce our approach, named Exterlog (EXtraction of TERminology from LOGs), to extract the terminology of log files. To evaluate the extracted terms and choose the most relevant ones, we propose a candidate term evaluation method using a measure, based on the Web and combined with statistical measures, taking into account the context of log files

APA, Harvard, Vancouver, ISO, and other styles

2

Zhou, Wubai. "Data Mining Techniques to Understand Textual Data." FIU Digital Commons, 2017. https://digitalcommons.fiu.edu/etd/3493.

Full text

Abstract:

More than ever, information delivery online and storage heavily rely on text. Billions of texts are produced every day in the form of documents, news, logs, search queries, ad keywords, tags, tweets, messenger conversations, social network posts, etc. Text understanding is a fundamental and essential task involving broad research topics, and contributes to many applications in the areas text summarization, search engine, recommendation systems, online advertising, conversational bot and so on. However, understanding text for computers is never a trivial task, especially for noisy and ambiguous text such as logs, search queries. This dissertation mainly focuses on textual understanding tasks derived from the two domains, i.e., disaster management and IT service management that mainly utilizing textual data as an information carrier. Improving situation awareness in disaster management and alleviating human efforts involved in IT service management dictates more intelligent and efficient solutions to understand the textual data acting as the main information carrier in the two domains. From the perspective of data mining, four directions are identified: (1) Intelligently generate a storyline summarizing the evolution of a hurricane from relevant online corpus; (2) Automatically recommending resolutions according to the textual symptom description in a ticket; (3) Gradually adapting the resolution recommendation system for time correlated features derived from text; (4) Efficiently learning distributed representation for short and lousy ticket symptom descriptions and resolutions. Provided with different types of textual data, data mining techniques proposed in those four research directions successfully address our tasks to understand and extract valuable knowledge from those textual data. My dissertation will address the research topics outlined above. Concretely, I will focus on designing and developing data mining methodologies to better understand textual information, including (1) a storyline generation method for efficient summarization of natural hurricanes based on crawled online corpus; (2) a recommendation framework for automated ticket resolution in IT service management; (3) an adaptive recommendation system on time-varying temporal correlated features derived from text; (4) a deep neural ranking model not only successfully recommending resolutions but also efficiently outputting distributed representation for ticket descriptions and resolutions.

APA, Harvard, Vancouver, ISO, and other styles

3

Al-Mutairy, Badr. "Data mining and integration of heterogeneous bioinformatics data sources." Thesis, Cardiff University, 2008. http://orca.cf.ac.uk/54178/.

Full text

Abstract:

In this thesis, we have presented a novel approach to interoperability based on the use of biological relationships that have used relationship-based integration to integrate bioinformatics data sources; this refers to the use of different relationship types with different relationship closeness values to link gene expression datasets with other information available in public bioinformatics data sources. These relationships provide flexible linkage for biologists to discover linked data across the biological universe. Relationship closeness is a variable used to measure the closeness of the biological entities in a relationship and is a characteristic of the relationship. The novelty of this approach is that it allows a user to link a gene expression dataset with heterogeneous data sources dynamically and flexibly to facilitate comparative genomics investigations. Our research has demonstrated that using different relationships allows biologists to analyze experimental datasets in different ways, shorten the time needed to analyze the datasets and provide an easier way to undertake this analysis. Thus, it provides more power to biologists to do experimentations using changing threshold values and linkage types. This is achieved in our framework by introducing the Soft Link Model (SLM) and a Relationship Knowledge Base (RKB), which is built and used by SLM. Integration and Data Mining Bioinformatics Data sources system (IDMBD) is implemented as a proof of concept prototype to demonstrate the technique of linkages described in the thesis.

APA, Harvard, Vancouver, ISO, and other styles

4

Ur-Rahman, Nadeem. "Textual data mining applications for industrial knowledge management solutions." Thesis, Loughborough University, 2010. https://dspace.lboro.ac.uk/2134/6373.

Full text

Abstract:

In recent years knowledge has become an important resource to enhance the business and many activities are required to manage these knowledge resources well and help companies to remain competitive within industrial environments. The data available in most industrial setups is complex in nature and multiple different data formats may be generated to track the progress of different projects either related to developing new products or providing better services to the customers. Knowledge Discovery from different databases requires considerable efforts and energies and data mining techniques serve the purpose through handling structured data formats. If however the data is semi-structured or unstructured the combined efforts of data and text mining technologies may be needed to bring fruitful results. This thesis focuses on issues related to discovery of knowledge from semi-structured or unstructured data formats through the applications of textual data mining techniques to automate the classification of textual information into two different categories or classes which can then be used to help manage the knowledge available in multiple data formats. Applications of different data mining techniques to discover valuable information and knowledge from manufacturing or construction industries have been explored as part of a literature review. The application of text mining techniques to handle semi-structured or unstructured data has been discussed in detail. A novel integration of different data and text mining tools has been proposed in the form of a framework in which knowledge discovery and its refinement processes are performed through the application of Clustering and Apriori Association Rule of Mining algorithms. Finally the hypothesis of acquiring better classification accuracies has been detailed through the application of the methodology on case study data available in the form of Post Project Reviews (PPRs) reports. The process of discovering useful knowledge, its interpretation and utilisation has been automated to classify the textual data into two classes.

APA, Harvard, Vancouver, ISO, and other styles

5

ATTANASIO, ANTONIO. "Mining Heterogeneous Urban Data at Multiple Granularity Layers." Doctoral thesis, Politecnico di Torino, 2018. http://hdl.handle.net/11583/2709888.

Full text

Abstract:

The recent development of urban areas and of the new advanced services supported by digital technologies has generated big challenges for people and city administrators, like air pollution, high energy consumption, traffic congestion, management of public events. Moreover, understanding the perception of citizens about the provided services and other relevant topics can help devising targeted actions in the management. With the large diffusion of sensing technologies and user devices, the capability to generate data of public interest within the urban area has rapidly grown. For instance, different sensors networks deployed in the urban area allow collecting a variety of data useful to characterize several aspects of the urban environment. The huge amount of data produced by different types of devices and applications brings a rich knowledge about the urban context. Mining big urban data can provide decision makers with knowledge useful to tackle the aforementioned challenges for a smart and sustainable administration of urban spaces. However, the high volume and heterogeneity of data increase the complexity of the analysis. Moreover, different sources provide data with different spatial and temporal references. The extraction of significant information from such diverse kinds of data depends also on how they are integrated, hence alternative data representations and efficient processing technologies are required. The PhD research activity presented in this thesis was aimed at tackling these issues. Indeed, the thesis deals with the analysis of big heterogeneous data in smart city scenarios, by means of new data mining techniques and algorithms, to study the nature of urban related processes. The problem is addressed focusing on both infrastructural and algorithmic layers. In the first layer, the thesis proposes the enhancement of the current leading techniques for the storage and elaboration of Big Data. The integration with novel computing platforms is also considered to support parallelization of tasks, tackling the issue of automatic scaling of resources. At algorithmic layer, the research activity aimed at innovating current data mining algorithms, by adapting them to novel Big Data architectures and to Cloud computing environments. Such algorithms have been applied to various classes of urban data, in order to discover hidden but important information to support the optimization of the related processes. This research activity focused on the development of a distributed framework to automatically aggregate heterogeneous data at multiple temporal and spatial granularities and to apply different data mining techniques. Parallel computations are performed according to the MapReduce paradigm and exploiting in-memory computing to reach near-linear computational scalability. By exploring manifold data resolutions in a relatively short time, several additional patterns of data can be discovered, allowing to further enrich the description of urban processes. Such framework is suitably applied to different use cases, where many types of data are used to provide insightful descriptive and predictive analyses. In particular, the PhD activity addressed two main issues in the context of urban data mining: the evaluation of buildings energy efficiency from different energy-related data and the characterization of people's perception and interest about different topics from user-generated content on social networks. For each use case within the considered applications, a specific architectural solution was designed to obtain meaningful and actionable results and to optimize the computational performance and scalability of algorithms, which were extensively validated through experimental tests.

APA, Harvard, Vancouver, ISO, and other styles

6

Kubalík, Jakub. "Mining of Textual Data from the Web for Speech Recognition." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2010. http://www.nusl.cz/ntk/nusl-237170.

Full text

Abstract:

Prvotním cílem tohoto projektu bylo prostudovat problematiku jazykového modelování pro rozpoznávání řeči a techniky pro získávání textových dat z Webu. Text představuje základní techniky rozpoznávání řeči a detailněji popisuje jazykové modely založené na statistických metodách. Zvláště se práce zabývá kriterii pro vyhodnocení kvality jazykových modelů a systémů pro rozpoznávání řeči. Text dále popisuje modely a techniky dolování dat, zvláště vyhledávání informací. Dále jsou představeny problémy spojené se získávání dat z webu, a v kontrastu s tím je představen vyhledávač Google. Součástí projektu byl návrh a implementace systému pro získávání textu z webu, jehož detailnímu popisu je věnována náležitá pozornost. Nicméně, hlavním cílem práce bylo ověřit, zda data získaná z Webu mohou mít nějaký přínos pro rozpoznávání řeči. Popsané techniky se tak snaží najít optimální způsob, jak data získaná z Webu použít pro zlepšení ukázkových jazykových modelů, ale i modelů nasazených v reálných rozpoznávacích systémech.

APA, Harvard, Vancouver, ISO, and other styles

7

Nimmagadda, Shastri Lakshman. "Ontology based data warehousing for mining of heterogeneous and multidimensional data sources." Thesis, Curtin University, 2015. http://hdl.handle.net/20.500.11937/2322.

Full text

Abstract:

Heterogeneous and multidimensional big-data sources are virtually prevalent in all business environments. System and data analysts are unable to fast-track and access big-data sources. A robust and versatile data warehousing system is developed, integrating domain ontologies from multidimensional data sources. For example, petroleum digital ecosystems and digital oil field solutions, derived from big-data petroleum (information) systems, are in increasing demand in multibillion dollar resource businesses worldwide. This work is recognized by Industrial Electronic Society of IEEE and appeared in more than 50 international conference proceedings and journals.

APA, Harvard, Vancouver, ISO, and other styles

8

Preti, Giulia. "On the discovery of relevant structures in dynamic and heterogeneous data." Doctoral thesis, Università degli studi di Trento, 2019. http://hdl.handle.net/11572/242978.

Full text

Abstract:

We are witnessing an explosion of available data coming from a huge amount of sources and domains, which is leading to the creation of datasets larger and larger, as well as richer and richer. Understanding, processing, and extracting useful information from those datasets requires specialized algorithms that take into consideration both the dynamism and the heterogeneity of the data they contain. Although several pattern mining techniques have been proposed in the literature, most of them fall short in providing interesting structures when the data can be interpreted differently from user to user, when it can change from time to time, and when it has different representations. In this thesis, we propose novel approaches that go beyond the traditional pattern mining algorithms, and can effectively and efficiently discover relevant structures in dynamic and heterogeneous settings. In particular, we address the task of pattern mining in multi-weighted graphs, pattern mining in dynamic graphs, and pattern mining in heterogeneous temporal databases. In pattern mining in multi-weighted graphs, we consider the problem of mining patterns for a new category of graphs called emph{multi-weighted graphs}. In these graphs, nodes and edges can carry multiple weights that represent, for example, the preferences of different users or applications, and that are used to assess the relevance of the patterns. We introduce a novel family of scoring functions that assign a score to each pattern based on both the weights of its appearances and their number, and that respect the anti-monotone property, pivotal for efficient implementations. We then propose a centralized and a distributed algorithm that solve the problem both exactly and approximately. The approximate solution has better scalability in terms of the number of edge weighting functions, while achieving good accuracy in the results found. An extensive experimental study shows the advantages and disadvantages of our strategies, and proves their effectiveness. Then, in pattern mining in dynamic graphs, we focus on the particular task of discovering structures that are both well-connected and correlated over time, in graphs where nodes and edges can change over time. These structures represent edges that are topologically close and exhibit a similar behavior of appearance and disappearance in the snapshots of the graph. To this aim, we introduce two measures for computing the density of a subgraph whose edges change in time, and a measure to compute their correlation. The density measures are able to detect subgraphs that are silent in some periods of time but highly connected in the others, and thus they can detect events or anomalies happened in the network. The correlation measure can identify groups of edges that tend to co-appear together, as well as edges that are characterized by similar levels of activity. For both variants of density measure, we provide an effective solution that enumerates all the maximal subgraphs whose density and correlation exceed given minimum thresholds, but can also return a more compact subset of representative subgraphs that exhibit high levels of pairwise dissimilarity. Furthermore, we propose an approximate algorithm that scales well with the size of the network, while achieving a high accuracy. We evaluate our framework with an extensive set of experiments on both real and synthetic datasets, and compare its performance with the main competitor algorithm. The results confirm the correctness of the exact solution, the high accuracy of the approximate, and the superiority of our framework over the existing solutions. In addition, they demonstrate the scalability of the framework and its applicability to networks of different nature. Finally, we address the problem of entity resolution in heterogeneous temporal data-ba-se-s, which are datasets that contain records that give different descriptions of the status of real-world entities at different periods of time, and thus are characterized by different sets of attributes that can change over time. Detecting records that refer to the same entity in such scenario requires a record similarity measure that takes into account the temporal information and that is aware of the absence of a common fixed schema between the records. However, existing record matching approaches either ignore the dynamism in the attribute values of the records, or assume that all the records share the same set of attributes throughout time. In this thesis, we propose a novel time-aware schema-agnostic similarity measure for temporal records to find pairs of matching records, and integrate it into an exact and an approximate algorithm. The exact algorithm can find all the maximal groups of pairwise similar records in the database. The approximate algorithm, on the other hand, can achieve higher scalability with the size of the dataset and the number of attributes, by relying on a technique called meta-blocking. This algorithm can find a good-quality approximation of the actual groups of similar records, by adopting an effective and efficient clustering algorithm.

APA, Harvard, Vancouver, ISO, and other styles

9

Preti, Giulia. "On the discovery of relevant structures in dynamic and heterogeneous data." Doctoral thesis, Università degli studi di Trento, 2019. http://hdl.handle.net/11572/242978.

Full text

Abstract:

We are witnessing an explosion of available data coming from a huge amount of sources and domains, which is leading to the creation of datasets larger and larger, as well as richer and richer. Understanding, processing, and extracting useful information from those datasets requires specialized algorithms that take into consideration both the dynamism and the heterogeneity of the data they contain. Although several pattern mining techniques have been proposed in the literature, most of them fall short in providing interesting structures when the data can be interpreted differently from user to user, when it can change from time to time, and when it has different representations. In this thesis, we propose novel approaches that go beyond the traditional pattern mining algorithms, and can effectively and efficiently discover relevant structures in dynamic and heterogeneous settings. In particular, we address the task of pattern mining in multi-weighted graphs, pattern mining in dynamic graphs, and pattern mining in heterogeneous temporal databases. In pattern mining in multi-weighted graphs, we consider the problem of mining patterns for a new category of graphs called emph{multi-weighted graphs}. In these graphs, nodes and edges can carry multiple weights that represent, for example, the preferences of different users or applications, and that are used to assess the relevance of the patterns. We introduce a novel family of scoring functions that assign a score to each pattern based on both the weights of its appearances and their number, and that respect the anti-monotone property, pivotal for efficient implementations. We then propose a centralized and a distributed algorithm that solve the problem both exactly and approximately. The approximate solution has better scalability in terms of the number of edge weighting functions, while achieving good accuracy in the results found. An extensive experimental study shows the advantages and disadvantages of our strategies, and proves their effectiveness. Then, in pattern mining in dynamic graphs, we focus on the particular task of discovering structures that are both well-connected and correlated over time, in graphs where nodes and edges can change over time. These structures represent edges that are topologically close and exhibit a similar behavior of appearance and disappearance in the snapshots of the graph. To this aim, we introduce two measures for computing the density of a subgraph whose edges change in time, and a measure to compute their correlation. The density measures are able to detect subgraphs that are silent in some periods of time but highly connected in the others, and thus they can detect events or anomalies happened in the network. The correlation measure can identify groups of edges that tend to co-appear together, as well as edges that are characterized by similar levels of activity. For both variants of density measure, we provide an effective solution that enumerates all the maximal subgraphs whose density and correlation exceed given minimum thresholds, but can also return a more compact subset of representative subgraphs that exhibit high levels of pairwise dissimilarity. Furthermore, we propose an approximate algorithm that scales well with the size of the network, while achieving a high accuracy. We evaluate our framework with an extensive set of experiments on both real and synthetic datasets, and compare its performance with the main competitor algorithm. The results confirm the correctness of the exact solution, the high accuracy of the approximate, and the superiority of our framework over the existing solutions. In addition, they demonstrate the scalability of the framework and its applicability to networks of different nature. Finally, we address the problem of entity resolution in heterogeneous temporal data-ba-se-s, which are datasets that contain records that give different descriptions of the status of real-world entities at different periods of time, and thus are characterized by different sets of attributes that can change over time. Detecting records that refer to the same entity in such scenario requires a record similarity measure that takes into account the temporal information and that is aware of the absence of a common fixed schema between the records. However, existing record matching approaches either ignore the dynamism in the attribute values of the records, or assume that all the records share the same set of attributes throughout time. In this thesis, we propose a novel time-aware schema-agnostic similarity measure for temporal records to find pairs of matching records, and integrate it into an exact and an approximate algorithm. The exact algorithm can find all the maximal groups of pairwise similar records in the database. The approximate algorithm, on the other hand, can achieve higher scalability with the size of the dataset and the number of attributes, by relying on a technique called meta-blocking. This algorithm can find a good-quality approximation of the actual groups of similar records, by adopting an effective and efficient clustering algorithm.

APA, Harvard, Vancouver, ISO, and other styles

10

Fang, Chunsheng. "Novel Frameworks for Mining Heterogeneous and Dynamic Networks." University of Cincinnati / OhioLINK, 2011. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1321369978.

Full text

APA, Harvard, Vancouver, ISO, and other styles

11

SIMONETTI, Andrea. "Development of statistical methods for the analysis of textual data." Doctoral thesis, Università degli Studi di Palermo, 2022. https://hdl.handle.net/10447/574870.

Full text

APA, Harvard, Vancouver, ISO, and other styles

12

Malherbe, Emmanuel. "Standardization of textual data for comprehensive job market analysis." Thesis, Université Paris-Saclay (ComUE), 2016. http://www.theses.fr/2016SACLC058/document.

Full text

Abstract:

Sachant qu'une grande partie des offres d'emplois et des profils candidats est en ligne, le e-recrutement constitue un riche objet d'étude. Ces documents sont des textes non structurés, et le grand nombre ainsi que l'hétérogénéité des sites de recrutement implique une profusion de vocabulaires et nomenclatures. Avec l'objectif de manipuler plus aisément ces données, Multiposting, une entreprise française spécialisée dans les outils de e-recrutement, a soutenu cette thèse, notamment en terme de données, en fournissant des millions de CV numériques et offres d'emplois agrégées de sources publiques.Une difficulté lors de la manipulation de telles données est d'en déduire les concepts sous-jacents, les concepts derrière les mots n'étant compréhensibles que des humains. Déduire de tels attributs structurés à partir de donnée textuelle brute est le problème abordé dans cette thèse, sous le nom de normalisation. Avec l'objectif d'un traitement unifié, la normalisation doit fournir des valeurs dans une nomenclature, de sorte que les attributs résultants forment une représentation structurée unique de l'information. Ce traitement traduit donc chaque document en un language commun, ce qui permet d'agréger l'ensemble des données dans un format exploitable et compréhensible. Plusieurs questions sont cependant soulevées: peut-on exploiter les structures locales des sites web dans l'objectif d'une normalisation finale unifiée? Quelle structure de nomenclature est la plus adaptée à la normalisation, et comment l'exploiter? Est-il possible de construire automatiquement une telle nomenclature de zéro, ou de normaliser sans en avoir une?Pour illustrer le problème de la normalisation, nous allons étudier par exemple la déduction des compétences ou de la catégorie professionelle d'une offre d'emploi, ou encore du niveau d'étude d'un profil de candidat. Un défi du e-recrutement est que les concepts évoluent continuellement, de sorte que la normalisation se doit de suivre les tendances du marché. A la lumière de cela, nous allons proposer un ensemble de modèles d'apprentissage statistique nécessitant le minimum de supervision et facilement adaptables à l'évolution des nomenclatures. Les questions posées ont trouvé des solutions dans le raisonnement à partir de cas, le learning-to-rank semi-supervisé, les modèles à variable latente, ainsi qu'en bénéficiant de l'Open Data et des médias sociaux. Les différents modèles proposés ont été expérimentés sur des données réelles, avant d'être implémentés industriellement. La normalisation résultante est au coeur de SmartSearch, un projet qui fournit une analyse exhaustive du marché de l'emploi
With so many job adverts and candidate profiles available online, the e-recruitment constitutes a rich object of study. All this information is however textual data, which from a computational point of view is unstructured. The large number and heterogeneity of recruitment websites also means that there is a lot of vocabularies and nomenclatures. One of the difficulties when dealing with this type of raw textual data is being able to grasp the concepts contained in it, which is the problem of standardization that is tackled in this thesis. The aim of standardization is to create a unified process providing values in a nomenclature. A nomenclature is by definition a finite set of meaningful concepts, which means that the attributes resulting from standardization are a structured representation of the information. Several questions are however raised: Are the websites' structured data usable for a unified standardization? What structure of nomenclature is the best suited for standardization, and how to leverage it? Is it possible to automatically build such a nomenclature from scratch, or to manage the standardization process without one? To illustrate the various obstacles of standardization, the examples we are going to study include the inference of the skills or the category of a job advert, or the level of training of a candidate profile. One of the challenges of e-recruitment is that the concepts are continuously evolving, which means that the standardization must be up-to-date with job market trends. In light of this, we will propose a set of machine learning models that require minimal supervision and can easily adapt to the evolution of the nomenclatures. The questions raised found partial answers using Case Based Reasoning, semi-supervised Learning-to-Rank, latent variable models, and leveraging the evolving sources of the semantic web and social media. The different models proposed have been tested on real-world data, before being implemented in a industrial environment. The resulting standardization is at the core of SmartSearch, a project which provides a comprehensive analysis of the job market

APA, Harvard, Vancouver, ISO, and other styles

13

Koeller, Andreas. "Integration of Heterogeneous Databases: Discovery of Meta-Information and Maintenance of Schema-Restructuring Views." Digital WPI, 2002. https://digitalcommons.wpi.edu/etd-dissertations/116.

Full text

Abstract:

In today's networked world, information is widely distributed across many independent databases in heterogeneous formats. Integrating such information is a difficult task and has been adressed by several projects. However, previous integration solutions, such as the EVE-Project, have several shortcomings. Database contents and structure change frequently, and users often have incomplete information about the data content and structure of the databases they use. When information from several such insufficiently described sources is to be extracted and integrated, two problems have to be solved: How can we discover the structure and contents of and interrelationships among unknown databases, and how can we provide durable integration views over several such databases? In this dissertation, we have developed solutions for those key problems in information integration. The first part of the dissertation addresses the fact that knowledge about the interrelationships between databases is essential for any attempt at solving the information integration problem. We are presenting an algorithm called FIND2 based on the clique-finding problem in graphs and k-uniform hypergraphs to discover redundancy relationships between two relations. Furthermore, the algorithm is enhanced by heuristics that significantly reduce the search space when necessary. Extensive experimental studies on the algorithm both with and without heuristics illustrate its effectiveness on a variety of real-world data sets. The second part of the dissertation addresses the durable view problem and presents the first algorithm for incremental view maintenance in schema-restructuring views. Such views are essential for the integration of heterogeneous databases. They are typically defined in schema-restructuring query languages like SchemaSQL, which can transform schema into data and vice versa, making traditional view maintenance based on differential queries impossible. Based on an existing algebra for SchemaSQL, we present an update propagation algorithm that propagates updates along the query algebra tree and prove its correctness. We also propose optimizations on our algorithm and present experimental results showing its benefits over view recomputation.

APA, Harvard, Vancouver, ISO, and other styles

14

Kalledat, Tobias. "Tracking domain knowledge based on segmented textual sources." Doctoral thesis, Humboldt-Universität zu Berlin, Wirtschaftswissenschaftliche Fakultät, 2009. http://dx.doi.org/10.18452/15925.

Full text

Abstract:

Die hier vorliegende Forschungsarbeit hat zum Ziel, Erkenntnisse über den Einfluss der Vorverarbeitung auf die Ergebnisse der Wissensgenerierung zu gewinnen und konkrete Handlungsempfehlungen für die geeignete Vorverarbeitung von Textkorpora in Text Data Mining (TDM) Vorhaben zu geben. Der Fokus liegt dabei auf der Extraktion und der Verfolgung von Konzepten innerhalb bestimmter Wissensdomänen mit Hilfe eines methodischen Ansatzes, der auf der waagerechten und senkrechten Segmentierung von Korpora basiert. Ergebnis sind zeitlich segmentierte Teilkorpora, welche die Persistenzeigenschaft der enthaltenen Terme widerspiegeln. Innerhalb jedes zeitlich segmentierten Teilkorpus können jeweils Cluster von Termen gebildet werden, wobei eines diejenigen Terme enthält, die bezogen auf das Gesamtkorpus nicht persistent sind und das andere Cluster diejenigen, die in allen zeitlichen Segmenten vorkommen. Auf Grundlage einfacher Häufigkeitsmaße kann gezeigt werden, dass allein die statistische Qualität eines einzelnen Korpus es erlaubt, die Vorverarbeitungsqualität zu messen. Vergleichskorpora sind nicht notwendig. Die Zeitreihen der Häufigkeitsmaße zeigen signifikante negative Korrelationen zwischen dem Cluster von Termen, die permanent auftreten, und demjenigen das die Terme enthält, die nicht persistent in allen zeitlichen Segmenten des Korpus vorkommen. Dies trifft ausschließlich auf das optimal vorverarbeitete Korpus zu und findet sich nicht in den anderen Test Sets, deren Vorverarbeitungsqualität gering war. Werden die häufigsten Terme unter Verwendung domänenspezifischer Taxonomien zu Konzepten gruppiert, zeigt sich eine signifikante negative Korrelation zwischen der Anzahl unterschiedlicher Terme pro Zeitsegment und den einer Taxonomie zugeordneten Termen. Dies trifft wiederum nur für das Korpus mit hoher Vorverarbeitungsqualität zu. Eine semantische Analyse auf einem mit Hilfe einer Schwellenwert basierenden TDM Methode aufbereiteten Datenbestand ergab signifikant unterschiedliche Resultate an generiertem Wissen, abhängig von der Qualität der Datenvorverarbeitung. Mit den in dieser Forschungsarbeit vorgestellten Methoden und Maßzahlen ist sowohl die Qualität der verwendeten Quellkorpora, als auch die Qualität der angewandten Taxonomien messbar. Basierend auf diesen Erkenntnissen werden Indikatoren für die Messung und Bewertung von Korpora und Taxonomien entwickelt sowie Empfehlungen für eine dem Ziel des nachfolgenden Analyseprozesses adäquate Vorverarbeitung gegeben.
The research work available here has the goal of analysing the influence of pre-processing on the results of the generation of knowledge and of giving concrete recommendations for action for suitable pre-processing of text corpora in TDM. The research introduced here focuses on the extraction and tracking of concepts within certain knowledge domains using an approach of horizontally (timeline) and vertically (persistence of terms) segmenting of corpora. The result is a set of segmented corpora according to the timeline. Within each timeline segment clusters of concepts can be built according to their persistence quality in relation to each single time-based corpus segment and to the whole corpus. Based on a simple frequency measure it can be shown that only the statistical quality of a single corpus allows measuring the pre-processing quality. It is not necessary to use comparison corpora. The time series of the frequency measure have significant negative correlations between the two clusters of concepts that occur permanently and others that vary within an optimal pre-processed corpus. This was found to be the opposite in every other test set that was pre-processed with lower quality. The most frequent terms were grouped into concepts by the use of domain-specific taxonomies. A significant negative correlation was found between the time series of different terms per yearly corpus segments and the terms assigned to taxonomy for corpora with high quality level of pre-processing. A semantic analysis based on a simple TDM method with significant frequency threshold measures resulted in significant different knowledge extracted from corpora with different qualities of pre-processing. With measures introduced in this research it is possible to measure the quality of applied taxonomy. Rules for the measuring of corpus as well as taxonomy quality were derived from these results and advice suggested for the appropriate level of pre-processing.

APA, Harvard, Vancouver, ISO, and other styles

15

Franzke, Maximilian [Verfasser], and Matthias [Akademischer Betreuer] Renz. "Querying and mining heterogeneous spatial, social, and temporal data / Maximilian Franzke ; Betreuer: Matthias Renz." München : Universitätsbibliothek der Ludwig-Maximilians-Universität, 2019. http://d-nb.info/1190563630/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

16

Wu, Chao. "Intelligent Data Mining on Large-scale Heterogeneous Datasets and its Application in Computational Biology." University of Cincinnati / OhioLINK, 2014. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1406880774.

Full text

APA, Harvard, Vancouver, ISO, and other styles

17

Koeller, Andreas. "Integration of heterogeneous databases : discovery of meta-information and maintenance of schema-restructuring views." Link to electronic version, 2001. http://www.wpi.edu/Pubs/ETD/Available/etd-0415102-133008/.

Full text

Abstract:

Thesis (Ph. D.)--Worcester Polytechnic Institute.
UMI no. 30-30945. Keywords: schema restructuring; schema changes; meta-data discovery; data mining; data integration. Includes bibliographical references (leaves 256-274).

APA, Harvard, Vancouver, ISO, and other styles

18

Ait, Saada Mira. "Unsupervised learning from textual data with neural text representations." Electronic Thesis or Diss., Université Paris Cité, 2023. http://www.theses.fr/2023UNIP7122.

Full text

Abstract:

L'ère du numérique génère des quantités énormes de données non structurées telles que des images et des documents, nécessitant des méthodes de traitement spécifiques pour en tirer de la valeur. Les données textuelles présentent une difficulté supplémentaire car elles ne contiennent pas de valeurs numériques. Les plongements de mots sont des techniques permettant de transformer automatiquement du texte en données numériques, qui permettent aux algorithmes d'apprentissage automatique de les traiter. Les tâches non-supervisées sont un enjeu majeur dans l'industrie car elles permettent de créer de la valeur à partir de grandes quantités de données sans nécessiter une labellisation manuelle coûteuse. Cette thèse explore l'utilisation des modèles Transformeurs pour les tâches non-supervisées telles que la classification automatique, la détection d'anomalies et la visualisation de données. Elle propose également des méthodologies pour exploiter au mieux les modèles Transformeurs multicouches dans un contexte non-supervisé pour améliorer la qualité et la robustesse du clustering de documents tout en s'affranchissant du choix de la couche à utiliser et du nombre de classes. En outre, la thèse examine les méthodes de transfert d'apprentissage pour améliorer la qualité des modèles Transformeurs pré-entraînés sur une autre tâche en les utilisant pour la tâche de clustering. Par ailleurs, nous investiguons plus profondément dans cette thèse les modèles de langage "Transformers" et leur application au clustering en examinant en particulier les méthodes de transfert d'apprentissage qui consistent à réapprendre des modèles pré-entraînés sur une tâche différente afin d'améliorer leur qualité pour de futures tâches. Nous démontrons par une étude empirique que les méthodes de post-traitement basées sur la réduction de dimension sont plus avantageuses que les stratégies de réapprentissage proposées dans la littérature pour le clustering. Enfin, nous proposons un nouveau cadre de détection d'anomalies textuelles en français adapté à deux cas : celui où les données concernent une thématique précise et celui où les données ont plusieurs sous-thématiques. Dans les deux cas, nous obtenons des résultats supérieurs à l'état de l'art avec un temps de calcul nettement inférieur
The digital era generates enormous amounts of unstructured data such as images and documents, requiring specific processing methods to extract value from them. Textual data presents an additional challenge as it does not contain numerical values. Word embeddings are techniques that transform text into numerical data, enabling machine learning algorithms to process them. Unsupervised tasks are a major challenge in the industry as they allow value creation from large amounts of data without requiring costly manual labeling. In thesis we explore the use of Transformer models for unsupervised tasks such as clustering, anomaly detection, and data visualization. We also propose methodologies to better exploit multi-layer Transformer models in an unsupervised context to improve the quality and robustness of document clustering while avoiding the choice of which layer to use and the number of classes. Additionally, we investigate more deeply Transformer language models and their application to clustering, examining in particular transfer learning methods that involve fine-tuning pre-trained models on a different task to improve their quality for future tasks. We demonstrate through an empirical study that post-processing methods based on dimensionality reduction are more advantageous than fine-tuning strategies proposed in the literature. Finally, we propose a framework for detecting text anomalies in French adapted to two cases: one where the data concerns a specific topic and the other where the data has multiple sub-topics. In both cases, we obtain superior results to the state of the art with significantly lower computation time

APA, Harvard, Vancouver, ISO, and other styles

19

元吉, 忠寛, and Tadahiro MOTOYOSHI. "災害のイマジネーション力に関する探索的研究 - 大学生の想像力と阪神淡路大震災の事例との比較 -." 名古屋大学大学院教育発達科学研究科, 2006. http://hdl.handle.net/2237/9454.

Full text

APA, Harvard, Vancouver, ISO, and other styles

20

Alencar, Medeiros Gabriel Henrique. "ΡreDiViD Τοwards the Ρredictiοn οf the Disseminatiοn οf Viral Disease cοntagiοn in a pandemic setting." Electronic Thesis or Diss., Normandie, 2025. http://www.theses.fr/2025NORMR005.

Full text

Abstract:

Les systèmes de surveillance basés sur les événements (EBS) sont essentiels pour détecter et suivre les phénomènes de santé émergents tels que les épidémies et crises sanitaires. Cependant, ils souffrent de limitations, notamment une forte dépendance à l’expertise humaine, des difficultés à traiter des données textuelles hétérogènes et une prise en compte insuffisante des dynamiques spatio-temporelles. Pour pallier ces limites, nous proposons une approche hybride combinant des méthodologies guidées par les connaissances et les données, ancrée dans l’ontologie des phénomènes de propagation (PropaPhen) et le cadre Description-Detection-Prediction Framework (DDPF), afin d’améliorer la description, la détection et la prédiction des phénomènes de propagation. PropaPhen est une ontologie FAIR conçue pour modéliser la propagation spatio-temporelle des phénomènes et a été spécialisée pour le biomédical grâce à l’intégration de UMLS et World-KG, menant à la création du graphe BioPropaPhenKG. Le cadre DDPF repose sur trois modules : la description, générant des ontologies spécifiques ; la détection, appliquant des techniques d'extraction de relations sur des textes hétérogènes ; et la prédiction, utilisant des méthodes avancées de clustering. Expérimenté sur des données du COVID-19 et de la variole du singe et validé avec les données de l’OMS, DDPF a démontré son efficacité dans la détection et la prédiction de clusters spatio-temporels. Son architecture modulaire assure son évolutivité et son adaptabilité à divers domaines, ouvrant des perspectives en santé publique, environnement et phénomènes sociaux
Event-Based Surveillance (EBS) systems are essential for detecting and tracking emerging health phenomena such as epidemics and public health crises. However, they face limitations, including strong dependence on human expertise, challenges processing heterogeneous textual data, and insufficient consideration of spatiotemporal dynamics. To overcome these issues, we propose a hybrid approach combining knowledge-driven and data-driven methodologies, anchored in the Propagation Phenomena Ontology (PropaPhen) and the Description-Detection-Prediction Framework (DDPF), to enhance the description, detection, and prediction of propagation phenomena. PropaPhen is a FAIR ontology designed to model the spatiotemporal spread of phenomena. It has been specialized in the biomedical domain through the integration of UMLS and World-KG, leading to the creation of the BioPropaPhenKG knowledge graph. The DDPF framework consists of three modules: description, which generates domain-specific ontologies; detection, which applies relation extraction techniques to heterogeneous textual sources; and prediction, which uses advanced clustering methods. Tested on COVID-19 and Monkeypox datasets and validated against WHO data, DDPF demonstrated its effectiveness in detecting and predicting spatiotemporal clusters. Its modular architecture ensures scalability and adaptability to various domains, opening perspectives in public health, environmental monitoring, and social phenomena

APA, Harvard, Vancouver, ISO, and other styles

21

Spiegler, Sebastian R. "Comparative study of clustering algorithms on textual databases : clustering of curricula vitae into comptency-based groups to support knowledge management /." Saarbrücken : VDM Verl. Müller, 2007. http://deposit.d-nb.de/cgi-bin/dokserv?id=3035354&prov=M&dok_var=1&dok_ext=htm.

Full text

APA, Harvard, Vancouver, ISO, and other styles

22

Fabbri, Renato. "Topological stability and textual differentiation in human interaction networks: statistical analysis, visualization and linked data." Universidade de São Paulo, 2017. http://www.teses.usp.br/teses/disponiveis/76/76132/tde-11092017-154706/.

Full text

Abstract:

This work reports on stable (or invariant) topological properties and textual differentiation in human interaction networks, with benchmarks derived from public email lists. Activity along time and topology were observed in snapshots in a timeline, and at different scales. Our analysis shows that activity is practically the same for all networks across timescales ranging from seconds to months. The principal components of the participants in the topological metrics space remain practically unchanged as different sets of messages are considered. The activity of participants follows the expected scale-free outline, thus yielding the hub, intermediary and peripheral classes of vertices by comparison against the Erdös-Rényi model. The relative sizes of these three sectors are essentially the same for all email lists and the same along time. Typically, 3-12% of the vertices are hubs, 15-45% are intermediary and 44-81% are peripheral vertices. Texts from each of such sectors are shown to be very different through direct measurements and through an adaptation of the Kolmogorov-Smirnov test. These properties are consistent with the literature and may be general for human interaction networks, which has important implications for establishing a typology of participants based on quantitative criteria. For guiding and supporting this research, we also developed a visualization method of dynamic networks through animations. To facilitate verification and further steps in the analyses, we supply a linked data representation of data related to our results.
Este trabalho relata propriedades topológicas estáveis (ou invariantes) e diferenciação textual em redes de interação humana, com referências derivadas de listas públicas de e-mail. A atividade ao longo do tempo e a topologia foram observadas em instantâneos ao longo de uma linha do tempo e em diferentes escalas. A análise mostra que a atividade é praticamente a mesma para todas as redes em escalas temporais de segundos a meses. As componentes principais dos participantes no espaço das métricas topológicas mantêm-se praticamente inalteradas quando diferentes conjuntos de mensagens são considerados. A atividade dos participantes segue o esperado perfil livre de escala, produzindo, assim, as classes de vértices dos hubs, dos intermediários e dos periféricos em comparação com o modelo Erdös-Rényi. Os tamanhos relativos destes três setores são essencialmente os mesmos para todas as listas de e-mail e ao longo do tempo. Normalmente, 3-12% dos vértices são hubs, 15-45% são intermediários e 44-81% são vértices periféricos. Os textos de cada um destes setores são considerados muito diferentes através de uma adaptação dos testes de Kolmogorov-Smirnov. Estas propriedades são consistentes com a literatura e podem ser gerais para redes de interação humana, o que tem implicações importantes para o estabelecimento de uma tipologia dos participantes com base em critérios quantitativos. De modo a guiar e apoiar esta pesquisa, também desenvolvemos um método de visualização para redes dinâmicas através de animações. Para facilitar a verificação e passos seguintes nas análises, fornecemos uma representação em dados ligados dos dados relacionados aos nossos resultados.

APA, Harvard, Vancouver, ISO, and other styles

23

Nieto, Erick Mauricio Gómez. "Projeção multidimensional aplicada a visualização de resultados de busca textual." Universidade de São Paulo, 2012. http://www.teses.usp.br/teses/disponiveis/55/55134/tde-05122012-105730/.

Full text

Abstract:

Usuários da Internet estão muito familiarizados que resultados de uma consulta sejam exibidos como uma lista ordenada de snippets. Cada snippet possui conteúdo textual que mostra um resumo do documento referido (ou página web) e um link para o mesmo. Esta representação tem muitas vantagens como, por exemplo, proporcionar uma navegação fácil e simples de interpretar. No entanto, qualquer usuário que usa motores de busca poderia reportar possivelmente alguma experiência de decepção com este modelo. Todavia, ela tem limitações em situações particulares, como o não fornecimento de uma visão geral da coleção de documentos recuperados. Além disso, dependendo da natureza da consulta - por exemplo, pode ser muito geral, ou ambígua, ou mal expressa - a informação desejada pode ser mal classificada, ou os resultados podem contemplar temas variados. Várias tarefas de busca seriam mais fáceis se fosse devolvida aos usuários uma visão geral dos documentos organizados de modo a refletir a forma como são relacionados, em relação ao conteúdo. Propomos uma técnica de visualização para exibir os resultados de consultas web que visa superar tais limitações. Ela combina a capacidade de preservação de vizinhança das projeções multidimensionais com a conhecida representação baseada em snippets. Essa visualização emprega uma projeção multidimensional para derivar layouts bidimensionais dos resultados da pesquisa, que preservam as relações de similaridade de texto, ou vizinhança. A similaridade é calculada mediante a aplicação da similaridade do cosseno sobre uma representação bag-of-words vetorial de coleções construídas a partir dos snippets. Se os snippets são exibidos diretamente de acordo com o layout derivado, eles se sobrepõem consideravelmente, produzindo uma visualização pobre. Nós superamos esse problema definindo uma energia funcional que considera tanto a sobreposição entre os snippets e a preservação da estrutura de vizinhanças como foi dada no layout da projeção. Minimizando esta energia funcional é fornecida uma representação bidimensional com preservação das vizinhanças dos snippets textuais com sobreposição mínima. A visualização transmite tanto uma visão global dos resultados da consulta como os agrupamentos visuais que refletem documentos relacionados, como é ilustrado em vários dos exemplos apresentados
Internet users are very familiar with the results of a search query displayed as a ranked list of snippets. Each textual snippet shows a content summary of the referred document (or web page) and a link to it. This display has many advantages, e.g., it affords easy navigation and is straightforward to interpret. Nonetheless, any user of search engines could possibly report some experience of disappointment with this metaphor. Indeed, it has limitations in particular situations, as it fails to provide an overview of the document collection retrieved. Moreover, depending on the nature of the query - e.g., it may be too general, or ambiguous, or ill expressed - the desired information may be poorly ranked, or results may contemplate varied topics. Several search tasks would be easier if users were shown an overview of the returned documents, organized so as to reflect how related they are, content-wise. We propose a visualization technique to display the results of web queries aimed at overcoming such limitations. It combines the neighborhood preservation capability of multidimensional projections with the familiar snippet-based representation by employing a multidimensional projection to derive two-dimensional layouts of the query search results that preserve text similarity relations, or neighborhoods. Similarity is computed by applying the cosine similarity over a bag-of-words vector representation of collection built from the snippets. If the snippets are displayed directly according to the derived layout they will overlap considerably, producing a poor visualization. We overcome this problem by defining an energy functional that considers both the overlapping amongst snippets and the preservation of the neighborhood structure as given in vii the projected layout. Minimizing this energy functional provides a neighborhood preserving two-dimensional arrangement of the textual snippets with minimum overlap. The resulting visualization conveys both a global view of the query results and visual groupings that reflect related results, as illustrated in several examples shown

APA, Harvard, Vancouver, ISO, and other styles

24

Driscoll, Timothy. "Host-Microbe Relations: A Phylogenomics-Driven Bioinformatic Approach to the Characterization of Microbial DNA from Heterogeneous Sequence Data." Diss., Virginia Tech, 2013. http://hdl.handle.net/10919/50921.

Full text

Abstract:

Plants and animals are characterized by intimate, enduring, often indispensable, and always complex associations with microbes. Therefore, it should come as no surprise that when the genome of a eukaryote is sequenced, a medley of bacterial sequences are produced as well. These sequences can be highly informative about the interactions between the eukaryote and its bacterial cohorts; unfortunately, they often comprise a vanishingly small constituent within a heterogeneous mixture of microbial and host sequences. Genomic analyses typically avoid the bacterial sequences in order to obtain a genome sequence for the host. Metagenomic analysis typically avoid the host sequences in order to analyze community composition and functional diversity of the bacterial component. This dissertation describes the development of a novel approach at the intersection of genomics and metagenomics, aimed at the extraction and characterization of bacterial sequences from heterogeneous sequence data using phylogenomic and bioinformatic tools. To achieve this objective, three interoperable workflows were constructed as modular computational pipelines, with built-in checkpoints for periodic interpretation and refinement. The MetaMiner workflow uses 16S small subunit rDNA analysis to enable the systematic discovery and classification of bacteria associated with a host genome sequencing project. Using this information, the ReadMiner workflow comprehensively extracts, assembles, and characterizes sequences that belong to a target microbe. Finally, AssemblySifter examines the genes and scaffolds of the eukaryotic genome for sequences associated with the target microbe. The combined information from these three workflows is used to systemically characterize a bacterial target of interest, including robust estimation of its phylogeny, assessment of its signature profile, and determination of its relationship to the associated eukaryote. This dissertation presents the development of the described methodology and its application to three eukaryotic genome projects. In the first study, the genomic sequences of a single, known endosymbiont was extracted from the genome sequencing data of its host. In the second study, a highly divergent endosymbiont was characterized from the assembled genome of its host. In the third study, genome sequences from a novel bacterium were extracted from both the raw sequencing data and assembled genome of a eukaryote that contained significant amounts of sequence from multiple competing bacteria. Taken together, these results demonstrate the usefulness of the described approach in singularly disparate situations, and strongly argue for a sophisticated, multifaceted, supervised approach to the characterization of host-associated microbes and their interactions.
Ph. D.

APA, Harvard, Vancouver, ISO, and other styles

25

Alonso, Gonzalez Kevin [Verfasser], Gerhard [Akademischer Betreuer] [Gutachter] Rigoll, and Mihai [Gutachter] Datcu. "Heterogeneous Data Mining of Earth Observation Archives: Integration and Fusion of Images, Maps, and In-situ Data / Kevin Alonso Gonzalez ; Gutachter: Gerhard Rigoll, Mihai Datcu ; Betreuer: Gerhard Rigoll." München : Universitätsbibliothek der TU München, 2017. http://d-nb.info/1132774195/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

26

Mrowca, Artur [Verfasser], Stephan [Akademischer Betreuer] Günnemann, Stephan [Gutachter] Günnemann, and Sebastian [Gutachter] Steinhorst. "Specification Mining in High dimensional Heterogeneous Data Sets of Large-Scale Distributed Systems / Artur Mrowca ; Gutachter: Stephan Günnemann, Sebastian Steinhorst ; Betreuer: Stephan Günnemann." München : Universitätsbibliothek der TU München, 2021. http://d-nb.info/1234149125/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

27

Kamenieva, Iryna. "Research Ontology Data Models for Data and Metadata Exchange Repository." Thesis, Växjö University, School of Mathematics and Systems Engineering, 2009. http://urn.kb.se/resolve?urn=urn:nbn:se:vxu:diva-6351.

Full text

Abstract:

For researches in the field of the data mining and machine learning the necessary condition is an availability of various input data set. Now researchers create the databases of such sets. Examples of the following systems are: The UCI Machine Learning Repository, Data Envelopment Analysis Dataset Repository, XMLData Repository, Frequent Itemset Mining Dataset Repository. Along with above specified statistical repositories, the whole pleiad from simple filestores to specialized repositories can be used by researchers during solution of applied tasks, researches of own algorithms and scientific problems. It would seem, a single complexity for the user will be search and direct understanding of structure of so separated storages of the information. However detailed research of such repositories leads us to comprehension of deeper problems existing in usage of data. In particular a complete mismatch and rigidity of data files structure with SDMX - Statistical Data and Metadata Exchange - standard and structure used by many European organizations, impossibility of preliminary data origination to the concrete applied task, lack of data usage history for those or other scientific and applied tasks.

Now there are lots of methods of data miming, as well as quantities of data stored in various repositories. In repositories there are no methods of DM (data miming) and moreover, methods are not linked to application areas. An essential problem is subject domain link (problem domain), methods of DM and datasets for an appropriate method. Therefore in this work we consider the building problem of ontological models of DM methods, interaction description of methods of data corresponding to them from repositories and intelligent agents allowing the statistical repository user to choose the appropriate method and data corresponding to the solved task. In this work the system structure is offered, the intelligent search agent on ontological model of DM methods considering the personal inquiries of the user is realized.

For implementation of an intelligent data and metadata exchange repository the agent oriented approach has been selected. The model uses the service oriented architecture. Here is used the cross platform programming language Java, multi-agent platform Jadex, database server Oracle Spatial 10g, and also the development environment for ontological models - Protégé Version 3.4.

APA, Harvard, Vancouver, ISO, and other styles

28

Mendes, MarÃlia Soares. "MALTU - model for evaluation of interaction in social systems from the Users Textual Language." Universidade Federal do CearÃ, 2015. http://www.teses.ufc.br/tde_busca/arquivo.php?codArquivo=14296.

Full text

Abstract:

The field of Human Computer Interaction (HCI) has suggested various methods for evaluating systems in order to improve their usability and User eXperience (UX). The advent of Web 2.0 has allowed the development of applications marked by collaboration, communication and interaction among their users in a way and on a scale never seen before. Social Systems (SS) (e.g. Twitter, Facebook, MySpace, LinkedIn etc.) are examples of such applications and have features such as: frequent exchange of messages, spontaneity and expression of feelings. The opportunities and challenges posed by these types of applications require the traditional evaluation methods to be reassessed, taking into consideration these new characteristics. For instance, the postings of users on SS reveal their opinions on various issues, including on what they think of the system. This work aims to test the hypothesis that the postings of users in SS provide relevant data for evaluation of the usability and of UX in SS. While researching through literature, we have not identified any evaluation model intending to collect and interpret texts from users in order to assess the user experience and system usability. Thus, this thesis proposes MALTU - Model for evaluation of interaction in social systems from the Users Textual Language. In order to provide a basis for the development of the proposed model, we conducted a study of how users express their opinions on the system in natural language. We extracted postings of users from four SS of different contexts. HCI experts classified, studied and processed such postings by using Natural Language Processing (PLN) techniques and data mining, and then analyzed them in order to obtain a generic model. The MALTU was applied in two SS: an entertainment and an educational SS. The results show that is possible to evaluate a system from the postings of users in SS. Such assessments are aided by extraction patterns related to the use, to the types of postings and to HCI factors used in system.
A Ãrea de InteraÃÃo Humano-Computador (IHC) tem sugerido muitas formas para avaliar sistemas a fim de melhorar sua usabilidade e a eXperiÃncia do UsuÃrio (UX). O surgimento da web 2.0 permitiu o desenvolvimento de aplicaÃÃes marcadas pela colaboraÃÃo, comunicaÃÃo e interatividade entre seus usuÃrios de uma forma e em uma escala nunca antes observadas. Sistemas Sociais (SS) (e.g., Twitter, Facebook, MySpace, LinkedIn etc.) sÃo exemplos dessas aplicaÃÃes e possuem caracterÃsticas como: frequente troca de mensagens e expressÃo de sentimentos de forma espontÃnea. As oportunidades e os desafios trazidos por esses tipos de aplicaÃÃes exigem que os mÃtodos tradicionais de avaliaÃÃo sejam repensados, considerando essas novas caracterÃsticas. Por exemplo, as postagens dos usuÃrios em SS revelam suas opiniÃes sobre diversos assuntos, inclusive sobre o que eles pensam do sistema em uso. Esta tese procura testar a hipÃtese de que as postagens dos usuÃrios em SS fornecem dados relevantes para avaliaÃÃo da Usabilidade e da UX (UUX) em SS. Durante as pesquisas realizadas na literatura, nÃo foi identificado nenhum modelo de avaliaÃÃo que tenha direcionado seu foco na coleta e anÃlise das postagens dos usuÃrios a fim de avaliar a UUX de um sistema em uso. Sendo assim, este estudo propÃe o MALTU â Modelo para AvaliaÃÃo da interaÃÃo em sistemas sociais a partir da Linguagem Textual do UsuÃrio. A fim de fornecer bases para o desenvolvimento do modelo proposto, foram realizados estudos de como os usuÃrios expressam suas opiniÃes sobre o sistema em lÃngua natural. Foram extraÃdas postagens de usuÃrios de quatro SS de contextos distintos. Tais postagens foram classificadas por especialistas de IHC, estudadas e processadas utilizando tÃcnicas de Processamento da Linguagem Natural (PLN) e mineraÃÃo de dados e, analisadas a fim da obtenÃÃo de um modelo genÃrico. O MALTU foi aplicado em dois SS: um de entretenimento e um SS educativo. Os resultados mostram que Ã possÃvel avaliar um sistema a partir das postagens dos usuÃrios em SS. Tais avaliaÃÃes sÃo auxiliadas por padrÃes de extraÃÃo relacionados ao uso, aos tipos de postagens e Ãs metas de IHC utilizadas na avaliaÃÃo do sistema.

APA, Harvard, Vancouver, ISO, and other styles

29

Egho, Elias. "Extraction de motifs séquentiels dans des données séquentielles multidimensionnelles et hétérogènes : une application à l'analyse de trajectoires de patients." Thesis, Université de Lorraine, 2014. http://www.theses.fr/2014LORR0066/document.

Full text

Abstract:

Tous les domaines de la science et de la technologie produisent de gros volume de données hétérogènes. L'exploration de tels volumes de données reste toujours un défi. Peu de travaux ciblent l'exploration et l'analyse de données séquentielles multidimensionnelles et hétérogènes. Dans ce travail, nous proposons une contribution à la découverte de connaissances dans les données séquentielles hétérogènes. Nous étudions trois axes de recherche différents: (i) l'extraction de motifs séquentiels, (ii) la classification et (iii) le clustering des données séquentielles. Tout d'abord, nous généralisons la notion de séquence multidimensionnelle en considérant la structure complexe et hétérogène. Nous présentons une nouvelle approche MMISP pour extraire des motifs séquentiels à partir de données séquentielles multidimensionnelles et hétérogènes. MMISP génère un grand nombre de motifs séquentiels comme cela est généralement le cas pour toues les algorithmes d'énumération des motifs. Pour surmonter ce problème, nous proposons une nouvelle façon de considérer les séquences multidimensionnelles hétérogènes en les associant à des structures de patrons. Nous développons une méthode pour énumérer seulement les motifs qui respectent certaines contraintes. La deuxième direction de recherche est la classification de séquences multidimensionnelles et hétérogènes. Nous utilisons l'analyse formelle de concept (AFC) comme une méthode de classification. Nous montrons l'intérêt des treillis de concepts et de l'indice de stabilité pour classer les séquences et pour choisir quelques groupes intéressants de séquences. La troisième direction de recherche dans cette thèse est préoccupé par le regroupement des données séquentielles multidimensionnelles et hétérogènes. Nous nous basons sur la notion de sous-séquences communes pour définir une mesure de similarité permettant d'évaluer la proximité entre deux séquences formées d'une liste d'ensemble d'items. Nous utilisons cette mesure de similarité pour construire une matrice de similarité entre les séquences et pour les segmenter en plusieurs groupes. Dans ce travail, nous présentons les résultats théoriques et un algorithme de programmation dynamique permettant de compter efficacement toutes les sous-séquences communes à deux séquences sans énumérer toutes les séquences. Le système résultant de cette recherches a été appliqué pour analyser et extraire les trajectoires de soins de santé des patients en cancérologie. Les données sont issues d' une base de données médico-administrative incluant des informations sur des patients hospitalisent en France. Le système permet d'identifier et de caractériser des épisodes de soins pour des ensembles spécifiques de patients. Les résultats ont été discutés et interprétés avec les experts du domaine
All domains of science and technology produce large and heterogeneous data. Although a lot of work was done in this area, mining such data is still a challenge. No previous research work targets the mining of heterogeneous multidimensional sequential data. This thesis proposes a contribution to knowledge discovery in heterogeneous sequential data. We study three different research directions: (i) Extraction of sequential patterns, (ii) Classification and (iii) Clustering of sequential data. Firstly we generalize the notion of a multidimensional sequence by considering complex and heterogeneous sequential structure. We present a new approach called MMISP to extract sequential patterns from heterogeneous sequential data. MMISP generates a large number of sequential patterns as this is usually the case for pattern enumeration algorithms. To overcome this problem, we propose a novel way of considering heterogeneous multidimensional sequences by mapping them into pattern structures. We develop a framework for enumerating only patterns satisfying given constraints. The second research direction is in concern with the classification of heterogeneous multidimensional sequences. We use Formal Concept Analysis (FCA) as a classification method. We show interesting properties of concept lattices and of stability index to classify sequences into a concept lattice and to select some interesting groups of sequences. The third research direction in this thesis is in concern with the clustering of heterogeneous multidimensional sequential data. We focus on the notion of common subsequences to define similarity between a pair of sequences composed of a list of itemsets. We use this similarity measure to build a similarity matrix between sequences and to separate them in different groups. In this work, we present theoretical results and an efficient dynamic programming algorithm to count the number of common subsequences between two sequences without enumerating all subsequences. The system resulting from this research work was applied to analyze and mine patient healthcare trajectories in oncology. Data are taken from a medico-administrative database including all information about the hospitalizations of patients in Lorraine Region (France). The system allows to identify and characterize episodes of care for specific sets of patients. Results were discussed and validated with domain experts

APA, Harvard, Vancouver, ISO, and other styles

30

Ammari, Ahmad N. "Transforming user data into user value by novel mining techniques for extraction of web content, structure and usage patterns : the development and evaluation of new Web mining methods that enhance information retrieval and improve the understanding of users' Web behavior in websites and social blogs." Thesis, University of Bradford, 2010. http://hdl.handle.net/10454/5269.

Full text

Abstract:

The rapid growth of the World Wide Web in the last decade makes it the largest publicly accessible data source in the world, which has become one of the most significant and influential information revolution of modern times. The influence of the Web has impacted almost every aspect of humans' life, activities and fields, causing paradigm shifts and transformational changes in business, governance, and education. Moreover, the rapid evolution of Web 2.0 and the Social Web in the past few years, such as social blogs and friendship networking sites, has dramatically transformed the Web from a raw environment for information consumption to a dynamic and rich platform for information production and sharing worldwide. However, this growth and transformation of the Web has resulted in an uncontrollable explosion and abundance of the textual contents, creating a serious challenge for any user to find and retrieve the relevant information that he truly seeks to find on the Web. The process of finding a relevant Web page in a website easily and efficiently has become very difficult to achieve. This has created many challenges for researchers to develop new mining techniques in order to improve the user experience on the Web, as well as for organizations to understand the true informational interests and needs of their customers in order to improve their targeted services accordingly by providing the products, services and information that truly match the requirements of every online customer. With these challenges in mind, Web mining aims to extract hidden patterns and discover useful knowledge from Web page contents, Web hyperlinks, and Web usage logs. Based on the primary kinds of Web data used in the mining process, Web mining tasks can be categorized into three main types: Web content mining, which extracts knowledge from Web page contents using text mining techniques, Web structure mining, which extracts patterns from the hyperlinks that represent the structure of the website, and Web usage mining, which mines user's Web navigational patterns from Web server logs that record the Web page access made by every user, representing the interactional activities between the users and the Web pages in a website. The main goal of this thesis is to contribute toward addressing the challenges that have been resulted from the information explosion and overload on the Web, by proposing and developing novel Web mining-based approaches. Toward achieving this goal, the thesis presents, analyzes, and evaluates three major contributions. First, the development of an integrated Web structure and usage mining approach that recommends a collection of hyperlinks for the surfers of a website to be placed at the homepage of that website. Second, the development of an integrated Web content and usage mining approach to improve the understanding of the user's Web behavior and discover the user group interests in a website. Third, the development of a supervised classification model based on recent Social Web concepts, such as Tag Clouds, in order to improve the retrieval of relevant articles and posts from Web social blogs.

APA, Harvard, Vancouver, ISO, and other styles

31

Ammari, Ahmad N. "Transforming user data into user value by novel mining techniques for extraction of web content, structure and usage patterns. The Development and Evaluation of New Web Mining Methods that enhance Information Retrieval and improve the Understanding of User¿s Web Behavior in Websites and Social Blogs." Thesis, University of Bradford, 2010. http://hdl.handle.net/10454/5269.

Full text

Abstract:

The rapid growth of the World Wide Web in the last decade makes it the largest publicly accessible data source in the world, which has become one of the most significant and influential information revolution of modern times. The influence of the Web has impacted almost every aspect of humans' life, activities and fields, causing paradigm shifts and transformational changes in business, governance, and education. Moreover, the rapid evolution of Web 2.0 and the Social Web in the past few years, such as social blogs and friendship networking sites, has dramatically transformed the Web from a raw environment for information consumption to a dynamic and rich platform for information production and sharing worldwide. However, this growth and transformation of the Web has resulted in an uncontrollable explosion and abundance of the textual contents, creating a serious challenge for any user to find and retrieve the relevant information that he truly seeks to find on the Web. The process of finding a relevant Web page in a website easily and efficiently has become very difficult to achieve. This has created many challenges for researchers to develop new mining techniques in order to improve the user experience on the Web, as well as for organizations to understand the true informational interests and needs of their customers in order to improve their targeted services accordingly by providing the products, services and information that truly match the requirements of every online customer. With these challenges in mind, Web mining aims to extract hidden patterns and discover useful knowledge from Web page contents, Web hyperlinks, and Web usage logs. Based on the primary kinds of Web data used in the mining process, Web mining tasks can be categorized into three main types: Web content mining, which extracts knowledge from Web page contents using text mining techniques, Web structure mining, which extracts patterns from the hyperlinks that represent the structure of the website, and Web usage mining, which mines user's Web navigational patterns from Web server logs that record the Web page access made by every user, representing the interactional activities between the users and the Web pages in a website. The main goal of this thesis is to contribute toward addressing the challenges that have been resulted from the information explosion and overload on the Web, by proposing and developing novel Web mining-based approaches. Toward achieving this goal, the thesis presents, analyzes, and evaluates three major contributions. First, the development of an integrated Web structure and usage mining approach that recommends a collection of hyperlinks for the surfers of a website to be placed at the homepage of that website. Second, the development of an integrated Web content and usage mining approach to improve the understanding of the user's Web behavior and discover the user group interests in a website. Third, the development of a supervised classification model based on recent Social Web concepts, such as Tag Clouds, in order to improve the retrieval of relevant articles and posts from Web social blogs.

APA, Harvard, Vancouver, ISO, and other styles

32

Mazoyer, Béatrice. "Social Media Stories. Event detection in heterogeneous streams of documents applied to the study of information spreading across social and news media." Thesis, université Paris-Saclay, 2020. http://www.theses.fr/2020UPASC009.

Full text

Abstract:

Les réseaux sociaux, et Twitter en particulier, sont devenus une source d'information privilégiée pour les journalistes ces dernières années. Beaucoup effectuent une veille sur Twitter, à la recherche de sujets qui puissent être repris dans les médias. Cette thèse vise à étudier et à quantifier l'effet de ce changement technologique sur les décisions prises par les rédactions. La popularité d’un événement sur les réseaux sociaux affecte-t-elle sa couverture par les médias traditionnels, indépendamment de son intérêt intrinsèque ?Pour mettre en évidence cette relation, nous adoptons une approche pluridisciplinaire, à la rencontre de l'informatique et de l'économie : tout d’abord, nous concevons une approche inédite pour collecter un échantillon représentatif de 70% de tous les tweets en français émis pendant un an. Par la suite, nous étudions différents types d'algorithmes pour découvrir automatiquement les tweets qui se rapportent aux mêmes événements. Nous testons différentes représentation vectorielles de tweets, en nous intéressants aux représentations vectorielles de texte, et aux représentations texte-image. Troisièmement, nous concevons une nouvelle méthode pour regrouper les événements Twitter et les événements médiatiques. Enfin, nous concevons un instrument économétrique pour identifier un effet causal de la popularité d'un événement sur Twitter sur sa couverture par les médias traditionnels. Nous montrons que la popularité d’un événement sur Twitter a un effet sur le nombre d'articles qui lui sont consacrés dans les médias traditionnels, avec une augmentation d'environ 1 article pour 1000 tweets supplémentaires
Social Media, and Twitter in particular, has become a privileged source of information for journalists in recent years. Most of them monitor Twitter, in the search for newsworthy stories. This thesis aims to investigate and to quantify the effect of this technological change on editorial decisions. Does the popularity of a story affects the way it is covered by traditional news media, regardless of its intrinsic interest?To highlight this relationship, we take a multidisciplinary approach at the crossroads of computer science and economics: first, we design a novel approach to collect a representative sample of 70% of all French tweets emitted during an entire year. Second, we study different types of algorithms to automatically discover tweets that relate to the same stories. We test several vector representations of tweets, looking at both text and text-image representations, Third, we design a new method to group together Twitter events and media events. Finally, we design an econometric instrument to identify a causal effect of the popularity of an event on Twitter on its coverage by traditional media. We show that the popularity of a story on Twitter does have an effect on the number of articles devoted to it by traditional media, with an increase of about 1 article per 1000 additional tweets

APA, Harvard, Vancouver, ISO, and other styles

33

Egho, Elias. "Extraction de motifs séquentiels dans des données séquentielles multidimensionnelles et hétérogènes : une application à l'analyse de trajectoires de patients." Electronic Thesis or Diss., Université de Lorraine, 2014. http://www.theses.fr/2014LORR0066.

Full text

Abstract:

Tous les domaines de la science et de la technologie produisent de gros volume de données hétérogènes. L'exploration de tels volumes de données reste toujours un défi. Peu de travaux ciblent l'exploration et l'analyse de données séquentielles multidimensionnelles et hétérogènes. Dans ce travail, nous proposons une contribution à la découverte de connaissances dans les données séquentielles hétérogènes. Nous étudions trois axes de recherche différents: (i) l'extraction de motifs séquentiels, (ii) la classification et (iii) le clustering des données séquentielles. Tout d'abord, nous généralisons la notion de séquence multidimensionnelle en considérant la structure complexe et hétérogène. Nous présentons une nouvelle approche MMISP pour extraire des motifs séquentiels à partir de données séquentielles multidimensionnelles et hétérogènes. MMISP génère un grand nombre de motifs séquentiels comme cela est généralement le cas pour toues les algorithmes d'énumération des motifs. Pour surmonter ce problème, nous proposons une nouvelle façon de considérer les séquences multidimensionnelles hétérogènes en les associant à des structures de patrons. Nous développons une méthode pour énumérer seulement les motifs qui respectent certaines contraintes. La deuxième direction de recherche est la classification de séquences multidimensionnelles et hétérogènes. Nous utilisons l'analyse formelle de concept (AFC) comme une méthode de classification. Nous montrons l'intérêt des treillis de concepts et de l'indice de stabilité pour classer les séquences et pour choisir quelques groupes intéressants de séquences. La troisième direction de recherche dans cette thèse est préoccupé par le regroupement des données séquentielles multidimensionnelles et hétérogènes. Nous nous basons sur la notion de sous-séquences communes pour définir une mesure de similarité permettant d'évaluer la proximité entre deux séquences formées d'une liste d'ensemble d'items. Nous utilisons cette mesure de similarité pour construire une matrice de similarité entre les séquences et pour les segmenter en plusieurs groupes. Dans ce travail, nous présentons les résultats théoriques et un algorithme de programmation dynamique permettant de compter efficacement toutes les sous-séquences communes à deux séquences sans énumérer toutes les séquences. Le système résultant de cette recherches a été appliqué pour analyser et extraire les trajectoires de soins de santé des patients en cancérologie. Les données sont issues d' une base de données médico-administrative incluant des informations sur des patients hospitalisent en France. Le système permet d'identifier et de caractériser des épisodes de soins pour des ensembles spécifiques de patients. Les résultats ont été discutés et interprétés avec les experts du domaine
All domains of science and technology produce large and heterogeneous data. Although a lot of work was done in this area, mining such data is still a challenge. No previous research work targets the mining of heterogeneous multidimensional sequential data. This thesis proposes a contribution to knowledge discovery in heterogeneous sequential data. We study three different research directions: (i) Extraction of sequential patterns, (ii) Classification and (iii) Clustering of sequential data. Firstly we generalize the notion of a multidimensional sequence by considering complex and heterogeneous sequential structure. We present a new approach called MMISP to extract sequential patterns from heterogeneous sequential data. MMISP generates a large number of sequential patterns as this is usually the case for pattern enumeration algorithms. To overcome this problem, we propose a novel way of considering heterogeneous multidimensional sequences by mapping them into pattern structures. We develop a framework for enumerating only patterns satisfying given constraints. The second research direction is in concern with the classification of heterogeneous multidimensional sequences. We use Formal Concept Analysis (FCA) as a classification method. We show interesting properties of concept lattices and of stability index to classify sequences into a concept lattice and to select some interesting groups of sequences. The third research direction in this thesis is in concern with the clustering of heterogeneous multidimensional sequential data. We focus on the notion of common subsequences to define similarity between a pair of sequences composed of a list of itemsets. We use this similarity measure to build a similarity matrix between sequences and to separate them in different groups. In this work, we present theoretical results and an efficient dynamic programming algorithm to count the number of common subsequences between two sequences without enumerating all subsequences. The system resulting from this research work was applied to analyze and mine patient healthcare trajectories in oncology. Data are taken from a medico-administrative database including all information about the hospitalizations of patients in Lorraine Region (France). The system allows to identify and characterize episodes of care for specific sets of patients. Results were discussed and validated with domain experts

APA, Harvard, Vancouver, ISO, and other styles

34

Ataky, Steve Tsham Mpinda. "Análise de dados sequenciais heterogêneos baseada em árvore de decisão e modelos de Markov : aplicação na logística de transporte." Universidade Federal de São Carlos, 2015. https://repositorio.ufscar.br/handle/ufscar/7242.

Full text

Abstract:

Submitted by Bruna Rodrigues (bruna92rodrigues@yahoo.com.br) on 2016-09-16T12:52:39Z No. of bitstreams: 1 DissSATM.pdf: 3079104 bytes, checksum: 51b46ffeb4387370e30fb92e31771606 (MD5)
Approved for entry into archive by Marina Freitas (marinapf@ufscar.br) on 2016-09-16T19:59:28Z (GMT) No. of bitstreams: 1 DissSATM.pdf: 3079104 bytes, checksum: 51b46ffeb4387370e30fb92e31771606 (MD5)
Approved for entry into archive by Marina Freitas (marinapf@ufscar.br) on 2016-09-16T19:59:34Z (GMT) No. of bitstreams: 1 DissSATM.pdf: 3079104 bytes, checksum: 51b46ffeb4387370e30fb92e31771606 (MD5)
Made available in DSpace on 2016-09-16T19:59:41Z (GMT). No. of bitstreams: 1 DissSATM.pdf: 3079104 bytes, checksum: 51b46ffeb4387370e30fb92e31771606 (MD5) Previous issue date: 2015-10-16
Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES)
Latterly, the development of data mining techniques has emerged in many applications’ fields with aim at analyzing large volumes of data which may be simple and / or complex. The logistics of transport, the railway setor in particular, is a sector with such a characteristic in that the data available in are of varied natures (classic variables such as top speed or type of train, symbolic variables such as the set of routes traveled by train, degree of tack, etc.). As part of this dissertation, one addresses the problem of classification and prediction of heterogeneous data; it is proposed to study through two main approaches. First, an automatic classification approach was implemented based on classification tree technique, which also allows new data to be efficiently integrated into partitions initialized beforehand. The second contribution of this work concerns the analysis of sequence data. It has been proposed to combine the above classification method with Markov models for obtaining a time series (temporal sequences) partition in homogeneous and significant groups based on probabilities. The resulting model offers good interpretation of classes built and allows us to estimate the evolution of the sequences of a particular vehicle. Both approaches were then applied onto real data from the a Brazilian railway information system company in the spirit of supporting the strategic management of planning and coherent prediction. This work is to initially provide a thinner type of planning to solve the problems associated with the existing classification in homogeneous circulations groups. Second, it sought to define a typology of train paths (sucession traffic of the same train) in order to provide or predict the next movement of statistical characteristics of a train carrying the same route. The general methodology provides a supportive environment for decision-making to monitor and control the planning organization. Thereby, a formula with two variants was proposed to calculate the adhesion degree between the track effectively carried out or being carried out with the planned one.
Nos últimos anos aflorou o desenvolvimento de técnicas de mineração de dados em muitos domínios de aplicação com finalidade de analisar grandes volumes de dados, os quais podendo ser simples e/ou complexos. A logística de transporte, o setor ferroviário em particular, é uma área com tal característica em que os dados disponíveis são muitos e de variadas naturezas (variáveis clássicas como velocidade máxima ou tipo de trem, variáveis simbólicas como o conjunto de vias percorridas pelo trem, etc). Como parte desta dissertação, aborda-se o problema de classificação e previsão de dados heterogêneos, propõe-se estudar através de duas abordagens principais. Primeiramente, foi utilizada uma abordagem de classificação automática com base na técnica por ´arvore de classificação, a qual também permite que novos dados sejam eficientemente integradas nas partições inicial. A segunda contribuição deste trabalho diz respeito à análise de dados sequenciais. Propôs-se a combinar o método de classificação anterior com modelos de Markov para obter uma participação de sequências temporais em grupos homogêneos e significativos com base nas probabilidades. O modelo resultante oferece uma boa interpretação das classes construídas e permite estimar a evolução das sequências de um determinado veículo. Ambas as abordagens foram então aplicadas nos dados do sistema de informação ferroviário, no espírito de dar apoio à gestão estratégica de planejamentos e previsões aderentes. Este trabalho consiste em fornecer inicialmente uma tipologia mais fina de planejamento para resolver os problemas associados com a classificação existente em grupos de circulações homogêneos. Em segundo lugar, buscou-se definir uma tipologia de trajetórias de trens (sucessão de circulações de um mesmo trem) para assim fornecer ou prever características estatísticas da próxima circulação mais provável de um trem realizando o mesmo percurso. A metodologia geral proporciona um ambiente de apoio à decisão para o monitoramento e controle da organização de planejamento. Deste fato, uma fórmula com duas variantes foi proposta para calcular o grau de aderência entre a trajetória efetivamente realizada ou em curso de realização com o planejado.

APA, Harvard, Vancouver, ISO, and other styles

35

Valentin, Sarah. "Extraction et combinaison d’informations épidémiologiques à partir de sources informelles pour la veille des maladies infectieuses animales." Thesis, Montpellier, 2020. http://www.theses.fr/2020MONTS067.

Full text

Abstract:

L’intelligence épidémiologique a pour but de détecter, d’analyser et de surveiller au cours du temps les potentielles menaces sanitaires. Ce processus de surveillance repose sur des sources dites formelles, tels que les organismes de santé officiels, et des sources dites informelles, comme les médias. La veille des sources informelles est réalisée au travers de la surveillance basée sur les événements (event-based surveillance en anglais). Ce type de veille requiert le développement d’outils dédiés à la collecte et au traitement de données textuelles non structurées publiées sur le Web. Cette thèse se concentre sur l’extraction et la combinaison d’informations épidémiologiques extraites d’articles de presse en ligne, dans le cadre de la veille des maladies infectieuses animales. Le premier objectif de cette thèse est de proposer et de comparer des approches pour améliorer l’identification et l’extraction d’informations épidémiologiques pertinentes à partir du contenu d’articles. Le second objectif est d’étudier l’utilisation de descripteurs épidémiologiques (i.e. maladies, hôtes, localisations et dates) dans le contexte de l’extraction d’événements et de la mise en relation d’articles similaires au regard de leur contenu épidémiologique. Dans ce manuscrit, nous proposons de nouvelles représentations textuelles fondées sur la sélection, l’expansion et la combinaison de descripteurs épidémiologiques. Nous montrons que l’adaptation et l’extension de méthodes de fouille de texte et de classification permet d’améliorer l’utilisation des articles en ligne tant que source de données sanitaires. Nous mettons en évidence le rôle de l’expertise quant à la pertinence et l’interprétabilité de certaines des approches proposées. Bien que nos travaux soient menés dans le contexte de la surveillance de maladies en santé animale, nous discutons des aspects génériques des méthodes proposées, vis-à-vis de de maladies inconnues et dans un contexte One Health (« une seule santé »)
Epidemic intelligence aims to detect, investigate and monitor potential health threats while relying on formal (e.g. official health authorities) and informal (e.g. media) information sources. Monitoring of unofficial sources, or so-called event-based surveillance (EBS), requires the development of systems designed to retrieve and process unstructured textual data published online. This manuscript focuses on the extraction and combination of epidemiological information from informal sources (i.e. online news), in the context of the international surveillance of animal infectious diseases. The first objective of this thesis is to propose and compare approaches to enhance the identification and extraction of relevant epidemiological information from the content of online news. The second objective is to study the use of epidemiological entities extracted from the news articles (i.e. diseases, hosts, locations and dates) in the context of event extraction and retrieval of related online news.This manuscript proposes new textual representation approaches by selecting, expanding, and combining relevant epidemiological features. We show that adapting and extending text mining and classification methods improves the added value of online news sources for event-based surveillance. We stress the role of domain expert knowledge regarding the relevance and the interpretability of methods proposed in this thesis. While our researches are conducted in the context of animal disease surveillance, we discuss the generic aspects of our approaches regarding unknown threats and One Health surveillance

APA, Harvard, Vancouver, ISO, and other styles

36

Richard, Jérémy. "De la capture de trajectoires de visiteurs vers l’analyse interactive de comportement après enrichissement sémantique." Electronic Thesis or Diss., La Rochelle, 2023. http://www.theses.fr/2023LAROS012.

Full text

Abstract:

Cette thèse porte sur l’étude comportementale de l’activité touristique en utilisant une approche d’analyse générique et interactive. Le processus d’analyse développé concerne la trajectoire touristique dans la ville et dans les musées en tant que terrain d’étude. Des expérimentations ont été menées pour collecter les données de déplacement dans la ville touristique en utilisant des signaux GPS, permettant ainsi l’obtention d’une trajectoire de déplacement. Toutefois, l’étude se focalise en premier lieu sur la reconstruction de la trajectoire d’un visiteur dans les musées à l’aide d’un équipement de positionnement intérieur, c’est-à-dire dans un environnement contraint. Ensuite, un modèle d’enrichissement sémantique multi-aspects générique est développé pour compléter la trajectoire d’un individu en utilisant plusieurs données de contexte telles que les noms des quartiers traversés par l’individu dans la ville, les salles des musées, la météo à l’extérieur et des données d’application mobile à l’intérieur. Les trajectoires enrichies, appelées trajectoires sémantiques, sont ensuite analysées à l’aide de l’analyse formelle de concept et de la plateforme GALACTIC, qui permet l’analyse de structures de données complexes et hétérogènes sous la forme d’une hiérarchie de sous-groupes d’individus partageant des comportements communs. Enfin, l’attention est portée sur l’algorithme "ReducedContextCompletion" qui permet la navigation interactive dans un treillis de concepts, ce qui permet à l’analyste de données de se concentrer sur les aspects de la donnée qu’il souhaite explorer
This thesis focuses on the behavioral study of tourist activity using a generic and interactive analysis approach. The developed analytical process concerns the tourist trajectory in the city and museums as the study field. Experiments were conducted to collect movement data in the tourist city using GPS signals, thus enabling the acquisition of a movement trajectory. However, the study primarily focuses on reconstructing a visitor’s trajectory in museums using indoor positioning equipment, i.e., in a constrained environment. Then, a generic multi-aspect semantic enrichment model is developed to supplement an individual’s trajectory using multiple context data such as the names of neighborhoods the individual passed through in the city, museum rooms, weather outside, and indoor mobile application data. The enriched trajectories, called semantic trajectories, are then analyzed using formal concept analysis and the GALACTIC platform, which enables the analysis of complex and heterogeneous data structures as a hierarchy of subgroups of individuals sharing common behaviors. Finally, attention is paid to the "ReducedContextCompletion" algorithm that allows for interactive navigation in a lattice of concepts, allowing the data analyst to focus on the aspects of the data they wish to explore

APA, Harvard, Vancouver, ISO, and other styles

37

Jiang, Xinxin. "Mining heterogeneous enterprise data." Thesis, 2018. http://hdl.handle.net/10453/129377.

Full text

Abstract:

University of Technology Sydney. Faculty of Engineering and Information Technology.
Heterogeneity is becoming one of the key characteristics inside enterprise data, because the current nature of globalization and competition stress the importance of leveraging huge amounts of enterprise accumulated data, according to various organizational processes, resources and standards. Effectively deriving meaningful insights from complex large-scaled heterogeneous enterprise data poses an interesting, but critical challenge. The aim of this thesis is to investigate the theoretical foundations of mining heterogeneous enterprise data in light of the above challenges and to develop new algorithms and frameworks that are able to effectively and efficiently consider heterogeneity in four elements of the data: objects, events, context, and domains. Objects describe a variety of business roles and instruments involved in business systems. Object heterogeneity means that object information at both the data and structural level is heterogeneous. The cost-sensitive hybrid neural network (Cs-HNN) proposed leverages parallel network architectures and an algorithm specifically designed for minority classification to generate a robust model for learning heterogeneous objects. Events trace an object’s behaviours or activities. Event heterogeneity reflects the level of variety in business events and is normally expressed in the type and format of features. The approach proposed in this thesis focuses on fleet tracking as a practical example of an application with a high degree of event heterogeneity. Context describes the environment and circumstances surrounding objects and events. Context heterogeneity reflects the degree of diversity in contextual features. The coupled collaborative filtering (CCF) approach proposed in this thesis is able to provide context-aware recommendations by measuring the non-independent and identically distributed (non-IID) relationships across diverse contexts. Domains are the sources of information and reflect the nature of the business or function that has generated the data. The cross-domain deep learning (Cd-DLA) proposed in this thesis provides a potential avenue to overcome the complexity and nonlinearity of heterogeneous domains. Each of the approaches, algorithms, and frameworks for heterogeneous enterprise data mining presented in this thesis outperform the state-of-the-art methods in a range of backgrounds and scenarios, as evidenced by a theoretical analysis, an empirical study, or both. All outcomes derived from this research have been published or accepted for publication, and the follow-up work has also been recognised, which demonstrates scholarly interest in mining heterogeneous enterprise data as a research topic. However, despite this interest, heterogeneous data mining still holds increasing attractive opportunities for further exploration and development in both academia and industry.

APA, Harvard, Vancouver, ISO, and other styles

38

Bharti, Santosh Kumar. "Sarcasm Detection in Textual Data: A Supervised Approach." Thesis, 2019. http://ethesis.nitrkl.ac.in/10002/1/2019_PhD_SKBharti_513CS1037_Sarcasm.pdf.

Full text

Abstract:

Sentiment analysis is a technique to identify people’s opinion, attitude, sentiment,and emotion towards any specific target such as individuals, events, topics, products, organizations, services, etc. Sarcasm is a special type of sentiment that comprise of words which are opposite in meaning to what is really being said(especially in a sesne of insult, wit, irritation, humor). People often expressed it verbally through the use of heavy tonal stress and certain gestures clues like eye rolling, hands movement, etc. These tonal and gestural clues are obviously missing to express sarcasm in text, making its detection reliant upon other factors such as capitalization of words, punctuation mark, exclamation mark, etc. To express sarcasm in text, one often use positive or intensified positive words to express their negative feelings on a particular target. Nowadays, posting sarcastic messages on social media like Twitter, Facebook, WhatsApp, etc., has became a new trend to avoid direct negativity. Detecting these indirect negativity i.e., sarcasm in the social media text has become an important task as they influence every business organization. In the presence of sarcasm, sentiment analysis on these social media texts became the most challenging task. The property of sarcasm that makes it difficult to analyze and detect is the gap between its literal and intended meaning. Therefore, an automated system is required for sarcasm detection in textual data which would be capable of identifying actual sentiment of a given text in the presence of sarcasm. In this thesis, we proposed an automated system for sarcasm detection in tweets scripted in English as well as Hindi (Transliterated in English). It also detects sarcasm in Telugu conversation sentences (Transliterated in English). Sarcasm detection methods in the text can be categorized as rule-based, pattern-based, machine learning-based and context-based. Rule-based approach is the most basic method used for sarcasm detection in the text. In this approach, we mainly focus on hyperbolic and syntactic features of the text. Interjections, intensifiers and punctuation symbols are the most frequent hyperbole features used in the text to infer sarcastic messages. The extreme adjective and extreme adverb act as intensifiers for the text. Some examples of intensifiers are thoroughly enjoyed, fantastic weather, so beautiful, etc. The rule-based approach is simple to implement and often attains good accuracy for text classification. Three rule-based classification methods are proposed, one each for English, Hindi and Telugu. Pattern-based approach is the most effective classifier for sarcasm detection in the text. Here, a corpus of sarcastic tweets and conversation sentences were analyzed, and six unique patterns of the sarcastic text were obtained. The patterns are: sarcasm as a contradiction between tweet’s sentiment and its situation phrases, sarcasm as a contradiction between user’s likes and dislikes in Twitter data, sarcasm as a contradiction between a tweet and the universal truth, sarcasm as a contradiction between a tweet and its time-dependent facts, sarcasm as a contradiction between tweet’s sentiment and its context on which it is posted, and a positive tweet with antonym pairs of either verbs or adverbs or adjectives. These approaches attain high accuracy for sarcasm detection in the text. Machine learning-based approach is the most common technique used for classification. The performance of the machine learning classifiers often depends on dataset and feature set quality. In this thesis, lexical, syntactic, hyperbole, sentiment features are used in various machine learning algorithms. The classifiers evaluated are Naive Bayes (NB), Support Vector Machine (SVM), Decision Tree (DT), Random Forest (RF) and AdaBoost. Among these classifiers, NB outperformed other classifiers because of independence of text features. In text classification (especially when training set is small), NB performs better than other classifiers. Context-based approach is the most important method for text classification. Sarcasm can be detected by considering lexical, pragmatic, hyperbolic or other such features of the text. Some features can also be developed using certain patterns such as unigram, bigram, trigram, etc. There can be features based on verbal or gestural clues such as emoticons, onomatopoeic expressions in laughter, positive interjections, quotation marks, use of punctuation which can help in detecting sarcasm. But all these features are not enough to identify sarcasm in text until the context of the text is known. The machine, as well as human, should be aware of the context of the text and relate it to general world knowledge to be able to identify sarcasm more accurately. In this approach, we mainly focus on situation, topical, temporal, and historical context of the text.

APA, Harvard, Vancouver, ISO, and other styles

39

Hasan, Maryam. "Extracting Structured Knowledge from Textual Data in Software Repositories." Master's thesis, 2011. http://hdl.handle.net/10048/1776.

Full text

Abstract:

Software team members, as they communicate and coordinate their work with others throughout the life-cycle of their projects, generate different kinds of textual artifacts. Despite the variety of works in the area of mining software artifacts, relatively little research has focused on communication artifacts. Software communication artifacts, in addition to source code artifacts, contain useful semantic information that is not fully explored by existing approaches. This thesis, presents the development of a text analysis method and tool to extract and represent useful pieces of information from a wide range of textual data sources associated with software projects. Our text analysis system integrates Natural Language Processing techniques and statistical text analysis methods, with software domain knowledge. The extracted information is represented as RDF-style triples which constitute interesting relations between developers and software products. We applied the developed system to analyze five different textual information, i.e., source code commits, bug reports, email messages, chat logs, and wiki pages. In the evaluation of our system, we found its precision to be 82%, its recall 58%, and its F-measure 68%.

APA, Harvard, Vancouver, ISO, and other styles

40

Sarkas, Nikolaos. "Querying, Exploring and Mining the Extended Document." Thesis, 2011. http://hdl.handle.net/1807/29857.

Full text

Abstract:

The evolution of the Web into an interactive medium that encourages active user engagement has ignited a huge increase in the amount, complexity and diversity of available textual data. This evolution forces us to re-evaluate our view of documents as simple pieces of text and of document collections as immutable and isolated. Extended documents published in the context of blogs, micro-blogs, on-line social networks, customer feedback portals, can be associated with a wealth of meta-data in addition to their textual component: tags, links, sentiment, entities mentioned in text, etc. Collections of user-generated documents grow, evolve, co-exist and interact: they are dynamic and integrated. These unique characteristics of modern documents and document collections present us with exciting opportunities for improving the way we interact with them. At the same time, this additional complexity combined with the vast amounts of available textual data present us with formidable computational challenges. In this context, we introduce, study and extensively evaluate an array of effective and efficient solutions for querying, exploring and mining extended documents, dynamic and integrated document collections. For collections of socially annotated extended documents, we present an improved probabilistic search and ranking approach based on our growing understanding of the dynamics of the social annotation process. For extended documents, such as blog posts, associated with entities extracted from text and categorical attributes, we enable their interactive exploration through the efficient computation of strong entity associations. Associated entities are computed for all possible attribute value restrictions of the document collection. For extended documents, such as user reviews, annotated with a numerical rating, we introduce a keyword-query refinement approach. The solution enables the interactive navigation and exploration of large result sets. We extend the skyline query to document streams, such as news articles, associated with categorical attributes and partially ordered domains. The technique incrementally maintains a small set of recent, uniquely interesting extended documents from the stream.Finally, we introduce a solution for the scalable integration of structured data sources into Web search. Queries are analysed in order to determine what structured data, if any, should be used to augment Web search results.

APA, Harvard, Vancouver, ISO, and other styles

41

Dev, Kapil. "Spatio-Textual Similarity Joins Using Variable Prefix Filtering and MBR Filtering." Thesis, 2016. http://ethesis.nitrkl.ac.in/9110/1/2016_MT_KDev.pdf.

Full text

Abstract:

Given a set of objects that carries textual and spatial information, a spatio-textual similarity join computes the pair of objects which are textually similar and spatially near. Huge amount of work has been done in considering spatial dimension but less work has been done in spatio-textual joins. Now a days, due to the availability of GPS enabled devices, a huge amount of spatio-textual information is being generated which needs new methods to perform operation on this type of data. Here we study join operations for spatio-textual data and uses various optimization techniques such as efficient grid partitioning for spatial information, MBR-prefix technique for R-tree data structure, use of variable prefix-length of textual information and ordering in textual vectors on their TF-IDF value. We show the improvement of these optimizations in terms of running time as well as pruning of non-candidates.

APA, Harvard, Vancouver, ISO, and other styles

42

Chen, Meng-Peng, and 陳夢芃. "Using Data Mining Techniques for Studying the Consumer Heterogeneous Needs." Thesis, 2014. http://ndltd.ncl.edu.tw/handle/70574846603697298093.

Full text

Abstract:

碩士
國立勤益科技大學
工業工程與管理系
102
Bookstore is face of the fierce horizontal competition. Customer buying behavior is often just as book sales data. The study concerned with Customer base for marketing strategy often ignoring the different living styles of people will show different reading tendencies and preferences. Improve the base conditions for sales of bookstores , is provide a variety of products or books ,set reasonable price to meet the needs of different customers good service. But original intention of the reading customer is want to know the information, not the books. If bookstores can find the real needs of consumers in the consumer's buying process, and made for the needs of different consumers of existing proposals or introduce more attractive to consumers buying behavior marketing activities for the current important topic. In this study, the use of Data Mining in association analysis method to identify the heterogeneous needs of customers spending, customers mastered reading preferences, and develop marketing programs and displays. Providing customized the product portfolio of cross-selling mode to improve customer satisfaction and loyalty to buy. Through research report will help industry to regain intimacy with readers and understand the new relationships with customers .

APA, Harvard, Vancouver, ISO, and other styles

43

Chopra, Pankaj. "Data mining techniques to enable large-scale exploratory analysis of heterogeneous scientific data." 2009. http://www.lib.ncsu.edu/theses/available/etd-04092009-161454/unrestricted/etd.pdf.

Full text

APA, Harvard, Vancouver, ISO, and other styles

44

Dlamini, Phezulu, and 佩祖露. "Mining Textual Relationships from Social Media Data for Users’ E-Learning Experiences." Thesis, 2017. http://ndltd.ncl.edu.tw/handle/r4v6xc.

Full text

APA, Harvard, Vancouver, ISO, and other styles

45

"All Purpose Textual Data Information Extraction, Visualization and Querying." Master's thesis, 2018. http://hdl.handle.net/2286/R.I.50530.

Full text

Abstract:

abstract: Since the advent of the internet and even more after social media platforms, the explosive growth of textual data and its availability has made analysis a tedious task. Information extraction systems are available but are generally too specific and often only extract certain kinds of information they deem necessary and extraction worthy. Using data visualization theory and fast, interactive querying methods, leaving out information might not really be necessary. This thesis explores textual data visualization techniques, intuitive querying, and a novel approach to all-purpose textual information extraction to encode large text corpus to improve human understanding of the information present in textual data. This thesis presents a modified traversal algorithm on dependency parse output of text to extract all subject predicate object pairs from text while ensuring that no information is missed out. To support full scale, all-purpose information extraction from large text corpuses, a data preprocessing pipeline is recommended to be used before the extraction is run. The output format is designed specifically to fit on a node-edge-node model and form the building blocks of a network which makes understanding of the text and querying of information from corpus quick and intuitive. It attempts to reduce reading time and enhancing understanding of the text using interactive graph and timeline.
Dissertation/Thesis
Masters Thesis Software Engineering 2018

APA, Harvard, Vancouver, ISO, and other styles

46

"Learning from the Data Heterogeneity for Data Imputation." Doctoral diss., 2021. http://hdl.handle.net/2286/R.I.64299.

Full text

Abstract:

abstract: Data mining, also known as big data analysis, has been identified as a critical and challenging process for a variety of applications in real-world problems. Numerous datasets are collected and generated every day to store the information. The rise in the number of data volumes and data modality has resulted in the increased demand for data mining methods and strategies of finding anomalies, patterns, and correlations within large data sets to predict outcomes. Effective machine learning methods are widely adapted to build the data mining pipeline for various purposes like business understanding, data understanding, data preparation, modeling, evaluation, and deployment. The major challenges for effectively and efficiently mining big data include (1) data heterogeneity and (2) missing data. Heterogeneity is the natural characteristic of big data, as the data is typically collected from different sources with diverse formats. The missing value is the most common issue faced by the heterogeneous data analysis, which resulted from variety of factors including the data collecting processing, user initiatives, erroneous data entries, and so on. In response to these challenges, in this thesis, three main research directions with application scenarios have been investigated: (1) Mining and Formulating Heterogeneous Data, (2) missing value imputation strategy in various application scenarios in both offline and online manner, and (3) missing value imputation for multi-modality data. Multiple strategies with theoretical analysis are presented, and the evaluation of the effectiveness of the proposed algorithms compared with state-of-the-art methods is discussed.
Dissertation/Thesis
Doctoral Dissertation Computer Engineering 2021

APA, Harvard, Vancouver, ISO, and other styles

47

Louis, Anita Lily. "Unsupervised discovery of relations for analysis of textual data in digital forensics." Diss., 2010. http://hdl.handle.net/2263/27479.

Full text

Abstract:

This dissertation addresses the problem of analysing digital data in digital forensics. It will be shown that text mining methods can be adapted and applied to digital forensics to aid analysts to more quickly, efficiently and accurately analyse data to reveal truly useful information. Investigators who wish to utilise digital evidence must examine and organise the data to piece together events and facts of a crime. The difficulty with finding relevant information quickly using the current tools and methods is that these tools rely very heavily on background knowledge for query terms and do not fully utilise the content of the data. A novel framework in which to perform evidence discovery is proposed in order to reduce the quantity of data to be analysed, aid the analysts' exploration of the data and enhance the intelligibility of the presentation of the data. The framework combines information extraction techniques with visual exploration techniques to provide a novel approach to performing evidence discovery, in the form of an evidence discovery system. By utilising unrestricted, unsupervised information extraction techniques, the investigator does not require input queries or keywords for searching, thus enabling the investigator to analyse portions of the data that may not have been identified by keyword searches. The evidence discovery system produces text graphs of the most important concepts and associations extracted from the full text to establish ties between the concepts and provide an overview and general representation of the text. Through an interactive visual interface the investigator can explore the data to identify suspects, events and the relations between suspects. Two models are proposed for performing the relation extraction process of the evidence discovery framework. The first model takes a statistical approach to discovering relations based on co-occurrences of complex concepts. The second model utilises a linguistic approach using named entity extraction and information extraction patterns. A preliminary study was performed to assess the usefulness of a text mining approach to digital forensics as against the traditional information retrieval approach. It was concluded that the novel approach to text analysis for evidence discovery presented in this dissertation is a viable and promising approach. The preliminary experiment showed that the results obtained from the evidence discovery system, using either of the relation extraction models, are sensible and useful. The approach advocated in this dissertation can therefore be successfully applied to the analysis of textual data for digital forensics Copyright
Dissertation (MSc)--University of Pretoria, 2010.
Computer Science
unrestricted

APA, Harvard, Vancouver, ISO, and other styles

48

hung, Cheng chih, and 鄭志宏. "An Attribute-based Approach for Data Mining in the Environment of Heterogeneous Information Sources." Thesis, 1998. http://ndltd.ncl.edu.tw/handle/23514594121163647287.

Full text

Abstract:

碩士
國立臺灣科技大學
電子工程技術研究所
86
Concerning the puzzles facing today in the process of KDD and data mining, weproposed an integrated approach for dealing with related problems in the heterogenous information environment. Our approach employs the mediator-based mechanism to construct an integrated and reconciled data warehouse according the results of the operation of mediator. The data warehouse, therefore, can serve as a target database during the process of data mining. With regard to the system of the data mining, the main framework is centered on the mechanism of the attribute-based reasioning. Frist, we translate the initial data into the conceptual information according to the attribute data via the process of generalization. Then, in the process of generalization, we analyze simultaneously the statistics of data quantity that is available for quantitative analysis and noise control, Finally, we represent the results of information in a logical formalism.In this paper, data was mined by four kinds of rules which are characterization, discrimination, classification, as well as association rules. Each type of rules can be used in the different subject-oriented applications. In addition, we defined a simplified data mining language based on the characterisitic of data mining system and data mining process. The language can be easily used for data mining and the manipulation of database systems.

APA, Harvard, Vancouver, ISO, and other styles

49

Morgado, João Pedro Barreiro Gomes e. "Knowledge elicitation by merging heterogeneous data sources in a die-casting process." Master's thesis, 2015. http://hdl.handle.net/10316/39072.

Full text

Abstract:

Dissertação de Mestrado em Engenharia e Gestão Industrial apresentada à Faculdade de Ciências e Tecnologia da Universidade de Coimbra.
In order to establish adaptive control of a manufacturing process knowledge must be acquired about both, the process and its environment. This knowledge can be obtained by mining large amounts of data collected through the monitoring of the manufacturing process. This enables the study of process parameters and the correlations between the process parameters and with the parameters of the environment. Through this, knowledge about the process and its relation to the environment can be established and, in turn, used for adaptive process control. The aim of this thesis is to study real manufacturing data, obtained through monitoring of a die casting process. First, in order to better understand the problem at hand, a literature review of using Big data and merging data from heterogeneous sources is given. Second, using the real data, a robust algorithm to asses the quality rate was developed, due to data being incomplete and noisy. Merging the process and the environment data was done. In this way it is possible to visualize the influences of various parameters on quality rate and make suggestions for improvement of the die casting process.

APA, Harvard, Vancouver, ISO, and other styles

50

"Stock market forecasting by integrating time-series and textual information." 2003. http://library.cuhk.edu.hk/record=b5896089.

Full text

Abstract:

Fung Pui Cheong Gabriel.
Thesis (M.Phil.)--Chinese University of Hong Kong, 2003.
Includes bibliographical references (leaves 88-93).
Abstracts in English and Chinese.
Abstract (English) --- p.i
Abstract (Chinese) --- p.ii
Acknowledgement --- p.iii
Contents --- p.v
List of Figures --- p.ix
List of Tables --- p.x
Chapter Part I --- The Very Beginning --- p.1
Chapter 1 --- Introduction --- p.2
Chapter 1.1 --- Contributions --- p.3
Chapter 1.2 --- Dissertation Organization --- p.4
Chapter 2 --- Problem Formulation --- p.6
Chapter 2.1 --- Defining the Prediction Task --- p.6
Chapter 2.2 --- Overview of the System Architecture --- p.8
Chapter Part II --- Literatures Review --- p.11
Chapter 3 --- The Social Dynamics of Financial Markets --- p.12
Chapter 3.1 --- The Collective Behavior of Groups --- p.13
Chapter 3.2 --- Prediction Based on Publicity Information --- p.16
Chapter 4 --- Time Series Representation --- p.20
Chapter 4.1 --- Technical Analysis --- p.20
Chapter 4.2 --- Piecewise Linear Approximation --- p.23
Chapter 5 --- Text Classification --- p.27
Chapter 5.1 --- Document Representation --- p.28
Chapter 5.2 --- Document Pre-processing --- p.30
Chapter 5.3 --- Classifier Construction --- p.31
Chapter 5.3.1 --- Naive Bayes (NB) --- p.31
Chapter 5.3.2 --- Support Vectors Machine (SVM) --- p.33
Chapter Part III --- Mining Financial Time Series and Textual Doc- uments Concurrently --- p.36
Chapter 6 --- Time Series Representation --- p.37
Chapter 6.1 --- Discovering Trends on the Time Series --- p.37
Chapter 6.2 --- t-test Based Split and Merge Segmentation Algorithm ´ؤ Splitting Phrase --- p.39
Chapter 6.3 --- t-test Based Split and Merge Segmentation Algorithm - Merging Phrase --- p.41
Chapter 7 --- Article Alignment and Pre-processing --- p.43
Chapter 7.1 --- Aligning News Articles to the Stock Trends --- p.44
Chapter 7.2 --- Selecting Positive Training Examples --- p.46
Chapter 7.3 --- Selecting Negative Training Examples --- p.48
Chapter 8 --- System Learning --- p.52
Chapter 8.1 --- Similarity Based Classification Approach --- p.53
Chapter 8.2 --- Category Sketch Generation --- p.55
Chapter 8.2.1 --- Within-Category Coefficient --- p.55
Chapter 8.2.2 --- Cross-Category Coefficient --- p.56
Chapter 8.2.3 --- Average-Importance Coefficient --- p.57
Chapter 8.3 --- Document Sketch Generation --- p.58
Chapter 9 --- System Operation --- p.60
Chapter 9.1 --- System Operation --- p.60
Chapter Part IV --- Results and Discussions --- p.62
Chapter 10 --- Evaluations --- p.63
Chapter 10.1 --- Time Series Evaluations --- p.64
Chapter 10.2 --- Classifier Evaluations --- p.64
Chapter 10.2.1 --- Batch Classification Evaluation --- p.69
Chapter 10.2.2 --- Online Classification Evaluation --- p.71
Chapter 10.2.3 --- Components Analysis --- p.74
Chapter 10.2.4 --- Document Sketch Analysis --- p.75
Chapter 10.3 --- Prediction Evaluations --- p.75
Chapter 10.3.1 --- Simulation Results --- p.77
Chapter 10.3.2 --- Hit Rate Analysis --- p.78
Chapter Part V --- The Final Words --- p.80
Chapter 11 --- Conclusion and Future Work --- p.81
Appendix --- p.84
Chapter A --- Hong Kong Stocks Categorization Powered by Reuters --- p.84
Chapter B --- Morgan Stanley Capital International (MSCI) Classification --- p.85
Chapter C --- "Precision, Recall and F1 measure" --- p.86
Bibliography --- p.88

APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic 'Heterogeneous Textual Data Mining'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles