Dissertations / Theses: 'History of data mining'

1

Egas, Carlos A. "Methodology for Data Mining Customer Order History for Storage Assignment." Ohio University / OhioLINK, 2012. http://rave.ohiolink.edu/etdc/view?acc_num=ohiou1345223808.

Full text

APA, Harvard, Vancouver, ISO, and other styles

2

Rosswog, James. "Improving classification of spatiotemporal data using adaptive history filtering." Diss., Online access via UMI:, 2007.

Find full text

APA, Harvard, Vancouver, ISO, and other styles

3

Virkkala, Linda, and Johanna Haglund. "Modelling of patterns between operational data, diagnostic trouble codes and workshop history using big data and machine learning." Thesis, Uppsala universitet, Datalogi, 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-279823.

Full text

Abstract:

The work presented in this thesis is part of a large research and development project on condition-based maintenance for heavy trucks and buses at Scania. The aim of this thesis was to be able to predict the status of a component (the starter motor) using data mining methods and to create models that can predict the failure of that component. Based on workshop history data, error codes and operational data, three sets of classification models were built and evaluated. The first model aims to find patterns in a set of error codes, to see which codes are related to a starter motor failure. The second model aims to see if there are patterns in operational data that lead to the occurrence of an error code. Finally, the two data sets were merged and a classifier was trained and evaluated on this larger data set. Two machine learning algorithms were used and compared throughout the model building: AdaBoost and random forest. There is no statistically significant difference in their performance, and both algorithms had an error rate around ~13%, ~5% and ~13% for the three classification models respectively. However, random forest is much faster, and is therefore the preferable option for an industrial implementation. Variable analysis was conducted for the error codes and operational data, resulting in rankings of informative variables. From the evaluation metric precision, it can be derived that if our random forest model predicts a starter motor failure, there is a 85.7% chance that it actually has failed. This model finds 32% (the models recall) of the failed starter motors. It is also shown that four error codes; 2481, 2639, 2657 and 2597 have the highest predictive power for starter motor failure classification. For the operational data, variables that concern the starter motor lifetime and battery health are generally ranked as important by the models. The random forest model finds 81.9% of the cases where the 2481 error code occurs. If the random forest model predicts that the error code 2481 will occur, there is a 88.2% chance that it will. The classification performance was not increased when the two data sets were merged, indicating that the patterns detected by the two first classification models do not add value toone another.

APA, Harvard, Vancouver, ISO, and other styles

4

Jiang, Tianyu. "“Frankenstein Complex” in the Realm of Digital Humanities : Data Mining Classic Horror Cinema via Media History Digital Library (MHDL)." Thesis, Stockholms universitet, Filmvetenskap, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:su:diva-169638.

Full text

Abstract:

This thesis addresses the complexity of digitalization and humanities research practices, with a specific focus on digital archives and film history research. I propose the term “Frankenstein Complex” to highlight and contextualize the epistemological collision and empirical challenges humanities scholars encounter when utilizing digital resources with digital methods. A particular aim of this thesis is to scrutinize digital archiving practices when using the Media History Digital Library (MHDL) as a case for a themed meta-inquiry on the preservation of and access to classic horror cinema in this particular digital venue. The project found conventional research methods, such as the close reading of classical cinema history, to be limiting. Instead, the project tried out a distant reading technique throughout the meta-inquiry to better interrogate with the massive volume of data generated by MHDL. Besides a general reassessment of debates in the digital humanities and themes relating to horror film culture, this thesis strives for a reflection on classic horror spectatorship through the lens of sexual identity, inspired by Sara Ahmed’s perspective on queer phenomenology. This original reading of horror history is facilitated by an empirical study of the digital corpus at hand, which in turn gives insights into the entangled relation between subjective identities and the appointed research contexts.

APA, Harvard, Vancouver, ISO, and other styles

5

Magnusson, John. "Finding time-based listening habits in users music listening history to lower entropy in data." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-300043.

Full text

Abstract:

In a world where information, entertainment and e-commerce are growing rapidly in terms of volume and options, it can be challenging for individuals to find what they want. Search engines and recommendation systems have emerged as solutions, guiding the users. A typical example of this is Spotify, a music streaming company that utilises users listening data and other derived metrics to provide personalised music recommendation. Spotify has a hypothesis that external factors affect users listening preferences and that some of these external factors routinely affect the users, such as workout routines and commuting to work. This work aims to find time- based listening habits in users’ music listening history to decrease the entropy in the data, resulting in a better understanding of the users. While this work primarily targets listening habits, the method can, in theory, be applied on any time series-based dataset. Listening histories were split into hour vectors, vectors where each element represents the distribution of a label/genre played during an hour. The hour vectors allowed for a good representation of the data independent of the volume. In addition, it allowed for clustering, making it possible to find hours where similar music was played. Hour slots that routinely appeared in the same cluster became a profile, highlighting a habit. In the final implementation, a user is represented by a profile vector allowing different profiles each hour of a week. Several users were profiled with the proposed approach and evaluated in terms of decrease in Shannon entropy when profiled compared to when not profiled. On average, user entropy dropped by 9% with highs in the 50% and a small portion of users not experiencing any decrease. In addition, the profiling was evaluated by measuring cosine similarity across users listening history, resulting in a correlation between gain in cosine similarity and decrease in entropy. In conclusion, users become more predictable and interpretable when profiled. This knowledge can be used to understand users better or as a feature for recommender systems and other analysis. I en värld där information, underhållning och e-handel har vuxit kraftig i form av volym och alternativ, har individer fått det svårare att hitta det som de vill ha. Sökmotorer och rekommendationssystem har vuxit fram som lösningar till detta problem och hjälpt individer att hitta rätt. Ett typexempel på detta är Spotify, en musikströmningstjänst som använder sig av användares lyssningsdata för att rekommendera musik och annan personalisering. Spotify har en hypotes att externa faktorer påverkar användares lyssningspreferenser, samt att vissa av dessa faktorer påverkar användaren rutinmässigt som till exempel träningsrutiner och pendlade till jobbet. Målet med detta arbete är att hitta tidsbaserade lyssningsvanor i användares musiklyssningshistorik för att sänka Shannon entropin i data, resulterande i en bättre förståelse av användarna. Arbetet är primärt gjort för att hitta lyssningsvanor, men metoden kan i teorin appliceras på valfri godtycklig tidsserie dataset. Lyssningshistoriken delades in i timvektorer, radvektorer med längden x där varje element representerar fördelningen av en etikett/ genre som spelas under en timme. Timvektorerna skapade möjligheten till att använda klusteranalys som användes för att hitta timmar där liknande musik spelats. Timvektorer som rutinmässigt hamnade i samma kluster blev profiler, som användes för att markera vanor. I den slutgiltiga produkten representeras en användare av en profilvektor som tillåter en användare att ha en profil för varje timme i veckan. Ett flertal användare blev profilerade med den föreslagna metoden och utvärderade i form av sänkning i entropi när de blev profilerade gentemot när de inte blev profilerade. I genomsnitt sänktes användarnas entropi med 9%, med några över användare 50%, samt ett fåtal som inte fick någon sänknings alls. Profilering blev även utvärderad genom att mäta cosinuslikhet över en användares lyssningshistorik. Detta resulterade i en korrelation mellan ökning i cosinuslikhet och sänkning i entropi vid användandet av profilering. Slutsatsen som kan dras är att användare blir mera förutsägbara och tolkbara när de har blivit profilerade. Denna kunskap kan användas till att förstå användare bättre eller användas som en del av ett rekommendationssystem eller annan analys.

APA, Harvard, Vancouver, ISO, and other styles

6

Hagward, Anders. "Using Git Commit History for Change Prediction : An empirical study on the predictive potential of ﬁle-level logical coupling". Thesis, KTH, Skolan för datavetenskap och kommunikation (CSC), 2015. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-172998.

Full text

Abstract:

In recent years, a new generation of distributed version control systems have taken the place of the aging centralized ones, with Git arguably being the most popular distributed system today. We investigate the potential of using Git commit history to predict files that are often changed together. Specifically, we look at the rename tracking heuristic found in Git, and the impact it has on prediction performance. By applying a data mining algorithm to five popular GitHub repositories we extract logical coupling – inter-file dependencies not necessarily detectable by static analysis – on which we base our change prediction. In addition, we examine if certain commits are better suited for change prediction than others; we define a bug fix commit as a commit that resolves one or more issues in the associated issue tracking system and compare their prediction performance. While our findings do not reveal any notable differences in prediction performance when disregarding rename information, they suggest that extracting coupling from, and predicting on, bug fix commits in particular could lead to predictions that are both more accurate and numerous. De senaste åren har en ny generation av distribuerade versionshanteringssystem tagit plats där tidigare centraliserade sådana huserat. I spetsen för dessa nya system går ett system vid namn Git. Vi undersöker potentialen i att nyttja versionshistorik från Git i syftet att förutspå filer som ofta redigeras ihop. I synnerhet synar vi Gits heuristik för att detektera när en fil flyttats eller bytt namn, någonting som torde vara användbart för att bibehålla historiken för en sådan fil, och mäter dess inverkan på prediktionsprestandan. Genom att applicera en datautvinningsalgoritm på fem populära GitHubprojekt extraherar vi logisk koppling – beroenden mellan filer som inte nödvändigtvis är detekterbara medelst statisk analys – på vilken vi baserar vår prediktion. Därtill utreder vi huruvida vissa Gitcommits är bättre lämpade för prediktion än andra; vi definierar en buggfixcommit som en commit som löser en eller flera buggar i den tillhörande buggdatabasen, och jämför deras prediktionsprestanda. Medan våra resultat ej kan påvisa några större prestandamässiga skillnader när flytt- och namnbytesinformationen ignorerades, indikerar de att extrahera koppling från, och prediktera på, enbart bugfixcommits kan leda till förutsägelser som är både mer precisa och mångtaliga.

APA, Harvard, Vancouver, ISO, and other styles

7

Matys, Filip. "Předpověď nových chyb pomocí dolování dat v historii výsledků testů." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2016. http://www.nusl.cz/ntk/nusl-255448.

Full text

Abstract:

Software projects go through a phase of maintenance and, in case of open source projects, through hard development process. Both of these phases are prone to regressions, meaning previously working parts of system do not work anymore. To avoid this behavior, systems are being tested with long test suites, which can be sometimes time consuming. For this reason, prediction models are developed to predict software regressions using historical testing data and code changes, to detect changes that can most likely cause regression and focus testing on such parts of code. But, these predictors rely on static code analysis without deeper semantic understanding of the code. Purpose of this master thesis is to create predictor, that relies not only on static code analysis, but provides decisions based on code semantics as well.

APA, Harvard, Vancouver, ISO, and other styles

8

Cressey, Michael. "The identification of early lead mining : environmental, archaeological and historical perspectives from Islay, Inner Hebrides, Scotland." Thesis, University of Edinburgh, 1996. http://hdl.handle.net/1842/33319.

Full text

Abstract:

This thesis investigates whether lead mining can be detected using palaeoenvironmental data recovered from freshwater loch and marsh sediment. Using radiometric time-frames and geochernical analyses the environmental impact of 18th and 19th century mining on Islay, Inner Hebrides, Scotland, has been investigated. The model of known mining events thus produced has been used to assess previously unrecorded (early) lead mining activity. Previous mining in the area is suggested by 18th century accounts that record the presence of 1,000 "early" workings scattered over the north-east limestone region. While there is little to support the often repeated assertion that lead mining dates back to the Norse Period (circa lOll th centuries) it is clear that it may well have been an established industry prior to the time of the first historical records in the 16th century. In order to use a palaeoenvironmental approach to the question of mining history and its impact, the strategy has been to use integrated loch and catclunent units of study. The areas considered are; Loch Finlaggan, Loch Lossit, Loch Bharradail and a control site at Loch Leathann. Soil and sediment geochemical mapping has been used to assess the distribution of lead, zinc and copper within the catchments. Environmental pathways have been identified and influx of lead, zinc and copper to the loch sediment has been detennined through the analyses of cores from each loch basin. Archaeological fieldsurvey and the re-examination of the results from mineral prospecting data across the study region provides new evidence on the geographical extent and contaminatory effects of leadmining in this area. This study shows how the effect of lead mining can be identified in the palaeoenvironrnental record from circa 1367 AD onwards, so mining in Islay does indeed predate the earliest known archaeological and historical records.

APA, Harvard, Vancouver, ISO, and other styles

9

Mrázek, Michal. "Data mining." Master's thesis, Vysoké učení technické v Brně. Fakulta strojního inženýrství, 2019. http://www.nusl.cz/ntk/nusl-400441.

Full text

Abstract:

The aim of this master’s thesis is analysis of the multidimensional data. Three dimensionality reduction algorithms are introduced. It is shown how to manipulate with text documents using basic methods of natural language processing. The goal of the practical part of the thesis is to process real-world data from the internet forum. Posted messages are transformed to the numerical representation, then to two-dimensional space and visualized. Later on, topics of the messages are discovered. In the last part, a few selected algorithms are compared.

APA, Harvard, Vancouver, ISO, and other styles

10

Payyappillil, Hemambika. "Data mining framework." Morgantown, W. Va. : [West Virginia University Libraries], 2005. https://etd.wvu.edu/etd/controller.jsp?moduleName=documentdata&jsp%5FetdId=3807.

Full text

Abstract:

Thesis (M.S.)--West Virginia University, 2005 Title from document title page. Document formatted into pages; contains vi, 65 p. : ill. (some col.). Includes abstract. Includes bibliographical references (p. 64-65).

APA, Harvard, Vancouver, ISO, and other styles

11

Abedjan, Ziawasch. "Improving RDF data with data mining." Phd thesis, Universität Potsdam, 2014. http://opus.kobv.de/ubp/volltexte/2014/7133/.

Full text

Abstract:

Linked Open Data (LOD) comprises very many and often large public data sets and knowledge bases. Those datasets are mostly presented in the RDF triple structure of subject, predicate, and object, where each triple represents a statement or fact. Unfortunately, the heterogeneity of available open data requires significant integration steps before it can be used in applications. Meta information, such as ontological definitions and exact range definitions of predicates, are desirable and ideally provided by an ontology. However in the context of LOD, ontologies are often incomplete or simply not available. Thus, it is useful to automatically generate meta information, such as ontological dependencies, range definitions, and topical classifications. Association rule mining, which was originally applied for sales analysis on transactional databases, is a promising and novel technique to explore such data. We designed an adaptation of this technique for min-ing Rdf data and introduce the concept of “mining configurations”, which allows us to mine RDF data sets in various ways. Different configurations enable us to identify schema and value dependencies that in combination result in interesting use cases. To this end, we present rule-based approaches for auto-completion, data enrichment, ontology improvement, and query relaxation. Auto-completion remedies the problem of inconsistent ontology usage, providing an editing user with a sorted list of commonly used predicates. A combination of different configurations step extends this approach to create completely new facts for a knowledge base. We present two approaches for fact generation, a user-based approach where a user selects the entity to be amended with new facts and a data-driven approach where an algorithm discovers entities that have to be amended with missing facts. As knowledge bases constantly grow and evolve, another approach to improve the usage of RDF data is to improve existing ontologies. Here, we present an association rule based approach to reconcile ontology and data. Interlacing different mining configurations, we infer an algorithm to discover synonymously used predicates. Those predicates can be used to expand query results and to support users during query formulation. We provide a wide range of experiments on real world datasets for each use case. The experiments and evaluations show the added value of association rule mining for the integration and usability of RDF data and confirm the appropriateness of our mining configuration methodology. Linked Open Data (LOD) umfasst viele und oft sehr große öffentlichen Datensätze und Wissensbanken, die hauptsächlich in der RDF Triplestruktur bestehend aus Subjekt, Prädikat und Objekt vorkommen. Dabei repräsentiert jedes Triple einen Fakt. Unglücklicherweise erfordert die Heterogenität der verfügbaren öffentlichen Daten signifikante Integrationsschritte bevor die Daten in Anwendungen genutzt werden können. Meta-Daten wie ontologische Strukturen und Bereichsdefinitionen von Prädikaten sind zwar wünschenswert und idealerweise durch eine Wissensbank verfügbar. Jedoch sind Wissensbanken im Kontext von LOD oft unvollständig oder einfach nicht verfügbar. Deshalb ist es nützlich automatisch Meta-Informationen, wie ontologische Abhängigkeiten, Bereichs-und Domänendefinitionen und thematische Assoziationen von Ressourcen generieren zu können. Eine neue und vielversprechende Technik um solche Daten zu untersuchen basiert auf das entdecken von Assoziationsregeln, welche ursprünglich für Verkaufsanalysen in transaktionalen Datenbanken angewendet wurde. Wir haben eine Adaptierung dieser Technik auf RDF Daten entworfen und stellen das Konzept der Mining Konfigurationen vor, welches uns befähigt in RDF Daten auf unterschiedlichen Weisen Muster zu erkennen. Verschiedene Konfigurationen erlauben uns Schema- und Wertbeziehungen zu erkennen, die für interessante Anwendungen genutzt werden können. In dem Sinne, stellen wir assoziationsbasierte Verfahren für eine Prädikatvorschlagsverfahren, Datenvervollständigung, Ontologieverbesserung und Anfrageerleichterung vor. Das Vorschlagen von Prädikaten behandelt das Problem der inkonsistenten Verwendung von Ontologien, indem einem Benutzer, der einen neuen Fakt einem Rdf-Datensatz hinzufügen will, eine sortierte Liste von passenden Prädikaten vorgeschlagen wird. Eine Kombinierung von verschiedenen Konfigurationen erweitert dieses Verfahren sodass automatisch komplett neue Fakten für eine Wissensbank generiert werden. Hierbei stellen wir zwei Verfahren vor, einen nutzergesteuertenVerfahren, bei dem ein Nutzer die Entität aussucht die erweitert werden soll und einen datengesteuerten Ansatz, bei dem ein Algorithmus selbst die Entitäten aussucht, die mit fehlenden Fakten erweitert werden. Da Wissensbanken stetig wachsen und sich verändern, ist ein anderer Ansatz um die Verwendung von RDF Daten zu erleichtern die Verbesserung von Ontologien. Hierbei präsentieren wir ein Assoziationsregeln-basiertes Verfahren, der Daten und zugrundeliegende Ontologien zusammenführt. Durch die Verflechtung von unterschiedlichen Konfigurationen leiten wir einen neuen Algorithmus her, der gleichbedeutende Prädikate entdeckt. Diese Prädikate können benutzt werden um Ergebnisse einer Anfrage zu erweitern oder einen Nutzer während einer Anfrage zu unterstützen. Für jeden unserer vorgestellten Anwendungen präsentieren wir eine große Auswahl an Experimenten auf Realweltdatensätzen. Die Experimente und Evaluierungen zeigen den Mehrwert von Assoziationsregeln-Generierung für die Integration und Nutzbarkeit von RDF Daten und bestätigen die Angemessenheit unserer konfigurationsbasierten Methodologie um solche Regeln herzuleiten.

APA, Harvard, Vancouver, ISO, and other styles

12

Liu, Tantan. "Data Mining over Hidden Data Sources." The Ohio State University, 2012. http://rave.ohiolink.edu/etdc/view?acc_num=osu1343313341.

Full text

APA, Harvard, Vancouver, ISO, and other styles

13

Taylor, Phillip. "Data mining of vehicle telemetry data." Thesis, University of Warwick, 2015. http://wrap.warwick.ac.uk/77645/.

Full text

Abstract:

Driving a safety critical task that requires a high level of attention and workload from the driver. Despite this, people often perform secondary tasks such as eating or using a mobile phone, which increase workload levels and divert cognitive and physical attention from the primary task of driving. As well as these distractions, the driver may also be overloaded for other reasons, such as dealing with an incident on the road or holding conversations in the car. One solution to this distraction problem is to limit the functionality of in-car devices while the driver is overloaded. This can take the form of withholding an incoming phone call or delaying the display of a non-urgent piece of information about the vehicle. In order to design and build these adaptions in the car, we must first have an understanding of the driver's current level of workload. Traditionally, driver workload has been monitored using physiological sensors or camera systems in the vehicle. However, physiological systems are often intrusive and camera systems can be expensive and are unreliable in poor light conditions. It is important, therefore, to use methods that are non-intrusive, inexpensive and robust, such as sensors already installed on the car and accessible via the Controller Area Network (CAN)-bus. This thesis presents a data mining methodology for this problem, as well as for others in domains with similar types of data, such as human activity monitoring. It focuses on the variable selection stage of the data mining process, where inputs are chosen for models to learn from and make inferences. Selecting inputs from vehicle telemetry data is challenging because there are many irrelevant variables with a high level of redundancy. Furthermore, data in this domain often contains biases because only relatively small amounts can be collected and processed, leading to some variables appearing more relevant to the classification task than they are really. Over the course of this thesis, a detailed variable selection framework that addresses these issues for telemetry data is developed. A novel blocked permutation method is developed and applied to mitigate biases when selecting variables from potentially biased temporal data. This approach is infeasible computationally when variable redundancies are also considered, and so a novel permutation redundancy measure with similar properties is proposed. Finally, a known redundancy structure between features in telemetry data is used to enhance the feature selection process in two ways. First the benefits of performing raw signal selection, feature extraction, and feature selection in different orders are investigated. Second, a two-stage variable selection framework is proposed and the two permutation based methods are combined. Throughout the thesis, it is shown through classification evaluations and inspection of the features that these permutation based selection methods are appropriate for use in selecting features from CAN-bus data.

APA, Harvard, Vancouver, ISO, and other styles

14

Sherikar, Vishnu Vardhan Reddy. "I2MAPREDUCE: DATA MINING FOR BIG DATA." CSUSB ScholarWorks, 2017. https://scholarworks.lib.csusb.edu/etd/437.

Full text

Abstract:

This project is an extension of i2MapReduce: Incremental MapReduce for Mining Evolving Big Data . i2MapReduce is used for incremental big data processing, which uses a fine-grained incremental engine, a general purpose iterative model that includes iteration algorithms such as PageRank, Fuzzy-C-Means(FCM), Generalized Iterated Matrix-Vector Multiplication(GIM-V), Single Source Shortest Path(SSSP). The main purpose of this project is to reduce input/output overhead, to avoid incurring the cost of re-computation and avoid stale data mining results. Finally, the performance of i2MapReduce is analyzed by comparing the resultant graphs.

APA, Harvard, Vancouver, ISO, and other styles

15

Zhang, Nan. "Privacy-preserving data mining." [College Station, Tex. : Texas A&M University, 2006. http://hdl.handle.net/1969.1/ETD-TAMU-1080.

Full text

APA, Harvard, Vancouver, ISO, and other styles

16

Hulten, Geoffrey. "Mining massive data streams /." Thesis, Connect to this title online; UW restricted, 2005. http://hdl.handle.net/1773/6937.

Full text

APA, Harvard, Vancouver, ISO, and other styles

17

Büchel, Nina. "Faktorenvorselektion im Data Mining /." Berlin : Logos, 2009. http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&doc_number=019006997&line_number=0001&func_code=DB_RECORDS&service_type=MEDIA.

Full text

APA, Harvard, Vancouver, ISO, and other styles

18

Shao, Junming. "Synchronization Inspired Data Mining." Diss., lmu, 2011. http://nbn-resolving.de/urn:nbn:de:bvb:19-137356.

Full text

APA, Harvard, Vancouver, ISO, and other styles

19

Wang, Xiaohong. "Data mining with bilattices." Thesis, National Library of Canada = Bibliothèque nationale du Canada, 2001. http://www.collectionscanada.ca/obj/s4/f2/dsk3/ftp04/MQ59344.pdf.

Full text

APA, Harvard, Vancouver, ISO, and other styles

20

Knobbe, Arno J. "Multi-relational data mining /." Amsterdam [u.a.] : IOS Press, 2007. http://www.loc.gov/catdir/toc/fy0709/2006931539.html.

Full text

APA, Harvard, Vancouver, ISO, and other styles

21

丁嘉慧 and Ka-wai Ting. "Time sequences: data mining." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 2001. http://hub.hku.hk/bib/B31226760.

Full text

APA, Harvard, Vancouver, ISO, and other styles

22

Wan, Chang, and 萬暢. "Mining multi-faceted data." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 2013. http://hdl.handle.net/10722/197527.

Full text

Abstract:

Multi-faceted data contains different types of objects and relationships between them. With rapid growth of web-based services, multi-faceted data are increasing (e.g. Flickr, Yago, IMDB), which offers us richer information to infer users’ preferences and provide them better services. In this study, we look at two types of multi-faceted data: social tagging system and heterogeneous information network and how to improve service such as resources retrieving and classification on them. In social tagging systems, resources such as images and videos are annotated with descriptive words called tags. It has been shown that tag-based resource searching and retrieval is much more effective than content-based retrieval. With the advances in mobile technology, many resources are also geo-tagged with location information. We observe that a traditional tag (word) can carry different semantics at different locations. We study how location information can be used to help distinguish the different semantics of a resource’s tags and thus to improve retrieval accuracy. Given a search query, we propose a location-partitioning method that partitions all locations into regions such that the user query carries distinguishing semantics in each region. Based on the identified regions, we utilize location information in estimating the ranking scores of resources for the given query. These ranking scores are learned using the Bayesian Personalized Ranking (BPR) framework. Two algorithms, namely, LTD and LPITF, which apply Tucker Decomposition and Pairwise Interaction Tensor Factorization, respectively for modeling the ranking score tensor are proposed. Through experiments on real datasets, we show that LTD and LPITF outperform other tag-based resource retrieval methods. A heterogeneous information network (HIN) is used to model objects of different types and their relationships. Meta-paths are sequences of object types. They are used to represent complex relationships between objects beyond what links in a homogeneous network capture. We study the problem of classifying objects in an HIN. We propose class-level meta-paths and study how they can be used to (1) build more accurate classifiers and (2) improve active learning in identifying objects for which training labels should be obtained. We show that class-level meta-paths and object classification exhibit interesting synergy. Our experimental results show that the use of class-level meta-paths results in very effective active learning and good classification performance in HINs. published_or_final_version Computer Science Master Master of Philosophy

APA, Harvard, Vancouver, ISO, and other styles

23

GarciÌa-Osorio, CeÌsar. "Data mining and visualization." Thesis, University of Exeter, 2005. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.414266.

Full text

APA, Harvard, Vancouver, ISO, and other styles

24

Wang, Grant J. (Grant Jenhorn) 1979. "Algorithms for data mining." Thesis, Massachusetts Institute of Technology, 2006. http://hdl.handle.net/1721.1/38315.

Full text

Abstract:

Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2006. Includes bibliographical references (p. 81-89). Data of massive size are now available in a wide variety of fields and come with great promise. In theory, these massive data sets allow data mining and exploration on a scale previously unimaginable. However, in practice, it can be difficult to apply classic data mining techniques to such massive data sets due to their sheer size. In this thesis, we study three algorithmic problems in data mining with consideration to the analysis of massive data sets. Our work is both theoretical and experimental - we design algorithms and prove guarantees for their performance and also give experimental results on real data sets. The three problems we study are: 1) finding a matrix of low rank that approximates a given matrix, 2) clustering high-dimensional points into subsets whose points lie in the same subspace, and 3) clustering objects by pairwise similarities/distances. by Grant J. Wang. Ph.D.

APA, Harvard, Vancouver, ISO, and other styles

25

Anwar, Muhammad Naveed. "Data mining of audiology." Thesis, University of Sunderland, 2012. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.573120.

Full text

Abstract:

This thesis describes the data mining of a large set of patient records from the hearing aid clinic at James Cook University Hospital in Middlesbrough, UK. As typical of medical data in general, these audiology records are heterogeneous, containing the following three different types of data: Audiograms (graphs of hearing ability at different frequencies) Structured tabular data (such as gender, date of birth and diagnosis) Unstructured text (specific observations made about each patient in a free- text or comment field) This audiology data set is unique, as it contains records of patients prescribed with both ITE and BTE hearing aids. ITE hearing aids are not generally available on the British National Health Service in England, as they are more expensive than BTE hearing aids. However, both types of aids are prescribed at James Cook University Hospital in Middlesbrough, UK, which is also an important feature of this data. There are two research questions for this research: Which factors influence the choice of ITE (in the ear) as opposed to BTE (behind the ear) hearing aids? For patients diagnosed with tinnitus (ringing in the ear), which factors influence the decision whether to fit a tinnitus masker (a gentle sound source, worn like a hearing aid, designed to drown out tinnitus)? A number of data mining techniques, such as clustering of audiograms, association analysis of variables (such as, age, gender, diagnosis, masker, mould and free text keywords) using contingency tables and principal component analysis on audiograms were used to find candidate variables to be combined into a decision support system (OSS) where unseen patient records are presented to the system, and the relative likelihood that a patient should be fitted with an ITE as opposed to a BTE aid or a tinnitus with masker as opposed to tinnitus not with masker is returned. The DSS was created using the techniques of logistic regression, Nalve Bayesian analysis and Bayesian network, and these systems were tested using 5 fold cross validations to see which of the techniques produced the better results. The advantage of these techniques for the combination of evidence is that it is easy to see which variables contributed to the final d~~Jpion. The constructed models and the data behind them were validated by"presenting them to the Principal audiologist, Dr. Robertshaw at James Cook University Hospital in Middlesbrough for comments and suggestions for improvements. The techniques developed in this thesis for the construction of prediction models were also used successfully on a different audiology data set from Malaysia. These decisions are typically made by audiology technicians working in the out- patient clinics, on the basis of audiogram results and in consultation with the patients. In many cases, the choice is clear cut, but at other times the technicians might benefit from a second opinion given by an automatic system with an explanation of how that second opinion was arrived at.

APA, Harvard, Vancouver, ISO, and other styles

26

Santos, José Carlos Almeida. "Mining protein structure data." Master's thesis, FCT - UNL, 2006. http://hdl.handle.net/10362/1130.

Full text

Abstract:

The principal topic of this work is the application of data mining techniques, in particular of machine learning, to the discovery of knowledge in a protein database. In the first chapter a general background is presented. Namely, in section 1.1 we overview the methodology of a Data Mining project and its main algorithms. In section 1.2 an introduction to the proteins and its supporting file formats is outlined. This chapter is concluded with section 1.3 which defines that main problem we pretend to address with this work: determine if an amino acid is exposed or buried in a protein, in a discrete way (i.e.: not continuous), for five exposition levels: 2%, 10%, 20%, 25% and 30%. In the second chapter, following closely the CRISP-DM methodology, whole the process of construction the database that supported this work is presented. Namely, it is described the process of loading data from the Protein Data Bank, DSSP and SCOP. Then an initial data exploration is performed and a simple prediction model (baseline) of the relative solvent accessibility of an amino acid is introduced. It is also introduced the Data Mining Table Creator, a program developed to produce the data mining tables required for this problem. In the third chapter the results obtained are analyzed with statistical significance tests. Initially the several used classifiers (Neural Networks, C5.0, CART and Chaid) are compared and it is concluded that C5.0 is the most suitable for the problem at stake. It is also compared the influence of parameters like the amino acid information level, the amino acid window size and the SCOP class type in the accuracy of the predictive models. The fourth chapter starts with a brief revision of the literature about amino acid relative solvent accessibility. Then, we overview the main results achieved and finally discuss about possible future work. The fifth and last chapter consists of appendices. Appendix A has the schema of the database that supported this thesis. Appendix B has a set of tables with additional information. Appendix C describes the software provided in the DVD accompanying this thesis that allows the reconstruction of the present work.

APA, Harvard, Vancouver, ISO, and other styles

27

Garda-Osorio, Cesar. "Data mining and visualisation." Thesis, University of the West of Scotland, 2005. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.742763.

Full text

APA, Harvard, Vancouver, ISO, and other styles

28

Rawles, Simon Alan. "Object-oriented data mining." Thesis, University of Bristol, 2007. http://hdl.handle.net/1983/c13bda2c-75c9-4bfa-b86b-04ac06ba0278.

Full text

Abstract:

Attempts to overcome limitations in the attribute-value representation for machine learning has led to much interest in learning from structured data, concentrated in the research areas of inductive logic programming (ILP) and multi-relational data mining (MDRM). The expressivenessa nd encapsulationo f the object-oriented data model has led to its widespread adoption in software and database design. The considerable congruence between this model and individual-centred models in inductive logic programming presents new opportunities for mining object data specific to its domain. This thesis investigates the use of object-orientation in knowledge representation for multi-relational data mining. We propose a language for expressing object model metaknowledge and use it to extend the reasoning mechanisms of an object-oriented logic. A refinement operator is then defined and used for feature search in a object-oriented propositionalisation-based ILP classifier. An algorithm is proposed for reducing the large number of redundant features typical in propositionalisation. A data mining system based on the refinement operator is implemented and demonstrated on a real-world computational linguistics task and compared with a conventional ILP system. Keywords: Object orientation; data mining; inductive logic programming; propositionalisation; refinement operators; feature reduction

APA, Harvard, Vancouver, ISO, and other styles

29

Mao, Shihong. "Comparative Microarray Data Mining." Wright State University / OhioLINK, 2007. http://rave.ohiolink.edu/etdc/view?acc_num=wright1198695415.

Full text

APA, Harvard, Vancouver, ISO, and other styles

30

Novák, Petr. "Data mining časových řad." Master's thesis, Vysoká škola ekonomická v Praze, 2009. http://www.nusl.cz/ntk/nusl-72068.

Full text

APA, Harvard, Vancouver, ISO, and other styles

31

Blunt, Gordon. "Mining credit card data." Thesis, n.p, 2002. http://ethos.bl.uk/.

Full text

APA, Harvard, Vancouver, ISO, and other styles

32

DUPONT, Daniel Ambrosi. "CHSPAM: um modelo multi-domínio para acompanhamento de padrões em históricos de contextos." Universidade do Vale do Rio dos Sinos, 2017. http://www.repositorio.jesuita.org.br/handle/UNISINOS/6272.

Full text

Abstract:

Submitted by JOSIANE SANTOS DE OLIVEIRA (josianeso) on 2017-05-22T11:24:15Z No. of bitstreams: 1 Daniel Ambrosi Dupont_.pdf: 2397654 bytes, checksum: 6d41d597126ff9f150969b3b5ad9fd1b (MD5) Made available in DSpace on 2017-05-22T11:24:15Z (GMT). No. of bitstreams: 1 Daniel Ambrosi Dupont_.pdf: 2397654 bytes, checksum: 6d41d597126ff9f150969b3b5ad9fd1b (MD5) Previous issue date: 2017-03-21 Nenhuma A Computação Ubíqua estuda o desenvolvimento de técnicas que visam integrar perfeitamente a tecnologia da informação ao cotidiano das pessoas, de modo que elas sejam auxiliadas pelos recursos tecnológicos no mundo real, de forma pró-ativa, enquanto realizam atividades diárias. Um dos aspectos fundamentais para o desenvolvimento deste tipo de aplicação é a questão da Sensibilidade ao Contexto, que permite a uma aplicação adaptar o seu funcionamento conforme o contexto no qual o usuário se encontra. Com o desenvolvimento de sistemas que utilizam informações de contextos armazenados anteriormente, foram surgindo bases de dados que armazenam os Históricos de Contextos capturados ao longo do tempo. Muitos pesquisadores têm estudado diferentes formas para realização de análises nestes dados. Este trabalho aborda um tipo específico de análise de dados em históricos de contextos, que é a busca e acompanhamento de padrões. Deste modo, é proposto um modelo denominado CHSPAM (Context History Pattern Monitoring) que permite a realização de descoberta e acompanhamento de padrões sequenciais em bases de Históricos de Contextos, fazendo uso de técnicas de mineração de dados já existentes. O diferencial deste trabalho é o uso de uma representação genérica para o armazenamento de contextos, permitindo sua aplicação em múltiplos domínios. Outro diferencial é que o modelo realiza o acompanhamento dos padrões descobertos durante o tempo, armazenando um histórico da evolução de cada padrão. Um protótipo foi implementado e, a partir dele, foram realizados três experimentos. O primeiro foi utilizado para avaliar as funcionalidades e serviços oferecidos pelo CHSPAM e foi baseado em dados sintéticos. No segundo, o modelo foi utilizado em uma aplicação de predição e o acompanhamento de padrões proporcionou ganhos na precisão das predições quando comparado ao uso de padrões sem acompanhamento. Por fim, no terceiro experimento, o CHSPAM foi utilizado como componente de uma aplicação de recomendação de objetos de aprendizagem e a aplicação foi capaz de identificar objetos relacionados aos interesses de alunos, utilizando como base o acompanhamento de padrões. Ubiquitous computing aims to make tasks that depend on computing, transparent to users, thus, providing resources and services anytime and anywhere. One of the key factors to the development this type of application is the matter of Context Awareness, which enables an application to adjust its operation as the situation in which the user is. Thus, several authors have presented formal definitions of what is a context and how to represent it. With the development of systems that use Context information previously stored, databases have emerged that store Historical Contexts captured over time. Many researchers have studied different ways to analyzes this data. This paper addresses a specific type of data analysis in historical contexts, which is the discovery and monitoring of patterns in Context Histories. For this purpose, a model called CHSPAM (Context History Pattern Monitoring) is proposed, that allows the discovery of patterns in Context History databases and keeps track of these patterns to monitor their evolution over the time. Ubiquitous computing aims aim to integrate information technology perfectly into people's daily lives, so that people are aided by technological resources in the real world, proactively, while performing daily activities. One of the fundamental aspects for the development of this type of application is the issue of Context Awareness, which allows an application to adapt its operation according to the context in which the user is. With the development of systems that use information from previously stored contexts, databases have emerged that store captured Context Histories over time. Many researchers have studied different ways to perform analyzes on these data. This work addresses a specific type of data analysis in context histories, which is the search for sequential patterns. With this purpose, a model called CHSPAM (Context History Pattern Monitoring) is proposed that allows the discovery of sequential patterns in Context Historical databases, making use of existing data mining techniques. The main contributions of this work are the use of a generic representation for the storage of contexts allowing its application in multiple domains. Another contribution is that the model monitors the patterns discovered over time, storing history pattern evolution. A prototype of the model was implemented, and from it three experiments were carried out for its validation. The first experiment was used to evaluate the functionalities and services offered by CHSPAM and was based on simulated data. In the second experiment, the model was used in a prediction application and the use of monitored sequential patterns provided accuracy improvement on predictions when compared to the use of common patterns. Finally, in the third experiment, CHSPAM was used as a component of a learning object recommendation application and the application was able to recommend objects related to students’ interests based on monitored sequential patterns extracted from users’ session history.

APA, Harvard, Vancouver, ISO, and other styles

33

Niggemann, Oliver. "Visual data mining of graph based data." [S.l. : s.n.], 2001. http://deposit.ddb.de/cgi-bin/dokserv?idn=962400505.

Full text

APA, Harvard, Vancouver, ISO, and other styles

34

Li, Liangchun. "Web-based data visualization for data mining." Thesis, National Library of Canada = Bibliothèque nationale du Canada, 1998. http://www.collectionscanada.ca/obj/s4/f2/dsk2/ftp03/MQ35845.pdf.

Full text

APA, Harvard, Vancouver, ISO, and other styles

35

Al-Hashemi, Idrees Yousef. "Applying data mining techniques over big data." Thesis, Boston University, 2013. https://hdl.handle.net/2144/21119.

Full text

Abstract:

Thesis (M.S.C.S.) PLEASE NOTE: Boston University Libraries did not receive an Authorization To Manage form for this thesis or dissertation. It is therefore not openly accessible, though it may be available by request. If you are the author or principal advisor of this work and would like to request open access for it, please contact us at open-help@bu.edu. Thank you. The rapid development of information technology in recent decades means that data appear in a wide variety of formats — sensor data, tweets, photographs, raw data, and unstructured data. Statistics show that there were 800,000 Petabytes stored in the world in 2000. Today’s internet has about 0.1 Zettabytes of data (ZB is about 1021 bytes), and this number will reach 35 ZB by 2020. With such an overwhelming flood of information, present data management systems are not able to scale to this huge amount of raw, unstructured data—in today’s parlance, Big Data. In the present study, we show the basic concepts and design of Big Data tools, algorithms, and techniques. We compare the classical data mining algorithms to the Big Data algorithms by using Hadoop/MapReduce as a core implementation of Big Data for scalable algorithms. We implemented the K-means algorithm and A-priori algorithm with Hadoop/MapReduce on a 5 nodes Hadoop cluster. We explore NoSQL databases for semi-structured, massively large-scaling of data by using MongoDB as an example. Finally, we show the performance between HDFS (Hadoop Distributed File System) and MongoDB data storage for these two algorithms.

APA, Harvard, Vancouver, ISO, and other styles

36

Zhou, Wubai. "Data Mining Techniques to Understand Textual Data." FIU Digital Commons, 2017. https://digitalcommons.fiu.edu/etd/3493.

Full text

Abstract:

More than ever, information delivery online and storage heavily rely on text. Billions of texts are produced every day in the form of documents, news, logs, search queries, ad keywords, tags, tweets, messenger conversations, social network posts, etc. Text understanding is a fundamental and essential task involving broad research topics, and contributes to many applications in the areas text summarization, search engine, recommendation systems, online advertising, conversational bot and so on. However, understanding text for computers is never a trivial task, especially for noisy and ambiguous text such as logs, search queries. This dissertation mainly focuses on textual understanding tasks derived from the two domains, i.e., disaster management and IT service management that mainly utilizing textual data as an information carrier. Improving situation awareness in disaster management and alleviating human efforts involved in IT service management dictates more intelligent and efficient solutions to understand the textual data acting as the main information carrier in the two domains. From the perspective of data mining, four directions are identified: (1) Intelligently generate a storyline summarizing the evolution of a hurricane from relevant online corpus; (2) Automatically recommending resolutions according to the textual symptom description in a ticket; (3) Gradually adapting the resolution recommendation system for time correlated features derived from text; (4) Efficiently learning distributed representation for short and lousy ticket symptom descriptions and resolutions. Provided with different types of textual data, data mining techniques proposed in those four research directions successfully address our tasks to understand and extract valuable knowledge from those textual data. My dissertation will address the research topics outlined above. Concretely, I will focus on designing and developing data mining methodologies to better understand textual information, including (1) a storyline generation method for efficient summarization of natural hurricanes based on crawled online corpus; (2) a recommendation framework for automated ticket resolution in IT service management; (3) an adaptive recommendation system on time-varying temporal correlated features derived from text; (4) a deep neural ranking model not only successfully recommending resolutions but also efficiently outputting distributed representation for ticket descriptions and resolutions.

APA, Harvard, Vancouver, ISO, and other styles

37

KAVOOSIFAR, MOHAMMAD REZA. "Data Mining and Indexing Big Multimedia Data." Doctoral thesis, Politecnico di Torino, 2019. http://hdl.handle.net/11583/2742526.

Full text

APA, Harvard, Vancouver, ISO, and other styles

38

Adderly, Darryl M. "Data mining meets e-commerce using data mining to improve customer relationship management /." [Gainesville, Fla.]: University of Florida, 2002. http://purl.fcla.edu/fcla/etd/UFE0000500.

Full text

APA, Harvard, Vancouver, ISO, and other styles

39

Vithal, Kadam Omkar. "Novel applications of Association Rule Mining- Data Stream Mining." AUT University, 2009. http://hdl.handle.net/10292/826.

Full text

Abstract:

From the advent of association rule mining, it has become one of the most researched areas of data exploration schemes. In recent years, implementing association rule mining methods in extracting rules from a continuous flow of voluminous data, known as Data Stream has generated immense interest due to its emerging applications such as network-traffic analysis, sensor-network data analysis. For such typical kinds of application domains, the facility to process such enormous amount of stream data in a single pass is critical.

APA, Harvard, Vancouver, ISO, and other styles

40

Patel, Akash. "Data Mining of Process Data in Multivariable Systems." Thesis, KTH, Skolan för elektro- och systemteknik (EES), 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-201087.

Full text

Abstract:

Performing system identification experiments in order to model control plantsin industry processes can be costly and time consuming. Therefore, with increasinglymore computational power available and abundant access to loggedhistorical data from plants, data mining algorithms have become more appealing.This thesis focuses on evaluating a data mining algorithm for multivariate processwhere the mined data can potentially be used for system identification.The first part of the thesis explores the effect many of the necessary user chosenparameters have on the algorithmic performance. In order to do this, a GUIdesigned with assisting in parameter selection is developed. The second partof the thesis evaluates the proposed algorithm’s performance by modelling asimulated process based on intervals found by the algorithm.The results show that the algorithm is particularly sensitive to the choice ofcut-off frequencies in the bandpass filter, threshold of the reciprocal conditionnumber and the Laguerre filter order. It is also shown that with the GUI itis possible to select parameters such that the algorithm performs satisfactoryand mines data relevant for system identification. Finally, the results show thatit’s possible to use the mined data in order to model a simulated process usingsystem identification techniques with good accuracy. Modellering av reglersystem i industriprocesser med hjälp av system identifieringsexperiment, kan vara både kostsammt och tidskrävande. Ökad tillgångtill stora volymer av historisk lagrad data och processorkraft har därmed väcktstort intresse för data mining algoritmer.Denna avhandling fokuserar på utvärderingen av en data minig algoritm för mulitvariablaprocesser där de utvunna data segmenten can potenitellt användasför system identifiering. Första delen av avhandlingen utforskar vilken effektalgoritmens många parametrar har på dess prestanda. För att förenkla valenav parametrarna, utveklades ett användargränsnitt. Den andra delen av avhandlingenutvärderar algoritmens prestanda genom att modellera en simuleradprocess som är baserad på de utvunna data segment.Resultaten visar att algoritmen är särskilt känslig mot valen av brytfrekvensernai bandpassfiltret, tröskel värdet för det reciproka konditions talet och ordernpå Laguerre filtret. Dessutom visar resultaten att det är, genom det utveckladeanvändargränssnittet, möjligt att välja parameter värden som ger godtyckligautvunna data segment. Slutgiltigen kan det konstateras att man kan medhög nogrannhet modellera en simulerad process med hjälp av de utvunna datasegmenten från algoritmen.

APA, Harvard, Vancouver, ISO, and other styles

41

Cordeiro, Robson Leonardo Ferreira. "Data mining in large sets of complex data." Universidade de São Paulo, 2011. http://www.teses.usp.br/teses/disponiveis/55/55134/tde-22112011-083653/.

Full text

Abstract:

Due to the increasing amount and complexity of the data stored in the enterprises\' databases, the task of knowledge discovery is nowadays vital to support strategic decisions. However, the mining techniques used in the process usually have high computational costs that come from the need to explore several alternative solutions, in different combinations, to obtain the desired knowledge. The most common mining tasks include data classification, labeling and clustering, outlier detection and missing data prediction. Traditionally, the data are represented by numerical or categorical attributes in a table that describes one element in each tuple. Although the same tasks applied to traditional data are also necessary for more complex data, such as images, graphs, audio and long texts, the complexity and the computational costs associated to handling large amounts of these complex data increase considerably, making most of the existing techniques impractical. Therefore, especial data mining techniques for this kind of data need to be developed. This Ph.D. work focuses on the development of new data mining techniques for large sets of complex data, especially for the task of clustering, tightly associated to other data mining tasks that are performed together. Specifically, this Doctoral dissertation presents three novel, fast and scalable data mining algorithms well-suited to analyze large sets of complex data: the method Halite for correlation clustering; the method BoW for clustering Terabyte-scale datasets; and the method QMAS for labeling and summarization. Our algorithms were evaluated on real, very large datasets with up to billions of complex elements, and they always presented highly accurate results, being at least one order of magnitude faster than the fastest related works in almost all cases. The real data used come from the following applications: automatic breast cancer diagnosis, satellite imagery analysis, and graph mining on a large web graph crawled by Yahoo! and also on the graph with all users and their connections from the Twitter social network. Such results indicate that our algorithms allow the development of real time applications that, potentially, could not be developed without this Ph.D. work, like a software to aid on the fly the diagnosis process in a worldwide Healthcare Information System, or a system to look for deforestation within the Amazon Rainforest in real time O crescimento em quantidade e complexidade dos dados armazenados nas organizações torna a extração de conhecimento utilizando técnicas de mineração uma tarefa ao mesmo tempo fundamental para aproveitar bem esses dados na tomada de decisões estratégicas e de alto custo computacional. O custo vem da necessidade de se explorar uma grande quantidade de casos de estudo, em diferentes combinações, para se obter o conhecimento desejado. Tradicionalmente, os dados a explorar são representados como atributos numéricos ou categóricos em uma tabela, que descreve em cada tupla um caso de teste do conjunto sob análise. Embora as mesmas tarefas desenvolvidas para dados tradicionais sejam também necessárias para dados mais complexos, como imagens, grafos, áudio e textos longos, a complexidade das análises e o custo computacional envolvidos aumentam significativamente, inviabilizando a maioria das técnicas de análise atuais quando aplicadas a grandes quantidades desses dados complexos. Assim, técnicas de mineração especiais devem ser desenvolvidas. Este Trabalho de Doutorado visa a criação de novas técnicas de mineração para grandes bases de dados complexos. Especificamente, foram desenvolvidas duas novas técnicas de agrupamento e uma nova técnica de rotulação e sumarização que são rápidas, escaláveis e bem adequadas à análise de grandes bases de dados complexos. As técnicas propostas foram avaliadas para a análise de bases de dados reais, em escala de Terabytes de dados, contendo até bilhões de objetos complexos, e elas sempre apresentaram resultados de alta qualidade, sendo em quase todos os casos pelo menos uma ordem de magnitude mais rápidas do que os trabalhos relacionados mais eficientes. Os dados reais utilizados vêm das seguintes aplicações: diagnóstico automático de câncer de mama, análise de imagens de satélites, e mineração de grafos aplicada a um grande grafo da web coletado pelo Yahoo! e também a um grafo com todos os usuários da rede social Twitter e suas conexões. Tais resultados indicam que nossos algoritmos permitem a criação de aplicações em tempo real que, potencialmente, não poderiam ser desenvolvidas sem a existência deste Trabalho de Doutorado, como por exemplo, um sistema em escala global para o auxílio ao diagnóstico médico em tempo real, ou um sistema para a busca por áreas de desmatamento na Floresta Amazônica em tempo real

APA, Harvard, Vancouver, ISO, and other styles

42

XIAO, XIN. "Data Mining Techniques for Complex User-Generated Data." Doctoral thesis, Politecnico di Torino, 2016. http://hdl.handle.net/11583/2644046.

Full text

Abstract:

Nowadays, the amount of collected information is continuously growing in a variety of different domains. Data mining techniques are powerful instruments to effectively analyze these large data collections and extract hidden and useful knowledge. Vast amount of User-Generated Data (UGD) is being created every day, such as user behavior, user-generated content, user exploitation of available services and user mobility in different domains. Some common critical issues arise for the UGD analysis process such as the large dataset cardinality and dimensionality, the variable data distribution and inherent sparseness, and the heterogeneous data to model the different facets of the targeted domain. Consequently, the extraction of useful knowledge from such data collections is a challenging task, and proper data mining solutions should be devised for the problem under analysis. In this thesis work, we focus on the design and development of innovative solutions to support data mining activities over User-Generated Data characterised by different critical issues, via the integration of different data mining techniques in a unified frame- work. Real datasets coming from three example domains characterized by the above critical issues are considered as reference cases, i.e., health care, social network, and ur- ban environment domains. Experimental results show the effectiveness of the proposed approaches to discover useful knowledge from different domains.

APA, Harvard, Vancouver, ISO, and other styles

43

Tong, Suk-man Ivy. "Techniques in data stream mining." Click to view the E-thesis via HKUTO, 2005. http://sunzi.lib.hku.hk/hkuto/record/B34737376.

Full text

APA, Harvard, Vancouver, ISO, and other styles

44

Borgelt, Christian. "Data mining with graphical models." [S.l. : s.n.], 2000. http://deposit.ddb.de/cgi-bin/dokserv?idn=962912107.

Full text

APA, Harvard, Vancouver, ISO, and other styles

45

Weber, Irene. "Suchraumbeschränkung für relationales Data Mining." [S.l. : s.n.], 2004. http://www.bsz-bw.de/cgi-bin/xvms.cgi?SWB11380447.

Full text

APA, Harvard, Vancouver, ISO, and other styles

46

Maden, Engin. "Data Mining On Architecture Simulation." Master's thesis, METU, 2010. http://etd.lib.metu.edu.tr/upload/2/12611635/index.pdf.

Full text

Abstract:

Data mining is the process of extracting patterns from huge data. One of the branches in data mining is mining sequence data and here the data can be viewed as a sequence of events and each event has an associated time of occurrence. Sequence data is modelled using episodes and events are included in episodes. The aim of this thesis work is analysing architecture simulation output data by applying episode mining techniques, showing the previously known relationships between the events in architecture and providing an environment to predict the performance of a program in an architecture before executing the codes. One of the most important points here is the application area of episode mining techniques. Architecture simulation data is a new domain to apply these techniques and by using the results of these techniques making predictions about the performance of programs in an architecture before execution can be considered as a new approach. For this purpose, by implementing three episode mining techniques which are WINEPI approach, non-overlapping occurrence based approach and MINEPI approach a data mining tool has been developed. This tool has three main components. These are data pre-processor, episode miner and output analyser.

APA, Harvard, Vancouver, ISO, and other styles

47

Drwal, Maciej. "Data mining in distributedcomputer systems." Thesis, Blekinge Tekniska Högskola, Sektionen för datavetenskap och kommunikation, 2009. http://urn.kb.se/resolve?urn=urn:nbn:se:bth-5709.

Full text

APA, Harvard, Vancouver, ISO, and other styles

48

Thun, Julia, and Rebin Kadouri. "Automating debugging through data mining." Thesis, KTH, Data- och elektroteknik, 2017. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-203244.

Full text

Abstract:

Contemporary technological systems generate massive quantities of log messages. These messages can be stored, searched and visualized efficiently using log management and analysis tools. The analysis of log messages offer insights into system behavior such as performance, server status and execution faults in web applications. iStone AB wants to explore the possibility to automate their debugging process. Since iStone does most parts of their debugging manually, it takes time to find errors within the system. The aim was therefore to find different solutions to reduce the time it takes to debug. An analysis of log messages within access – and console logs were made, so that the most appropriate data mining techniques for iStone’s system would be chosen. Data mining algorithms and log management and analysis tools were compared. The result of the comparisons showed that the ELK Stack as well as a mixture between Eclat and a hybrid algorithm (Eclat and Apriori) were the most appropriate choices. To demonstrate their feasibility, the ELK Stack and Eclat were implemented. The produced results show that data mining and the use of a platform for log analysis can facilitate and reduce the time it takes to debug. Dagens system genererar stora mängder av loggmeddelanden. Dessa meddelanden kan effektivt lagras, sökas och visualiseras genom att använda sig av logghanteringsverktyg. Analys av loggmeddelanden ger insikt i systemets beteende såsom prestanda, serverstatus och exekveringsfel som kan uppkomma i webbapplikationer. iStone AB vill undersöka möjligheten att automatisera felsökning. Eftersom iStone till mestadels utför deras felsökning manuellt så tar det tid att hitta fel inom systemet. Syftet var att därför att finna olika lösningar som reducerar tiden det tar att felsöka. En analys av loggmeddelanden inom access – och konsolloggar utfördes för att välja de mest lämpade data mining tekniker för iStone’s system. Data mining algoritmer och logghanteringsverktyg jämfördes. Resultatet av jämförelserna visade att ELK Stacken samt en blandning av Eclat och en hybrid algoritm (Eclat och Apriori) var de lämpligaste valen. För att visa att så är fallet så implementerades ELK Stacken och Eclat. De framställda resultaten visar att data mining och användning av en plattform för logganalys kan underlätta och minska den tid det tar för att felsöka.

APA, Harvard, Vancouver, ISO, and other styles

49

Rahman, Sardar Muhammad Monzurur, and mrahman99@yahoo com. "Data Mining Using Neural Networks." RMIT University. Electrical & Computer Engineering, 2006. http://adt.lib.rmit.edu.au/adt/public/adt-VIT20080813.094814.

Full text

Abstract:

Data mining is about the search for relationships and global patterns in large databases that are increasing in size. Data mining is beneficial for anyone who has a huge amount of data, for example, customer and business data, transaction, marketing, financial, manufacturing and web data etc. The results of data mining are also referred to as knowledge in the form of rules, regularities and constraints. Rule mining is one of the popular data mining methods since rules provide concise statements of potentially important information that is easily understood by end users and also actionable patterns. At present rule mining has received a good deal of attention and enthusiasm from data mining researchers since rule mining is capable of solving many data mining problems such as classification, association, customer profiling, summarization, segmentation and many others. This thesis makes several contributions by proposing rule mining methods using genetic algorithms and neural networks. The thesis first proposes rule mining methods using a genetic algorithm. These methods are based on an integrated framework but capable of mining three major classes of rules. Moreover, the rule mining processes in these methods are controlled by tuning of two data mining measures such as support and confidence. The thesis shows how to build data mining predictive models using the resultant rules of the proposed methods. Another key contribution of the thesis is the proposal of rule mining methods using supervised neural networks. The thesis mathematically analyses the Widrow-Hoff learning algorithm of a single-layered neural network, which results in a foundation for rule mining algorithms using single-layered neural networks. Three rule mining algorithms using single-layered neural networks are proposed for the three major classes of rules on the basis of the proposed theorems. The thesis also looks at the problem of rule mining where user guidance is absent. The thesis proposes a guided rule mining system to overcome this problem. The thesis extends this work further by comparing the performance of the algorithm used in the proposed guided rule mining system with Apriori data mining algorithm. Finally, the thesis studies the Kohonen self-organization map as an unsupervised neural network for rule mining algorithms. Two approaches are adopted based on the way of self-organization maps applied in rule mining models. In the first approach, self-organization map is used for clustering, which provides class information to the rule mining process. In the second approach, automated rule mining takes the place of trained neurons as it grows in a hierarchical structure.

APA, Harvard, Vancouver, ISO, and other styles

50

Guo, Shishan. "Data mining in crystallographic databases." Thesis, National Library of Canada = Bibliothèque nationale du Canada, 2000. http://www.collectionscanada.ca/obj/s4/f2/dsk1/tape3/PQDD_0012/NQ52854.pdf.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic 'History of data mining'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles