Dissertations / Theses on the topic 'Data patterns'
Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles
Consult the top 50 dissertations / theses for your research on the topic 'Data patterns.'
Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.
You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.
Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.
Voß, Jakob. "Describing data patterns." Doctoral thesis, Humboldt-Universität zu Berlin, Philosophische Fakultät I, 2013. http://dx.doi.org/10.18452/16794.
Full textMany methods, technologies, standards, and languages exist to structure and describe data. The aim of this thesis is to find common features in these methods to determine how data is actually structured and described. Existing studies are limited to notions of data as recorded observations and facts, or they require given structures to build on, such as the concept of a record or the concept of a schema. These presumed concepts have been deconstructed in this thesis from a semiotic point of view. This was done by analysing data as signs, communicated in form of digital documents. The study was conducted by a phenomenological research method. Conceptual properties of data structuring and description were first collected and experienced critically. Examples of such properties include encodings, identifiers, formats, schemas, and models. The analysis resulted in six prototypes to categorize data methods by their primary purpose. The study further revealed five basic paradigms that deeply shape how data is structured and described in practice. The third result consists of a pattern language of data structuring. The patterns show problems and solutions which occur over and over again in data, independent from particular technologies. Twenty general patterns were identified and described, each with its benefits, consequences, pitfalls, and relations to other patterns. The results can help to better understand data and its actual forms, both for consumption and creation of data. Particular domains of application include data archaeology and data literacy.
Jones, Mary Elizabeth Song Il-Yeol. "Dimensional modeling : identifying patterns, classifying patterns, and evaluating pattern impact on the design process /." Philadelphia, Pa. : Drexel University, 2006. http://dspace.library.drexel.edu/handle/1860/743.
Full textTronicke, Jens. "Patterns in geophysical data and models." Universität Potsdam, 2006. http://www.uni-potsdam.de/imaf/events/ge_work0602.html.
Full textMuzammal, Muhammad. "Mining sequential patterns from probabilistic data." Thesis, University of Leicester, 2012. http://hdl.handle.net/2381/27638.
Full text陳志昌 and Chee-cheong Chan. "Compositional data analysis of voting patterns." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 1993. http://hub.hku.hk/bib/B31977236.
Full textMcDermott, Philip. "Patterns of data management in bioinformatics." Thesis, University of Manchester, 2010. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.705544.
Full textMomsen, Eric. "Vector-Vector Patterns for Agricultural Data." Thesis, North Dakota State University, 2013. https://hdl.handle.net/10365/27040.
Full textNational Science Foundation Partnerships for Innovation program Grant No. 1114363
Chan, Chee-cheong. "Compositional data analysis of voting patterns." [Hong Kong : University of Hong Kong], 1993. http://sunzi.lib.hku.hk/hkuto/record.jsp?B13787160.
Full textTiddi, Ilaria. "Explaining data patterns using knowledge from the Web of Data." Thesis, Open University, 2016. http://oro.open.ac.uk/47827/.
Full textKamra, Varun. "Mining discriminating patterns in data with confidence." Thesis, California State University, Long Beach, 2016. http://pqdtopen.proquest.com/#viewpdf?dispub=10196147.
Full textThere are many pattern mining algorithms available for classifying data. The main drawback of most of the algorithms is that they always focus on mining frequent patterns in data that may not always be discriminative enough for classification. There could exist patterns that are not frequent, but are efficient discriminators. In such cases these algorithms might not perform well. This project proposes the MDP algorithm, which aims to search for patterns that are good at discriminating between classes rather than searching for frequent patterns. The MDP ensures that there is at least one most discriminative pattern (MDP) per record. The purpose of the project is to investigate how a structural approach to classification compares to a functional approach. The project has been developed in Java programming language.
Light, Adam. "Design patterns for cartography and data graphics /." view abstract or download file of text, 2004. http://wwwlib.umi.com/cr/uoregon/fullcit?p3153792.
Full textTypescript. Includes vita and abstract. Includes bibliographical references (leaves 93-97). Also available for download via the World Wide Web; free to University of Oregon users.
Sommeria-Klein, Guilhem. "From models to data : understanding biodiversity patterns from environmental DNA data." Thesis, Toulouse 3, 2017. http://www.theses.fr/2017TOU30390/document.
Full textIntegrative patterns of biodiversity, such as the distribution of taxa abundances and the spatial turnover of taxonomic composition, have been under scrutiny from ecologists for a long time, as they offer insight into the general rules governing the assembly of organisms into ecological communities. Thank to recent progress in high-throughput DNA sequencing, these patterns can now be measured in a fast and standardized fashion through the sequencing of DNA sampled from the environment (e.g. soil or water), instead of relying on tedious fieldwork and rare naturalist expertise. They can also be measured for the whole tree of life, including the vast and previously unexplored diversity of microorganisms. Taking full advantage of this new type of data is challenging however: DNA-based surveys are indirect, and suffer as such from many potential biases; they also produce large and complex datasets compared to classical censuses. The first goal of this thesis is to investigate how statistical tools and models classically used in ecology or coming from other fields can be adapted to DNA-based data so as to better understand the assembly of ecological communities. The second goal is to apply these approaches to soil DNA data from the Amazonian forest, the Earth's most diverse land ecosystem. Two broad types of mechanisms are classically invoked to explain the assembly of ecological communities: 'neutral' processes, i.e. the random birth, death and dispersal of organisms, and 'niche' processes, i.e. the interaction of the organisms with their environment and with each other according to their phenotype. Disentangling the relative importance of these two types of mechanisms in shaping taxonomic composition is a key ecological question, with many implications from estimating global diversity to conservation issues. In the first chapter, this question is addressed across the tree of life by applying the classical analytic tools of community ecology to soil DNA samples collected from various forest plots in French Guiana. The second chapter focuses on the neutral aspect of community assembly.[...]
Zhang, Xin Iris, and 張欣. "Fast mining of spatial co-location patterns." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 2004. http://hub.hku.hk/bib/B30462708.
Full textMerah, Amar Farouk. "Vehicular Movement Patterns: A Sequential Patterns Data Mining Approach Towards Vehicular Route Prediction." Thèse, Université d'Ottawa / University of Ottawa, 2012. http://hdl.handle.net/10393/22851.
Full textSalazar, Llano Lorena. "Portraying urban diversity patterns through exploratory data analysis." Doctoral thesis, Universitat Politècnica de Catalunya, 2019. http://hdl.handle.net/10803/668423.
Full textEsta tesis analiza la complejidad del sistema urbano, descrito con múltiples variables que representan las características ambientales, económicas y sociales de la ciudad. La motivación fundamental para emprender este estudio consiste en describir la diversidad de la ciudad y su relación con una mejor respuesta a perturbaciones y amenazas, y por lo tanto, a su sostenibilidad. La tesis plantea aportar conocimiento teórico mediante la aplicación de metodologías estadísticas y computacionales que se desarrollan progresivamente en sus capítulos. En la introducción se presenta la abstracción de la ciudad como un sistema urbano, y se hace una revisión de los conceptos y medidas de la diversidad dentro de los marcos teóricos de la sostenibilidad, la ecología urbana y la teoría de los sistemas complejos. Posteriormente, se introduce el sistema urbano de la ciudad de Barcelona, constituido por un conjunto de distritos y representado mediante un sistema de información que contiene mediciones temporales de múltiples variables ambientales, económicas y sociales. Se hace una primera aproximación a la sostenibilidad de la ciudad empleando la entropía de la información como medida de diversidad del sistema urbano. Pero el aporte fundamental de la tesis se centra en la aplicación del Análisis Exploratorio Multivariante (EMA) en el sistema urbano: Análisis de Componentes principales (PCA), Análisis Factorial Múltiple (MFA) y Análisis de Agrupamiento Jerárquico (HCA). Desde dicho enfoque se analiza la diversidad identificando la similaridad -o disimilaridad- entre las distintas partes que componen el sistema urbano. Se plantean también algunas de las técnicas de las ciencias de la computación y la física para evaluar la transformación temporal del sistema urbano, entendido como una nube de datos tridimensionales que se deforma gradualmente. En el análisis del estudio de caso se identifican características diferenciadas y funciones distintivas de los distritos. Además, la dependencia temporal del conjunto de datos revela información sobre las tendencias de diferenciación u homogeneización de los distritos. Finalmente, se exponen las conclusiones de los resultados más relevantes y se enuncian algunas líneas futuras de investigaciónes
Hönel, Sebastian. "Temporal data analysis facilitating recognition of enhanced patterns." Thesis, Linnéuniversitetet, Institutionen för datavetenskap (DV), 2015. http://urn.kb.se/resolve?urn=urn:nbn:se:lnu:diva-51864.
Full textGu, Zhuoer. "Mining previously unknown patterns in time series data." Thesis, University of Warwick, 2017. http://wrap.warwick.ac.uk/99207/.
Full textBreyer, Nils. "Analysis of Travel Patterns from Cellular Network Data." Licentiate thesis, Linköpings universitet, Kommunikations- och transportsystem, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-157139.
Full textKabra, Amit. "Clustering of Driver Data based on Driving Patterns." Thesis, Blekinge Tekniska Högskola, Institutionen för datavetenskap, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:bth-18466.
Full textSmirnov, Sergey, Matthias Weidlich, Jan Mendling, and Mathias Weske. "Action patterns in business process models." Universität Potsdam, 2009. http://opus.kobv.de/ubp/volltexte/2009/3358/.
Full textDie zunehmende Bedeutung des Geschäftsprozessmanagements führt dazu, dass eine steigende Anzahl von Mitarbeitern eines Unternehmens mit der Erstellung von Prozessmodellen betraut ist. Um trotz dieser Tendenz die Qualität der Prozessmodelle, sowie ihre Homogenität sicherzustellen, sind entsprechende Modellierungshilfen unabdingbar. In diesem Bericht stellen wir einen Ansatz vor, welcher die Prozessmodellierung durch Empfehlungen unterstützt. Jene basieren auf sogenannten Aktionsmustern, welche typische Arbeitsblöcke darstellen. Neben der Definition dieser Aktionsmuster zeigen wir eine Methode zur Identifikation dieser Muster auf. Mittels Techniken der Assoziationsanalyse können die Muster automatisch aus einer Sammlung von Prozessmodellen extrahiert werden. Die Anwendbarkeit unseres Ansatzes wird durch eine Fallstudie auf Basis des SAP Referenzmodells illustriert.
Hilton, Ross P. "Model-based data mining methods for identifying patterns in biomedical and health data." Diss., Georgia Institute of Technology, 2015. http://hdl.handle.net/1853/54387.
Full textDing, Guoxiang. "DERIVING ACTIVITY PATTERNS FROM INDIVIDUAL TRAVEL DIARY DATA: A SPATIOTEMPORAL DATA MINING APPROACH." The Ohio State University, 2009. http://rave.ohiolink.edu/etdc/view?acc_num=osu1236777859.
Full textYang, Di. "Mining and Managing Neighbor-Based Patterns in Data Streams." Digital WPI, 2012. https://digitalcommons.wpi.edu/etd-dissertations/16.
Full textLee, Ho Young. "Diagnosing spatial variation patterns in manufacturing processes." Diss., Texas A&M University, 2003. http://hdl.handle.net/1969/122.
Full textChambers, Connie. "Development of a physician profiling data mart." [Denver, Colo.] : Regis University, 2008. http://165.236.235.140/lib/CChambers2008partI.pdf.
Full textPadhye, Manoday D. "Use of data mining for investigation of crime patterns." Morgantown, W. Va. : [West Virginia University Libraries], 2006. https://eidr.wvu.edu/etd/documentdata.eTD?documentid=4836.
Full textTitle from document title page. Document formatted into pages; contains viii, 108 p. : ill. (some col.). Includes abstract. Includes bibliographical references (p. 80-81).
Tillander, Annika. "Classification models for high-dimensional data with sparsity patterns." Doctoral thesis, Stockholms universitet, Statistiska institutionen, 2013. http://urn.kb.se/resolve?urn=urn:nbn:se:su:diva-95664.
Full textMed dagens teknik, till exempel spektrometer och genchips, alstras data i stora mängder. Detta överflöd av data är inte bara till fördel utan orsakar även vissa problem, vanligtvis är antalet variabler (p) betydligt fler än antalet observation (n). Detta ger så kallat högdimensionella data vilket kräver nya statistiska metoder, då de traditionella metoderna är utvecklade för den omvända situationen (p<n). Dessutom är det vanligtvis väldigt få av alla dessa variabler som är relevanta för något givet projekt och styrkan på informationen hos de relevanta variablerna är ofta svag. Därav brukar denna typ av data benämnas som gles och svag (sparse and weak). Vanligtvis brukar identifiering av de relevanta variablerna liknas vid att hitta en nål i en höstack. Denna avhandling tar upp tre olika sätt att klassificera i denna typ av högdimensionella data. Där klassificera innebär, att genom ha tillgång till ett dataset med både förklaringsvariabler och en utfallsvariabel, lära en funktion eller algoritm hur den skall kunna förutspå utfallsvariabeln baserat på endast förklaringsvariablerna. Den typ av riktiga data som används i avhandlingen är microarrays, det är cellprov som visar aktivitet hos generna i cellen. Målet med klassificeringen är att med hjälp av variationen i aktivitet hos de tusentals gener (förklaringsvariablerna) avgöra huruvida cellprovet kommer från cancervävnad eller normalvävnad (utfallsvariabeln). Det finns klassificeringsmetoder som kan hantera högdimensionella data men dessa är ofta beräkningsintensiva, därav fungera de ofta bättre för diskreta data. Genom att transformera kontinuerliga variabler till diskreta (diskretisera) kan beräkningstiden reduceras och göra klassificeringen mer effektiv. I avhandlingen studeras huruvida av diskretisering påverkar klassificeringens prediceringsnoggrannhet och en mycket effektiv diskretiseringsmetod för högdimensionella data föreslås. Linjära klassificeringsmetoder har fördelen att vara stabila. Nackdelen är att de kräver en inverterbar kovariansmatris och vilket kovariansmatrisen inte är för högdimensionella data. I avhandlingen föreslås ett sätt att skatta inversen för glesa kovariansmatriser med blockdiagonalmatris. Denna matris har dessutom fördelen att det leder till additiv klassificering vilket möjliggör att välja hela block av relevanta variabler. I avhandlingen presenteras även en metod för att identifiera och välja ut blocken. Det finns också probabilistiska klassificeringsmetoder som har fördelen att ge sannolikheten att tillhöra vardera av de möjliga utfallen för en observation, inte som de flesta andra klassificeringsmetoder som bara predicerar utfallet. I avhandlingen förslås en sådan Bayesiansk metod, givet den blockdiagonala matrisen och normalfördelade utfallsklasser. De i avhandlingen förslagna metodernas relevans och fördelar är visade genom att tillämpa dem på simulerade och riktiga högdimensionella data.
Sun, Feng-Tso. "Nonparametric Discovery of Human Behavior Patterns from Multimodal Data." Research Showcase @ CMU, 2014. http://repository.cmu.edu/dissertations/359.
Full textSithole, Jabulani S. "Longitudinal data models for evaluating change in prescribing patterns." Thesis, Keele University, 2000. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.327702.
Full textAbnaof, Khalid [Verfasser]. "Finding Common Patterns In Heterogeneous Perturbation Data / Khalid Abnaof." Bonn : Universitäts- und Landesbibliothek Bonn, 2016. http://d-nb.info/1103024337/34.
Full textWilson, Saul Kriger. "Exploring urban activity patterns using electric smart meter data." Thesis, Massachusetts Institute of Technology, 2016. http://hdl.handle.net/1721.1/107028.
Full textThis electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.
Cataloged from student-submitted PDF version of thesis.
Includes bibliographical references (pages 109-111).
This thesis uses electricity consumption data from household and enterprise-level smart meters in County B, Country A, and Turin, Italy, to explore temporal and geographic variations in urban energy consumption and thus urban activity. A central question is whether electricity consumption patterns vary between different economic sectors, across space, and between different days of the week and times of year. This data shows clearly that Country A activity patterns are roughly similar across all seven days of the week, whereas Italian electricity consumption declines markedly on weekends, particularly Sundays. In general, and particularly in Italy, this thesis shows strong seasonality to electricity consumption, with clearly identifiable seasons and high correlation in consumption patterns within each season. This thesis focuses on user type variation in Country A, where although certain patterns are more widespread in some sectors than others, there is significant overlap between pairs of sectors. Hence this thesis is able only to classify land use between residential and industrial sectors, and is unable to classify land use to a meaningful degree of accuracy by analyzing electricity consumption. It is, however, possible to detect geographic variation: urban and industrial centers consume a higher percentage of their electricity on weekdays and during regular work hours than rural areas. In addition, the impact of various special occurrences on urban behavior is probed. This thesis provides measurement of the impact of various holidays on economic activity, using electricity consumption as a proxy. Large (industrial) consumers are generally much more sensitive to holidays than small (residential) consumers are, except during the summer months in Italy. In general, consumption declines on a single holiday are highly correlated with consumption declines on other holidays. Furthermore, using observations at 15-minute intervals, I attempt to measure the short-term behavior shifts caused by daylight savings time's start and finish.
by Saul Kriger Wilson.
S.M.
Alhusain, Sultan. "Intelligent data-driven reverse engineering of software design patterns." Thesis, De Montfort University, 2016. http://hdl.handle.net/2086/14341.
Full textPatchala, Jagadeesh. "Data Mining Algorithms for Discovering Patterns in Text Collections." University of Cincinnati / OhioLINK, 2016. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1458299372.
Full textAwodokun, Olugbenga. "Classification of Patterns in Streaming Data Using Clustering Signatures." University of Cincinnati / OhioLINK, 2017. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1504880155623189.
Full textSeyfi, Majid. "Mining discriminative itemsets in data streams using different window models." Thesis, Queensland University of Technology, 2018. https://eprints.qut.edu.au/120850/1/Majid_Seyfi_Thesis.pdf.
Full textKerr, David. "Extraction of displacement data from Electronic Speckle Pattern Interferometric fringe patterns using digital image processing techniques." Thesis, Loughborough University, 1992. https://dspace.lboro.ac.uk/2134/28205.
Full textGuo, Zhenyu. "Visually Mining Interesting Patterns in Multivariate Datasets." Digital WPI, 2013. https://digitalcommons.wpi.edu/etd-dissertations/9.
Full textYou, Chang Hun. "Learning patterns in dynamic graphs with application to biological networks." Pullman, Wash. : Washington State University, 2009. http://www.dissertations.wsu.edu/Dissertations/Summer2009/c_you_072309.pdf.
Full textTitle from PDF title page (viewed on Aug. 19, 2009). "School of Electrical Engineering and Computer Science." Includes bibliographical references (p. 114-117).
Henning, Johan, and Nicolai Hellesnes. "Detecting Plagiarism Patterns in student code." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-255049.
Full textPlagiat har blivit ett stort problem både på utbildningsnivå och inom industrin för mjukvaruutveckling. Trots att mycket tid och anstränging har lagts ned för att förbättra plagiatdetektering så har det mestadels fokuserat på vanlig text. Medan detekteringsmetoderna för att upptäcka plagiat har förbättrats så har även metoderna för att plagiera utvecklats. Denna uppsats fokuserar på plagiat inom programmeringskurser för förstaårsstudenter på datortekniklinjen på KTH för att se hur utrbrett plagiat är, och om plagiatdetekteringsalgorit- mer i samband med metadata från GitHub kan användas för att förbättra detekteringen av plagiat. Mer specifikt används antal commits metadatan från GitHub för att se om intressanta mönster för studenter som plagierar kan upptäckas. Datasetet som användes i denna rapport är GitHub repositories från programmeringskurserna DD1337 och DD1338 från 2015. Datasetet består av 17 programmeringsuppgifter med ungefär 200 inlämningar för varje uppgift. Plagiatdetekteringsverktyget som användes är MOSS och för varje vecka togs de 10 mest misstänkta inlämningarna och lades till i en lista med misstänkta inlämningar som sedan användes för att hitta mönster för studenter som plagierar. Resultat visar att de misstänkta studenterna i genomsnitt hade 5,27 commits per inlämning, medan de icke-misstänkta studenterna hade ett genomsnitt på 6,49 commits per inlämning. Detta innebär att de misstänkta studenterna i genomsnitt hade färre commits än vad de icke-misstänkte studenterna hade. Framtida studier inkluderar att testa med större datasets, och att testa med annan metadata för att se om andra intressanta mönster kan finnas för studenter som plagierar.
Wong, Ka-yan, and 王嘉欣. "Positioning patterns from multidimensional data and its applications in meteorology." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 2008. http://hub.hku.hk/bib/B39558630.
Full textDe, Luca Silvia. "Studies of CMS data access patterns with machine learning techniques." Bachelor's thesis, Alma Mater Studiorum - Università di Bologna, 2016. http://amslaurea.unibo.it/12021/.
Full textBifet, Albert. "Adaptive Learning and Mining for Data Streams and Frequent Patterns." Doctoral thesis, Universitat Politècnica de Catalunya, 2009. http://hdl.handle.net/10803/22738.
Full textThis thesis is devoted to the design of data mining algorithms for evolving data streams and for the extraction of closed frequent trees. First, we deal with each of these tasks separately, and then we deal with them together, developing classification methods for data streams containing items that are trees. In the data stream model, data arrive at high speed, and the algorithms that must process them have very strict constraints of space and time. In the first part of this thesis we propose and illustrate a framework for developing algorithms that can adaptively learn from data streams that change over time. Our methods are based on using change detectors and estimator modules at the right places. We propose an adaptive sliding window algorithm ADWIN for detecting change and keeping updated statistics from a data stream, and use it as a black-box in place or counters or accumulators in algorithms initially not designed for drifting data. Since ADWIN has rigorous performance guarantees, this opens the possibility of extending such guarantees to learning and mining algorithms. We test our methodology with several learning methods as Naïve Bayes, clustering, decision trees and ensemble methods. We build an experimental framework for data stream mining with concept drift, based on the MOA framework, similar to WEKA, so that it will be easy for researchers to run experimental data stream benchmarks. Trees are connected acyclic graphs and they are studied as link-based structures in many cases. In the second part of this thesis, we describe a rather formal study of trees from the point of view of closure-based mining. Moreover, we present efficient algorithms for subtree testing and for mining ordered and unordered frequent closed trees. We include an analysis of the extraction of association rules of full confidence out of the closed sets of trees, and we have found there an interesting phenomenon: rules whose propositional counterpart is nontrivial are, however, always implicitly true in trees due to the peculiar combinatorics of the structures. And finally, using these results on evolving data streams mining and closed frequent tree mining, we present high performance algorithms for mining closed unlabeled rooted trees adaptively from data streams that change over time. We introduce a general methodology to identify closed patterns in a data stream, using Galois Lattice Theory. Using this methodology, we then develop an incremental one, a sliding-window based one, and finally one that mines closed trees adaptively from data streams. We use these methods to develop classification methods for tree data streams.
Lodolini, Lucia. "The representation of symmetric patterns using the Quadtree data structure /." Online version of thesis, 1988. http://hdl.handle.net/1850/8402.
Full textWong, Ka-yan. "Positioning patterns from multidimensional data and its applications in meteorology." Click to view the E-thesis via HKUTO, 2008. http://sunzi.lib.hku.hk/HKUTO/record/B39558630.
Full textBifet, Figuerol Albert Carles. "Adaptive Learning and Mining for Data Streams and Frequent Patterns." Doctoral thesis, Universitat Politècnica de Catalunya, 2009. http://hdl.handle.net/10803/22738.
Full textThis thesis is devoted to the design of data mining algorithms for evolving data streams and for the extraction of closed frequent trees. First, we deal with each of these tasks separately, and then we deal with them together, developing classification methods for data streams containing items that are trees. In the data stream model, data arrive at high speed, and the algorithms that must process them have very strict constraints of space and time. In the first part of this thesis we propose and illustrate a framework for developing algorithms that can adaptively learn from data streams that change over time. Our methods are based on using change detectors and estimator modules at the right places. We propose an adaptive sliding window algorithm ADWIN for detecting change and keeping updated statistics from a data stream, and use it as a black-box in place or counters or accumulators in algorithms initially not designed for drifting data. Since ADWIN has rigorous performance guarantees, this opens the possibility of extending such guarantees to learning and mining algorithms. We test our methodology with several learning methods as Naïve Bayes, clustering, decision trees and ensemble methods. We build an experimental framework for data stream mining with concept drift, based on the MOA framework, similar to WEKA, so that it will be easy for researchers to run experimental data stream benchmarks. Trees are connected acyclic graphs and they are studied as link-based structures in many cases. In the second part of this thesis, we describe a rather formal study of trees from the point of view of closure-based mining. Moreover, we present efficient algorithms for subtree testing and for mining ordered and unordered frequent closed trees. We include an analysis of the extraction of association rules of full confidence out of the closed sets of trees, and we have found there an interesting phenomenon: rules whose propositional counterpart is nontrivial are, however, always implicitly true in trees due to the peculiar combinatorics of the structures. And finally, using these results on evolving data streams mining and closed frequent tree mining, we present high performance algorithms for mining closed unlabeled rooted trees adaptively from data streams that change over time. We introduce a general methodology to identify closed patterns in a data stream, using Galois Lattice Theory. Using this methodology, we then develop an incremental one, a sliding-window based one, and finally one that mines closed trees adaptively from data streams. We use these methods to develop classification methods for tree data streams.
Tatu, Andrada [Verfasser]. "Visual Analytics of Patterns in High-Dimensional Data / Andrada Tatu." Konstanz : Bibliothek der Universität Konstanz, 2013. http://d-nb.info/1041224680/34.
Full textNassopoulos, Georges. "Deducing Basic Graph Patterns from Logs of Linked Data Providers." Thesis, Nantes, 2017. http://www.theses.fr/2017NANT4110/document.
Full textFollowing the principles of Linked Data, data providers published billions of facts as RDF data. Executing SPARQL queries over SPARQL endpoints or Triple Pattern Fragments (TPF) servers allow to easily consume Linked Data. However, federated SPARQL query processing and TPF query processing decompose the initial query into subqueries. Consequently, the data providers only see subqueries and the initial query is only known by end users. Knowing executed SPARQL queries is fundamental for data providers, to ensure usage control, to optimize costs of query answering, to justify return of investment, to improve the user experience or to create business models of usage trends. In this thesis, we focus on analyzing execution logs of TPF servers and SPARQL endpoints to extract Basic Graph Patterns (BGP) of executed SPARQL queries. The main challenge to extract BGPs is the concurrent execution of SPARQL queries. We propose two algorithms: LIFT and FETA. LIFT extracts BGPs of executed queries from a single TPF server log. FETA extracts BGPs of federated queries from a log of a set of SPARQL endpoints. For experiments, we run LIFT and FETA on synthetic logs and real logs. LIFT and FETA are able to extract BGPs with good precision and recall under certain conditions
Oliveira, Alexandre (Alexandre S. ). "Finding patterns in timed data with spike timing dependent plasticity." Thesis, Massachusetts Institute of Technology, 2012. http://hdl.handle.net/1721.1/77031.
Full textCataloged from PDF version of thesis.
My research focuses on finding patterns in events - in sequences of data that happen over time. It takes inspiration from a neuroscience phenomena believed to be deeply involved in learning. I propose a machine learning algorithm that finds patterns in timed data and is highly robust to noise and missing data. It can find both coincident relationships, where two events tend to happen together; as well as causal relationships, where one event appears to be caused by another. I analyze stock price information using this algorithm and strong relationships are found between companies within the same industry. In particular, I worked with 12 stocks taken from the banking, information technology, healthcare, and oil industries. The relationships are almost exclusively coincidental, rather than causal.
by Alexandre Oliveira.
M.Eng.
Vimieiro, Renato. "Mining disjunctive patterns in biomedical data sets." Thesis, 2012. http://hdl.handle.net/1959.13/936341.
Full textFrequent itemset mining is one of the most studied problems in data mining. Since Agrawal et al. (1993) introduced the problem, several advances both theoretical and practical have been achieved. In spite of that, there are still many unresolved issues to be tackled before frequent pattern mining can be claimed a cornerstone approach in data mining (Han et al., 2007). Here, we investigate issues related to: (1) the (un)suitability of frequent itemset mining algorithms to identify patterns in biomedical data sets; and (2) the limited expressiveness of such patterns, since, in its vast majority, frequent itemsets are exclusively conjunctions. Our ultimate goal in this thesis is to improve methods for frequent pattern mining in such a way that they provide alternative insightful solutions for mining biomedical data sets. Specifically, we provide eficient tools for mining disjunctive patterns in biomedical data sets. We tackle the problem of mining disjunctive patterns through three different fronts: (1) disjunctive minimal generators; (2) disjunctive closed patterns; and (3) quasi-CNF emerging patterns. We then propose three different algorithms, one for each task above: TitanicOR, Disclosed, and QCEP. While the first two aim for more descriptive patterns, the third is a more predictive. These algorithms are proposed as an attempt to cover different sources of data sets coming from biomedical researches. TitanicOR is more suitable to identify patterns in data sets containing physiological, biochemical, or medical record information. Disclosed was designed to exploit the characteristics of microarray gene expression data sets, which usually contains many features, but only few samples. Finally, QCEP is the only algorithm to consider data sets with class label information. We conducted experiments with both synthetic and real world data sets to assess the performance of our algorithms. Our experiments show that our algorithms overcame the state of the art algorithms in each of those categories of patterns.
Liu, Chunyang. "Summarizing data with representative patterns." Thesis, 2016. http://hdl.handle.net/10453/52923.
Full textThe advance of technology makes data acquisition and storage become unprecedentedly convenient. It contributes to the rapid growth of not only the volume but also the veracity and variety of data in recent years, which poses new challenges to the data mining area. For example, uncertain data mining emerges due to its capability to model the inherent veracity of data; spatial data mining attracts much research attention as the widespread of location-based services and wearable devices. As a fundamental topic of data mining, how to effectively and efficiently summarize data in this situation still remains to be explored. This thesis studied the problem of summarizing data with representative patterns. The objective is to find a set of patterns, which is much more concise but still contains rich information of the original data, and may provide valuable insights for further analysis of data. In the light of this idea, we formally formulate the problem and provide effective and efficient solutions in various scenarios. We study the problem of summarizing probabilistic frequent patterns over uncertain data. Probabilistic frequent pattern mining over uncertain data has received much research attention due to the wide applicabilities of uncertain data. It suffers from the problem of generating an exponential number of result patterns, which hinders the analysis of patterns and calls for the need to find a small number of representative patterns to approximate all other patterns. We formally formulate the problem of probabilistic representative frequent pattern (P-RFP) mining, which aims to find the minimal set of patterns with sufficiently high probability to represent all other patterns. The bottleneck turns out to be checking whether a pattern can probabilistically represent another, which involves the computation of a joint probability of the supports of two patterns. We propose a novel dynamic programming-based approach to address the problem and devise effective optimization strategies to improve the computation efficiency. To enhance the practicability of P-RFP mining, we introduce a novel approximation of the joint probability with both theoretical and empirical proofs. Based on the approximation, we propose an Approximate P-RFP Mining (APM) algorithm, which effectively and efficiently compresses the probabilistic frequent pattern set. The error rate of APM is guaranteed to be very small when the database contains hundreds of transactions, which further affirms that APM is a practical solution for summarizing probabilistic frequent patterns. We address the problem of directly summarizing uncertain transaction database by formulating the problem as Minimal Probabilistic Tile Cover Mining, which aims to find a high-quality probabilistic tile set covering an uncertain database with minimal cost. We define the concept of Probabilistic Price and Probabilistic Price Order to evaluate and compare the quality of tiles, and propose a framework to discover the minimal probabilistic tile cover. The bottleneck is to check whether a tile is better than another according to the Probabilistic Price Order, which involves the computation of a joint probability. We prove that it can be decomposed into independent terms and calculated efficiently. Several optimization techniques are devised to further improve the performance. We analyze the problem of summarizing co-locations mined from spatial databases. Co-location pattern mining finds patterns of spatial features whose instances tend to locate together in geographic space. However, the traditional framework of co-location pattern mining produces an exponential number of patterns because of the downward closure property, which makes it difficult for users to understand, assess or apply the huge number of resulted patterns. To address this issue, we study the problem of mining representative co-location patterns (RCP). We first define a covering relationship between two co-location patterns then formally formulate the problem of Representative Co-location Pattern mining. To solve the problem of RCP mining, we propose the RCPFast algorithm adopting the post-mining framework and the RCPMS algorithm pushing pattern summarization into the co-location mining process.