To see the other types of publications on this topic, follow the link: Data patterns.

Dissertations / Theses on the topic 'Data patterns'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 dissertations / theses for your research on the topic 'Data patterns.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.

1

Voß, Jakob. "Describing data patterns." Doctoral thesis, Humboldt-Universität zu Berlin, Philosophische Fakultät I, 2013. http://dx.doi.org/10.18452/16794.

Full text
Abstract:
Diese Arbeit behandelt die Frage, wie Daten grundsätzlich strukturiert und beschrieben sind. Im Gegensatz zu vorhandenen Auseinandersetzungen mit Daten im Sinne von gespeicherten Beobachtungen oder Sachverhalten, werden Daten hierbei semiotisch als Zeichen aufgefasst. Diese Zeichen werden in Form von digitalen Dokumenten kommuniziert und sind mittels zahlreicher Standards, Formate, Sprachen, Kodierungen, Schemata, Techniken etc. strukturiert und beschrieben. Diese Vielfalt von Mitteln wird erstmals in ihrer Gesamtheit mit Hilfe der phenomenologischen Forschungsmethode analysiert. Ziel ist es dabei, durch eine genaue Erfahrung und Beschreibung von Mitteln zur Strukturierung und Beschreibung von Daten zum allgemeinen Wesen der Datenstrukturierung und -beschreibung vorzudringen. Die Ergebnisse dieser Arbeit bestehen aus drei Teilen. Erstens ergeben sich sechs Prototypen, die die beschriebenen Mittel nach ihrem Hauptanwendungszweck kategorisieren. Zweitens gibt es fünf Paradigmen, die das Verständnis und die Anwendung von Mitteln zur Strukturierung und Beschreibung von Daten grundlegend beeinflussen. Drittens legt diese Arbeit eine Mustersprache der Datenstrukturierung vor. In zwanzig Mustern werden typische Probleme und Lösungen dokumentiert, die bei der Strukturierung und Beschreibung von Daten unabhängig von konkreten Techniken immer wieder auftreten. Die Ergebnisse dieser Arbeit können dazu beitragen, das Verständnis von Daten --- das heisst digitalen Dokumente und ihre Metadaten in allen ihren Formen --- zu verbessern. Spezielle Anwendungsgebiete liegen unter Anderem in den Bereichen Datenarchäologie und Daten-Literacy.
Many methods, technologies, standards, and languages exist to structure and describe data. The aim of this thesis is to find common features in these methods to determine how data is actually structured and described. Existing studies are limited to notions of data as recorded observations and facts, or they require given structures to build on, such as the concept of a record or the concept of a schema. These presumed concepts have been deconstructed in this thesis from a semiotic point of view. This was done by analysing data as signs, communicated in form of digital documents. The study was conducted by a phenomenological research method. Conceptual properties of data structuring and description were first collected and experienced critically. Examples of such properties include encodings, identifiers, formats, schemas, and models. The analysis resulted in six prototypes to categorize data methods by their primary purpose. The study further revealed five basic paradigms that deeply shape how data is structured and described in practice. The third result consists of a pattern language of data structuring. The patterns show problems and solutions which occur over and over again in data, independent from particular technologies. Twenty general patterns were identified and described, each with its benefits, consequences, pitfalls, and relations to other patterns. The results can help to better understand data and its actual forms, both for consumption and creation of data. Particular domains of application include data archaeology and data literacy.
APA, Harvard, Vancouver, ISO, and other styles
2

Jones, Mary Elizabeth Song Il-Yeol. "Dimensional modeling : identifying patterns, classifying patterns, and evaluating pattern impact on the design process /." Philadelphia, Pa. : Drexel University, 2006. http://dspace.library.drexel.edu/handle/1860/743.

Full text
APA, Harvard, Vancouver, ISO, and other styles
3

Tronicke, Jens. "Patterns in geophysical data and models." Universität Potsdam, 2006. http://www.uni-potsdam.de/imaf/events/ge_work0602.html.

Full text
APA, Harvard, Vancouver, ISO, and other styles
4

Muzammal, Muhammad. "Mining sequential patterns from probabilistic data." Thesis, University of Leicester, 2012. http://hdl.handle.net/2381/27638.

Full text
Abstract:
Sequential Pattern Mining (SPM) is an important data mining problem. Although it is assumed in classical SPM that the data to be mined is deterministic, it is now recognized that data obtained from a wide variety of data sources is inherently noisy or uncertain, such as data from sensors or data being collected from the web from different (potentially conflicting) data sources. Probabilistic databases is a popular framework for modelling uncertainty. Recently, several data mining and ranking problems have been studied in probabilistic databases. To the best of our knowledge, this is the first systematic study of mining sequential patterns from probabilistic databases. In this work, we consider the kind of uncertainties that could arise in SPM. We propose four novel uncertainty models for SPM, namely tuple-level uncertainty, event-level uncertainty, source-level uncertainty and source-level uncertainty in deduplication, all of which fit into the probabilistic databases framework, and motivate them using potential real-life scenarios. We then define the interestingness predicate for two measures of interestingness, namely expected support and probabilistic frequentness. Next, we consider the computational complexity of evaluating the interestingness predicate, for various combinations of uncertainty models and interestingness measures, and show that different combinations have very different outcomes from a complexity theoretic viewpoint: whilst some cases are computationally tractable, we show other cases to be computationally intractable. We give a dynamic programming algorithm to compute the source support probability and hence the expected support of a sequence in a source-level uncertain database. We then propose optimizations to speedup the support computation task. Next, we propose probabilistic SPM algorithms based on the candidate generation and pattern growth frameworks for the source-level uncertainty model and the expected support measure. We implement these algorithms and give an empirical evaluation of the probabilistic SPM algorithms and show the scalability of these algorithms under different parameter settings using both real and synthetic datasets. Finally, we demonstrate the effectiveness of the probabilistic SPM framework at extracting meaningful patterns in the presence of noise.
APA, Harvard, Vancouver, ISO, and other styles
5

陳志昌 and Chee-cheong Chan. "Compositional data analysis of voting patterns." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 1993. http://hub.hku.hk/bib/B31977236.

Full text
APA, Harvard, Vancouver, ISO, and other styles
6

McDermott, Philip. "Patterns of data management in bioinformatics." Thesis, University of Manchester, 2010. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.705544.

Full text
APA, Harvard, Vancouver, ISO, and other styles
7

Momsen, Eric. "Vector-Vector Patterns for Agricultural Data." Thesis, North Dakota State University, 2013. https://hdl.handle.net/10365/27040.

Full text
Abstract:
Agriculture is increasingly driven by massive data, and some challenges are not covered by existing statistics, machine learning, or data mining techniques. Many crops are characterized not only by yield but also by quality measures, such as sugar content and sugar lost to molasses for sugarbeets. The set of features furthermore contains time series data, such as rainfall and periodic satellite imagery. This study examines the problem of identifying relationships in a complex data set, in which there are vectors (multiple attributes) for both the explanatory and response conditions. This problem can be characterized as a vector-vector pattern mining problem. The proposed algorithm uses one of the vector representations to determine the neighbors of a randomly picked instance, and then tests the randomness of that subset within the other vector representation. Compared to conventional approaches, the vector-vector algorithm shows better performance for distinguishing existing relationships.
National Science Foundation Partnerships for Innovation program Grant No. 1114363
APA, Harvard, Vancouver, ISO, and other styles
8

Chan, Chee-cheong. "Compositional data analysis of voting patterns." [Hong Kong : University of Hong Kong], 1993. http://sunzi.lib.hku.hk/hkuto/record.jsp?B13787160.

Full text
APA, Harvard, Vancouver, ISO, and other styles
9

Tiddi, Ilaria. "Explaining data patterns using knowledge from the Web of Data." Thesis, Open University, 2016. http://oro.open.ac.uk/47827/.

Full text
Abstract:
Knowledge Discovery (KD) is a long-tradition field aiming at developing methodologies to detect hidden patterns and regularities in large datasets, using techniques from a wide range of domains, such as statistics, machine learning, pattern recognition or data visualisation. In most real world contexts, the interpretation and explanation of the discovered patterns is left to human experts, whose work is to use their background knowledge to analyse, refine and make the patterns understandable for the intended purpose. Explaining patterns is therefore an intensive and time-consuming process, where parts of the knowledge can remain unrevealed, especially when the experts lack some of the required background knowledge. In this thesis, we investigate the hypothesis that such interpretation process can be facilitated by introducing background knowledge from the Web of (Linked) Data. In the last decade, many areas started publishing and sharing their domain-specific knowledge in the form of structured data, with the objective of encouraging information sharing, reuse and discovery. With a constantly increasing amount of shared and connected knowledge, we thus assume that the process of explaining patterns can become easier, faster, and more automated. To demonstrate this, we developed Dedalo, a framework that automatically provides explanations to patterns of data using the background knowledge extracted from the Web of Data. We studied the elements required for a piece of information to be considered an explanation, identified the best strategies to automatically find the right piece of information in the Web of Data, and designed a process able to produce explanations to a given pattern using the background knowledge autonomously collected from the Web of Data. The final evaluation of Dedalo involved users within an empirical study based on a real-world scenario. We demonstrated that the explanation process is complex when not being familiar with the domain of usage, but also that this can be considerably simplified when using the Web of Data as a source of background knowledge.
APA, Harvard, Vancouver, ISO, and other styles
10

Kamra, Varun. "Mining discriminating patterns in data with confidence." Thesis, California State University, Long Beach, 2016. http://pqdtopen.proquest.com/#viewpdf?dispub=10196147.

Full text
Abstract:

There are many pattern mining algorithms available for classifying data. The main drawback of most of the algorithms is that they always focus on mining frequent patterns in data that may not always be discriminative enough for classification. There could exist patterns that are not frequent, but are efficient discriminators. In such cases these algorithms might not perform well. This project proposes the MDP algorithm, which aims to search for patterns that are good at discriminating between classes rather than searching for frequent patterns. The MDP ensures that there is at least one most discriminative pattern (MDP) per record. The purpose of the project is to investigate how a structural approach to classification compares to a functional approach. The project has been developed in Java programming language.

APA, Harvard, Vancouver, ISO, and other styles
11

Light, Adam. "Design patterns for cartography and data graphics /." view abstract or download file of text, 2004. http://wwwlib.umi.com/cr/uoregon/fullcit?p3153792.

Full text
Abstract:
Thesis (Ph. D.)--University of Oregon, 2004.
Typescript. Includes vita and abstract. Includes bibliographical references (leaves 93-97). Also available for download via the World Wide Web; free to University of Oregon users.
APA, Harvard, Vancouver, ISO, and other styles
12

Sommeria-Klein, Guilhem. "From models to data : understanding biodiversity patterns from environmental DNA data." Thesis, Toulouse 3, 2017. http://www.theses.fr/2017TOU30390/document.

Full text
Abstract:
La distribution de l'abondance des espèces en un site, et la similarité de la composition taxonomique d'un site à l'autre, sont deux mesures de la biodiversité ayant servi de longue date de base empirique aux écologues pour tenter d'établir les règles générales gouvernant l'assemblage des communautés d'organismes. Pour ce type de mesures intégratives, le séquençage haut-débit d'ADN prélevé dans l'environnement (" ADN environnemental ") représente une alternative récente et prometteuse aux observations naturalistes traditionnelles. Cette approche présente l'avantage d'être rapide et standardisée, et donne accès à un large éventail de taxons microbiens jusqu'alors indétectables. Toutefois, ces jeux de données de grande taille à la structure complexe sont difficiles à analyser, et le caractère indirect des observations complique leur interprétation. Le premier objectif de cette thèse est d'identifier les modèles statistiques permettant d'exploiter ce nouveau type de données afin de mieux comprendre l'assemblage des communautés. Le deuxième objectif est de tester les approches retenues sur des données de biodiversité du sol en forêt amazonienne, collectées en Guyane française. Deux grands types de processus sont invoqués pour expliquer l'assemblage des communautés d'organismes : les processus "neutres", indépendants de l'espèce considérée, que sont la naissance, la mort et la dispersion des organismes, et les processus liés à la niche écologique occupée par les organismes, c'est-à-dire les interactions avec l'environnement et entre organismes. Démêler l'importance relative de ces deux types de processus dans l'assemblage des communautés est une question fondamentale en écologie ayant de nombreuses implications, notamment pour l'estimation de la biodiversité et la conservation. Le premier chapitre aborde cette question à travers la comparaison d'échantillons d'ADN environnemental prélevés dans le sol de diverses parcelles forestières en Guyane française, via les outils classiques d'analyse statistique en écologie des communautés. Le deuxième chapitre se concentre sur les processus neutres d'assemblages des communautés.[...]
Integrative patterns of biodiversity, such as the distribution of taxa abundances and the spatial turnover of taxonomic composition, have been under scrutiny from ecologists for a long time, as they offer insight into the general rules governing the assembly of organisms into ecological communities. Thank to recent progress in high-throughput DNA sequencing, these patterns can now be measured in a fast and standardized fashion through the sequencing of DNA sampled from the environment (e.g. soil or water), instead of relying on tedious fieldwork and rare naturalist expertise. They can also be measured for the whole tree of life, including the vast and previously unexplored diversity of microorganisms. Taking full advantage of this new type of data is challenging however: DNA-based surveys are indirect, and suffer as such from many potential biases; they also produce large and complex datasets compared to classical censuses. The first goal of this thesis is to investigate how statistical tools and models classically used in ecology or coming from other fields can be adapted to DNA-based data so as to better understand the assembly of ecological communities. The second goal is to apply these approaches to soil DNA data from the Amazonian forest, the Earth's most diverse land ecosystem. Two broad types of mechanisms are classically invoked to explain the assembly of ecological communities: 'neutral' processes, i.e. the random birth, death and dispersal of organisms, and 'niche' processes, i.e. the interaction of the organisms with their environment and with each other according to their phenotype. Disentangling the relative importance of these two types of mechanisms in shaping taxonomic composition is a key ecological question, with many implications from estimating global diversity to conservation issues. In the first chapter, this question is addressed across the tree of life by applying the classical analytic tools of community ecology to soil DNA samples collected from various forest plots in French Guiana. The second chapter focuses on the neutral aspect of community assembly.[...]
APA, Harvard, Vancouver, ISO, and other styles
13

Zhang, Xin Iris, and 張欣. "Fast mining of spatial co-location patterns." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 2004. http://hub.hku.hk/bib/B30462708.

Full text
APA, Harvard, Vancouver, ISO, and other styles
14

Merah, Amar Farouk. "Vehicular Movement Patterns: A Sequential Patterns Data Mining Approach Towards Vehicular Route Prediction." Thèse, Université d'Ottawa / University of Ottawa, 2012. http://hdl.handle.net/10393/22851.

Full text
Abstract:
Behavioral patterns prediction in the context of Vehicular Ad hoc Networks (VANETs)has been receiving increasing attention due to enabling on-demand, intelligent traffic analysis and response to real-time traffic issues. One of these patterns, sequential patterns, are a type of behavioral patterns that describe the occurence of events in a timely-ordered fashion. In the context of VANETs, these events are defined as an ordered list of road segments traversed by vehicles during their trips from a starting point to their final intended destination, forming a vehicular path. Due to their predictable nature, undertaken vehicular paths can be exploited to extract the paths that are considered frequent. From the extracted frequent paths through data mining, the probability that a vehicular path will take a certain direction is obtained. However, in order to achieve this, samples of vehicular paths need to be initially collected over periods of time in order to be data-mined accordingly. In this thesis, a new set of formal definitions depicting vehicular paths as sequential patterns is described. Also, five novel communication schemes have been designed and implemented under a simulated environment to collect vehicular paths; such schemes are classified under two categories: Road Side Unit-Triggered (RSU-Triggered) and Vehicle-Triggered. After collection, extracted frequent paths are obtained through data mining, and the probability of these frequent paths is measured. In order to evaluate the e ciency and e ectiveness of the proposed schemes, extensive experimental analysis has been realized. From the results, two of the Vehicle-Triggered schemes, VTB-FP and VTRD-FP, have improved the vehicular path collection operation in terms of communication cost and latency over others. In terms of reliability, the Vehicle-Triggered schemes achieved a higher success rate than the RSU-Triggered scheme. Finally, frequent vehicular movement patterns have been effectively extracted from the collected vehicular paths according to a user-de ned threshold and the confidence of generated movement rules have been measured. From the analysis, it was clear that the user-de ned threshold needs to be set accordingly in order to not discard important vehicular movement patterns.
APA, Harvard, Vancouver, ISO, and other styles
15

Salazar, Llano Lorena. "Portraying urban diversity patterns through exploratory data analysis." Doctoral thesis, Universitat Politècnica de Catalunya, 2019. http://hdl.handle.net/10803/668423.

Full text
Abstract:
This thesis analyzes the complexity of the urban system, being described with multiple variables that represent the environmental, economic, and social characters of the city. The portrayal of the urban diversity and its relationship with a better response of the city to disturbances, hence to its sustainability, is the main motivation of the study. Certainly, this thesis aims to provide theoretical knowledge through the application of statistical and computational methodologies that are developed progressively in its chapters. Beginning with the introduction, which draws the city as an abstract urban system and reviews the concepts and measures of diversity within the theoretical frameworks of sustainability, urban ecology, and complex systems theory. Afterward, the city of Barcelona is introduced as the case study: it is constituted by a set of districts and represented by an information system that contains temporal measurements of multiple environmental, economic, and social variables. A first approach to the sustainability of the city is made with the entropy of information as a measure of the urban system's diversity. But the fundamental contribution of the thesis focuses on the application of loratory Multivariate Analysis (EMA) to the urban system: Principal Component Analysis (PCA), Multiple Factorial Analysis (MFA), and Hierarchical Cluster Analysis (HCA). From this EMA approach, diversity is analyzed by identifying the similarity -or dissimilarity- between the different parts that make up the urban system. Some other techniques based on computer science and physics are proposed to evaluate the temporal transformation of the urban system, understood as a three-dimensional data cloud that gradually deforms. Differentiated characters and distinctive functions of districts are identifiable in the EMA application to the case study. Moreover, the temporal dependency of the dataset reveals information about the district's differentiation or homogenization trends. Finally, the conclusions of the most relevant results are presented and some future lines of research are proposed.
Esta tesis analiza la complejidad del sistema urbano, descrito con múltiples variables que representan las características ambientales, económicas y sociales de la ciudad. La motivación fundamental para emprender este estudio consiste en describir la diversidad de la ciudad y su relación con una mejor respuesta a perturbaciones y amenazas, y por lo tanto, a su sostenibilidad. La tesis plantea aportar conocimiento teórico mediante la aplicación de metodologías estadísticas y computacionales que se desarrollan progresivamente en sus capítulos. En la introducción se presenta la abstracción de la ciudad como un sistema urbano, y se hace una revisión de los conceptos y medidas de la diversidad dentro de los marcos teóricos de la sostenibilidad, la ecología urbana y la teoría de los sistemas complejos. Posteriormente, se introduce el sistema urbano de la ciudad de Barcelona, constituido por un conjunto de distritos y representado mediante un sistema de información que contiene mediciones temporales de múltiples variables ambientales, económicas y sociales. Se hace una primera aproximación a la sostenibilidad de la ciudad empleando la entropía de la información como medida de diversidad del sistema urbano. Pero el aporte fundamental de la tesis se centra en la aplicación del Análisis Exploratorio Multivariante (EMA) en el sistema urbano: Análisis de Componentes principales (PCA), Análisis Factorial Múltiple (MFA) y Análisis de Agrupamiento Jerárquico (HCA). Desde dicho enfoque se analiza la diversidad identificando la similaridad -o disimilaridad- entre las distintas partes que componen el sistema urbano. Se plantean también algunas de las técnicas de las ciencias de la computación y la física para evaluar la transformación temporal del sistema urbano, entendido como una nube de datos tridimensionales que se deforma gradualmente. En el análisis del estudio de caso se identifican características diferenciadas y funciones distintivas de los distritos. Además, la dependencia temporal del conjunto de datos revela información sobre las tendencias de diferenciación u homogeneización de los distritos. Finalmente, se exponen las conclusiones de los resultados más relevantes y se enuncian algunas líneas futuras de investigaciónes
APA, Harvard, Vancouver, ISO, and other styles
16

Hönel, Sebastian. "Temporal data analysis facilitating recognition of enhanced patterns." Thesis, Linnéuniversitetet, Institutionen för datavetenskap (DV), 2015. http://urn.kb.se/resolve?urn=urn:nbn:se:lnu:diva-51864.

Full text
Abstract:
Assessing the source code quality of software objectively requires a well-defined model. Due to the distinct nature of each and every project, the definition of such a model is specific to the underlying type of paradigms used. A definer can pick metrics from standard norms to define measurements for qualitative assessment. Software projects develop over time and a wide variety of re-factorings is applied tothe code which makes the process temporal. In this thesis the temporal model was enhanced using methods known from financial markets and further evaluated using artificial neural networks with the goal of improving the prediction precision by learning from more detailed patterns. Subject to research was also if the combination of technical analysis and machine learning is viable and how to blend them. An in-depth selection of applicable instruments and algorithms and extensive experiments were run to approximate answers. It was found that enhanced patterns are of value for further processing by neural networks. Technical analysis however was not able to improve the results, although it is assumed that it can for an appropriately sizedproblem set.
APA, Harvard, Vancouver, ISO, and other styles
17

Gu, Zhuoer. "Mining previously unknown patterns in time series data." Thesis, University of Warwick, 2017. http://wrap.warwick.ac.uk/99207/.

Full text
Abstract:
The emerging importance of distributed computing systems raises the needs of gaining a better understanding of system performance. As a major indicator of system performance, analysing CPU host load helps evaluate system performance in many ways. Discovering similar patterns in CPU host load is very useful since many applications rely on the pattern mined from the CPU host load, such as pattern-based prediction, classification and relative rule mining of CPU host load. Essentially, the problem of mining patterns in CPU host load is mining the time series data. Due to the complexity of the problem, many traditional mining techniques for time series data are not suitable anymore. Comparing to mining known patterns in time series, mining unknown patterns is a much more challenging task. In this thesis, we investigate the major difficulties of the problem and develop the techniques for mining unknown patterns by extending the traditional techniques of mining the known patterns. In this thesis, we develop two different CPU host load discovery methods: the segment-based method and the reduction-based method to optimize the pattern discovery process. The segment-based method works by extracting segment features while the reduction-based method works by reducing the size of raw data. The segment-based pattern discovery method maps the CPU host load segments to a 5-dimension space, then applies the DBSCAN clustering method to discover similar segments. The reduction-based method reduces the dimensionality and numerosity of the CPU host load to reduce the search space. A cascade method is proposed to support accurate pattern mining while maintaining efficiency. The investigations into the CPU host load data inspired us to further develop a pattern mining algorithm for general time series data. The method filters out the unlikely starting positions for reoccurring patterns at the early stage and then iteratively locates all best-matching patterns. The results obtained by our method do not contain any meaningless patterns, which has been a different problematic issue for a long time. Comparing to the state of art techniques, our method is more efficient and effective in most scenarios.
APA, Harvard, Vancouver, ISO, and other styles
18

Breyer, Nils. "Analysis of Travel Patterns from Cellular Network Data." Licentiate thesis, Linköpings universitet, Kommunikations- och transportsystem, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-157139.

Full text
Abstract:
Traffic planners are facing a big challenge with an increasing demand for mobility and a need to drastically reduce the environmental impacts of the transportation system at the same time. The transportation system therefore needs to become more efficient, which requires a good understanding about the actual travel patterns. Data from travel surveys and traffic counts is expensive to collect and gives only limited insights on travel patterns. Cellular network data collected in the mobile operators infrastructure is a promising data source which can provide new ways of obtaining information relevant for traffic analysis. It can provide large-scale observations of travel patterns independent of the travel mode used and can be updated easier than other data sources. In order to use cellular network data for traffic analysis it needs to be filtered and processed in a way that preserves privacy of individuals and takes the low resolution of the data in space and time into account. The research of finding appropriate algorithms is ongoing and while substantial progress has been achieved, there is a still a large potential for better algorithms and ways to evaluate them. The aim of this thesis is to analyse the potential and limitations of using cellular network data for traffic analysis. In the three papers included in the thesis, contributions are made to the trip extraction, travel demand and route inference steps part of a data-driven traffic analysis processing chain. To analyse the performance of the proposed algorithms, a number of datasets from different cellular network operators are used. The results obtained using different algorithms are compared to each other as well as to other available data sources. A main finding presented in this thesis is that large-scale cellular network data can be used in particular to infer travel demand. In a study of data for the municipality of Norrköping, the results from cellular network data resemble the travel demand model currently used by the municipality, while adding more details such as time profiles which are currently not available to traffic planners. However, it is found that all later traffic analysis results from cellular network data can differ to a large extend based on the choice of algorithm used for the first steps of data filtering and trip extraction. Particular difficulties occur with the detection of short trips (less than 2km) with a possible under-representation of these trips affecting the subsequent traffic analysis.
APA, Harvard, Vancouver, ISO, and other styles
19

Kabra, Amit. "Clustering of Driver Data based on Driving Patterns." Thesis, Blekinge Tekniska Högskola, Institutionen för datavetenskap, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:bth-18466.

Full text
Abstract:
Data analysis methods are important to analyze the ever-growing enormous quantity of the high dimensional data. Cluster analysis separates or partitions the data into disjoint groups such that data in the same group are similar while data between groups are dissimilar. The focus of this thesis study is to identify natural groups or clusters of drivers using the data which is based on driving style. In finding such a group of drivers, evaluation of the combinations of dimensionality reduction and clustering algorithms is done. The dimensionality reduction algorithms used in this thesis are Principal Component Analysis (PCA) and t-distributed stochastic neighbour embedding (t-SNE). The clustering algorithms such as K-means Clustering and Hierarchical Clustering are selected after performing Literature Review. In this thesis, the evaluation of PCA with K-means, PCA with Hierarchical Clustering, t-SNE with K-means and t-SNE with Hierarchical Clustering is done. The evaluation was done on the Volvo Cars’ drivers dataset based on their driving styles. The dataset is normalized first and Markov Chain of driving styles is calculated. This Markov Chain dataset is of very high dimensions and hence dimensionality reduction algorithms are applied to reduce the dimensions. The reduced dimensions dataset is used as an input to selected clustering algorithms. The combinations of algorithms are evaluated using performance metrics like Silhouette Coefficient, Calinski-Harabasz Index and DaviesBouldin Index. Based on experiment and analysis, the combination of t-SNE and K-means algorithms is found to be the best in comparison to other combinations of algorithms in terms of all performance metrics and is chosen to cluster the drivers based on their driving styles.
APA, Harvard, Vancouver, ISO, and other styles
20

Smirnov, Sergey, Matthias Weidlich, Jan Mendling, and Mathias Weske. "Action patterns in business process models." Universität Potsdam, 2009. http://opus.kobv.de/ubp/volltexte/2009/3358/.

Full text
Abstract:
Business process management experiences a large uptake by the industry, and process models play an important role in the analysis and improvement of processes. While an increasing number of staff becomes involved in actual modeling practice, it is crucial to assure model quality and homogeneity along with providing suitable aids for creating models. In this paper we consider the problem of offering recommendations to the user during the act of modeling. Our key contribution is a concept for defining and identifying so-called action patterns - chunks of actions often appearing together in business processes. In particular, we specify action patterns and demonstrate how they can be identified from existing process model repositories using association rule mining techniques. Action patterns can then be used to suggest additional actions for a process model. Our approach is challenged by applying it to the collection of process models from the SAP Reference Model.
Die zunehmende Bedeutung des Geschäftsprozessmanagements führt dazu, dass eine steigende Anzahl von Mitarbeitern eines Unternehmens mit der Erstellung von Prozessmodellen betraut ist. Um trotz dieser Tendenz die Qualität der Prozessmodelle, sowie ihre Homogenität sicherzustellen, sind entsprechende Modellierungshilfen unabdingbar. In diesem Bericht stellen wir einen Ansatz vor, welcher die Prozessmodellierung durch Empfehlungen unterstützt. Jene basieren auf sogenannten Aktionsmustern, welche typische Arbeitsblöcke darstellen. Neben der Definition dieser Aktionsmuster zeigen wir eine Methode zur Identifikation dieser Muster auf. Mittels Techniken der Assoziationsanalyse können die Muster automatisch aus einer Sammlung von Prozessmodellen extrahiert werden. Die Anwendbarkeit unseres Ansatzes wird durch eine Fallstudie auf Basis des SAP Referenzmodells illustriert.
APA, Harvard, Vancouver, ISO, and other styles
21

Hilton, Ross P. "Model-based data mining methods for identifying patterns in biomedical and health data." Diss., Georgia Institute of Technology, 2015. http://hdl.handle.net/1853/54387.

Full text
Abstract:
In this thesis we provide statistical and model-based data mining methods for pattern detection with applications to biomedical and healthcare data sets. In particular, we examine applications in costly acute or chronic disease management. In Chapter II, we consider nuclear magnetic resonance experiments in which we seek to locate and demix smooth, yet highly localized components in a noisy two-dimensional signal. By using wavelet-based methods we are able to separate components from the noisy background, as well as from other neighboring components. In Chapter III, we pilot methods for identifying profiles of patient utilization of the healthcare system from large, highly-sensitive, patient-level data. We combine model-based data mining methods with clustering analysis in order to extract longitudinal utilization profiles. We transform these profiles into simple visual displays that can inform policy decisions and quantify the potential cost savings of interventions that improve adherence to recommended care guidelines. In Chapter IV, we propose new methods integrating survival analysis models and clustering analysis to profile patient-level utilization behaviors while controlling for variations in the population’s demographic and healthcare characteristics and explaining variations in utilization due to different state-based Medicaid programs, as well as access and urbanicity measures.
APA, Harvard, Vancouver, ISO, and other styles
22

Ding, Guoxiang. "DERIVING ACTIVITY PATTERNS FROM INDIVIDUAL TRAVEL DIARY DATA: A SPATIOTEMPORAL DATA MINING APPROACH." The Ohio State University, 2009. http://rave.ohiolink.edu/etdc/view?acc_num=osu1236777859.

Full text
APA, Harvard, Vancouver, ISO, and other styles
23

Yang, Di. "Mining and Managing Neighbor-Based Patterns in Data Streams." Digital WPI, 2012. https://digitalcommons.wpi.edu/etd-dissertations/16.

Full text
Abstract:
The current data-intensive world is continuously producing huge volumes of live streaming data through various kinds of electronic devices, such as sensor networks, smart phones, GPS and RFID systems. To understand these data sources and thus better leverage them to serve human society, the demands for mining complex patterns from these high speed data streams have significantly increased in a broad range of application domains, such as financial analysis, social network analysis, credit fraud detection, and moving object monitoring. In this dissertation, we present a framework to tackle the mining and management problem for the family of neighbor-based patterns in data streams, which covers a broad range of popular pattern types, including clusters, outliers, k-nearest neighbors and others. First, we study the problem of efficiently executing single neighbor-based pattern mining queries. We propose a general optimization principle for incremental pattern maintenance in data streams, called "Predicted Views". This general optimization principle exploits the "predictability" of sliding window semantics to eliminate both the computational and storage effort needed for handling the expiration of stream objects, which usually constitutes the most expensive operations for incremental pattern maintenance. Second, the problem of multiple query optimization for neighbor-based pattern mining queries is analyzed, which aims to efficiently execute a heavy workload of neighbor-based pattern mining queries using shared execution strategies. We present an integrated pattern maintenance strategy to represent and incrementally maintain the patterns identified by queries with different query parameters within a single compact structure. Our solution realizes fully shared execution of multiple queries with arbitrary parameter settings. Third, the problem of summarization and matching for neighbor-based patterns is examined. To solve this problem, we first propose a summarization format for each pattern type. Then, we present computation strategies, which efficiently summarize the neighbor-based patterns either during or after the online pattern extraction process. Lastly, to compare patterns extracted on different time horizon of the stream, we design an efficient matching mechanism to identify similar patterns in the stream history for any given pattern of interest to an analyst. Our comprehensive experimental studies, using both synthetic as well as real data from domains of stock trades and moving object monitoring, demonstrate superiority of our proposed strategies over alternate methods in both effectiveness and efficiency.
APA, Harvard, Vancouver, ISO, and other styles
24

Lee, Ho Young. "Diagnosing spatial variation patterns in manufacturing processes." Diss., Texas A&M University, 2003. http://hdl.handle.net/1969/122.

Full text
APA, Harvard, Vancouver, ISO, and other styles
25

Chambers, Connie. "Development of a physician profiling data mart." [Denver, Colo.] : Regis University, 2008. http://165.236.235.140/lib/CChambers2008partI.pdf.

Full text
APA, Harvard, Vancouver, ISO, and other styles
26

Padhye, Manoday D. "Use of data mining for investigation of crime patterns." Morgantown, W. Va. : [West Virginia University Libraries], 2006. https://eidr.wvu.edu/etd/documentdata.eTD?documentid=4836.

Full text
Abstract:
Thesis (M.S.)--West Virginia University, 2006.
Title from document title page. Document formatted into pages; contains viii, 108 p. : ill. (some col.). Includes abstract. Includes bibliographical references (p. 80-81).
APA, Harvard, Vancouver, ISO, and other styles
27

Tillander, Annika. "Classification models for high-dimensional data with sparsity patterns." Doctoral thesis, Stockholms universitet, Statistiska institutionen, 2013. http://urn.kb.se/resolve?urn=urn:nbn:se:su:diva-95664.

Full text
Abstract:
Today's high-throughput data collection devices, e.g. spectrometers and gene chips, create information in abundance. However, this poses serious statistical challenges, as the number of features is usually much larger than the number of observed units.  Further, in this high-dimensional setting, only a small fraction of the features are likely to be informative for any specific project. In this thesis, three different approaches to the two-class supervised classification in this high-dimensional, low sample setting are considered. There are classifiers that are known to mitigate the issues of high-dimensionality, e.g. distance-based classifiers such as Naive Bayes. However, these classifiers are often computationally intensive and therefore less time-consuming for discrete data. Hence, continuous features are often transformed into discrete features. In the first paper, a discretization algorithm suitable for high-dimensional data is suggested and compared with other discretization approaches. Further, the effect of discretization on misclassification probability in high-dimensional setting is evaluated.   Linear classifiers are more stable which motivate adjusting the linear discriminant procedure to high-dimensional setting. In the second paper, a two-stage estimation procedure of the inverse covariance matrix, applying Lasso-based regularization and Cuthill-McKee ordering is suggested. The estimation gives a block-diagonal approximation of the covariance matrix which in turn leads to an additive classifier. In the third paper, an asymptotic framework that represents sparse and weak block models is derived and a technique for block-wise feature selection is proposed.      Probabilistic classifiers have the advantage of providing the probability of membership in each class for new observations rather than simply assigning to a class. In the fourth paper, a method is developed for constructing a Bayesian predictive classifier. Given the block-diagonal covariance matrix, the resulting Bayesian predictive and marginal classifier provides an efficient solution to the high-dimensional problem by splitting it into smaller tractable problems. The relevance and benefits of the proposed methods are illustrated using both simulated and real data.
Med dagens teknik, till exempel spektrometer och genchips, alstras data i stora mängder. Detta överflöd av data är inte bara till fördel utan orsakar även vissa problem, vanligtvis är antalet variabler (p) betydligt fler än antalet observation (n). Detta ger så kallat högdimensionella data vilket kräver nya statistiska metoder, då de traditionella metoderna är utvecklade för den omvända situationen (p<n).  Dessutom är det vanligtvis väldigt få av alla dessa variabler som är relevanta för något givet projekt och styrkan på informationen hos de relevanta variablerna är ofta svag. Därav brukar denna typ av data benämnas som gles och svag (sparse and weak). Vanligtvis brukar identifiering av de relevanta variablerna liknas vid att hitta en nål i en höstack. Denna avhandling tar upp tre olika sätt att klassificera i denna typ av högdimensionella data.  Där klassificera innebär, att genom ha tillgång till ett dataset med både förklaringsvariabler och en utfallsvariabel, lära en funktion eller algoritm hur den skall kunna förutspå utfallsvariabeln baserat på endast förklaringsvariablerna. Den typ av riktiga data som används i avhandlingen är microarrays, det är cellprov som visar aktivitet hos generna i cellen. Målet med klassificeringen är att med hjälp av variationen i aktivitet hos de tusentals gener (förklaringsvariablerna) avgöra huruvida cellprovet kommer från cancervävnad eller normalvävnad (utfallsvariabeln). Det finns klassificeringsmetoder som kan hantera högdimensionella data men dessa är ofta beräkningsintensiva, därav fungera de ofta bättre för diskreta data. Genom att transformera kontinuerliga variabler till diskreta (diskretisera) kan beräkningstiden reduceras och göra klassificeringen mer effektiv. I avhandlingen studeras huruvida av diskretisering påverkar klassificeringens prediceringsnoggrannhet och en mycket effektiv diskretiseringsmetod för högdimensionella data föreslås. Linjära klassificeringsmetoder har fördelen att vara stabila. Nackdelen är att de kräver en inverterbar kovariansmatris och vilket kovariansmatrisen inte är för högdimensionella data. I avhandlingen föreslås ett sätt att skatta inversen för glesa kovariansmatriser med blockdiagonalmatris. Denna matris har dessutom fördelen att det leder till additiv klassificering vilket möjliggör att välja hela block av relevanta variabler. I avhandlingen presenteras även en metod för att identifiera och välja ut blocken. Det finns också probabilistiska klassificeringsmetoder som har fördelen att ge sannolikheten att tillhöra vardera av de möjliga utfallen för en observation, inte som de flesta andra klassificeringsmetoder som bara predicerar utfallet. I avhandlingen förslås en sådan Bayesiansk metod, givet den blockdiagonala matrisen och normalfördelade utfallsklasser. De i avhandlingen förslagna metodernas relevans och fördelar är visade genom att tillämpa dem på simulerade och riktiga högdimensionella data.
APA, Harvard, Vancouver, ISO, and other styles
28

Sun, Feng-Tso. "Nonparametric Discovery of Human Behavior Patterns from Multimodal Data." Research Showcase @ CMU, 2014. http://repository.cmu.edu/dissertations/359.

Full text
Abstract:
Recent advances in sensor technologies and the growing interest in context- aware applications, such as targeted advertising and location-based services, have led to a demand for understanding human behavior patterns from sensor data. People engage in routine behaviors. Automatic routine discovery goes beyond low-level activity recognition such as sitting or standing and analyzes human behaviors at a higher level (e.g., commuting to work). The goal of the research presented in this thesis is to automatically discover high-level semantic human routines from low-level sensor streams. One recent line of research is to mine human routines from sensor data using parametric topic models. The main shortcoming of parametric models is that they assume a fixed, pre-specified parameter regardless of the data. Choosing an appropriate parameter usually requires an inefficient trial-and-error model selection process. Furthermore, it is even more difficult to find optimal parameter values in advance for personalized applications. The research presented in this thesis offers a novel nonparametric framework for human routine discovery that can infer high-level routines without knowing the number of latent low-level activities beforehand. More specifically, the frame-work automatically finds the size of the low-level feature vocabulary from sensor feature vectors at the vocabulary extraction phase. At the routine discovery phase, the framework further automatically selects the appropriate number of latent low-level activities and discovers latent routines. Moreover, we propose a new generative graphical model to incorporate multimodal sensor streams for the human activity discovery task. The hypothesis and approaches presented in this thesis are evaluated on public datasets in two routine domains: two daily-activity datasets and a transportation mode dataset. Experimental results show that our nonparametric framework can automatically learn the appropriate model parameters from multimodal sensor data without any form of manual model selection procedure and can outperform traditional parametric approaches for human routine discovery tasks.
APA, Harvard, Vancouver, ISO, and other styles
29

Sithole, Jabulani S. "Longitudinal data models for evaluating change in prescribing patterns." Thesis, Keele University, 2000. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.327702.

Full text
APA, Harvard, Vancouver, ISO, and other styles
30

Abnaof, Khalid [Verfasser]. "Finding Common Patterns In Heterogeneous Perturbation Data / Khalid Abnaof." Bonn : Universitäts- und Landesbibliothek Bonn, 2016. http://d-nb.info/1103024337/34.

Full text
APA, Harvard, Vancouver, ISO, and other styles
31

Wilson, Saul Kriger. "Exploring urban activity patterns using electric smart meter data." Thesis, Massachusetts Institute of Technology, 2016. http://hdl.handle.net/1721.1/107028.

Full text
Abstract:
Thesis: S.M., Massachusetts Institute of Technology, Department of Urban Studies and Planning, 2016.
This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.
Cataloged from student-submitted PDF version of thesis.
Includes bibliographical references (pages 109-111).
This thesis uses electricity consumption data from household and enterprise-level smart meters in County B, Country A, and Turin, Italy, to explore temporal and geographic variations in urban energy consumption and thus urban activity. A central question is whether electricity consumption patterns vary between different economic sectors, across space, and between different days of the week and times of year. This data shows clearly that Country A activity patterns are roughly similar across all seven days of the week, whereas Italian electricity consumption declines markedly on weekends, particularly Sundays. In general, and particularly in Italy, this thesis shows strong seasonality to electricity consumption, with clearly identifiable seasons and high correlation in consumption patterns within each season. This thesis focuses on user type variation in Country A, where although certain patterns are more widespread in some sectors than others, there is significant overlap between pairs of sectors. Hence this thesis is able only to classify land use between residential and industrial sectors, and is unable to classify land use to a meaningful degree of accuracy by analyzing electricity consumption. It is, however, possible to detect geographic variation: urban and industrial centers consume a higher percentage of their electricity on weekdays and during regular work hours than rural areas. In addition, the impact of various special occurrences on urban behavior is probed. This thesis provides measurement of the impact of various holidays on economic activity, using electricity consumption as a proxy. Large (industrial) consumers are generally much more sensitive to holidays than small (residential) consumers are, except during the summer months in Italy. In general, consumption declines on a single holiday are highly correlated with consumption declines on other holidays. Furthermore, using observations at 15-minute intervals, I attempt to measure the short-term behavior shifts caused by daylight savings time's start and finish.
by Saul Kriger Wilson.
S.M.
APA, Harvard, Vancouver, ISO, and other styles
32

Alhusain, Sultan. "Intelligent data-driven reverse engineering of software design patterns." Thesis, De Montfort University, 2016. http://hdl.handle.net/2086/14341.

Full text
Abstract:
Recognising implemented instances of Design Patterns (DPs) in software design discloses and recovers a wealth of information about the intention of the original designers and the rationale of their design decisions. Because it is often the case that the documentation available for software systems, if any, is poor and/or obsolete, recovering such information can be of great help and importance for maintenance tasks. Since DPs are abstractly and vaguely defined, a set of software classes with exactly the same relationships as expected for a DP instance may actually be only accidentally similar. On the other hand, a set of classes with relationships that are to an extent different from the typically expected can still be a true DP instance. The deciding factor is mainly whether or not the set of classes is actually intended to solve the design problem addressed by the DP, which makes the intent a fundamental and defining characteristic of DPs. Discerning the intent of potential instances requires building complex models that cannot be built using the information known about DPs. So, a paradigm shift in DP recognition to fully machine learning based approaches is required. The problem is that there exists no accurate and sufficiently large DP datasets and it is difficult to manually construct one. Also, there is a lack of research on the feature set that should be used in DP recognition. The main aim of this thesis is to enable the paradigm shift by laying down an accurate, comprehensive and information-rich foundation of feature and data sets. To achieve this aim, a large set of features is developed to cover a wide range of design aspects, with a particular focus on the design intent. This set serves as a global feature set from which different subsets can be objectively selected for different DPs. A new and feasible approach for DP dataset construction is designed and used to construct training datasets. The feature and data sets are then used experimentally to build and train DP classifiers. The results demonstrate the accuracy and utility of the sets introduced, and show that fully machine learning based approaches do provide the appropriate and well-equipped solutions to the problem of DP recognition.
APA, Harvard, Vancouver, ISO, and other styles
33

Patchala, Jagadeesh. "Data Mining Algorithms for Discovering Patterns in Text Collections." University of Cincinnati / OhioLINK, 2016. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1458299372.

Full text
APA, Harvard, Vancouver, ISO, and other styles
34

Awodokun, Olugbenga. "Classification of Patterns in Streaming Data Using Clustering Signatures." University of Cincinnati / OhioLINK, 2017. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1504880155623189.

Full text
APA, Harvard, Vancouver, ISO, and other styles
35

Seyfi, Majid. "Mining discriminative itemsets in data streams using different window models." Thesis, Queensland University of Technology, 2018. https://eprints.qut.edu.au/120850/1/Majid_Seyfi_Thesis.pdf.

Full text
Abstract:
Big data availability in areas such as social networks, online marketing systems and stock markets is a good source for knowledge discovery. This thesis studies how discriminative itemsets can be discovered in the data streams made of transactions out of user profiles. Discriminative itemsets are frequent in one data stream with much higher frequencies than same itemsets in other data streams in the application domain. This research uses heuristics to manage the large and complex datasets by decreasing the number of candidate patterns. This gives researchers a better understanding of pattern mining in multiple data streams.
APA, Harvard, Vancouver, ISO, and other styles
36

Kerr, David. "Extraction of displacement data from Electronic Speckle Pattern Interferometric fringe patterns using digital image processing techniques." Thesis, Loughborough University, 1992. https://dspace.lboro.ac.uk/2134/28205.

Full text
Abstract:
The commercial exploitation of Electronic Speckle Pattern Interferometry (ESPI) is now gathering pace with manufacturers marketing products in Europe and the USA. The power of the technique both in a research and an industrial inspection role has brought pressure from the engineering community for an automated fringe analysis system.
APA, Harvard, Vancouver, ISO, and other styles
37

Guo, Zhenyu. "Visually Mining Interesting Patterns in Multivariate Datasets." Digital WPI, 2013. https://digitalcommons.wpi.edu/etd-dissertations/9.

Full text
Abstract:
Data mining for patterns and knowledge discovery in multivariate datasets are very important processes and tasks to help analysts understand the dataset, describe the dataset, and predict unknown data values. However, conventional computer-supported data mining approaches often limit the user from getting involved in the mining process and performing interactions during the pattern discovery. Besides, without the visual representation of the extracted knowledge, the analysts can have difficulty explaining and understanding the patterns. Therefore, instead of directly applying automatic data mining techniques, it is necessary to develop appropriate techniques and visualization systems that allow users to interactively perform knowledge discovery, visually examine the patterns, adjust the parameters, and discover more interesting patterns based on their requirements. In the dissertation, I will discuss different proposed visualization systems to assist analysts in mining patterns and discovering knowledge in multivariate datasets, including the design, implementation, and the evaluation. Three types of different patterns are proposed and discussed, including trends, clusters of subgroups, and local patterns. For trend discovery, the parameter space is visualized to allow the user to visually examine the space and find where good linear patterns exist. For cluster discovery, the user is able to interactively set the query range on a target attribute, and retrieve all the sub-regions that satisfy the user's requirements. The sub-regions that satisfy the same query and are neareach other are grouped and aggregated to form clusters. For local pattern discovery, the patterns for the local sub-region with a focal point and its neighbors are computationally extracted and visually represented. To discover interesting local neighbors, the extracted local patterns are integrated and visually shown to the analysts. Evaluations of the three visualization systems using formal user studies are also performed and discussed.
APA, Harvard, Vancouver, ISO, and other styles
38

You, Chang Hun. "Learning patterns in dynamic graphs with application to biological networks." Pullman, Wash. : Washington State University, 2009. http://www.dissertations.wsu.edu/Dissertations/Summer2009/c_you_072309.pdf.

Full text
Abstract:
Thesis (Ph. D.)--Washington State University, August 2009.
Title from PDF title page (viewed on Aug. 19, 2009). "School of Electrical Engineering and Computer Science." Includes bibliographical references (p. 114-117).
APA, Harvard, Vancouver, ISO, and other styles
39

Henning, Johan, and Nicolai Hellesnes. "Detecting Plagiarism Patterns in student code." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-255049.

Full text
Abstract:
Plagiarism has become a big concern in programming both in education and in the industry of software development. While a lot of effort have been put into detecting plagiarism, most of the it have been focused on detecting plagiarism in plain text. The methods for cheating has evolved as plagiarism detection has improved. This thesis looks at plagiarism in entry level programming courses to discover how wide spread the cheating is, and if plagiarism detection algorithms in conjunction with metadata from GitHub can be used to better detect cheating. More specifically the commit metadata from GitHub is used to see if any interesting patterns with students who plagiarize can be found. The dataset used in this thesis are GitHub repositories for the entry level programming courses DD1337 and DD1338 for the year of 2015. The data set consists of 17 programming assignments with around 200 student submissions per assignment. The plagiarism detection tools used were MOSS and for each week the 10 most suspicious submitted assignments were added to a suspicious-list which were later used to help find patterns in students that plagiarize. The results show that the suspicious students on average had 5.27 commits per assignment, while the non-suspicious students had 6.49 commits on average per assignment. This is to say that suspicious students on average had a lower number of commits than the non-suspicious students. Future work includes testing with bigger data sets, and testing other metadata for finding other interesting patterns in cases of plagiarism.
Plagiat har blivit ett stort problem både på utbildningsnivå och inom industrin för mjukvaruutveckling. Trots att mycket tid och anstränging har lagts ned för att förbättra plagiatdetektering så har det mestadels fokuserat på vanlig text. Medan detekteringsmetoderna för att upptäcka plagiat har förbättrats så har även metoderna för att plagiera utvecklats. Denna uppsats fokuserar på plagiat inom programmeringskurser för förstaårsstudenter på datortekniklinjen på KTH för att se hur utrbrett plagiat är, och om plagiatdetekteringsalgorit- mer i samband med metadata från GitHub kan användas för att förbättra detekteringen av plagiat. Mer specifikt används antal commits metadatan från GitHub för att se om intressanta mönster för studenter som plagierar kan upptäckas. Datasetet som användes i denna rapport är GitHub repositories från programmeringskurserna DD1337 och DD1338 från 2015. Datasetet består av 17 programmeringsuppgifter med ungefär 200 inlämningar för varje uppgift. Plagiatdetekteringsverktyget som användes är MOSS och för varje vecka togs de 10 mest misstänkta inlämningarna och lades till i en lista med misstänkta inlämningar som sedan användes för att hitta mönster för studenter som plagierar. Resultat visar att de misstänkta studenterna i genomsnitt hade 5,27 commits per inlämning, medan de icke-misstänkta studenterna hade ett genomsnitt på 6,49 commits per inlämning. Detta innebär att de misstänkta studenterna i genomsnitt hade färre commits än vad de icke-misstänkte studenterna hade. Framtida studier inkluderar att testa med större datasets, och att testa med annan metadata för att se om andra intressanta mönster kan finnas för studenter som plagierar.
APA, Harvard, Vancouver, ISO, and other styles
40

Wong, Ka-yan, and 王嘉欣. "Positioning patterns from multidimensional data and its applications in meteorology." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 2008. http://hub.hku.hk/bib/B39558630.

Full text
APA, Harvard, Vancouver, ISO, and other styles
41

De, Luca Silvia. "Studies of CMS data access patterns with machine learning techniques." Bachelor's thesis, Alma Mater Studiorum - Università di Bologna, 2016. http://amslaurea.unibo.it/12021/.

Full text
Abstract:
This thesis presents a study of the Grid data access patterns in distributed analysis in the CMS experiment at the LHC accelerator. This study ranges from the deep analysis of the historical patterns of access to the most relevant data types in CMS, to the exploitation of a supervised Machine Learning classification system to set-up a machinery able to eventually predict future data access patterns - i.e. the so-called dataset “popularity” of the CMS datasets on the Grid - with focus on specific data types. All the CMS workflows run on the Worldwide LHC Computing Grid (WCG) computing centers (Tiers), and in particular the distributed analysis systems sustains hundreds of users and applications submitted every day. These applications (or “jobs”) access different data types hosted on disk storage systems at a large set of WLCG Tiers. The detailed study of how this data is accessed, in terms of data types, hosting Tiers, and different time periods, allows to gain precious insight on storage occupancy over time and different access patterns, and ultimately to extract suggested actions based on this information (e.g. targetted disk clean-up and/or data replication). In this sense, the application of Machine Learning techniques allows to learn from past data and to gain predictability potential for the future CMS data access patterns. Chapter 1 provides an introduction to High Energy Physics at the LHC. Chapter 2 describes the CMS Computing Model, with special focus on the data management sector, also discussing the concept of dataset popularity. Chapter 3 describes the study of CMS data access patterns with different depth levels. Chapter 4 offers a brief introduction to basic machine learning concepts and gives an introduction to its application in CMS and discuss the results obtained by using this approach in the context of this thesis.
APA, Harvard, Vancouver, ISO, and other styles
42

Bifet, Albert. "Adaptive Learning and Mining for Data Streams and Frequent Patterns." Doctoral thesis, Universitat Politècnica de Catalunya, 2009. http://hdl.handle.net/10803/22738.

Full text
Abstract:
Aquesta tesi està dedicada al disseny d'algorismes de mineria de dades per fluxos de dades que evolucionen en el temps i per l'extracció d'arbres freqüents tancats. Primer ens ocupem de cadascuna d'aquestes tasques per separat i, a continuació, ens ocupem d'elles conjuntament, desenvolupant mètodes de classificació de fluxos de dades que contenen elements que són arbres. En el model de flux de dades, les dades arriben a gran velocitat, i els algorismes que els han de processar tenen limitacions estrictes de temps i espai. En la primera part d'aquesta tesi proposem i mostrem un marc per desenvolupar algorismes que aprenen de forma adaptativa dels fluxos de dades que canvien en el temps. Els nostres mètodes es basen en l'ús de mòduls detectors de canvi i estimadors en els llocs correctes. Proposem ADWIN, un algorisme de finestra lliscant adaptativa, per la detecció de canvi i manteniment d'estadístiques actualitzades, i proposem utilitzar-lo com a caixa negra substituint els comptadors en algorismes inicialment no dissenyats per a dades que varien en el temps. Com ADWIN té garanties teòriques de funcionament, això obre la possibilitat d'ampliar aquestes garanties als algorismes d'aprenentatge i de mineria de dades que l'usin. Provem la nostre metodologia amb diversos mètodes d'aprenentatge com el Naïve Bayes, partició, arbres de decisió i conjunt de classificadors. Construïm un marc experimental per fer mineria amb fluxos de dades que varien en el temps, basat en el programari MOA, similar al programari WEKA, de manera que sigui fàcil pels investigadors de realitzar-hi proves experimentals. Els arbres són grafs acíclics connectats i són estudiats com vincles en molts casos. En la segona part d'aquesta tesi, descrivim un estudi formal dels arbres des del punt de vista de mineria de dades basada en tancats. A més, presentem algorismes eficients per fer tests de subarbres i per fer mineria d'arbres freqüents tancats ordenats i no ordenats. S'inclou una anàlisi de l'extracció de regles d'associació de confiança plena dels conjunts d'arbres tancats, on hem trobat un fenomen interessant: les regles que la seva contrapart proposicional és no trivial, són sempre certes en els arbres a causa de la seva peculiar combinatòria. I finalment, usant aquests resultats en fluxos de dades evolutius i la mineria d'arbres tancats freqüents, hem presentat algorismes d'alt rendiment per fer mineria d'arbres freqüents tancats de manera adaptativa en fluxos de dades que evolucionen en el temps. Introduïm una metodologia general per identificar patrons tancats en un flux de dades, utilitzant la Teoria de Reticles de Galois. Usant aquesta metodologia, desenvolupem un algorisme incremental, un basat en finestra lliscant, i finalment un que troba arbres freqüents tancats de manera adaptativa en fluxos de dades. Finalment usem aquests mètodes per a desenvolupar mètodes de classificació per a fluxos de dades d'arbres.
This thesis is devoted to the design of data mining algorithms for evolving data streams and for the extraction of closed frequent trees. First, we deal with each of these tasks separately, and then we deal with them together, developing classification methods for data streams containing items that are trees. In the data stream model, data arrive at high speed, and the algorithms that must process them have very strict constraints of space and time. In the first part of this thesis we propose and illustrate a framework for developing algorithms that can adaptively learn from data streams that change over time. Our methods are based on using change detectors and estimator modules at the right places. We propose an adaptive sliding window algorithm ADWIN for detecting change and keeping updated statistics from a data stream, and use it as a black-box in place or counters or accumulators in algorithms initially not designed for drifting data. Since ADWIN has rigorous performance guarantees, this opens the possibility of extending such guarantees to learning and mining algorithms. We test our methodology with several learning methods as Naïve Bayes, clustering, decision trees and ensemble methods. We build an experimental framework for data stream mining with concept drift, based on the MOA framework, similar to WEKA, so that it will be easy for researchers to run experimental data stream benchmarks. Trees are connected acyclic graphs and they are studied as link-based structures in many cases. In the second part of this thesis, we describe a rather formal study of trees from the point of view of closure-based mining. Moreover, we present efficient algorithms for subtree testing and for mining ordered and unordered frequent closed trees. We include an analysis of the extraction of association rules of full confidence out of the closed sets of trees, and we have found there an interesting phenomenon: rules whose propositional counterpart is nontrivial are, however, always implicitly true in trees due to the peculiar combinatorics of the structures. And finally, using these results on evolving data streams mining and closed frequent tree mining, we present high performance algorithms for mining closed unlabeled rooted trees adaptively from data streams that change over time. We introduce a general methodology to identify closed patterns in a data stream, using Galois Lattice Theory. Using this methodology, we then develop an incremental one, a sliding-window based one, and finally one that mines closed trees adaptively from data streams. We use these methods to develop classification methods for tree data streams.
APA, Harvard, Vancouver, ISO, and other styles
43

Lodolini, Lucia. "The representation of symmetric patterns using the Quadtree data structure /." Online version of thesis, 1988. http://hdl.handle.net/1850/8402.

Full text
APA, Harvard, Vancouver, ISO, and other styles
44

Wong, Ka-yan. "Positioning patterns from multidimensional data and its applications in meteorology." Click to view the E-thesis via HKUTO, 2008. http://sunzi.lib.hku.hk/HKUTO/record/B39558630.

Full text
APA, Harvard, Vancouver, ISO, and other styles
45

Bifet, Figuerol Albert Carles. "Adaptive Learning and Mining for Data Streams and Frequent Patterns." Doctoral thesis, Universitat Politècnica de Catalunya, 2009. http://hdl.handle.net/10803/22738.

Full text
Abstract:
Aquesta tesi està dedicada al disseny d'algorismes de mineria de dades per fluxos de dades que evolucionen en el temps i per l'extracció d'arbres freqüents tancats. Primer ens ocupem de cadascuna d'aquestes tasques per separat i, a continuació, ens ocupem d'elles conjuntament, desenvolupant mètodes de classificació de fluxos de dades que contenen elements que són arbres. En el model de flux de dades, les dades arriben a gran velocitat, i els algorismes que els han de processar tenen limitacions estrictes de temps i espai. En la primera part d'aquesta tesi proposem i mostrem un marc per desenvolupar algorismes que aprenen de forma adaptativa dels fluxos de dades que canvien en el temps. Els nostres mètodes es basen en l'ús de mòduls detectors de canvi i estimadors en els llocs correctes. Proposem ADWIN, un algorisme de finestra lliscant adaptativa, per la detecció de canvi i manteniment d'estadístiques actualitzades, i proposem utilitzar-lo com a caixa negra substituint els comptadors en algorismes inicialment no dissenyats per a dades que varien en el temps. Com ADWIN té garanties teòriques de funcionament, això obre la possibilitat d'ampliar aquestes garanties als algorismes d'aprenentatge i de mineria de dades que l'usin. Provem la nostre metodologia amb diversos mètodes d'aprenentatge com el Naïve Bayes, partició, arbres de decisió i conjunt de classificadors. Construïm un marc experimental per fer mineria amb fluxos de dades que varien en el temps, basat en el programari MOA, similar al programari WEKA, de manera que sigui fàcil pels investigadors de realitzar-hi proves experimentals. Els arbres són grafs acíclics connectats i són estudiats com vincles en molts casos. En la segona part d'aquesta tesi, descrivim un estudi formal dels arbres des del punt de vista de mineria de dades basada en tancats. A més, presentem algorismes eficients per fer tests de subarbres i per fer mineria d'arbres freqüents tancats ordenats i no ordenats. S'inclou una anàlisi de l'extracció de regles d'associació de confiança plena dels conjunts d'arbres tancats, on hem trobat un fenomen interessant: les regles que la seva contrapart proposicional és no trivial, són sempre certes en els arbres a causa de la seva peculiar combinatòria. I finalment, usant aquests resultats en fluxos de dades evolutius i la mineria d'arbres tancats freqüents, hem presentat algorismes d'alt rendiment per fer mineria d'arbres freqüents tancats de manera adaptativa en fluxos de dades que evolucionen en el temps. Introduïm una metodologia general per identificar patrons tancats en un flux de dades, utilitzant la Teoria de Reticles de Galois. Usant aquesta metodologia, desenvolupem un algorisme incremental, un basat en finestra lliscant, i finalment un que troba arbres freqüents tancats de manera adaptativa en fluxos de dades. Finalment usem aquests mètodes per a desenvolupar mètodes de classificació per a fluxos de dades d'arbres.
This thesis is devoted to the design of data mining algorithms for evolving data streams and for the extraction of closed frequent trees. First, we deal with each of these tasks separately, and then we deal with them together, developing classification methods for data streams containing items that are trees. In the data stream model, data arrive at high speed, and the algorithms that must process them have very strict constraints of space and time. In the first part of this thesis we propose and illustrate a framework for developing algorithms that can adaptively learn from data streams that change over time. Our methods are based on using change detectors and estimator modules at the right places. We propose an adaptive sliding window algorithm ADWIN for detecting change and keeping updated statistics from a data stream, and use it as a black-box in place or counters or accumulators in algorithms initially not designed for drifting data. Since ADWIN has rigorous performance guarantees, this opens the possibility of extending such guarantees to learning and mining algorithms. We test our methodology with several learning methods as Naïve Bayes, clustering, decision trees and ensemble methods. We build an experimental framework for data stream mining with concept drift, based on the MOA framework, similar to WEKA, so that it will be easy for researchers to run experimental data stream benchmarks. Trees are connected acyclic graphs and they are studied as link-based structures in many cases. In the second part of this thesis, we describe a rather formal study of trees from the point of view of closure-based mining. Moreover, we present efficient algorithms for subtree testing and for mining ordered and unordered frequent closed trees. We include an analysis of the extraction of association rules of full confidence out of the closed sets of trees, and we have found there an interesting phenomenon: rules whose propositional counterpart is nontrivial are, however, always implicitly true in trees due to the peculiar combinatorics of the structures. And finally, using these results on evolving data streams mining and closed frequent tree mining, we present high performance algorithms for mining closed unlabeled rooted trees adaptively from data streams that change over time. We introduce a general methodology to identify closed patterns in a data stream, using Galois Lattice Theory. Using this methodology, we then develop an incremental one, a sliding-window based one, and finally one that mines closed trees adaptively from data streams. We use these methods to develop classification methods for tree data streams.
APA, Harvard, Vancouver, ISO, and other styles
46

Tatu, Andrada [Verfasser]. "Visual Analytics of Patterns in High-Dimensional Data / Andrada Tatu." Konstanz : Bibliothek der Universität Konstanz, 2013. http://d-nb.info/1041224680/34.

Full text
APA, Harvard, Vancouver, ISO, and other styles
47

Nassopoulos, Georges. "Deducing Basic Graph Patterns from Logs of Linked Data Providers." Thesis, Nantes, 2017. http://www.theses.fr/2017NANT4110/document.

Full text
Abstract:
Conformément aux principes de Linked Data, les fournisseurs de données ont publié des milliards de faits en tant que données RDF. Exécuter les requêtes SPARQL sur les endpoints SPARQL ou les serveurs Triple Pattern Fragments (TPF) permet de consommer facilement des données du Linked Data. Cependant, le traitement des requêtes SPARQL fédérées, tout comme le traitement des requêtes TPF, décompose la requête initiale en de nombreuses sous-requêtes. Les fournisseurs de données ne voient alors que les sous-requêtes et la requête initiale n’est connue que des utilisateurs finaux. La connaissance des requêtes exécutées est fondamentale pour les fournisseurs, afin d’assurer un contrôle de l’utilisation des données, d’optimiser le coût des réponses aux requêtes, de justifier un retour sur investissements, d’améliorer l’expérience utilisateur ou de créer des modèles commerciaux à partir de tendances d’utilisation. Dans cette thèse, nous nous concentrons sur l’analyse des logs d’exécution des serveurs TPF et des endpoints SPARQL pour extraire les Basic Graph Patterns (BGP) des requêtes SPARQL exécutées. Le principal défi pour l’extraction des BGPs est l’exécution simultanée des requêtes SPARQL. Nous proposons deux algorithmes : LIFT et FETA. Sous certaines conditions, nous constatons que LIFT et FETA sont capables d’extraire des BGPs avec une bonne précision et un bon rappel
Following the principles of Linked Data, data providers published billions of facts as RDF data. Executing SPARQL queries over SPARQL endpoints or Triple Pattern Fragments (TPF) servers allow to easily consume Linked Data. However, federated SPARQL query processing and TPF query processing decompose the initial query into subqueries. Consequently, the data providers only see subqueries and the initial query is only known by end users. Knowing executed SPARQL queries is fundamental for data providers, to ensure usage control, to optimize costs of query answering, to justify return of investment, to improve the user experience or to create business models of usage trends. In this thesis, we focus on analyzing execution logs of TPF servers and SPARQL endpoints to extract Basic Graph Patterns (BGP) of executed SPARQL queries. The main challenge to extract BGPs is the concurrent execution of SPARQL queries. We propose two algorithms: LIFT and FETA. LIFT extracts BGPs of executed queries from a single TPF server log. FETA extracts BGPs of federated queries from a log of a set of SPARQL endpoints. For experiments, we run LIFT and FETA on synthetic logs and real logs. LIFT and FETA are able to extract BGPs with good precision and recall under certain conditions
APA, Harvard, Vancouver, ISO, and other styles
48

Oliveira, Alexandre (Alexandre S. ). "Finding patterns in timed data with spike timing dependent plasticity." Thesis, Massachusetts Institute of Technology, 2012. http://hdl.handle.net/1721.1/77031.

Full text
Abstract:
Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2012.
Cataloged from PDF version of thesis.
My research focuses on finding patterns in events - in sequences of data that happen over time. It takes inspiration from a neuroscience phenomena believed to be deeply involved in learning. I propose a machine learning algorithm that finds patterns in timed data and is highly robust to noise and missing data. It can find both coincident relationships, where two events tend to happen together; as well as causal relationships, where one event appears to be caused by another. I analyze stock price information using this algorithm and strong relationships are found between companies within the same industry. In particular, I worked with 12 stocks taken from the banking, information technology, healthcare, and oil industries. The relationships are almost exclusively coincidental, rather than causal.
by Alexandre Oliveira.
M.Eng.
APA, Harvard, Vancouver, ISO, and other styles
49

Vimieiro, Renato. "Mining disjunctive patterns in biomedical data sets." Thesis, 2012. http://hdl.handle.net/1959.13/936341.

Full text
Abstract:
Research Doctorate - Doctor of Philosophy (PhD)
Frequent itemset mining is one of the most studied problems in data mining. Since Agrawal et al. (1993) introduced the problem, several advances both theoretical and practical have been achieved. In spite of that, there are still many unresolved issues to be tackled before frequent pattern mining can be claimed a cornerstone approach in data mining (Han et al., 2007). Here, we investigate issues related to: (1) the (un)suitability of frequent itemset mining algorithms to identify patterns in biomedical data sets; and (2) the limited expressiveness of such patterns, since, in its vast majority, frequent itemsets are exclusively conjunctions. Our ultimate goal in this thesis is to improve methods for frequent pattern mining in such a way that they provide alternative insightful solutions for mining biomedical data sets. Specifically, we provide eficient tools for mining disjunctive patterns in biomedical data sets. We tackle the problem of mining disjunctive patterns through three different fronts: (1) disjunctive minimal generators; (2) disjunctive closed patterns; and (3) quasi-CNF emerging patterns. We then propose three different algorithms, one for each task above: TitanicOR, Disclosed, and QCEP. While the first two aim for more descriptive patterns, the third is a more predictive. These algorithms are proposed as an attempt to cover different sources of data sets coming from biomedical researches. TitanicOR is more suitable to identify patterns in data sets containing physiological, biochemical, or medical record information. Disclosed was designed to exploit the characteristics of microarray gene expression data sets, which usually contains many features, but only few samples. Finally, QCEP is the only algorithm to consider data sets with class label information. We conducted experiments with both synthetic and real world data sets to assess the performance of our algorithms. Our experiments show that our algorithms overcame the state of the art algorithms in each of those categories of patterns.
APA, Harvard, Vancouver, ISO, and other styles
50

Liu, Chunyang. "Summarizing data with representative patterns." Thesis, 2016. http://hdl.handle.net/10453/52923.

Full text
Abstract:
University of Technology Sydney. Faculty of Engineering and Information Technology.
The advance of technology makes data acquisition and storage become unprecedentedly convenient. It contributes to the rapid growth of not only the volume but also the veracity and variety of data in recent years, which poses new challenges to the data mining area. For example, uncertain data mining emerges due to its capability to model the inherent veracity of data; spatial data mining attracts much research attention as the widespread of location-based services and wearable devices. As a fundamental topic of data mining, how to effectively and efficiently summarize data in this situation still remains to be explored. This thesis studied the problem of summarizing data with representative patterns. The objective is to find a set of patterns, which is much more concise but still contains rich information of the original data, and may provide valuable insights for further analysis of data. In the light of this idea, we formally formulate the problem and provide effective and efficient solutions in various scenarios. We study the problem of summarizing probabilistic frequent patterns over uncertain data. Probabilistic frequent pattern mining over uncertain data has received much research attention due to the wide applicabilities of uncertain data. It suffers from the problem of generating an exponential number of result patterns, which hinders the analysis of patterns and calls for the need to find a small number of representative patterns to approximate all other patterns. We formally formulate the problem of probabilistic representative frequent pattern (P-RFP) mining, which aims to find the minimal set of patterns with sufficiently high probability to represent all other patterns. The bottleneck turns out to be checking whether a pattern can probabilistically represent another, which involves the computation of a joint probability of the supports of two patterns. We propose a novel dynamic programming-based approach to address the problem and devise effective optimization strategies to improve the computation efficiency. To enhance the practicability of P-RFP mining, we introduce a novel approximation of the joint probability with both theoretical and empirical proofs. Based on the approximation, we propose an Approximate P-RFP Mining (APM) algorithm, which effectively and efficiently compresses the probabilistic frequent pattern set. The error rate of APM is guaranteed to be very small when the database contains hundreds of transactions, which further affirms that APM is a practical solution for summarizing probabilistic frequent patterns. We address the problem of directly summarizing uncertain transaction database by formulating the problem as Minimal Probabilistic Tile Cover Mining, which aims to find a high-quality probabilistic tile set covering an uncertain database with minimal cost. We define the concept of Probabilistic Price and Probabilistic Price Order to evaluate and compare the quality of tiles, and propose a framework to discover the minimal probabilistic tile cover. The bottleneck is to check whether a tile is better than another according to the Probabilistic Price Order, which involves the computation of a joint probability. We prove that it can be decomposed into independent terms and calculated efficiently. Several optimization techniques are devised to further improve the performance. We analyze the problem of summarizing co-locations mined from spatial databases. Co-location pattern mining finds patterns of spatial features whose instances tend to locate together in geographic space. However, the traditional framework of co-location pattern mining produces an exponential number of patterns because of the downward closure property, which makes it difficult for users to understand, assess or apply the huge number of resulted patterns. To address this issue, we study the problem of mining representative co-location patterns (RCP). We first define a covering relationship between two co-location patterns then formally formulate the problem of Representative Co-location Pattern mining. To solve the problem of RCP mining, we propose the RCPFast algorithm adopting the post-mining framework and the RCPMS algorithm pushing pattern summarization into the co-location mining process.
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography