Dissertations / Theses: 'Very large data sets'

1

Quddus, Syed. "Accurate and efficient clustering algorithms for very large data sets." Thesis, Federation University Australia, 2017. http://researchonline.federation.edu.au/vital/access/HandleResolver/1959.17/162586.

Full text

Abstract:

The ability to mine and extract useful information from large data sets is a common concern for organizations. Data over the internet is rapidly increasing and the importance of development of new approaches to collect, store and mine large amounts of data is significantly increasing. Clustering is one of the main tasks in data mining. Many clustering algorithms have been proposed but there are still clustering problems that have not been addressed in depth especially the clustering problems in large data sets. Clustering in large data sets is important in many applications and such applications include network intrusion detection systems, fraud detection in banking systems, air traffic control, web logs, sensor networks, social networks and bioinformatics. Data sets in these applications contain from hundreds of thousands to hundreds of millions of data points and they may contain hundreds or thousands of attributes. Recent developments in computer hardware allows to store in random access memory and repeatedly read data sets with hundreds of thousands and even millions of data points. This makes possible the use of existing clustering algorithms in such data sets. However, these algorithms require a prohibitively large CPU time and fail to produce an accurate solution. Therefore, it is important to develop clustering algorithms which are accurate and can provide real time clustering in such data sets. This is especially important in a big data era. The aim of this PhD study is to develop accurate and real time algorithms for clustering in very large data sets containing hundreds of thousands and millions of data points. Such algorithms are developed based on the combination of heuristic algorithms with the incremental approach. These algorithms also involve a special procedure to identify dense areas in a data set and compute a subset most informative representative data points in order to decrease the size of a data set. It is the aim of this PhD study to develop the center-based clustering algorithms. The success of these algorithms strongly depends on the choice of starting cluster centers. Different procedures are proposed to generate such centers. Special procedures are designed to identify the most promising starting cluster centers and to restrict their number. New clustering algorithms are evaluated using large data sets available in public domains. Their results will be compared with those obtained using several existing center-based clustering algorithms.
Doctor of Philosophy

APA, Harvard, Vancouver, ISO, and other styles

2

Harrington, Justin. "Extending linear grouping analysis and robust estimators for very large data sets." Thesis, University of British Columbia, 2008. http://hdl.handle.net/2429/845.

Full text

Abstract:

Cluster analysis is the study of how to partition data into homogeneous subsets so that the partitioned data share some common characteristic. In one to three dimensions, the human eye can distinguish well between clusters of data if clearly separated. However, when there are more than three dimensions and/or the data is not clearly separated, an algorithm is required which needs a metric of similarity that quantitatively measures the characteristic of interest. Linear Grouping Analysis (LGA, Van Aelst et al. 2006) is an algorithm for clustering data around hyperplanes, and is most appropriate when: 1) the variables are related/correlated, which results in clusters with an approximately linear structure; and 2) it is not natural to assume that one variable is a “response”, and the remainder the “explanatories”. LGA measures the compactness within each cluster via the sum of squared orthogonal distances to hyperplanes formed from the data. In this dissertation, we extend the scope of problems to which LGA can be applied. The first extension relates to the linearity requirement inherent within LGA, and proposes a new method of non-linearly transforming the data into a Feature Space, using the Kernel Trick, such that in this space the data might then form linear clusters. A possible side effect of this transformation is that the dimension of the transformed space is significantly larger than the number of observations in a given cluster, which causes problems with orthogonal regression. Therefore, we also introduce a new method for calculating the distance of an observation to a cluster when its covariance matrix is rank deficient. The second extension concerns the combinatorial problem for optimizing a LGA objective function, and adapts an existing algorithm, called BIRCH, for use in providing fast, approximate solutions, particularly for the case when data does not fit in memory. We also provide solutions based on BIRCH for two other challenging optimization problems in the field of robust statistics, and demonstrate, via simulation study as well as application on actual data sets, that the BIRCH solution compares favourably to the existing state-of-the-art alternatives, and in many cases finds a more optimal solution.

APA, Harvard, Vancouver, ISO, and other styles

3

Sandhu, Jatinder Singh. "Combining exploratory data analysis and scientific visualization in the study of very large, space-time data sets /." The Ohio State University, 1990. http://rave.ohiolink.edu/etdc/view?acc_num=osu1487683401443166.

Full text

APA, Harvard, Vancouver, ISO, and other styles

4

Geppert, Leo Nikolaus [Verfasser], Katja [Akademischer Betreuer] Ickstadt, and Andreas [Gutachter] Groll. "Bayesian and frequentist regression approaches for very large data sets / Leo Nikolaus Geppert ; Gutachter: Andreas Groll ; Betreuer: Katja Ickstadt." Dortmund : Universitätsbibliothek Dortmund, 2018. http://d-nb.info/1181427479/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

5

McNeil, Vivienne Heather. "Assessment methodologies for very large, irregularly collected water quality data sets with special reference to the natural waters of Queensland." Thesis, Queensland University of Technology, 2001.

Find full text

APA, Harvard, Vancouver, ISO, and other styles

6

Cordeiro, Robson Leonardo Ferreira. "Data mining in large sets of complex data." Universidade de São Paulo, 2011. http://www.teses.usp.br/teses/disponiveis/55/55134/tde-22112011-083653/.

Full text

Abstract:

Due to the increasing amount and complexity of the data stored in the enterprises\' databases, the task of knowledge discovery is nowadays vital to support strategic decisions. However, the mining techniques used in the process usually have high computational costs that come from the need to explore several alternative solutions, in different combinations, to obtain the desired knowledge. The most common mining tasks include data classification, labeling and clustering, outlier detection and missing data prediction. Traditionally, the data are represented by numerical or categorical attributes in a table that describes one element in each tuple. Although the same tasks applied to traditional data are also necessary for more complex data, such as images, graphs, audio and long texts, the complexity and the computational costs associated to handling large amounts of these complex data increase considerably, making most of the existing techniques impractical. Therefore, especial data mining techniques for this kind of data need to be developed. This Ph.D. work focuses on the development of new data mining techniques for large sets of complex data, especially for the task of clustering, tightly associated to other data mining tasks that are performed together. Specifically, this Doctoral dissertation presents three novel, fast and scalable data mining algorithms well-suited to analyze large sets of complex data: the method Halite for correlation clustering; the method BoW for clustering Terabyte-scale datasets; and the method QMAS for labeling and summarization. Our algorithms were evaluated on real, very large datasets with up to billions of complex elements, and they always presented highly accurate results, being at least one order of magnitude faster than the fastest related works in almost all cases. The real data used come from the following applications: automatic breast cancer diagnosis, satellite imagery analysis, and graph mining on a large web graph crawled by Yahoo! and also on the graph with all users and their connections from the Twitter social network. Such results indicate that our algorithms allow the development of real time applications that, potentially, could not be developed without this Ph.D. work, like a software to aid on the fly the diagnosis process in a worldwide Healthcare Information System, or a system to look for deforestation within the Amazon Rainforest in real time
O crescimento em quantidade e complexidade dos dados armazenados nas organizações torna a extração de conhecimento utilizando técnicas de mineração uma tarefa ao mesmo tempo fundamental para aproveitar bem esses dados na tomada de decisões estratégicas e de alto custo computacional. O custo vem da necessidade de se explorar uma grande quantidade de casos de estudo, em diferentes combinações, para se obter o conhecimento desejado. Tradicionalmente, os dados a explorar são representados como atributos numéricos ou categóricos em uma tabela, que descreve em cada tupla um caso de teste do conjunto sob análise. Embora as mesmas tarefas desenvolvidas para dados tradicionais sejam também necessárias para dados mais complexos, como imagens, grafos, áudio e textos longos, a complexidade das análises e o custo computacional envolvidos aumentam significativamente, inviabilizando a maioria das técnicas de análise atuais quando aplicadas a grandes quantidades desses dados complexos. Assim, técnicas de mineração especiais devem ser desenvolvidas. Este Trabalho de Doutorado visa a criação de novas técnicas de mineração para grandes bases de dados complexos. Especificamente, foram desenvolvidas duas novas técnicas de agrupamento e uma nova técnica de rotulação e sumarização que são rápidas, escaláveis e bem adequadas à análise de grandes bases de dados complexos. As técnicas propostas foram avaliadas para a análise de bases de dados reais, em escala de Terabytes de dados, contendo até bilhões de objetos complexos, e elas sempre apresentaram resultados de alta qualidade, sendo em quase todos os casos pelo menos uma ordem de magnitude mais rápidas do que os trabalhos relacionados mais eficientes. Os dados reais utilizados vêm das seguintes aplicações: diagnóstico automático de câncer de mama, análise de imagens de satélites, e mineração de grafos aplicada a um grande grafo da web coletado pelo Yahoo! e também a um grafo com todos os usuários da rede social Twitter e suas conexões. Tais resultados indicam que nossos algoritmos permitem a criação de aplicações em tempo real que, potencialmente, não poderiam ser desenvolvidas sem a existência deste Trabalho de Doutorado, como por exemplo, um sistema em escala global para o auxílio ao diagnóstico médico em tempo real, ou um sistema para a busca por áreas de desmatamento na Floresta Amazônica em tempo real

APA, Harvard, Vancouver, ISO, and other styles

7

Chaudhary, Amitabh. "Applied spatial data structures for large data sets." Available to US Hopkins community, 2002. http://wwwlib.umi.com/dissertations/dlnow/3068131.

Full text

APA, Harvard, Vancouver, ISO, and other styles

8

Arvidsson, Johan. "Finding delta difference in large data sets." Thesis, Luleå tekniska universitet, Datavetenskap, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:ltu:diva-74943.

Full text

Abstract:

To find out what differs between two versions of a file can be done with several different techniques and programs. These techniques and programs are often focusd on finding differences in text files, in documents, or in class files for programming. An example of a program is the popular git tool which focuses on displaying the difference between versions of files in a project. A common way to find these differences is to utilize an algorithm called Longest common subsequence, which focuses on finding the longest common subsequence in each file to find similarity between the files. By excluding all similarities in a file, all remaining text will be the differences between the files. The Longest Common Subsequence is often used to find the differences in an acceptable time. When two lines in a file is compared to see if they differ from each other hashing is used. The hash values for each correspondent line in both files will be compared. Hashing a line will give the content on that line a unique value. If as little as one character on a line is different between the version, the hash values for those lines will be different as well. These techniques are very useful when comparing two versions of a file with text content. With data from a database some, but not all, of these techniques can be useful. A big difference between data in a database and text in a file will be that content is not just added and delete but also updated. This thesis studies the problem on how to make use of these techniques when finding differences between large datasets, and doing this in a reasonable time, instead of finding differences in documents and files. Three different methods are going to be studied in theory. These results will be provided in both time and space complexities. Finally, a selected one of these methods is further studied with implementation and testing. The reason only one of these three is implemented is because of time constraint. The one that got chosen had easy maintainability, an easy implementation, and maintains a good execution time.

APA, Harvard, Vancouver, ISO, and other styles

9

Tricker, Edward A. "Detecting anomalous aggregations of data points in large data sets." Thesis, Imperial College London, 2009. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.512050.

Full text

APA, Harvard, Vancouver, ISO, and other styles

10

Romig, Phillip R. "Parallel task processing of very large datasets." [Lincoln, Neb. : University of Nebraska-Lincoln], 1999. http://international.unl.edu/Private/1999/romigab.pdf.

Full text

APA, Harvard, Vancouver, ISO, and other styles

11

Bate, Steven Mark. "Generalized linear models for large dependent data sets." Thesis, University College London (University of London), 2004. http://discovery.ucl.ac.uk/1446542/.

Full text

Abstract:

Generalized linear models (GLMs) were originally used to build regression models for independent responses. In recent years, however, effort has focused on extending the original GLM theory to enable it to be applied to data which exhibit dependence in the responses. This thesis focuses on some specific extensions of the GLM theory for dependent responses. A new hypothesis testing technique is proposed for the application of GLMs to cluster dependent data. The test is based on an adjustment to the 'independence' likelihood ratio test, which allows for the within cluster dependence. The performance of the new test, in comparison to established techniques, is explored. The application of the generalized estimating equations (GEE) methodology to model space-time data is also investigated. The approach allows for the temporal dependence via the covariates and models the spatial dependence using techniques from geostatistics. The application area of climatology has been used to motivate much of the work undertaken. A key attribute of climate data sets, in addition to exhibiting dependence both spatially and temporally, is that they are typically large in size, often running into millions of observations. Therefore, throughout the thesis, particular attention has focused on computational issues, to enable analysis to be undertaken in a feasible time frame. For example, we investigate the use of the GEE one-step estimator in situations where the application of the full algorithm is impractical. The final chapter of this thesis presents a climate case study. This involves wind speeds over northwestern Europe, which we analyse using the techniques developed.

APA, Harvard, Vancouver, ISO, and other styles

12

Hennessey, Anthony. "Statistical shape analysis of large molecular data sets." Thesis, University of Nottingham, 2018. http://eprints.nottingham.ac.uk/52088/.

Full text

Abstract:

Protein classification databases are widely used in the prediction of protein structure and function, and amongst these databases the manually-curated Structural Classification of Proteins database (SCOP) is considered to be a gold standard. In SCOP, functional relationships are described by hyperfamily and superfamily categories and structural relationships are described by family, species and protein categories. We present a method to calculate a difference measure between pairs of proteins that can be used to reproduce SCOP2 structural relationship classifications, and that can also be used to reproduce a subset of functional relationship classifications at the superfamily level. Calculating the difference measure requires first finding the best correspondence between atoms in two protein configurations. The problem of finding the best correspondence is known as the unlabelled, partial matching problem. We consider the unlabelled, partial matching problem through a detailed analysis of the approach presented in Green and Mardia (2006). Using this analysis, and applying domain-specific constraints, we develop a new algorithm called GProtA for protein structure alignment. The proposed difference measure is constructed from the root mean squared deviation of the aligned protein structures and a binary similarity measure, where the binary similarity measure takes into account the proportions of atoms matching from each configuration. The GProtA algorithm and difference measure are applied to protein structure data taken from the Protein Data Bank. The difference measure is shown to correctly classify 62 of a set of 72 proteins into the correct SCOP family categories when clustered. Of the remaining 9 proteins, 2 are assigned incorrectly and 7 are considered indeterminate. In addition, a method for deriving characteristic signatures for categories is proposed. The signatures offer a mechanism by which a single comparison can be made to judge similarity to a particular category. Comparison using characteristic signatures is shown to correctly delineate proteins at the family level, including the identification of both families for a subset of proteins described by two family level categories.

APA, Harvard, Vancouver, ISO, and other styles

13

Dementiev, Roman. "Algorithm engineering for large data sets hardware, software, algorithms." Saarbrücken VDM, Müller, 2006. http://d-nb.info/986494429/04.

Full text

APA, Harvard, Vancouver, ISO, and other styles

14

Dementiev, Roman. "Algorithm engineering for large data sets : hardware, software, algorithms /." Saarbrücken : VDM-Verl. Dr. Müller, 2007. http://deposit.d-nb.de/cgi-bin/dokserv?id=3029033&prov=M&dok_var=1&dok_ext=htm.

Full text

APA, Harvard, Vancouver, ISO, and other styles

15

Nair, Sumitra Sarada. "Function estimation using kernel methods for large data sets." Thesis, University of Sheffield, 2007. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.444581.

Full text

APA, Harvard, Vancouver, ISO, and other styles

16

Rauschenberg, David Edward. "Computer-graphical exploration of large data sets from teletraffic." Diss., The University of Arizona, 1994. http://hdl.handle.net/10150/186645.

Full text

Abstract:

The availability of large data sets and powerful computing resources has made data analysis an increasingly viable approach to understanding random processes. Of particular interest are exploratory techniques which provide insight into the local path behavior of highly positively correlated processes. We focus on actual and simulated teletraffic data in the form of time series. Our foremost objective is to develop a methodology of identifying and classifying shape features which are essentially unrecognizable with standard statistical descriptors. Using basic aspects of human vision as a heuristic guide, we have developed an algorithm which "sketches" data sequences. Our approach to summarizing path behavior is based on exploiting the simple structure of a sketch. We have developed a procedure whereby all the "shapes" of a sketch are summarized in a visually comprehensible manner. We do so by placing the shapes in classes, then displaying, for each class, both a representative shape and the number of shapes in the class. These "shape histograms" can provide substantial insight into the behavior of sample paths. We have also used sketches to help model data sequences. The idea here is that a model based on a sketch of a data sequence may provide a better fit under some circumstances than a model based directly on the data. By considering various sketches, one could, for example, develop a Markov chain model whose autocorrelation function approximates that of the original data. We have generalized this use of sketches so that a data sequence can be modeled as the superposition of several sketches, each capturing a different level of detail. Because the concept of path shape is highly visual, it is important that our techniques exploit the strengths of and accommodate for the weaknesses of human vision. We have addressed this by using computer graphics in a variety of novel ways.

APA, Harvard, Vancouver, ISO, and other styles

17

Farran, Bassam. "One-pass algorithms for large and shifting data sets." Thesis, University of Southampton, 2010. https://eprints.soton.ac.uk/159173/.

Full text

Abstract:

For many problem domains, practitioners are faced with the problem of ever-increasing amounts of data. Examples include the UniProt database of proteins which now contains ~6 million sequences, and the KDD ’99 data which consists of ~5 million points. At these scales, the state-of-the-art machine learning techniques are not applicable since the multiple passes they require through the data are prohibitively expensive, and a need for different approaches arises. Another issue arising in real-world tasks, which is only recently becoming a topic of interest in the machine learning community, is distribution shift, which occurs naturally in many problem domains such as intrusion detection and EEG signal mapping in the Brain-Computer Interface domain. This means that the i.i.d. assumption between the training and test data does not hold, causing classifiers to perform poorly on the unseen test set. We first present a novel, hierarchical, one-pass clustering technique that is capable of handling very large data. Our experiments show that the quality of the clusters generated by our method does not degrade, while making vast computational savings compared to algorithms that require multiple passes through the data. We then propose Voted Spheres, a novel, non-linear, one-pass, multi-class classification technique capable of handling millions of points in minutes. Our empirical study shows that it achieves state-of-the-art performance on real world data sets, in a fraction of the time required by other methods. We then adapt the VS to deal with covariate shift between the training and test phases using two different techniques: an importance weighting scheme and kernel mean matching. Our results on a toy problem and the real-world KDD ’99 data show an increase in performance to our VS framework. Our final contribution involves applying the one-pass VS algorithm, along with the adapted counterpart (for covariate shift), to the Brain-Computer Interface domain, in which linear batch algorithms are generally used. Our VS-based methods outperform the SVM, and perform very competitively with the submissions of a recent BCI competition, which further shows the robustness of our proposed techniques to different problem domains.

APA, Harvard, Vancouver, ISO, and other styles

18

Mangalvedkar, Pallavi Ramachandra. "GPU-ASSISTED RENDERING OF LARGE TREE-SHAPED DATA SETS." Wright State University / OhioLINK, 2007. http://rave.ohiolink.edu/etdc/view?acc_num=wright1195491112.

Full text

APA, Harvard, Vancouver, ISO, and other styles

19

Toulis, Panagiotis. "Implicit methods for iterative estimation with large data sets." Thesis, Harvard University, 2016. http://nrs.harvard.edu/urn-3:HUL.InstRepos:33493434.

Full text

Abstract:

The ideal estimation method needs to fulfill three requirements: (i) efficient computation, (ii) statistical efficiency, and (iii) numerical stability. The classical stochastic approximation of (Robbins, 1951) is an iterative estimation method, where the current iterate (parameter estimate) is updated according to some discrepancy between what is observed and what is expected assuming the current iterate has the true parameter value. Classical stochastic approximation undoubtedly meets the computation requirement, which explains its widespread popularity, for example, in modern applications of machine learning with large data sets, but cannot effectively combine it with efficiency and stability. Surprisingly, the stability issue can be improved substantially, if the aforementioned discrepancy is computed not using the current iterate, but using the conditional expectation of the next iterate given the current one. The computational overhead of the resulting implicit update is minimal for many statistical models, whereas statistical efficiency can be achieved through simple averaging of the iterates, as in classical stochastic approximation (Ruppert, 1988). Thus, implicit stochastic approximation is fast and principled, fulfills requirements (i-iii) for a number of popular statistical models including generalized linear models, M-estimation, and proportional hazards, and it is poised to become the workhorse of estimation with large data sets in statistical practice.
Statistics

APA, Harvard, Vancouver, ISO, and other styles

20

Kışınbay, Turgut. "Predictive ability or data snopping? : essays on forecasting with large data sets." Thesis, McGill University, 2004. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=85018.

Full text

Abstract:

This thesis examines the predictive ability of models for forecasting inflation and financial market volatility. Emphasis is put on evaluation of forecasts and the usage of large data sets. Variety of models are used to forecast inflation, including diffusion indices, artificial neural networks, and traditional linear regressions. Financial market volatility is forecast using various GARCH-type and high-frequency based models. High-frequency data are also used to obtain ex-post estimates of volatility, which is then used to evaluate forecasts. All forecast are evaluated using recently proposed techniques that can account for data snooping bias, nested, and nonlinear models.

APA, Harvard, Vancouver, ISO, and other styles

21

Simonet, Anthony. "Active Data - Enabling Smart Data Life Cycle Management for Large Distributed Scientific Data Sets." Thesis, Lyon, École normale supérieure, 2015. http://www.theses.fr/2015ENSL1004/document.

Full text

Abstract:

Dans tous les domaines, le progrès scientifique repose de plus en plus sur la capacité à exploiter des volumes de données toujours plus gigantesques. Alors que leur volume augmente, la gestion de ces données se complexifie. Un point clé est la gestion du cycle de vie des données, c'est à dire les diverses opérations qu'elles subissent entre leur création et leur disparition : transfert, archivage, réplication, suppression, etc. Ces opérations, autrefois simples, deviennent ingérables lorsque le volume des données augmente de manière importante, au vu de l'hétérogénéité des logiciels utilisés d'une part, et de la complexité des infrastructures mises en œuvre d'autre part.Nous présentons Active Data, un méta-modèle, une implémentation et un modèle de programmation qui permet de représenter formellement et graphiquement le cycle de vie de données présentes dans un assemblage de systèmes et d'infrastructures hétérogènes, en exposant naturellement la réplication, la distribution et les différents identifiants des données. Une fois connecté à des applications existantes, Active Data expose aux utilisateurs ou à des programmes l'état d'avancement des données dans leur cycle de vie, en cours d'exécution, tout en gardant leur trace lorsqu'elles passent d'un système à un autre.Le modèle de programmation Active Data permet d'exécuter du code à chaque étape du cycle de vie des données. Les programmes écrits avec Active Data ont à tout moment accès à l'état complet des données, à la fois dans tous les systèmes et dans toutes les infrastructures sur lesquels elles sont distribuées. Nous présentons des évaluations de performance et des exemples d'utilisation qui attestent de l'expressivité du modèle de programmation et de la qualité de l'implémentation. Enfin, nous décrivons l'implémentation d'un outil de Surveillance des données basé sur Active Data pour l'expérience Advanced Photon Source qui permet aux utilisateurs de suivre la progression de leurs données, d'automatiser la plupart des tâches manuelles, d'obtenir des notifications pertinente parmi une masse gigantesque d'événements, ainsi que de détecter et corriger de nombreuses erreurs sans intervention humaine.Ce travail propose des perspectives intéressantes, en particulier dans les domaines de la provenance des données et de l'open data, tout en facilitant la collaboration entre les scientifiques de communautés différentes
In all domains, scientific progress relies more and more on our ability to exploit ever growing volumes of data. However, as datavolumes increase, their management becomes more difficult. A key point is to deal with the complexity of data life cycle management,i.e. all the operations that happen to data between their creation and there deletion: transfer, archiving, replication, disposal etc.These formerly straightforward operations become intractable when data volume grows dramatically, because of the heterogeneity ofdata management software on the one hand, and the complexity of the infrastructures involved on the other.In this thesis, we introduce Active Data, a meta-model, an implementation and a programming model that allow to represent formally and graphically the life cycle of data distributed in an assemblage of heterogeneous systems and infrastructures, naturally exposing replication, distribution and different data identifiers. Once connected to existing applications, Active Data exposes the progress of data through their life cycle at runtime to users and programs, while keeping their track as it passes from a system to another.The Active Data programming model allows to execute code at each step of the data life cycle. Programs developed with Active Datahave access at any time to the complete state of data in any system and infrastructure it is distributed to.We present micro-benchmarks and usage scenarios that demonstrate the expressivity of the programming model and the implementationquality. Finally, we describe the implementation of a Data Surveillance framework based on Active Data for theAdvanced Photon Source experiment that allows scientists to monitor the progress of their data, automate most manual tasks,get relevant notifications from huge amount of events, and detect and recover from errors without human intervention.This work provides interesting perspectives in data provenance and open data in particular, while facilitating collaboration betweenscientists from different communities

APA, Harvard, Vancouver, ISO, and other styles

22

Schwartz, Jeremy (Jeremy D. ). "A modified experts algorithm : using correlation to speed convergence with very large sets of experts." Thesis, Massachusetts Institute of Technology, 2006. http://hdl.handle.net/1721.1/35642.

Full text

Abstract:

Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Mechanical Engineering, 2006.
Includes bibliographical references (p. 121).
This paper discusses a modification to the Exploration-Exploitation Experts algorithm - (EEE). The EEE is a generalization of the standard experts algorithm which is designed for use in reactive environments. In these problems, the algorithm is only able to learn about the expert that it follows at any given stage. As a result, the convergence rate of the algorithm is heavily dependent on the number of experts which it must consider. We adapt this algorithm for use with a very large set of experts. We do this by capitalizing on the fact that when a set of experts is large, many experts in the set tend to display similarities in behavior. We quantify this similarity with a concept called correlation, and use this correlation information to improve the convergence rate of the algorithm with respect to the number of experts. Experimental results show that given the proper conditions, the convergence rate of the modified algorithm can be independent of the size of the expert space.
by Jeremy Schwartz.
S.M.

APA, Harvard, Vancouver, ISO, and other styles

23

Ljung, Patric. "Efficient Methods for Direct Volume Rendering of Large Data Sets." Doctoral thesis, Norrköping : Department of Science and Technology, Linköping University, 2006. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-7232.

Full text

APA, Harvard, Vancouver, ISO, and other styles

24

Lam, Heidi Lap Mun. "Visual exploratory analysis of large data sets : evaluation and application." Thesis, University of British Columbia, 2008. http://hdl.handle.net/2429/839.

Full text

Abstract:

Large data sets are difficult to analyze. Visualization has been proposed to assist exploratory data analysis (EDA) as our visual systems can process signals in parallel to quickly detect patterns. Nonetheless, designing an effective visual analytic tool remains a challenge. This challenge is partly due to our incomplete understanding of how common visualization techniques are used by human operators during analyses, either in laboratory settings or in the workplace. This thesis aims to further understand how visualizations can be used to support EDA. More specifically, we studied techniques that display multiple levels of visual information resolutions (VIRs) for analyses using a range of methods. The first study is a summary synthesis conducted to obtain a snapshot of knowledge in multiple-VIR use and to identify research questions for the thesis: (1) low-VIR use and creation; (2) spatial arrangements of VIRs. The next two studies are laboratory studies to investigate the visual memory cost of image transformations frequently used to create low-VIR displays and overview use with single-level data displayed in multiple-VIR interfaces. For a more well-rounded evaluation, we needed to study these techniques in ecologically-valid settings. We therefore selected the application domain of web session log analysis and applied our knowledge from our first three evaluations to build a tool called Session Viewer. Taking the multiple coordinated view and overview + detail approaches, Session Viewer displays multiple levels of web session log data and multiple views of session populations to facilitate data analysis from the high-level statistical to the low-level detailed session analysis approaches. Our fourth and last study for this thesis is a field evaluation conducted at Google Inc. with seven session analysts using Session Viewer to analyze their own data with their own tasks. Study observations suggested that displaying web session logs at multiple levels using the overview + detail technique helped bridge between high-level statistical and low-level detailed session analyses, and the simultaneous display of multiple session populations at all data levels using multiple views allowed quick comparisons between session populations. We also identified design and deployment considerations to meet the needs of diverse data sources and analysis styles.

APA, Harvard, Vancouver, ISO, and other styles

25

Uminsky, David. "Generalized Spectral Analysis for Large Sets of Approval Voting Data." Scholarship @ Claremont, 2003. https://scholarship.claremont.edu/hmc_theses/157.

Full text

Abstract:

Generalized Spectral analysis of approval voting data uses representation theory and the symmetry of the data to project the approval voting data into orthogonal and interpretable subspaces. Unfortunately, as the number of voters grows, the data space becomes prohibitively large to compute the decomposition of the data vector. To attack these large data sets we develop a method to partition the data set into equivalence classes, in order to drastically reduce the size of the space while retaining the necessary characteristics of the data set. We also make progress on the needed statistical tools to explain the results of the spectral analysis. The standard spectral analysis will be demonstrated, and our partitioning technique is applied to U.S. Senate roll call data.

APA, Harvard, Vancouver, ISO, and other styles

26

Yagoubi, Djamel edine. "Indexing and analysis of very large masses of time series." Thesis, Montpellier, 2018. http://www.theses.fr/2018MONTS084/document.

Full text

Abstract:

Les séries temporelles sont présentes dans de nombreux domaines d'application tels que la finance, l'agronomie, la santé, la surveillance de la Terre ou la prévision météorologique, pour n'en nommer que quelques-uns. En raison des progrès de la technologie des capteurs, de telles applications peuvent produire des millions, voir des des milliards, de séries temporelles par jour, ce qui nécessite des techniques rapides d'analyse et de synthèse.Le traitement de ces énormes volumes de données a ouvert de nouveaux défis dans l'analyse des séries temporelles. En particulier, les techniques d'indexation ont montré de faibles performances lors du traitement des grands volumes des données.Dans cette thèse, nous abordons le problème de la recherche de similarité dans des centaines de millions de séries temporelles. Pour cela, nous devons d'abord développer des opérateurs de recherche efficaces, capables d'interroger une très grande base de données distribuée de séries temporelles avec de faibles temps de réponse. L'opérateur de recherche peut être implémenté en utilisant un index avant l'exécution des requêtes.L'objectif des indices est d'améliorer la vitesse des requêtes de similitude. Dans les bases de données, l'index est une structure de données basées sur des critères de recherche comme la localisation efficace de données répondant aux exigences. Les index rendent souvent le temps de réponse de l'opération de recherche sous linéaire dans la taille de la base de données. Les systèmes relationnels ont été principalement supportés par des structures de hachage, B-tree et des structures multidimensionnelles telles que R-tree, avec des vecteurs binaires jouant un rôle de support. De telles structures fonctionnent bien pour les recherches, et de manière adéquate pour les requêtes de similarité. Nous proposons trois solutions différentes pour traiter le problème de l'indexation des séries temporelles dans des grandes bases de données. Nos algorithmes nous permettent d'obtenir d'excellentes performances par rapport aux approches traditionnelles.Nous étudions également le problème de la détection de corrélation parallèle de toutes paires sur des fenêtres glissantes de séries temporelles. Nous concevons et implémentons une stratégie de calcul incrémental des sketchs dans les fenêtres glissantes. Cette approche évite de recalculer les sketchs à partir de zéro. En outre, nous développons une approche de partitionnement qui projette des sketchs vecteurs de séries temporelles dans des sous-vecteurs et construit une structure de grille distribuée. Nous utilisons cette méthode pour détecter les séries temporelles corrélées dans un environnement distribué
Time series arise in many application domains such as finance, agronomy, health, earth monitoring, weather forecasting, to name a few. Because of advances in sensor technology, such applications may produce millions to trillions of time series per day, requiring fast analytical and summarization techniques.The processing of these massive volumes of data has opened up new challenges in time series data mining. In particular, it is to improve indexing techniques that has shown poor performances when processing large databases.In this thesis, we focus on the problem of parallel similarity search in such massive sets of time series. For this, we first need to develop efficient search operators that can query a very large distributed database of time series with low response times. The search operator can be implemented by using an index constructed before executing the queries. The objective of indices is to improve the speed of data retrieval operations. In databases, the index is a data structure, which based on search criteria, efficiently locates data entries satisfying the requirements. Indexes often make the response time of the lookup operation sublinear in the database size.After reviewing the state of the art, we propose three novel approaches for parallel indexing and queryin large time series datasets. First, we propose DPiSAX, a novel and efficient parallel solution that includes a parallel index construction algorithm that takes advantage of distributed environments to build iSAX-based indices over vast volumes of time series efficiently. Our solution also involves a parallel query processing algorithm that, given a similarity query, exploits the available processors of the distributed system to efficiently answer the query in parallel by using the constructed parallel index.Second, we propose RadiusSketch a random projection-based approach that scales nearly linearly in parallel environments, and provides high quality answers. RadiusSketch includes a parallel index construction algorithm that takes advantage of distributed environments to efficiently build sketch-based indices over very large databases of time series, and then query the databases in parallel.Third, we propose ParCorr, an efficient parallel solution for detecting similar time series across distributed data streams. ParCorr uses the sketch principle for representing the time series. Our solution includes a parallel approach for incremental computation of the sketches in sliding windows and a partitioning approach that projects sketch vectors of time series into subvectors and builds a distributed grid structure.Our solutions have been evaluated using real and synthetics datasets and the results confirm their high efficiency compared to the state of the art

APA, Harvard, Vancouver, ISO, and other styles

27

Lundell, Fredrik. "Out-of-Core Multi-Resolution Volume Rendering of Large Data Sets." Thesis, Linköpings universitet, Medie- och Informationsteknik, 2011. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-70162.

Full text

Abstract:

A modality device can today capture high resolution volumetric data sets and as the data resolutions increase so does the challenges of processing volumetric data through a visualization pipeline. Standard volume rendering pipelines often use a graphic processing unit (GPU) to accelerate rendering performance by taking beneficial use of the parallel architecture on such devices. Unfortunately, graphics cards have limited amounts of video memory (VRAM), causing a bottleneck in a standard pipeline. Multi-resolution techniques can be used to efficiently modify the rendering pipeline, allowing a sub-domain within the volume to be represented at different resolutions. The active resolution distribution is temporarily stored on the VRAM for rendering and the inactive parts are stored on secondary memory layers such as the system RAM or on disk. The active resolution set can be optimized to produce high quality renders while minimizing the amount of storage required. This is done by using a dynamic compression scheme which optimize the visual quality by evaluating user-input data. The optimized resolution of each sub-domain is then, on demand, streamed to the VRAM from secondary memory layers. Rendering a multi-resolution data set requires some extra care between boundaries of sub-domains. To avoid artifacts, an intrablock interpolation (II) sampling scheme capable of creating smooth transitions between sub-domains at arbitrary resolutions can be used. The result is a highly optimized rendering pipeline complemented with a preprocessing pipeline together capable of rendering large volumetric data sets in real-time.

APA, Harvard, Vancouver, ISO, and other styles

28

Månsson, Per. "Database analysis and managing large data sets in a trading environment." Thesis, Linköpings universitet, Databas och informationsteknik, 2014. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-104193.

Full text

Abstract:

Start-up companies today tend to find a need to scale up quickly and smoothly, to cover quickly increasing demands for the services they create. It is also always a necessity to save money and finding a cost-efficient solution which can meet the demands of the company. This report uses Amazon Web Services for infrastructure. It covers hosting databases on Elastic Computing Cloud, the Relational Database Serviceas well as Amazon DynamoDB for NoSQL storage are compared, benchmarked and evaluated.

APA, Harvard, Vancouver, ISO, and other styles

29

Carter, Caleb. "High Resolution Visualization of Large Scientific Data Sets Using Tiled Display." Fogler Library, University of Maine, 2007. http://www.library.umaine.edu/theses/pdf/CarterC2007.pdf.

Full text

APA, Harvard, Vancouver, ISO, and other styles

30

Memarsadeghi, Nargess. "Efficient algorithms for clustering and interpolation of large spatial data sets." College Park, Md. : University of Maryland, 2007. http://hdl.handle.net/1903/6839.

Full text

Abstract:

Thesis (Ph. D.) -- University of Maryland, College Park, 2007.
Thesis research directed by: Computer Science. Title from t.p. of PDF. Includes bibliographical references. Published by UMI Dissertation Services, Ann Arbor, Mich. Also available in paper.

APA, Harvard, Vancouver, ISO, and other styles

31

Sips, Mike. "Pixel-based visual data mining in large geo-spatial point sets /." Konstanz : Hartung-Gorre, 2006. http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&doc_number=014881714&line_number=0001&func_code=DB_RECORDS&service_type=MEDIA.

Full text

APA, Harvard, Vancouver, ISO, and other styles

32

Coudret, Raphaël. "Stochastic modelling using large data sets : applications in ecology and genetics." Phd thesis, Université Sciences et Technologies - Bordeaux I, 2013. http://tel.archives-ouvertes.fr/tel-00865867.

Full text

Abstract:

There are two main parts in this thesis. The first one concerns valvometry, which is here the study of the distance between both parts of the shell of an oyster, over time. The health status of oysters can be characterized using valvometry in order to obtain insights about the quality of their environment. We consider that a renewal process with four states underlies the behaviour of the studied oysters. Such a hidden process can be retrieved from a valvometric signal by assuming that some probability density function linked with this signal, is bimodal. We then compare several estimators which take this assumption into account, including kernel density estimators.In another chapter, we compare several regression approaches, aiming at analysing transcriptomic data. To understand which explanatory variables have an effect on gene expressions, we apply a multiple testing procedure on these data, through the linear model FAMT. The SIR method may find nonlinear relations in such a context. It is however more commonly used when the response variable is univariate. A multivariate version of SIR was then developed. Procedures to measure gene expressions can be expensive. The sample size n of the corresponding datasets is then often small. That is why we also studied SIR when n is less than the number of explanatory variables p.

APA, Harvard, Vancouver, ISO, and other styles

33

Winter, Eitan E. "Evolutionary analyses of protein-coding genes using large biological data sets." Thesis, University of Oxford, 2006. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.427615.

Full text

APA, Harvard, Vancouver, ISO, and other styles

34

Mostafa, Nour. "Intelligent dynamic caching for large data sets in a grid environment." Thesis, Queen's University Belfast, 2013. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.602689.

Full text

Abstract:

Present and future distributed applications need to deal with very large PetaBytes (PB) datasets and increasing numbers of associated users and resources. The emergence of Grid-based systems as a potential solution for large computational and data management problems has initiated significant research activity in the area. Grid research can be divided into two at least areas: Data Grids and Computational Grids. The aims of Data Grids are to provide services for accessing, sharing and modifying large databases, the aims of Computational Grids are to provide services for sharing resources. The considerable increase in data production and data sharing within scientific communities has created the need for improvements in data access and data availability. It can be argued the problems associated with the management of very large datasets are not well serviced by current approaches. The thesis concentrates on one of the areas concerned in the access to distributed very large databases on Grid resources. To this end, it presents the design and implementation of partial replication system and a Grid caching system that mediates access to distributed data. Artificial intelligent (AI) techniques such as a neural network (NN) have been used as a prediction element of the model to determine user requirements by analysing the past history of the user. Hence, this thesis will examine the problems surrounding the manipulation of very large data sets within a Grid-like environment The goal is the development of a prototype system that will enable both effective and efficient access to very large datasets, based on the use of a caching model.

APA, Harvard, Vancouver, ISO, and other styles

35

Kim, Hyeyoen. "Large data sets and nonlinearity : essays in international finance and macroeconomics." Thesis, University of Warwick, 2009. http://wrap.warwick.ac.uk/3747/.

Full text

Abstract:

This thesis has aimed to investigate whether the information in large macroeconomic data sets is relevant for resolving some of puzzling and questionable aspects of international finance and macroeconomics. In particular, we employ the diffusion indices (DIs) analysis in order to capture the very large data sets into a small number of factors. Applications of factors into conventional model specifications address the following main issues. Using factor-augmented vector autoregressive (FAVAR) models, we measure the impact of the UK and US monetary policy. This approach notably mitigates the ‘price puzzle’ for both economies, whereby a monetary tightening appears to have perverse effects on price movements. We also estimate structural FAVARs and examine the impact of aggregate-demand and aggregate-supply using a recursive long-run multiplier identification procedure. This method is applied to examine the evidence for increased UK macroeconomic flexibility following the UK labour market reforms of the 1980s. For forecasting purpose, factors are employed as ‘unobserved’ fundamentals, which direct the movement of exchange rates. From the long-run relationship between factor-based fundamentals and exchange rate, the deviation from the fundamental level of exchange rate is exploited to improve the predictive performance of the fundamental model of exchange rates. Our empirical results suggest that there is strong evidence that factors are helpful to predict the exchange rates as the horizons becomes more elongated, better than random walk and the standard monetary fundamental models. Finally, we explore whether allowing for a wide range of influences on the real exchange rate in a nonlinear framework can help to resolve the ‘PPP puzzle’. Factors, as determinants of the time-varying equilibrium of real exchange rates, are incorporated into a nonlinear framework. Allowing for the effects of macroeconomic factors dramatically increases the measured speed of adjustment of the real exchange rate.

APA, Harvard, Vancouver, ISO, and other styles

36

Nguyen, Minh Quoc. "Toward accurate and efficient outlier detection in high dimensional and large data sets." Diss., Georgia Institute of Technology, 2010. http://hdl.handle.net/1853/34657.

Full text

Abstract:

An efficient method to compute local density-based outliers in high dimensional data was proposed. In our work, we have shown that this type of outlier is present even in any subset of the dataset. This property is used to partition the data set into random subsets to compute the outliers locally. The outliers are then combined from different subsets. Therefore, the local density-based outliers can be computed efficiently. Another challenge in outlier detection in high dimensional data is that the outliers are often suppressed when the majority of dimensions do not exhibit outliers. The contribution of this work is to introduce a filtering method whereby outlier scores are computed in sub-dimensions. The low sub-dimensional scores are filtered out and the high scores are aggregated into the final score. This aggregation with filtering eliminates the effect of accumulating delta deviations in multiple dimensions. Therefore, the outliers are identified correctly. In some cases, the set of outliers that form micro patterns are more interesting than individual outliers. These micro patterns are considered anomalous with respect to the dominant patterns in the dataset. In the area of anomalous pattern detection, there are two challenges. The first challenge is that the anomalous patterns are often overlooked by the dominant patterns using the existing clustering techniques. A common approach is to cluster the dataset using the k-nearest neighbor algorithm. The contribution of this work is to introduce the adaptive nearest neighbor and the concept of dual-neighbor to detect micro patterns more accurately. The next challenge is to compute the anomalous patterns very fast. Our contribution is to compute the patterns based on the correlation between the attributes. The correlation implies that the data can be partitioned into groups based on each attribute to learn the candidate patterns within the groups. Thus, a feature-based method is developed that can compute these patterns efficiently.

APA, Harvard, Vancouver, ISO, and other styles

37

Towfeek, Ajden. "Multi-Resolution Volume Rendering of Large Medical Data Sets on the GPU." Thesis, Linköping University, Department of Science and Technology, 2008. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-10715.

Full text

Abstract:

Volume rendering techniques can be powerful tools when visualizing medical data sets. The characteristics of being able to capture 3-D internal structures make the technique attractive. Scanning equipment is producing medical images, with rapidly increasing resolution, resulting in heavily increased size of the data set. Despite the great amount of processing power CPUs deliver, the required precision in image quality can be hard to obtain in real-time rendering. Therefore, it is highly desirable to optimize the rendering process.

Modern GPUs possess much more computational power and is available for general purpose programming through high level shading languages. Efficient representations of the data are crucial due to the limited memory provided by the GPU. This thesis describes the theoretical background and the implementation of an approach presented by Patric Ljung, Claes Lundström and Anders Ynnerman at Linköping University. The main objective is to implement a fully working multi-resolution framework with two separate pipelines for pre-processing and real-time rendering, which uses the GPU to visualize large medical data sets.

APA, Harvard, Vancouver, ISO, and other styles

38

González, David Muñoz. "Discovering unknown equations that describe large data sets using genetic programming techniques." Thesis, Linköping University, Department of Electrical Engineering, 2005. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-2639.

Full text

Abstract:

FIR filters are widely used nowadays, with applications from MP3 players, Hi-Fi systems, digital TVs, etc. to communication systems like wireless communication. They are implemented in DSPs and there are several trade-offs that make important to have an exact as possible estimation of the required filter order.

In order to find a better estimation of the filter order than the existing ones, genetic expression programming (GEP) is used. GEP is a Genetic Algorithm that can be used in function finding. It is implemented in a commercial application which, after the appropriate input file and settings have been provided, performs the evolution of the individuals in the input file so that a good solution is found. The thesis is the first one in this new research line.

The aim has been not only reaching the desired estimation but also pave the way for further investigations.

APA, Harvard, Vancouver, ISO, and other styles

39

Bäckström, Daniel. "Managing and Exploring Large Data Sets Generated by Liquid Separation - Mass Spectrometry." Doctoral thesis, Uppsala University, Analytical Chemistry, 2007. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-8223.

Full text

Abstract:

A trend in natural science and especially in analytical chemistry is the increasing need for analysis of a large number of complex samples with low analyte concentrations. Biological samples (urine, blood, plasma, cerebral spinal fluid, tissue etc.) are often suitable for analysis with liquid separation mass spectrometry (LS-MS), resulting in two-way data tables (time vs. m/z). Such biological 'fingerprints' taken for all samples in a study correspond to a large amount of data. Detailed characterization requires a high sampling rate in combination with high mass resolution and wide mass range, which presents a challenge in data handling and exploration. This thesis describes methods for managing and exploring large data sets made up of such detailed 'fingerprints' (represented as data matrices).

The methods were implemented as scripts and functions in Matlab, a wide-spread environment for matrix manipulations. A single-file structure to hold the imported data facilitated both easy access and fast manipulation. Routines for baseline removal and noise reduction were intended to reduce the amount of data without loosing relevant information. A tool for visualizing and exploring single runs was also included. When comparing two or more 'fingerprints' they usually have to be aligned due to unintended shifts in analyte positions in time and m/z. A PCA-like multivariate method proved to be less sensitive to such shifts, and an ANOVA implementation made it easier to find systematic differences within the data sets.

The above strategies and methods were applied to complex samples such as plasma, protein digests, and urine. The field of application included urine profiling (paracetamole intake; beverage effects), peptide mapping (different digestion protocols) and search for potential biomarkers (appendicitis diagnosis) . The influence of the experimental factors was visualized by PCA score plots as well as clustering diagrams (dendrograms).

APA, Harvard, Vancouver, ISO, and other styles

40

Cutchin, Andrew E. Donahoo Michael J. "Towards efficient and practical reliable bulk data transport for large receiver sets." Waco, Tex. : Baylor University, 2007. http://hdl.handle.net/2104/5140.

Full text

APA, Harvard, Vancouver, ISO, and other styles

41

Dutta, Soumya. "In Situ Summarization and Visual Exploration of Large-scale Simulation Data Sets." The Ohio State University, 2018. http://rave.ohiolink.edu/etdc/view?acc_num=osu1524070976058567.

Full text

APA, Harvard, Vancouver, ISO, and other styles

42

Blanc, Trevor Jon. "Analysis and Compression of Large CFD Data Sets Using Proper Orthogonal Decomposition." BYU ScholarsArchive, 2014. https://scholarsarchive.byu.edu/etd/5303.

Full text

Abstract:

Efficient analysis and storage of data is an integral but often challenging task when working with computation fluid dynamics mainly due to the amount of data it can output. Methods centered around the proper orthogonal decomposition were used to analyze, compress, and model various simulation cases. Two different high-fidelity, time-accurate turbomachinery simulations were investigated to show various applications of the analysis techniques. The first turbomachinery example was used to illustrate the extraction of turbulent coherent structures such as traversing shocks, vortex shedding, and wake variation from deswirler and rotor blade passages. Using only the most dominant modes, flow fields were reconstructed and analyzed for error. The reconstructions reproduced the general dynamics within the flow well, but failed to fully resolve shock fronts and smaller vortices. By decomposing the domain into smaller, independent pieces, reconstruction error was reduced by up to 63 percent. A new method of data compression that combined an image compression algorithm and the proper orthogonal decomposition was used to store the reconstructions of the flow field, increasing data compression ratios by a factor of 40.The second turbomachinery simulation studied was a three-stage fan with inlet total pressure distortion. Both the snapshot and repeating geometry methods were used to characterize structures of static pressure fluctuation within the blade passages of the third rotor blade row. Modal coefficients filtered by frequencies relating to the inlet distortion pattern were used to produce reconstructions of the pressure field solely dependent on the inlet boundary condition. A hybrid proper orthogonal decomposition method was proposed to limit burdens on computational resources while providing high temporal resolution analysis.Parametric reduced order models were created from large databases of transient and steady conjugate heat transfer and airfoil simulations. Performance of the models were found to depend heavily on the range of the parameters varied as well as the number of simulations used to traverse that range. The heat transfer models gave excellent predictions for temperature profiles in heated solids for ambitious parameter ranges. Model development for the airfoil case showed that accuracy was highly dependent on modal truncation. The flow fields were predicted very well, especially outside the boundary layer region of the flow.

APA, Harvard, Vancouver, ISO, and other styles

43

Deri, Joya A. "Graph Signal Processing: Structure and Scalability to Massive Data Sets." Research Showcase @ CMU, 2016. http://repository.cmu.edu/dissertations/725.

Full text

Abstract:

Large-scale networks are becoming more prevalent, with applications in healthcare systems, financial networks, social networks, and traffic systems. The detection of normal and abnormal behaviors (signals) in these systems presents a challenging problem. State-of-the-art approaches such as principal component analysis and graph signal processing address this problem using signal projections onto a space determined by an eigendecomposition or singular value decomposition. When a graph is directed, however, applying methods based on the graph Laplacian or singular value decomposition causes information from unidirectional edges to be lost. Here we present a novel formulation and graph signal processing framework that addresses this issue and that is well suited for application to extremely large, directed, sparse networks. In this thesis, we develop and demonstrate a graph Fourier transform for which the spectral components are the Jordan subspaces of the adjacency matrix. In addition to admitting a generalized Parseval’s identity, this transform yields graph equivalence classes that can simplify the computation of the graph Fourier transform over certain networks. Exploration of these equivalence classes provides the intuition for an inexact graph Fourier transform method that dramatically reduces computation time over real-world networks with nontrivial Jordan subspaces. We apply our inexact method to four years of New York City taxi trajectories (61 GB after preprocessing) over the NYC road network (6,400 nodes, 14,000 directed edges). We discuss optimization strategies that reduce the computation time of taxi trajectories from raw data by orders of magnitude: from 3,000 days to less than one day. Our method yields a fine-grained analysis that pinpoints the same locations as the original method while reducing computation time and decreasing energy dispersal among spectral components. This capability to rapidly reduce raw traffic data to meaningful features has important ramifications for city planning and emergency vehicle routing.

APA, Harvard, Vancouver, ISO, and other styles

44

Quiroz, Matias. "Bayesian Inference in Large Data Problems." Doctoral thesis, Stockholms universitet, Statistiska institutionen, 2015. http://urn.kb.se/resolve?urn=urn:nbn:se:su:diva-118836.

Full text

Abstract:

In the last decade or so, there has been a dramatic increase in storage facilities and the possibility of processing huge amounts of data. This has made large high-quality data sets widely accessible for practitioners. This technology innovation seriously challenges traditional modeling and inference methodology. This thesis is devoted to developing inference and modeling tools to handle large data sets. Four included papers treat various important aspects of this topic, with a special emphasis on Bayesian inference by scalable Markov Chain Monte Carlo (MCMC) methods. In the first paper, we propose a novel mixture-of-experts model for longitudinal data. The model and inference methodology allows for manageable computations with a large number of subjects. The model dramatically improves the out-of-sample predictive density forecasts compared to existing models. The second paper aims at developing a scalable MCMC algorithm. Ideas from the survey sampling literature are used to estimate the likelihood on a random subset of data. The likelihood estimate is used within the pseudomarginal MCMC framework and we develop a theoretical framework for such algorithms based on subsets of the data. The third paper further develops the ideas introduced in the second paper. We introduce the difference estimator in this framework and modify the methods for estimating the likelihood on a random subset of data. This results in scalable inference for a wider class of models. Finally, the fourth paper brings the survey sampling tools for estimating the likelihood developed in the thesis into the delayed acceptance MCMC framework. We compare to an existing approach in the literature and document promising results for our algorithm.

At the time of the doctoral defense, the following papers were unpublished and had a status as follows: Paper 1: Submitted. Paper 2: Submitted. Paper 3: Manuscript. Paper 4: Manuscript.

APA, Harvard, Vancouver, ISO, and other styles

45

Boukorca, Ahcène. "Hypergraphs in the Service of Very Large Scale Query Optimization. Application : Data Warehousing." Thesis, Chasseneuil-du-Poitou, Ecole nationale supérieure de mécanique et d'aérotechnique, 2016. http://www.theses.fr/2016ESMA0026/document.

Full text

Abstract:

L'apparition du phénomène Big-Data, a conduit à l'arrivée de nouvelles besoins croissants et urgents de partage de données qui a engendré un grand nombre de requêtes que les SGBD doivent gérer. Ce problème a été aggravé par d 'autres besoins de recommandation et d 'exploration des requêtes. Vu que le traitement de données est toujours possible grâce aux solutions liées à l'optimisation de requêtes, la conception physique et l'architecture de déploiement, où ces solutions sont des résultats de problèmes combinatoires basés sur les requêtes, il est indispensable de revoir les méthodes traditionnelles pour répondre aux nouvelles besoins de passage à l'échelle. Cette thèse s'intéresse à ce problème de nombreuses requêtes et propose une approche, implémentée par un Framework appelé Big-Quereis, qui passe à l'échelle et basée sur le hypergraph, une structure de données flexible qui a une grande puissance de modélisation et permet des formulations précises de nombreux problèmes d•combinatoire informatique. Cette approche est. le fruit. de collaboration avec l'entreprise Mentor Graphies. Elle vise à capturer l'interaction de requêtes dans un plan unifié de requêtes et utiliser des algorithmes de partitionnement pour assurer le passage à l'échelle et avoir des structures d'optimisation optimales (vues matérialisées et partitionnement de données). Ce plan unifié est. utilisé dans la phase de déploiement des entrepôts de données parallèles, par le partitionnement de données en fragments et l'allocation de ces fragments dans les noeuds de calcule correspondants. Une étude expérimentale intensive a montré l'intérêt de notre approche en termes de passage à l'échelle des algorithmes et de réduction de temps de réponse de requêtes
The emergence of the phenomenon Big-Data conducts to the introduction of new increased and urgent needs to share data between users and communities, which has engender a large number of queries that DBMS must handle. This problem has been compounded by other needs of recommendation and exploration of queries. Since data processing is still possible through solutions of query optimization, physical design and deployment architectures, in which these solutions are the results of combinatorial problems based on queries, it is essential to review traditional methods to respond to new needs of scalability. This thesis focuses on the problem of numerous queries and proposes a scalable approach implemented on framework called Big-queries and based on the hypergraph, a flexible data structure, which bas a larger modeling power and may allow accurate formulation of many problems of combinatorial scientific computing. This approach is the result of collaboration with the company Mentor Graphies. It aims to capture the queries interaction in an unified query plan and to use partitioning algorithms to ensure scalability and to optimal optimization structures (materialized views and data partitioning). Also, the unified plan is used in the deploymemt phase of parallel data warehouses, by allowing data partitioning in fragments and allocating these fragments in the correspond processing nodes. Intensive experimental study sbowed the interest of our approach in terms of scaling algorithms and minimization of query response time

APA, Harvard, Vancouver, ISO, and other styles

46

Bresell, Anders. "Characterization of protein families, sequence patterns, and functional annotations in large data sets." Doctoral thesis, Linköping : Department of Physics, Chemistry and Biology, 2008. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-10565.

Full text

APA, Harvard, Vancouver, ISO, and other styles

47

Castro, Jose R. "MODIFICATIONS TO THE FUZZY-ARTMAP ALGORITHM FOR DISTRIBUTED LEARNING IN LARGE DATA SETS." Doctoral diss., University of Central Florida, 2004. http://digital.library.ucf.edu/cdm/ref/collection/ETD/id/4449.

Full text

Abstract:

The Fuzzy-ARTMAP (FAM) algorithm is one of the premier neural network architectures for classification problems. FAM can learn on line and is usually faster than other neural network approaches. Nevertheless the learning time of FAM can slow down considerably when the size of the training set increases into the hundreds of thousands. In this dissertation we apply data partitioning and network partitioning to the FAM algorithm in a sequential and parallel setting to achieve better convergence time and to efficiently train with large databases (hundreds of thousands of patterns). We implement our parallelization on a Beowulf clusters of workstations. This choice of platform requires that the process of parallelization be coarse grained. Extensive testing of all the approaches is done on three large datasets (half a million data points). One of them is the Forest Covertype database from Blackard and the other two are artificially generated Gaussian data with different percentages of overlap between classes. Speedups in the data partitioning approach reached the order of the hundreds without having to invest in parallel computation. Speedups on the network partitioning approach are close to linear on a cluster of workstations. Both methods allowed us to reduce the computation time of training the neural network in large databases from days to minutes. We prove formally that the workload balance of our network partitioning approaches will never be worse than an acceptable bound, and also demonstrate the correctness of these parallelization variants of FAM.
Ph.D.
School of Electrical and Computer Engineering
Engineering and Computer Science
Electrical and Computer Engineering

APA, Harvard, Vancouver, ISO, and other styles

48

Brind'Amour, Katherine. "Maternal and Child Health Home Visiting Evaluations Using Large, Pre-Existing Data Sets." The Ohio State University, 2016. http://rave.ohiolink.edu/etdc/view?acc_num=osu1468965739.

Full text

APA, Harvard, Vancouver, ISO, and other styles

49

Nyumbeka, Dumisani Joshua. "Using data analysis and Information visualization techniques to support the effective analysis of large financial data sets." Thesis, Nelson Mandela Metropolitan University, 2016. http://hdl.handle.net/10948/12983.

Full text

Abstract:

There have been a number of technological advances in the last ten years, which has resulted in the amount of data generated in organisations increasing by more than 200% during this period. This rapid increase in data means that if financial institutions are to derive significant value from this data, they need to identify new ways to analyse this data effectively. Due to the considerable size of the data, financial institutions also need to consider how to effectively visualise the data. Traditional tools such as relational database management systems have problems processing large amounts of data due to memory constraints, latency issues and the presence of both structured and unstructured data The aim of this research was to use data analysis and information visualisation techniques (IV) to support the effective analysis of large financial data sets. In order to visually analyse the data effectively, the underlying data model must produce results that are reliable. A large financial data set was identified, and used to demonstrate that IV techniques can be used to support the effective analysis of large financial data sets. A review of the literature on large financial data sets, visual analytics, existing data management and data visualisation tools identified the shortcomings of existing tools. This resulted in the determination of the requirements for the data management tool, and the IV tool. The data management tool identified was a data warehouse and the IV toolkit identified was Tableau. The IV techniques identified included the Overview, Dashboards and Colour Blending. The IV tool was implemented and published online and can be accessed through a web browser interface. The data warehouse and the IV tool were evaluated to determine their accuracy and effectiveness in supporting the effective analysis of the large financial data set. The experiment used to evaluate the data warehouse yielded positive results, showing that only about 4% of the records had incorrect data. The results of the user study were positive and no major usability issues were identified. The participants found the IV techniques effective for analysing the large financial data set.

APA, Harvard, Vancouver, ISO, and other styles

50

Li, Yanrong. "Techniques for improving clustering and association rules mining from very large transactional databases." Thesis, Curtin University, 2009. http://hdl.handle.net/20.500.11937/907.

Full text

Abstract:

Clustering and association rules mining are two core data mining tasks that have been actively studied by data mining community for nearly two decades. Though many clustering and association rules mining algorithms have been developed, no algorithm is better than others on all aspects, such as accuracy, efficiency, scalability, adaptability and memory usage. While more efficient and effective algorithms need to be developed for handling the large-scale and complex stored datasets, emerging applications where data takes the form of streams pose new challenges for the data mining community. The existing techniques and algorithms for static stored databases cannot be applied to the data streams directly. They need to be extended or modified, or new methods need to be developed to process the data streams.In this thesis, algorithms have been developed for improving efficiency and accuracy of clustering and association rules mining on very large, high dimensional, high cardinality, sparse transactional databases and data streams.A new similarity measure suitable for clustering transactional data is defined and an incremental clustering algorithm, INCLUS, is proposed using this similarity measure. The algorithm only scans the database once and produces clusters based on the user’s expectations of similarities between transactions in a cluster, which is controlled by the user input parameters, a similarity threshold and a support threshold. Intensive testing has been performed to evaluate the effectiveness, efficiency, scalability and order insensitiveness of the algorithm.To extend INCLUS for transactional data streams, an equal-width time window model and an elastic time window model are proposed that allow mining of clustering changes in evolving data streams. The minimal width of the window is determined by the minimum clustering granularity for a particular application. Two algorithms, CluStream_EQ and CluStream_EL, based on the equal-width window model and the elastic window model respectively, are developed by incorporating these models into INCLUS. Each algorithm consists of an online micro-clustering component and an offline macro-clustering component. The online component writes summary statistics of a data stream to the disk, and the offline components uses those summaries and other user input to discover changes in a data stream. The effectiveness and scalability of the algorithms are evaluated by experiments.This thesis also looks into sampling techniques that can improve efficiency of mining association rules in a very large transactional database. The sample size is derived based on the binomial distribution and central limit theorem. The sample size used is smaller than that based on Chernoff Bounds, but still provides the same approximation guarantees. The accuracy of the proposed sampling approach is theoretically analyzed and its effectiveness is experimentally evaluated on both dense and sparse datasets.Applications of stratified sampling for association rules mining is also explored in this thesis. The database is first partitioned into strata based on the length of transactions, and simple random sampling is then performed on each stratum. The total sample size is determined by a formula derived in this thesis and the sample size for each stratum is proportionate to the size of the stratum. The accuracy of transaction size based stratified sampling is experimentally compared with that of random sampling.The thesis concludes with a summary of significant contributions and some pointers for further work.

APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic 'Very large data sets'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles