Dissertations / Theses: 'High Throughput Data Storage'

1

Roguski, Łukasz 1987. "High-throughput sequencing data compression." Doctoral thesis, Universitat Pompeu Fabra, 2017. http://hdl.handle.net/10803/565775.

Full text

Abstract:

Thanks to advances in sequencing technologies, biomedical research has experienced a revolution over recent years, resulting in an explosion in the amount of genomic data being generated worldwide. The typical space requirement for storing sequencing data produced by a medium-scale experiment lies in the range of tens to hundreds of gigabytes, with multiple files in different formats being produced by each experiment. The current de facto standard file formats used to represent genomic data are text-based. For practical reasons, these are stored in compressed form. In most cases, such storage methods rely on general-purpose text compressors, such as gzip. Unfortunately, however, these methods are unable to exploit the information models specific to sequencing data, and as a result they usually provide limited functionality and insufficient savings in storage space. This explains why relatively basic operations such as processing, storage, and transfer of genomic data have become a typical bottleneck of current analysis setups. Therefore, this thesis focuses on methods to efficiently store and compress the data generated from sequencing experiments. First, we propose a novel general purpose FASTQ files compressor. Compared to gzip, it achieves a significant reduction in the size of the resulting archive, while also offering high data processing speed. Next, we present compression methods that exploit the high sequence redundancy present in sequencing data. These methods achieve the best compression ratio among current state-of-the-art FASTQ compressors, without using any external reference sequence. We also demonstrate different lossy compression approaches to store auxiliary sequencing data, which allow for further reductions in size. Finally, we propose a flexible framework and data format, which allows one to semi-automatically generate compression solutions which are not tied to any specific genomic file format. To facilitate data management needed by complex pipelines, multiple genomic datasets having heterogeneous formats can be stored together in configurable containers, with an option to perform custom queries over the stored data. Moreover, we show that simple solutions based on our framework can achieve results comparable to those of state-of-the-art format-specific compressors. Overall, the solutions developed and described in this thesis can easily be incorporated into current pipelines for the analysis of genomic data. Taken together, they provide grounds for the development of integrated approaches towards efficient storage and management of such data.
Gràcies als avenços en el camp de les tecnologies de seqüenciació, en els darrers anys la recerca biomèdica ha viscut una revolució, que ha tingut com un dels resultats l'explosió del volum de dades genòmiques generades arreu del món. La mida típica de les dades de seqüenciació generades en experiments d'escala mitjana acostuma a situar-se en un rang entre deu i cent gigabytes, que s'emmagatzemen en diversos arxius en diferents formats produïts en cada experiment. Els formats estàndards actuals de facto de representació de dades genòmiques són en format textual. Per raons pràctiques, les dades necessiten ser emmagatzemades en format comprimit. En la majoria dels casos, aquests mètodes de compressió es basen en compressors de text de caràcter general, com ara gzip. Amb tot, no permeten explotar els models d'informació especifícs de dades de seqüenciació. És per això que proporcionen funcionalitats limitades i estalvi insuficient d'espai d'emmagatzematge. Això explica per què operacions relativament bàsiques, com ara el processament, l'emmagatzematge i la transferència de dades genòmiques, s'han convertit en un dels principals obstacles de processos actuals d'anàlisi. Per tot això, aquesta tesi se centra en mètodes d'emmagatzematge i compressió eficients de dades generades en experiments de sequenciació. En primer lloc, proposem un compressor innovador d'arxius FASTQ de propòsit general. A diferència de gzip, aquest compressor permet reduir de manera significativa la mida de l'arxiu resultant del procés de compressió. A més a més, aquesta eina permet processar les dades a una velocitat alta. A continuació, presentem mètodes de compressió que fan ús de l'alta redundància de seqüències present en les dades de seqüenciació. Aquests mètodes obtenen la millor ratio de compressió d'entre els compressors FASTQ del marc teòric actual, sense fer ús de cap referència externa. També mostrem aproximacions de compressió amb pèrdua per emmagatzemar dades de seqüenciació auxiliars, que permeten reduir encara més la mida de les dades. En últim lloc, aportem un sistema flexible de compressió i un format de dades. Aquest sistema fa possible generar de manera semi-automàtica solucions de compressió que no estan lligades a cap mena de format específic d'arxius de dades genòmiques. Per tal de facilitar la gestió complexa de dades, diversos conjunts de dades amb formats heterogenis poden ser emmagatzemats en contenidors configurables amb l'opció de dur a terme consultes personalitzades sobre les dades emmagatzemades. A més a més, exposem que les solucions simples basades en el nostre sistema poden obtenir resultats comparables als compressors de format específic de l'estat de l'art. En resum, les solucions desenvolupades i descrites en aquesta tesi poden ser incorporades amb facilitat en processos d'anàlisi de dades genòmiques. Si prenem aquestes solucions conjuntament, aporten una base sòlida per al desenvolupament d'aproximacions completes encaminades a l'emmagatzematge i gestió eficient de dades genòmiques.

APA, Harvard, Vancouver, ISO, and other styles

2

Kalathur, Ravi Kiran Reddy. "An integrated systematic approach for storage, analysis and visualization of gene expression data from neuronal tissues acquired through high-throughput techniques." Université Louis Pasteur (Strasbourg) (1971-2008), 2008. https://publication-theses.unistra.fr/public/theses_doctorat/2008/KALATHUR_Ravi_Kiran_Reddy_2008.pdf.

Full text

Abstract:

The work presented in this manuscript concerns different aspects of gene expression data analysis, encompassing statistical methods and storage and visualization systems used to exploit and mine pertinent information from large volumes of data. Overall, I had the opportunity during my thesis to work on these various aspects firstly, by contributing to the tests through the design of biological applications for new clustering and meta-analysis approaches developed in our laboratory and secondly, by the development of RETINOBASE (http://alnitak. U-strasbg. Fr/RetinoBase/. ) , a relational database for storage and efficient querying of transcriptomic data which represents my major project
Le travail présenté dans ce manuscrit concerne différents aspects de l'analyse des données d'expression de gènes, qui englobe l'utilisation de méthodes statistiques et de systèmes de stockage et de visualisation, pour exploiter et extraire des informations pertinentes à partir de grands volumes de données. Durant ma thèse j'ai eu l'opportunité de travailler sur ces différents aspects, en contribuant en premier lieu aux tests de nouvelles approches de classification et de méta-analyses à travers la conception d'applications biologiques, puis dans le développement de RETINOBASE (http://alnitak. U-strasbg. Fr/RetinoBase/), une base de données relationnelle qui permet le stockage et l'interrogation performante de données de transcriptomique et qui représente la partie majeure de mon travail

APA, Harvard, Vancouver, ISO, and other styles

3

Nicolae, Bogdan. "BlobSeer : towards efficient data storage management for large-scale, distributed systems." Phd thesis, Université Rennes 1, 2010. http://tel.archives-ouvertes.fr/tel-00552271.

Full text

Abstract:

With data volumes increasing at a high rate and the emergence of highly scalable infrastructures (cloud computing, petascale computing), distributed management of data becomes a crucial issue that faces many challenges. This thesis brings several contributions in order to address such challenges. First, it proposes a set of principles for designing highly scalable distributed storage systems that are optimized for heavy data access concurrency. In particular, it highlights the potentially large benefits of using versioning in this context. Second, based on these principles, it introduces a series of distributed data and metadata management algorithms that enable a high throughput under concurrency. Third, it shows how to efficiently implement these algorithms in practice, dealing with key issues such as high-performance parallel transfers, efficient maintainance of distributed data structures, fault tolerance, etc. These results are used to build BlobSeer, an experimental prototype that is used to demonstrate both the theoretical benefits of the approach in synthetic benchmarks, as well as the practical benefits in real-life, applicative scenarios: as a storage backend for MapReduce applications, as a storage backend for deployment and snapshotting of virtual machine images in clouds, as a quality-of-service enabled data storage service for cloud applications. Extensive experimentations on the Grid'5000 testbed show that BlobSeer remains scalable and sustains a high throughput even under heavy access concurrency, outperforming by a large margin several state-of-art approaches.

APA, Harvard, Vancouver, ISO, and other styles

4

Ljung, Patric. "Visualization of Particle In Cell Simulations." Thesis, Linköping University, Department of Science and Technology, 2000. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-2340.

Full text

Abstract:

A numerical simulation case involving space plasma and the evolution of instabilities that generates very fast electrons, i.e. approximately at half of the speed of light, is used as a test bed for scientific visualisation techniques. A visualisation system was developed to provide interactive real-time animation and visualisation of the simulation results. The work focuses on two themes and the integration of them. The first theme is the storage and management of the large data sets produced. The second theme deals with how the Visualisation System and Visual Objects are tailored to efficiently visualise the data at hand.

The integration of the themes has resulted in an interactive real-time animation and visualisation system which constitutes a very powerful tool for analysis and understanding of the plasma physics processes. The visualisations contained in this work have spawned many new possible research projects and provided insight into previously not fully understood plasma physics phenomena.

APA, Harvard, Vancouver, ISO, and other styles

5

Carpen-Amarie, Alexandra. "BlobSeer as a data-storage facility for clouds : self-Adaptation, integration, evaluation." Thesis, Cachan, Ecole normale supérieure, 2011. http://www.theses.fr/2011DENS0066/document.

Full text

Abstract:

L’émergence de l’informatique dans les nuages met en avant de nombreux défis qui pourraient limiter l’adoption du paradigme Cloud. Tandis que la taille des données traitées par les applications Cloud augmente exponentiellement, un défi majeur porte sur la conception de solutions efficaces pour la gestion de données. Cette thèse a pour but de concevoir des mécanismes d’auto-adaptation pour des systèmes de gestion de données, afin qu’ils puissent répondre aux exigences des services de stockage Cloud en termes de passage à l’échelle, disponibilité et sécurité des données. De plus, nous nous proposons de concevoir un service de données qui soit à la fois compatible avec les interfaces Cloud standard dans et capable d’offrir un stockage de données à haut débit. Pour relever ces défis, nous avons proposé des mécanismes génériques pour l’auto-connaissance, l’auto-protection et l’auto-configuration des systèmes de gestion de données. Ensuite, nous les avons validés en les intégrant dans le logiciel BlobSeer, un système de stockage qui optimise les accès hautement concurrents aux données. Finalement, nous avons conçu et implémenté un système de fichiers s’appuyant sur BlobSeer, afin d’optimiser ce dernier pour servir efficacement comme support de stockage pour les services Cloud. Puis, nous l’avons intégré dans un environnement Cloud réel, la plate-forme Nimbus. Les avantages et les désavantages de l’utilisation du stockage dans le Cloud pour des applications réelles sont soulignés lors des évaluations effectuées sur Grid’5000. Elles incluent des applications à accès intensif aux données, comme MapReduce, et des applications fortement couplées, comme les simulations atmosphériques
The emergence of Cloud computing brings forward many challenges that may limit the adoption rate of the Cloud paradigm. As data volumes processed by Cloud applications increase exponentially, designing efficient and secure solutions for data management emerges as a crucial requirement. The goal of this thesis is to enhance a distributed data-management system with self-management capabilities, so that it can meet the requirements of the Cloud storage services in terms of scalability, data availability, reliability and security. Furthermore, we aim at building a Cloud data service both compatible with state-of-the-art Cloud interfaces and able to deliver high-throughput data storage. To meet these goals, we proposed generic self-awareness, self-protection and self-configuration components targeted at distributed data-management systems. We validated them on top of BlobSeer, a large-scale data-management system designed to optimize highly-concurrent data accesses. Next, we devised and implemented a BlobSeer-based file system optimized to efficiently serve as a storage backend for Cloud services. We then integrated it within a real-world Cloud environment, the Nimbus platform. The benefits and drawbacks of using Cloud storage for real-life applications have been emphasized in evaluations that involved data-intensive MapReduce applications and tightly-coupled, high-performance computing applications

APA, Harvard, Vancouver, ISO, and other styles

6

Kalathur, Ravi Kiran Reddy Poch Olivier. "Approche systématique et intégrative pour le stockage, l'analyse et la visualisation des données d'expression génique acquises par des techniques à haut débit, dans des tissus neuronaux An integrated systematic approach for storage, analysis and visualization of gene expression data from neuronal tissues acquired through high-throughput techniques /." Strasbourg : Université Louis Pasteur, 2008. http://eprints-scd-ulp.u-strasbg.fr:8080/920/01/KALATHUR_R_2007.pdf.

Full text

APA, Harvard, Vancouver, ISO, and other styles

7

Jin, Shuangshuang. "Integrated data modeling in high-throughput proteomices." Online access for everyone, 2007. http://www.dissertations.wsu.edu/Dissertations/Fall2007/S_Jin_111907.pdf.

Full text

APA, Harvard, Vancouver, ISO, and other styles

8

Capparuccini, Maria. "Inferential Methods for High-Throughput Methylation Data." VCU Scholars Compass, 2010. http://scholarscompass.vcu.edu/etd/156.

Full text

Abstract:

The role of abnormal DNA methylation in the progression of disease is a growing area of research that relies upon the establishment of sound statistical methods. The common method for declaring there is differential methylation between two groups at a given CpG site, as summarized by the difference between proportions methylated db=b1-b2, has been through use of a Filtered Two Sample t-test, using the recommended filter of 0.17 (Bibikova et al., 2006b). In this dissertation, we performed a re-analysis of the data used in recommending the threshold by fitting a mixed-effects ANOVA model. It was determined that the 0.17 filter is not accurate and conjectured that application of a Filtered Two Sample t-test likely leads to loss of power. Further, the Two Sample t-test assumes that data arise from an underlying distribution encompassing the entire real number line, whereas b1 and b2 are constrained on the interval . Additionally, the imposition of a filter at a level signifying the minimum level of detectable difference to a Two Sample t-test likely reduces power for smaller but truly differentially methylated CpG sites. Therefore, we compared the Two Sample t-test and the Filtered Two Sample t-test, which are widely used but largely untested with respect to their performance, to three proposed methods. These three proposed methods are a Beta distribution test, a Likelihood ratio test, and a Bootstrap test, where each was designed to address distributional concerns present in the current testing methods. It was ultimately shown through simulations comparing Type I and Type II error rates that the (unfiltered) Two Sample t-test and the Beta distribution test performed comparatively well.

APA, Harvard, Vancouver, ISO, and other styles

9

Durif, Ghislain. "Multivariate analysis of high-throughput sequencing data." Thesis, Lyon, 2016. http://www.theses.fr/2016LYSE1334/document.

Full text

Abstract:

L'analyse statistique de données de séquençage à haut débit (NGS) pose des questions computationnelles concernant la modélisation et l'inférence, en particulier à cause de la grande dimension des données. Le travail de recherche dans ce manuscrit porte sur des méthodes de réductions de dimension hybrides, basées sur des approches de compression (représentation dans un espace de faible dimension) et de sélection de variables. Des développements sont menés concernant la régression "Partial Least Squares" parcimonieuse (supervisée) et les méthodes de factorisation parcimonieuse de matrices (non supervisée). Dans les deux cas, notre objectif sera la reconstruction et la visualisation des données. Nous présenterons une nouvelle approche de type PLS parcimonieuse, basée sur une pénalité adaptative, pour la régression logistique. Cette approche sera utilisée pour des problèmes de prédiction (devenir de patients ou type cellulaire) à partir de l'expression des gènes. La principale problématique sera de prendre en compte la réponse pour écarter les variables non pertinentes. Nous mettrons en avant le lien entre la construction des algorithmes et la fiabilité des résultats.Dans une seconde partie, motivés par des questions relatives à l'analyse de données "single-cell", nous proposons une approche probabiliste pour la factorisation de matrices de comptage, laquelle prend en compte la sur-dispersion et l'amplification des zéros (caractéristiques des données single-cell). Nous développerons une procédure d'estimation basée sur l'inférence variationnelle. Nous introduirons également une procédure de sélection de variables probabiliste basée sur un modèle "spike-and-slab". L'intérêt de notre méthode pour la reconstruction, la visualisation et le clustering de données sera illustré par des simulations et par des résultats préliminaires concernant une analyse de données "single-cell". Toutes les méthodes proposées sont implémentées dans deux packages R: plsgenomics et CMF
The statistical analysis of Next-Generation Sequencing data raises many computational challenges regarding modeling and inference, especially because of the high dimensionality of genomic data. The research work in this manuscript concerns hybrid dimension reduction methods that rely on both compression (representation of the data into a lower dimensional space) and variable selection. Developments are made concerning: the sparse Partial Least Squares (PLS) regression framework for supervised classification, and the sparse matrix factorization framework for unsupervised exploration. In both situations, our main purpose will be to focus on the reconstruction and visualization of the data. First, we will present a new sparse PLS approach, based on an adaptive sparsity-inducing penalty, that is suitable for logistic regression to predict the label of a discrete outcome. For instance, such a method will be used for prediction (fate of patients or specific type of unidentified single cells) based on gene expression profiles. The main issue in such framework is to account for the response to discard irrelevant variables. We will highlight the direct link between the derivation of the algorithms and the reliability of the results. Then, motivated by questions regarding single-cell data analysis, we propose a flexible model-based approach for the factorization of count matrices, that accounts for over-dispersion as well as zero-inflation (both characteristic of single-cell data), for which we derive an estimation procedure based on variational inference. In this scheme, we consider probabilistic variable selection based on a spike-and-slab model suitable for count data. The interest of our procedure for data reconstruction, visualization and clustering will be illustrated by simulation experiments and by preliminary results on single-cell data analysis. All proposed methods were implemented into two R-packages "plsgenomics" and "CMF" based on high performance computing

APA, Harvard, Vancouver, ISO, and other styles

10

Zhang, Xuekui. "Mixture models for analysing high throughput sequencing data." Thesis, University of British Columbia, 2011. http://hdl.handle.net/2429/35982.

Full text

Abstract:

The goal of my thesis is to develop methods and software for analysing high-throughput sequencing data, emphasizing sonicated ChIP-seq. For this goal, we developed a few variants of mixture models for genome-wide profiling of transcription factor binding sites and nucleosome positions. Our methods have been implemented into Bioconductor packages, which are freely available to other researchers. For profiling transcription factor binding sites, we developed a method, PICS, and implemented it into a Bioconductor package. We used a simulation study to confirm that PICS compares favourably to rival methods, such as MACS, QuEST, CisGenome, and USeq. Using published GABP and FOXA1 data from human cell lines, we then show that PICS predicted binding sites were more consistent with computationally predicted binding motifs than the alternative methods. For motif discovery using transcription binding sites, we combined PICS with two other existing packages to create the first complete set of Bioconductor tools for peak-calling and binding motif analysis of ChIP-Seq and ChIP-chip data. We demonstrate the effectiveness of our pipeline on published human ChIP-Seq datasets for FOXA1, ER, CTCF and STAT1, detecting co-occurring motifs that were consistent with the literature but not detected by other methods. For nucleosome positioning, we modified PICS into a method called PING. PING can handle MNase-Seq and MNase- or sonicated-ChIP-Seq data. It compares favourably to NPS and TemplateFilter in scalability, accuracy and robustness to low read density. To demonstrate that PING predictions from sonicated data can have sufficient spatial resolution to be biologically meaningful, we use H3K4me1 data to detect nucleosome shifts, discriminate functional and non-functional transcription factor binding sites, and confirm that Foxa2 associates with the accessible major groove of nucleosomal DNA. All of the above uses single-end sequencing data. At the end of the thesis, we briefly discuss the issue of processing paired-end data, which we are currently investigating.

APA, Harvard, Vancouver, ISO, and other styles

11

Hoffmann, Steve. "Genome Informatics for High-Throughput Sequencing Data Analysis." Doctoral thesis, Universitätsbibliothek Leipzig, 2014. http://nbn-resolving.de/urn:nbn:de:bsz:15-qucosa-152643.

Full text

Abstract:

This thesis introduces three different algorithmical and statistical strategies for the analysis of high-throughput sequencing data. First, we introduce a heuristic method based on enhanced suffix arrays to map short sequences to larger reference genomes. The algorithm builds on the idea of an error-tolerant traversal of the suffix array for the reference genome in conjunction with the concept of matching statistics introduced by Chang and a bitvector based alignment algorithm proposed by Myers. The algorithm supports paired-end and mate-pair alignments and the implementation offers methods for primer detection, primer and poly-A trimming. In our own benchmarks as well as independent bench- marks this tool outcompetes other currently available tools with respect to sensitivity and specificity in simulated and real data sets for a large number of sequencing protocols. Second, we introduce a novel dynamic programming algorithm for the spliced alignment problem. The advantage of this algorithm is its capability to not only detect co-linear splice events, i.e. local splice events on the same genomic strand, but also circular and other non-collinear splice events. This succinct and simple algorithm handles all these cases at the same time with a high accuracy. While it is at par with other state- of-the-art methods for collinear splice events, it outcompetes other tools for many non-collinear splice events. The application of this method to publically available sequencing data led to the identification of a novel isoform of the tumor suppressor gene p53. Since this gene is one of the best studied genes in the human genome, this finding is quite remarkable and suggests that the application of our algorithm could help to identify a plethora of novel isoforms and genes. Third, we present a data adaptive method to call single nucleotide variations (SNVs) from aligned high-throughput sequencing reads. We demonstrate that our method based on empirical log-likelihoods automatically adjusts to the quality of a sequencing experiment and thus renders a \"decision\" on when to call an SNV. In our simulations this method is at par with current state-of-the-art tools. Finally, we present biological results that have been obtained using the special features of the presented alignment algorithm
Diese Arbeit stellt drei verschiedene algorithmische und statistische Strategien für die Analyse von Hochdurchsatz-Sequenzierungsdaten vor. Zuerst führen wir eine auf enhanced Suffixarrays basierende heuristische Methode ein, die kurze Sequenzen mit grossen Genomen aligniert. Die Methode basiert auf der Idee einer fehlertoleranten Traversierung eines Suffixarrays für Referenzgenome in Verbindung mit dem Konzept der Matching-Statistik von Chang und einem auf Bitvektoren basierenden Alignmentalgorithmus von Myers. Die vorgestellte Methode unterstützt Paired-End und Mate-Pair Alignments, bietet Methoden zur Erkennung von Primersequenzen und zum trimmen von Poly-A-Signalen an. Auch in unabhängigen Benchmarks zeichnet sich das Verfahren durch hohe Sensitivität und Spezifität in simulierten und realen Datensätzen aus. Für eine große Anzahl von Sequenzierungsprotokollen erzielt es bessere Ergebnisse als andere bekannte Short-Read Alignmentprogramme. Zweitens stellen wir einen auf dynamischer Programmierung basierenden Algorithmus für das spliced alignment problem vor. Der Vorteil dieses Algorithmus ist seine Fähigkeit, nicht nur kollineare Spleiß- Ereignisse, d.h. Spleiß-Ereignisse auf dem gleichen genomischen Strang, sondern auch zirkuläre und andere nicht-kollineare Spleiß-Ereignisse zu identifizieren. Das Verfahren zeichnet sich durch eine hohe Genauigkeit aus: während es bei der Erkennung kollinearer Spleiß-Varianten vergleichbare Ergebnisse mit anderen Methoden erzielt, schlägt es die Wettbewerber mit Blick auf Sensitivität und Spezifität bei der Vorhersage nicht-kollinearer Spleißvarianten. Die Anwendung dieses Algorithmus führte zur Identifikation neuer Isoformen. In unserer Publikation berichten wir über eine neue Isoform des Tumorsuppressorgens p53. Da dieses Gen eines der am besten untersuchten Gene des menschlichen Genoms ist, könnte die Anwendung unseres Algorithmus helfen, eine Vielzahl weiterer Isoformen bei weniger prominenten Genen zu identifizieren. Drittens stellen wir ein datenadaptives Modell zur Identifikation von Single Nucleotide Variations (SNVs) vor. In unserer Arbeit zeigen wir, dass sich unser auf empirischen log-likelihoods basierendes Modell automatisch an die Qualität der Sequenzierungsexperimente anpasst und eine \"Entscheidung\" darüber trifft, welche potentiellen Variationen als SNVs zu klassifizieren sind. In unseren Simulationen ist diese Methode auf Augenhöhe mit aktuell eingesetzten Verfahren. Schließlich stellen wir eine Auswahl biologischer Ergebnisse vor, die mit den Besonderheiten der präsentierten Alignmentverfahren in Zusammenhang stehen

APA, Harvard, Vancouver, ISO, and other styles

12

Stromberg, Michael Peter. "Enabling high-throughput sequencing data analysis with MOSAIK." Thesis, Boston College, 2010. http://hdl.handle.net/2345/1332.

Full text

Abstract:

Thesis advisor: Gabor T. Marth
During the last few years, numerous new sequencing technologies have emerged that require tools that can process large amounts of read data quickly and accurately. Regardless of the downstream methods used, reference-guided aligners are at the heart of all next-generation analysis studies. I have developed a general reference-guided aligner, MOSAIK, to support all current sequencing technologies (Roche 454, Illumina, Applied Biosystems SOLiD, Helicos, and Sanger capillary). The calibrated alignment qualities calculated by MOSAIK allow the user to fine-tune the alignment accuracy for a given study. MOSAIK is a highly configurable and easy-to-use suite of alignment tools that is used in hundreds of labs worldwide. MOSAIK is an integral part of our genetic variant discovery pipeline. From SNP and short-INDEL discovery to structural variation discovery, alignment accuracy is an essential requirement and enables our downstream analyses to provide accurate calls. In this thesis, I present three major studies that were formative during the development of MOSAIK and our analysis pipeline. In addition, I present a novel algorithm that identifies mobile element insertions (non-LTR retrotransposons) in the human genome using split-read alignments in MOSAIK. This algorithm has a low false discovery rate (4.4 %) and enabled our group to be the first to determine the number of mobile elements that differentially occur between any two individuals
Thesis (PhD) — Boston College, 2010
Submitted to: Boston College. Graduate School of Arts and Sciences
Discipline: Biology

APA, Harvard, Vancouver, ISO, and other styles

13

Xing, Zhengrong. "Poisson multiscale methods for high-throughput sequencing data." Thesis, The University of Chicago, 2016. http://pqdtopen.proquest.com/#viewpdf?dispub=10195268.

Full text

Abstract:

In this dissertation, we focus on the problem of analyzing data from high-throughput sequencing experiments. With the emergence of more capable hardware and more efficient software, these sequencing data provide information at an unprecedented resolution. However, statistical methods developed for such data rarely tackle the data at such high resolutions, and often make approximations that only hold under certain conditions.

We propose a model-based approach to dealing with such data, starting from a single sample. By taking into account the inherent structure present in such data, our model can accurately capture important genomic regions. We also present the model in such a way that makes it easily extensible to more complicated and biologically interesting scenarios.

Building upon the single-sample model, we then turn to the statistical question of detecting differences between multiple samples. Such questions often arise in the context of expression data, where much emphasis has been put on the problem of detecting differential expression between two groups. By extending the framework for a single sample to incorporate additional group covariates, our model provides a systematic approach to estimating and testing for such differences. We then apply our method to several empirical datasets, and discuss the potential for further applications to other biological tasks.

We also seek to address a different statistical question, where the goal here is to perform exploratory analysis to uncover hidden structure within the data. We incorporate the single-sample framework into a commonly used clustering scheme, and show that our enhanced clustering approach is superior to the original clustering approach in many ways. We then apply our clustering method to a few empirical datasets and discuss our findings.

Finally, we apply the shrinkage procedure used within the single-sample model to tackle a completely different statistical issue: nonparametric regression with heteroskedastic Gaussian noise. We propose an algorithm that accurately recovers both the mean and variance functions given a single set of observations, and demonstrate its advantages over state-of-the art methods through extensive simulation studies.

APA, Harvard, Vancouver, ISO, and other styles

14

Wang, Yuanyuan (Marcia). "Statistical Methods for High Throughput Screening Drug Discovery Data." Thesis, University of Waterloo, 2005. http://hdl.handle.net/10012/1204.

Full text

Abstract:

High Throughput Screening (HTS) is used in drug discovery to screen large numbers of compounds against a biological target. Data on activity against the target are collected for a representative sample of compounds selected from a large library. The goal of drug discovery is to relate the activity of a compound to its chemical structure, which is quantified by various explanatory variables, and hence to identify further active compounds. Often, this application has a very unbalanced class distribution, with a rare active class.

Classification methods are commonly proposed as solutions to this problem. However, regarding drug discovery, researchers are more interested in ranking compounds by predicted activity than in the classification itself. This feature makes my approach distinct from common classification techniques.

In this thesis, two AIDS data sets from the National Cancer Institute (NCI) are mainly used. Local methods, namely K-nearest neighbours (KNN) and classification and regression trees (CART), perform very well on these data in comparison with linear/logistic regression, neural networks, and Multivariate Adaptive Regression Splines (MARS) models, which assume more smoothness. One reason for the superiority of local methods is the local behaviour of the data. Indeed, I argue that conventional classification criteria such as misclassification rate or deviance tend to select too small a tree or too large a value of k (the number of nearest neighbours). A more local model (bigger tree or smaller k) gives a better performance in terms of drug discovery.

Because off-the-shelf KNN works relatively well, this thesis takes this promising method and makes several novel modifications, which further improve its performance. The choice of k is optimized for each test point to be predicted. The empirically observed superiority of allowing k to vary is investigated. The nature of the problem, ranking of objects rather than estimating the probability of activity, enables the k-varying algorithm to stand out. Similarly, KNN combined with a kernel weight function (weighted KNN) is proposed and demonstrated to be superior to the regular KNN method.

High dimensionality of the explanatory variables is known to cause problems for KNN and many other classifiers. I propose a novel method (subset KNN) of averaging across multiple classifiers based on building classifiers on subspaces (subsets of variables). It improves the performance of KNN for HTS data. When applied to CART, it also performs as well as or even better than the popular methods of bagging and boosting. Part of this improvement is due to the discovery that classifiers based on irrelevant subspaces (unimportant explanatory variables) do little damage when averaged with good classifiers based on relevant subspaces (important variables). This result is particular to the ranking of objects rather than estimating the probability of activity. A theoretical justification is proposed. The thesis also suggests diagnostics for identifying important subsets of variables and hence further reducing the impact of the curse of dimensionality.

In order to have a broader evaluation of these methods, subset KNN and weighted KNN are applied to three other data sets: the NCI AIDS data with Constitutional descriptors, Mutagenicity data with BCUT descriptors and Mutagenicity data with Constitutional descriptors. The k-varying algorithm as a method for unbalanced data is also applied to NCI AIDS data with Constitutional descriptors. As a baseline, the performance of KNN on such data sets is reported. Although different methods are best for the different data sets, some of the proposed methods are always amongst the best.

Finally, methods are described for estimating activity rates and error rates in HTS data. By combining auxiliary information about repeat tests of the same compound, likelihood methods can extract interesting information about the magnitudes of the measurement errors made in the assay process. These estimates can be used to assess model performance, which sheds new light on how various models handle the large random or systematic assay errors often present in HTS data.

APA, Harvard, Vancouver, ISO, and other styles

15

Yang, Yang. "Data mining support for high-throughput discovery of nanomaterials." Thesis, University of Leeds, 2011. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.577527.

Full text

Abstract:

Nanotechnology is becoming a promising technology due to its potential to dramatically improve the effectiveness of a number of existing consumer and industrial products, such as drug delivery systems, electronic circuits, catalysts and light-harvesting materials. However, the ability of industry and academia to accelerate the discovery of new nanomaterials is severely limited by the speed at which new compositions can be made and tested for suitable properties. A promising alternative approach for nanophotocatalyst discovery, currently under development at University College London (UCL) and University of Leeds, utilizes recent advances in nanomaterials synthesis and automation to implement a high-throughput (HT) experimental system to enable rapid exploration of materials space. The HT nanocatalyst discovery is an automated continuous process using hydrothermal synthesis which can synthesize a large number of nanoparticles in a short time. The nanoparticles formulated are characterized and tested on as many samples as possible for indicative properties rather than conduct comprehensive characterisation on each sample. This thesis describes the development of chemome~ric and inductive data-mining tools to support the HT nanomaterial discovery process. The work will be reported mainly in five parts, including a data management system structured to reflect the flow of HT work flow, an information flow management system designed for correspondence between UCL and Leeds, a prototype data mining system tailored for HT experimental data processing and analysis as well as its application, a new Design of Experiments (DoE) using genetic algorithm which was proved to be able to handle variables effectively, and robust quantitative structure-activity relationship (QSAR) models with genetic parameter optimization for HT catalyst discovery. In contrast to the enormous benefits, nanoparticles also bring adverse effects to the biological environment and human health. Due to the diversity of nanoparticles, as well as the dependence of toxicity on the physico-chemical properties of nanoparticles, it is not possible to test every particle. A rational way without testing ii every single nanoparticle and its variants is to relate the physicochemical characteristics of nanoparticles with their toxicity in a QSAR model. This thesis also researches the measurement methods of structural and composition properties including size distribution, surface area, morphological parameters and assessment methods for comparative toxicity of the nanoparticles in a panel of 18 nanoparticles which were chosen by University of Edinburgh, UBI Institute and University of Leeds. The major contribution of this work is in using data mining methods to analyze toxicity and measured structural and composition properties to find the relationships between the toxicity and physico-chemical properties of nanoparticles. Structure-activity relationship (SAR) analysis focused on identifying the possible structural and compositional properties that determine the cytotoxicity of nanoparticles. iii

APA, Harvard, Vancouver, ISO, and other styles

16

Birchall, Kristian. "Reduced graph approaches to analysing high-throughput screening data." Thesis, University of Sheffield, 2007. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.443869.

Full text

APA, Harvard, Vancouver, ISO, and other styles

17

Kannan, Anusha Aiyalu. "Detecting relevant changes in high throughput gene expression data /." Online version of thesis, 2008. http://hdl.handle.net/1850/10832.

Full text

APA, Harvard, Vancouver, ISO, and other styles

18

Fritz, Markus Hsi-Yang. "Exploiting high throughput DNA sequencing data for genomic analysis." Thesis, University of Cambridge, 2012. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.610819.

Full text

APA, Harvard, Vancouver, ISO, and other styles

19

Woolford, Julie Ruth. "Statistical analysis of small RNA high-throughput sequencing data." Thesis, University of Cambridge, 2012. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.610375.

Full text

APA, Harvard, Vancouver, ISO, and other styles

20

Chen, Li. "Integrative Modeling and Analysis of High-throughput Biological Data." Diss., Virginia Tech, 2010. http://hdl.handle.net/10919/30192.

Full text

Abstract:

Computational biology is an interdisciplinary field that focuses on developing mathematical models and algorithms to interpret biological data so as to understand biological problems. With current high-throughput technology development, different types of biological data can be measured in a large scale, which calls for more sophisticated computational methods to analyze and interpret the data. In this dissertation research work, we propose novel methods to integrate, model and analyze multiple biological data, including microarray gene expression data, protein-DNA interaction data and protein-protein interaction data. These methods will help improve our understanding of biological systems. First, we propose a knowledge-guided multi-scale independent component analysis (ICA) method for biomarker identification on time course microarray data. Guided by a knowledge gene pool related to a specific disease under study, the method can determine disease relevant biological components from ICA modes and then identify biologically meaningful markers related to the specific disease. We have applied the proposed method to yeast cell cycle microarray data and Rsf-1-induced ovarian cancer microarray data. The results show that our knowledge-guided ICA approach can extract biologically meaningful regulatory modes and outperform several baseline methods for biomarker identification. Second, we propose a novel method for transcriptional regulatory network identification by integrating gene expression data and protein-DNA binding data. The approach is built upon a multi-level analysis strategy designed for suppressing false positive predictions. With this strategy, a regulatory module becomes increasingly significant as more relevant gene sets are formed at finer levels. At each level, a two-stage support vector regression (SVR) method is utilized to reduce false positive predictions by integrating binding motif information and gene expression data; a significance analysis procedure is followed to assess the significance of each regulatory module. The resulting performance on simulation data and yeast cell cycle data shows that the multi-level SVR approach outperforms other existing methods in the identification of both regulators and their target genes. We have further applied the proposed method to breast cancer cell line data to identify condition-specific regulatory modules associated with estrogen treatment. Experimental results show that our method can identify biologically meaningful regulatory modules related to estrogen signaling and action in breast cancer. Third, we propose a bootstrapping Markov Random Filed (MRF)-based method for subnetwork identification on microarray data by incorporating protein-protein interaction data. Methodologically, an MRF-based network score is first derived by considering the dependency among genes to increase the chance of selecting hub genes. A modified simulated annealing search algorithm is then utilized to find the optimal/suboptimal subnetworks with maximal network score. A bootstrapping scheme is finally implemented to generate confident subnetworks. Experimentally, we have compared the proposed method with other existing methods, and the resulting performance on simulation data shows that the bootstrapping MRF-based method outperforms other methods in identifying ground truth subnetwork and hub genes. We have then applied our method to breast cancer data to identify significant subnetworks associated with drug resistance. The identified subnetworks not only show good reproducibility across different data sets, but indicate several pathways and biological functions potentially associated with the development of breast cancer and drug resistance. In addition, we propose to develop network-constrained support vector machines (SVM) for cancer classification and prediction, by taking into account the network structure to construct classification hyperplanes. The simulation study demonstrates the effectiveness of our proposed method. The study on the real microarray data sets shows that our network-constrained SVM, together with the bootstrapping MRF-based subnetwork identification approach, can achieve better classification performance compared with conventional biomarker selection approaches and SVMs. We believe that the research presented in this dissertation not only provides novel and effective methods to model and analyze different types of biological data, the extensive experiments on several real microarray data sets and results also show the potential to improve the understanding of biological mechanisms related to cancers by generating novel hypotheses for further study.
Ph. D.

APA, Harvard, Vancouver, ISO, and other styles

21

Lu, Feng. "Big data scalability for high throughput processing and analysis of vehicle engineering data." Thesis, KTH, Skolan för informations- och kommunikationsteknik (ICT), 2017. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-207084.

Full text

Abstract:

"Sympathy for Data" is a platform that is utilized for Big Data automation analytics. It is based on visual interface and workflow configurations. The main purpose of the platform is to reuse parts of code for structured analysis of vehicle engineering data. However, there are some performance issues on a single machine for processing a large amount of data in Sympathy for Data. There are also disk and CPU IO intensive issues when the data is oversized and the platform need fits comfortably in memory. In addition, for data over the TB or PB level, the Sympathy for data needs separate functionality for efficient processing simultaneously and scalable for distributed computation functionality. This paper focuses on exploring the possibilities and limitations in using the Sympathy for Data platform in various data analytic scenarios within the Volvo Cars vision and strategy. This project re-writes the CDE workflow for over 300 nodes into pure Python script code and make it executable on the Apache Spark and Dask infrastructure. We explore and compare both distributed computing frameworks implemented on Amazon Web Service EC2 used for 4 machine with a 4x type for distributed cluster measurement. However, the benchmark results show that Spark is superior to Dask from performance perspective. Apache Spark and Dask will combine with Sympathy for Data products for a Big Data processing engine to optimize the system disk and CPU IO utilization. There are several challenges when using Spark and Dask to analyze large-scale scientific data on systems. For instance, parallel file systems are shared among all computing machines, in contrast to shared-nothing architectures. Moreover, accessing data stored in commonly used scientific data formats, such as HDF5 is not tentatively supported in Spark. This report presents research carried out on the next generation of Big Data platforms in the automotive industry called "Sympathy for Data". The research questions focusing on improving the I/O performance and scalable distributed function to promote Big Data analytics. During this project, we used the Dask.Array parallelism features for interpretation the data sources as a raster shows in table format, and Apache Spark used as data processing engine for parallelism to load data sources to memory for improving the big data computation capacity. The experiments chapter will demonstrate 640GB of engineering data benchmark for single node and distributed computation mode to evaluate the Sympathy for Data Disk CPU and memory metrics. Finally, the outcome of this project improved the six times performance of the original Sympathy for data by developing a middleware SparkImporter. It is used in Sympathy for Data for distributed computation and connected to the Apache Spark for data processing through the maximum utilization of the system resources. This improves its throughput, scalability, and performance. It also increases the capacity of the Sympathy for data to process Big Data and avoids big data cluster infrastructures.

APA, Harvard, Vancouver, ISO, and other styles

22

Kircher, Martin. "Understanding and improving high-throughput sequencing data production and analysis." Doctoral thesis, Universitätsbibliothek Leipzig, 2011. http://nbn-resolving.de/urn:nbn:de:bsz:15-qucosa-71102.

Full text

Abstract:

Advances in DNA sequencing revolutionized the field of genomics over the last 5 years. New sequencing instruments make it possible to rapidly generate large amounts of sequence data at substantially lower cost. These high-throughput sequencing technologies (e.g. Roche 454 FLX, Life Technology SOLiD, Dover Polonator, Helicos HeliScope and Illumina Genome Analyzer) make whole genome sequencing and resequencing, transcript sequencing as well as quantification of gene expression, DNA-protein interactions and DNA methylation feasible at an unanticipated scale. In the field of evolutionary genomics, high-throughput sequencing permitted studies of whole genomes from ancient specimens of different hominin groups. Further, it allowed large-scale population genetics studies of present-day humans as well as different types of sequence-based comparative genomics studies in primates. Such comparisons of humans with closely related apes and hominins are important not only to better understand human origins and the biological background of what sets humans apart from other organisms, but also for understanding the molecular basis for diseases and disorders, particularly those that affect uniquely human traits, such as speech disorders, autism or schizophrenia. However, while the cost and time required to create comparative data sets have been greatly reduced, the error profiles and limitations of the new platforms differ significantly from those of previous approaches. This requires a specific experimental design in order to circumvent these issues, or to handle them during data analysis. During the course of my PhD, I analyzed and improved current protocols and algorithms for next generation sequencing data, taking into account the specific characteristics of these new sequencing technologies. The presented approaches and algorithms were applied in different projects and are widely used within the department of Evolutionary Genetics at the Max Planck Institute of Evolutionary Anthropology. In this thesis, I will present selected analyses from the whole genome shotgun sequencing of two ancient hominins and the quantification of gene expression from short-sequence tags in five tissues from three primates.

APA, Harvard, Vancouver, ISO, and other styles

23

Gustafsson, Mika. "Gene networks from high-throughput data : Reverse engineering and analysis." Doctoral thesis, Linköpings universitet, Kommunikations- och transportsystem, 2010. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-54089.

Full text

Abstract:

Experimental innovations starting in the 1990’s leading to the advent of high-throughput experiments in cellular biology have made it possible to measure thousands of genes simultaneously at a modest cost. This enables the discovery of new unexpected relationships between genes in addition to the possibility of falsify existing. To benefit as much as possible from these experiments the new inter disciplinary research field of systems biology have materialized. Systems biology goes beyond the conventional reductionist approach and aims at learning the whole system under the assumption that the system is greater than the sum of its parts. One emerging enterprise in systems biology is to use the high-throughput data to reverse engineer the web of gene regulatory interactions governing the cellular dynamics. This relatively new endeavor goes further than clustering genes with similar expression patterns and requires the separation of cause of gene expression from the effect. Despite the rapid data increase we then face the problem of having too few experiments to determine which regulations are active as the number of putative interactions has increased dramatic as the number of units in the system has increased. One possibility to overcome this problem is to impose more biologically motivated constraints. However, what is a biological fact or not is often not obvious and may be condition dependent. Moreover, investigations have suggested several statistical facts about gene regulatory networks, which motivate the development of new reverse engineering algorithms, relying on different model assumptions. As a result numerous new reverse engineering algorithms for gene regulatory networks has been proposed. As a consequent, there has grown an interest in the community to assess the performance of different attempts in fair trials on “real” biological problems. This resulted in the annually held DREAM conference which contains computational challenges that can be solved by the prosing researchers directly, and are evaluated by the chairs of the conference after the submission deadline. This thesis contains the evolution of regularization schemes to reverse engineer gene networks from high-throughput data within the framework of ordinary differential equations. Furthermore, to understand gene networks a substantial part of it also concerns statistical analysis of gene networks. First, we reverse engineer a genome-wide regulatory network based solely on microarray data utilizing an extremely simple strategy assuming sparseness (LASSO). To validate and analyze this network we also develop some statistical tools. Then we present a refinement of the initial strategy which is the algorithm for which we achieved best performer at the DREAM2 conference. This strategy is further refined into a reverse engineering scheme which also can include external high-throughput data, which we confirm to be of relevance as we achieved best performer in the DREAM3 conference as well. Finally, the tools we developed to analyze stability and flexibility in linearized ordinary differential equations representing gene regulatory networks is further discussed.

APA, Harvard, Vancouver, ISO, and other styles

24

Zandegiacomo, Cella Alice. "Multiplex network analysis with application to biological high-throughput data." Master's thesis, Alma Mater Studiorum - Università di Bologna, 2016. http://amslaurea.unibo.it/10495/.

Full text

Abstract:

In questa tesi vengono studiate alcune caratteristiche dei network a multiplex; in particolare l'analisi verte sulla quantificazione delle differenze fra i layer del multiplex. Le dissimilarita sono valutate sia osservando le connessioni di singoli nodi in layer diversi, sia stimando le diverse partizioni dei layer. Sono quindi introdotte alcune importanti misure per la caratterizzazione dei multiplex, che vengono poi usate per la costruzione di metodi di community detection . La quantificazione delle differenze tra le partizioni di due layer viene stimata utilizzando una misura di mutua informazione. Viene inoltre approfondito l'uso del test dell'ipergeometrica per la determinazione di nodi sovra-rappresentati in un layer, mostrando l'efficacia del test in funzione della similarita dei layer. Questi metodi per la caratterizzazione delle proprieta dei network a multiplex vengono applicati a dati biologici reali. I dati utilizzati sono stati raccolti dallo studio DILGOM con l'obiettivo di determinare le implicazioni genetiche, trascrittomiche e metaboliche dell'obesita e della sindrome metabolica. Questi dati sono utilizzati dal progetto Mimomics per la determinazione di relazioni fra diverse omiche. Nella tesi sono analizzati i dati metabolici utilizzando un approccio a multiplex network per verificare la presenza di differenze fra le relazioni di composti sanguigni di persone obese e normopeso.

APA, Harvard, Vancouver, ISO, and other styles

25

Bleuler, Stefan. "Search heuristics for module identification from biological high-throughput data /." [S.l.] : [s.n.], 2008. http://e-collection.ethbib.ethz.ch/show?type=diss&nr=17386.

Full text

APA, Harvard, Vancouver, ISO, and other styles

26

Cunningham, Gordon John. "Application of cluster analysis to high-throughput multiple data types." Thesis, University of Glasgow, 2011. http://theses.gla.ac.uk/2715/.

Full text

Abstract:

PolySNAP is a program used for analysis of high-throughput powder diffraction data. The program matches diffraction patterns using Pearson and Spearman correlation coefficients to measure the similarity of the profiles of each pattern with every other pattern, which creates a correlation matrix. This correlation matrix is then used to partition the patterns into groups using a variety of cluster analysis methods. The original version could not handle any data types other than powder X-ray Diffraction. The aim of this project was to expand the methods used in PolySNAP to allow it to analyse other data types, in particular Raman spectroscopy, differential scanning calorimetry and infrared spectroscopy data. This involves the preparation of suitable compounds which can be analysed using these techniques. The main compounds studied are sulfathiazole, carbamazepine and piroxicam. Some additional studies have been carried out on other datasets, including a test on an unseen dataset to test the efficacy of the methods. The optimal method for clustering any unknown dataset has also been determined.

APA, Harvard, Vancouver, ISO, and other styles

27

Ainsworth, David. "Computational approaches for metagenomic analysis of high-throughput sequencing data." Thesis, Imperial College London, 2016. http://hdl.handle.net/10044/1/44070.

Full text

Abstract:

High-throughput DNA sequencing has revolutionised microbiology and is the foundation on which the nascent field of metagenomics has been built. This ability to cheaply sample billions of DNA reads directly from environments has democratised sequencing and allowed researchers to gain unprecedented insights into diverse microbial communities. These technologies however are not without their limitations: the short length of the reads requires the production of vast amounts of data to ensure all information is captured. This 'data deluge' has been a major bottleneck and has necessitated the development of new algorithms for analysis. Sequence alignment methods provide the most information about the composition of a sample as they allow both taxonomic and functional classification but algorithms are prohibitively slow. This inefficiency has led to the reliance on faster algorithms which only produce simple taxonomic classification or abundance estimation, losing the valuable information given by full alignments against annotated genomes. This thesis will describe k-SLAM, a novel ultra-fast method for the alignment and taxonomic classification of metagenomic data. Using a k -mer based method k-SLAM achieves speeds three orders of magnitude faster than current alignment based approaches, allowing a full taxonomic classification and gene identification to be tractable on modern large datasets. The alignments found by k-SLAM can also be used to find variants and identify genes, along with their nearest taxonomic origins. A novel pseudo-assembly method produces more specific taxonomic classifications on species which have high sequence identity within their genus. This provides a significant (up to 40%) increase in accuracy on these species. Also described is a re-analysis of a Shiga-toxin producing E. coli O104:H4 isolate via alignment against bacterial and viral species to find antibiotic resistance and toxin producing genes. k-SLAM has been used by a range of research projects including FLORINASH and is currently being used by a number of groups.

APA, Harvard, Vancouver, ISO, and other styles

28

Yu, Haipeng. "Designing and modeling high-throughput phenotyping data in quantitative genetics." Diss., Virginia Tech, 2020. http://hdl.handle.net/10919/97579.

Full text

Abstract:

Quantitative genetics aims to bridge the genome to phenome gap. The advent of high-throughput genotyping technologies has accelerated the progress of genome to phenome mapping, but a challenge remains in phenotyping. Various high-throughput phenotyping (HTP) platforms have been developed recently to obtain economically important phenotypes in an automated fashion with less human labor and reduced costs. However, the effective way of designing HTP has not been investigated thoroughly. In addition, high-dimensional HTP data bring up a big challenge for statistical analysis by increasing computational demands. A new strategy for modeling high-dimensional HTP data and elucidating the interrelationships among these phenotypes are needed. Previous studies used pedigree-based connectetdness statistics to study the design of phenotyping. The availability of genetic markers provides a new opportunity to evaluate connectedness based on genomic data, which can serve as a means to design HTP. This dissertation first discusses the utility of connectedness spanning in three studies. In the first study, I introduced genomic connectedness and compared it with traditional pedigree-based connectedness. The relationship between genomic connectedness and prediction accuracy based on cross-validation was investigated in the second study. The third study introduced a user-friendly connectedness R package, which provides a suite of functions to evaluate the extent of connectedness. In the last study, I proposed a new statistical approach to model high-dimensional HTP data by leveraging the combination of confirmatory factor analysis and Bayesian network. Collectively, the results from the first three studies suggested the potential usefulness of applying genomic connectedness to design HTP. The statistical approach I introduced in the last study provides a new avenue to model high-dimensional HTP data holistically to further help us understand the interrelationships among phenotypes derived from HTP.
Doctor of Philosophy
Quantitative genetics aims to bridge the genome to phenome gap. With the advent of genotyping technologies, the genomic information of individuals can be included in a quantitative genetic model. A new challenge is to obtain sufficient and accurate phenotypes in an automated fashion with less human labor and reduced costs. The high-throughput phenotyping (HTP) technologies have emerged recently, opening a new opportunity to address this challenge. However, there is a paucity of research in phenotyping design and modeling high-dimensional HTP data. The main themes of this dissertation are 1) genomic connectedness that could potentially be used as a means to design a phenotyping experiment and 2) a novel statistical approach that aims to handle high-dimensional HTP data. In the first three studies, I first compared genomic connectedness with pedigree-based connectedness. This was followed by investigating the relationship between genomic connectedness and prediction accuracy derived from cross-validation. Additionally, I developed a connectedness R package that implements a variety of connectedness measures. The fourth study investigated a novel statistical approach by leveraging the combination of dimension reduction and graphical models to understand the interrelationships among high-dimensional HTP data.

APA, Harvard, Vancouver, ISO, and other styles

29

Lasher, Christopher Donald. "Discovering contextual connections between biological processes using high-throughput data." Diss., Virginia Tech, 2011. http://hdl.handle.net/10919/77217.

Full text

Abstract:

Hearkening to calls from life scientists for aid in interpreting rapidly-growing repositories of data, the fields of bioinformatics and computational systems biology continue to bear increasingly sophisticated methods capable of summarizing and distilling pertinent phenomena captured by high-throughput experiments. Techniques in analysis of genome-wide gene expression (e.g., microarray) data, for example, have moved beyond simply detecting individual genes perturbed in treatment-control experiments to reporting the collective perturbation of biologically-related collections of genes, or "processes". Recent expression analysis methods have focused on improving comprehensibility of results by reporting concise, non-redundant sets of processes by leveraging statistical modeling techniques such as Bayesian networks. Simultaneously, integrating gene expression measurements with gene interaction networks has led to computation of response networks--subgraphs of interaction networks in which genes exhibit strong collective perturbation or co-expression. Methods that integrate process annotations of genes with interaction networks identify high-level connections between biological processes, themselves. To identify context-specific changes in these inter-process connections, however, techniques beyond process-based expression analysis, which reports only perturbed processes and not their relationships, response networks, composed of interactions between genes rather than processes, and existing techniques in process connection detection, which do not incorporate specific biological context, proved necessary. We present two novel methods which take inspiration from the latest techniques in process-based gene expression analysis, computation of response networks, and computation of inter-process connections. We motivate the need for detecting inter-process connections by identifying a collection of processes exhibiting significant differences in collective expression in two liver tissue culture systems widely used in toxicological and pharmaceutical assays. Next, we identify perturbed connections between these processes via a novel method that integrates gene expression, interaction, and annotation data. Finally, we present another novel method that computes non-redundant sets of perturbed inter-process connections, and apply it to several additional liver-related data sets. These applications demonstrate the ability of our methods to capture and report biologically relevant high-level trends.
Ph. D.

APA, Harvard, Vancouver, ISO, and other styles

30

Mohamadi, Hamid. "Parallel algorithms and software tools for high-throughput sequencing data." Thesis, University of British Columbia, 2017. http://hdl.handle.net/2429/62072.

Full text

Abstract:

With growing throughput and dropping cost of High-Throughput Sequencing (HTS) technologies, there is a continued need to develop faster and more cost-effective bioinformatics solutions. However, the algorithms and computational power required to efficiently analyze HTS data have lagged considerably. In health and life sciences research organizations, de novo assembly and sequence alignment have become two key steps in everyday research and analysis. The de novo assembly process is a fundamental step in analyzing previously uncharacterized organisms and is one of the most computationally demanding problems in bioinformatics. The sequence alignment is a fundamental operation in a broad spectrum of genomics projects. In genome resequencing projects, they are often used prior to variant calling. In transcriptome resequencing, they provide information on gene expression. They are even used in de novo sequencing projects to help contiguate assembled sequences. As such designing efficient, scalable, and accurate solutions for de novo assembly and sequence alignment problems would have a wide effect in the field. In this thesis, I present a collection of novel algorithms and software tools for the analysis of high-throughput sequencing data using efficient data structures. I also utilize the latest advances in parallel and distributed computing to design and develop scalable and cost-effective algorithms on High-Performance Computing (HPC) infrastructures especially for the de novo assembly and sequence alignment problems. The algorithms and software solutions I develop are publicly available for free for academic use, to facilitate research at health and life sciences laboratories and other organizations worldwide.
Science, Faculty of
Graduate

APA, Harvard, Vancouver, ISO, and other styles

31

Mammana, Alessandro [Verfasser]. "Patterns and algorithms in high-throughput sequencing count data / Alessandro Mammana." Berlin : Freie Universität Berlin, 2016. http://d-nb.info/1108270956/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

32

Love, Michael I. [Verfasser]. "Statistical analysis of high-throughput sequencing count data / Michael I. Love." Berlin : Freie Universität Berlin, 2013. http://d-nb.info/1043197842/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

33

Kostadinov, Ivaylo [Verfasser]. "Marine Metagenomics: From high-throughput data to ecogenomic interpretation / Ivaylo Kostadinov." Bremen : IRC-Library, Information Resource Center der Jacobs University Bremen, 2012. http://d-nb.info/1035211564/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

34

Hänzelmann, Sonja 1981. "Pathway-centric approaches to the analysis of high-throughput genomics data." Doctoral thesis, Universitat Pompeu Fabra, 2012. http://hdl.handle.net/10803/108337.

Full text

Abstract:

In the last decade, molecular biology has expanded from a reductionist view to a systems-wide view that tries to unravel the complex interactions of cellular components. Owing to the emergence of high-throughput technology it is now possible to interrogate entire genomes at an unprecedented resolution. The dimension and unstructured nature of these data made it evident that new methodologies and tools are needed to turn data into biological knowledge. To contribute to this challenge we exploited the wealth of publicly available high-throughput genomics data and developed bioinformatics methodologies focused on extracting information at the pathway rather than the single gene level. First, we developed Gene Set Variation Analysis (GSVA), a method that facilitates the organization and condensation of gene expression proﬁles into gene sets. GSVA enables pathway-centric downstream analyses of microarray and RNA-seq gene expression data. The method estimates sample-wise pathway variation over a population and allows for the integration of heterogeneous biological data sources with pathway-level expression measurements. To illustrate the features of GSVA, we applied it to several use-cases employing diﬀerent data types and addressing biological questions. GSVA is made available as an R package within the Bioconductor project. Secondly, we developed a pathway-centric genome-based strategy to reposition drugs in type 2 diabetes (T2D). This strategy consists of two steps, ﬁrst a regulatory network is constructed that is used to identify disease driving modules and then these modules are searched for compounds that might target them. Our strategy is motivated by the observation that disease genes tend to group together in the same neighborhood forming disease modules and that multiple genes might have to be targeted simultaneously to attain an eﬀect on the pathophenotype. To ﬁnd potential compounds, we used compound exposed genomics data deposited in public databases. We collected about 20,000 samples that have been exposed to about 1,800 compounds. Gene expression can be seen as an intermediate phenotype reﬂecting underlying dysregulatory pathways in a disease. Hence, genes contained in the disease modules that elicit similar transcriptional responses upon compound exposure are assumed to have a potential therapeutic eﬀect. We applied the strategy to gene expression data of human islets from diabetic and healthy individuals and identiﬁed four potential compounds, methimazole, pantoprazole, bitter orange extract and torcetrapib that might have a positive eﬀect on insulin secretion. This is the ﬁrst time a regulatory network of human islets has been used to reposition compounds for T2D. In conclusion, this thesis contributes with two pathway-centric approaches to important bioinformatic problems, such as the assessment of biological function and in silico drug repositioning. These contributions demonstrate the central role of pathway-based analyses in interpreting high-throughput genomics data.
En l'última dècada, la biologia molecular ha evolucionat des d'una perspectiva reduccionista cap a una perspectiva a nivell de sistemes que intenta desxifrar les complexes interaccions entre els components cel•lulars. Amb l'aparició de les tecnologies d'alt rendiment actualment és possible interrogar genomes sencers amb una resolució sense precedents. La dimensió i la naturalesa desestructurada d'aquestes dades ha posat de manifest la necessitat de desenvolupar noves eines i metodologies per a convertir aquestes dades en coneixement biològic. Per contribuir a aquest repte hem explotat l'abundància de dades genòmiques procedents d'instruments d'alt rendiment i disponibles públicament, i hem desenvolupat mètodes bioinformàtics focalitzats en l'extracció d'informació a nivell de via molecular en comptes de fer-ho al nivell individual de cada gen. En primer lloc, hem desenvolupat GSVA (Gene Set Variation Analysis), un mètode que facilita l'organització i la condensació de perfils d'expressió dels gens en conjunts. GSVA possibilita anàlisis posteriors en termes de vies moleculars amb dades d'expressió gènica provinents de microarrays i RNA-seq. Aquest mètode estima la variació de les vies moleculars a través d'una població de mostres i permet la integració de fonts heterogènies de dades biològiques amb mesures d'expressió a nivell de via molecular. Per il•lustrar les característiques de GSVA, l'hem aplicat a diversos casos usant diferents tipus de dades i adreçant qüestions biològiques. GSVA està disponible com a paquet de programari lliure per R dins el projecte Bioconductor. En segon lloc, hem desenvolupat una estratègia centrada en vies moleculars basada en el genoma per reposicionar fàrmacs per la diabetis tipus 2 (T2D). Aquesta estratègia consisteix en dues fases: primer es construeix una xarxa reguladora que s'utilitza per identificar mòduls de regulació gènica que condueixen a la malaltia; després, a partir d'aquests mòduls es busquen compostos que els podrien afectar. La nostra estratègia ve motivada per l'observació que els gens que provoquen una malaltia tendeixen a agrupar-se, formant mòduls patogènics, i pel fet que podria caldre una actuació simultània sobre múltiples gens per assolir un efecte en el fenotipus de la malaltia. Per trobar compostos potencials, hem usat dades genòmiques exposades a compostos dipositades en bases de dades públiques. Hem recollit unes 20.000 mostres que han estat exposades a uns 1.800 compostos. L'expressió gènica es pot interpretar com un fenotip intermedi que reflecteix les vies moleculars desregulades subjacents a una malaltia. Per tant, considerem que els gens d'un mòdul patològic que responen, a nivell transcripcional, d'una manera similar a l'exposició del medicament tenen potencialment un efecte terapèutic. Hem aplicat aquesta estratègia a dades d'expressió gènica en illots pancreàtics humans corresponents a individus sans i diabètics, i hem identificat quatre compostos potencials (methimazole, pantoprazole, extracte de taronja amarga i torcetrapib) que podrien tenir un efecte positiu sobre la secreció de la insulina. Aquest és el primer cop que una xarxa reguladora d'illots pancreàtics humans s'ha utilitzat per reposicionar compostos per a T2D. En conclusió, aquesta tesi aporta dos enfocaments diferents en termes de vies moleculars a problemes bioinformàtics importants, com ho son el contrast de la funció biològica i el reposicionament de fàrmacs "in silico". Aquestes contribucions demostren el paper central de les anàlisis basades en vies moleculars a l'hora d'interpretar dades genòmiques procedents d'instruments d'alt rendiment.

APA, Harvard, Vancouver, ISO, and other styles

35

Swamy, Sajani. "The automation of glycopeptide discovery in high throughput MS/MS data." Thesis, Waterloo, Ont. : University of Waterloo, 2004. http://etd.uwaterloo.ca/etd/sswamy2004.pdf.

Full text

Abstract:

Thesis (MMath)--University of Waterloo, 2004.
"A thesis presented to the University of Waterloo in fulfilment of the thesis requirement for the degree of Master of Mathematics in Computer Science." Includes bibliographical references.

APA, Harvard, Vancouver, ISO, and other styles

36

Ballinger, Tracy J. "Analysis of genomic rearrangements in cancer from high throughput sequencing data." Thesis, University of California, Santa Cruz, 2015. http://pqdtopen.proquest.com/#viewpdf?dispub=3729995.

Full text

Abstract:

In the last century cancer has become increasingly prevalent and is the second largest killer in the United States, estimated to afflict 1 in 4 people during their life. Despite our long history with cancer and our herculean efforts to thwart the disease, in many cases we still do not understand the underlying causes or have successful treatments. In my graduate work, I’ve developed two approaches to the study of cancer genomics and applied them to the whole genome sequencing data of cancer patients from The Cancer Genome Atlas (TCGA). In collaboration with Dr. Ewing, I built a pipeline to detect retrotransposon insertions from paired-end high-throughput sequencing data and found somatic retrotransposon insertions in a fifth of cancer patients.

My second novel contribution to the study of cancer genomics is the development of the CN-AVG pipeline, a method for reconstructing the evolutionary history of a single tumor by predicting the order of structural mutations such as deletions, duplications, and inversions. The CN-AVG theory was developed by Drs. Haussler, Zerbino, and Paten and samples potential evolutionary histories for a tumor using Markov Chain Monte Carlo sampling. I contributed to the development of this method by testing its accuracy and limitations on simulated evolutionary histories. I found that the ability to reconstruct a history decays exponentially with increased breakpoint reuse, but that we can estimate how accurately we reconstruct a mutation event using the likelihood scores of the events. I further designed novel techniques for the application of CN-AVG to whole genome sequencing data from actual patients and applied these techniques to search for evolutionary patterns in glioblastoma multiforme using sequencing data from TCGA. My results show patterns of two-hit deletions, as we would expect, and amplifications occurring over several mutational events. I also find that the CN-AVG method frequently makes use of whole chromosome copy number changes following by localized deletions, a bias that could be mitigated through modifying the cost function for an evolutionary history.

APA, Harvard, Vancouver, ISO, and other styles

37

Bao, Suying, and 鲍素莹. "Deciphering the mechanisms of genetic disorders by high throughput genomic data." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 2013. http://hdl.handle.net/10722/196471.

Full text

Abstract:

A new generation of non-Sanger-based sequencing technologies, so called “next-generation” sequencing (NGS), has been changing the landscape of genetics at unprecedented speed. In particular, our capacity in deciphering the genotypes underlying phenotypes, such as diseases, has never been greater. However, before fully applying NGS in medical genetics, researchers have to bridge the widening gap between the generation of massively parallel sequencing output and the capacity to analyze the resulting data. In addition, even a list of candidate genes with potential causal variants can be obtained from an effective NGS analysis, to pinpoint disease genes from the long list remains a challenge. The issue becomes especially difficult when the molecular basis of the disease is not fully elucidated. New NGS users are always bewildered by a plethora of options in mapping, assembly, variant calling and filtering programs and may have no idea about how to compare these tools and choose the “right” ones. To get an overview of various bioinformatics attempts in mapping and assembly, a series of performance evaluation work was conducted by using both real and simulated NGS short reads. For NGS variant detection, the performances of two most widely used toolkits were assessed, namely, SAM tools and GATK. Based on the results of systematic evaluation, a NGS data processing and analysis pipeline was constructed. And this pipeline was proved a success with the identification of a mutation (a frameshift deletion on Hnrnpa1, p.Leu181Valfs*6) related to congenital heart defect (CHD) in procollagen type IIA deficient mice. In order to prioritize risk genes for diseases, especially those with limited prior knowledge, a network-based gene prioritization model was constructed. It consists of two parts: network analysis on known disease genes (seed-based network strategy)and network analysis on differential expression (DE-based network strategy). Case studies of various complex diseases/traits demonstrated that the DE-based network strategy can greatly outperform traditional gene expression analysis in predicting disease-causing genes. A series of simulation work indicated that the DE-based strategy is especially meaningful to diseases with limited prior knowledge, and the model’s performance can be further advanced by integrating with seed-based network strategy. Moreover, a successful application of the network-based gene prioritization model in influenza host genetic study further demonstrated the capacity of the model in identifying promising candidates and mining of new risk genes and pathways not biased toward our current knowledge. In conclusion, an efficient NGS analysis framework from the steps of quality control and variant detection, to those of result analysis and gene prioritization has been constructed for medical genetics. The novelty in this framework is an encouraging attempt to prioritize risk genes for not well-characterized diseases by network analysis on known disease genes and differential expression data. The successful applications in detecting genetic factors associated with CHD and influenza host resistance demonstrated the efficacy of this framework. And this may further stimulate more applications of high throughput genomic data in dissecting the genetic components of human disorders in the near future.
published_or_final_version
Biochemistry
Doctoral
Doctor of Philosophy

APA, Harvard, Vancouver, ISO, and other styles

38

Wang, Dao Sen. "Conditional Differential Expression for Biomarker Discovery In High-throughput Cancer Data." Thesis, Université d'Ottawa / University of Ottawa, 2019. http://hdl.handle.net/10393/38819.

Full text

Abstract:

Biomarkers have important clinical uses as diagnostic, prognostic, and predictive tools for cancer therapy. However, translation from biomarkers claimed in literature to clinical use has been traditionally poor. Importantly, clinical covariates have been shown to be important factors in biomarker discovery in small-scale studies. Yet, traditional differential gene expression analysis for expression biomarkers ignores covariates, which are only accounted for later, if at all. We conjecture that covariate-sensitive biomarker identification should lead to the discovery of more robust and true biomarkers as confounding effects are considered. Here we examine gene expression in more than 750 breast invasive ductal carcinoma cases from The Cancer Genome Atlas (TCGA-BRCA) in the form of RNA-Seq data. Specifically, we focus on differential gene expression with respect to understanding HER2, ER, and PR biology – the three key receptors in breast cancer. We explore methods of differential expression analysis, including non-parametric Mann-Whitney-Wilcoxon analysis, generalized linear models with covariates, and a novel categorical method for covariates. We tested the influence of common patient characteristics, such as age and race, and clinical covariates such as HER2, ER, and PR receptor statuses. More importantly, we show that inclusion of a correlated covariate (e.g. PR status as a covariate in ER analysis) substantially changes the list of differentially expressed genes, removing many likely false positives and revealing genes obscured by the covariate. Incorporation of relevant covariates in differential gene expression analysis holds strong biological importance with respect to biomarker discovery and may be the next step towards better translation of biomarkers to clinical use.

APA, Harvard, Vancouver, ISO, and other styles

39

Cao, Hongfei. "High-throughput Visual Knowledge Analysis and Retrieval in Big Data Ecosystems." Thesis, University of Missouri - Columbia, 2019. http://pqdtopen.proquest.com/#viewpdf?dispub=13877134.

Full text

Abstract:

Visual knowledge plays an important role in many highly skilled applications, such as medical diagnosis, geospatial image analysis and pathology diagnosis. Medical practitioners are able to interpret and reason about diagnostic images based on not only primitive-level image features such as color, texture, and spatial distribution but also their experience and tacit knowledge which are seldom articulated explicitly. This reasoning process is dynamic and closely related to real-time human cognition. Due to a lack of visual knowledge management and sharing tools, it is difficult to capture and transfer such tacit and hard-won expertise to novices. Moreover, many mission-critical applications require the ability to process such tacit visual knowledge in real time. Precisely how to index this visual knowledge computationally and systematically still poses a challenge to the computing community.

My dissertation research results in novel computational approaches for highthroughput visual knowledge analysis and retrieval from large-scale databases using latest technologies in big data ecosystems. To provide a better understanding of visual reasoning, human gaze patterns are qualitatively measured spatially and temporally to model observers’ cognitive process. These gaze patterns are then indexed in a NoSQL distributed database as a visual knowledge repository, which is accessed using various unique retrieval methods developed through this dissertation work. To provide meaningful retrievals in real time, deep-learning methods for automatic annotation of visual activities and streaming similarity comparisons are developed under a gaze-streaming framework using Apache Spark.

This research has several potential applications that offer a broader impact among the scientific community and in the practical world. First, the proposed framework can be adapted for different domains, such as fine arts, life sciences, etc. with minimal effort to capture human reasoning processes. Second, with its real-time visual knowledge search function, this framework can be used for training novices in the interpretation of domain images, by helping them learn experts’ reasoning processes. Third, by helping researchers to understand human visual reasoning, it may shed light on human semantics modeling. Finally, integrating reasoning process with multimedia data, future retrieval of media could embed human perceptual reasoning for database search beyond traditional content-based media retrievals.

APA, Harvard, Vancouver, ISO, and other styles

40

Paicu, Claudia. "miRNA detection and analysis from high-throughput small RNA sequencing data." Thesis, University of East Anglia, 2016. https://ueaeprints.uea.ac.uk/63738/.

Full text

Abstract:

Small RNAs (sRNAs) are a broad class of short regulatory non-coding RNAs. microRNAs (miRNAs) are a special class of -21-22 nucleotide sRNAs which are derived from a stable hairpin-like secondary structure. miRNAs have critical gene regulatory functions and are involved in many pathways including developmental timing, organogenesis and development in both plants and animals. Next generation sequencing (NGS) technologies, which are often used for identifying miRNAs, are continuously evolving, generating datasets containing millions of sRNAs, which has led to new challenges for the tools used to predict miRNAs from such data. There are several tools for miRNA detection from NGS datasets, which we review in this thesis, identifying a number of potential shortcomings in their algorithms. In this thesis, we present a novel miRNA prediction algorithm, miRCat2. Our algorithm is more robust to variations in sequencing depth due to the fact that it compares aligned sRNA reads to a random uniform distribution to detect peaks in the input dataset, using a new entropy-based approach. Then it applies filters based on the miRNA biogenesis on the read alignment and on the computed secondary structure. Results show that miRCat2 has a better specificity-sensitivity trade-off than similar tools, and its predictions also contains a larger percentage of sequences that are downregulated in mutants in the miRNA biogenesis pathway. This confirms the validity of novel predictions, which may lead to new miRNA annotations, expanding and contributing to the field of sRNA research.

APA, Harvard, Vancouver, ISO, and other styles

41

Yu, Guoqiang. "Machine Learning to Interrogate High-throughput Genomic Data: Theory and Applications." Diss., Virginia Tech, 2011. http://hdl.handle.net/10919/28980.

Full text

Abstract:

The missing heritability in genome-wide association studies (GWAS) is an intriguing open scientific problem which has attracted great recent interest. The interaction effects among risk factors, both genetic and environmental, are hypothesized to be one of the main missing heritability sources. Moreover, detection of multilocus interaction effect may also have great implications for revealing disease/biological mechanisms, for accurate risk prediction, personalized clinical management, and targeted drug design. However, current analysis of GWAS largely ignores interaction effects, partly due to the lack of tools that meet the statistical and computational challenges posed by taking into account interaction effects. Here, we propose a novel statistically-based framework (Significant Conditional Association) for systematically exploring, assessing significance, and detecting interaction effect. Further, our SCA work has also revealed new theoretical results and insights on interaction detection, as well as theoretical performance bounds. Using in silico data, we show that the new approach has detection power significantly better than that of peer methods, while controlling the running time within a permissible range. More importantly, we applied our methods on several real data sets, confirming well-validated interactions with more convincing evidence (generating smaller p-values and requiring fewer samples) than those obtained through conventional methods, eliminating inconsistent results in the original reports, and observing novel discoveries that are otherwise undetectable. The proposed methods provide a useful tool to mine new knowledge from existing GWAS and generate new hypotheses for further research. Microarray gene expression studies provide new opportunities for the molecular characterization of heterogeneous diseases. Multiclass gene selection is an imperative task for identifying phenotype-associated mechanistic genes and achieving accurate diagnostic classification. Most existing multiclass gene selection methods heavily rely on the direct extension of two-class gene selection methods. However, simple extensions of binary discriminant analysis to multiclass gene selection are suboptimal and not well-matched to the unique characteristics of the multi-category classification problem. We report a simpler and yet more accurate strategy than previous works for multicategory classification of heterogeneous diseases. Our method selects the union of one-versus-everyone phenotypic up-regulated genes (OVEPUGs) and matches this gene selection with a one-versus-rest support vector machine. Our approach provides even-handed gene resources for discriminating both neighboring and well-separated classes, and intends to assure the statistical reproducibility and biological plausibility of the selected genes. We evaluated the fold changes of OVEPUGs and found that only a small number of high-ranked genes were required to achieve superior accuracy for multicategory classification. We tested the proposed OVEPUG method on six real microarray gene expression data sets (five public benchmarks and one in-house data set) and two simulation data sets, observing significantly improved performance with lower error rates, fewer marker genes, and higher performance sustainability, as compared to several widely-adopted gene selection and classification methods.
Ph. D.

APA, Harvard, Vancouver, ISO, and other styles

42

Zucker, Mark Raymond. "Inferring Clonal Heterogeneity in Chronic Lymphocytic Leukemia From High-Throughput Data." The Ohio State University, 2019. http://rave.ohiolink.edu/etdc/view?acc_num=osu1554049121307262.

Full text

APA, Harvard, Vancouver, ISO, and other styles

43

Guennel, Tobias. "Statistical Methods for Normalization and Analysis of High-Throughput Genomic Data." VCU Scholars Compass, 2012. http://scholarscompass.vcu.edu/etd/2647.

Full text

Abstract:

High-throughput genomic datasets obtained from microarray or sequencing studies have revolutionized the field of molecular biology over the last decade. The complexity of these new technologies also poses new challenges to statisticians to separate biological relevant information from technical noise. Two methods are introduced that address important issues with normalization of array comparative genomic hybridization (aCGH) microarrays and the analysis of RNA sequencing (RNA-Seq) studies. Many studies investigating copy number aberrations at the DNA level for cancer and genetic studies use comparative genomic hybridization (CGH) on oligo arrays. However, aCGH data often suffer from low signal to noise ratios resulting in poor resolution of fine features. Bilke et al. showed that the commonly used running average noise reduction strategy performs poorly when errors are dominated by systematic components. A method called pcaCGH is proposed that significantly reduces noise using a non-parametric regression on technical covariates of probes to estimate systematic bias. Then a robust principal components analysis (PCA) estimates any remaining systematic bias not explained by technical covariates used in the preceding regression. The proposed algorithm is demonstrated on two CGH datasets measuring the NCI-60 cell lines utilizing NimbleGen and Agilent microarrays. The method achieves a nominal error variance reduction of 60%-65% as well as an 2-fold increase in signal to noise ratio on average, resulting in more detailed copy number estimates. Furthermore, correlations of signal intensity ratios of NimbleGen and Agilent arrays are increased by 40% on average, indicating a significant improvement in agreement between the technologies. A second algorithm called gamSeq is introduced to test for differential gene expression in RNA sequencing studies. Limitations of existing methods are outlined and the proposed algorithm is compared to these existing algorithms. Simulation studies and real data are used to show that gamSeq improves upon existing methods with regards to type I error control while maintaining similar or better power for a range of sample sizes for RNA-Seq studies. Furthermore, the proposed method is applied to detect differential 3' UTR usage.

APA, Harvard, Vancouver, ISO, and other styles

44

Ferber, Kyle L. "Methods for Predicting an Ordinal Response with High-Throughput Genomic Data." VCU Scholars Compass, 2016. http://scholarscompass.vcu.edu/etd/4585.

Full text

Abstract:

Multigenic diagnostic and prognostic tools can be derived for ordinal clinical outcomes using data from high-throughput genomic experiments. A challenge in this setting is that the number of predictors is much greater than the sample size, so traditional ordinal response modeling techniques must be exchanged for more specialized approaches. Existing methods perform well on some datasets, but there is room for improvement in terms of variable selection and predictive accuracy. Therefore, we extended an impressive binary response modeling technique, Feature Augmentation via Nonparametrics and Selection, to the ordinal response setting. Through simulation studies and analyses of high-throughput genomic datasets, we showed that our Ordinal FANS method is sensitive and specific when discriminating between important and unimportant features from the high-dimensional feature space and is highly competitive in terms of predictive accuracy. Discrete survival time is another example of an ordinal response. For many illnesses and chronic conditions, it is impossible to record the precise date and time of disease onset or relapse. Further, the HIPPA Privacy Rule prevents recording of protected health information which includes all elements of dates (except year), so in the absence of a “limited dataset,” date of diagnosis or date of death are not available for calculating overall survival. Thus, we developed a method that is suitable for modeling high-dimensional discrete survival time data and assessed its performance by conducting a simulation study and by predicting the discrete survival times of acute myeloid leukemia patients using a high-dimensional dataset.

APA, Harvard, Vancouver, ISO, and other styles

45

Glaus, Peter. "Bayesian methods for gene expression analysis from high-throughput sequencing data." Thesis, University of Manchester, 2014. https://www.research.manchester.ac.uk/portal/en/theses/bayesian-methods-for-gene-expression-analysis-from-highthroughput-sequencing-data(cf9680e0-a3f2-4090-8535-a39f3ef50cc4).html.

Full text

Abstract:

We study the tasks of transcript expression quantification and differential expression analysis based on data from high-throughput sequencing of the transcriptome (RNA-seq). In an RNA-seq experiment subsequences of nucleotides are sampled from a transcriptome specimen, producing millions of short reads. The reads can be mapped to a reference to determine the set of transcripts from which they were sequenced. We can measure the expression of transcripts in the specimen by determining the amount of reads that were sequenced from individual transcripts. In this thesis we propose a new probabilistic method for inferring the expression of transcripts from RNA-seq data. We use a generative model of the data that can account for read errors, fragment length distribution and non-uniform distribution of reads along transcripts. We apply the Bayesian inference approach, using the Gibbs sampling algorithm to sample from the posterior distribution of transcript expression. Producing the full distribution enables assessment of the uncertainty of the estimated expression levels. We also investigate the use of alternative inference techniques for the transcript expression quantification. We apply a collapsed Variational Bayes algorithm which can provide accurate estimates of mean expression faster than the Gibbs sampling algorithm. Building on the results from transcript expression quantification, we present a new method for the differential expression analysis. Our approach utilizes the full posterior distribution of expression from multiple replicates in order to detect significant changes in abundance between different conditions. The method can be applied to differential expression analysis of both genes and transcripts. We use the newly proposed methods to analyse real RNA-seq data and provide evaluation of their accuracy using synthetic datasets. We demonstrate the advantages of our approach in comparisons with existing alternative approaches for expression quantification and differential expression analysis. The methods are implemented in the BitSeq package, which is freely distributed under an open-source license. Our methods can be accessed and used by other researchers for RNA-seq data analysis.

APA, Harvard, Vancouver, ISO, and other styles

46

Ramljak, Dusan. "Data Driven High Performance Data Access." Diss., Temple University Libraries, 2018. http://cdm16002.contentdm.oclc.org/cdm/ref/collection/p245801coll10/id/530207.

Full text

Abstract:

Computer and Information Science
Ph.D.
Low-latency, high throughput mechanisms to retrieve data become increasingly crucial as the cyber and cyber-physical systems pour out increasing amounts of data that often must be analyzed in an online manner. Generally, as the data volume increases, the marginal utility of an ``average'' data item tends to decline, which requires greater effort in identifying the most valuable data items and making them available with minimal overhead. We believe that data analytics driven mechanisms have a big role to play in solving this needle-in-the-haystack problem. We rely on the claim that efficient pattern discovery and description, coupled with the observed predictability of complex patterns within many applications offers significant potential to enable many I/O optimizations. Our research covers exploitation of storage hierarchy for data driven caching and tiering, reduction of distance between data and computations, removing redundancy in data, using sparse representations of data, the impact of data access mechanisms on resilience, energy consumption, storage usage, and the enablement of new classes of data driven applications. For caching and prefetching, we offer a powerful model that separates the process of access prediction from the data retrieval mechanism. Predictions are made on a data entity basis and used the notions of ``context'' and its aspects such as ``belief'' to uncover and leverage future data needs. This approach allows truly opportunistic utilization of predictive information. We elaborate on which aspects of the context we are using in areas other than caching and prefetching different situations and why it is appropriate in the specified situation. We present in more details the methods we have developed, BeliefCache for data driven caching and prefetching and AVSC for pattern mining based compression of data. In BeliefCache, using a belief, an aspect of context representing an estimate of the probability that the storage element will be needed, we developed modular framework BeliefCache, to make unified informed decisions about that element or a group. For the workloads we examined we were able to capture complex non-sequential access patterns better than a state-of-the-art framework for optimizing cloud storage gateways. Moreover, our framework is also able to adjust to variations in the workload faster. It also does not require a static workload to be effective since modular framework allows for discovering and adapting to the changes in the workload. In AVSC, using an aspect of context to gauge the similarity of the events, we perform our compression by keeping relevant events intact and approximating other events. We do that in two stages. We first generate a summarization of the data, then approximately match the remaining events with the existing patterns if possible, or add the patterns to the summary otherwise. We show gains over the plain lossless compression for a specified amount of accuracy for purposes of identifying the state of the system and a clear tradeoff in between the compressibility and fidelity. In other mentioned research areas we present challenges and opportunities with the hope that will spur researchers to further examine those issues in the space of rapidly emerging data intensive applications. We also discuss the ideas how our research in other domains could be applied in our attempts to provide high performance data access.
Temple University--Theses

APA, Harvard, Vancouver, ISO, and other styles

47

Chen, Tien-Fu. "Data prefetching for high-performance processors /." Thesis, Connect to this title online; UW restricted, 1993. http://hdl.handle.net/1773/6871.

Full text

APA, Harvard, Vancouver, ISO, and other styles

48

Bain, R. S. "DESIGNING A HIGH-SPEED DATA ARCHIVE SYSTEM." International Foundation for Telemetering, 1992. http://hdl.handle.net/10150/608904.

Full text

Abstract:

International Telemetering Conference Proceedings / October 26-29, 1992 / Town and Country Hotel and Convention Center, San Diego, California
Modern telemetry systems collect large quantities of data at high rates of speed. To support near real-time analysis of this data, a sophisticated data archival system is required. If the data is to be analyzed during the test, it must be available via computer-accessible peripheral devices. The use of computer-compatible media permits powerful “instant- replay-type” functions to be implemented, allowing the user to search for events, blow-up time segments, or alter playback rates. Using computer-compatible media also implies inexpensive COTS devices with an “industry standard” interface and direct media compatibility with host processing systems. This paper discusses the design and implementation of a board-level archive subsystem developed by Veda Systems, Incorporated.

APA, Harvard, Vancouver, ISO, and other styles

49

Boyd, Peter G. "Computational High Throughput Screening of Metal Organic Frameworks for Carbon Dioxide Capture and Storage Applications." Thesis, Université d'Ottawa / University of Ottawa, 2015. http://hdl.handle.net/10393/32532.

Full text

Abstract:

This work explores the use of computational methods to aid in the design of Metal Organic Frameworks (MOFs) for use as CO2 scrubbers in carbon capture and storage applications. One of the main challenges in this field is in identifying important MOF design characteristics which optimize the complex interactions governing surface adsorption. We approach this in a high-throughput manner, determining properties important to CO2 adsorption from generating and sampling a large materials search space. The utilization of MOFs as potential carbon scrubbing agents is a recent phenomenon, as such, many of the computational tools necessary to perform high-throughput screening of MOFs and subsequent analysis are either underdeveloped or non-existent. A large portion of this work therefore involved the development of novel tools designed specifically for this task. The chapters in this work are contiguous with the goal of designing MOFs for CO¬2 capture, and somewhat chronological in order and complexity, meaning as time and expertise progressed, more advanced tools were developed and utilized for the purposes of computational MOF discovery. Initial work towards MOF design involved the detailed analysis of two experimental structures; CALF-15 and CALF-16 using classical molecular dynamics, grand canonical Monte Carlo simulations, and DFT to determine the structural features which promote CO2 adsorption. An unprecedented level of agreement was found between theory and experiment, as we are able to capture, with simulation, the X-ray resolved binding sites of CO2 in the confined pores of CALF-15. Molecular simulation was then used to provide a detailed breakdown of the energy contributions from nearby functional groups in both CALF-15 and CALF-16. A large database of hypothetical MOF structures is constructed for the purposes of screening for CO2 adsorption. The database contains 1.3 million hypothetical structures, generated with an algorithm which snaps together rigid molecular building blocks extracted from existing MOF crystal structures. The algorithm for constructing the hypothetical MOFs and the building blocks themselves were all developed in-house to form the resulting database. The topological, chemical, and physical features of these MOFs are compared to recently developed materials databases to demonstrate the larger structural and chemical space sampled by our database. In order to rapidly and accurately describe the electrostatic interactions of CO2 in the hypothetical database of MOFs, parameters were developed for use with the charge equilibration method. This method assigns partial charges on the framework atoms based on a set of parameters assigned to each atom type. An evolutionary algorithm was used to optimize the charge equilibration parameters on a set of 543 hypothetical MOFs such that the partial charges generated would reproduce each MOFs DFT-derived electrostatic potential. Validation of these parameters was performed by comparing the CO2 adsorption from the charge equilibration method vs DFT-derived charges on a separate set of 693 MOFs. Our parameter set were found to reproduce DFT-derived CO2 adsorption extremely well using only a fraction of the time, making this method ideal for rapid and accurate high-throughput MOF screening. A database of 325,000 MOFs was then screened for CO2 capture and storage applications. From this study we identify important binding pockets for CO2 in MOFs using a binding site analysis tool. This tool uses a pattern recognition method to compare the 3-D configurations of thousands of pore structures surrounding strong CO2 adsorption sites, and present common features found amongst them. For the purposes of developing larger databases which sample a more diverse materials space, a novel MOF construction tool is devloped which builds MOFs based on abstract graphs. The graph theoretical foundations of this method are discussed and several examples of MOF construction are presented to demonstrate its use. Notably, not only can it build existing MOFs with complicated geometries, but it can sample a wide range of unique structures not yet discovered by experimental means.

APA, Harvard, Vancouver, ISO, and other styles

50

Wollmann, Philipp, Matthias Leistner, Ulrich Stoeck, Ronny Grünker, Kristina Gedrich, Nicole Klein, Oliver Throl, et al. "High-throughput screening: speeding up porous materials discovery." Saechsische Landesbibliothek- Staats- und Universitaetsbibliothek Dresden, 2014. http://nbn-resolving.de/urn:nbn:de:bsz:14-qucosa-138648.

Full text

Abstract:

A new tool (Infrasorb-12) for the screening of porosity is described, identifying high surface area materials in a very short time with high accuracy. Further, an example for the application of the tool in the discovery of new cobalt-based metal–organic frameworks is given
Dieser Beitrag ist mit Zustimmung des Rechteinhabers aufgrund einer (DFG-geförderten) Allianz- bzw. Nationallizenz frei zugänglich

APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic 'High Throughput Data Storage'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles