Dissertations / Theses: 'High-throughput sequencing data'

1

Roguski, Łukasz 1987. "High-throughput sequencing data compression." Doctoral thesis, Universitat Pompeu Fabra, 2017. http://hdl.handle.net/10803/565775.

Full text

Abstract:

Thanks to advances in sequencing technologies, biomedical research has experienced a revolution over recent years, resulting in an explosion in the amount of genomic data being generated worldwide. The typical space requirement for storing sequencing data produced by a medium-scale experiment lies in the range of tens to hundreds of gigabytes, with multiple files in different formats being produced by each experiment. The current de facto standard file formats used to represent genomic data are text-based. For practical reasons, these are stored in compressed form. In most cases, such storage methods rely on general-purpose text compressors, such as gzip. Unfortunately, however, these methods are unable to exploit the information models specific to sequencing data, and as a result they usually provide limited functionality and insufficient savings in storage space. This explains why relatively basic operations such as processing, storage, and transfer of genomic data have become a typical bottleneck of current analysis setups. Therefore, this thesis focuses on methods to efficiently store and compress the data generated from sequencing experiments. First, we propose a novel general purpose FASTQ files compressor. Compared to gzip, it achieves a significant reduction in the size of the resulting archive, while also offering high data processing speed. Next, we present compression methods that exploit the high sequence redundancy present in sequencing data. These methods achieve the best compression ratio among current state-of-the-art FASTQ compressors, without using any external reference sequence. We also demonstrate different lossy compression approaches to store auxiliary sequencing data, which allow for further reductions in size. Finally, we propose a flexible framework and data format, which allows one to semi-automatically generate compression solutions which are not tied to any specific genomic file format. To facilitate data management needed by complex pipelines, multiple genomic datasets having heterogeneous formats can be stored together in configurable containers, with an option to perform custom queries over the stored data. Moreover, we show that simple solutions based on our framework can achieve results comparable to those of state-of-the-art format-specific compressors. Overall, the solutions developed and described in this thesis can easily be incorporated into current pipelines for the analysis of genomic data. Taken together, they provide grounds for the development of integrated approaches towards efficient storage and management of such data.
Gràcies als avenços en el camp de les tecnologies de seqüenciació, en els darrers anys la recerca biomèdica ha viscut una revolució, que ha tingut com un dels resultats l'explosió del volum de dades genòmiques generades arreu del món. La mida típica de les dades de seqüenciació generades en experiments d'escala mitjana acostuma a situar-se en un rang entre deu i cent gigabytes, que s'emmagatzemen en diversos arxius en diferents formats produïts en cada experiment. Els formats estàndards actuals de facto de representació de dades genòmiques són en format textual. Per raons pràctiques, les dades necessiten ser emmagatzemades en format comprimit. En la majoria dels casos, aquests mètodes de compressió es basen en compressors de text de caràcter general, com ara gzip. Amb tot, no permeten explotar els models d'informació especifícs de dades de seqüenciació. És per això que proporcionen funcionalitats limitades i estalvi insuficient d'espai d'emmagatzematge. Això explica per què operacions relativament bàsiques, com ara el processament, l'emmagatzematge i la transferència de dades genòmiques, s'han convertit en un dels principals obstacles de processos actuals d'anàlisi. Per tot això, aquesta tesi se centra en mètodes d'emmagatzematge i compressió eficients de dades generades en experiments de sequenciació. En primer lloc, proposem un compressor innovador d'arxius FASTQ de propòsit general. A diferència de gzip, aquest compressor permet reduir de manera significativa la mida de l'arxiu resultant del procés de compressió. A més a més, aquesta eina permet processar les dades a una velocitat alta. A continuació, presentem mètodes de compressió que fan ús de l'alta redundància de seqüències present en les dades de seqüenciació. Aquests mètodes obtenen la millor ratio de compressió d'entre els compressors FASTQ del marc teòric actual, sense fer ús de cap referència externa. També mostrem aproximacions de compressió amb pèrdua per emmagatzemar dades de seqüenciació auxiliars, que permeten reduir encara més la mida de les dades. En últim lloc, aportem un sistema flexible de compressió i un format de dades. Aquest sistema fa possible generar de manera semi-automàtica solucions de compressió que no estan lligades a cap mena de format específic d'arxius de dades genòmiques. Per tal de facilitar la gestió complexa de dades, diversos conjunts de dades amb formats heterogenis poden ser emmagatzemats en contenidors configurables amb l'opció de dur a terme consultes personalitzades sobre les dades emmagatzemades. A més a més, exposem que les solucions simples basades en el nostre sistema poden obtenir resultats comparables als compressors de format específic de l'estat de l'art. En resum, les solucions desenvolupades i descrites en aquesta tesi poden ser incorporades amb facilitat en processos d'anàlisi de dades genòmiques. Si prenem aquestes solucions conjuntament, aporten una base sòlida per al desenvolupament d'aproximacions completes encaminades a l'emmagatzematge i gestió eficient de dades genòmiques.

APA, Harvard, Vancouver, ISO, and other styles

2

Durif, Ghislain. "Multivariate analysis of high-throughput sequencing data." Thesis, Lyon, 2016. http://www.theses.fr/2016LYSE1334/document.

Full text

Abstract:

L'analyse statistique de données de séquençage à haut débit (NGS) pose des questions computationnelles concernant la modélisation et l'inférence, en particulier à cause de la grande dimension des données. Le travail de recherche dans ce manuscrit porte sur des méthodes de réductions de dimension hybrides, basées sur des approches de compression (représentation dans un espace de faible dimension) et de sélection de variables. Des développements sont menés concernant la régression "Partial Least Squares" parcimonieuse (supervisée) et les méthodes de factorisation parcimonieuse de matrices (non supervisée). Dans les deux cas, notre objectif sera la reconstruction et la visualisation des données. Nous présenterons une nouvelle approche de type PLS parcimonieuse, basée sur une pénalité adaptative, pour la régression logistique. Cette approche sera utilisée pour des problèmes de prédiction (devenir de patients ou type cellulaire) à partir de l'expression des gènes. La principale problématique sera de prendre en compte la réponse pour écarter les variables non pertinentes. Nous mettrons en avant le lien entre la construction des algorithmes et la fiabilité des résultats.Dans une seconde partie, motivés par des questions relatives à l'analyse de données "single-cell", nous proposons une approche probabiliste pour la factorisation de matrices de comptage, laquelle prend en compte la sur-dispersion et l'amplification des zéros (caractéristiques des données single-cell). Nous développerons une procédure d'estimation basée sur l'inférence variationnelle. Nous introduirons également une procédure de sélection de variables probabiliste basée sur un modèle "spike-and-slab". L'intérêt de notre méthode pour la reconstruction, la visualisation et le clustering de données sera illustré par des simulations et par des résultats préliminaires concernant une analyse de données "single-cell". Toutes les méthodes proposées sont implémentées dans deux packages R: plsgenomics et CMF
The statistical analysis of Next-Generation Sequencing data raises many computational challenges regarding modeling and inference, especially because of the high dimensionality of genomic data. The research work in this manuscript concerns hybrid dimension reduction methods that rely on both compression (representation of the data into a lower dimensional space) and variable selection. Developments are made concerning: the sparse Partial Least Squares (PLS) regression framework for supervised classification, and the sparse matrix factorization framework for unsupervised exploration. In both situations, our main purpose will be to focus on the reconstruction and visualization of the data. First, we will present a new sparse PLS approach, based on an adaptive sparsity-inducing penalty, that is suitable for logistic regression to predict the label of a discrete outcome. For instance, such a method will be used for prediction (fate of patients or specific type of unidentified single cells) based on gene expression profiles. The main issue in such framework is to account for the response to discard irrelevant variables. We will highlight the direct link between the derivation of the algorithms and the reliability of the results. Then, motivated by questions regarding single-cell data analysis, we propose a flexible model-based approach for the factorization of count matrices, that accounts for over-dispersion as well as zero-inflation (both characteristic of single-cell data), for which we derive an estimation procedure based on variational inference. In this scheme, we consider probabilistic variable selection based on a spike-and-slab model suitable for count data. The interest of our procedure for data reconstruction, visualization and clustering will be illustrated by simulation experiments and by preliminary results on single-cell data analysis. All proposed methods were implemented into two R-packages "plsgenomics" and "CMF" based on high performance computing

APA, Harvard, Vancouver, ISO, and other styles

3

Zhang, Xuekui. "Mixture models for analysing high throughput sequencing data." Thesis, University of British Columbia, 2011. http://hdl.handle.net/2429/35982.

Full text

Abstract:

The goal of my thesis is to develop methods and software for analysing high-throughput sequencing data, emphasizing sonicated ChIP-seq. For this goal, we developed a few variants of mixture models for genome-wide profiling of transcription factor binding sites and nucleosome positions. Our methods have been implemented into Bioconductor packages, which are freely available to other researchers. For profiling transcription factor binding sites, we developed a method, PICS, and implemented it into a Bioconductor package. We used a simulation study to confirm that PICS compares favourably to rival methods, such as MACS, QuEST, CisGenome, and USeq. Using published GABP and FOXA1 data from human cell lines, we then show that PICS predicted binding sites were more consistent with computationally predicted binding motifs than the alternative methods. For motif discovery using transcription binding sites, we combined PICS with two other existing packages to create the first complete set of Bioconductor tools for peak-calling and binding motif analysis of ChIP-Seq and ChIP-chip data. We demonstrate the effectiveness of our pipeline on published human ChIP-Seq datasets for FOXA1, ER, CTCF and STAT1, detecting co-occurring motifs that were consistent with the literature but not detected by other methods. For nucleosome positioning, we modified PICS into a method called PING. PING can handle MNase-Seq and MNase- or sonicated-ChIP-Seq data. It compares favourably to NPS and TemplateFilter in scalability, accuracy and robustness to low read density. To demonstrate that PING predictions from sonicated data can have sufficient spatial resolution to be biologically meaningful, we use H3K4me1 data to detect nucleosome shifts, discriminate functional and non-functional transcription factor binding sites, and confirm that Foxa2 associates with the accessible major groove of nucleosomal DNA. All of the above uses single-end sequencing data. At the end of the thesis, we briefly discuss the issue of processing paired-end data, which we are currently investigating.

APA, Harvard, Vancouver, ISO, and other styles

4

Hoffmann, Steve. "Genome Informatics for High-Throughput Sequencing Data Analysis." Doctoral thesis, Universitätsbibliothek Leipzig, 2014. http://nbn-resolving.de/urn:nbn:de:bsz:15-qucosa-152643.

Full text

Abstract:

This thesis introduces three different algorithmical and statistical strategies for the analysis of high-throughput sequencing data. First, we introduce a heuristic method based on enhanced suffix arrays to map short sequences to larger reference genomes. The algorithm builds on the idea of an error-tolerant traversal of the suffix array for the reference genome in conjunction with the concept of matching statistics introduced by Chang and a bitvector based alignment algorithm proposed by Myers. The algorithm supports paired-end and mate-pair alignments and the implementation offers methods for primer detection, primer and poly-A trimming. In our own benchmarks as well as independent bench- marks this tool outcompetes other currently available tools with respect to sensitivity and specificity in simulated and real data sets for a large number of sequencing protocols. Second, we introduce a novel dynamic programming algorithm for the spliced alignment problem. The advantage of this algorithm is its capability to not only detect co-linear splice events, i.e. local splice events on the same genomic strand, but also circular and other non-collinear splice events. This succinct and simple algorithm handles all these cases at the same time with a high accuracy. While it is at par with other state- of-the-art methods for collinear splice events, it outcompetes other tools for many non-collinear splice events. The application of this method to publically available sequencing data led to the identification of a novel isoform of the tumor suppressor gene p53. Since this gene is one of the best studied genes in the human genome, this finding is quite remarkable and suggests that the application of our algorithm could help to identify a plethora of novel isoforms and genes. Third, we present a data adaptive method to call single nucleotide variations (SNVs) from aligned high-throughput sequencing reads. We demonstrate that our method based on empirical log-likelihoods automatically adjusts to the quality of a sequencing experiment and thus renders a \"decision\" on when to call an SNV. In our simulations this method is at par with current state-of-the-art tools. Finally, we present biological results that have been obtained using the special features of the presented alignment algorithm
Diese Arbeit stellt drei verschiedene algorithmische und statistische Strategien für die Analyse von Hochdurchsatz-Sequenzierungsdaten vor. Zuerst führen wir eine auf enhanced Suffixarrays basierende heuristische Methode ein, die kurze Sequenzen mit grossen Genomen aligniert. Die Methode basiert auf der Idee einer fehlertoleranten Traversierung eines Suffixarrays für Referenzgenome in Verbindung mit dem Konzept der Matching-Statistik von Chang und einem auf Bitvektoren basierenden Alignmentalgorithmus von Myers. Die vorgestellte Methode unterstützt Paired-End und Mate-Pair Alignments, bietet Methoden zur Erkennung von Primersequenzen und zum trimmen von Poly-A-Signalen an. Auch in unabhängigen Benchmarks zeichnet sich das Verfahren durch hohe Sensitivität und Spezifität in simulierten und realen Datensätzen aus. Für eine große Anzahl von Sequenzierungsprotokollen erzielt es bessere Ergebnisse als andere bekannte Short-Read Alignmentprogramme. Zweitens stellen wir einen auf dynamischer Programmierung basierenden Algorithmus für das spliced alignment problem vor. Der Vorteil dieses Algorithmus ist seine Fähigkeit, nicht nur kollineare Spleiß- Ereignisse, d.h. Spleiß-Ereignisse auf dem gleichen genomischen Strang, sondern auch zirkuläre und andere nicht-kollineare Spleiß-Ereignisse zu identifizieren. Das Verfahren zeichnet sich durch eine hohe Genauigkeit aus: während es bei der Erkennung kollinearer Spleiß-Varianten vergleichbare Ergebnisse mit anderen Methoden erzielt, schlägt es die Wettbewerber mit Blick auf Sensitivität und Spezifität bei der Vorhersage nicht-kollinearer Spleißvarianten. Die Anwendung dieses Algorithmus führte zur Identifikation neuer Isoformen. In unserer Publikation berichten wir über eine neue Isoform des Tumorsuppressorgens p53. Da dieses Gen eines der am besten untersuchten Gene des menschlichen Genoms ist, könnte die Anwendung unseres Algorithmus helfen, eine Vielzahl weiterer Isoformen bei weniger prominenten Genen zu identifizieren. Drittens stellen wir ein datenadaptives Modell zur Identifikation von Single Nucleotide Variations (SNVs) vor. In unserer Arbeit zeigen wir, dass sich unser auf empirischen log-likelihoods basierendes Modell automatisch an die Qualität der Sequenzierungsexperimente anpasst und eine \"Entscheidung\" darüber trifft, welche potentiellen Variationen als SNVs zu klassifizieren sind. In unseren Simulationen ist diese Methode auf Augenhöhe mit aktuell eingesetzten Verfahren. Schließlich stellen wir eine Auswahl biologischer Ergebnisse vor, die mit den Besonderheiten der präsentierten Alignmentverfahren in Zusammenhang stehen

APA, Harvard, Vancouver, ISO, and other styles

5

Stromberg, Michael Peter. "Enabling high-throughput sequencing data analysis with MOSAIK." Thesis, Boston College, 2010. http://hdl.handle.net/2345/1332.

Full text

Abstract:

Thesis advisor: Gabor T. Marth
During the last few years, numerous new sequencing technologies have emerged that require tools that can process large amounts of read data quickly and accurately. Regardless of the downstream methods used, reference-guided aligners are at the heart of all next-generation analysis studies. I have developed a general reference-guided aligner, MOSAIK, to support all current sequencing technologies (Roche 454, Illumina, Applied Biosystems SOLiD, Helicos, and Sanger capillary). The calibrated alignment qualities calculated by MOSAIK allow the user to fine-tune the alignment accuracy for a given study. MOSAIK is a highly configurable and easy-to-use suite of alignment tools that is used in hundreds of labs worldwide. MOSAIK is an integral part of our genetic variant discovery pipeline. From SNP and short-INDEL discovery to structural variation discovery, alignment accuracy is an essential requirement and enables our downstream analyses to provide accurate calls. In this thesis, I present three major studies that were formative during the development of MOSAIK and our analysis pipeline. In addition, I present a novel algorithm that identifies mobile element insertions (non-LTR retrotransposons) in the human genome using split-read alignments in MOSAIK. This algorithm has a low false discovery rate (4.4 %) and enabled our group to be the first to determine the number of mobile elements that differentially occur between any two individuals
Thesis (PhD) — Boston College, 2010
Submitted to: Boston College. Graduate School of Arts and Sciences
Discipline: Biology

APA, Harvard, Vancouver, ISO, and other styles

6

Xing, Zhengrong. "Poisson multiscale methods for high-throughput sequencing data." Thesis, The University of Chicago, 2016. http://pqdtopen.proquest.com/#viewpdf?dispub=10195268.

Full text

Abstract:

In this dissertation, we focus on the problem of analyzing data from high-throughput sequencing experiments. With the emergence of more capable hardware and more efficient software, these sequencing data provide information at an unprecedented resolution. However, statistical methods developed for such data rarely tackle the data at such high resolutions, and often make approximations that only hold under certain conditions.

We propose a model-based approach to dealing with such data, starting from a single sample. By taking into account the inherent structure present in such data, our model can accurately capture important genomic regions. We also present the model in such a way that makes it easily extensible to more complicated and biologically interesting scenarios.

Building upon the single-sample model, we then turn to the statistical question of detecting differences between multiple samples. Such questions often arise in the context of expression data, where much emphasis has been put on the problem of detecting differential expression between two groups. By extending the framework for a single sample to incorporate additional group covariates, our model provides a systematic approach to estimating and testing for such differences. We then apply our method to several empirical datasets, and discuss the potential for further applications to other biological tasks.

We also seek to address a different statistical question, where the goal here is to perform exploratory analysis to uncover hidden structure within the data. We incorporate the single-sample framework into a commonly used clustering scheme, and show that our enhanced clustering approach is superior to the original clustering approach in many ways. We then apply our clustering method to a few empirical datasets and discuss our findings.

Finally, we apply the shrinkage procedure used within the single-sample model to tackle a completely different statistical issue: nonparametric regression with heteroskedastic Gaussian noise. We propose an algorithm that accurately recovers both the mean and variance functions given a single set of observations, and demonstrate its advantages over state-of-the art methods through extensive simulation studies.

APA, Harvard, Vancouver, ISO, and other styles

7

Fritz, Markus Hsi-Yang. "Exploiting high throughput DNA sequencing data for genomic analysis." Thesis, University of Cambridge, 2012. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.610819.

Full text

APA, Harvard, Vancouver, ISO, and other styles

8

Woolford, Julie Ruth. "Statistical analysis of small RNA high-throughput sequencing data." Thesis, University of Cambridge, 2012. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.610375.

Full text

APA, Harvard, Vancouver, ISO, and other styles

9

Kircher, Martin. "Understanding and improving high-throughput sequencing data production and analysis." Doctoral thesis, Universitätsbibliothek Leipzig, 2011. http://nbn-resolving.de/urn:nbn:de:bsz:15-qucosa-71102.

Full text

Abstract:

Advances in DNA sequencing revolutionized the field of genomics over the last 5 years. New sequencing instruments make it possible to rapidly generate large amounts of sequence data at substantially lower cost. These high-throughput sequencing technologies (e.g. Roche 454 FLX, Life Technology SOLiD, Dover Polonator, Helicos HeliScope and Illumina Genome Analyzer) make whole genome sequencing and resequencing, transcript sequencing as well as quantification of gene expression, DNA-protein interactions and DNA methylation feasible at an unanticipated scale. In the field of evolutionary genomics, high-throughput sequencing permitted studies of whole genomes from ancient specimens of different hominin groups. Further, it allowed large-scale population genetics studies of present-day humans as well as different types of sequence-based comparative genomics studies in primates. Such comparisons of humans with closely related apes and hominins are important not only to better understand human origins and the biological background of what sets humans apart from other organisms, but also for understanding the molecular basis for diseases and disorders, particularly those that affect uniquely human traits, such as speech disorders, autism or schizophrenia. However, while the cost and time required to create comparative data sets have been greatly reduced, the error profiles and limitations of the new platforms differ significantly from those of previous approaches. This requires a specific experimental design in order to circumvent these issues, or to handle them during data analysis. During the course of my PhD, I analyzed and improved current protocols and algorithms for next generation sequencing data, taking into account the specific characteristics of these new sequencing technologies. The presented approaches and algorithms were applied in different projects and are widely used within the department of Evolutionary Genetics at the Max Planck Institute of Evolutionary Anthropology. In this thesis, I will present selected analyses from the whole genome shotgun sequencing of two ancient hominins and the quantification of gene expression from short-sequence tags in five tissues from three primates.

APA, Harvard, Vancouver, ISO, and other styles

10

Ainsworth, David. "Computational approaches for metagenomic analysis of high-throughput sequencing data." Thesis, Imperial College London, 2016. http://hdl.handle.net/10044/1/44070.

Full text

Abstract:

High-throughput DNA sequencing has revolutionised microbiology and is the foundation on which the nascent field of metagenomics has been built. This ability to cheaply sample billions of DNA reads directly from environments has democratised sequencing and allowed researchers to gain unprecedented insights into diverse microbial communities. These technologies however are not without their limitations: the short length of the reads requires the production of vast amounts of data to ensure all information is captured. This 'data deluge' has been a major bottleneck and has necessitated the development of new algorithms for analysis. Sequence alignment methods provide the most information about the composition of a sample as they allow both taxonomic and functional classification but algorithms are prohibitively slow. This inefficiency has led to the reliance on faster algorithms which only produce simple taxonomic classification or abundance estimation, losing the valuable information given by full alignments against annotated genomes. This thesis will describe k-SLAM, a novel ultra-fast method for the alignment and taxonomic classification of metagenomic data. Using a k -mer based method k-SLAM achieves speeds three orders of magnitude faster than current alignment based approaches, allowing a full taxonomic classification and gene identification to be tractable on modern large datasets. The alignments found by k-SLAM can also be used to find variants and identify genes, along with their nearest taxonomic origins. A novel pseudo-assembly method produces more specific taxonomic classifications on species which have high sequence identity within their genus. This provides a significant (up to 40%) increase in accuracy on these species. Also described is a re-analysis of a Shiga-toxin producing E. coli O104:H4 isolate via alignment against bacterial and viral species to find antibiotic resistance and toxin producing genes. k-SLAM has been used by a range of research projects including FLORINASH and is currently being used by a number of groups.

APA, Harvard, Vancouver, ISO, and other styles

11

Mohamadi, Hamid. "Parallel algorithms and software tools for high-throughput sequencing data." Thesis, University of British Columbia, 2017. http://hdl.handle.net/2429/62072.

Full text

Abstract:

With growing throughput and dropping cost of High-Throughput Sequencing (HTS) technologies, there is a continued need to develop faster and more cost-effective bioinformatics solutions. However, the algorithms and computational power required to efficiently analyze HTS data have lagged considerably. In health and life sciences research organizations, de novo assembly and sequence alignment have become two key steps in everyday research and analysis. The de novo assembly process is a fundamental step in analyzing previously uncharacterized organisms and is one of the most computationally demanding problems in bioinformatics. The sequence alignment is a fundamental operation in a broad spectrum of genomics projects. In genome resequencing projects, they are often used prior to variant calling. In transcriptome resequencing, they provide information on gene expression. They are even used in de novo sequencing projects to help contiguate assembled sequences. As such designing efficient, scalable, and accurate solutions for de novo assembly and sequence alignment problems would have a wide effect in the field. In this thesis, I present a collection of novel algorithms and software tools for the analysis of high-throughput sequencing data using efficient data structures. I also utilize the latest advances in parallel and distributed computing to design and develop scalable and cost-effective algorithms on High-Performance Computing (HPC) infrastructures especially for the de novo assembly and sequence alignment problems. The algorithms and software solutions I develop are publicly available for free for academic use, to facilitate research at health and life sciences laboratories and other organizations worldwide.
Science, Faculty of
Graduate

APA, Harvard, Vancouver, ISO, and other styles

12

Mammana, Alessandro [Verfasser]. "Patterns and algorithms in high-throughput sequencing count data / Alessandro Mammana." Berlin : Freie Universität Berlin, 2016. http://d-nb.info/1108270956/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

13

Love, Michael I. [Verfasser]. "Statistical analysis of high-throughput sequencing count data / Michael I. Love." Berlin : Freie Universität Berlin, 2013. http://d-nb.info/1043197842/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

14

Ballinger, Tracy J. "Analysis of genomic rearrangements in cancer from high throughput sequencing data." Thesis, University of California, Santa Cruz, 2015. http://pqdtopen.proquest.com/#viewpdf?dispub=3729995.

Full text

Abstract:

In the last century cancer has become increasingly prevalent and is the second largest killer in the United States, estimated to afflict 1 in 4 people during their life. Despite our long history with cancer and our herculean efforts to thwart the disease, in many cases we still do not understand the underlying causes or have successful treatments. In my graduate work, I’ve developed two approaches to the study of cancer genomics and applied them to the whole genome sequencing data of cancer patients from The Cancer Genome Atlas (TCGA). In collaboration with Dr. Ewing, I built a pipeline to detect retrotransposon insertions from paired-end high-throughput sequencing data and found somatic retrotransposon insertions in a fifth of cancer patients.

My second novel contribution to the study of cancer genomics is the development of the CN-AVG pipeline, a method for reconstructing the evolutionary history of a single tumor by predicting the order of structural mutations such as deletions, duplications, and inversions. The CN-AVG theory was developed by Drs. Haussler, Zerbino, and Paten and samples potential evolutionary histories for a tumor using Markov Chain Monte Carlo sampling. I contributed to the development of this method by testing its accuracy and limitations on simulated evolutionary histories. I found that the ability to reconstruct a history decays exponentially with increased breakpoint reuse, but that we can estimate how accurately we reconstruct a mutation event using the likelihood scores of the events. I further designed novel techniques for the application of CN-AVG to whole genome sequencing data from actual patients and applied these techniques to search for evolutionary patterns in glioblastoma multiforme using sequencing data from TCGA. My results show patterns of two-hit deletions, as we would expect, and amplifications occurring over several mutational events. I also find that the CN-AVG method frequently makes use of whole chromosome copy number changes following by localized deletions, a bias that could be mitigated through modifying the cost function for an evolutionary history.

APA, Harvard, Vancouver, ISO, and other styles

15

Paicu, Claudia. "miRNA detection and analysis from high-throughput small RNA sequencing data." Thesis, University of East Anglia, 2016. https://ueaeprints.uea.ac.uk/63738/.

Full text

Abstract:

Small RNAs (sRNAs) are a broad class of short regulatory non-coding RNAs. microRNAs (miRNAs) are a special class of -21-22 nucleotide sRNAs which are derived from a stable hairpin-like secondary structure. miRNAs have critical gene regulatory functions and are involved in many pathways including developmental timing, organogenesis and development in both plants and animals. Next generation sequencing (NGS) technologies, which are often used for identifying miRNAs, are continuously evolving, generating datasets containing millions of sRNAs, which has led to new challenges for the tools used to predict miRNAs from such data. There are several tools for miRNA detection from NGS datasets, which we review in this thesis, identifying a number of potential shortcomings in their algorithms. In this thesis, we present a novel miRNA prediction algorithm, miRCat2. Our algorithm is more robust to variations in sequencing depth due to the fact that it compares aligned sRNA reads to a random uniform distribution to detect peaks in the input dataset, using a new entropy-based approach. Then it applies filters based on the miRNA biogenesis on the read alignment and on the computed secondary structure. Results show that miRCat2 has a better specificity-sensitivity trade-off than similar tools, and its predictions also contains a larger percentage of sequences that are downregulated in mutants in the miRNA biogenesis pathway. This confirms the validity of novel predictions, which may lead to new miRNA annotations, expanding and contributing to the field of sRNA research.

APA, Harvard, Vancouver, ISO, and other styles

16

Glaus, Peter. "Bayesian methods for gene expression analysis from high-throughput sequencing data." Thesis, University of Manchester, 2014. https://www.research.manchester.ac.uk/portal/en/theses/bayesian-methods-for-gene-expression-analysis-from-highthroughput-sequencing-data(cf9680e0-a3f2-4090-8535-a39f3ef50cc4).html.

Full text

Abstract:

We study the tasks of transcript expression quantification and differential expression analysis based on data from high-throughput sequencing of the transcriptome (RNA-seq). In an RNA-seq experiment subsequences of nucleotides are sampled from a transcriptome specimen, producing millions of short reads. The reads can be mapped to a reference to determine the set of transcripts from which they were sequenced. We can measure the expression of transcripts in the specimen by determining the amount of reads that were sequenced from individual transcripts. In this thesis we propose a new probabilistic method for inferring the expression of transcripts from RNA-seq data. We use a generative model of the data that can account for read errors, fragment length distribution and non-uniform distribution of reads along transcripts. We apply the Bayesian inference approach, using the Gibbs sampling algorithm to sample from the posterior distribution of transcript expression. Producing the full distribution enables assessment of the uncertainty of the estimated expression levels. We also investigate the use of alternative inference techniques for the transcript expression quantification. We apply a collapsed Variational Bayes algorithm which can provide accurate estimates of mean expression faster than the Gibbs sampling algorithm. Building on the results from transcript expression quantification, we present a new method for the differential expression analysis. Our approach utilizes the full posterior distribution of expression from multiple replicates in order to detect significant changes in abundance between different conditions. The method can be applied to differential expression analysis of both genes and transcripts. We use the newly proposed methods to analyse real RNA-seq data and provide evaluation of their accuracy using synthetic datasets. We demonstrate the advantages of our approach in comparisons with existing alternative approaches for expression quantification and differential expression analysis. The methods are implemented in the BitSeq package, which is freely distributed under an open-source license. Our methods can be accessed and used by other researchers for RNA-seq data analysis.

APA, Harvard, Vancouver, ISO, and other styles

17

Beckers, Matthew. "Quality checking and expression analysis of high-throughput small RNA sequencing data." Thesis, University of East Anglia, 2015. https://ueaeprints.uea.ac.uk/58581/.

Full text

Abstract:

The advent of high-throughput RNA sequencing (RNA-seq) methods have made it possible to sequence transcriptomes for the cell-wide identi�cation of small non-coding RNAs (sRNAs) and to assess their regulation using di�erential expression analysis by comparing two or more di�erent conditions. During an analysis of a typical set of sRNA sequencing (sRNA-seq) libraries, a large variety of tools and methods are used on the dataset in order to understand the data's quality, content, and to summarise the knowledge gained from the entire analysis. Many of the tools available to do this were created for mRNA sequencing (mRNA-seq) datasets. In this thesis, we present and implement a processing pipeline that can be used to assess the quality and the di�erential expression of sRNA-seq datasets over two or more di�erent conditions. We then utilise aspects of this pipeline in various sRNA-seq experiments. Firstly, we combine our pipeline with current tools for miRNA identi�cation to assess the regulation of miRNAs during larval caste di�erentiation in a novel genome; the European bumblebee (Bombus terrestris). Secondly, we explore the di�erential expression during cell stress of all classes of sRNAs using two cell lines in humans. We also �nd that a speci�c protein, Ro60, is required for the expression of mRNA-derived sRNAs during stress, similar to the way in which sRNAs derived from Y RNAs are regulated. Finally, we utilise our understanding of sRNA mapping patterns, alongside current tools for miRNA identi�cation, to search for functional miRNAs and other sRNAs in the novel genomes of two diatoms. The lack of canonical miRNA predictions in this study has repercussions for the evolutionary theory behind miRNAs. The implementation of our pipeline for sRNA-seq data provides an interactive and quality controlled work ow that can be used to process a dataset from raw sequences to the results of several di�erential expression experiments for all identi�ed sRNA classes within a sequenced transcriptome.

APA, Harvard, Vancouver, ISO, and other styles

18

Gupta, Namita. "Computational Identification of B Cell Clones in High-Throughput Immunoglobulin Sequencing Data." Thesis, Yale University, 2017. http://pqdtopen.proquest.com/#viewpdf?dispub=10633249.

Full text

Abstract:

Humoral immunity is driven by the expansion, somatic hypermutation, and selection of B cell clones. Each clone is the progeny of a single B cell responding to antigen. with diversified Ig receptors. The advent of next-generation sequencing technologies enables deep profiling of the Ig repertoire. This large-scale characterization provides a window into the micro-evolutionary dynamics of the adaptive immune response and has a variety of applications in basic science and clinical studies. Clonal relationships are not directly measured, but must be computationally inferred from these sequencing data. In this dissertation, we use a combination of human experimental and simulated data to characterize the performance of hierarchical clustering-based methods for partitioning sequences into clones. Our results suggest that hierarchical clustering using single linkage with nucleotide Hamming distance identifies clones with high confidence and provides a fully automated method for clonal grouping. The performance estimates we develop provide important context to interpret clonal analysis of repertoire sequencing data and allow for rigorous testing of other clonal grouping algorithms. We present the clonal grouping tool as well as other tools for advanced analyses of large-scale Ig repertoire sequencing data through a suite of utilities, Change-O. All Change-O tools utilize a common data format, which enables the seamless integration of multiple analyses into a single workflow. We then apply the Change-O suite in concert with the nucleotide coding se- quences for WNV-specific antibodies derived from single cells to identify expanded WNV-specific clones in the repertoires of recently infected subjects through quantitative Ig repertoire sequencing analysis. The method proposed in this dissertation to computationally identify B cell clones in Ig repertoire sequencing data with high confidence is made available through the Change-O suite and can be applied to provide insight into the dynamics of the adaptive immune response.

APA, Harvard, Vancouver, ISO, and other styles

19

González-Vallinas, Rostes Juan 1983. "Software development and analysis of high throughput sequencing data for genomic enhancer prediction." Doctoral thesis, Universitat Pompeu Fabra, 2013. http://hdl.handle.net/10803/283480.

Full text

Abstract:

High Throughput Sequencing technologies (HTS) are becoming the standard in genomic regulation analysis. During my thesis I developed software for the analysis of HTS data. Through collaborations with other research groups, I specialized in the analysis of ChIP-Seq short mapped reads. For instance, I collaborated in the analysis of the effect of Hog1 stress induced response in Yeast and helped in the design of a multiple promoter-alignment method using ChIP-Seq data, among other collaborations. Making use of expertise and the software developed during this time, I analyzed ENCODE datasets in order to detect active genomic enhancers. Genomic enhancers are regions in the genome known to regulate transcription levels of close by or distant genes. Mechanism of activation and silencing of enhancers is still poorly understood. Epigenomic elements, like histone modifications and transcription factors play a critical role in enhancer activity. Modeling epigenomic signals, I predicted active and silenced enhancers in two cell lines and studied their effect in splicing and transcription initiation.
Las tecnologías High Throughput Sequencing (HTS) se están convirtiendo en el método standard de análisis de la regulación genómica. Durante mi tesis, he desarrollado software para el análisis de datos HTS. Mediante la colaboración con otros grupos de investigaci n, me he especializado ́ en el análisis de datos de ChIP-Seq. Por ejemplo, colaborado en el análisis del efecto de Hog1 en células de levadura afectadas por stress, colaboré en el diseño de un m ́ todo para el alineamiento m ́ ltiple de promotores usando datos de ChIP-Seq, entre otras colaboraciones. Usando el conocimiento y el software desarrollados durante este tiempo, analicé datos producidos por el proyecto ENCODE para detectar enhancers genómicos activos. Los enhancers son areas del genoma conocidas por regular la transcripción de genes cercanos y lejanos. Los mecanismos de activación y silenciamiento de enhancers son aún poco entendidos. Elementos epigenómicos, como las modificaciones de histonas y los factores de transcripción juegan un papel crucial en la actividad de enhancers. Construyendo un modelo con estas señales epigen ́ micas, predije enhancers activos y silenciados en dos lineas celulares y estudié su efecto sobre splicing y sobre la iniciacion de la transcripción.

APA, Harvard, Vancouver, ISO, and other styles

20

Becher, Hannes. "Differentiation across the Podisma pedestris hybrid zone inferred from high-throughput sequencing data." Thesis, Queen Mary, University of London, 2018. http://qmro.qmul.ac.uk/xmlui/handle/123456789/39744.

Full text

Abstract:

Hybrid zones are regions where genetically differentiated forms come together and exchange genes through hybrid offspring. The study of characters gradually changing across such zones (clines) can give insight into evolutionary processes, providing exceptionally sensitive estimates of the intensity of selection, and allowing the detection of loci that might be involved in reproductive isolation and speciation. The Alpine grasshopper Podisma pedestris has a hybrid zone in Southern France where two populations meet. They differ in their sex chromosome system, and strong selection against hybrids is observed. These distinct populations likely have split and re-joined several times during the Quaternary glacial cycles. A model explaining the selection observed against hybrids postulates hundreds of loci of small effect spread over two differentiated genomes meeting in secondary contact. Yet, over 50 years of study to-date non have been discovered. However, so far the study of P. pedestris has not made use of high-throughput sequencing data which provides an unprecedented resolution of molecular markers. I am aiming to close the gap with this thesis. I assemble the grasshopper's mitochondrial genome sequence and infer what proportion of its genome is made up by mitochondrial inserts (Numts). Using transcriptome data from two individuals, I then go on to fit demographic models, finding the populations split approximately 400 000 years ago and that the current-day population sizes are considerably smaller than the ancestral one. The final data chapter explores the genetic architecture of the hybrid zone using data from a targeted sequence capture of hundreds of loci covering some 10 000 polymorphic sites. Only two loci under selection are identified, which is surprising given the power of the analysis. Both loci are located on the X chromosome and are subject to weak selection (0.3% and 0.03%). This shows the power of hybrid zone analysis to infer targets of selection. The results are discussed in light of a theoretical chapter on the 'inexorable spread' phenomenon and lead to the proposal for further research into the causes of the reproductive isolation observed between the grasshopper populations.

APA, Harvard, Vancouver, ISO, and other styles

21

Makowski, Mateusz. "High-Throughput Data Analysis: Application to Micronuclei Frequency and T-cell Receptor Sequencing." VCU Scholars Compass, 2015. http://scholarscompass.vcu.edu/etd/3923.

Full text

Abstract:

The advent of high-throughput sequencing has brought about the creation of an unprecedented amount of research data. Analytical methodology has not been able to keep pace with the plethora of data being produced. Two assays, ImmunoSEQ and the cytokinesisblock micronucleus (CBMN), that both produce count data and have few methods available to analyze them are considered. ImmunoSEQ is a sequencing assay that measures the beta T-cell receptor (TCR) repertoire. The ImmunoSEQ assay was used to describe the TCR repertoires of patients that have undergone hematopoietic stem cell transplantation (HSCT). Several different methods for spectratype analysis were extended to the TCR sequencing setting then applied to these data to demonstrate different ways the data set can be analyzed. The different methods include CDR3 distribution perturbation, Oligoscores, Simpson's diversity, Shannon diversity, Kullback-Liebler divergence, a non-parametric method and a proportion logit transformation method. Herein we also demonstrate adapting compositional data analysis methods to the TCR sequencing setting. The various methods were compared when analyzing a set of 13 subjects who underwent hematopoietic stem cell transplantation. The eight subjects who developed graft versus host disease were compared to the five who did not. There was no little overlap in the results of the different methods showing that researchers must choose the appropriate method for their research question of interest. The CBMN assay measures the rate of micronuclei (MN) formation in a sample of cells and can be paired with gene expression or methylation assays to determine association between MN formation and other genetic markers. Herein we extended the generalized monotone incremental forward stagewise (GMIFS) method to the situation where the response is count data and there are more independent variables than there are samples. Our Poisson GMIFS method was compared to a popular alternative, glmpath, by using simulations and applying both to real data. Simulations showed that both methods perform similarly in accurately choosing truly significant variables. However, glmpath appears to overfit compared to our GMIFS method. Finally, when both methods were applied to two data sets GMIFS appeared to be more stable than glmpath.

APA, Harvard, Vancouver, ISO, and other styles

22

Ye, Lin, and 叶林. "Exploring microbial community structures and functions of activated sludge by high-throughput sequencing." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 2012. http://hub.hku.hk/bib/B48079649.

Full text

Abstract:

To investigate the diversities and abundances of nitrifiers and to apply the highthroughput sequencing technologies to analyze the overall microbial community structures and functions in the wastewater treatment bioreactors were the major objectives of this study. Specifically, this study was conducted: (1) to investigate the diversities and abundances of AOA, AOB and NOB in bioreactors, (2) to explore the bacterial communities in bioreactors using 454 pyrosequencing, and (3) to analyze the metagenomes of activated sludge using Illumina sequencing. A lab-scale nitrification bioreactor was operated for 342 days under low DO (0.15~0.5 mg/L) and high nitrogen loading (0.26~0.52 kg-N/(m3d)). T-RFLP and cloning analysis showed there were only one dominant AOA, AOB and NOB species in the bioreactor, respectively. The amoA gene of the dominant AOA had a similarity of 89.3% with the isolated AOA species Nitrosopumilus maritimus SCM1. The AOB species detected in the bioreactor belonged to Nitrosomonas genus. The abundance of AOB was more than 40 times larger than that of AOA. The percentage of NOB in total bacteria increased from not detectable to 30% when DO changed from 0.15 to 0.5 mg/L. Compared with traditional methods, pyrosequencing analysis of the bacteria in this bioreactor provided unprecedented information. 494 bacterial OTUs was obtained at 3% distance cutoff. Furthermore, 454 pyrosequencing was applied to investigate the bacterial communities of activated sludge samples from 14 WWTPs of Asia (mainland China, Hong Kong, and Singapore) and North America (Canada and the United States). The results revealed huge amounts of OTUs in activated sludge, i.e. 1183~3567 OTUs in one sludge sample at 3% distance cutoff. Clear geographical differences among these samples were observed. The AOB amoA genes in different WWTPs were found quite diverse while the 16S rRNA genes were relatively conserved. To explore microbial community structures and functions in the abovementioned labscale bioreactor and a full-scale bioreactor, over six gigabases of metagenomic sequence data and 150,000 paired-end reads of PCR amplicons were generated from the activated sludge in the two bioreactors on Illumina HiSeq2000 platform. Three kinds of sequences (16S rRNA amplicons, 16S rRNA gene tags and predicted genes) were used to conduct taxonomic assignment and their applicabilities and reliabilities were compared. Specially, based on 16S rRNA and amoA gene sequences, AOB were found more abundant than AOA in the two bioreactors. Furthermore, the analysis of the metabolic profiles and pathways indicated that the overall pathways in the two bioreactors were quite similar. However, the abundances of some specific genes in the two bioreactors were different. In addition, 454 pyrosequencing was also used to detect potentially pathogenic bacteria in environmental samples. It was found most abundant potentially pathogenic bacteria in the WWTPs were affiliated with Aeromonas and Clostridium. Aeromonas veronii, Aeromonas hydrophila and Clostridium perfringens were species most similar to the potentially pathogenic bacteria found in this study. Overall, the percentage of the sequences closely related to known pathogenic bacteria sequences was about 0.16% of the total sequences. Additionally, a Java application (BAND) was developed for graphical visualization of microbial abundance data.
published_or_final_version
Civil Engineering
Doctoral
Doctor of Philosophy

APA, Harvard, Vancouver, ISO, and other styles

23

Althammer, Sonja Daniela. "Elucidating mechanisms of gene regulation. Integration of high-throughput sequencing data for studying the epigenome." Doctoral thesis, Universitat Pompeu Fabra, 2012. http://hdl.handle.net/10803/81355.

Full text

Abstract:

The recent advent of High-Throughput Sequencing (HTS) methods has triggered a revolution in gene regulation studies. Demand has never been higher to process the immense amount of emerging data to gain insight into the regulatory mechanisms of the cell. We address this issue by describing methods to analyze, integrate and interpret HTS data from different sources. In particular, we developed and benchmarked Pyicos, a powerful toolkit that offers flexibility, versatility and efficient memory usage. We applied it to data from ChIP-Seq on progesterone receptor in breast cancer cells to gain insight into regulatory mechanisms of hormones. Moreover, we embedded Pyicos into a pipeline to integrate HTS data from different sources. In order to do so, we used data sets from ENCODE to systematically calculate signal changes between two cell lines. We thus created a model that accurately predicts the regulatory outcome of gene expression, based on epigenetic changes in a gene locus. Finally, we provide the processed data in a Biomart database to the scientific community.
La llegada reciente de nuevos métodos de High-Throughput Sequencing (HTS) ha provocado una revolución en el estudio de la regulación génica. La necesidad de procesar la inmensa cantidad de datos generados, con el objectivo de estudiar los mecanismos regulatorios en la celula, nunca ha sido mayor. En esta tesis abordamos este tema presentando métodos para analizar, integrar e interpretar datos HTS de diferentes fuentes. En particular, hemos desarollado Pyicos, un potente conjunto de herramientas que ofrece flexibilidad, versatilidad y un uso eficiente de la memoria. Lo hemos aplicado a datos de ChIP-Seq del receptor de progesterona en células de cáncer de mama con el fin de investigar los mecanismos de la regulación por hormonas. Además, hemos incorporado Pyicos en una pipeline para integrar los datos HTS de diferentes fuentes. Hemos usado los conjuntos de datos de ENCODE para calcular de forma sistemática los cambios de señal entre dos líneas celulares. De esta manera hemos logrado crear un modelo que predice con bastante precisión los cambios de la expresión génica, basándose en los cambios epigenéticos en el locus de un gen. Por último, hemos puesto los datos procesados a disposición de la comunidad científica en una base de datos Biomart.

APA, Harvard, Vancouver, ISO, and other styles

24

Videm, Pavankumar [Verfasser], and Rolf [Akademischer Betreuer] Backofen. "Analysis of high-throughput sequencing data related to small non-coding RNAs biogenesis and function." Freiburg : Universität, 2021. http://d-nb.info/1238518087/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

25

Bansal, Vikas [Verfasser]. "Computational Analysis of High-Throughput Sequencing Data in Cardiac Disease and Skeletal Muscle Development / Vikas Bansal." Berlin : Freie Universität Berlin, 2016. http://d-nb.info/1110884494/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

26

Ott, Felix [Verfasser], and Detlef [Akademischer Betreuer] Weigel. "An Integrated Data Analysis Suite and Programming Framework for High-Throughput DNA Sequencing / Felix Ott ; Betreuer: Detlef Weigel." Tübingen : Universitätsbibliothek Tübingen, 2014. http://d-nb.info/1162897147/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

27

Kawalia, Amit [Verfasser], Peter [Gutachter] Nürnberg, and Michael [Gutachter] Nothnagel. "Addressing NGS Data Challenges: Efficient High Throughput Processing and Sequencing Error Detection / Amit Kawalia ; Gutachter: Peter Nürnberg, Michael Nothnagel." Köln : Universitäts- und Stadtbibliothek Köln, 2016. http://d-nb.info/112370368X/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

28

Hoffmann, Steve [Verfasser], Peter F. [Gutachter] Stadler, and Rolf [Gutachter] Backofen. "Genome Informatics for High-Throughput Sequencing Data Analysis : Methods and Applications / Steve Hoffmann ; Gutachter: Peter F. Stadler, Rolf Backofen." Leipzig : Universitätsbibliothek Leipzig, 2014. http://d-nb.info/1238789528/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

29

Sheppard, Sarah E. "Application of a Naïve Bayes Classifier to Assign Polyadenylation Sites from 3' End Deep Sequencing Data: A Dissertation." eScholarship@UMMS, 2013. http://escholarship.umassmed.edu/gsbs_diss/653.

Full text

Abstract:

Cleavage and polyadenylation of a precursor mRNA is important for transcription termination, mRNA stability, and regulation of gene expression. This process is directed by a multitude of protein factors and cis elements in the pre-mRNA sequence surrounding the cleavage and polyadenylation site. Importantly, the location of the cleavage and polyadenylation site helps define the 3’ untranslated region of a transcript, which is important for regulation by microRNAs and RNA binding proteins. Additionally, these sites have generally been poorly annotated. To identify 3’ ends, many techniques utilize an oligo-dT primer to construct deep sequencing libraries. However, this approach can lead to identification of artifactual polyadenylation sites due to internal priming in homopolymeric stretches of adenines. Previously, simple heuristic filters relying on the number of adenines in the genomic sequence downstream of a putative polyadenylation site have been used to remove these sites of internal priming. However, these simple filters may not remove all sites of internal priming and may also exclude true polyadenylation sites. Therefore, I developed a naïve Bayes classifier to identify putative sites from oligo-dT primed 3’ end deep sequencing as true or false/internally primed. Notably, this algorithm uses a combination of sequence elements to distinguish between true and false sites. Finally, the resulting algorithm is highly accurate in multiple model systems and facilitates identification of novel polyadenylation sites.

APA, Harvard, Vancouver, ISO, and other styles

30

Martin, Marcel Verfasser], Sven [Akademischer Betreuer] [Rahmann, and Jens [Akademischer Betreuer] Stoye. "Algorithms and tools for the analysis of high throughput DNA sequencing data / Marcel Martin. Betreuer: Sven Rahmann. Gutachter: Jens Stoye." Dortmund : Universitätsbibliothek Dortmund, 2014. http://d-nb.info/1095767682/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

31

Martin, Marcel [Verfasser], Sven [Akademischer Betreuer] Rahmann, and Jens [Akademischer Betreuer] Stoye. "Algorithms and tools for the analysis of high throughput DNA sequencing data / Marcel Martin. Betreuer: Sven Rahmann. Gutachter: Jens Stoye." Dortmund : Universitätsbibliothek Dortmund, 2014. http://d-nb.info/1095767682/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

32

Kircher, Martin [Verfasser], Janet [Akademischer Betreuer] Kelso, Anton [Gutachter] Nekrutenko, and Peter F. [Gutachter] Stadler. "Understanding and improving high-throughput sequencing data production and analysis / Martin Kircher ; Gutachter: Anton Nekrutenko, Peter F. Stadler ; Betreuer: Janet Kelso." Leipzig : Universitätsbibliothek Leipzig, 2011. http://d-nb.info/1237894654/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

33

Pantano, Rubiño Lorena. "Full characterization of the small RNA transcriptome using novel computational methods for high-throughput sequencing data: study of miRNA variability in eukaryote organisms." Doctoral thesis, Universitat Pompeu Fabra, 2011. http://hdl.handle.net/10803/53576.

Full text

Abstract:

In this thesis we have developed a user-friendly tool, SeqBuster, for the analysis of small RNA (sRNA) data generated by next generation sequencing strategies, with special emphasis on deep characterization of miRNA variants (isomiRs). We tested the tool using public datasets, revealing an unexpected amount of isomiRs in the total miRNA profile in different species. In addition, we detected all known classes of non-miRNA sRNAs and new sRNAs with a still unassigned function. Furthermore, we studied the implication of miRNAs and isomiRs in human brain development and aging and in Huntington disease, concluding that miRNAs/isomiRs may contribute to central nervous system physiological and pathological conditions. Overall, our results have uncovered a new layer of complexity in miRNAs, with probable consequences in mRNA mediated gene expression regulation underlying different biological functions. Furthermore SeqBuster may be extremely useful to identify sRNA sequences with a putative regulation role in selective biological processes
En esta tesis hemos desarrollado una herramienta, SeqBuster, para el análisis de datos de RNA (sRNA) de pequeño tamaño generados por las nuevas tecnologías de secuenciación, con especial énfasis en la caracterización de variantes de los miRNAs. Aplicamos la herramienta a datos públicos de secuenciación, lo que reveló una inesperada abundancia de isomiRs en diferentes especies. Ademas, detectamos todas las clases conocidas de otros sRNAs y de nuevos sRNAs con funciones desconocidas. También estudiamos la implicación de los miRNAs e isomiRs en el desarrollo y envejecimiento del cerebro humano, y en la enfermedad de Huntington. Nuestros resultados resaltan una posible importancia de la plasticidad de secuencia de los miRNAs, con probables consecuencias en la regulación de la expresión génica, subyacente a varias funciones biológicas. Por último, SeqBuster, podría ser extremadamente útil para identificar nuevos sRNAs con una posible función en determinados procesos biológicos.

APA, Harvard, Vancouver, ISO, and other styles

34

Arora, Ankit [Verfasser], and Peter [Gutachter] Nürnberg. "Development of Web-Application for High-Throughput Sequencing Data and In Silico Dissection of LINE-1 Retrotransposons in Cellular Senescence / Ankit Arora ; Gutachter: Peter Nürnberg." Köln : Universitäts- und Stadtbibliothek Köln, 2018. http://d-nb.info/1171422660/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

35

Wang, Wei. "Unveiling Molecular Mechanisms of piRNA Pathway from Small Signals in Big Data: A Dissertation." eScholarship@UMMS, 2010. http://escholarship.umassmed.edu/gsbs_diss/805.

Full text

Abstract:

PIWI-interacting RNAs (piRNA) are a group of 23–35 nucleotide (nt) short RNAs that protect animal gonads from transposon activities. In Drosophila germ line, piRNAs can be categorized into two different categories— primary and secondary piRNAs— based on their origins. Primary piRNAs, generated from transcripts of specific genomic regions called piRNA clusters, which are enriched in transposon fragments that are unlikely to retain transposition activity. The transcription and maturation of primary piRNAs from those cluster transcripts are poorly understood. After being produced, a group of primary piRNAs associates Piwi proteins and directs them to repress transposons at the transcriptional level in the nucleus. Other than their direct role in repressing transposons, primary piRNAs can also initiate the production of secondary piRNA. piRNAs with such function are loaded in a second PIWI protein named Aubergine (Aub). Similar to Piwi, Aub is guided by piRNAs to identify its targets through base-pairing. Differently, Aub functions in the cytoplasm by cleaving transposon mRNAs. The 5' cleavage products are not degraded but loaded into the third PIWI protein Argonaute3 (Ago3). It is believed that an unidentified nuclease trims the 3' ends of those cleavage products to 23–29 nt, becoming mature piRNAs remained in Ago3. Such piRNAs whose 5' ends are generated by another PIWI protein are named secondary piRNAs. Intriguingly, secondary piRNAs loaded into Ago3 also cleave transposon mRNA or piRNA cluster transcripts and produce more secondary piRNAs loaded into Aub. This reciprocal feed-forward loop, named the “Ping-Pong cycle”, amplified piRNA abundance. By dissecting and analyzing data from large-scale deep sequencing of piRNAs and transposon transcripts, my dissertation research elucidates the biogenesis of germline piRNAs in Drosophila. How primary piRNAs are processed into mature piRNAs remains enigmatic. I discover that primary piRNA signal on the genome display a fixed periodicity of ~26 nt. Such phasing depends on Zucchini, Armitage and some other primary piRNA pathway components. Further analysis suggests that secondary piRNAs bound to Ago3 can initiate phased primary piRNA production from cleaved transposon RNAs. The first ~26 nt becomes a secondary piRNA that bind Aub while the subsequent piRNAs bind Piwi, allowing piRNAs to spread beyond the site of RNA cleavage. This discovery adds sequence diversity to the piRNA pool, allowing adaptation to changes in transposon sequence. We further find that most Piwi-associated piRNAs are generated from the cleavage products of Ago3, instead of being processed from piRNA cluster transcripts as the previous model suggests. The cardinal function of Ago3 is to produce antisense piRNAs that direct transcriptional silencing by Piwi, rather to make piRNAs that guide post-transcriptional silencing by Aub. Although Ago3 slicing is required to efficiently trigger phased piRNA production, an alternative, slicing-independent pathway suffices to generate Piwi-bound piRNAs that repress transcription of a subset of transposon families. The alternative pathway may help flies silence newly acquired transposons for which they lack extensively complementary piRNAs. The Ping-Pong model depicts that first ten nucleotides of Aub-bound piRNAs are complementary to the first ten nt of Ago3-bound piRNAs. Supporting this view, piRNAs bound to Aub typically begin with Uridine (1U), while piRNAs bound to Ago3 often have adenine at position 10 (10A). Furthermore, the majority of Ping-Pong piRNAs form this 1U:10A pair. The Ping-Pong model proposes that the 10A is a consequence of 1U. By statistically quantifying those target piRNAs not paired to g1U, we discover that 10A is not directly caused by 1U. Instead, fly Aub as well as its homologs, Siwi in silkmoth and MILI in mice, have an intrinsic preference for adenine at the t1 position of their target RNAs. On the other hand, this t1A (and g10A after loading) piRNA directly give rise to 1U piRNA in the next Ping-Pong cycle, maximizing the affinity between piRNAs and PIWI proteins.

APA, Harvard, Vancouver, ISO, and other styles

36

Wang, Wei. "Unveiling Molecular Mechanisms of piRNA Pathway from Small Signals in Big Data: A Dissertation." eScholarship@UMMS, 2015. https://escholarship.umassmed.edu/gsbs_diss/805.

Full text

Abstract:

PIWI-interacting RNAs (piRNA) are a group of 23–35 nucleotide (nt) short RNAs that protect animal gonads from transposon activities. In Drosophila germ line, piRNAs can be categorized into two different categories— primary and secondary piRNAs— based on their origins. Primary piRNAs, generated from transcripts of specific genomic regions called piRNA clusters, which are enriched in transposon fragments that are unlikely to retain transposition activity. The transcription and maturation of primary piRNAs from those cluster transcripts are poorly understood. After being produced, a group of primary piRNAs associates Piwi proteins and directs them to repress transposons at the transcriptional level in the nucleus. Other than their direct role in repressing transposons, primary piRNAs can also initiate the production of secondary piRNA. piRNAs with such function are loaded in a second PIWI protein named Aubergine (Aub). Similar to Piwi, Aub is guided by piRNAs to identify its targets through base-pairing. Differently, Aub functions in the cytoplasm by cleaving transposon mRNAs. The 5' cleavage products are not degraded but loaded into the third PIWI protein Argonaute3 (Ago3). It is believed that an unidentified nuclease trims the 3' ends of those cleavage products to 23–29 nt, becoming mature piRNAs remained in Ago3. Such piRNAs whose 5' ends are generated by another PIWI protein are named secondary piRNAs. Intriguingly, secondary piRNAs loaded into Ago3 also cleave transposon mRNA or piRNA cluster transcripts and produce more secondary piRNAs loaded into Aub. This reciprocal feed-forward loop, named the “Ping-Pong cycle”, amplified piRNA abundance. By dissecting and analyzing data from large-scale deep sequencing of piRNAs and transposon transcripts, my dissertation research elucidates the biogenesis of germline piRNAs in Drosophila. How primary piRNAs are processed into mature piRNAs remains enigmatic. I discover that primary piRNA signal on the genome display a fixed periodicity of ~26 nt. Such phasing depends on Zucchini, Armitage and some other primary piRNA pathway components. Further analysis suggests that secondary piRNAs bound to Ago3 can initiate phased primary piRNA production from cleaved transposon RNAs. The first ~26 nt becomes a secondary piRNA that bind Aub while the subsequent piRNAs bind Piwi, allowing piRNAs to spread beyond the site of RNA cleavage. This discovery adds sequence diversity to the piRNA pool, allowing adaptation to changes in transposon sequence. We further find that most Piwi-associated piRNAs are generated from the cleavage products of Ago3, instead of being processed from piRNA cluster transcripts as the previous model suggests. The cardinal function of Ago3 is to produce antisense piRNAs that direct transcriptional silencing by Piwi, rather to make piRNAs that guide post-transcriptional silencing by Aub. Although Ago3 slicing is required to efficiently trigger phased piRNA production, an alternative, slicing-independent pathway suffices to generate Piwi-bound piRNAs that repress transcription of a subset of transposon families. The alternative pathway may help flies silence newly acquired transposons for which they lack extensively complementary piRNAs. The Ping-Pong model depicts that first ten nucleotides of Aub-bound piRNAs are complementary to the first ten nt of Ago3-bound piRNAs. Supporting this view, piRNAs bound to Aub typically begin with Uridine (1U), while piRNAs bound to Ago3 often have adenine at position 10 (10A). Furthermore, the majority of Ping-Pong piRNAs form this 1U:10A pair. The Ping-Pong model proposes that the 10A is a consequence of 1U. By statistically quantifying those target piRNAs not paired to g1U, we discover that 10A is not directly caused by 1U. Instead, fly Aub as well as its homologs, Siwi in silkmoth and MILI in mice, have an intrinsic preference for adenine at the t1 position of their target RNAs. On the other hand, this t1A (and g10A after loading) piRNA directly give rise to 1U piRNA in the next Ping-Pong cycle, maximizing the affinity between piRNAs and PIWI proteins.

APA, Harvard, Vancouver, ISO, and other styles

37

Wei-ChuChen and 陳薇筑. "Optimizing microalgae genome assembly of high throughput sequencing data." Thesis, 2018. http://ndltd.ncl.edu.tw/handle/f5c43h.

Full text

APA, Harvard, Vancouver, ISO, and other styles

38

Chen, Chien-Chih, and 陳建智. "Scalable Assembly of High-Throughput De Novo Sequencing Data." Thesis, 2013. http://ndltd.ncl.edu.tw/handle/30042954633923603047.

Full text

Abstract:

博士
國立臺灣大學
資訊工程學研究所
101
DNA sequencing is one of the most important procedures in molecular biology research for determining the sequences of bases in specific DNA segments. With the development of next-generation sequencing technologies, studies on genomics and transcriptomics are moving into a new era. However, the current DNA sequencing technologies cannot be used to read entire genomes or a transcript in 1 step; instead, small sequences of 20–1000 bases are read. Thus, sequence assembly continues to be one of the central problems in bioinformatics. The challenges facing sequence assembly include the following: (1) sequencing error, (2) repeat sequences, (3) nonuniform coverage, and (4) computational complexity of processing large volumes of data. From these challenges, considering the rapid growth of data throughput delivered by next-generation sequencing technologies, there is a pressing need for sequence assembly software that can efficiently handle massive sequencing data by using scalable and on-demand computing resources. These requirements fit in with the model of cloud computing. In cloud computing, computing resources can be allocated on demand over the Internet from several thousand computers offered by vendors for analyzing data in parallel. Such cloud-computing applications are constantly being developed for large datasets and are run under the framework of MapReduce. In this dissertation, we have proposed CloudBrush, a parallel pipeline that runs on the MapReduce framework for de novo assembly of high-throughput sequencing data. CloudBrush is based on bidirected string graphs and its analysis consists of 2 main stages: graph construction and graph simplification. During graph construction, a node is defined for each nonredundant sequence read, and the edge is defined for overlap between reads. We have developed a prefix-and-extend algorithm for identifying overlaps between a pair of reads. The graph is further simplified by using conventional operations such as transitive reduction, path compression, tip removal, and bubble removal. We have also introduced a new operation, edge adjustment, for removing error topology structures in string graphs. This operation uses the sequence information of all graph neighbors for each read and eliminates the edges connecting to reads containing rare bases. CloudBrush was evaluated against Genome Assembly Gold-Standard Evaluation (GAGE) benchmarks to compare its assembly quality with that of other assemblers. The results showed that our assemblies have a moderate N50, a low misassembly rate of misjoins, and indels. In addition, we have introduced 2 measures, precision and recall, to address the issues of faithfully aligned contigs in order to target genomes. Compared with the assembly tools used in the GAGE benchmarks, CloudBrush was found to produce contigs with high precision and recall. We have also introduced a T-CloudBrush pipeline for transcriptome data. T-CloudBrush uses the multiple-k concept to overcome the problem of nonuniform coverage of transcriptome data. This concept is based on observation of the correlation between sequencing data coverage and the overlap size used during assembly. The experiment results showed that T-CloudBrush improves the accuracy of de novo transcriptome assembly. In summary, this dissertation explores the challenges facing sequence assembly under the scalable computing framework and provides possible solutions for the problems of sequencing errors, nonuniform coverage, and processing of large volumes of data.

APA, Harvard, Vancouver, ISO, and other styles

39

"Analysis of nonsense-mediated decay targeted RNA (nt-RNA) in high-throughput sequencing data." 2015. http://repository.lib.cuhk.edu.hk/en/item/cuhk-1291523.

Full text

Abstract:

Nonsense-mediated mRNA decay (NMD) is an important protective mechanism to guard against erroneous transcripts particularly mRNA transcripts containing premature termination codons (PTC). In classical teaching, such erroneous transcripts (called nonsense-mediated decay targeted RNA, nt-RNA here) are considered as incidental non-specific side-products of the cellular transcription machinery and they are rapidly cleared by NMD and thus they exists in scanty quantity inside a cell (i.e. at a very low steady state abundance). As a side product of stochastic transcriptional error, they are also commonly considered to carry no biologic function.
By analysis of a large collection of RNA-seq data in TCGA (over 4000 samples and the hard disk storage was over 50 TB), it was found that nt-RNA were produced in large amount for some genes, sometimes, they were even more abundant than the normal transcripts of the corresponding genes.
Based on the hypothesis that some nt-RNA are specifically produced by a biological process (in contrast to a process happened by chance), the aims of this work are: 1) To quantify the expression of nt-RNA (survey of the spectrum); 2) To examine the relationship between nt-RNA and protein expression (biological roles); 3) To detect nt-RNAs that affect prognosis of cancer (biological roles); 4) To apply nt-RNA as diagnostic biomarkers for cancer (application); 5) To identify nt-RNAs to classify tumors for unknown primary (CUP, application).
Firstly, nt-RNA were defined from Gene databases and all PTC containing transcripts were compared to their corresponding normal transcripts to locate specific signature tags (both short segments of sequences and splice junctions) for each of the nt-RNA. And the presence and counts of these nt-RNA signature tag were searched in all RNA reads of RNA-seq datasets. Such search and counting produced the read counts of each nt-RNA signature tag and all RNA-read containing such tags are targets for NMD. RNA-seq datasets used in this study included TCGA normal samples, TCGA tumor samples and cancer cell lines for 13 cancer types.
In the example of KIRC, it was found that most differentially expressed nt-RNA (tumor vs control) were related to differential expression of the corresponding normal transcripts. However, nt-RNA were produced in 900 genes which were independent of higher production of the normal transcripts. In the example of KIRC, collection of 12 genes in the proteasome ubiquitination pathway standed out among the highly produced nt-RNA. This finding is very interesting as VHL-HIF1A is a key oncogenesis mechanism in KIRC and normal HIF1A degradation required proteasomal ubiquitination pathway. GO analysis was highly significant at p-value<4.11E-05. And the nt-RNA producing genes included PSMB4, PSMD14, PSMC6, PSMD13, PSMB1, VCP, ANAPC5, PSMA4, PSMD3, ANAPC7, OS9, GCLC.
Secondly, some nt-RNA retarded translation of the normal transcripts. By using proteome data, the relationship between quantity of nt-RNA unique tags and normal protein product were analyzed by ANOVA comparison of linear models. It was found that 422 nt-RNA unique tags influenced the expression of proteins, which suggested a potential biological action of these nt-RNA. PTEN also produced nt-RNA in KIRC and tumor cells with higher PTEN nt-RNA had a lower PTEN protein level (p-value of ANOVA comparison of linear models: 0.017). Survival analysis results showed that PTEN nt-RNA levels affected survival, which suggested that it can be used as biomarker for prognosis. Furthermore, survival analysis were done for other nt-RNA unique tags which affected protein expression using clinical data.
Thirdly, the application of nt-RNA as diagnostic markers and markers to define tumor origin in CUP were examined. nt-RNA were identified in different types of tumors. Here, only nt-RNA that were independent of the normal gene transcripts in term of differential expression were used as biomarkers. By comparing tumor samples with normal samples, nt-RNAs as diagnostic markers were detected. Unsupervised clustering was performed for these nt-RNAs and heat maps showed high degree of separation of tumor and normal samples. For studying tumor origin in CUP, in both cross-validation study in the training dataset (N=541) and independent sample set external validation (N=2462), a highly discriminating sets of nt-RNAs were defined for most cancers examined (400 nt-RNA seq. tags). Unsupervised clustering was performed for the 400 nt-RNA seq. tags and heat maps showed its power to define tumor origin in CUP. And then the significance of classifier formed by 400 nt-RNA seq. tags was measured by performing 100 resampling of the training set. The results for the 100 resampling showed that the correctly classified instance rate for training set had 96.4895% ± 0.75% (mean ± standard deviation); for validation set had 91.0239% ± 1.032611%.
In conclusion, this study showed nt-RNA can have important biological function and be used for various applications. It’s a potential biomarker for diagnosis and prognosis of diseases. And it can also be used to decide the origin site of tumors, which indicates that nt-RNA will provide great information for potential application in diagnosis of cancer and determining the origin in cancer of unknown primary site (CUP). [With diagram]
無意介導的mRNA降解（NMD）是一種重要的保護機制，它可以防止錯誤的轉錄本，特別是含有提前終止密碼子的轉錄本。在經典的教學里，這種錯誤的轉錄本（這裡稱為無意介導的mRNA降解所靶向的轉錄本，記為nt-RNA）被認為是細胞轉錄過程中偶然產生的非特異性的副產物，它們很快被NMD清除，因此它們在細胞內的表達很少（即穩態時它們的表達量很少）。作為隨機的轉錄錯誤的一個副產物，它們通常被認為是沒有生物功能的。
通過分析大量的來自TCGA的RNA-seq的數據（超過4000個樣本，存儲空間超過50TB），我們發現一些基因的nt-RNA有很高的表達量，有的甚至超過同一個基因的正常轉錄本的表達量。
我們的假設是一些nt-RNA是由某個生物過程特定產生的，而不是偶然產生的。基於這一假設，本研究的目標有：（1）量化nt-RNA的表達（表達譜的調查）；（2）探索nt-RNA與蛋白質表達的關係（生物功能）；（3）尋找可以影響癌症預後的nt-RNA（生物功能）；（4）用nt-RNA作為癌症診斷的生物標記物（應用）；（5）識別可以用來區分原发灶不明的癌症的nt-RNA（應用）。
首先，通過基因的數據庫定義nt-RNA，并將這些nt-RNA與相應的正常的轉錄本進行比較，找到每個nt-RNA特有的標簽（包括系列的片段和剪接位点）。進而在RNA-seq數據所有的讀段中搜索這些nt-RNA特有的標簽并記數。通過這樣的搜索和記數，產生了每個nt-RNA特有標簽的讀段數目，而包含這些標簽的讀段就是NMD的靶標。本研究中使用的RNA-seq數據包含13種癌症的TCGA正常和癌症樣本，以及癌細胞系的樣本數據。
在腎癌的例子中，大多數差異表達（癌症與正常比較）的nt-RNA和它相應的正常的轉錄本的差異表達是有關聯的。然而，900个基因產生的nt-RNA與正常轉錄本的高表達是獨立的。我們發現與白酶體泛素化通路相關的12個基因高表達nt-RNA。這個發現是很有意思的，因為VHL-HIF1A是KIRC的一個重要的致癌機制，而正常的HIF1A的降解需要通過白酶體泛素化通路。白酶體泛素化通路在基因富集分析中是顯著的（p值<4.11E-05）。這12個基因分別是PSMB4，PSMD14，PSMC6，PSMD13，PSMB1，VCP，ANAPC5，PSMA4，PSMD3，ANAPC7，OS9，GCLC。
其次，一些nt-RNA可以降低正常轉錄本的翻譯。利用蛋白組數據，我們用ANOVA比較線性模型的方法研究了nt-RNA特有的標簽與正常的蛋白產物的關係。結果發現，422个nt-RNA特有的標簽影響蛋白質的表達，這說明nt-RNA具有潛在的生物作用。PTEN也在KIRC裡產生nt-RNA，PTEN的nt-RNA表達越高的樣本，含有越少的PTEN蛋白產物（ANOVA比較線性模型的p值=0.017）。生存分析的結果顯示PTEN的nt-RNA影響生存率，這說明PTEN的nt-RNA可以作為癌症預後的生物標記物。進一步，對其他的影響蛋白表達的nt-RNA特有的標簽也做了生存分析。
最後，我檢查了nt-RNA作為診斷標記物和用來定義原发灶不明的癌症（CUP）的起源的標記物的兩大應用。只有在差異表達方面獨立於正常轉錄本的那些nt-RNA會被用作生物標記物。通過比較癌症和正常的樣本，檢查了哪些nt-RNA可以作為診斷標記物。利用無監督的聚類分析和熱圖顯示了這些nt-RNA可以很明顯地將癌症和正常樣本分開。在研究原发灶不明的癌症（CUP）的起源中，通過對訓練集（N=541）和獨立的外部驗證集（N=2462）進行交叉驗證學習，定義了一個可以識別大多數癌症樣本的nt-RNA標簽集（400個nt-RNA特有的片段標簽）。無監督的聚類分析和熱圖顯示了用這些nt-RNA定義原发灶不明的癌症（CUP）的起源的能力。隨後，通過從訓練集的樣本隨機抽樣100次，檢查了由400個nt-RNA特有的片段標簽組成的分類器的顯著性。100次隨機抽樣的結果顯示：對訓練集，樣本準確分類率的均值和標準差分別是96.4895%和0.75%；對驗證集，樣本準確分類率的均值和標準差分別是91.0239%和1.032611%。
總之，本研究顯示了nt-RNA有重要的生物功能和多種應用。它是癌症診斷和預後的潛在的生物標記物。它也可以被用來決定癌症的原发灶，這意味著nt-RNA將會為癌症診斷和決定原发灶不明的癌症的原发灶的這些潛在應用提供很好的信息。[附圖]
Hu, Fuyan.
Thesis Ph.D. Chinese University of Hong Kong 2015.
Includes bibliographical references (leaves 173-211).
Abstracts also in Chinese.
Title from PDF title page (viewed on 12, October, 2016).
Detailed summary in vernacular field only.
Detailed summary in vernacular field only.
Detailed summary in vernacular field only.
Detailed summary in vernacular field only.
Detailed summary in vernacular field only.
Detailed summary in vernacular field only.
Detailed summary in vernacular field only.
Detailed summary in vernacular field only.

APA, Harvard, Vancouver, ISO, and other styles

40

Fang, Hao-Yu, and 方浩宇. "Using high-throughput sequencing data to identify the transcriptional start sites of mouse microRNAs." Thesis, 2013. http://ndltd.ncl.edu.tw/handle/32453885381096374336.

Full text

Abstract:

碩士
國立中興大學
基因體暨生物資訊學研究所
101
MicroRNAs (miRNAs) are non-coding small RNAs that inhibit protein coding gene expression by hybridizing with messenger RNAs (mRNAs). MiRNAs are involved in a lot of diverse biological processes and various diseases. To identify miRNA transcription start sites (TSSs) is important for studying the upstream regulatory networks of miRNAs. Up to now the studies regarding miRNA TSS identification are all focus on human miRNAs. We are interested in other species and our aim in this study is to identify mouse miRNA TSSs and the result would contribute to understanding the evolution of upstream regulatory networks of miRNAs. In this study, we integrated two types of high-throughput sequencing data, i.e. transcription start sites sequencing (TSSseq) and Cap Analysis of Gene Expression (CAGE), as the evidence of miRNA TSSs. A machine-learning-based Support Vector Machine (SVM) was developed to identify mouse miRNA TSSs. In addition, we also incorporated the ESTs (expression sequence tag) and sequence conservation information to provide evidence for mouse miRNA TSSs.

APA, Harvard, Vancouver, ISO, and other styles

41

Yao, Jianchao. "Integrative analysis of high-throughput biological data: shrinkage correlation coefficient and comparative expression analysis." Thesis, 2009. http://hdl.handle.net/2152/ETD-UT-2009-12-403.

Full text

Abstract:

The focus for this research is to develop and apply statistical methods to analyze and interpret high-throughput biological data. We developed a novel correlation coefficient, shrinkage correlation coefficient (SCC), that fully exploits the similarity between the replicated microarray experimental samples. The methodology considers both the number of replicates and the variance within each experimental group in clustering expression data, and provides a robust statistical estimation of the error of replicated microarray data. Applying SCC-based hierarchical clustering to the replicated microarray data obtained from germinating spores of the fern Ceratopteris richardii, we discovered two clusters of genes with shared expression patterns during spore germination. This computational approach is not only applicable to DNA microarray analysis but is also applicable to proteomics data or any other high-throughput analysis methodology. The suppression of APY1 and APY2 in mutants expressing an inducible RNAi system resulted in plants with a dwarf phenotype and disrupted auxin distribution, and we used these mutants to discover what genes changed expression during growth suppression. We evaluated the gene expression changes of apyrase-suppressed RNAi mutants that had been grown in the light and in the darkness, using the NimbleGen Arabidopsis thaliana 4-Plex microarray, respectively. We compared the two sets of large-scale expression data and identified genes whose expression significantly changed after apyrase suppression in light and darkness, respectively. Our results allowed us to highlight some of the genes likely to play major roles in mediating the growth changes that happen when plants drastically reduce their production of APY1 and APY2, some more associated with growth promotion and others, such as stress-induced genes, more associated with growth inhibition. There is a strong rationale for ranking all these genes as prime candidates for mediating the inhibitory growth effects of suppressing apyrase expression, thus the NimbleGen data will serve as a catalyst and valuable guide to the subsequent physiological and molecular experiments that will be needed to clarify the network of gene expression changes that accompany growth inhibition.
text

APA, Harvard, Vancouver, ISO, and other styles

42

"Integrative analysis of large intergenic non-coding RNA and circular RNA using high-throughput RNA sequencing data." 2015. http://repository.lib.cuhk.edu.hk/en/item/cuhk-1291760.

Full text

Abstract:

Ji, Lu.
Thesis Ph.D. Chinese University of Hong Kong 2015.
Includes bibliographical references (leaves 154-170).
Abstracts also in Chinese.
Title from PDF title page (viewed on 10, November, 2016).

APA, Harvard, Vancouver, ISO, and other styles

43

"Computational models for extracting structural signals from noisy high-throughput sequencing data: 通过计算模型来提取高通量测序数据中的分子结构信息." 2015. http://repository.lib.cuhk.edu.hk/en/item/cuhk-1291576.

Full text

Abstract:

Hu, Xihao.
Thesis Ph.D. Chinese University of Hong Kong 2015.
Includes bibliographical references (leaves 147-161).
Abstracts also in Chinese.
Title from PDF title page (viewed on 26, October, 2016).
Hu, Xihao.

APA, Harvard, Vancouver, ISO, and other styles

44

Poirier-Morency, Guillaume. "Modélisation des réseaux de régulation de l’expression des gènes par les microARN." Thesis, 2020. http://hdl.handle.net/1866/25104.

Full text

Abstract:

Les microARN sont de petits ARN non codants d'environ 22 nucléotides impliqués dans la régulation de l'expression des gènes. Ils ciblent les régions complémentaires des molécules d'ARN messagers que ces gènes codent et ajustent leurs niveaux de traduction en protéines en fonction des besoins de la cellule. En s'attachant à leurs cibles par complémentarité partielle de leurs séquences, ces deux groupes de molécules d'ARN compétitionnent activement pour former des interactions régulatrices. Par conséquent, prédire quantitativement les concentrations d'équilibres des duplexes formés est une tâche qui doit prendre un compte plusieurs facteurs dont l'affinité pour l'hybridation, la capacité à catalyser la cible, la coopérativité et l'accessibilité de l'ARN cible. Dans le modèle que nous proposons, miRBooking 2.0, chaque interaction possible entre un microARN et un site sur un ARN cible pour former un duplexe est caractérisée par une réaction enzymatique. Une réaction de ce type opère en deux phases : une formation réversible d'un complexe enzyme-substrat, le duplexe microARN-ARN, suivie d'une conversion irréversible du substrat en produit, un ARN cible dégradé, et de la restitution l'enzyme qui pourra participer à une nouvelle réaction. Nous montrons que l'état stationnaire de ce système, qui peut comporter jusqu'à 10 millions d'équations en pratique, est unique et son jacobien possède un très petit nombre de valeurs non-nulles, permettant sa résolution efficace à l'aide d'un solveur linéaire épars. Cette solution nous permet de caractériser précisément ce mécanisme de régulation et d'étudier le rôle des microARN dans un contexte cellulaire donné. Les prédictions obtenues sur un modèle de cellule HeLa corrèlent significativement avec un ensemble de données obtenu expérimentalement et permettent d'expliquer remarquablement les effets de seuil d'expression des gènes. En utilisant ces prédictions comme condition initiale et une méthode d'intégration numérique, nous simulons en temps réel la réponse du système aux changements de conditions expérimentales. Nous appliquons ce modèle pour cibler des éléments impliqués dans la transition épithélio-mésenchymateuse (EMT), un mécanisme biologique permettant aux cellules d'acquérir une mobilité essentielle pour proliférer. En identifiant des éléments transcrits différentiellement entre les conditions épithéliale et mésenchymateuse, nous concevons des microARN synthétiques spécifiques pour interférer avec cette transition. Pour ce faire, nous proposons une méthode basée sur une recherche gloutonne parallèle pour rechercher efficacement l'espace de la séquence du microARN et présentons des résultats préliminaires sur des marqueurs connus de l'EMT.
MicroRNAs are small non-coding RNAs of approximately 22 nucleotide long involved in the regulation of gene expression. They target complementary regions to the RNA transcripts molecules that these genes encode and adjust the concentration according to the needs of the cell. As microRNAs and their RNA targets binds each other with imperfect complementarity, these two groups actively compete to form regulatory interactions. Consequently, attempting to quantitatively predict their equilibrium concentrations is a task that must take several factors into account, including the affinity for hybridization, the ability to catalyze the target, cooperation, and RNA accessibility. In the model we propose, miRBooking 2.0, each possible interaction between a microRNA and a binding site on a target RNA is characterized by an enzymatic reaction. A reaction of this type operates in two phases: a reversible formation of an enzyme-substrate complex, the microRNA-RNA duplex, and an irreversible conversion of the substrate in an RNA degradation product that restores the enzyme which can subsequently participate to other reactions. We show that the stationary state of this system, which can include up to 10 million equations in practice, has a very shallow Jacobian, allowing its efficient resolution using a sparse linear solver. This solution allows us to characterize precisely the mechanism of regulation and to study the role of microRNAs in a given cellular context. Predictions obtained on a HeLa S3 cell model correlate significantly with a set of experimental data obtained experimentally and can remarkably explain the expression threshold effects of genes. Using this solution as an initial condition and an explicit method of numerical integration, we simulate in real time the response of the system to changes of experimental conditions. We apply this model to target elements involved in the Epithelio-Mesenchymal Transition (EMT), an important mechanism of tumours proliferation. By identifying differentially expressed elements between the two conditions, we design synthetic microRNAs to interfere with the transition. To do so, we propose a method based on a parallel greedy best-first search to efficiently crawl the sequence space of the microRNA and present preliminary results on known EMT markers.

APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic 'High-throughput sequencing data'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles