Dissertations / Theses on the topic 'High-throughput sequencing data'
Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles
Consult the top 44 dissertations / theses for your research on the topic 'High-throughput sequencing data.'
Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.
You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.
Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.
Roguski, Łukasz 1987. "High-throughput sequencing data compression." Doctoral thesis, Universitat Pompeu Fabra, 2017. http://hdl.handle.net/10803/565775.
Full textGràcies als avenços en el camp de les tecnologies de seqüenciació, en els darrers anys la recerca biomèdica ha viscut una revolució, que ha tingut com un dels resultats l'explosió del volum de dades genòmiques generades arreu del món. La mida típica de les dades de seqüenciació generades en experiments d'escala mitjana acostuma a situar-se en un rang entre deu i cent gigabytes, que s'emmagatzemen en diversos arxius en diferents formats produïts en cada experiment. Els formats estàndards actuals de facto de representació de dades genòmiques són en format textual. Per raons pràctiques, les dades necessiten ser emmagatzemades en format comprimit. En la majoria dels casos, aquests mètodes de compressió es basen en compressors de text de caràcter general, com ara gzip. Amb tot, no permeten explotar els models d'informació especifícs de dades de seqüenciació. És per això que proporcionen funcionalitats limitades i estalvi insuficient d'espai d'emmagatzematge. Això explica per què operacions relativament bàsiques, com ara el processament, l'emmagatzematge i la transferència de dades genòmiques, s'han convertit en un dels principals obstacles de processos actuals d'anàlisi. Per tot això, aquesta tesi se centra en mètodes d'emmagatzematge i compressió eficients de dades generades en experiments de sequenciació. En primer lloc, proposem un compressor innovador d'arxius FASTQ de propòsit general. A diferència de gzip, aquest compressor permet reduir de manera significativa la mida de l'arxiu resultant del procés de compressió. A més a més, aquesta eina permet processar les dades a una velocitat alta. A continuació, presentem mètodes de compressió que fan ús de l'alta redundància de seqüències present en les dades de seqüenciació. Aquests mètodes obtenen la millor ratio de compressió d'entre els compressors FASTQ del marc teòric actual, sense fer ús de cap referència externa. També mostrem aproximacions de compressió amb pèrdua per emmagatzemar dades de seqüenciació auxiliars, que permeten reduir encara més la mida de les dades. En últim lloc, aportem un sistema flexible de compressió i un format de dades. Aquest sistema fa possible generar de manera semi-automàtica solucions de compressió que no estan lligades a cap mena de format específic d'arxius de dades genòmiques. Per tal de facilitar la gestió complexa de dades, diversos conjunts de dades amb formats heterogenis poden ser emmagatzemats en contenidors configurables amb l'opció de dur a terme consultes personalitzades sobre les dades emmagatzemades. A més a més, exposem que les solucions simples basades en el nostre sistema poden obtenir resultats comparables als compressors de format específic de l'estat de l'art. En resum, les solucions desenvolupades i descrites en aquesta tesi poden ser incorporades amb facilitat en processos d'anàlisi de dades genòmiques. Si prenem aquestes solucions conjuntament, aporten una base sòlida per al desenvolupament d'aproximacions completes encaminades a l'emmagatzematge i gestió eficient de dades genòmiques.
Durif, Ghislain. "Multivariate analysis of high-throughput sequencing data." Thesis, Lyon, 2016. http://www.theses.fr/2016LYSE1334/document.
Full textThe statistical analysis of Next-Generation Sequencing data raises many computational challenges regarding modeling and inference, especially because of the high dimensionality of genomic data. The research work in this manuscript concerns hybrid dimension reduction methods that rely on both compression (representation of the data into a lower dimensional space) and variable selection. Developments are made concerning: the sparse Partial Least Squares (PLS) regression framework for supervised classification, and the sparse matrix factorization framework for unsupervised exploration. In both situations, our main purpose will be to focus on the reconstruction and visualization of the data. First, we will present a new sparse PLS approach, based on an adaptive sparsity-inducing penalty, that is suitable for logistic regression to predict the label of a discrete outcome. For instance, such a method will be used for prediction (fate of patients or specific type of unidentified single cells) based on gene expression profiles. The main issue in such framework is to account for the response to discard irrelevant variables. We will highlight the direct link between the derivation of the algorithms and the reliability of the results. Then, motivated by questions regarding single-cell data analysis, we propose a flexible model-based approach for the factorization of count matrices, that accounts for over-dispersion as well as zero-inflation (both characteristic of single-cell data), for which we derive an estimation procedure based on variational inference. In this scheme, we consider probabilistic variable selection based on a spike-and-slab model suitable for count data. The interest of our procedure for data reconstruction, visualization and clustering will be illustrated by simulation experiments and by preliminary results on single-cell data analysis. All proposed methods were implemented into two R-packages "plsgenomics" and "CMF" based on high performance computing
Zhang, Xuekui. "Mixture models for analysing high throughput sequencing data." Thesis, University of British Columbia, 2011. http://hdl.handle.net/2429/35982.
Full textHoffmann, Steve. "Genome Informatics for High-Throughput Sequencing Data Analysis." Doctoral thesis, Universitätsbibliothek Leipzig, 2014. http://nbn-resolving.de/urn:nbn:de:bsz:15-qucosa-152643.
Full textDiese Arbeit stellt drei verschiedene algorithmische und statistische Strategien für die Analyse von Hochdurchsatz-Sequenzierungsdaten vor. Zuerst führen wir eine auf enhanced Suffixarrays basierende heuristische Methode ein, die kurze Sequenzen mit grossen Genomen aligniert. Die Methode basiert auf der Idee einer fehlertoleranten Traversierung eines Suffixarrays für Referenzgenome in Verbindung mit dem Konzept der Matching-Statistik von Chang und einem auf Bitvektoren basierenden Alignmentalgorithmus von Myers. Die vorgestellte Methode unterstützt Paired-End und Mate-Pair Alignments, bietet Methoden zur Erkennung von Primersequenzen und zum trimmen von Poly-A-Signalen an. Auch in unabhängigen Benchmarks zeichnet sich das Verfahren durch hohe Sensitivität und Spezifität in simulierten und realen Datensätzen aus. Für eine große Anzahl von Sequenzierungsprotokollen erzielt es bessere Ergebnisse als andere bekannte Short-Read Alignmentprogramme. Zweitens stellen wir einen auf dynamischer Programmierung basierenden Algorithmus für das spliced alignment problem vor. Der Vorteil dieses Algorithmus ist seine Fähigkeit, nicht nur kollineare Spleiß- Ereignisse, d.h. Spleiß-Ereignisse auf dem gleichen genomischen Strang, sondern auch zirkuläre und andere nicht-kollineare Spleiß-Ereignisse zu identifizieren. Das Verfahren zeichnet sich durch eine hohe Genauigkeit aus: während es bei der Erkennung kollinearer Spleiß-Varianten vergleichbare Ergebnisse mit anderen Methoden erzielt, schlägt es die Wettbewerber mit Blick auf Sensitivität und Spezifität bei der Vorhersage nicht-kollinearer Spleißvarianten. Die Anwendung dieses Algorithmus führte zur Identifikation neuer Isoformen. In unserer Publikation berichten wir über eine neue Isoform des Tumorsuppressorgens p53. Da dieses Gen eines der am besten untersuchten Gene des menschlichen Genoms ist, könnte die Anwendung unseres Algorithmus helfen, eine Vielzahl weiterer Isoformen bei weniger prominenten Genen zu identifizieren. Drittens stellen wir ein datenadaptives Modell zur Identifikation von Single Nucleotide Variations (SNVs) vor. In unserer Arbeit zeigen wir, dass sich unser auf empirischen log-likelihoods basierendes Modell automatisch an die Qualität der Sequenzierungsexperimente anpasst und eine \"Entscheidung\" darüber trifft, welche potentiellen Variationen als SNVs zu klassifizieren sind. In unseren Simulationen ist diese Methode auf Augenhöhe mit aktuell eingesetzten Verfahren. Schließlich stellen wir eine Auswahl biologischer Ergebnisse vor, die mit den Besonderheiten der präsentierten Alignmentverfahren in Zusammenhang stehen
Stromberg, Michael Peter. "Enabling high-throughput sequencing data analysis with MOSAIK." Thesis, Boston College, 2010. http://hdl.handle.net/2345/1332.
Full textDuring the last few years, numerous new sequencing technologies have emerged that require tools that can process large amounts of read data quickly and accurately. Regardless of the downstream methods used, reference-guided aligners are at the heart of all next-generation analysis studies. I have developed a general reference-guided aligner, MOSAIK, to support all current sequencing technologies (Roche 454, Illumina, Applied Biosystems SOLiD, Helicos, and Sanger capillary). The calibrated alignment qualities calculated by MOSAIK allow the user to fine-tune the alignment accuracy for a given study. MOSAIK is a highly configurable and easy-to-use suite of alignment tools that is used in hundreds of labs worldwide. MOSAIK is an integral part of our genetic variant discovery pipeline. From SNP and short-INDEL discovery to structural variation discovery, alignment accuracy is an essential requirement and enables our downstream analyses to provide accurate calls. In this thesis, I present three major studies that were formative during the development of MOSAIK and our analysis pipeline. In addition, I present a novel algorithm that identifies mobile element insertions (non-LTR retrotransposons) in the human genome using split-read alignments in MOSAIK. This algorithm has a low false discovery rate (4.4 %) and enabled our group to be the first to determine the number of mobile elements that differentially occur between any two individuals
Thesis (PhD) — Boston College, 2010
Submitted to: Boston College. Graduate School of Arts and Sciences
Discipline: Biology
Xing, Zhengrong. "Poisson multiscale methods for high-throughput sequencing data." Thesis, The University of Chicago, 2016. http://pqdtopen.proquest.com/#viewpdf?dispub=10195268.
Full textIn this dissertation, we focus on the problem of analyzing data from high-throughput sequencing experiments. With the emergence of more capable hardware and more efficient software, these sequencing data provide information at an unprecedented resolution. However, statistical methods developed for such data rarely tackle the data at such high resolutions, and often make approximations that only hold under certain conditions.
We propose a model-based approach to dealing with such data, starting from a single sample. By taking into account the inherent structure present in such data, our model can accurately capture important genomic regions. We also present the model in such a way that makes it easily extensible to more complicated and biologically interesting scenarios.
Building upon the single-sample model, we then turn to the statistical question of detecting differences between multiple samples. Such questions often arise in the context of expression data, where much emphasis has been put on the problem of detecting differential expression between two groups. By extending the framework for a single sample to incorporate additional group covariates, our model provides a systematic approach to estimating and testing for such differences. We then apply our method to several empirical datasets, and discuss the potential for further applications to other biological tasks.
We also seek to address a different statistical question, where the goal here is to perform exploratory analysis to uncover hidden structure within the data. We incorporate the single-sample framework into a commonly used clustering scheme, and show that our enhanced clustering approach is superior to the original clustering approach in many ways. We then apply our clustering method to a few empirical datasets and discuss our findings.
Finally, we apply the shrinkage procedure used within the single-sample model to tackle a completely different statistical issue: nonparametric regression with heteroskedastic Gaussian noise. We propose an algorithm that accurately recovers both the mean and variance functions given a single set of observations, and demonstrate its advantages over state-of-the art methods through extensive simulation studies.
Fritz, Markus Hsi-Yang. "Exploiting high throughput DNA sequencing data for genomic analysis." Thesis, University of Cambridge, 2012. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.610819.
Full textWoolford, Julie Ruth. "Statistical analysis of small RNA high-throughput sequencing data." Thesis, University of Cambridge, 2012. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.610375.
Full textKircher, Martin. "Understanding and improving high-throughput sequencing data production and analysis." Doctoral thesis, Universitätsbibliothek Leipzig, 2011. http://nbn-resolving.de/urn:nbn:de:bsz:15-qucosa-71102.
Full textAinsworth, David. "Computational approaches for metagenomic analysis of high-throughput sequencing data." Thesis, Imperial College London, 2016. http://hdl.handle.net/10044/1/44070.
Full textMohamadi, Hamid. "Parallel algorithms and software tools for high-throughput sequencing data." Thesis, University of British Columbia, 2017. http://hdl.handle.net/2429/62072.
Full textScience, Faculty of
Graduate
Mammana, Alessandro [Verfasser]. "Patterns and algorithms in high-throughput sequencing count data / Alessandro Mammana." Berlin : Freie Universität Berlin, 2016. http://d-nb.info/1108270956/34.
Full textLove, Michael I. [Verfasser]. "Statistical analysis of high-throughput sequencing count data / Michael I. Love." Berlin : Freie Universität Berlin, 2013. http://d-nb.info/1043197842/34.
Full textBallinger, Tracy J. "Analysis of genomic rearrangements in cancer from high throughput sequencing data." Thesis, University of California, Santa Cruz, 2015. http://pqdtopen.proquest.com/#viewpdf?dispub=3729995.
Full textIn the last century cancer has become increasingly prevalent and is the second largest killer in the United States, estimated to afflict 1 in 4 people during their life. Despite our long history with cancer and our herculean efforts to thwart the disease, in many cases we still do not understand the underlying causes or have successful treatments. In my graduate work, I’ve developed two approaches to the study of cancer genomics and applied them to the whole genome sequencing data of cancer patients from The Cancer Genome Atlas (TCGA). In collaboration with Dr. Ewing, I built a pipeline to detect retrotransposon insertions from paired-end high-throughput sequencing data and found somatic retrotransposon insertions in a fifth of cancer patients.
My second novel contribution to the study of cancer genomics is the development of the CN-AVG pipeline, a method for reconstructing the evolutionary history of a single tumor by predicting the order of structural mutations such as deletions, duplications, and inversions. The CN-AVG theory was developed by Drs. Haussler, Zerbino, and Paten and samples potential evolutionary histories for a tumor using Markov Chain Monte Carlo sampling. I contributed to the development of this method by testing its accuracy and limitations on simulated evolutionary histories. I found that the ability to reconstruct a history decays exponentially with increased breakpoint reuse, but that we can estimate how accurately we reconstruct a mutation event using the likelihood scores of the events. I further designed novel techniques for the application of CN-AVG to whole genome sequencing data from actual patients and applied these techniques to search for evolutionary patterns in glioblastoma multiforme using sequencing data from TCGA. My results show patterns of two-hit deletions, as we would expect, and amplifications occurring over several mutational events. I also find that the CN-AVG method frequently makes use of whole chromosome copy number changes following by localized deletions, a bias that could be mitigated through modifying the cost function for an evolutionary history.
Paicu, Claudia. "miRNA detection and analysis from high-throughput small RNA sequencing data." Thesis, University of East Anglia, 2016. https://ueaeprints.uea.ac.uk/63738/.
Full textGlaus, Peter. "Bayesian methods for gene expression analysis from high-throughput sequencing data." Thesis, University of Manchester, 2014. https://www.research.manchester.ac.uk/portal/en/theses/bayesian-methods-for-gene-expression-analysis-from-highthroughput-sequencing-data(cf9680e0-a3f2-4090-8535-a39f3ef50cc4).html.
Full textBeckers, Matthew. "Quality checking and expression analysis of high-throughput small RNA sequencing data." Thesis, University of East Anglia, 2015. https://ueaeprints.uea.ac.uk/58581/.
Full textGupta, Namita. "Computational Identification of B Cell Clones in High-Throughput Immunoglobulin Sequencing Data." Thesis, Yale University, 2017. http://pqdtopen.proquest.com/#viewpdf?dispub=10633249.
Full textHumoral immunity is driven by the expansion, somatic hypermutation, and selection of B cell clones. Each clone is the progeny of a single B cell responding to antigen. with diversified Ig receptors. The advent of next-generation sequencing technologies enables deep profiling of the Ig repertoire. This large-scale characterization provides a window into the micro-evolutionary dynamics of the adaptive immune response and has a variety of applications in basic science and clinical studies. Clonal relationships are not directly measured, but must be computationally inferred from these sequencing data. In this dissertation, we use a combination of human experimental and simulated data to characterize the performance of hierarchical clustering-based methods for partitioning sequences into clones. Our results suggest that hierarchical clustering using single linkage with nucleotide Hamming distance identifies clones with high confidence and provides a fully automated method for clonal grouping. The performance estimates we develop provide important context to interpret clonal analysis of repertoire sequencing data and allow for rigorous testing of other clonal grouping algorithms. We present the clonal grouping tool as well as other tools for advanced analyses of large-scale Ig repertoire sequencing data through a suite of utilities, Change-O. All Change-O tools utilize a common data format, which enables the seamless integration of multiple analyses into a single workflow. We then apply the Change-O suite in concert with the nucleotide coding se- quences for WNV-specific antibodies derived from single cells to identify expanded WNV-specific clones in the repertoires of recently infected subjects through quantitative Ig repertoire sequencing analysis. The method proposed in this dissertation to computationally identify B cell clones in Ig repertoire sequencing data with high confidence is made available through the Change-O suite and can be applied to provide insight into the dynamics of the adaptive immune response.
González-Vallinas, Rostes Juan 1983. "Software development and analysis of high throughput sequencing data for genomic enhancer prediction." Doctoral thesis, Universitat Pompeu Fabra, 2013. http://hdl.handle.net/10803/283480.
Full textLas tecnologías High Throughput Sequencing (HTS) se están convirtiendo en el método standard de análisis de la regulación genómica. Durante mi tesis, he desarrollado software para el análisis de datos HTS. Mediante la colaboración con otros grupos de investigaci n, me he especializado ́ en el análisis de datos de ChIP-Seq. Por ejemplo, colaborado en el análisis del efecto de Hog1 en células de levadura afectadas por stress, colaboré en el diseño de un m ́ todo para el alineamiento m ́ ltiple de promotores usando datos de ChIP-Seq, entre otras colaboraciones. Usando el conocimiento y el software desarrollados durante este tiempo, analicé datos producidos por el proyecto ENCODE para detectar enhancers genómicos activos. Los enhancers son areas del genoma conocidas por regular la transcripción de genes cercanos y lejanos. Los mecanismos de activación y silenciamiento de enhancers son aún poco entendidos. Elementos epigenómicos, como las modificaciones de histonas y los factores de transcripción juegan un papel crucial en la actividad de enhancers. Construyendo un modelo con estas señales epigen ́ micas, predije enhancers activos y silenciados en dos lineas celulares y estudié su efecto sobre splicing y sobre la iniciacion de la transcripción.
Becher, Hannes. "Differentiation across the Podisma pedestris hybrid zone inferred from high-throughput sequencing data." Thesis, Queen Mary, University of London, 2018. http://qmro.qmul.ac.uk/xmlui/handle/123456789/39744.
Full textMakowski, Mateusz. "High-Throughput Data Analysis: Application to Micronuclei Frequency and T-cell Receptor Sequencing." VCU Scholars Compass, 2015. http://scholarscompass.vcu.edu/etd/3923.
Full textYe, Lin, and 叶林. "Exploring microbial community structures and functions of activated sludge by high-throughput sequencing." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 2012. http://hub.hku.hk/bib/B48079649.
Full textpublished_or_final_version
Civil Engineering
Doctoral
Doctor of Philosophy
Althammer, Sonja Daniela. "Elucidating mechanisms of gene regulation. Integration of high-throughput sequencing data for studying the epigenome." Doctoral thesis, Universitat Pompeu Fabra, 2012. http://hdl.handle.net/10803/81355.
Full textLa llegada reciente de nuevos métodos de High-Throughput Sequencing (HTS) ha provocado una revolución en el estudio de la regulación génica. La necesidad de procesar la inmensa cantidad de datos generados, con el objectivo de estudiar los mecanismos regulatorios en la celula, nunca ha sido mayor. En esta tesis abordamos este tema presentando métodos para analizar, integrar e interpretar datos HTS de diferentes fuentes. En particular, hemos desarollado Pyicos, un potente conjunto de herramientas que ofrece flexibilidad, versatilidad y un uso eficiente de la memoria. Lo hemos aplicado a datos de ChIP-Seq del receptor de progesterona en células de cáncer de mama con el fin de investigar los mecanismos de la regulación por hormonas. Además, hemos incorporado Pyicos en una pipeline para integrar los datos HTS de diferentes fuentes. Hemos usado los conjuntos de datos de ENCODE para calcular de forma sistemática los cambios de señal entre dos líneas celulares. De esta manera hemos logrado crear un modelo que predice con bastante precisión los cambios de la expresión génica, basándose en los cambios epigenéticos en el locus de un gen. Por último, hemos puesto los datos procesados a disposición de la comunidad científica en una base de datos Biomart.
Videm, Pavankumar [Verfasser], and Rolf [Akademischer Betreuer] Backofen. "Analysis of high-throughput sequencing data related to small non-coding RNAs biogenesis and function." Freiburg : Universität, 2021. http://d-nb.info/1238518087/34.
Full textBansal, Vikas [Verfasser]. "Computational Analysis of High-Throughput Sequencing Data in Cardiac Disease and Skeletal Muscle Development / Vikas Bansal." Berlin : Freie Universität Berlin, 2016. http://d-nb.info/1110884494/34.
Full textOtt, Felix [Verfasser], and Detlef [Akademischer Betreuer] Weigel. "An Integrated Data Analysis Suite and Programming Framework for High-Throughput DNA Sequencing / Felix Ott ; Betreuer: Detlef Weigel." Tübingen : Universitätsbibliothek Tübingen, 2014. http://d-nb.info/1162897147/34.
Full textKawalia, Amit [Verfasser], Peter [Gutachter] Nürnberg, and Michael [Gutachter] Nothnagel. "Addressing NGS Data Challenges: Efficient High Throughput Processing and Sequencing Error Detection / Amit Kawalia ; Gutachter: Peter Nürnberg, Michael Nothnagel." Köln : Universitäts- und Stadtbibliothek Köln, 2016. http://d-nb.info/112370368X/34.
Full textHoffmann, Steve [Verfasser], Peter F. [Gutachter] Stadler, and Rolf [Gutachter] Backofen. "Genome Informatics for High-Throughput Sequencing Data Analysis : Methods and Applications / Steve Hoffmann ; Gutachter: Peter F. Stadler, Rolf Backofen." Leipzig : Universitätsbibliothek Leipzig, 2014. http://d-nb.info/1238789528/34.
Full textSheppard, Sarah E. "Application of a Naïve Bayes Classifier to Assign Polyadenylation Sites from 3' End Deep Sequencing Data: A Dissertation." eScholarship@UMMS, 2013. http://escholarship.umassmed.edu/gsbs_diss/653.
Full textMartin, Marcel Verfasser], Sven [Akademischer Betreuer] [Rahmann, and Jens [Akademischer Betreuer] Stoye. "Algorithms and tools for the analysis of high throughput DNA sequencing data / Marcel Martin. Betreuer: Sven Rahmann. Gutachter: Jens Stoye." Dortmund : Universitätsbibliothek Dortmund, 2014. http://d-nb.info/1095767682/34.
Full textMartin, Marcel [Verfasser], Sven [Akademischer Betreuer] Rahmann, and Jens [Akademischer Betreuer] Stoye. "Algorithms and tools for the analysis of high throughput DNA sequencing data / Marcel Martin. Betreuer: Sven Rahmann. Gutachter: Jens Stoye." Dortmund : Universitätsbibliothek Dortmund, 2014. http://d-nb.info/1095767682/34.
Full textKircher, Martin [Verfasser], Janet [Akademischer Betreuer] Kelso, Anton [Gutachter] Nekrutenko, and Peter F. [Gutachter] Stadler. "Understanding and improving high-throughput sequencing data production and analysis / Martin Kircher ; Gutachter: Anton Nekrutenko, Peter F. Stadler ; Betreuer: Janet Kelso." Leipzig : Universitätsbibliothek Leipzig, 2011. http://d-nb.info/1237894654/34.
Full textPantano, Rubiño Lorena. "Full characterization of the small RNA transcriptome using novel computational methods for high-throughput sequencing data: study of miRNA variability in eukaryote organisms." Doctoral thesis, Universitat Pompeu Fabra, 2011. http://hdl.handle.net/10803/53576.
Full textEn esta tesis hemos desarrollado una herramienta, SeqBuster, para el análisis de datos de RNA (sRNA) de pequeño tamaño generados por las nuevas tecnologías de secuenciación, con especial énfasis en la caracterización de variantes de los miRNAs. Aplicamos la herramienta a datos públicos de secuenciación, lo que reveló una inesperada abundancia de isomiRs en diferentes especies. Ademas, detectamos todas las clases conocidas de otros sRNAs y de nuevos sRNAs con funciones desconocidas. También estudiamos la implicación de los miRNAs e isomiRs en el desarrollo y envejecimiento del cerebro humano, y en la enfermedad de Huntington. Nuestros resultados resaltan una posible importancia de la plasticidad de secuencia de los miRNAs, con probables consecuencias en la regulación de la expresión génica, subyacente a varias funciones biológicas. Por último, SeqBuster, podría ser extremadamente útil para identificar nuevos sRNAs con una posible función en determinados procesos biológicos.
Arora, Ankit [Verfasser], and Peter [Gutachter] Nürnberg. "Development of Web-Application for High-Throughput Sequencing Data and In Silico Dissection of LINE-1 Retrotransposons in Cellular Senescence / Ankit Arora ; Gutachter: Peter Nürnberg." Köln : Universitäts- und Stadtbibliothek Köln, 2018. http://d-nb.info/1171422660/34.
Full textWang, Wei. "Unveiling Molecular Mechanisms of piRNA Pathway from Small Signals in Big Data: A Dissertation." eScholarship@UMMS, 2010. http://escholarship.umassmed.edu/gsbs_diss/805.
Full textWang, Wei. "Unveiling Molecular Mechanisms of piRNA Pathway from Small Signals in Big Data: A Dissertation." eScholarship@UMMS, 2015. https://escholarship.umassmed.edu/gsbs_diss/805.
Full textWei-ChuChen and 陳薇筑. "Optimizing microalgae genome assembly of high throughput sequencing data." Thesis, 2018. http://ndltd.ncl.edu.tw/handle/f5c43h.
Full textChen, Chien-Chih, and 陳建智. "Scalable Assembly of High-Throughput De Novo Sequencing Data." Thesis, 2013. http://ndltd.ncl.edu.tw/handle/30042954633923603047.
Full text國立臺灣大學
資訊工程學研究所
101
DNA sequencing is one of the most important procedures in molecular biology research for determining the sequences of bases in specific DNA segments. With the development of next-generation sequencing technologies, studies on genomics and transcriptomics are moving into a new era. However, the current DNA sequencing technologies cannot be used to read entire genomes or a transcript in 1 step; instead, small sequences of 20–1000 bases are read. Thus, sequence assembly continues to be one of the central problems in bioinformatics. The challenges facing sequence assembly include the following: (1) sequencing error, (2) repeat sequences, (3) nonuniform coverage, and (4) computational complexity of processing large volumes of data. From these challenges, considering the rapid growth of data throughput delivered by next-generation sequencing technologies, there is a pressing need for sequence assembly software that can efficiently handle massive sequencing data by using scalable and on-demand computing resources. These requirements fit in with the model of cloud computing. In cloud computing, computing resources can be allocated on demand over the Internet from several thousand computers offered by vendors for analyzing data in parallel. Such cloud-computing applications are constantly being developed for large datasets and are run under the framework of MapReduce. In this dissertation, we have proposed CloudBrush, a parallel pipeline that runs on the MapReduce framework for de novo assembly of high-throughput sequencing data. CloudBrush is based on bidirected string graphs and its analysis consists of 2 main stages: graph construction and graph simplification. During graph construction, a node is defined for each nonredundant sequence read, and the edge is defined for overlap between reads. We have developed a prefix-and-extend algorithm for identifying overlaps between a pair of reads. The graph is further simplified by using conventional operations such as transitive reduction, path compression, tip removal, and bubble removal. We have also introduced a new operation, edge adjustment, for removing error topology structures in string graphs. This operation uses the sequence information of all graph neighbors for each read and eliminates the edges connecting to reads containing rare bases. CloudBrush was evaluated against Genome Assembly Gold-Standard Evaluation (GAGE) benchmarks to compare its assembly quality with that of other assemblers. The results showed that our assemblies have a moderate N50, a low misassembly rate of misjoins, and indels. In addition, we have introduced 2 measures, precision and recall, to address the issues of faithfully aligned contigs in order to target genomes. Compared with the assembly tools used in the GAGE benchmarks, CloudBrush was found to produce contigs with high precision and recall. We have also introduced a T-CloudBrush pipeline for transcriptome data. T-CloudBrush uses the multiple-k concept to overcome the problem of nonuniform coverage of transcriptome data. This concept is based on observation of the correlation between sequencing data coverage and the overlap size used during assembly. The experiment results showed that T-CloudBrush improves the accuracy of de novo transcriptome assembly. In summary, this dissertation explores the challenges facing sequence assembly under the scalable computing framework and provides possible solutions for the problems of sequencing errors, nonuniform coverage, and processing of large volumes of data.
"Analysis of nonsense-mediated decay targeted RNA (nt-RNA) in high-throughput sequencing data." 2015. http://repository.lib.cuhk.edu.hk/en/item/cuhk-1291523.
Full textBy analysis of a large collection of RNA-seq data in TCGA (over 4000 samples and the hard disk storage was over 50 TB), it was found that nt-RNA were produced in large amount for some genes, sometimes, they were even more abundant than the normal transcripts of the corresponding genes.
Based on the hypothesis that some nt-RNA are specifically produced by a biological process (in contrast to a process happened by chance), the aims of this work are: 1) To quantify the expression of nt-RNA (survey of the spectrum); 2) To examine the relationship between nt-RNA and protein expression (biological roles); 3) To detect nt-RNAs that affect prognosis of cancer (biological roles); 4) To apply nt-RNA as diagnostic biomarkers for cancer (application); 5) To identify nt-RNAs to classify tumors for unknown primary (CUP, application).
Firstly, nt-RNA were defined from Gene databases and all PTC containing transcripts were compared to their corresponding normal transcripts to locate specific signature tags (both short segments of sequences and splice junctions) for each of the nt-RNA. And the presence and counts of these nt-RNA signature tag were searched in all RNA reads of RNA-seq datasets. Such search and counting produced the read counts of each nt-RNA signature tag and all RNA-read containing such tags are targets for NMD. RNA-seq datasets used in this study included TCGA normal samples, TCGA tumor samples and cancer cell lines for 13 cancer types.
In the example of KIRC, it was found that most differentially expressed nt-RNA (tumor vs control) were related to differential expression of the corresponding normal transcripts. However, nt-RNA were produced in 900 genes which were independent of higher production of the normal transcripts. In the example of KIRC, collection of 12 genes in the proteasome ubiquitination pathway standed out among the highly produced nt-RNA. This finding is very interesting as VHL-HIF1A is a key oncogenesis mechanism in KIRC and normal HIF1A degradation required proteasomal ubiquitination pathway. GO analysis was highly significant at p-value<4.11E-05. And the nt-RNA producing genes included PSMB4, PSMD14, PSMC6, PSMD13, PSMB1, VCP, ANAPC5, PSMA4, PSMD3, ANAPC7, OS9, GCLC.
Secondly, some nt-RNA retarded translation of the normal transcripts. By using proteome data, the relationship between quantity of nt-RNA unique tags and normal protein product were analyzed by ANOVA comparison of linear models. It was found that 422 nt-RNA unique tags influenced the expression of proteins, which suggested a potential biological action of these nt-RNA. PTEN also produced nt-RNA in KIRC and tumor cells with higher PTEN nt-RNA had a lower PTEN protein level (p-value of ANOVA comparison of linear models: 0.017). Survival analysis results showed that PTEN nt-RNA levels affected survival, which suggested that it can be used as biomarker for prognosis. Furthermore, survival analysis were done for other nt-RNA unique tags which affected protein expression using clinical data.
Thirdly, the application of nt-RNA as diagnostic markers and markers to define tumor origin in CUP were examined. nt-RNA were identified in different types of tumors. Here, only nt-RNA that were independent of the normal gene transcripts in term of differential expression were used as biomarkers. By comparing tumor samples with normal samples, nt-RNAs as diagnostic markers were detected. Unsupervised clustering was performed for these nt-RNAs and heat maps showed high degree of separation of tumor and normal samples. For studying tumor origin in CUP, in both cross-validation study in the training dataset (N=541) and independent sample set external validation (N=2462), a highly discriminating sets of nt-RNAs were defined for most cancers examined (400 nt-RNA seq. tags). Unsupervised clustering was performed for the 400 nt-RNA seq. tags and heat maps showed its power to define tumor origin in CUP. And then the significance of classifier formed by 400 nt-RNA seq. tags was measured by performing 100 resampling of the training set. The results for the 100 resampling showed that the correctly classified instance rate for training set had 96.4895% ± 0.75% (mean ± standard deviation); for validation set had 91.0239% ± 1.032611%.
In conclusion, this study showed nt-RNA can have important biological function and be used for various applications. It’s a potential biomarker for diagnosis and prognosis of diseases. And it can also be used to decide the origin site of tumors, which indicates that nt-RNA will provide great information for potential application in diagnosis of cancer and determining the origin in cancer of unknown primary site (CUP). [With diagram]
無意介導的mRNA降解(NMD)是一種重要的保護機制,它可以防止錯誤的轉錄本,特別是含有提前終止密碼子的轉錄本。在經典的教學里,這種錯誤的轉錄本(這裡稱為無意介導的mRNA降解所靶向的轉錄本,記為nt-RNA)被認為是細胞轉錄過程中偶然產生的非特異性的副產物,它們很快被NMD清除,因此它們在細胞內的表達很少(即穩態時它們的表達量很少)。作為隨機的轉錄錯誤的一個副產物,它們通常被認為是沒有生物功能的。
通過分析大量的來自TCGA的RNA-seq的數據(超過4000個樣本,存儲空間超過50TB),我們發現一些基因的nt-RNA有很高的表達量,有的甚至超過同一個基因的正常轉錄本的表達量。
我們的假設是一些nt-RNA是由某個生物過程特定產生的,而不是偶然產生的。基於這一假設,本研究的目標有:(1)量化nt-RNA的表達(表達譜的調查);(2)探索nt-RNA與蛋白質表達的關係(生物功能);(3)尋找可以影響癌症預後的nt-RNA(生物功能);(4)用nt-RNA作為癌症診斷的生物標記物(應用);(5)識別可以用來區分原发灶不明的癌症的nt-RNA(應用)。
首先,通過基因的數據庫定義nt-RNA,并將這些nt-RNA與相應的正常的轉錄本進行比較,找到每個nt-RNA特有的標簽(包括系列的片段和剪接位点)。進而在RNA-seq數據所有的讀段中搜索這些nt-RNA特有的標簽并記數。通過這樣的搜索和記數,產生了每個nt-RNA特有標簽的讀段數目,而包含這些標簽的讀段就是NMD的靶標。本研究中使用的RNA-seq數據包含13種癌症的TCGA正常和癌症樣本,以及癌細胞系的樣本數據。
在腎癌的例子中,大多數差異表達(癌症與正常比較)的nt-RNA和它相應的正常的轉錄本的差異表達是有關聯的。然而,900个基因產生的nt-RNA與正常轉錄本的高表達是獨立的。我們發現與白酶體泛素化通路相關的12個基因高表達nt-RNA。這個發現是很有意思的,因為VHL-HIF1A是KIRC的一個重要的致癌機制,而正常的HIF1A的降解需要通過白酶體泛素化通路。白酶體泛素化通路在基因富集分析中是顯著的(p值<4.11E-05)。這12個基因分別是PSMB4,PSMD14,PSMC6,PSMD13,PSMB1,VCP,ANAPC5,PSMA4,PSMD3,ANAPC7,OS9,GCLC。
其次,一些nt-RNA可以降低正常轉錄本的翻譯。利用蛋白組數據,我們用ANOVA比較線性模型的方法研究了nt-RNA特有的標簽與正常的蛋白產物的關係。結果發現,422个nt-RNA特有的標簽影響蛋白質的表達,這說明nt-RNA具有潛在的生物作用。PTEN也在KIRC裡產生nt-RNA,PTEN的nt-RNA表達越高的樣本,含有越少的PTEN蛋白產物(ANOVA比較線性模型的p值=0.017)。生存分析的結果顯示PTEN的nt-RNA影響生存率,這說明PTEN的nt-RNA可以作為癌症預後的生物標記物。進一步,對其他的影響蛋白表達的nt-RNA特有的標簽也做了生存分析。
最後,我檢查了nt-RNA作為診斷標記物和用來定義原发灶不明的癌症(CUP)的起源的標記物的兩大應用。只有在差異表達方面獨立於正常轉錄本的那些nt-RNA會被用作生物標記物。通過比較癌症和正常的樣本,檢查了哪些nt-RNA可以作為診斷標記物。利用無監督的聚類分析和熱圖顯示了這些nt-RNA可以很明顯地將癌症和正常樣本分開。在研究原发灶不明的癌症(CUP)的起源中,通過對訓練集(N=541)和獨立的外部驗證集(N=2462)進行交叉驗證學習,定義了一個可以識別大多數癌症樣本的nt-RNA標簽集(400個nt-RNA特有的片段標簽)。無監督的聚類分析和熱圖顯示了用這些nt-RNA定義原发灶不明的癌症(CUP)的起源的能力。隨後,通過從訓練集的樣本隨機抽樣100次,檢查了由400個nt-RNA特有的片段標簽組成的分類器的顯著性。100次隨機抽樣的結果顯示:對訓練集,樣本準確分類率的均值和標準差分別是96.4895%和0.75%;對驗證集,樣本準確分類率的均值和標準差分別是91.0239%和1.032611%。
總之,本研究顯示了nt-RNA有重要的生物功能和多種應用。它是癌症診斷和預後的潛在的生物標記物。它也可以被用來決定癌症的原发灶,這意味著nt-RNA將會為癌症診斷和決定原发灶不明的癌症的原发灶的這些潛在應用提供很好的信息。[附圖]
Hu, Fuyan.
Thesis Ph.D. Chinese University of Hong Kong 2015.
Includes bibliographical references (leaves 173-211).
Abstracts also in Chinese.
Title from PDF title page (viewed on 12, October, 2016).
Detailed summary in vernacular field only.
Detailed summary in vernacular field only.
Detailed summary in vernacular field only.
Detailed summary in vernacular field only.
Detailed summary in vernacular field only.
Detailed summary in vernacular field only.
Detailed summary in vernacular field only.
Detailed summary in vernacular field only.
Fang, Hao-Yu, and 方浩宇. "Using high-throughput sequencing data to identify the transcriptional start sites of mouse microRNAs." Thesis, 2013. http://ndltd.ncl.edu.tw/handle/32453885381096374336.
Full text國立中興大學
基因體暨生物資訊學研究所
101
MicroRNAs (miRNAs) are non-coding small RNAs that inhibit protein coding gene expression by hybridizing with messenger RNAs (mRNAs). MiRNAs are involved in a lot of diverse biological processes and various diseases. To identify miRNA transcription start sites (TSSs) is important for studying the upstream regulatory networks of miRNAs. Up to now the studies regarding miRNA TSS identification are all focus on human miRNAs. We are interested in other species and our aim in this study is to identify mouse miRNA TSSs and the result would contribute to understanding the evolution of upstream regulatory networks of miRNAs. In this study, we integrated two types of high-throughput sequencing data, i.e. transcription start sites sequencing (TSSseq) and Cap Analysis of Gene Expression (CAGE), as the evidence of miRNA TSSs. A machine-learning-based Support Vector Machine (SVM) was developed to identify mouse miRNA TSSs. In addition, we also incorporated the ESTs (expression sequence tag) and sequence conservation information to provide evidence for mouse miRNA TSSs.
Yao, Jianchao. "Integrative analysis of high-throughput biological data: shrinkage correlation coefficient and comparative expression analysis." Thesis, 2009. http://hdl.handle.net/2152/ETD-UT-2009-12-403.
Full texttext
"Integrative analysis of large intergenic non-coding RNA and circular RNA using high-throughput RNA sequencing data." 2015. http://repository.lib.cuhk.edu.hk/en/item/cuhk-1291760.
Full textThesis Ph.D. Chinese University of Hong Kong 2015.
Includes bibliographical references (leaves 154-170).
Abstracts also in Chinese.
Title from PDF title page (viewed on 10, November, 2016).
"Computational models for extracting structural signals from noisy high-throughput sequencing data: 通过计算模型来提取高通量测序数据中的分子结构信息." 2015. http://repository.lib.cuhk.edu.hk/en/item/cuhk-1291576.
Full textThesis Ph.D. Chinese University of Hong Kong 2015.
Includes bibliographical references (leaves 147-161).
Abstracts also in Chinese.
Title from PDF title page (viewed on 26, October, 2016).
Hu, Xihao.
Poirier-Morency, Guillaume. "Modélisation des réseaux de régulation de l’expression des gènes par les microARN." Thesis, 2020. http://hdl.handle.net/1866/25104.
Full textMicroRNAs are small non-coding RNAs of approximately 22 nucleotide long involved in the regulation of gene expression. They target complementary regions to the RNA transcripts molecules that these genes encode and adjust the concentration according to the needs of the cell. As microRNAs and their RNA targets binds each other with imperfect complementarity, these two groups actively compete to form regulatory interactions. Consequently, attempting to quantitatively predict their equilibrium concentrations is a task that must take several factors into account, including the affinity for hybridization, the ability to catalyze the target, cooperation, and RNA accessibility. In the model we propose, miRBooking 2.0, each possible interaction between a microRNA and a binding site on a target RNA is characterized by an enzymatic reaction. A reaction of this type operates in two phases: a reversible formation of an enzyme-substrate complex, the microRNA-RNA duplex, and an irreversible conversion of the substrate in an RNA degradation product that restores the enzyme which can subsequently participate to other reactions. We show that the stationary state of this system, which can include up to 10 million equations in practice, has a very shallow Jacobian, allowing its efficient resolution using a sparse linear solver. This solution allows us to characterize precisely the mechanism of regulation and to study the role of microRNAs in a given cellular context. Predictions obtained on a HeLa S3 cell model correlate significantly with a set of experimental data obtained experimentally and can remarkably explain the expression threshold effects of genes. Using this solution as an initial condition and an explicit method of numerical integration, we simulate in real time the response of the system to changes of experimental conditions. We apply this model to target elements involved in the Epithelio-Mesenchymal Transition (EMT), an important mechanism of tumours proliferation. By identifying differentially expressed elements between the two conditions, we design synthetic microRNAs to interfere with the transition. To do so, we propose a method based on a parallel greedy best-first search to efficiently crawl the sequence space of the microRNA and present preliminary results on known EMT markers.