Log in

Relevant bibliographies by topics / RNA-seq Data Analysis / Dissertations / Theses

To see the other types of publications on this topic, follow the link: RNA-seq Data Analysis.

Dissertations / Theses on the topic 'RNA-seq Data Analysis'

Author: Grafiati

Published: 4 June 2021

Last updated: 1 February 2022

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 43 dissertations / theses for your research on the topic 'RNA-seq Data Analysis.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.

1

Wang, Qi. "Integrative Data Analysis of Microarray and RNA-seq." Diss., North Dakota State University, 2018. https://hdl.handle.net/10365/29968.

Full text

Abstract:

Background: Microarray and RNA sequencing (RNA-seq) are two commonly used high-throughput technologies for gene expression profiling for the past decades. For global gene expression studies, both techniques are expensive, and each has its unique advantages and limitations. Integrative analysis of these two types of data would provide increased statistical power, reduced cost, and complementary technical advantages. However, the complete different mechanisms of the high-throughput techniques make the two types of data highly incompatible. Methods: Based on the degrees of compatibility, the genes are grouped into different clusters using a novel clustering algorithm, called Boundary Shift Partition (BSP). For each cluster, a linear model is fitted to the data and the number of differentially expressed genes (DEGs) is calculated by running two-sample t-test on the residuals. The optimal number of cluster can be determined using the selection criteria that is penalized on the number of parameters for model fitting. The method was evaluated using the data simulated from various distributions and it was compared with the conventional K-means clustering method, Hartigan-Wong’s algorithm. The BSP algorithm was applied to the microarray and RNA-seq data obtained from the embryonic heart tissues from wild type mice and Tbx5 mice. The raw data went through multiple preprocessing steps including data transformation, quantile normalization, linear model, principal component analysis and probe alignments. The differentially expressed genes between wild type and Tbx5 are identified using the BSP algorithm. Results: The accuracies of the BSP algorithm for the simulation data are higher than those of Hartigan-Wong’s algorithm for the cases with smaller standard deviations across the five different underlying distributions. The BSP algorithm can find the correct number of the clusters using the selection criteria. The BSP method identifies 584 differentially expressed genes between the wild type and Tbx5 mice. A core gene network developed from the differentially expressed genes showed a set of key genes that were known to be important for heart development. Conclusion: The BSP algorithm is an efficient and robust classification method to integrate the data obtained from microarray and RNA-seq.

APA, Harvard, Vancouver, ISO, and other styles

2

Stupnikov, Aleksei. "Statistical models for RNA-seq data analysis of cancer." Thesis, Queen's University Belfast, 2017. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.728670.

Full text

Abstract:

In our research we addressed several major points, related with RNA-seq-based models for Cancer. The first chapter reviews various genomics technologies from the pre-NGS era and most commonSy used NGS platforms, as well as recently developed methods. From here the main concepts of differential expression for SAGE technology and RNA-seq were considered, going on to discuss several the most widely used methods in the field. In the third chapter we formulated the biological problem, that is, reproducibility and robustness of RNA-seq Differential Expression Analysis, and made some general observations on counts distributions of cancer-related RNA-seq data as well as sequencing depth alterations impact on data. In the chapter five we employed this robustness approach to rank the performance of existing differential gene expression (DGE) models and studied effects of subsamping in terms of library, size and number of samples on the outcome of a DGE analysis. In addition, in this chapter we introduced samExploreR - an R package that allows one to implement the sequencing depth altering simulations quickly and efficiently. Building on this work we applied the concept of subsampling to Quadratic - a candidate compound discovery framework based on connectivity mapping and explored its robustness and reproducibility for various, datasets. Finally, in chapter seven we explored how integrating information from different RNA-seq based approaches may affect the resulting outcome of the analysis and studied robustness' of those methods. The approaches adapted in this body of work allowed us to introduce the procedure of subsampling as a quality control measure that can allow an inference of quality when applied to datasets in research and clinical procedures.

APA, Harvard, Vancouver, ISO, and other styles

3

Huang, Yuanhua. "Structured Bayesian methods for splicing analysis in RNA-seq data." Thesis, University of Edinburgh, 2018. http://hdl.handle.net/1842/31328.

Full text

Abstract:

In most eukaryotes, alternative splicing is an important regulatory mechanism of gene expression that results in a single gene coding for multiple protein isoforms, thus largely increases the diversity of the proteome. RNA-seq is widely used for genome-wide splicing isoform quantification, and several effective and powerful methods have been developed for splicing analysis with RNA-seq data. However, it remains problematic for genes with low coverages or large number of isoforms. These difficulties may in principle be ameliorated by exploiting correlations encoded in the structured data sources. This thesis contributes to developments of Bayesian methods for splicing analysis by leveraging additional information in multiple datasets with structured prior distributions. First, we developed DICEseq, the first isoform quantification method tailored to time-series RNA-seq experiments. DICEseq explicitly models the correlations between experiments at different time points to aid the quantification of isoforms across experiments. Numerical experiments on both simulated and real datasets show that DICEseq yields more accurate results than state-of-the-art methods, an advantage that can become considerable at low coverage levels. Furthermore, DICEseq permits to quantify the trade-off between temporal sampling of RNA and depth of sequencing, frequently an important choice when planning experiments. Second, we developed BRIE (Bayesian Regression for Isoform Estimation), a Bayesian hierarchical model which resolves the difficulties in splicing analysis in single-cell RNA-seq (scRNA-seq) data by learning an informative prior distribution from sequence features. This method combines the quantification and imputation for splicing analysis via a Bayesian way, which is particularly useful in scRNA-seq data due to its extreme low coverages and high technical noises. We validated BRIE on several scRNA-seq data sets, showing that BRIE yields reproducible estimates of exon inclusion ratios in single cells. Third, we provided an effective tool by using Bayes factor to sensitively detect differential splicing between different single cells. When applying BRIE to a few real datasets, we found interesting heterogeneity patterns in splicing events across cell population, for example alternative exons in DNMT3B. In summary, this thesis proposes structured Bayesian methods to integrate multiple datasets to improve splicing analysis and study its biological functions.

APA, Harvard, Vancouver, ISO, and other styles

4

Wang, Xiao. "Computational Modeling for Differential Analysis of RNA-seq and Methylation data." Diss., Virginia Tech, 2016. http://hdl.handle.net/10919/72271.

Full text

Abstract:

Computational systems biology is an inter-disciplinary field that aims to develop computational approaches for a system-level understanding of biological systems. Advances in high-throughput biotechnology offer broad scope and high resolution in multiple disciplines. However, it is still a major challenge to extract biologically meaningful information from the overwhelming amount of data generated from biological systems. Effective computational approaches are of pressing need to reveal the functional components. Thus, in this dissertation work, we aim to develop computational approaches for differential analysis of RNA-seq and methylation data to detect aberrant events associated with cancers. We develop a novel Bayesian approach, BayesIso, to identify differentially expressed isoforms from RNA-seq data. BayesIso features a joint model of the variability of RNA-seq data and the differential state of isoforms. BayesIso can not only account for the variability of RNA-seq data but also combines the differential states of isoforms as hidden variables for differential analysis. The differential states of isoforms are estimated jointly with other model parameters through a sampling process, providing an improved performance in detecting isoforms of less differentially expressed. We propose to develop a novel probabilistic approach, DM-BLD, in a Bayesian framework to identify differentially methylated genes. The DM-BLD approach features a hierarchical model, built upon Markov random field models, to capture both the local dependency of measured loci and the dependency of methylation change. A Gibbs sampling procedure is designed to estimate the posterior distribution of the methylation change of CpG sites. Then, the differential methylation score of a gene is calculated from the estimated methylation changes of the involved CpG sites and the significance of genes is assessed by permutation-based statistical tests. We have demonstrated the advantage of the proposed Bayesian approaches over conventional methods for differential analysis of RNA-seq data and methylation data. The joint estimation of the posterior distributions of the variables and model parameters using sampling procedure has demonstrated the advantage in detecting isoforms or methylated genes of less differential. The applications to breast cancer data shed light on understanding the molecular mechanisms underlying breast cancer recurrence, aiming to identify new molecular targets for breast cancer treatment.<br>Ph. D.

APA, Harvard, Vancouver, ISO, and other styles

5

Hu, Yin. "A NOVEL COMPUTATIONAL FRAMEWORK FOR TRANSCRIPTOME ANALYSIS WITH RNA-SEQ DATA." UKnowledge, 2013. http://uknowledge.uky.edu/cs_etds/17.

Full text

Abstract:

The advance of high-throughput sequencing technologies and their application on mRNA transcriptome sequencing (RNA-seq) have enabled comprehensive and unbiased profiling of the landscape of transcription in a cell. In order to address the current limitation of analyzing accuracy and scalability in transcriptome analysis, a novel computational framework has been developed on large-scale RNA-seq datasets with no dependence on transcript annotations. Directly from raw reads, a probabilistic approach is first applied to infer the best transcript fragment alignments from paired-end reads. Empowered by the identification of alternative splicing modules, this framework then performs precise and efficient differential analysis at automatically detected alternative splicing variants, which circumvents the need of full transcript reconstruction and quantification. Beyond the scope of classical group-wise analysis, a clustering scheme is further described for mining prominent consistency among samples in transcription, breaking the restriction of presumed grouping. The performance of the framework has been demonstrated by a series of simulation studies and real datasets, including the Cancer Genome Atlas (TCGA) breast cancer analysis. The successful applications have suggested the unprecedented opportunity in using differential transcription analysis to reveal variations in the mRNA transcriptome in response to cellular differentiation or effects of diseases.

APA, Harvard, Vancouver, ISO, and other styles

6

Turro, Ernest. "Statistcal methods for gene expression analysis using microarray and RNA-Seq data." Thesis, Imperial College London, 2011. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.534964.

Full text

APA, Harvard, Vancouver, ISO, and other styles

7

Abuelqumsan, Mustafa. "Assessment of supervised classification methods for the analysis of RNA-seq data." Thesis, Aix-Marseille, 2018. http://www.theses.fr/2018AIXM0582/document.

Full text

Abstract:

Les technologies « Next Generation Sequencing» (NGS), qui permettent de caractériser les séquences génomiques à un rythme sans précédent, sont utilisées pour caractériser la diversité génétique humaine et le transcriptome (partie du génome transcrite en acides ribonucléiques). Les variations du niveau d’expression des gènes selon les organes et circonstances, sous-tendent la différentiation cellulaire et la réponse aux changements d’environnement. Comme les maladies affectent souvent l’expression génique, les profils transcriptomiques peuvent servir des fins médicales (diagnostic, pronostic). Différentes méthodes d’apprentissage artificiel ont été proposées pour classer des individus sur base de données multidimensionnelles (par exemple, niveau d’expression de tous les gènes dans des d’échantillons). Pendant ma thèse, j’ai évalué des méthodes de « machine learning » afin d’optimiser la précision de la classification d’échantillons sur base de profils transcriptomiques de type RNA-seq<br>Since a decade, “Next Generation Sequencing” (NGS) technologies enabled to characterize genomic sequences at an unprecedented pace. Many studies focused of human genetic diversity and on transcriptome (the part of genome transcribed into ribonucleic acid). Indeed, different tissues of our body express different genes at different moments, enabling cell differentiation and functional response to environmental changes. Since many diseases affect gene expression, transcriptome profiles can be used for medical purposes (diagnostic and prognostic). A wide variety of advanced statistical and machine learning methods have been proposed to address the general problem of classifying individuals according to multiple variables (e.g. transcription level of thousands of genes in hundreds of samples). During my thesis, I led a comparative assessment of machine learning methods and their parameters, to optimize the accuracy of sample classification based on RNA-seq transcriptome profiles

APA, Harvard, Vancouver, ISO, and other styles

8

Wartmann, Hannes [Verfasser]. "Bias Invariant RNA-Seq Data Annotation and Liver Diseases Microbiome Analysis / Hannes Wartmann." Hamburg : Staats- und Universitätsbibliothek Hamburg Carl von Ossietzky, 2021. http://d-nb.info/1235244083/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

9

Johnson, Kristen. "Software for Estimation of Human Transcriptome Isoform Expression Using RNA-Seq Data." ScholarWorks@UNO, 2012. http://scholarworks.uno.edu/td/1448.

Full text

Abstract:

The goal of this thesis research was to develop software to be used with RNA-Seq data for transcriptome quantification that was capable of handling multireads and quantifying isoforms on a more global level. Current software available for these purposes uses various forms of parameter alteration in order to work with multireads. Many still analyze isoforms per gene or per researcher determined clusters as well. By doing so, the effects of multireads are diminished or possibly wrongly represented. To address this issue, two programs, GWIE and ChromIE, were developed based on a simple iterative EM-like algorithm with no parameter manipulation. These programs are used to produce accurate isoform expression levels.

APA, Harvard, Vancouver, ISO, and other styles

10

Graf, Alexander. "Analysis of genome activation in early bovine embryos by bioinformatic evaluation of RNA-Seq data." Diss., Ludwig-Maximilians-Universität München, 2015. http://nbn-resolving.de/urn:nbn:de:bvb:19-179386.

Full text

Abstract:

During maternal-to-embryonic transition, control of embryonic development gradually switches from maternal RNAs and proteins stored in the oocyte to gene products generated after embryonic genome activation. Detailed insight into the onset of embryonic transcription is obscured by the presence of maternal transcripts and to date there is no systematic study addressing the activation of specific genes during several stages of early bovine embryo development. Using the bovine model system, comparative analyses of RNA-seq data set were performed. The sequencing libraries had been constructed starting with germinal vesicle (GV) and metaphase II (MII) oocytes and embryos at the four-cell, eight-cell, 16-cell and blastocyst stage. The embryos had been generated in vitro by fertilization of Bos taurus taurus oocytes with sperm of a Bos taurus indicus sire. In total, approximately 13,000 RNA species could be identified in oocytes and each embryonic stages. The number of identified differential abundant transcripts increased in the course of development from roughly 100 to several thousands, with a sharp rise at the eight-cell stage. A bioinformatic approach could be developed to capture maternally delivered and de novo synthesized RNA species separately. It sensitively identified actively transcribed genes despite the fact that comparative analyses failed due to presence of the huge amount of RNA provided by the oocyte. Actively transcribed RNA species could be identified for approximately 8,000 genes, the majority of them at the eight-cell stage. This finding indicated, that the majority of all RNA species provided by oocytes was de novo transcribed during early embryonic development. Furthermore, it could be shown that the de novo transcription of larger genes was initiated later in embryonic development than smaller ones. A procedure was established to identify Bos t. indicus specific SNPs in RNA-Seq datasets which identified more than 60,000 SNPs occurring in 20% of all annotated genes. A major part of these SNPs could be detected at the eight-cell stage. This procedure enables a way to capture and study allele-specific transcription during early embryonic development. The described bioinformatic approaches were used to study major genome activation, an important step in the maternal-to-embryonic transition. More than 4,000 genes were de novo transcribed during major genome activation, which was found to occur at the eight-cell stage. These genes were functionally related to transcription, translation and their regulation. In summary, this thesis created and applied a powerful tool set for bioinformatic dissection of processes occurring during development of early bovine embryos and provided unprecedented insights in major genome activation.

APA, Harvard, Vancouver, ISO, and other styles

11

Wan, Mohamad Nazarie Wan Fahmi Bin. "Network-based visualisation and analysis of next-generation sequencing (NGS) data." Thesis, University of Edinburgh, 2017. http://hdl.handle.net/1842/28923.

Full text

Abstract:

Next-generation sequencing (NGS) technologies have revolutionised research into nature and diversity of genomes and transcriptomes. Since the initial description of these technology platforms over a decade ago, massively parallel RNA sequencing (RNA-seq) has driven many advances in the characterization and quantification of transcriptomes. RNA-seq is a powerful gene expression profiling technology enabling transcript discovery and provides a far more precise measure of the levels of transcripts and their isoforms than other methods e.g. microarray. However, the analysis of RNA-seq data remains a significant challenge for many biologists. The data generated is large and the tools for its assembly, analysis and visualisation are still under development. Assemblies of reads can be inspected using tools such as the Integrative Genomics Viewer (IGV) where visualisation of results involves ‘stacking’ the reads onto a reference genome. Whilst sufficient for many needs, when the underlying variance of the genome or transcript assemblies is complex, this visualisation method can be limiting; errors in assembly can be difficult to spot and visualisation of splicing events may be challenging. Data visualisation is increasingly recognised as an essential component of genomic and transcriptomic data analysis, enabling large and complex datasets to be better understood. An approach that has been gaining traction in biological research is based on the application of network visualisation and analysis methods. Networks consist of nodes connected by edges (lines), where nodes usually represent an entity and edge a relationship between them. These are now widely used for plotting experimentally or computationally derived relationships between genes and proteins. The overall aim of this PhD project was to explore the use of network-based visualisation in the analysis and interpretation of RNA-seq data. In chapter 2, I describe the development of a data pipeline that has been designed to go from ‘raw’ RNA-seq data to a file format which supports data visualisation as a ‘DNA assembly graph’. In DNA assembly graphs, nodes represent sequence reads and edges denote a homology between reads above a defined threshold. Following the mapping of reads to a reference sequence and defining which reads a map to a given loci, pairwise sequence alignments are performed between reads using MegaBLAST. This provides a weighted similarity score that is used to define edges between reads. Visualisation of the resulting networks is then carried out using BioLayout Express3D that can render large networks in 3-D, thereby allowing a better appreciation of the often-complex network structure. This pipeline has formed the basis for my subsequent work on the exploring and analysing alternative splicing in human RNA-seq data. In the second half of this chapter, I provide a series of tutorials aimed at different types of users allowing them to perform such analyses. The first tutorial is aimed at computational novices who might want to generate networks using a web-browser and pre-prepared data. Other tutorials are designed for use by more advanced users who can access the code for the pipeline through GitHub or via an Amazon Machine Image (AMI). In chapter 3, the utility of network-based visualisations of RNA-seq data is explored using data processed through the pipeline described in Chapter 2. The aim of the work described in this chapter was to better understand the basic principles and challenges associated with network visualisation of RNA-seq data, in particular how it could be used to visualise transcript structure and splice-variation. These analyses were performed on data generated from four samples of human fibroblasts taken at different time points during their entry into cell division. One of the first challenges encountered was the fact that the existing network layout algorithm (Fruchterman- Reingold) implemented within BioLayout Express3D did not result in an optimal layout of the unusual graph structures produced by these analyses. Following the implementation of the more advanced layout algorithm FMMM within the tool, network structure could be far better appreciated. Using this layout method, the majority of genes sequenced to an adequate depth assemble into networks with a linear ‘corkscrew’ appearance and when representing single isoform transcripts add little to existing views of these data. However, in a small number of cases (~5%), the networks generated from transcripts expressed in human fibroblasts possess more complex structures, with ‘loops’, ‘knots’ and multiple ends being observed. In a majority of cases examined, these loops were associated with alternative splicing events, a fact confirmed by RT-PCR analyses. Other DNA assembly networks representing the mRNAs for genes such as MKI67 showed knot-like structures, which was found to be due to the presence of repetitive sequence within an exon of the gene. In another case, CENPO the unusual structure observed was due to reads derived from an overlapping gene of ADCY3 gene present on the opposite strand with reads being wrongly mapped to CENPO. Finally, I explored the use of a network reduction strategy as an approach to visualising highly expressed genes such as GAPDH and TUBA1C. Having successfully demonstrated the utility of networks in analysing transcript isoforms in data derived from a single cell type I set out to explore its utility in analysing transcript variation in tissue data where multiple isoforms expressed by different cells within the tissue might be present in a given sample. In chapter 4, I explore the analysis of transcript variation in an RNA-seq dataset derived from human tissue. The first half of this chapter describes the quality control of these data again using a network-based approach but this time based the correlation in expression between genes and samples. Of the 95 samples derived from 27 human tissues, 77 passed the quality control. A network was constructed using a correlation threshold of r ≥ 0.9, which comprised 6,109 nodes (genes) and 1,091,477 edges (correlations) and clustered. Subsequently, the profile and gene content of each cluster was examined and enrichment of GO terms analysed. In the second half of this chapter, the aim was to detect and analyse alternative splicing events between different tissues using the rMATS tool. By using a false-discovery rate (FDR) cut-off of < 0.01, I found that in comparisons of brain vs. heart, brain vs. liver and heart vs. liver, the program reported 4,992, 4,804 and 3,990 splicing events, respectively. Of these events, only 78 splicing events (52 genes) with more than 50% of exon inclusion level and expression level more than FPKM 30. To further explore the sometimes-complex structure of transcripts diversity derived from tissue, RNAseq assembly networks for KLC1, SORBS2, GUK1, and TPM1 were explored. Each of these networks showed different types of alternative splicing events and it was sometimes difficult to determine the isoforms expressed between tissues using other approaches. For instance, there is an issue in visualising the read assembly of long genes such as KLC1 and SORBS2, using a Sashimi plots or even Vials, just because of the number of exons and the size of their genomic loci. In another case of GUK1, tissue-specific isoform expression was observed when a network of three tissues was combined. Arguably the most complex analysis is the network of TPM1 where the uniquification step was employed for this highly expressed gene. In chapter 5, I perform a usability testing for NGS Graph Generator web application and visualising RNA-seq assemblies as a network using BioLayout Express3D. This test was important to ensure that the application is well received and utilised by the user.<br>Almost all participants of this usability test agree that this application would encourage biologists to visualise and understand the alternative splicing together with existing tools. The participants agreed that Sashimi plots rather difficult to view and visualise and perhaps would lose something interesting features. However, there were also reviews of this application that need improvements such as the capability to analyse big network in a short time, side-by-side analysis of network with Sashimi plot and Ensembl. Additional information of the network would be necessary to improve the understanding of the alternative splicing. In conclusion, this work demonstrates the utility of network visualisation of RNAseq data, where the unusual structure of these networks can be used to identify issues in assembly, repetitive sequences within transcripts and splice variation. As such, this approach has the potential to significantly improve our understanding of transcript complexity. Overall, this thesis demonstrates that network-based visualisation provides a new and complementary approach to characterise alternative splicing from RNA-seq data and has the potential to be useful for the analysis and interpretation of other kinds of sequencing data.

APA, Harvard, Vancouver, ISO, and other styles

12

Sundaramurthy, Gopinath. "A Probabilistic Approach for Automated Discovery of Biomarkers using Expression Data from Microarray or RNA-Seq Datasets." University of Cincinnati / OhioLINK, 2016. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1459528594.

Full text

APA, Harvard, Vancouver, ISO, and other styles

13

Innocenti, Nicolas. "Data Analysis and Next Generation Sequencing : Applications in Microbiology." Doctoral thesis, KTH, Beräkningsbiologi, CB, 2015. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-173219.

Full text

Abstract:

Next Generation Sequencing (NGS) is a new technology that has revolutionized the way we study living organisms. Where previously only a few genes could be studied at a time through targeted direct probing, NGS offers the possibility to perform measurements for a whole genome at once. The drawback is that the amount of data generated in the process is large and extracting useful information from it requires new methods to process and analyze it. The main contribution of this thesis is the development of a novel experimental method coined tagRNA-seq, combining 5’tagRACE, a previously developed technique, with RNA-sequencing technology. Briefly, tagRNA-seq makes it possible to identify the 5’ ends of RNAs in bacteria and directly probe for their type, primary or processed, by ligating short RNA sequences, the tags, to the beginnings of RNA molecules. We used the method to directly probe for transcription start and processing sites in two bacterial species, Escherichiacoli and Enterococcus faecalis. It was also used to study polyadenylation in E. coli, where the ability to identify processed RNA molecules proved to be useful to separate direct and indirect regulatory effects of this mechanism. We also demonstrate how data from tagRNA-seq experiments can be used to increase confidence on the discovery of anti-sense transcripts in bacteria. Analyses of RNA-seq data obtained in the context of these experiments revealed subtle artifacts in the coverage signal towards gene ends, that we were able to explain and quantify based Kolmogorov’s broken stick model. We also discovered evidences for circularization of a few RNA transcripts, both in our own data sets and publicly available data. Designing the tags used in tagRNA-seq led us to the problem of words absent from a text. We focus on a particular subset of these, the minimal absent words (MAWs), and develop a theory providing a complete description of their size distribution in random text. We also show that MAWs in genomes from viruses and living organisms almost always exhibit a behavior different from random texts in the tail of the distribution, and that MAWs from this tail are closely related to sequences present in the genome that preferentially appear in regions with important regulatory functions. Finally, and independently from tagRNA-seq, we propose a new approach to the problem of bacterial community reconstruction in metagenomic, based on techniques from compressed sensing. We provide a novel algorithm competing with state-of-the-art techniques in the field.<br><p>QC 20150930</p>

APA, Harvard, Vancouver, ISO, and other styles

14

Copeland, Nancy Giang. "Computational analysis of high-replicate RNA-seq data in Saccharomyces cerevisiae : searching for new genomic features." Thesis, University of Dundee, 2018. https://discovery.dundee.ac.uk/en/studentTheses/af2f83a4-3028-4925-9c99-81bd683067b0.

Full text

Abstract:

In this study, RNA-seq and proteomics, two orthogonal high-throughput technologies, were used to search the <em>Saccharomyces cerevisiae</em> genome for new genomic features. RNA-seq data were aligned to the genome with three successively stringent set of parameters for the STAR aligner (Dobin et al., 2013). The varying levels of stringency elucidated some complexities in the RNA-seq data, such as the presence of read alignments that mapped to multiple genomic locations. The RNA-seq alignments indicated the presence of RNA transcripts derived from regions of the genome without annotations (un-annotated regions) in the <em>Saccharomyces</em> Genome Database (SGD). To ensure that all of the high-quality curated annotations within SGD were accounted for appropriately, these datasets were categorised as either Primary or Secondary Annotations. Annotations of genomic regions where the primary sequence produced a molecule (e.g. snoRNA or peptide) were designated as Primary. Annotations of regions where other types of activity were present (e.g. histone binding sites, double-strand break hotspots) were classified as Secondary. Only the Primary Annotations were used as boundaries for determining locations of un-annotated regions. Open reading frames (ORFs) were present in these un-annotated regions. Therefore, the regions were translated in six frames to build a database of all theoretical peptides. Proteomics tandem mass spectra were then searched against this peptide database to find the presence of any expressed ORFs within the un-annotated regions. Two preliminary target ORFs have been found to contain RNA-seq alignments and were detected by the proteomics analysis, evidence that their transcripts may have been present in the original sample. The next step would be to verify these two preliminary target regions in the experimental laboratory to determine if they are in fact expressed as peptides, and if so, what possible functions the peptides may have. Throughout this study, the Un-Annotated Region Pipeline (UAR-Pipeline) software was constructed to facilitate the analysis of un-annotated regions given a genome sequence, a set of genomic annotations, and RNA-seq data. In addition, a Quickload Site within the Integrated Genome Browser (Nicol et al., 2009) was created to store and effectively visualise un-annotated regions against RNA-seq alignments, annotations, and other tracks of information such as conservation. The vast majority of annotations contained within the Quickload Site are also hosted by SGD; therefore, the Site would serve as a new resource for the research community through anticipated public access.

APA, Harvard, Vancouver, ISO, and other styles

15

Glaus, Peter. "Bayesian methods for gene expression analysis from high-throughput sequencing data." Thesis, University of Manchester, 2014. https://www.research.manchester.ac.uk/portal/en/theses/bayesian-methods-for-gene-expression-analysis-from-highthroughput-sequencing-data(cf9680e0-a3f2-4090-8535-a39f3ef50cc4).html.

Full text

Abstract:

We study the tasks of transcript expression quantification and differential expression analysis based on data from high-throughput sequencing of the transcriptome (RNA-seq). In an RNA-seq experiment subsequences of nucleotides are sampled from a transcriptome specimen, producing millions of short reads. The reads can be mapped to a reference to determine the set of transcripts from which they were sequenced. We can measure the expression of transcripts in the specimen by determining the amount of reads that were sequenced from individual transcripts. In this thesis we propose a new probabilistic method for inferring the expression of transcripts from RNA-seq data. We use a generative model of the data that can account for read errors, fragment length distribution and non-uniform distribution of reads along transcripts. We apply the Bayesian inference approach, using the Gibbs sampling algorithm to sample from the posterior distribution of transcript expression. Producing the full distribution enables assessment of the uncertainty of the estimated expression levels. We also investigate the use of alternative inference techniques for the transcript expression quantification. We apply a collapsed Variational Bayes algorithm which can provide accurate estimates of mean expression faster than the Gibbs sampling algorithm. Building on the results from transcript expression quantification, we present a new method for the differential expression analysis. Our approach utilizes the full posterior distribution of expression from multiple replicates in order to detect significant changes in abundance between different conditions. The method can be applied to differential expression analysis of both genes and transcripts. We use the newly proposed methods to analyse real RNA-seq data and provide evaluation of their accuracy using synthetic datasets. We demonstrate the advantages of our approach in comparisons with existing alternative approaches for expression quantification and differential expression analysis. The methods are implemented in the BitSeq package, which is freely distributed under an open-source license. Our methods can be accessed and used by other researchers for RNA-seq data analysis.

APA, Harvard, Vancouver, ISO, and other styles

16

Reddy, Veena K. "Analysis of single cell RNA seq data to identify markers for subtyping of non-small cell lung cancer." Thesis, Högskolan i Skövde, Institutionen för biovetenskap, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:his:diva-18514.

Full text

Abstract:

Single cell RNA technology is a recent technical advancement used to understand the cancertumorgenicity at single cell resolution. In this study we have analyzed the scRNA data from thenon-small cell lung cancer (NSCLC) dataset to facilitate the early identification of NSCLCsubtypes namely, squamous cell carcinoma (SCC) and adenocarcinoma (AC). Non-immunecells, have a major role in tumorigenesis of the malignant tumors, in early stages. Therefore,we have analyzed the major non-immune cells, namely endothelial cells and fibroblast cellsfrom the GSE127465 dataset using SEURAT pipeline. Dimensionality reduction analysis andcluster analysis indicate that AC and SCC subtypes of NSCLC have different fibroblastcompositions. Differential gene expression analysis indicates that AC tumours have shownelevated content of MGP/PTGDS and INMT/MFAP4 fibroblast cells, whereas squamous cellcarcinoma showed an elevated content of COL6A1/COL6A2 and FNDC1/COL12A1 fibroblastcells. The statistical analysis shows that the clustering is statistically significant and not anartefact. Given that the tumour microenvironment is highly dynamic, in this study we haveattempted to understand the tumour microenvironment by scRNA analysis of non-immune cellsat single cell resolution.

APA, Harvard, Vancouver, ISO, and other styles

17

Liu, Xinan. "NOVEL COMPUTATIONAL METHODS FOR SEQUENCING DATA ANALYSIS: MAPPING, QUERY, AND CLASSIFICATION." UKnowledge, 2018. https://uknowledge.uky.edu/cs_etds/63.

Full text

Abstract:

Over the past decade, the evolution of next-generation sequencing technology has considerably advanced the genomics research. As a consequence, fast and accurate computational methods are needed for analyzing the large data in different applications. The research presented in this dissertation focuses on three areas: RNA-seq read mapping, large-scale data query, and metagenomics sequence classification. A critical step of RNA-seq data analysis is to map the RNA-seq reads onto a reference genome. This dissertation presents a novel splice alignment tool, MapSplice3. It achieves high read alignment and base mapping yields and is able to detect splice junctions, gene fusions, and circular RNAs comprehensively at the same time. Based on MapSplice3, we further extend a novel lightweight approach called iMapSplice that enables personalized mRNA transcriptional profiling. As huge amount of RNA-seq has been shared through public datasets, it provides invaluable resources for researchers to test hypotheses by reusing existing datasets. To meet the needs of efficiently querying large-scale sequencing data, a novel method, called SeqOthello, has been developed. It is able to efficiently query sequence k-mers against large-scale datasets and finally determines the existence of the given sequence. Metagenomics studies often generate tens of millions of reads to capture the presence of microbial organisms. Thus efficient and accurate algorithms are in high demand. In this dissertation, we introduce MetaOthello, a probabilistic hashing classifier for metagenomic sequences. It supports efficient query of a taxon using its k-mer signatures.

APA, Harvard, Vancouver, ISO, and other styles

18

Shen, Shihao. "Statistical methods for deep sequencing data." Diss., University of Iowa, 2012. https://ir.uiowa.edu/etd/5059.

Full text

Abstract:

Ultra-deep RNA sequencing has become a powerful approach for genome-wide analysis of pre-mRNA alternative splicing. We develop MATS (Multivariate Analysis of Transcript Splicing), a Bayesian statistical framework for flexible hypothesis testing of differential alternative splicing patterns on RNA-Seq data. MATS uses a multivariate uniform prior to model the between-sample correlation in exon splicing patterns, and a Markov chain Monte Carlo (MCMC) method coupled with a simulation-based adaptive sampling procedure to calculate the P value and false discovery rate (FDR) of differential alternative splicing. Importantly, the MATS approach is applicable to almost any type of null hypotheses of interest, providing the flexibility to identify differential alternative splicing events that match a given user-defined pattern. We evaluated the performance of MATS using simulated and real RNA-Seq data sets. In the RNA-Seq analysis of alternative splicing events regulated by the epithelial-specific splicing factor ESRP1, we obtained a high RT-PCR validation rate of 86% for differential alternative splicing events with a MATS FDR of < 10%. Additionally, over the full list of RT-PCR tested exons, the MATS FDR estimates matched well with the experimental validation rate. Our results demonstrate that MATS is an effective and flexible approach for detecting differential alternative splicing from RNA-Seq data.

APA, Harvard, Vancouver, ISO, and other styles

19

Graf, Alexander Verfasser], and Eckhard [Akademischer Betreuer] [Wolf. "Analysis of genome activation in early bovine embryos by bioinformatic evaluation of RNA-Seq data / Alexander Graf. Betreuer: Eckhard Wolf." München : Universitätsbibliothek der Ludwig-Maximilians-Universität, 2015. http://d-nb.info/1068460636/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

20

Kahles, André [Verfasser], and Gunnar [Akademischer Betreuer] Rätsch. "Novel Methods for the Computational Analysis of RNA-Seq Data with Applications to Alternative Splicing / André Kahles ; Betreuer: Gunnar Rätsch." Tübingen : Universitätsbibliothek Tübingen, 2014. http://d-nb.info/1163282316/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

21

González-Vallinas, Rostes Juan 1983. "Software development and analysis of high throughput sequencing data for genomic enhancer prediction." Doctoral thesis, Universitat Pompeu Fabra, 2013. http://hdl.handle.net/10803/283480.

Full text

Abstract:

High Throughput Sequencing technologies (HTS) are becoming the standard in genomic regulation analysis. During my thesis I developed software for the analysis of HTS data. Through collaborations with other research groups, I specialized in the analysis of ChIP-Seq short mapped reads. For instance, I collaborated in the analysis of the effect of Hog1 stress induced response in Yeast and helped in the design of a multiple promoter-alignment method using ChIP-Seq data, among other collaborations. Making use of expertise and the software developed during this time, I analyzed ENCODE datasets in order to detect active genomic enhancers. Genomic enhancers are regions in the genome known to regulate transcription levels of close by or distant genes. Mechanism of activation and silencing of enhancers is still poorly understood. Epigenomic elements, like histone modifications and transcription factors play a critical role in enhancer activity. Modeling epigenomic signals, I predicted active and silenced enhancers in two cell lines and studied their effect in splicing and transcription initiation.<br>Las tecnologías High Throughput Sequencing (HTS) se están convirtiendo en el método standard de análisis de la regulación genómica. Durante mi tesis, he desarrollado software para el análisis de datos HTS. Mediante la colaboración con otros grupos de investigaci n, me he especializado ́ en el análisis de datos de ChIP-Seq. Por ejemplo, colaborado en el análisis del efecto de Hog1 en células de levadura afectadas por stress, colaboré en el diseño de un m ́ todo para el alineamiento m ́ ltiple de promotores usando datos de ChIP-Seq, entre otras colaboraciones. Usando el conocimiento y el software desarrollados durante este tiempo, analicé datos producidos por el proyecto ENCODE para detectar enhancers genómicos activos. Los enhancers son areas del genoma conocidas por regular la transcripción de genes cercanos y lejanos. Los mecanismos de activación y silenciamiento de enhancers son aún poco entendidos. Elementos epigenómicos, como las modificaciones de histonas y los factores de transcripción juegan un papel crucial en la actividad de enhancers. Construyendo un modelo con estas señales epigen ́ micas, predije enhancers activos y silenciados en dos lineas celulares y estudié su efecto sobre splicing y sobre la iniciacion de la transcripción.

APA, Harvard, Vancouver, ISO, and other styles

22

Guennel, Tobias. "Statistical Methods for Normalization and Analysis of High-Throughput Genomic Data." VCU Scholars Compass, 2012. http://scholarscompass.vcu.edu/etd/2647.

Full text

Abstract:

High-throughput genomic datasets obtained from microarray or sequencing studies have revolutionized the field of molecular biology over the last decade. The complexity of these new technologies also poses new challenges to statisticians to separate biological relevant information from technical noise. Two methods are introduced that address important issues with normalization of array comparative genomic hybridization (aCGH) microarrays and the analysis of RNA sequencing (RNA-Seq) studies. Many studies investigating copy number aberrations at the DNA level for cancer and genetic studies use comparative genomic hybridization (CGH) on oligo arrays. However, aCGH data often suffer from low signal to noise ratios resulting in poor resolution of fine features. Bilke et al. showed that the commonly used running average noise reduction strategy performs poorly when errors are dominated by systematic components. A method called pcaCGH is proposed that significantly reduces noise using a non-parametric regression on technical covariates of probes to estimate systematic bias. Then a robust principal components analysis (PCA) estimates any remaining systematic bias not explained by technical covariates used in the preceding regression. The proposed algorithm is demonstrated on two CGH datasets measuring the NCI-60 cell lines utilizing NimbleGen and Agilent microarrays. The method achieves a nominal error variance reduction of 60%-65% as well as an 2-fold increase in signal to noise ratio on average, resulting in more detailed copy number estimates. Furthermore, correlations of signal intensity ratios of NimbleGen and Agilent arrays are increased by 40% on average, indicating a significant improvement in agreement between the technologies. A second algorithm called gamSeq is introduced to test for differential gene expression in RNA sequencing studies. Limitations of existing methods are outlined and the proposed algorithm is compared to these existing algorithms. Simulation studies and real data are used to show that gamSeq improves upon existing methods with regards to type I error control while maintaining similar or better power for a range of sample sizes for RNA-Seq studies. Furthermore, the proposed method is applied to detect differential 3' UTR usage.

APA, Harvard, Vancouver, ISO, and other styles

23

Althammer, Sonja Daniela. "Elucidating mechanisms of gene regulation. Integration of high-throughput sequencing data for studying the epigenome." Doctoral thesis, Universitat Pompeu Fabra, 2012. http://hdl.handle.net/10803/81355.

Full text

Abstract:

The recent advent of High-Throughput Sequencing (HTS) methods has triggered a revolution in gene regulation studies. Demand has never been higher to process the immense amount of emerging data to gain insight into the regulatory mechanisms of the cell. We address this issue by describing methods to analyze, integrate and interpret HTS data from different sources. In particular, we developed and benchmarked Pyicos, a powerful toolkit that offers flexibility, versatility and efficient memory usage. We applied it to data from ChIP-Seq on progesterone receptor in breast cancer cells to gain insight into regulatory mechanisms of hormones. Moreover, we embedded Pyicos into a pipeline to integrate HTS data from different sources. In order to do so, we used data sets from ENCODE to systematically calculate signal changes between two cell lines. We thus created a model that accurately predicts the regulatory outcome of gene expression, based on epigenetic changes in a gene locus. Finally, we provide the processed data in a Biomart database to the scientific community.<br>La llegada reciente de nuevos métodos de High-Throughput Sequencing (HTS) ha provocado una revolución en el estudio de la regulación génica. La necesidad de procesar la inmensa cantidad de datos generados, con el objectivo de estudiar los mecanismos regulatorios en la celula, nunca ha sido mayor. En esta tesis abordamos este tema presentando métodos para analizar, integrar e interpretar datos HTS de diferentes fuentes. En particular, hemos desarollado Pyicos, un potente conjunto de herramientas que ofrece flexibilidad, versatilidad y un uso eficiente de la memoria. Lo hemos aplicado a datos de ChIP-Seq del receptor de progesterona en células de cáncer de mama con el fin de investigar los mecanismos de la regulación por hormonas. Además, hemos incorporado Pyicos en una pipeline para integrar los datos HTS de diferentes fuentes. Hemos usado los conjuntos de datos de ENCODE para calcular de forma sistemática los cambios de señal entre dos líneas celulares. De esta manera hemos logrado crear un modelo que predice con bastante precisión los cambios de la expresión génica, basándose en los cambios epigenéticos en el locus de un gen. Por último, hemos puesto los datos procesados a disposición de la comunidad científica en una base de datos Biomart.

APA, Harvard, Vancouver, ISO, and other styles

24

Aghamirzaie, Delasa. "Isoform-Specific Expression During Embryo Development in Arabidopsis and Soybean." Diss., Virginia Tech, 2016. http://hdl.handle.net/10919/73054.

Full text

Abstract:

Almost every precursor mRNA (pre-mRNA) in a eukaryotic organism undergoes splicing, in some cases resulting in the formation of more than one splice variant, a process called alternative splicing. RNA-Seq provides a major opportunity to capture the state of the transcriptome, which includes the detection of alternative spicing events. Alternative splicing is a highly regulated process occurring in a complex machinery called the spliceosome. In this dissertation, I focus on identification of different splice variants and splicing factors that are produced during Arabidopsis and soybean embryo development. I developed several data analysis pipelines for the detection and the functional characterization of active splice variants and splicing factors that arise during embryo development. The main goal of this dissertation was to identify transcriptional changes associated with specific stages of embryo development and infer possible associations between known regulatory genes and their targets. We identified several instances of exon skipping and intron retention as products of alternative splicing. The coding potential of the splice variants were evaluated using CodeWise. I developed CodeWise, a weighted support vector machine classifier to assess the coding potential of novel transcripts with respect to RNA secondary structure free energy, conserved domains, and sequence properties. We also examined the effect of alternative splicing on the domain composition of resulting protein isoforms. The majority of splice variants pairs encode proteins with identical domains or similar domains with truncation and in less than 10% of the cases alternative splicing results in gain or loss of a conserved domain. I constructed several possible regulatory networks that occur at specific stages of embryo development. In addition, in order to gain a better understanding of splicing regulation, we developed the concept of co-splicing networks, as a group of transcripts containing common RNA-binding motifs, which are co-expressed with a specific splicing factor. For this purpose, I developed a multi-stage analysis pipeline to integrate the co-expression networks with de novo RNA binding motif discovery at inferred splice sites, resulting in the identification of specific splicing factors and the corresponding cis-regulatory sequences that cause the production of splice variants. This approach resulted in the development of several novel hypotheses about the regulation of minor and major splicing in developing Arabidopsis embryos. In summary, this dissertation provides a comprehensive view of splicing regulation in Arabidopsis and soybean embryo development using computational analysis.<br>Ph. D.

APA, Harvard, Vancouver, ISO, and other styles

25

Shi, Xu. "Bayesian Modeling for Isoform Identification and Phenotype-specific Transcript Assembly." Diss., Virginia Tech, 2017. http://hdl.handle.net/10919/79772.

Full text

Abstract:

The rapid development of biotechnology has enabled researchers to collect high-throughput data for studying various biological processes at the genomic level, transcriptomic level, and proteomic level. Due to the large noise in the data and the high complexity of diseases (such as cancer), it is a challenging task for researchers to extract biologically meaningful information that can help reveal the underlying molecular mechanisms. The challenges call for more efforts in developing efficient and effective computational methods to analyze the data at different levels so as to understand the biological systems in different aspects. In this dissertation research, we have developed novel Bayesian approaches to infer alternative splicing mechanisms in biological systems using RNA sequencing data. Specifically, we focus on two research topics in this dissertation: isoform identification and phenotype-specific transcript assembly. For isoform identification, we develop a computational approach, SparseIso, to jointly model the existence and abundance of isoforms in a Bayesian framework. A spike-and-slab prior is incorporated into the model to enforce the sparsity of expressed isoforms. A Gibbs sampler is developed to sample the existence and abundance of isoforms iteratively. For transcript assembly, we develop a Bayesian approach, IntAPT, to assemble phenotype-specific transcripts from multiple RNA sequencing profiles. A two-layer Bayesian framework is used to model the existence of phenotype-specific transcripts and the transcript abundance in individual samples. Based on the hierarchical Bayesian model, a Gibbs sampling algorithm is developed to estimate the joint posterior distribution for phenotype-specific transcript assembly. The performances of our proposed methods are evaluated with simulation data, compared with existing methods and benchmarked with real cell line data. We then apply our methods on breast cancer data to identify biologically meaningful splicing mechanisms associated with breast cancer. For the further work, we will extend our methods for de novo transcript assembly to identify novel isoforms in biological systems; we will incorporate isoform-specific networks into our methods to better understand splicing mechanisms in biological systems.<br>Ph. D.

APA, Harvard, Vancouver, ISO, and other styles

26

Hänzelmann, Sonja 1981. "Pathway-centric approaches to the analysis of high-throughput genomics data." Doctoral thesis, Universitat Pompeu Fabra, 2012. http://hdl.handle.net/10803/108337.

Full text

Abstract:

In the last decade, molecular biology has expanded from a reductionist view to a systems-wide view that tries to unravel the complex interactions of cellular components. Owing to the emergence of high-throughput technology it is now possible to interrogate entire genomes at an unprecedented resolution. The dimension and unstructured nature of these data made it evident that new methodologies and tools are needed to turn data into biological knowledge. To contribute to this challenge we exploited the wealth of publicly available high-throughput genomics data and developed bioinformatics methodologies focused on extracting information at the pathway rather than the single gene level. First, we developed Gene Set Variation Analysis (GSVA), a method that facilitates the organization and condensation of gene expression proﬁles into gene sets. GSVA enables pathway-centric downstream analyses of microarray and RNA-seq gene expression data. The method estimates sample-wise pathway variation over a population and allows for the integration of heterogeneous biological data sources with pathway-level expression measurements. To illustrate the features of GSVA, we applied it to several use-cases employing diﬀerent data types and addressing biological questions. GSVA is made available as an R package within the Bioconductor project. Secondly, we developed a pathway-centric genome-based strategy to reposition drugs in type 2 diabetes (T2D). This strategy consists of two steps, ﬁrst a regulatory network is constructed that is used to identify disease driving modules and then these modules are searched for compounds that might target them. Our strategy is motivated by the observation that disease genes tend to group together in the same neighborhood forming disease modules and that multiple genes might have to be targeted simultaneously to attain an eﬀect on the pathophenotype. To ﬁnd potential compounds, we used compound exposed genomics data deposited in public databases. We collected about 20,000 samples that have been exposed to about 1,800 compounds. Gene expression can be seen as an intermediate phenotype reﬂecting underlying dysregulatory pathways in a disease. Hence, genes contained in the disease modules that elicit similar transcriptional responses upon compound exposure are assumed to have a potential therapeutic eﬀect. We applied the strategy to gene expression data of human islets from diabetic and healthy individuals and identiﬁed four potential compounds, methimazole, pantoprazole, bitter orange extract and torcetrapib that might have a positive eﬀect on insulin secretion. This is the ﬁrst time a regulatory network of human islets has been used to reposition compounds for T2D. In conclusion, this thesis contributes with two pathway-centric approaches to important bioinformatic problems, such as the assessment of biological function and in silico drug repositioning. These contributions demonstrate the central role of pathway-based analyses in interpreting high-throughput genomics data.<br>En l'última dècada, la biologia molecular ha evolucionat des d'una perspectiva reduccionista cap a una perspectiva a nivell de sistemes que intenta desxifrar les complexes interaccions entre els components cel•lulars. Amb l'aparició de les tecnologies d'alt rendiment actualment és possible interrogar genomes sencers amb una resolució sense precedents. La dimensió i la naturalesa desestructurada d'aquestes dades ha posat de manifest la necessitat de desenvolupar noves eines i metodologies per a convertir aquestes dades en coneixement biològic. Per contribuir a aquest repte hem explotat l'abundància de dades genòmiques procedents d'instruments d'alt rendiment i disponibles públicament, i hem desenvolupat mètodes bioinformàtics focalitzats en l'extracció d'informació a nivell de via molecular en comptes de fer-ho al nivell individual de cada gen. En primer lloc, hem desenvolupat GSVA (Gene Set Variation Analysis), un mètode que facilita l'organització i la condensació de perfils d'expressió dels gens en conjunts. GSVA possibilita anàlisis posteriors en termes de vies moleculars amb dades d'expressió gènica provinents de microarrays i RNA-seq. Aquest mètode estima la variació de les vies moleculars a través d'una població de mostres i permet la integració de fonts heterogènies de dades biològiques amb mesures d'expressió a nivell de via molecular. Per il•lustrar les característiques de GSVA, l'hem aplicat a diversos casos usant diferents tipus de dades i adreçant qüestions biològiques. GSVA està disponible com a paquet de programari lliure per R dins el projecte Bioconductor. En segon lloc, hem desenvolupat una estratègia centrada en vies moleculars basada en el genoma per reposicionar fàrmacs per la diabetis tipus 2 (T2D). Aquesta estratègia consisteix en dues fases: primer es construeix una xarxa reguladora que s'utilitza per identificar mòduls de regulació gènica que condueixen a la malaltia; després, a partir d'aquests mòduls es busquen compostos que els podrien afectar. La nostra estratègia ve motivada per l'observació que els gens que provoquen una malaltia tendeixen a agrupar-se, formant mòduls patogènics, i pel fet que podria caldre una actuació simultània sobre múltiples gens per assolir un efecte en el fenotipus de la malaltia. Per trobar compostos potencials, hem usat dades genòmiques exposades a compostos dipositades en bases de dades públiques. Hem recollit unes 20.000 mostres que han estat exposades a uns 1.800 compostos. L'expressió gènica es pot interpretar com un fenotip intermedi que reflecteix les vies moleculars desregulades subjacents a una malaltia. Per tant, considerem que els gens d'un mòdul patològic que responen, a nivell transcripcional, d'una manera similar a l'exposició del medicament tenen potencialment un efecte terapèutic. Hem aplicat aquesta estratègia a dades d'expressió gènica en illots pancreàtics humans corresponents a individus sans i diabètics, i hem identificat quatre compostos potencials (methimazole, pantoprazole, extracte de taronja amarga i torcetrapib) que podrien tenir un efecte positiu sobre la secreció de la insulina. Aquest és el primer cop que una xarxa reguladora d'illots pancreàtics humans s'ha utilitzat per reposicionar compostos per a T2D. En conclusió, aquesta tesi aporta dos enfocaments diferents en termes de vies moleculars a problemes bioinformàtics importants, com ho son el contrast de la funció biològica i el reposicionament de fàrmacs "in silico". Aquestes contribucions demostren el paper central de les anàlisis basades en vies moleculars a l'hora d'interpretar dades genòmiques procedents d'instruments d'alt rendiment.

APA, Harvard, Vancouver, ISO, and other styles

27

Jeena, Ganga [Verfasser], and Korbinian [Gutachter] Schneeberger. "A bioinformatics approach to quantify the effects of the underlying regulatory mechanisms on natural variation in gene expression by allele-specific expression analysis in Arabidopsis thaliana accessions using RNA-Seq Data / Ganga Jeena ; Gutachter: Korbinian Schneeberger." Köln : Universitäts- und Stadtbibliothek Köln, 2020. http://d-nb.info/1239811713/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

28

Lajoie, Bryan R. "Computational Approaches for the Analysis of Chromosome Conformation Capture Data and Their Application to Study Long-Range Gene Regulation: A Dissertation." eScholarship@UMMS, 2016. http://escholarship.umassmed.edu/gsbs_diss/833.

Full text

Abstract:

Over the last decade, development and application of a set of molecular genomic approaches based on the chromosome conformation capture method (3C), combined with increasingly powerful imaging approaches have enabled high resolution and genome-wide analysis of the spatial organization of chromosomes. The aim of this thesis is two-fold; 1), to provide guidelines for analyzing and interpreting data obtained from genome-wide 3C methods such as Hi-C and 3C-seq and 2), to leverage the 3C technology to solve genome function, structure, assembly, development and dosage problems across a broad range of organisms and disease models. First, through the introduction of cWorld, a toolkit for manipulating genome structure data, I accelerate the pace at which *C experiments can be performed, analyzed and biological insights inferred. Next I discuss a set of practical guidelines one should consider while planning an experiment to study the structure of the genome, a simple workflow for data processing unique to *C data and a set of considerations one should be aware of while attempting to gain insights from the data. Next, I apply these guidelines and leverage the cWorld toolkit in the context of two dosage compensation systems. The first is a worm condensin mutant which shows a reduction in dosage compensation in the hermaphrodite X chromosomes. The second is an allele-specific study consisting of genome wide Hi-C, RNA-Seq and ATAC-Seq which can measure the state of the active (Xa) and inactive (Xi) X chromosome. Finally I turn to studying specific gene – enhancer looping interactions across a panel of ENCODE cell-lines. These studies, when taken together, further our understanding of how genome structure relates to genome function.

APA, Harvard, Vancouver, ISO, and other styles

29

Tominaga, Sacomoto Gustavo Akio. "Efficient algorithms for de novo assembly of alternative splicing events from RNA-seq data." Phd thesis, Université Claude Bernard - Lyon I, 2014. http://tel.archives-ouvertes.fr/tel-01015506.

Full text

Abstract:

In this thesis, we address the problem of identifying and quantifying variants (alternative splicing and genomic polymorphism) in RNA-seq data when no reference genome is available, without assembling the full transcripts. Based on the idea that each variant corresponds to a recognizable pattern, a bubble, in a de Bruijn graph constructed from the RNA-seq reads, we propose a general model for all variants in such graphs. We then introduce an exact method, called KisSplice, to extract alternative splicing events and show that it outperforms general purpose transcriptome assemblers. We put an extra effort to make KisSplice as scalable as possible. In order to improve the running time, we propose a new polynomial delay algorithm to enumerate bubbles. We show that it is several orders of magnitude faster than previous approaches. In order to reduce its memory consumption, we propose a new compact way to build and represent a de Bruijn graph. We show that our approach uses 30% to 40% less memory than the state of the art, with an insignificant impact on the construction time. Additionally, we apply the techniques developed to list bubbles in two classical problems: cycle enumeration and the K-shortest paths problem. We give the first optimal algorithm to list cycles in undirected graphs, improving over Johnson's algorithm. This is the first improvement to this problem in almost 40 years. We then consider a different parameterization of the K-shortest (simple) paths problem: instead of bounding the number of st-paths, we bound the weight of the st-paths. We present new algorithms using exponentially less memory than previous approaches

APA, Harvard, Vancouver, ISO, and other styles

30

Jardillier, Rémy. "Evaluation de différentes variantes du modèle de Cox pour le pronostic de patients atteints de cancer à partir de données publiques de séquençage et cliniques." Thesis, Université Grenoble Alpes, 2020. http://www.theses.fr/2020GRALS008.

Full text

Abstract:

Le cancer constitue la première cause de mortalité prématurée (décès avant 65 ans) en France depuis 2004. Pour un même organe, chaque cancer est unique, et le pronostic personnalisé est donc un aspect important de la prise en charge et du suivi des patients. La baisse des coûts du séquençage des ARN a permis de mesurer à large échelle les profils moléculaires de nombreux échantillons tumoraux. Ainsi, la base de données TCGA fournit les données RNA-seq de tumeurs, des données cliniques (âge, sexe, grade, stade, etc.), et les temps de suivi des patients associés sur plusieurs années (dont la survie du patient, la récidive éventuelle, etc.). De nouvelles découvertes sont donc rendues possibles en terme de biomarqueurs construits à partir de données transcriptomiques, avec des pronostics individualisés. Ces avancées requièrent le développement de méthodes d’analyse de données en grande dimension adaptées à la prise en compte à la fois des données de survie (censurées à droite), des caractéristiques cliniques, et des profils moléculaires des patients. Dans ce contexte, l’objet principal de la thèse consiste à comparer et adapter des méthodologies pour construire des scores de risques pronostiques de la survie ou de la récidive des patients atteints de cancer à partir de données de séquençage et cliniques.Le modèle de Cox (semi-paramétrique) est largement utilisé pour modéliser ces données de survie, et permet de les relier à des variables explicatives. Les données RNA-seq de TCGA contiennent plus de 20 000 gènes pour seulement quelques centaines de patients. Le nombre p de variables excède alors le nombre n de patients, et l'estimation des paramètres est soumis à la « malédiction de la dimension ». Les deux principales stratégies permettant de remédier à cela sont les méthodes de pénalisation et le pré-filtrage des gènes. Ainsi, le premier objectif de cette thèse est de comparer les méthodes de pénalisations classiques du modèle de Cox (i.e. ridge, lasso, elastic net, adaptive elastic net). Pour cela, nous utilisons des données réelles et simulées permettant de contrôler la quantité d’information contenue dans les données transcriptomiques. Ensuite, la deuxième problématique abordée concerne le pré-filtrage univarié des gènes avant l’utilisation d’un modèle de Cox multivarié. Nous proposons une méthodologie permettant d’augmenter la stabilité des gènes sélectionnés, et de choisir les seuils de filtrage en optimisant les prédictions. Enfin, bien que le coût du séquençage (RNA-seq) ait diminué drastiquement au cours de la dernière décennie, il reste trop élevé pour une utilisation routinière en pratique. Dans une dernière partie, nous montrons que la profondeur de séquençage des miARN peut être réduite sans atténuer la qualité des prédictions pour certains cancers de TCGA, mais pas pour d’autres<br>Cancer has been the leading cause of premature mortality (death before the age of 65) in France since 2004. For the same organ, each cancer is unique, and personalized prognosis is therefore an important aspect of patient management and follow-up. The decrease in sequencing costs over the last decade have made it possible to measure the molecular profiles of many tumors on a large scale. Thus, the TCGA database provides RNA-seq data of tumors, clinical data (age, sex, grade, stage, etc.), and follow-up times of associated patients over several years (including patient survival, possible recurrence, etc.). New discoveries are thus made possible in terms of biomarkers built from transcriptomic data, with individualized prognoses. These advances require the development of large-scale data analysis methods adapted to take into account both survival data (right-censored), clinical characteristics, and molecular profiles of patients. In this context, the main goal of the thesis is to compare and adapt methodologies to construct prognostic risk scores for survival or recurrence of patients with cancer from sequencing and clinical data.The Cox model (semi-parametric) is widely used to model these survival data, and allows linking them to explanatory variables. The RNA-seq data from TCGA contain more than 20,000 genes for only a few hundred patients. The number p of variables then exceeds the number n of patients, and parameters estimation is subject to the “curse of dimensionality”. The two main strategies to overcome this issue are penalty methods and gene pre-filtering. Thus, the first objective of this thesis is to compare the classical penalization methods of Cox's model (i.e. ridge, lasso, elastic net, adaptive elastic net). To this end, we use real and simulated data to control the amount of information contained in the transcriptomic data. Then, the second issue addressed concerns the univariate pre-filtering of genes before using a multivariate Cox model. We propose a methodology to increase the stability of the genes selected, and to choose the filtering thresholds by optimizing the predictions. Finally, although the cost of sequencing (RNA-seq) has decreased drastically over the last decade, it remains too high for routine use in practice. In a final section, we show that the sequencing depth of miRNAs can be reduced without degrading the quality of predictions for some TCGA cancers, but not for others

APA, Harvard, Vancouver, ISO, and other styles

31

Li, Pei-Hsun, and 李沛洵. "Gene Set Enrichment Analysis of RNA-Seq data." Thesis, 2016. http://ndltd.ncl.edu.tw/handle/37077367052846883613.

Full text

Abstract:

碩士<br>國立臺灣大學<br>農藝學研究所<br>104<br>During the past few years, RNA-Seq technology has been widely employed for studying the transcriptome since it has clear advantages over the other transcriptomic technologies. The most popular use of RNA-seq applications is to identify differentially expressed genes. In addition, gene set analysis (GSA) aims to determine whether a predefined gene set, in which the genes share a common biological function, is correlated with the pheno-type. To date, many GSA approaches have been developed for identifying differentially expressed gene sets using microarray data. However, these methods are not directly ap-plicable to RNA-seq data due to intrinsic difference between two data structures. When testing the differential expression of gene sets, there is a critical assumption that the mem-bers in each gene set are sampled independently in most GSA methods. It means that the genes within a gene set don’t share a common biological function. In order to resolve this issue, we propose a GSA method based on the De-correlation (DECO) algorithm by Dougu Nam (2010) to remove the correlation bias in the expression of each gene set. We study the performance of our proposed method compared with other GSA methods through simulation studies under various scenarios combining with four different normal-ization methods. As a result, we found that our proposed method outperforms the others in terms of Type I error rate and empirical power.

APA, Harvard, Vancouver, ISO, and other styles

32

Jenkins, David. "Pathway activity analysis of bulk and single-cell RNA-Seq data." Thesis, 2019. https://hdl.handle.net/2144/34809.

Full text

Abstract:

Gene expression profiling can produce effective biomarkers that can provide additional information beyond other approaches for characterizing disease. While these approaches are typically performed on standard bulk RNA sequencing data, new methods for RNA sequencing of individual cells have allowed these approaches to be applied at the resolution of a single cell. As these methods enter the mainstream, there is an increased need for user-friendly software that allows researchers without experience in bioinformatics to apply these techniques. In this thesis, I have developed new, user-friendly data resources and software tools to allow researchers to use gene expression signatures in their own datasets. Specifically, I created the Single Cell Toolkit, a user-friendly and interactive toolkit for analyzing single-cell RNA sequencing data and used this toolkit to analyze the pathway activity levels in breast cancer cells before and after cancer therapy. Next, I created and validated a set of activated oncogenic growth factor receptor signatures in breast cancer, which revealed additional heterogeneity within public breast cancer cell line and patient sample RNA sequencing datasets. Finally, I created an R package for rapidly profiling TB samples using a set of 30 existing tuberculosis gene signatures. I applied this tool to look at pathway differences in a dataset of tuberculosis treatment failure samples. Taken together, the results of these studies serve as a set of user-friendly software tools and data sets that allow researchers to rapidly and consistently apply pathway activity methods across RNA sequencing samples.

APA, Harvard, Vancouver, ISO, and other styles

33

Fino, Joana Rita Vieira. "Analysis of RNA-seq data from the interaction of Coffea spp. - Colletotrichum kahawae." Master's thesis, 2014. http://hdl.handle.net/10451/11738.

Full text

Abstract:

Tese de mestrado em Bioinformática e Biologia Computacional (Bioinformática), apresentada à Universidade de Lisboa, através da Faculdade de Ciências, 2014<br>O café e um dos produtos mais comercializados no mundo, com extrema importância económica e social, influenciando milhões de pessoas que dependem direta ou indiretamente desta industria. No entanto, a cultura do café e extremamente afetada por agentes patogénicos, nomeadamente fungos. Colletotrichum kahawae Waller and Bridge e um desses agentes, sendo responsável pela antracnose dos frutos verdes do cafeeiro, conhecida como “Coffee Berry Disease”. Esta doença afeta a espécie Coffea arabica L., a espécie de maior importância no mercado, apresentando os maiores volumes de produção. Atualmente, a antracnose dos frutos verdes do cafeeiro incide sobretudo em zonas de alta altitude, encontrando-se confinada ao continente africano. Contudo tal não significa que não se possa dispersar para outras zonas de cultivo onde as condições de desenvolvimento, tanto para a planta como para o fungo, sejam favoráveis. Foram desenvolvidas várias estratégias de melhoramento para o combate a doença, levando ao desenvolvimento de algumas variedades resistentes no Quénia. Apesar de já serem atualmente conhecidos vários genótipos com um caracter de resistência a esta doença, as bases genéticas e moleculares da mesma são ainda desconhecidas. Com o intuito de compreender as bases subjacentes ao processo de resistência, recorreu-se a sequenciação comparativa do transcriptoma de dois genótipos de cafeeiro, um susceptível (Caturra) e outro resistente (Catimor 88) durante as primeiras horas de interacção de C. kahawae, através da plataforma Illumina. A análise destes dados visou a identificação de genes diferencialmente expressos, envolvidos na resistência da planta a doença. Os dados desta sequenciação foram previamente analisados pela empresa ARK genomics (UK), embora utilizando softwares e parâmetros padronizados, normalmente aplicados para todo o tipo de analises deste género, desde bactérias a plantas. Com o objetivo de melhorar e aprofundar a analise, foi desenvolvida uma nova analise customizada, que aqui se apresenta, em comparação com a analise anterior. Varias ferramentas e abordagens foram aplicadas nesta nova analise, tendo em conta a inexistência de um genoma de referencia. Neste trabalho foi possível identificar vários problemas e cuidados a ter desde o tratamento das “reads”, ate ao cálculo de diferenças de expressão, bem como simples diferenças entre softwares. Neste novo estudo de expressão teve-se ainda em conta análises comparativas a diferentes níveis que não tinham sido efetuadas na analise anterior. A anotação de “unigenes” diferencialmente expressos indica uma tendência para categorias funcionais diretamente relacionadas com a produção de energia, envolvida no crescimento e desenvolvimento da planta, e com processos ja identificados como envolvidos na resposta de defesa a agentes patogénicos tais como o metabolismo de açúcares ou a biossíntese de fenilalanina e fenilpropanoides. De um modo geral, os objetivos deste trabalho foram cumpridos, tendo-se desenvolvido uma linha de análise que permitiu uma melhor e mais adequada exploração dos dados gerados por sequenciação de transcriptoma. Espera-se assim que os resultados obtidos venha a contribuir para o aumento do conhecimento científico sobre a resposta de defesa por parte da planta, gerando informações uteis para o estabelecimento de programas de melhoramento que apoiem a produção sustentável de uma cultura tao relevante a nível económico e social. Por outro lado, espera-se que este trabalho mostre a necessidade de uma analise cuidada de dados de “next generation sequencing”, em especial dados resultantes da sequenciação de RNA, tecnologia ainda bastante recente e sem um processo universalmente aceite para a analise correta dos dados gerados.<br>Coffee is one of the most traded products in the world, with extremely social and economic importance, and millions of people who depend directly or indirectly on it. Coffee berry disease (CBD), caused by the fungus Colletotrichum kahawae Waller & Bridge, is considered the biggest threat to Arabica coffee production in Africa at high altitude. In Coffea arabica L. plantations, CBD can cause up to 20-50% of crop losses, reaching 80% in years of severe epidemics if chemical control is not applied. In order to control this disease, several coffee improvement strategies were developed which leaded to the selection of few hybrid commercial resistant varieties in Kenya. Therefore, breeding for coffee resistance remains a powerful strategy to fight CBD, in an economic and sustainable manner. With the purpose of gaining some insights on coffee resistance process, a RNA Illumina sequencing approach was used to characterize the transcriptional profile of two coffee genotypes, respectively susceptible (Caturra) and resistant (Catimor 88) to C. kahawae, during the early stages of the infection process. The differential expression analysis of this data aimed to identify genes putatively involved in the resistance process. Although a previous analysis was made by the sequencing company ARK genomics (UK), this was only based on non-specific methods generally applied to a wide range of organisms. To improve the analysis and consequently the results obtained, a new approach was taken aiming to produce a more customized workflow. Comparatively with the previous analysis, the present approach showed some improvement regarding the transcriptome assembly quality and size, or the level of confidence of the differential expression results, despite the CPU and RAM limitations. It was possible to account for additional comparative analyses for the differential expression assessment and to identify the enriched functional categories representing the differential expressed unigenes. Regarding the biological results, the resistant genotype showed a high effective response to the infection while the susceptible genotype showed an early stress-leaded response by the infection. The KOG and KEGG annotation of the differential expressed unigenes, was able to identify two main domains: plant development and defense response. It is expected that the results obtained here will contribute to increase the scientific knowledge on the plant defense response , generating useful information able to guide the establishment of breeding programs that support sustainable production. Moreover, it is expected that this study show the necessity of careful analysis of next generation sequencing data, especially when dealing with recent methods like RNA-seq, for which there is no clear consensus about the best analysis practices.

APA, Harvard, Vancouver, ISO, and other styles

34

Schulz, Marcel Holger [Verfasser]. "Data structures and algorithms for analysis of alternative splicing with RNA-seq data / Marcel H. Schulz." 2010. http://d-nb.info/1014037832/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

35

Wolff, Alexander. "Analysis of expression profile and gene variation via development of methods for Next Generation Sequencing data." Thesis, 2018. http://hdl.handle.net/11858/00-1735-0000-002E-E517-9.

Full text

APA, Harvard, Vancouver, ISO, and other styles

36

Shomroni, Orr. "Development of algorithms and next-generation sequencing data workflows for the analysis of gene regulatory networks." Doctoral thesis, 2017. http://hdl.handle.net/11858/00-1735-0000-0023-3E0C-8.

Full text

APA, Harvard, Vancouver, ISO, and other styles

37

Robert, Bonnie-Jean. "A pipeline for differential expression analysis of RNA-seq data and the effect of filter cutoff on performance." Thesis, 2017. https://dspace.library.uvic.ca//handle/1828/8538.

Full text

Abstract:

RNA sequencing is a powerful new approach to analyzing differential expression of transcripts between treatments. Many statistical methods are now available to test for differential expression, each one reports results differently. This thesis presents a workflow of five popular methods and discusses the results. A pipeline was built in the R language to analyze four of these packages using a real RNA-seq dataset. At present, researchers must prepare RNA-seq data prior to analysis to achieve reliable results. Filtering is a necessary preparatory step in which transcripts exhibiting low levels of genetic expression are removed from further analysis. Yet, little research is available to guide researchers on how best to choose this threshold. This thesis introduces a study designed to determine if the choice of filter threshold has a significant effect on individual package performance. Increasing the filtering threshold was shown to decrease the sensitivity and increase the specificity of the four statistical methods studied.<br>Graduate

APA, Harvard, Vancouver, ISO, and other styles

38

Yeh, Tsun-Hao, and 葉存皓. "Differential expression analysis from RNA-seq data upon different osmotic treatments during submergence in rice seedlings (Oryza sativa L.)." Thesis, 2019. http://ndltd.ncl.edu.tw/cgi-bin/gs32/gsweb.cgi/login?o=dnclcdr&s=id=%22107NCHU5417015%22.&searchmode=basic.

Full text

Abstract:

碩士<br>國立中興大學<br>農藝學系所<br>107<br>With the climate change in the world, the frequency of heavy rain and typhoon increased by global warming. The heavy rain and typhoon might result in crop yield reduction. Submergence not only causes hypoxic stress but osmotic stress. The physiological responses and molecular mechanism of hypoxia have been well studied, but there are limited studies in osmotic under submergence. To better understand the mechanisms of the osmotic submergence examined the responses of IR64-Sub1 under submergence (Sub), submergence combined with mannitol (Sub+Man) or sodium chloride (Sub+NaCl). Transcriptomic analysis was done by RNA-seq. Compared to the gene expression profile under control condition, The differentially expressed genes (DEGs) under Sub treatment was much least, followed by Sub+Man, and Sub+NaCl treatment. Under 500 times fold change selection, 78 genes were co-expressed under three different osmotic stress during submergence.While 78, 154 and 407 genes were specifically regulated under Sub, Sub+Man and Sub+NaCl treatment. Compairing the expression profiles, Sub and Sub+Man treatments were more similar than Sub+NaCl treatment. Gene Ontology analysis showed that gene function were related to biotic and abiotic stresses, transcription factors, protein synthesis and protein degradation. Genes were highly enriched involved in glycolysis pathway and fermentation pathway, including Sucrose Synthase4 (SUS4), Hexokinase9 (HEX9), Glyceraldehyde 3-Phosphate Dehydrogenase (GADPH), Pyruvate Dehydrogenase2 (PDC2) and Alcohol Dehydrogenase1 (ADH1) were differential expressed under different osmotic treatments during submergence. The expression level of ADH1, PDC2 and GAPDH in Sub+Man and Sub+NaCl treatments were higher than Sub treatment. The expression level of SUS4 in Sub+NaCl was higher than Sub treatment. The expression level of HEX9 were induced in all treatments. The expression level of malate dehydrogenase (MDH) was induced in Sub+NaCl treatment, but down regulated by Sub and Sub+Man treatments. Based on the results, IR64-Sub1 has complex mechanisms in response to different osmotic treatments during submergence. This results provide a new insight of the submergence responses.

APA, Harvard, Vancouver, ISO, and other styles

39

Gerasimov, Ekaterina. "Analysis of NGS Data from Immune Response and Viral Samples." 2017. http://scholarworks.gsu.edu/cs_diss/127.

Full text

Abstract:

This thesis is devoted to designing and applying advanced algorithmical and statistical tools for analysis of NGS data related to cancer and infection diseases. NGS data under investigation are obtained either from host samples or viral variants. Recently, random peptide phage display libraries (RPPDL) were applied to studies of host's antibody response to different diseases. We study human antibody response to breast cancer and mouse antibody response to Lyme disease by sequencing of the whole antibody repertoire profiles which are represented by RPPDL. Alternatively, instead of sequencing immune response NGS can be applied directly to a viral population within an infected host. Specifically, we analyze the following RNA viruses: the human immunodeficiency virus (HIV) and the infectious bronchitis virus (IBV). Sequencing of RNA viruses is challenging because there are many variants inside population due to high mutation rate. Our results show that NGS helps to understand RNA viruses and explore their interaction with infected hosts. NGS also helps to analyze immune response to different diseases, trace changing of immune response at different disease stages.

APA, Harvard, Vancouver, ISO, and other styles

40

Sousa, Eric Serafim Ramos de. "Bioinformatics analyses and approaches to RNA-Seq Data." Master's thesis, 2017. http://hdl.handle.net/10451/33897.

Full text

Abstract:

Tese de mestrado, Bioinformática e Biologia Computacional (Bioinformática )Universidade de Lisboa, Faculdade de Ciências, 2017<br>O envelhecimento ainda é um processo mal compreendido e a identificação dos moduladores mais importantes do envelhecimento continua a ser um desafio. Neste estudo, dados de RNA-Seq de duas experiências diferentes foram analisadas através de técnicas bioinformáticas para tentar encontrar genes nos processos de envelhecimento e antienvelhecimento. Uma das experiências tinha como intenção comparar o transcriptoma de células de ratinho e de rato-toupeira pelado depois de terem sido stressadas para tentar entender os processos biológicos por trás da resistência do rato-toupeira pelado contra compostos prejudiciais ao ADN. A outra experiência tinha como intenção avaliar as alterações do transcriptoma após o silenciamento do gene Bc055324 em células humanas. Os dados das experiências foram mapeados contra os genomas de referência usando STAR, as contagens de transcriptos por gene foram obtidas usando o ReadCounter e a expressão diferencial foi explorada usando edgeR. Quando alguns dos resultados pareceram estranhos, a ferramenta FastQ Screen foi usada para tentar entender a origem do problema. Numa das experiências, o batch effect era muito grande, e mesmo depois de ajustá-lo matematicamente, não permitindo que muitas inferências fossem tiradas. Na outra experiência, um grande erro durante sua execução significou que seu conceito inicial não foi alcançado. O gene Unc79 é identificado como um gene diferencialmente expresso positivamente ao comparar células de rato-toupeira pelado que foram tratadas com camptotecina após 48 horas contra um controlo. Embora exista um homólogo de ratinho para este gene, não existe expressão nas amostras de ratinho. O gene Unc79 pode então ter um papel importante na resistência do rato-toupeira pelado contra compostos prejudiciais ao ADN. Este estudo realça a importância de um planeamento e execução adequados das experiências. Apesar do facto de que o custo da sequenciação da NGS estar a diminuir, ainda é uma técnica dispendiosa, e se uma experiência não for executada corretamente pode resultar num desperdício de recursos preciosos. Este estudo também destaca como a Bioinformática é um campo multidisciplinar e que, sem dados de qualidade, mesmo as melhores ferramentas não ajudarão a tirar conclusões sobre uma determinada situação.<br>Ageing is still a poorly understood process and identifying the most important modulators of ageing remains a challenge. In this study, RNA-Seq data from two different experiments were put through bioinformatics pipelines to try and find genes in ageing and anti-ageing processes. One of experiments was to compare the transcriptome of naked mole-rat and mouse cells after they’ve been stressed to try to understand the biological processes behind naked mole-rat’s resistance against DNA-damaging agents. The other experiment was to evaluate the alterations of the transcriptome after the silencing of gene Bc055324 in human cells. Data from the experiments was mapped to reference genomes using STAR, read counts per gene were obtain using ReadCounter and differential expression was explored using edgeR. When some unusual results appeared, FastQ Screen was used to try to understand the source of the problem. In one of the experiments the batch effect was simply too great, even after adjusting for it mathematically, not allowing for many inferences to be made. On the other experiment, a big mistake during its execution meant that its initial concept was not achieved. Gene Unc79 is identified as a positively differentially expressed gene when comparing naked mole-rat cells who were treated with campthotecin after 48 hours to control cells. Even though there is a mouse homolog to this gene, there isn’t expression in the mouse samples. Gene Unc79 may be an important player in naked mole-rat’s resistance against DNA-damaging compounds. This study highlights the importance of proper design and execution of experiments. Even though the cost of NGS sequencing is going down, it is still an expensive technique, if an experiment isn’t properly executed it may result in a waste of precious resources. This study also highlights how Bioinformatics is a multidisciplinary field and that without good data even the best tools won’t help to make conclusions about a certain situation.

APA, Harvard, Vancouver, ISO, and other styles

41

Temate, Tiagueu Yvette Charly B., and Tiagueu Yvette C. B. Temate. "Methods for Differential Analysis of Gene Expression and Metabolic Pathway Activity." 2016. http://scholarworks.gsu.edu/cs_diss/102.

Full text

Abstract:

RNA-Seq is an increasingly popular approach to transcriptome profiling that uses the capabilities of next generation sequencing technologies and provides better measurement of levels of transcripts and their isoforms. In this thesis, we apply RNA-Seq protocol and transcriptome quantification to estimate gene expression and pathway activity levels. We present a novel method, called IsoDE, for differential gene expression analysis based on bootstrapping. In the first version of IsoDE, we compared the tool against four existing methods: Fisher's exact test, GFOLD, edgeR and Cuffdiff on RNA-Seq datasets generated using three different sequencing technologies, both with and without replicates. We also introduce the second version of IsoDE which runs 10 times faster than the first implementation due to some in-memory processing applied to the underlying gene expression frequencies estimation tool and we also perform more optimization on the analysis. The second part of this thesis presents a set of tools to differentially analyze metabolic pathways from RNA-Seq data. Metabolic pathways are series of chemical reactions occurring within a cell. We focus on two main problems in metabolic pathways differential analysis, namely, differential analysis of their inferred activity level and of their estimated abundance. We validate our approaches through differential expression analysis at the transcripts and genes levels and also through real-time quantitative PCR experiments. In part Four, we present the different packages created or updated in the course of this study. We conclude with our future work plans for further improving IsoDE 2.0.

APA, Harvard, Vancouver, ISO, and other styles

42

von, der Heyde Silvia. "Unravelling Drug Resistance Mechanisms in Breast Cancer." Thesis, 2015. http://hdl.handle.net/11858/00-1735-0000-0022-602E-E.

Full text

APA, Harvard, Vancouver, ISO, and other styles

43

Althobaiti, Atheer. "Reconstruction of Cell and Tissue-specific Immune-protein Interactomes Using Single-cell RNA Sequencing Data." Thesis, 2021. http://hdl.handle.net/10754/669004.

Full text

Abstract:

Protein molecules and their interactions via protein-protein interactions (PPIs) are at the core of cellular functions. While such global PPI networks have been useful for analyzing gene function and effects of genetic variants, they do not resolve tissue and cell-typespecific interactions. Here we leverage recent advances in single-cell RNA sequencing (scRNA-seq) to reconstruct cell-type-specific PPI networks across different tissues to enable a context-sensitive analysis of immune cells’ gene-protein pathways. Targeting B cells, T cells, and macrophage cells as a proof-of-principle, we used scRNA-seq data across different tissues from the Tabula Muris mouse consortium. We mapped the protein-coding DEGs to a protein-protein interaction network database (STRING v.11). Topological and global similarity analysis of the networks revealed distinct properties between tissues highlighting tissue-specific behaviors for each cell type. For example, we found that degree and clustering coefficients distributions were tissue-specific. Different cell types and tissues displayed specific characteristics, and in particular, the splenic PPI networks were different compared to other analyzed tissues for all the immune cell types examined. For example, the pairwise comparison of the Jaccard index for node similarity and the mantel test correlation analysis showed that the spleen’ node and PPI networks are more different than any other tissues for each cell type examined. The physiological and anatomical properties that distinguish the spleen from other examined tissues might explain why the splenic PPI networks tend to be less similar compared to other tissues. The cell-type-specific network analyses using the different distance measures between the adjacency matrices on the hub nodes such as Euclidean, Manhattan, Jaccard, and Hamming distances showed a macrophage-specific behavior not observed in B cells and T cells, confirming their lineage differences. Finally, we explored the rewiring of selected hub nodes and transcription factors in the PPI networks along with their biological enrichments to validate our observations. The suggested biological validity of our results confirms the relevance of data-driven reconstruction of these context-sensitive networks using more advanced network inference algorithms. In conclusion, scRNA-seq enables the reconstruction of global unspecific PPI networks into cell and tissue-specific networks, thereby providing an increased resolution of the biological context.

APA, Harvard, Vancouver, ISO, and other styles

We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!