To see the other types of publications on this topic, follow the link: Transcriptomic data analysis.

Dissertations / Theses on the topic 'Transcriptomic data analysis'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 dissertations / theses for your research on the topic 'Transcriptomic data analysis.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.

1

Xu, Huan. "Controlling false positive rate in network analysis of transcriptomic data." University of Cincinnati / OhioLINK, 2019. http://rave.ohiolink.edu/etdc/view?acc_num=ucin156267322069819.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Kmetzsch, Virgilio. "Multimodal analysis of neuroimaging and transcriptomic data in genetic frontotemporal dementia." Electronic Thesis or Diss., Sorbonne université, 2022. https://accesdistant.sorbonne-universite.fr/login?url=https://theses-intra.sorbonne-universite.fr/2022SORUS279.pdf.

Full text
Abstract:
La démence frontotemporale (DFT) représente le deuxième type de démence le plus fréquent chez les adultes de moins de 65 ans. Il n’existe aucun traitement capable de guérir cette maladie. Dans ce contexte, il est essentiel d’identifier des biomarqueurs capables d’évaluer la progression de la maladie. Cette thèse a deux objectifs. Premièrement, analyser les profils d’expression des microARNs circulants prélevés dans le plasma sanguin de participants, afin d’identifier si l’expression de certains microARNs est corrélée au statut mutationnel et à la progression de la maladie. Deuxièmement, proposer des méthodes pour intégrer des données transversales de type microARN et de neuroimagerie pour estimer la progression de la maladie. Nous avons mené trois études. D’abord, nous avons analysé des échantillons de plasma provenant de porteurs d’une expansion dans le gène C9orf72. Ensuite, nous avons testé toutes les signatures de microARNs identifiées dans la littérature comme biomarqueurs potentiels de la DFT ou de la sclérose latérale amyotrophique (SLA), dans deux cohortes indépendantes. Enfin, dans notre troisième étude, nous avons proposé une nouvelle méthode, utilisant un autoencodeur variationnel multimodal supervisé, qui estime à partir d’échantillons de petite taille un score de progression de la maladie en fonction de données transversales d’expression de microARNs et de neuroimagerie. Les travaux menés dans cette thèse interdisciplinaire ont montré qu’il est possible d’utiliser des biomarqueurs non invasifs, tels que les microARNs circulants et l’imagerie par résonance magnétique, pour évaluer la progression de maladies neurodégénératives rares telles que la DFT et la SLA
Frontotemporal dementia (FTD) represents the second most common type of dementia in adults under the age of 65. Currently, there are no treatments that can cure this condition. In this context, it is essential that biomarkers capable of assessing disease progression are identified. This thesis has two objectives. First, to analyze the expression patterns of microRNAs taken from blood samples of patients, asymptomatic individuals who have certain genetic mutations causing FTD, and controls, to identify whether the expressions of some microRNAs correlate with mutation status and disease progression. Second, this work aims at proposing methods for integrating cross-sectional data from microRNAs and neuroimaging to estimate disease progression. We conducted three studies. Initially, we focused on plasma samples from C9orf72 expansion carriers. We identified four microRNAs whose expressions correlated with the clinical status of the participants. Next, we tested all microRNA signatures identified in the literature as potential biomarkers of FTD or amyotrophic lateral sclerosis (ALS), in two groups of individuals. Finally, in our third work, we proposed a new approach, using a supervised multimodal variational autoencoder, that estimates a disease progression score from cross-sectional microRNA expression and neuroimaging datasets with small sample sizes. The work conducted in this interdisciplinary thesis showed that it is possible to use non-invasive biomarkers, such as circulating microRNAs and magnetic resonance imaging, to assess the progression of rare neurodegenerative diseases such as FTD and ALS
APA, Harvard, Vancouver, ISO, and other styles
3

Caterino, Cinzia. "The aging synapse: an integrated proteomic and transcriptomic analysis." Doctoral thesis, Scuola Normale Superiore, 2019. http://hdl.handle.net/11384/86004.

Full text
Abstract:
An important hallmark of aging is the loss of proteostasis, which can lead to the formation of protein aggregates and mitochondrial dysfunction in neurons. Although it is well known that protein synthesis is finely regulated in the brain, especially at synapses, where mRNAs are locally translated in activity-dependent manner, little is known as to the changes in the synaptic proteome and transcriptome during aging. Therefore, this work aims to elucidate the relationship between transcriptome and proteome at soma and synaptic level during aging. Cerebral cortices were isolated from 3 weeks-old mice, 5 months-old and 18 months-old mice and synaptosomal fraction was extracted by ultracentrifugation on discontinuous sucrose gradient. The fraction was then analyzed by Data Independent Analysis (DIA) Mass Spectrometry and the resulting data were analyzed using Spectronaut software. RNA was also extracted and analyzed by ribo-zero RNA-seq. Data were analyzed and combined with R software. Proteomic and transcriptomic data analysis revealed that, in young animals, proteins and transcripts are correlated and synaptic regulation is driven by changes in the soma. During aging, there is a decoupling between transcripts and proteins and between somatic and synaptic compartments. For example, there is an increase of ribosomal proteins at synapses that is not mirrored by a concomitant increase at somatic level. Furthermore, soma-synapse gradient of ribosomal genes changes upon aging, i.e. ribosomal transcripts are less abundant and ribosomal proteins are more abundant in synaptic compartment of old mice with respect to younglings. Mass spectrometry analysis of synaptic protein aggregates revealed that they are particularly rich in ribosomal proteins and also of some components of lysosomes and proteasome, suggesting that loss of proteostasis and inefficient degradation leads to aggregation of ribosomes in synaptic compartment. Strikingly, Desmoplakin, a structural constituent of desmosomes, was also highly abundant in synaptic aggregates. This study suggests that aging affects both the local translational machinery and the trafficking of transcripts and proteins.
APA, Harvard, Vancouver, ISO, and other styles
4

Captier, Nicolas. "Multimodal analysis of radiological, pathological, and transcriptomic data for the prediction of immunotherapy outcome in Non-Small Cell Lung Cancer patients." Electronic Thesis or Diss., Université Paris sciences et lettres, 2024. http://www.theses.fr/2024UPSLS012.

Full text
Abstract:
La survie globale des patients atteints de cancer du poumon non à petites cellules (CPNPC) métastatique a augmenté grâce à l’utilisation d’immunothérapies anti-PD1/PD-L1. Cependant, la durée de la réponse reste très variable d'un patient à l'autre, et seuls 20 à 30 % des patients sont encore en vie après deux ans. Par conséquent, de nouveaux biomarqueurs permettant de prédire la réponse au traitement et le pronostic des patients sont nécessaires pour guider la décision thérapeutique. Dans le cadre de mon doctorat, nous avons étudié des approches d'apprentissage automatique pour exploiter les données radiologiques, transcriptomiques et pathologiques, en les intégrant dans des modèles multimodaux susceptibles d'améliorer le pouvoir prédictif limité des données de routine clinique.Mon doctorat était au cœur d'un projet multidisciplinaire financé par la Fondation ARC, intitulé "SIGN'IT 2020-Signatures en Immunothérapie". Il réunissait plusieurs équipes de recherche de l'Institut Curie aux côtés d'une équipe de l'Institut du thorax, dirigée par le Professeur Nicolas Girard, en charge de la prise en charge des patients et de la collecte des données. Nous avons constitué une nouvelle cohorte multimodale de 317 patients atteints de CPNPC métastatique traités, en première ligne, par immunothérapie, seule ou associée à une chimiothérapie. Avant le début du traitement, nous avons recueilli des informations cliniques provenant des soins de routine, des examens TEP/TDM au 18F-FDG, des lames pathologiques numérisées provenant du diagnostic initial et des profils RNA-seq provenant de biopsies solides. Les résultats de l'immunothérapie ont été évalués en fonction de la survie globale (OS) et de la survie sans progression (PFS) de chaque patient.En collaboration avec Irène Buvat et Emmanuel Barillot, dont les équipes sont respectivement spécialisées dans l'analyse d'images médicales et de profils tumoraux RNA-seq, nous nous sommes d'abord concentrés sur la conception d'outils informatiques permettant d'extraire des informations pertinentes et interprétables à partir de ces deux modalités de données. Nous avons notamment développé un outil Python pour appliquer l'Analyse en Composantes Indépendantes (ICA) sur les données omiques et stabiliser les résultats à travers de multiples exécutions. Nous avons ensuite exploré le potentiel de l'ICA stabilisée pour extraire des caractéristiques transcriptomiques puissantes et biologiquement pertinentes pour la prédiction des résultats des patients. Pour les images médicales, et en particulier les examens TEP au 18F-FDG, nous avons étudié le potentiel des approches radiomiques pour caractériser la maladie métastatique au niveau du corps entier et concevoir de nouvelles caractéristiques prédictives. Nous avons conçu un outil d'explication Python, basé sur les valeurs de Shapley, pour mettre en évidence la contribution de chaque métastase individuelle à la prédiction des modèles radiomiques.Une part importante de mon doctorat a été consacrée à l'intégration des caractéristiques cliniques, radiomiques et transcriptomiques, ainsi que des caractéristiques pathomiques (avec l'aide de l'équipe de Thomas Walter). Nous avons procédé à une comparaison approfondie des capacités prédictives des différentes combinaisons multimodales en utilisant divers algorithmes d'apprentissage et méthodes d'intégration. Nous avons conçu des stratégies pour surmonter les nombreux défis associés à l'intégration multimodale, y compris la gestion des modalités manquantes pour de nombreux patients, la gestion d'une taille de cohorte modeste par rapport à la haute dimensionnalité des données, ou la garantie d'une comparaison équitable de toutes les combinaisons multimodales possibles. Nous nous sommes particulièrement attachés à mettre en évidence le potentiel des approches multimodales pour améliorer la stratification des risques des patients par rapport aux modèles utilisant uniquement des informations de routine clinique
Overall survival of patients with metastatic non-small cell lung cancer (NSCLC) has been increasing with the use of anti-PD-1 immune checkpoint inhibitors. However, the duration of response remains highly variable between patients, and only 20-30% of patients are alive at 2 years. Thus, new biomarkers for predicting response to treatment and patient outcomes are still needed to guide therapeutic decision. In my PhD, we investigated machine learning approaches to leverage radiological, transcriptomic, and pathological data, integrating them into powerful multimodal models that might improve the limited predictive power of routine clinical data.My doctoral research stood at the heart of a multidisciplinary project funded by Fondation ARC call «SIGN’IT 2020—Signatures in Immunotherapy». It brought together several research teams of Institut Curie alongside a team from Institut du thorax, led by Professor Nicolas Girard, in charge of patient management and data collection. We built a new multimodal cohort of 317 metastatic NSCLC patients treated with first-line immunotherapy alone or combined with chemotherapy. At baseline, we collected clinical information from routine care, 18F-FDG PET/CT scans, digitized pathological slides from the initial diagnosis, and bulk RNA-seq profiles from solid biopsies. Immunotherapy outcome was monitored with Overall Survival (OS) and Progression-Free Survival (PFS).Together with Irène Buvat and Emmanuel Barillot, whose teams hold significant expertise in the analysis of medical images and RNAseq tumor profiles, respectively, we initially focused on designing computational tools to extract relevant and interpretable information from these two data modalities. We notably developed a Python tool to apply Independent Component Analysis (ICA) on omics data and stabilize the results through multiple runs. We then explored the potential of stabilized ICA to extract powerful and biologically relevant transcriptomic features for the prediction of patient outcome. For medical images, and in particular 18F-FDG PET scans, we investigated the potential of radiomic approaches to characterize the metastatic disease at the whole-body level and design novel predictive features. We designed a Python explanation tool, based on Shapley values, to highlight the contribution of each individual metastasis to the prediction of radiomic models that use as input such whole-body features. A substantial portion of my PhD was devoted to the integration of clinical, radiomic, and transcriptomic features, as well as pathomic features extracted from digitized pathological slides (with the assistance of Thomas Walter’s team). We conducted a thorough comparison of the predictive capabilities of the different multimodal combinations using various state-of-the-art learning algorithms and integration methods. We devised strategies to overcome the many challenges associated to multimodal integration within our dataset, including handling missing modalities for numerous patients, dealing with a modest cohort size in comparison to the high dimensionality of the data, or ensuring a fair comparison of all the possible multimodal combinations. We especially focused on highlighting the potential of multimodal approaches to enhance patient risk stratification with respect to models using only clinical information collected during routine care
APA, Harvard, Vancouver, ISO, and other styles
5

Schmidt, Florian [Verfasser], and Marcel Holger [Akademischer Betreuer] Schulz. "Applications, challenges and new perspectives on the analysis of transcriptional regulation using epigenomic and transcriptomic data / Florian Schmidt ; Betreuer: Marcel Holger Schulz." Saarbrücken : Saarländische Universitäts- und Landesbibliothek, 2019. http://d-nb.info/1196090173/34.

Full text
APA, Harvard, Vancouver, ISO, and other styles
6

Schmidt, Florian Verfasser], and Marcel Holger [Akademischer Betreuer] [Schulz. "Applications, challenges and new perspectives on the analysis of transcriptional regulation using epigenomic and transcriptomic data / Florian Schmidt ; Betreuer: Marcel Holger Schulz." Saarbrücken : Saarländische Universitäts- und Landesbibliothek, 2019. http://nbn-resolving.de/urn:nbn:de:bsz:291--ds-287773.

Full text
APA, Harvard, Vancouver, ISO, and other styles
7

Czerwińska, Urszula. "Unsupervised deconvolution of bulk omics profiles : methodology and application to characterize the immune landscape in tumors Determining the optimal number of independent components for reproducible transcriptomic data analysis Application of independent component analysis to tumor transcriptomes reveals specific and reproducible immune-related signals A multiscale signalling network map of innate immune response in cancer reveals signatures of cell heterogeneity and functional polarization." Thesis, Sorbonne Paris Cité, 2018. http://www.theses.fr/2018USPCB075.

Full text
Abstract:
Les tumeurs sont entourées d'un microenvironnement complexe comprenant des cellules tumorales, des fibroblastes et une diversité de cellules immunitaires. Avec le développement actuel des immunothérapies, la compréhension de la composition du microenvironnement tumoral est d'une importance critique pour effectuer un pronostic sur la progression tumorale et sa réponse au traitement. Cependant, nous manquons d'approches quantitatives fiables et validées pour caractériser le microenvironnement tumoral, facilitant ainsi le choix de la meilleure thérapie. Une partie de ce défi consiste à quantifier la composition cellulaire d'un échantillon tumoral (appelé problème de déconvolution dans ce contexte), en utilisant son profil omique de masse (le profil quantitatif global de certains types de molécules, tels que l'ARNm ou les marqueurs épigénétiques). La plupart des méthodes existantes utilisent des signatures prédéfinies de types cellulaires et ensuite extrapolent cette information à des nouveaux contextes. Cela peut introduire un biais dans la quantification de microenvironnement tumoral dans les situations où le contexte étudié est significativement différent de la référence. Sous certaines conditions, il est possible de séparer des mélanges de signaux complexes, en utilisant des méthodes de séparation de sources et de réduction des dimensions, sans définitions de sources préexistantes. Si une telle approche (déconvolution non supervisée) peut être appliquée à des profils omiques de masse de tumeurs, cela permettrait d'éviter les biais contextuels mentionnés précédemment et fournirait un aperçu des signatures cellulaires spécifiques au contexte. Dans ce travail, j'ai développé une nouvelle méthode appelée DeconICA (Déconvolution de données omiques de masse par l'analyse en composantes immunitaires), basée sur la méthodologie de séparation aveugle de source. DeconICA a pour but l'interprétation et la quantification des signaux biologiques, façonnant les profils omiques d'échantillons tumoraux ou de tissus normaux, en mettant l'accent sur les signaux liés au système immunitaire et la découverte de nouvelles signatures. Afin de rendre mon travail plus accessible, j'ai implémenté la méthode DeconICA en tant que librairie R. En appliquant ce logiciel aux jeux de données de référence, j'ai démontré qu'il est possible de quantifier les cellules immunitaires avec une précision comparable aux méthodes de pointe publiées, sans définir a priori des gènes spécifiques au type cellulaire. DeconICA peut fonctionner avec des techniques de factorisation matricielle telles que l'analyse indépendante des composants (ICA) ou la factorisation matricielle non négative (NMF). Enfin, j'ai appliqué DeconICA à un grand volume de données : plus de 100 jeux de données, contenant au total plus de 28 000 échantillons de 40 types de tumeurs, générés par différentes technologies et traités indépendamment. Cette analyse a démontré que les signaux immunitaires basés sur l'ICA sont reproductibles entre les différents jeux de données. D'autre part, nous avons montré que les trois principaux types de cellules immunitaires, à savoir les lymphocytes T, les lymphocytes B et les cellules myéloïdes, peuvent y être identifiés et quantifiés. Enfin, les métagènes dérivés de l'ICA, c'est-à-dire les valeurs de projection associées à une source, ont été utilisés comme des signatures spécifiques permettant d'étudier les caractéristiques des cellules immunitaires dans différents types de tumeurs. L'analyse a révélé une grande diversité de phénotypes cellulaires identifiés ainsi que la plasticité des cellules immunitaires, qu'elle soit dépendante ou indépendante du type de tumeur. Ces résultats pourraient être utilisés pour identifier des cibles médicamenteuses ou des biomarqueurs pour l'immunothérapie du cancer
Tumors are engulfed in a complex microenvironment (TME) including tumor cells, fibroblasts, and a diversity of immune cells. Currently, a new generation of cancer therapies based on modulation of the immune system response is in active clinical development with first promising results. Therefore, understanding the composition of TME in each tumor case is critically important to make a prognosis on the tumor progression and its response to treatment. However, we lack reliable and validated quantitative approaches to characterize the TME in order to facilitate the choice of the best existing therapy. One part of this challenge is to be able to quantify the cellular composition of a tumor sample (called deconvolution problem in this context), using its bulk omics profile (global quantitative profiling of certain types of molecules, such as mRNA or epigenetic markers). In recent years, there was a remarkable explosion in the number of methods approaching this problem in several different ways. Most of them use pre-defined molecular signatures of specific cell types and extrapolate this information to previously unseen contexts. This can bias the TME quantification in those situations where the context under study is significantly different from the reference. In theory, under certain assumptions, it is possible to separate complex signal mixtures, using classical and advanced methods of source separation and dimension reduction, without pre-existing source definitions. If such an approach (unsupervised deconvolution) is feasible to apply for bulk omic profiles of tumor samples, then this would make it possible to avoid the above mentioned contextual biases and provide insights into the context-specific signatures of cell types. In this work, I developed a new method called DeconICA (Deconvolution of bulk omics datasets through Immune Component Analysis), based on the blind source separation methodology. DeconICA has an aim to decipher and quantify the biological signals shaping omics profiles of tumor samples or normal tissues. A particular focus of my study was on the immune system-related signals and discovering new signatures of immune cell types. In order to make my work more accessible, I implemented the DeconICA method as an R package named "DeconICA". By applying this software to the standard benchmark datasets, I demonstrated that DeconICA is able to quantify immune cells with accuracy comparable to published state-of-the-art methods but without a priori defining a cell type-specific signature genes. The implementation can work with existing deconvolution methods based on matrix factorization techniques such as Independent Component Analysis (ICA) or Non-Negative Matrix Factorization (NMF). Finally, I applied DeconICA to a big corpus of data containing more than 100 transcriptomic datasets composed of, in total, over 28000 samples of 40 tumor types generated by different technologies and processed independently. This analysis demonstrated that ICA-based immune signals are reproducible between datasets and three major immune cell types: T-cells, B-cells and Myeloid cells can be reliably identified and quantified. Additionally, I used the ICA-derived metagenes as context-specific signatures in order to study the characteristics of immune cells in different tumor types. The analysis revealed a large diversity and plasticity of immune cells dependent and independent on tumor type. Some conclusions of the study can be helpful in identification of new drug targets or biomarkers for immunotherapy of cancer
APA, Harvard, Vancouver, ISO, and other styles
8

Owen, Anne M. "Widescale analysis of transcriptomics data using cloud computing methods." Thesis, University of Essex, 2016. http://repository.essex.ac.uk/16125/.

Full text
Abstract:
This study explores the handling and analyzing of big data in the field of bioinformatics. The focus has been on improving the analysis of public domain data for Affymetrix GeneChips which are a widely used technology for measuring gene expression. Methods to determine the bias in gene expression due to G-stacks associated with runs of guanine in probes have been explored via the use of a grid and various types of cloud computing. An attempt has been made to find the best way of storing and analyzing big data used in bioinformatics. A grid and various types of cloud computing have been employed. The experience gained in using a grid and different clouds has been reported. In the case of Windows Azure, a public cloud has been employed in a new way to demonstrate the use of the R statistical language for research in bioinformatics. This work has studied the G-stack bias in a broad range of GeneChip data from public repositories. A wide scale survey has been carried out to determine the extent of the Gstack bias in four different chips across three different species. The study commenced with the human GeneChip HG U133A. A second human GeneChip HG U133 Plus2 was then examined, followed by a plant chip, Arabidopsis thaliana, and then a bacterium chip, Pseudomonas aeruginosa. Comparisons have also been made between the use of widely recognised algorithms RMA and PLIER for the normalization stage of extracting gene expression from GeneChip data.
APA, Harvard, Vancouver, ISO, and other styles
9

Hernandez-Ferrer, Carles 1987. "Bioinformatic tools for exposome data analysis : application to human molecular signatures of ultraviolet light effects." Doctoral thesis, Universitat Pompeu Fabra, 2017. http://hdl.handle.net/10803/572046.

Full text
Abstract:
Las enfermedades complejas se encuentran entre las más comunes y son causadas por una combinación de factores genéticos y ambientales (contaminación ambiental, estilo de vida, etc). Entre las enfermedades complejas que se pueden destacar se encuentran la obesidad, el asma, la hipertensión o la diabetes. Diversos estudios científicos sugieren que el hecho de padecer enfermedades complejas está condicionado a la aparición o acumulación de determinados factores ambientales. Asimismo, se ha descrito que los factores ambientales son unos de los principales contribuyentes a la carga mundial de morbilidad. Todo esto nos lleva a definir el término exposoma como el conjunto de factores ambientales a los que un individuo se ve expuesto desde la concepción hasta la muerte. El estudio de la mecánica subyacente que vincula el exposoma con la salud es un campo de investigación emergente con un fuerte potencial para proporcionar nuevos conocimientos sobre la etiología de las enfermedades. La primera parte de esta tesis se centra en la exposición a la radiación ultravioleta. La exposición a la radiación ultravioleta proviene de fuentes tanto naturales como artificiales. La radiación ultravioleta incluye tres subtipos de radiación según su longitud de onda (UVA 315-400 nm, UVB 315-295 nm y UVC 295-200 nm). Si bien la principal fuente natural de radiación ultravioleta es el Sol, la UVC no llega a la superficie de la Tierra debido a su absorción por la capa estratosférica de ozono. En consecuencia, la exposición a radiación ultravioleta a la que estamos usualmente sometidos consisten en una mezcla de UVA (95 %) y UVB (5 %). Los efectos de la radiación ultravioleta en humanos pueden ser beneficiosos o perjudiciales dependiendo de su cantidad y forma. Los efectos perjudiciales y agudos de la radiación ultravioleta incluyen eritema, oscurecimiento del pigmento, retraso en el bronceado y engrosamiento de la epidermis. Repetidas lesiones en la piel producidas por radiación ultravioleta pueden predisponer, en última instancia, a efectos crónicos de fotoenvejecimiento, inmunosupresión y fotocarcinogénesis. El mayor efecto beneficioso de la radiación ultravioleta es la síntesis cutánea de la vitamina D. La vitamina D es necesaria para mantener el calcio fisiológico y del fósforo para la mineralización ósea y para prevenir el raquitismo, la osteomalacia y la osteoporosis. El paradigma del exposoma es trabajar con múltiples exposiciones a la vez en vez centrarse en una sola exposición. Este enfoque permite tener una visión más parecida a la realidad que vivimos. Luego, la segunda parte se centra en las herramientas para explorar cómo caracterizar y analizar el exposoma y cómo probar sus efectos en múltiples capas biológicas intermedias para proporcionar información sobre los mecanismos moleculares subyacentes que vinculan las exposiciones ambientales a los resultados de salud.
Most common diseases are caused by a combination of genetic, environmental and lifestyle factors. These diseases are referred to as complex diseases. Examples of this type of diseases are obesity, asthma, hypertension or diabetes. Several empirical evidence suggest that exposures are necessary determinants of complex disease operating in a causal background of genetic diversity. Moreover, environmental factors have long been implicated as major contributors to the global disease burden. This leads to the formulation of the exposome, that contains any exposure to which an individual is subjected from conception to death. The study of the underlying mechanics that links the exposome with human health is an emerging research field with a strong potential to provide new insights into disease etiology. The first part of this thesis is focused on ultraviolet radiation (UVR) exposure. UVR exposure occurs from both natural and artificial sources. UVR includes three subtypes of radiation according to its wavelength (UVA 315-400 nm, UVB 315-295 nm, and UVC 295-200 nm). While the main natural source of UVR is the Sun, UVC radiation does not reach Earth's surface because of its absorption by the stratospheric ozone layer. Then, exposures to UVR typically consist of a mixture of UVA (95%) and UVB (5%). Effects of UVR on human can be both beneficial and detrimental, depending on the amount and form of UVR. Detrimental and acute effects of UVR include erythema, pigment darkening, delayed tanning and thickening of the epidermis. Repeated UVR-induced injury to the skin, may ultimately predispose one to the chronic effects photoaging, immunosuppression, and photocarcinogenesis. The beneficial effect of UVR is the cutaneous synthesis of vitamin D. Vitamin D is necessary to maintain physiologic calcium and phosphorous for normal bone mineralization and to prevent rickets, osteomalacia, and osteoporosis. But the exposome paradigm is to work with multiple exposures at a time and with one or more health outcomes rather focus in a single exposures analysis. This approach tends to be a more accurate snapshot of the reality that we live in complex environments. Then, the second part is focused on the tools to explore how to characterize and analyze the exposome and how to test its effects in multiple intermediate biological layers to provide insights into the underlying molecular mechanisms linking environmental exposures to health outcomes.
APA, Harvard, Vancouver, ISO, and other styles
10

Daub, Carsten O. "Analysis of integrated transcriptomics and metabolomics data a systems biology approach /." [S.l. : s.n.], 2004. http://pub.ub.uni-potsdam.de/2004/0025/daub.pdf.

Full text
APA, Harvard, Vancouver, ISO, and other styles
11

Daub, Carsten Oliver. "Analysis of integrated transcriptomics and metabolomics data : a systems biology approach." Phd thesis, Universität Potsdam, 2004. http://opus.kobv.de/ubp/volltexte/2005/138/.

Full text
Abstract:
Moderne Hochdurchsatzmethoden erlauben die Messung einer Vielzahl von komplementären Daten und implizieren die Existenz von regulativen Netzwerken auf einem systembiologischen Niveau. Ein üblicher Ansatz zur Rekonstruktion solcher Netzwerke stellt die Clusteranalyse dar, die auf einem Ähnlichkeitsmaß beruht.
Wir verwenden das informationstheoretische Konzept der wechselseitigen Information, das ursprünglich für diskrete Daten definiert ist, als Ähnlichkeitsmaß und schlagen eine Erweiterung eines für gewöhnlich für die Anwendung auf kontinuierliche biologische Daten verwendeten Algorithmus vor. Wir vergleichen unseren Ansatz mit bereits existierenden Algorithmen. Wir entwickeln ein geschwindigkeitsoptimiertes Computerprogramm für die Anwendung der wechselseitigen Information auf große Datensätze. Weiterhin konstruieren und implementieren wir einen web-basierten Dienst fuer die Analyse von integrierten Daten, die durch unterschiedliche Messmethoden gemessen wurden. Die Anwendung auf biologische Daten zeigt biologisch relevante Gruppierungen, und rekonstruierte Signalnetzwerke zeigen Übereinstimmungen mit physiologischen Erkenntnissen.
Recent high-throughput technologies enable the acquisition of a variety of complementary data and imply regulatory networks on the systems biology level. A common approach to the reconstruction of such networks is the cluster analysis which is based on a similarity measure.
We use the information theoretic concept of the mutual information, that has been originally defined for discrete data, as a measure of similarity and propose an extension to a commonly applied algorithm for its calculation from continuous biological data. We compare our approach to previously existing algorithms. We develop a performance optimised software package for the application of the mutual information to large-scale datasets. Furthermore, we design and implement a web-based service for the analysis of integrated data measured with different technologies. Application to biological data reveals biologically relevant groupings and reconstructed signalling networks show agreements with physiological findings.
APA, Harvard, Vancouver, ISO, and other styles
12

Östman, Josephine. "The fertile ovary transcriptome and proteome." Thesis, Uppsala universitet, Institutionen för kvinnors och barns hälsa, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-447785.

Full text
Abstract:
The Human Protein Atlas is an open-source database containing information about protein expression and location in the human cells,tissues and organs. The aim is to map all the proteins in humans using various biotechnology techniques such as antibody-based imaging, andRNA sequencing etc. Based on previous transcriptome analysis, 173 genes were shown to have an elevated expression in ovary compared to all other major tissue types in the human body. There is however no information regarding the expression in ovary during the reproductive years versus the post-menopausal years. In this thesis, the gene expression in ovaries of women in reproductive age was compared with women in post-menopausal age. 509 genes were found to have an at least 2-fold higher mean value RNA expression in the reproductive age group. 14 of these genes were analyzed further with antibody staining and multiplex immunofluorescence staining to localize the corresponding proteins. The results show that these genes are expressed in a variety of structures in the ovarian tissue, such as the oocyte, the granulosa cells and the corpus luteum. This thesis has demonstrated how data analysis can be used to find genes important for the ovary of women in reproductive age and in the future, this could aid research in female fertility.
APA, Harvard, Vancouver, ISO, and other styles
13

Hu, Yin. "A NOVEL COMPUTATIONAL FRAMEWORK FOR TRANSCRIPTOME ANALYSIS WITH RNA-SEQ DATA." UKnowledge, 2013. http://uknowledge.uky.edu/cs_etds/17.

Full text
Abstract:
The advance of high-throughput sequencing technologies and their application on mRNA transcriptome sequencing (RNA-seq) have enabled comprehensive and unbiased profiling of the landscape of transcription in a cell. In order to address the current limitation of analyzing accuracy and scalability in transcriptome analysis, a novel computational framework has been developed on large-scale RNA-seq datasets with no dependence on transcript annotations. Directly from raw reads, a probabilistic approach is first applied to infer the best transcript fragment alignments from paired-end reads. Empowered by the identification of alternative splicing modules, this framework then performs precise and efficient differential analysis at automatically detected alternative splicing variants, which circumvents the need of full transcript reconstruction and quantification. Beyond the scope of classical group-wise analysis, a clustering scheme is further described for mining prominent consistency among samples in transcription, breaking the restriction of presumed grouping. The performance of the framework has been demonstrated by a series of simulation studies and real datasets, including the Cancer Genome Atlas (TCGA) breast cancer analysis. The successful applications have suggested the unprecedented opportunity in using differential transcription analysis to reveal variations in the mRNA transcriptome in response to cellular differentiation or effects of diseases.
APA, Harvard, Vancouver, ISO, and other styles
14

Kelso, Janet. "The development and application of informatics-based systems for the analysis of the human transcriptome." Thesis, University of the Western Cape, 2003. http://etd.uwc.ac.za/index.php?module=etd&action=viewtitle&id=gen8Srv25Nme4_5101_1185442672.

Full text
Abstract:

Despite the fact that the sequence of the human genome is now complete it has become clear that the elucidation of the transcriptome is more complicated than previously expected. There is mounting evidence for unexpected and previously underestimated phenomena such as alternative splicing in the transcriptome. As a result, the identification of novel transcripts arising from the genome continues. Furthermore, as the volume of transcript data grows it is becoming increasingly difficult to integrate expression information which is from different sources, is stored in disparate locations, and is described using differing terminologies. Determining the function of translated transcripts also remains a complex task. Information about the expression profile &ndash
the location and timing of transcript expression &ndash
provides evidence that can be used in understanding the role of the expressed transcript in the organ or tissue under study, or in developmental pathways or disease phenotype observed.

In this dissertation I present novel computational approaches with direct biological applications to two distinct but increasingly important areas of research in gene expression research. The first addresses detection and characterisation of alternatively spliced transcripts. The second is the construction of an hierarchical controlled vocabulary for gene expression data and the annotation of expression libraries with controlled terms from the hierarchies. In the final chapter the biological questions that can be approached, and the discoveries that can be made using these systems are illustrated with a view to demonstrating how the application of informatics can both enable and accelerate biological insight into the human transcriptome.

APA, Harvard, Vancouver, ISO, and other styles
15

Windhorst, Anita Cornelia [Verfasser]. "Transcriptome analysis in preterm infants developing bronchopulmonary dysplasia : data processing and statistical analysis of microarray data / Anita Cornelia Windhorst." Gießen : Universitätsbibliothek, 2015. http://d-nb.info/1078220395/34.

Full text
APA, Harvard, Vancouver, ISO, and other styles
16

Bécavin, Christophe. "Dimensionaly reduction and pathway network analysis of transcriptome data : application to T-cell characterization." Paris, Ecole normale supérieure, 2010. http://www.theses.fr/2010ENSUBS02.

Full text
APA, Harvard, Vancouver, ISO, and other styles
17

Cicek, A. Ercument. "METABOLIC NETWORK-BASED ANALYSES OF OMICS DATA." Case Western Reserve University School of Graduate Studies / OhioLINK, 2013. http://rave.ohiolink.edu/etdc/view?acc_num=case1372866879.

Full text
APA, Harvard, Vancouver, ISO, and other styles
18

Calviello, Lorenzo. "Detecting and quantifying the translated transcriptome with Ribo-seq data." Doctoral thesis, Humboldt-Universität zu Berlin, 2018. http://dx.doi.org/10.18452/18974.

Full text
Abstract:
Die Untersuchung der posttranskriptionellen Genregulation erfordert eine eingehende Kenntnis vieler molekularer Prozesse, die auf RNA wirken, von der Prozessierung im Nukleus bis zur Translation und der Degradation im Zytoplasma. Mit dem Aufkommen von RNA-seq-Technologien können wir nun jeden dieser Schritte mit hohem Durchsatz und Auflösung verfolgen. Ribosome Profiling (Ribo-seq) ist eine RNA-seq-Technik, die darauf abzielt, die präzise Position von Millionen translatierender Ribosomen zu detektieren, was sich als ein wesentliches Instrument für die Untersuchung der Genregulation erweist. Allerdings ist die Interpretation von Ribo-seq-Profilen über das Transkriptom aufgrund der verrauschten Daten und unserer unvollständigen Kenntnis des translatierten Transkriptoms eine Herausforderung. In dieser Arbeit präsentiere ich eine Methode, um translatierte Regionen in Ribo-seq-Daten zu erkennen, wobei ein Spektralanalyse verwendet wird, die darauf abzielt, die ribosomale Translokation über die übersetzten Regionen zu erkennen. Die hohe Sensibilität und Spezifität unseres Ansatzes ermöglichten es uns, eine umfassende Darstellung der Translation über das menschlichen und pflanzlichen (Arabidopsis thaliana) Transkriptom zu zeichnen und die Anwesenheit bekannter und neu-identifizierter translatierter Regionen aufzudecken. Evolutionäre Konservierungsanalysen zusammen mit Hinweisen auf Proteinebene lieferten Einblicke in ihre Funktionen, von der Synthese von bisher unbekannter Proteinen einerseits, zu möglichen regulatorischen Rollen andererseits. Darüber hinaus zeigte die Quantifizierung des Ribo-seq-Signals über annotierte Genemodelle die Translation mehrerer Transkripte pro Gen, was die Verbindung zwischen Translations- und RNA-Überwachungsmechanismen offenbarte. Zusammen mit einem Vergleich verschiedener Ribo-seq-Datensätze in menschlichen und planzlichen Zellen umfasst diese Arbeit eine Reihe von Analysestrategien für Ribo-seq-Daten als Fenster in die vielfältigen Funktionen des exprimierten Transkriptoms.
The study of post-transcriptional gene regulation requires in-depth knowledge of multiple molecular processes acting on RNA, from its nuclear processing to translation and decay in the cytoplasm. With the advent of RNA-seq technologies we can now follow each of these steps with high throughput and resolution. Ribosome profiling (Ribo-seq) is a popular RNA-seq technique, which aims at monitoring the precise positions of millions of translating ribosomes, proving to be an essential tool in studying gene regulation. However, the interpretation of Ribo-seq profiles over the transcriptome is challenging, due to noisy data and to our incomplete knowledge of the translated transcriptome. In this Thesis, I present a strategy to detect translated regions from Ribo-seq data, using a spectral analysis approach aimed at detecting ribosomal translocation over the translated regions. The high sensitivity and specificity of our approach enabled us to draw a comprehensive map of translation over the human and Arabidopsis thaliana transcriptomes, uncovering the presence of known and novel translated regions. Evolutionary conservation analysis, together with large-scale proteomics evidence, provided insights on their functions, between the synthesis of previously unknown proteins to other possible regulatory roles. Moreover, quantification of Ribo-seq signal over annotated transcript structures exposed translation of multiple transcripts per gene, revealing the link between translation and RNA-surveillance mechanisms. Together with a comparison of different Ribo-seq datasets in human cells and in Arabidopsis thaliana, this work comprises a set of analysis strategies for Ribo-seq data, as a window into the manifold functions of the expressed transcriptome.
APA, Harvard, Vancouver, ISO, and other styles
19

Enjalbert, Courrech Nicolas. "Inférence post-sélection pour l'analyse des données transcriptomiques." Electronic Thesis or Diss., Université de Toulouse (2023-....), 2024. http://www.theses.fr/2024TLSES199.

Full text
Abstract:
Dans le domaine de la transcriptomique, les avancées technologiques, telles que les puces à ADN et le séquençage à haut-débit, ont permis de quantifier l'expression génique à grande échelle. Ces progrès ont soulevé des défis statistiques, notamment pour l'analyse d'expression différentielle, visant à identifier les gènes différenciant significativement deux populations. Cependant, les procédures classiques d'inférence perdent leurs garanties de contrôle du taux de faux positifs lorsque les biologistes sélectionnent un sous-ensemble de gènes. Les méthodes d'inférence post hoc surmontent cette limitation en garantissant un contrôle sur le nombre de faux positifs, même pour des ensembles de gènes sélectionnés de manière arbitraire. La première contribution de ce manuscrit démontre l'efficacité de ces méthodes pour les données transcriptomiques de deux conditions biologiques, notamment grâce à l'introduction d'un algorithme de calcul des bornes post hoc à complexité linéaire, adapté à la grande dimension des données. Une application interactive a également été développée, facilitant la sélection et l'évaluation simultanée des bornes post hoc pour des ensembles de gènes d'intérêt. Ces contributions sont présentées dans la première partie du manuscrit. L'évolution technologique vers le séquençage en cellule unique a soulevé de nouvelles questions, notamment l'identification des gènes dont l'expression se distingue d'un groupe cellulaire à un (des) autre(s). Cette problématique est complexe car les groupes cellulaires doivent d'abord être estimés par une méthode de clustering, avant d'effectuer un test comparatif, menant ainsi à une analyse circulaire. Dans la seconde partie de ce manuscrit, nous présentons une revue des méthodes d'inférence post-clustering résolvant ce problème ainsi qu'une comparaison numérique des approches multivariées et marginales de comparaison de classes. Enfin, nous explorons comment l'utilisation des modèles de mélange dans l'étape de clustering peut être exploitée dans les tests post-clustering, et nous discutons de perspectives pour l'application de ces tests aux données transcriptomiques
In the field of transcriptomics, technological advances, such as microarrays and high-throughput sequencing, have enabled large-scale quantification of gene expression. These advances have raised statistical challenges, particularly in differential expression analysis, which aims to identify genes that significantly differentiate between two populations. However, traditional inference procedures lose their ability to control the false positive rate when biologists select a subset of genes. Post-hoc inference methods address this limitation by providing control over the number of false positives, even for arbitrary gene sets. The first contribution of this manuscript demonstrates the effectiveness of these methods for the differential analysis of transcriptomic data between two biological conditions, notably through the introduction of a linear-time algorithm for computing post-hoc bounds, adapted to the high dimensionality of the data. An interactive application was also developed to facilitate the selection and simultaneous evaluation of post-hoc bounds for sets of genes of interest. These contributions are presented in the first part of the manuscript. The technological evolution towards single-cell sequencing has raised new questions, particularly regarding the identification of genes whose expression distinguishes one cellular group from another. This issue is complex because cell groups must first be estimated using clustering method before performing a comparative test, leading to a circular analysis. In the second part of this manuscript, we present a review of post-clustering inference methods addressing this problem, as well as a numerical comparison of multivariate and marginal approaches for cluster comparison. Finally, we explore how the use of mixture models in the clustering step can be exploited in post-clustering tests, and discuss perspectives for applying these tests to transcriptomic data
APA, Harvard, Vancouver, ISO, and other styles
20

Siatkowski, Marcin [Verfasser]. "Approaches to the Analysis of Proteomics and Transcriptomics Data based on Statistical Methodology / Marcin Siatkowski." Greifswald : Universitätsbibliothek Greifswald, 2014. http://d-nb.info/1050274954/34.

Full text
APA, Harvard, Vancouver, ISO, and other styles
21

Johnson, Kristen. "Software for Estimation of Human Transcriptome Isoform Expression Using RNA-Seq Data." ScholarWorks@UNO, 2012. http://scholarworks.uno.edu/td/1448.

Full text
Abstract:
The goal of this thesis research was to develop software to be used with RNA-Seq data for transcriptome quantification that was capable of handling multireads and quantifying isoforms on a more global level. Current software available for these purposes uses various forms of parameter alteration in order to work with multireads. Many still analyze isoforms per gene or per researcher determined clusters as well. By doing so, the effects of multireads are diminished or possibly wrongly represented. To address this issue, two programs, GWIE and ChromIE, were developed based on a simple iterative EM-like algorithm with no parameter manipulation. These programs are used to produce accurate isoform expression levels.
APA, Harvard, Vancouver, ISO, and other styles
22

Li, Mengbo. "Integration of Multi-Modal Data to Guide Classification in Studies of Complex Diseases." Thesis, The University of Sydney, 2020. https://hdl.handle.net/2123/22693.

Full text
Abstract:
Having entered the big data era, the unprecedentedly fast-growing volume and variety of biological data have swiftly transformed the landscape of biomedical research. Meanwhile, classification methods as a powerful bioinformatics tool have greatly empowered researchers to uncover new aspects of the complex biological systems. This thesis addresses the statistical and methodological challenges that often exist in different stages of biomedical multi-modal data integration with a focus into the application of classification methods in studies of complex diseases. Data generated from mass spectrometry (MS) platforms are inherently susceptible to systematic biases. Widespread missing values, where certain compounds cannot be identified or quantified, pose a prominent challenge to MS data normalisation. We propose a novel normalisation approach for high-dimensional MS data, called ruvms. This novel method is a one-step procedure that is able to handle missing values in input data and does not require imputation. We also explore a challenging situation in multi-modal data integration where not all types of data of interest are available within the same cohort. In brain studies, brain tissue samples are generally inaccessible from the same brain for which fMRI data can be obtained. We propose a gene-expression-guided fMRI network classification method that distinguishes patients of neurological diseases from the healthy control, called brainClass. brainClass links functional connectivity features to potentially involved biological pathways, to bridge the gap between functional biomarkers of neurological disorders and their underpinning molecular mechanisms. We also introduce a post-hoc interpretation framework to provide gene-expression-guided biological interpretations for predictive functional connectivity features identified by existing generic network classifiers applied to fMRI data.
APA, Harvard, Vancouver, ISO, and other styles
23

Jeanmougin, Marine. "Statistical methods for robust analysis of transcriptome data by integration of biological prior knowledge." Thesis, Evry-Val d'Essonne, 2012. http://www.theses.fr/2012EVRY0029/document.

Full text
Abstract:
Au cours de la dernière décennie, les progrès en Biologie Moléculaire ont accéléré le développement de techniques d'investigation à haut-débit. En particulier, l'étude du transcriptome a permis des avancées majeures dans la recherche médicale. Dans cette thèse, nous nous intéressons au développement de méthodes statistiques dédiées au traitement et à l'analyse de données transcriptomiques à grande échelle. Nous abordons le problème de sélection de signatures de gènes à partir de méthodes d'analyse de l'expression différentielle et proposons une étude de comparaison de différentes approches, basée sur plusieurs stratégies de simulations et sur des données réelles. Afin de pallier les limites de ces méthodes classiques qui s'avèrent peu reproductibles, nous présentons un nouvel outil, DiAMS (DIsease Associated Modules Selection), dédié à la sélection de modules de gènes significatifs. DiAMS repose sur une extension du score-local et permet l'intégration de données d'expressions et de données d'interactions protéiques. Par la suite, nous nous intéressons au problème d'inférence de réseaux de régulation de gènes. Nous proposons une méthode de reconstruction à partir de modèles graphiques Gaussiens, basée sur l'introduction d'a priori biologique sur la structure des réseaux. Cette approche nous permet d'étudier les interactions entre gènes et d'identifier des altérations dans les mécanismes de régulation, qui peuvent conduire à l'apparition ou à la progression d'une maladie. Enfin l'ensemble de ces développements méthodologiques sont intégrés dans un pipeline d'analyse que nous appliquons à l'étude de la rechute métastatique dans le cancer du sein
Recent advances in Molecular Biology have led biologists toward high-throughput genomic studies. In particular, the investigation of the human transcriptome offers unprecedented opportunities for understanding cellular and disease mechanisms. In this PhD, we put our focus on providing robust statistical methods dedicated to the treatment and the analysis of high-throughput transcriptome data. We discuss the differential analysis approaches available in the literature for identifying genes associated with a phenotype of interest and propose a comparison study. We provide practical recommendations on the appropriate method to be used based on various simulation models and real datasets. With the eventual goal of overcoming the inherent instability of differential analysis strategies, we have developed an innovative approach called DiAMS, for DIsease Associated Modules Selection. This method was applied to select significant modules of genes rather than individual genes and involves the integration of both transcriptome and protein interactions data in a local-score strategy. We then focus on the development of a framework to infer gene regulatory networks by integration of a biological informative prior over network structures using Gaussian graphical models. This approach offers the possibility of exploring the molecular relationships between genes, leading to the identification of altered regulations potentially involved in disease processes. Finally, we apply our statistical developments to study the metastatic relapse of breast cancer
APA, Harvard, Vancouver, ISO, and other styles
24

Hindle, Matthew Morritt. "An integrated approach to enhancing functional annotation of sequences for data analysis of a transcriptome." Thesis, University of Nottingham, 2012. http://eprints.nottingham.ac.uk/12580/.

Full text
Abstract:
Given the ever increasing quantity of sequence data, functional annotation of new gene sequences persists as being a significant challenge for bioinformatics. This is a particular problem for transcriptomics studies in crop plants where large genomes and evolutionarily distant model organisms, means that identifying the function of a given gene used on a microarray, is often a non-trivial task. Information pertinent to gene annotations is spread across technically and semantically heterogeneous biological databases. Combining and exploiting these data in a consistent way has the potential to improve our ability to assign functions to new or uncharacterised genes. Methods: The Ondex data integration framework was further developed to integrate databases pertinent to plant gene annotation, and provide data inference tools. The CoPSA annotation pipeline was created to provide automated annotation of novel plant genes using this knowledgebase. CoPSA was used to derive annotations for Affymetrix GeneChips available for plant species. A conjoint approach was used to align GeneChip sequences to orthologous proteins, and identify protein domain regions. These proteins and domains were used together with multiple evidences to predict functional annotations for sequences on the GeneChip. Quality was assessed with reference to other annotation pipelines. These improved gene annotations were used in the analysis of a time-series transcriptomics study of the differential responses of durum wheat varieties to water stress. Results and Conclusions: The integration of plant databases using the Ondex showed that it was possible to increase the overall quantity and quality of information available, and thereby improve the resulting annotation. Direct data aggregation benefits were observed, as well as new information derived from inference across databases. The CoPSA pipeline was shown to improve coverage of the wheat microarray compared to the NetAffx and BLAST2GO pipelines. Leverage of these annotations during the analysis of data from a transcriptomics study of the durum wheat water stress responses, yielded new biological insights into water stress and highlighted potential candidate genes that could be used by breeders to improve drought response.
APA, Harvard, Vancouver, ISO, and other styles
25

Schissler, Alfred Grant, and Alfred Grant Schissler. "Contributions to Gene Set Analysis of Correlated, Paired-Sample Transcriptome Data to Enable Precision Medicine." Diss., The University of Arizona, 2017. http://hdl.handle.net/10150/624283.

Full text
Abstract:
This dissertation serves as a unifying document for three related articles developed during my dissertation research. The projects involve the development of single-subject transcriptome (i.e. gene expression data) methodology for precision medicine and related applications. Traditional statistical approaches are largely unavailable in this setting due to prohibitive sample size and lack of independent replication. This leads one to rely on informatic devices including knowledgebase integration (e.g., gene set annotations) and external data sources (e.g., estimation of inter-gene correlation). Common statistical themes include multivariate statistics (such as Mahalanobis distance and copulas) and large-scale significance testing. Briefly, the first work describes the development of clinically relevant single-subject metrics of gene set (pathway) differential expression, N-of-1-pathways Mahalanobis distance (MD) scores. Next, the second article describes a method which overcomes a major shortcoming of the MD framework by accounting for inter-gene correlation. Lastly, the statistics developed in the previous works are re-purposed to analyze single-cell RNA-sequencing data derived from rare cells. Importantly, these works represent an interdisciplinary effort and show that creative solutions for pressing issues become possible at the intersection of statistics, biology, medicine, and computer science.
APA, Harvard, Vancouver, ISO, and other styles
26

Rubanova, Natalia. "MasterPATH : network analysis of functional genomics screening data." Thesis, Sorbonne Paris Cité, 2018. http://www.theses.fr/2018USPCC109/document.

Full text
Abstract:
Dans ce travail nous avons élaboré une nouvelle méthode de l'analyse de réseau à définir des membres possibles des voies moléculaires qui sont important pour ce phénotype en utilisant la « hit-liste » des expériences « omics » qui travaille dans le réseau intégré (le réseau comprend des interactions protéine-protéine, de transcription, l’acide ribonucléique micro-l’acide ribonucléique messager et celles métaboliques). La méthode tire des sous-réseaux qui sont construit des voies de quatre types les plus courtes (qui ne se composent des interactions protéine-protéine, ayant au minimum une interaction de transcription, ayant au minimum une interaction l’acide ribonucléique micro-l’acide ribonucléique messager, ayant au minimum une interaction métabolique) entre des hit –gènes et des soi-disant « exécuteurs terminaux » - les composants biologiques qui participent à la réalisation du phénotype finale (s’ils sont connus) ou entre les hit-gènes (si « des exécuteurs terminaux » sont inconnus). La méthode calcule la valeur de la centralité de chaque point culminant et de chaque voie dans le sous-réseau comme la quantité des voies les plus courtes trouvées sur la route précédente et passant à travers le point culminant et la voie. L'importance statistique des valeurs de la centralité est estimée en comparaison avec des valeurs de la centralité dans les sous-réseaux construit des voies les plus courtes pour les hit-listes choisi occasionnellement. Il est supposé que les points culminant et les voies avec les valeurs de la centralité statistiquement signifiantes peuvent être examinés comme les membres possibles des voies moléculaires menant à ce phénotype. S’il y a des valeurs expérimentales et la P-valeur pour un grand nombre des points culminant dans le réseau, la méthode fait possible de calculer les valeurs expérimentales pour les voies (comme le moyen des valeurs expérimentales des points culminant sur la route) et les P-valeurs expérimentales (en utilisant la méthode de Fischer et des transpositions multiples).A l'aide de la méthode masterPATH on a analysé les données de la perte de fonction criblage de l’acide ribonucléique micro et l'analyse de transcription de la différenciation terminal musculaire et les données de la perte de fonction criblage du procès de la réparation de l'ADN. On peut trouver le code initial de la méthode si l’on suit le lien https://github.com/daggoo/masterPATH
In this work we developed a new exploratory network analysis method, that works on an integrated network (the network consists of protein-protein, transcriptional, miRNA-mRNA, metabolic interactions) and aims at uncovering potential members of molecular pathways important for a given phenotype using hit list dataset from “omics” experiments. The method extracts subnetwork built from the shortest paths of 4 different types (with only protein-protein interactions, with at least one transcription interaction, with at least one miRNA-mRNA interaction, with at least one metabolic interaction) between hit genes and so called “final implementers” – biological components that are involved in molecular events responsible for final phenotypical realization (if known) or between hit genes (if “final implementers” are not known). The method calculates centrality score for each node and each path in the subnetwork as a number of the shortest paths found in the previous step that pass through the node and the path. Then, the statistical significance of each centrality score is assessed by comparing it with centrality scores in subnetworks built from the shortest paths for randomly sampled hit lists. It is hypothesized that the nodes and the paths with statistically significant centrality score can be considered as putative members of molecular pathways leading to the studied phenotype. In case experimental scores and p-values are available for a large number of nodes in the network, the method can also calculate paths’ experiment-based scores (as an average of the experimental scores of the nodes in the path) and experiment-based p-values (by aggregating p-values of the nodes in the path using Fisher’s combined probability test and permutation approach). The method is illustrated by analyzing the results of miRNA loss-of-function screening and transcriptomic profiling of terminal muscle differentiation and of ‘druggable’ loss-of-function screening of the DNA repair process. The Java source code is available on GitHub page https://github.com/daggoo/masterPATH
APA, Harvard, Vancouver, ISO, and other styles
27

Ghazanfar, Shila. "Statistical approaches to harness high throughput sequencing data in diverse biological systems." Thesis, The University of Sydney, 2017. http://hdl.handle.net/2123/17268.

Full text
Abstract:
The development of novel statistical approaches to questions specific to biological systems of interest is becoming more valuable as we tackle increasingly complex problems. This thesis explores three distinct biological systems in which high throughput sequencing data is utilised, varying in research area, organism, number of sequencing platforms and datasets integrated, and structure such as matched samples; showcasing the variety of study designs and thus the need for tailored statistical approaches. First, we characterise allelic imbalance from RNA-Seq data including stringent filtering criteria and a count based likelihood ratio test. This work identified genes of particular importance in livestock genomics such as those related to energy use. Second, we outline a novel methodology to identify highly expressed genes and cells for single cell RNA-Seq data. We derive a gamma-normal mixture model to identify lowly and highly expressed components, and use this to identify novel markers for olfactory sensory neuron (OSN) maturity across publicly available mouse neuron datasets. In addition we estimate single cell networks and find that mature OSN single cell networks are more centralised than immature OSN single cell networks. Third, we develop two novel frameworks for relating information from Whole Exome DNA-Seq and RNA-Seq data when i) samples are matched and when ii) samples are not necessary matched between platforms. In the latter case, we relate functional somatic mutation driver gene scores to transcriptional network correlation disturbance using a permutation testing framework, identifying potential candidate genes for targeted therapies. In the former case, we estimate directed mutation-expression networks for each cancer using linear models, providing a useful exploratory tool for identifying novel relationships among genes. This thesis demonstrates the importance of tailored statistical approaches to further understanding across many biological systems.
APA, Harvard, Vancouver, ISO, and other styles
28

Jangerstad, August. "Transcription factor analysis of longitudinal mRNA expression data." Thesis, KTH, Skolan för kemi, bioteknologi och hälsa (CBH), 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-278693.

Full text
Abstract:
Transcription factors (TFs) are key regulatory proteins that regulate transcriptionthrough precise, but highly variable binding events to cis-regulatory elements.The complexity of their regulatory patterns makes it difficult to determinethe roles of different TFs, a task which the field is still struggling with.Experimental procedures for this purpose, such as knock out experiments, arehowever costly and time consuming, and with the ever-increasing availabilityof sequencing data, computational methods for inferring the activity of TFsfrom such data have become of great interest. Current methods are howeverlacking in several regards, which necessitates further exploration of alternatives. A novel tool for estimating the activity of individual TFs over time fromlongitudinal mRNA expression data was in this project therefore put togetherand tested on data from Mus musculus liver and brain. The tool is based onprincipal component analysis, which is applied to data subsets containing theexpression data of genes likely regulated by a specific TF to acquire an estimationof its activity. Though initial tests on 17 selected TFs showed issues withunspecific trends in the estimations, further testing is required for a statementon the potential of the estimator.
Transcriptionsfaktorer (TFer) är viktiga regulatoriska protein som reglerar transkriptiongenom att binda till cis-regulatoriska element på precisa, menmycketvarierande vis. Komplexiteten i deras regulatoriska mönster gör det svårt attavgöra vilka roller olika TFer har, vilket är en uppgift som fältet fortfarandebrottas med. Experimentella procedurer i detta syfte, till exempel "knockout"experiment, är dock kostsamma och tidskrävande, och med den evigt ökandetillgången på sekvenseringsdata har metoder för att beräkna TFers aktivitetfrån sådan data fått stort intresse. De beräkningsmetoder som finns idag bristerdock på flera punker, vilket erfordrar ett fortsatt sökande efter alternativ. Ett nytt vektyg för att upskatta aktiviteten hos individuella TFer över tidmed hjälp av longitunell mRNA-uttrycksdata utvecklades därför i det här projektetoch testades på data från Mus musculus lever och hjärna. Verktyget ärbaserat på principalkomponentsanalys, som applicerades på set med uttrycksdatafrån gener sannolikt reglerade av en specifik TF för att erhålla en uppskattningav dess aktivitet. Trots att de första testerna för 17 utvalda TFer påvisadeproblem med ospecifika trender i upskattningarna krävs forsatta tester för attkunna ge ett tydligt svar på vilken potential estimatorn har.
APA, Harvard, Vancouver, ISO, and other styles
29

Monraz, Gomez Luis Cristobal. "Application of systems biology resources to human diseases : combining transcriptomics data analysis and molecular networks to identify major players." Electronic Thesis or Diss., Université Paris sciences et lettres, 2023. http://www.theses.fr/2023UPSLS069.

Full text
Abstract:
Les systèmes biologiques sont des structures complexes avec des interactions complexes entre leurs composants. Grâce à la combinaison de différents domaines scientifiques, il est désormais possible d'étudier ces systèmes et de répondre à différentes questions qui ont des applications différentes. Dans cette thèse, j'ai exploré des outils et des approches utilisés en biologie des systèmes afin de trouver des acteurs moléculaires ainsi que des mécanismes importants dans les réseaux moléculaires des systèmes biologiques. J'ai intégré des techniques d'analyse de données transcriptomiques. J'ai également utilisé des approches de formalisation des connaissances afin de construire ou d'étendre des réseaux moléculaires descriptifs existants dans différentes maladies.J'ai principalement étudié le rôle du tissu adipeux dans le cancer du sein. Le tissu adipeux constitue une partie fondamentale et importante de l’anatomie du sein. Il a été suggéré que ce tissu adipeux, principalement composé d’adipocytes blancs, interagit avec les cellules cancéreuses au front invasif de la tumeur, favorisant ainsi la progression tumorale. Ces cellules ont été appelées “Adipocytes associés au cancer (CAA, en anglais)”. Il a été émis l’hypothèse selon laquelle l’interaction entre CAA et cellules tumorales serait amplifiée en cas d’obésité. Ainsi, une cohorte de patientes atteintes d’un carcinome canalaire mammaire et classées comme obèses ou normo-pondérales a été constituée. J'ai analysé des échantillons de tissu adipeux de ces patients, proches (proximaux) ou éloignés (distal) de la tumeur, au niveau du transcriptome. Les deux types de tissus présentaient des motifs d’expression génique similaires. Cependant, avec l’analyse d’enrichissement, les échantillons proximaux présentaient des voies de signalisation des œstrogènes enrichies et des voies liées à l’épithélium par rapport aux échantillons distaux. Par rapport aux échantillons de tumeurs, les échantillons proximaux montraient principalement des voies menant à la fonction du tissu adipeux, telles que l'adipogenèse, le métabolisme des acides gras, la signalisation de PPAR entre autres. J’ai appliqué l'analyse ROMA pour déterminer l'activation des voies d'intérêt à partir des résultats d'enrichissement, et nous avons constaté que la thermogenèse et les métalloprotéinases matricielles étaient plus actives dans les tissus adipeux proximaux. Les gènes MMP7, MMP16, MMP3, SMARCC1, CREB3L4, MAPK13, RPS6KA6, SMARCA4, ZNF516, ACTG1, SLC25A9 sont apparus comme contributeurs majeurs.Les réseaux moléculaires peuvent être représentés sous forme de diagrammes. Les informations contenues dans ces réseaux peuvent servir à exploiter l'analyse des données transcriptomiques. Auparavant, l'Atlas du réseau de signalisation du cancer avait été constitué. Cette ressource est composée de processus biologiques pour le développement et la progression du cancer sous la forme de cartes. J'ai utilisé l'une des cartes, la sénescence cellulaire et la transition épithélio-mésenchymateuse (EMT, en anglais), pour explorer le rôle du prototype gène suppresseur de métastase, NME1 (appelé auparavant NM23-H1) dans ces processus. J'ai enrichi la carte avec les fonctions de la protéine NME1 et utilisé les informations pour compiler les acteurs impliqués dans la sénescence cellulaire et l'EMT. Certains acteurs intéressants liés aux deux processus ont été identifiés, comme NF-κB, montrant que la sénescence a une relation avec l'EMT. Ensuite, j'ai utilisé des données transcriptomiques provenant de patients atteints d'un cancer colorectal pour observer l'activité des différents modules du réseau afin d'observer la progression à travers les différents stades de la maladie.Finalement, en raison de l'épidémie de COVID-19, j’ai participé à un effort où nous avons construit une carte de l’interaction hôte-virus, la carte COVID-19. Ma contribution s'est concentrée sur la construction du réseau représentant le stress du réticulum endoplasmique
Biological systems are complex structures with multiple interactions between their components. Thanks to the combination of fields such as mathematics, computational science, biology, physiology etc. it is now possible to study these systems and answer different questions that have different applications, like in human health. In this thesis I have explored some tools and approaches used in systems biology in order to find molecular players as well as mechanisms that are important in the molecular networks for the biological systems. For this thesis, I have integrated data analysis techniques to transcriptomics data in different diseases. Also, I have used knowledge formalization approaches in order to construct or extend existing descriptive molecular networks in different diseases.I have studied the role of adipose tissue in breast cancer. The adipose tissue constitutes a fundamental and large part of the breast anatomy. Mammary adipocytes have been hypothesized to interact with cancer cells at the invasive front of the tumor, supporting the progression of the disease. These adipocytes have been termed “Cancer Associated Adipocytes (CAA)”. The interaction of these CAA and the progression of the disease have been suggested to be worse in obese patients. Therefore, to have an insight on the mechanism , a cohort of patients that had ductal breast carcinoma and that are considered as obese or normal-weight was created. I have analyzed adipose tissue samples of these patients, that were either close (proximal) or far (distal) from the tumor, at the transcriptome level. Both tissue types showed similar gene expression patterns. However, with the enrichment analysis, proximal samples had enriched estrogen signaling pathways, and pathways related to epithelium when compared to distal samples. When compared to tumor samples, proximal showed mostly pathways to their adipose tissue function, as adipogenesis, fatty acid metabolism PPAR signaling among others. We applied ROMA analysis to determine activation of pathways of interest from the enrichment results, and we found thermogenesis and matrix metalloproteinases to be more active in the proximal adipose tissues. The genes MMP7, MMP16, MMP3, SMARCC1, CREB3L4, MAPK13, RPS6KA6, SMARCA4, ZNF516, ACTG1, SLC25A9 appeared as major contributors.Molecular networks can be depicted as diagrams in order to facilitate their exploration and visualization. The information contained in these networks may serve to exploit the analysis of transcriptomics data using techniques such as gene-set enrichment analysis. Previously, the Atlas of Cancer Signaling Network was assembled. This resource is composed of known biological processes that are relevant for cancer development and progression in the form of maps depicting molecular interactions. I have used one of the maps, cellular senescence and Epithelial to Mesenchymal Transition (EMT), to explore the role of prototypic metastasis suppressor gene NME1 (previously called NM23-H1) in these processes. I had enriched the map with functions of the protein and also used the information to compile the players that are involved in cellular senescence and EMT. Some interesting players that are related were identified to both processes, like NF-κB, showing that senescence has a relationship with EMT. Then, I used transcriptomics data from colorectal cancer patients to observe the activity of the different modules in the network to observe the progression through the different stages of the disease.Lastly, due to the COVID-19 epidemic, I have participated in a multi-research groups’ effort where we constructed a map of the host-virus interaction, the COVID-19 map. My contribution was focused on building the network representing the endoplasmic reticulum stress
APA, Harvard, Vancouver, ISO, and other styles
30

Isik, Zerrin, Tulin Ersahin, Volkan Atalay, Cevdet Aykanat, and Rengul Cetin-Atalay. "A signal transduction score flow algorithm for cyclic cellular pathway analysis, which combines transcriptome and ChIP-seq data." Saechsische Landesbibliothek- Staats- und Universitaetsbibliothek Dresden, 2014. http://nbn-resolving.de/urn:nbn:de:bsz:14-qucosa-138982.

Full text
Abstract:
Determination of cell signalling behaviour is crucial for understanding the physiological response to a specific stimulus or drug treatment. Current approaches for large-scale data analysis do not effectively incorporate critical topological information provided by the signalling network. We herein describe a novel model- and data-driven hybrid approach, or signal transduction score flow algorithm, which allows quantitative visualization of cyclic cell signalling pathways that lead to ultimate cell responses such as survival, migration or death. This score flow algorithm translates signalling pathways as a directed graph and maps experimental data, including negative and positive feedbacks, onto gene nodes as scores, which then computationally traverse the signalling pathway until a pre-defined biological target response is attained. Initially, experimental data-driven enrichment scores of the genes were computed in a pathway, then a heuristic approach was applied using the gene score partition as a solution for protein node stoichiometry during dynamic scoring of the pathway of interest. Incorporation of a score partition during the signal flow and cyclic feedback loops in the signalling pathway significantly improves the usefulness of this model, as compared to other approaches. Evaluation of the score flow algorithm using both transcriptome and ChIP-seq data-generated signalling pathways showed good correlation with expected cellular behaviour on both KEGG and manually generated pathways. Implementation of the algorithm as a Cytoscape plug-in allows interactive visualization and analysis of KEGG pathways as well as user-generated and curated Cytoscape pathways. Moreover, the algorithm accurately predicts gene-level and global impacts of single or multiple in silico gene knockouts
Dieser Beitrag ist mit Zustimmung des Rechteinhabers aufgrund einer (DFG-geförderten) Allianz- bzw. Nationallizenz frei zugänglich
APA, Harvard, Vancouver, ISO, and other styles
31

Isik, Zerrin, Tulin Ersahin, Volkan Atalay, Cevdet Aykanat, and Rengul Cetin-Atalay. "A signal transduction score flow algorithm for cyclic cellular pathway analysis, which combines transcriptome and ChIP-seq data." Royal Society of Chemistry, 2012. https://tud.qucosa.de/id/qucosa%3A27799.

Full text
Abstract:
Determination of cell signalling behaviour is crucial for understanding the physiological response to a specific stimulus or drug treatment. Current approaches for large-scale data analysis do not effectively incorporate critical topological information provided by the signalling network. We herein describe a novel model- and data-driven hybrid approach, or signal transduction score flow algorithm, which allows quantitative visualization of cyclic cell signalling pathways that lead to ultimate cell responses such as survival, migration or death. This score flow algorithm translates signalling pathways as a directed graph and maps experimental data, including negative and positive feedbacks, onto gene nodes as scores, which then computationally traverse the signalling pathway until a pre-defined biological target response is attained. Initially, experimental data-driven enrichment scores of the genes were computed in a pathway, then a heuristic approach was applied using the gene score partition as a solution for protein node stoichiometry during dynamic scoring of the pathway of interest. Incorporation of a score partition during the signal flow and cyclic feedback loops in the signalling pathway significantly improves the usefulness of this model, as compared to other approaches. Evaluation of the score flow algorithm using both transcriptome and ChIP-seq data-generated signalling pathways showed good correlation with expected cellular behaviour on both KEGG and manually generated pathways. Implementation of the algorithm as a Cytoscape plug-in allows interactive visualization and analysis of KEGG pathways as well as user-generated and curated Cytoscape pathways. Moreover, the algorithm accurately predicts gene-level and global impacts of single or multiple in silico gene knockouts.
Dieser Beitrag ist mit Zustimmung des Rechteinhabers aufgrund einer (DFG-geförderten) Allianz- bzw. Nationallizenz frei zugänglich.
APA, Harvard, Vancouver, ISO, and other styles
32

Finotello, Francesca. "Computational methods for the analysis of gene expression from RNA sequencing data." Doctoral thesis, Università degli studi di Padova, 2014. http://hdl.handle.net/11577/3423789.

Full text
Abstract:
In every living organism, the entirety of its hereditary information is encoded, in the form of DNA, through the so-called genome. The genome consists in both genes and non-coding sequences and contains the whole information needed to determine all the properties and functions of each single cell. Cells can access and translate specific instructions of this code through gene expression, namely by selectively switching on and off a particular set of genes. Thanks to gene expression, the information encoded into the active genes is transcribed into RNAs. This set of RNAs reflects the current state of a cell and can reveal pathological mechanisms underlying diseases. In recent years, a novel methodology for RNA sequencing, called RNA-seq, is replacing microarrays for the study of gene expression. The sequencing framework of RNA-seq methodology enables to investigate at high resolution all the RNA species present in a sample, characterizing their sequences and quantifying their abundances at the same time. In practice, millions of short sequences, called reads, are sequenced from random positions of the input RNAs. These reads can then be computationally mapped on a reference genome to reveal a transcriptional map, where the number of reads aligned on each gene, called counts, gives a measure of its level of expression. At first glance, this scheme may seem very simple, but the implementation of the whole analysis workflow is in fact complex and not well defined. So far, many computational methods have been proposed to perform the different steps of RNA-seq data analysis, but a unified processing pipeline is still lacking. The aim of my Ph.D. research project was the implementation of a robust computational pipeline for RNA-seq data analysis, from data pre-processing to differential expression detection. The definition of the different analysis modules was carried out through several steps. First, we drafted a basic analysis framework through the study of RNA-seq data features and the dissection of data models and state-of-the-art algorithmic strategies. Then, we focused on count bias, which is one of the most challenging aspects of RNA-seq data analysis. We demonstrated that some biases affecting counts can be effectively corrected with current normalization methods, while others, like length bias, cannot be completely removed without introducing additional systematic errors. Thus, we defined a novel approach to compute RNA-seq counts, which strongly reduces length bias prior to normalization and is robust to the upstream processing steps. Finally, we defined the complete analysis pipeline considering the best preforming methods and optimized some specific processing steps to enable correct expression estimates even in the presence of high-similarity genomic sequences. The implemented analysis pipeline was applied to a real case study to identify the genes involved in the pathogenesis of spinal muscular atrophy (SMA) from RNA-seq data of patients and healthy controls. SMA is a degenerative neuromuscular disease that has no cure and represents one of the major genetic causes of infant mortality. We identified a set of genes related to skeletal muscle and connective tissue disorders whose patterns of differential expression correlate with phenotype and may underlie protective mechanisms against SMA progression. Some putative positive targets identified by this analysis are currently under biological validation since they might improve diagnostic screening and therapy. To pose the basis for future research, which will focus on the optimization of the processing pipeline and to its extension to the analysis of dynamic expression data, we designed two time-series RNA-seq data sets: a real one and a simulated one. The experimental and sequencing design of the real data set, as well as the modelling of the synthetic data, have been an integral part of the Ph.D. activity. Overall, this thesis considers each step of the RNA-seq data processing and provides some valuable guidelines in a fast-evolving research field that, up to now, has prevented the establishment of a stable and standardized analysis scheme.
Il patrimonio genetico di ogni organismo vivente è codificato, sotto forma di DNA, nel genoma. Il genoma è costituito da geni e da sequenze non codificanti e racchiude in sé tutte le informazioni necessarie al corretto funzionamento delle cellule dell'organismo. Le cellule possono accedere a specifiche istruzioni di questo codice tramite un processo chiamato espressione genica, ovvero attivando o disattivando un particolare set di geni e trascrivendo l'informazione necessaria in RNA. L'insieme degli RNA trascritti caratterizza quindi un preciso stato cellulare e può fornire importanti informazioni sui meccanismi coinvolti nella patogenesi di una malattia. Recentemente, una metodologia per il sequenziamento dell'RNA, chiamata RNA-seq, sta rapidamente sostituendo i microarray nello studio dell'espressione genica. Grazie alle proprietà delle tecnologie di sequenziamento su cui è basato, l'RNA-seq permette di misurare il numero di RNA presenti in un campione e al contempo di "leggerne" l'esatta sequenza. In realtà, il sequenziamento produce milioni di sequenze, chiamate "read", che rappresentano piccole stringhe lette da posizioni random degli RNA in input. Le read devono quindi essere mappate con un algoritmo su un genoma di riferimento, in modo da ricostruire una mappa trascrizionale, in cui il numero di read allineate su ciascun gene dà una misura digitale (chiamata "count") del suo livello di espressione. Sebbene a prima vista questa procedura possa sembrare molto semplice, lo schema di analisi integrale è in realtà molto complesso e non ben definito. In questi anni sono stati sviluppati diversi metodi per ciascuna delle fasi di elaborazione, ma non è stata tuttora definita una pipeline di analisi dei dati RNA-seq standardizzata. L'obiettivo principale del mio progetto di dottorato è stato lo sviluppo di una pipeline computazionale per l'analisi di dati RNA-seq, dal pre-processing alla misura dell'espressione genica differenziale. I diversi moduli di elaborazione sono stati definiti e implementati tramite una serie di passi successivi. Inizialmente, abbiamo considerato e ridefinito metodi e modelli per la descrizione e l'elaborazione dei dati, in modo da stabilire uno schema di analisi preliminare. In seguito, abbiamo considerato più attentamente uno degli aspetti più problematici dell'analisi dei dati RNA-seq: la correzione dei bias presenti nei count. Abbiamo dimostrato che alcuni di questi bias possono essere corretti in modo efficace tramite le tecniche di normalizzazione correnti, mentre altri, ad esempio il "length bias", non possono essere completamente rimossi senza introdurre ulteriori errori sistematici. Abbiamo quindi definito e testato un nuovo approccio per il calcolo dei count che minimizza i bias ancora prima di procedere con un'eventuale normalizzazione. Infine, abbiamo implementato la pipeline di analisi completa considerando gli algoritmi più robusti e accurati, selezionati nelle fasi precedenti, e ottimizzato alcun step in modo da garantire stime dell'espressione genica accurate anche in presenza di geni ad alta similarità. La pipeline implementata è stata in seguito applicata ad un caso di studio reale, per identificare i geni coinvolti nella patogenesi dell'atrofia muscolare spinale (SMA). La SMA è una malattia neuromuscolare degenerativa che costituisce una delle principali cause genetiche di morte infantile e per la quale non sono ad oggi disponibili né una cura né un trattamento efficace. Con la nostra analisi abbiamo identificato un insieme di geni legati ad altre malattie del tessuto connettivo e muscoloscheletrico i cui pattern di espressione differenziale correlano con il fenotipo, e che quindi potrebbero rappresentare dei meccanismi protettivi in grado di combattere i sintomi della SMA. Alcuni di questi target putativi sono in via di validazione poiché potrebbero portare allo sviluppo di strumenti efficaci per lo screening diagnostico e il trattamento di questa malattia. Gli obiettivi futuri riguardano l'ottimizzazione della pipeline definita in questa tesi e la sua estensione all'analisi di dati dinamici da "time-series RNA-seq". A questo scopo, abbiamo definito il design di due data set "time-series", uno reale e uno simulato. La progettazione del design sperimentale e del sequenziamento del data set reale, nonché la modellazione dei dati simulati, sono stati parte integrante dell'attività di ricerca svolta durante il dottorato. L'evoluzione rapida e costante che ha caratterizzato i metodi per l'analisi di dati RNA-seq ha impedito fino ad ora la definizione di uno schema di analisi standardizzato e la risoluzione di problematiche legate a diversi aspetti dell'elaborazione, quali ad esempio la normalizzazione. In questo contesto, la pipeline definita in questa tesi e, più in ampiamente, i temi discussi in ciascun capitolo, toccano tutti i diversi aspetti dell'analisi dei dati RNA-seq e forniscono delle linee guida utili a definire un approccio computazionale efficace e robusto.
APA, Harvard, Vancouver, ISO, and other styles
33

Aghamirzaie, Delasa. "Isoform-Specific Expression During Embryo Development in Arabidopsis and Soybean." Diss., Virginia Tech, 2016. http://hdl.handle.net/10919/73054.

Full text
Abstract:
Almost every precursor mRNA (pre-mRNA) in a eukaryotic organism undergoes splicing, in some cases resulting in the formation of more than one splice variant, a process called alternative splicing. RNA-Seq provides a major opportunity to capture the state of the transcriptome, which includes the detection of alternative spicing events. Alternative splicing is a highly regulated process occurring in a complex machinery called the spliceosome. In this dissertation, I focus on identification of different splice variants and splicing factors that are produced during Arabidopsis and soybean embryo development. I developed several data analysis pipelines for the detection and the functional characterization of active splice variants and splicing factors that arise during embryo development. The main goal of this dissertation was to identify transcriptional changes associated with specific stages of embryo development and infer possible associations between known regulatory genes and their targets. We identified several instances of exon skipping and intron retention as products of alternative splicing. The coding potential of the splice variants were evaluated using CodeWise. I developed CodeWise, a weighted support vector machine classifier to assess the coding potential of novel transcripts with respect to RNA secondary structure free energy, conserved domains, and sequence properties. We also examined the effect of alternative splicing on the domain composition of resulting protein isoforms. The majority of splice variants pairs encode proteins with identical domains or similar domains with truncation and in less than 10% of the cases alternative splicing results in gain or loss of a conserved domain. I constructed several possible regulatory networks that occur at specific stages of embryo development. In addition, in order to gain a better understanding of splicing regulation, we developed the concept of co-splicing networks, as a group of transcripts containing common RNA-binding motifs, which are co-expressed with a specific splicing factor. For this purpose, I developed a multi-stage analysis pipeline to integrate the co-expression networks with de novo RNA binding motif discovery at inferred splice sites, resulting in the identification of specific splicing factors and the corresponding cis-regulatory sequences that cause the production of splice variants. This approach resulted in the development of several novel hypotheses about the regulation of minor and major splicing in developing Arabidopsis embryos. In summary, this dissertation provides a comprehensive view of splicing regulation in Arabidopsis and soybean embryo development using computational analysis.
Ph. D.
APA, Harvard, Vancouver, ISO, and other styles
34

Logotheti, Marianthi. "Integration of functional genomics and data mining methodologies in the study of bipolar disorder and schizophrenia." Doctoral thesis, Örebro universitet, Institutionen för medicinska vetenskaper, 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:oru:diva-52644.

Full text
Abstract:
Bipolar disorder and schizophrenia are two severe psychiatric disorders characterized by a complex genetic basis, coupled to the influence of environmental factors. In this thesis, functional genomic analysis tools were used for the study of the underlying pathophysiology of these disorders, focusing on gene expression and function on a global scale with the application of high-throughput methods. Datasets from public databases regarding transcriptomic data of postmortem brain and skin fibroblast cells of patients with either schizophrenia or bipolar disorder were analyzed in order to identify differentially expressed genes. In addition, fibroblast cells of bipolar disorder patients obtained from the Biobank of the Neuropsychiatric Research Laboratory of Örebro University were cultured, RNA was extracted and used for microarray analysis. In order to gain deeper insight into the biological mechanisms related to the studied psychiatric disorders, the differentially expressed gene lists were subjected to pathway and target prioritization analysis, using proprietary tools developed by the group of Metabolic Engineering and Bioinformatics, of the National Hellenic Research Foundation, thus indicating various cellular processes as significantly altered. Many of the molecular processes derived from the analysis of the postmortem brain data of schizophrenia and bipolar disorder were also identified in the skin fibroblast cells. Additionally, through the use of machine learning methods, gene expression data from patients with schizophrenia were exploited for the identification of a subset of genes with discriminative ability between schizophrenia and healthy control subjects. Interestingly, a set of genes with high separating efficiency was derived from fibroblast gene expression profiling. This thesis suggests the suitability of skin fibroblasts as a reliable model for the diagnostic evaluation of psychiatric disorders and schizophrenia in particular, through the construction of promising machine-learning based classification models, exploiting gene expression data from peripheral tissues.
APA, Harvard, Vancouver, ISO, and other styles
35

Shi, Xu. "Bayesian Modeling for Isoform Identification and Phenotype-specific Transcript Assembly." Diss., Virginia Tech, 2017. http://hdl.handle.net/10919/79772.

Full text
Abstract:
The rapid development of biotechnology has enabled researchers to collect high-throughput data for studying various biological processes at the genomic level, transcriptomic level, and proteomic level. Due to the large noise in the data and the high complexity of diseases (such as cancer), it is a challenging task for researchers to extract biologically meaningful information that can help reveal the underlying molecular mechanisms. The challenges call for more efforts in developing efficient and effective computational methods to analyze the data at different levels so as to understand the biological systems in different aspects. In this dissertation research, we have developed novel Bayesian approaches to infer alternative splicing mechanisms in biological systems using RNA sequencing data. Specifically, we focus on two research topics in this dissertation: isoform identification and phenotype-specific transcript assembly. For isoform identification, we develop a computational approach, SparseIso, to jointly model the existence and abundance of isoforms in a Bayesian framework. A spike-and-slab prior is incorporated into the model to enforce the sparsity of expressed isoforms. A Gibbs sampler is developed to sample the existence and abundance of isoforms iteratively. For transcript assembly, we develop a Bayesian approach, IntAPT, to assemble phenotype-specific transcripts from multiple RNA sequencing profiles. A two-layer Bayesian framework is used to model the existence of phenotype-specific transcripts and the transcript abundance in individual samples. Based on the hierarchical Bayesian model, a Gibbs sampling algorithm is developed to estimate the joint posterior distribution for phenotype-specific transcript assembly. The performances of our proposed methods are evaluated with simulation data, compared with existing methods and benchmarked with real cell line data. We then apply our methods on breast cancer data to identify biologically meaningful splicing mechanisms associated with breast cancer. For the further work, we will extend our methods for de novo transcript assembly to identify novel isoforms in biological systems; we will incorporate isoform-specific networks into our methods to better understand splicing mechanisms in biological systems.
Ph. D.
APA, Harvard, Vancouver, ISO, and other styles
36

Hassan, Aamir Ul. "Integration of Genome Scale Data for Identifying New Biomarkers in Colon Cancer: Integrated Analysis of Transcriptomics and Epigenomics Data from High Throughput Technologies in Order to Identifying New Biomarkers Genes for Personalised Targeted Therapies for Patients Suffering from Colon Cancer." Thesis, University of Bradford, 2017. http://hdl.handle.net/10454/17419.

Full text
Abstract:
Colorectal cancer is the third most common cancer and the leading cause of cancer deaths in Western industrialised countries. Despite recent advances in the screening, diagnosis, and treatment of colorectal cancer, an estimated 608,000 people die every year due to colon cancer. Our current knowledge of colorectal carcinogenesis indicates a multifactorial and multi-step process that involves various genetic alterations and several biological pathways. The identification of molecular markers with early diagnostic and precise clinical outcome in colon cancer is a challenging task because of tumour heterogeneity. This Ph.D.-thesis presents the molecular and cellular mechanisms leading to colorectal cancer. A systematical review of the literature is conducted on Microarray Gene expression profiling, gene ontology enrichment analysis, microRNA and system Biology and various bioinformatics tools. We aimed this study to stratify a colon tumour into molecular distinct subtypes, identification of novel diagnostic targets and prediction of reliable prognostic signatures for clinical practice using microarray expression datasets. We performed an integrated analysis of gene expression data based on genetic, epigenetic and extensive clinical information using unsupervised learning, correlation and functional network analysis. As results, we identified 267-gene and 124-gene signatures that can distinguish normal, primary and metastatic tissues, and also involved in important regulatory functions such as immune-response, lipid metabolism and peroxisome proliferator-activated receptors (PPARs) signalling pathways. For the first time, we also identify miRNAs that can differentiate between primary colon from metastatic and a prognostic signature of grade and stage levels, which can be a major contributor to complex transcriptional phenotypes in a colon tumour.
APA, Harvard, Vancouver, ISO, and other styles
37

Sadacca, Benjamin. "Pharmacogenomic and High-Throughput Data Analysis to Overcome Triple Negative Breast Cancers Drug Resistance." Thesis, Université Paris-Saclay (ComUE), 2017. http://www.theses.fr/2017SACLS538/document.

Full text
Abstract:
Devant le grand nombre de tumeurs du sein triple négatif résistant aux traitements, il est essentiel de comprendre les mécanismes de résistance et de trouver de nouvelles molécules efficaces. En premier lieu, nous analysons deux ensembles de données pharmacogénomiques à grande échelle. Nous proposons une nouvelle classification basée sur des profils transcriptomiques de lignées cellulaires, selon un processus de sélection de gènes basé sur des réseaux biologiques. Notre classification moléculaire montre une plus grande homogénéité dans la réponse aux médicaments que lorsque l’on regroupe les lignées cellulaires en fonction de leur tissu d'origine. Elle permet également d’identifier des profils similaires de réponse aux traitements. Dans un second travail, nous étudions une cohorte de patients atteints d’un cancer du sein triple négatif ayant résisté à la chimiothérapie néoadjuvante. Nous effectuons des analyses moléculaires complètes basées sur du RNAseq et WES. Nous constatons une forte hétérogénéité moléculaire des tumeurs avant et après traitement. Bien que nous observons une évolution clonale sous traitement, aucun mécanisme récurrent de résistance n’a pu être identifié. Nos résultats suggèrent fortement que chaque tumeur a un profil moléculaire unique et qu'il est important d'étudier de grandes séries de tumeurs. Enfin, nous améliorons une méthode pour tester la surreprésentation de motifs connus de protéines de liaison à l'ARN, dans un ensemble donné de séquences régulées. Cet outil utilise une approche innovante pour contrôler la proportion de faux positifs qui n'est pas réalisé par l'algorithme existant. Nous montrons l'efficacité de notre approche en utilisant deux séries de données différentes
Given the large number of treatment-resistant triple-negative breast cancers, it is essential to understand the mechanisms of resistance and to find new effective molecules. First, we analyze two large-scale pharmacogenomic datasets. We propose a novel classification based on transcriptomic profiles of cell lines, according to a biological network-driven gene selection process. Our molecular classification shows greater homogeneity in drug response than when cell lines are grouped according to their original tissue. It also helps identify similar patterns of treatment response. In a second analysis, we study a cohort of patients with triple-negative breast cancer who have resisted to neoadjuvant chemotherapy. We perform complete molecular analyzes based on RNAseq and WES. We observe a high molecular heterogeneity of tumors before and after treatment. Although we highlighted clonal evolution under treatment, no recurrent mechanism of resistance could be identified Our results strongly suggest that each tumor has a unique molecular profile and that that it is increasingly important to have large series of tumors. Finally, we are improving a method for testing the overrepresentation of known RNA binding protein motifs in a given set of regulated sequences. This tool uses an innovative approach to control the proportion of false positives that is not realized by the existing algorithm. We show the effectiveness of our approach using two different datasets
APA, Harvard, Vancouver, ISO, and other styles
38

Wu, Mei. "Detection of aberrant events in RNA for clinical diagnostics." Thesis, Uppsala universitet, Institutionen för biologisk grundutbildning, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-448361.

Full text
Abstract:
Rare diseases are estimated to affect 3.75% of the global population, which roughly translates to 300 million affected individuals. A large proportion of patients still do not have their diagnosis and current approaches such as chromosomal microarray (CMA), whole exome sequencing (WES), and whole genome sequencing (WGS) that targets DNA and the exome aims to resolve that very first step. RNA-seq serves as a powerful approach complementing the aforementioned methods that have reached a plateau in the diagnostic yield. RNA-seq can facilitate the finding of aberrant events that appear during transcription e.g., splicing, changes in gene expression and monoallelic expression. In this study, we aimed to establish RNA-seq analysis pipelines and evaluate whether RNA-seq could be utilized to enhance diagnostic yield. A total of 47 clinical samples were analysed along with the publicly controlled GEAUVADIS dataset to evaluate the potential of RNA-seq in a clinical setting. The pilot pipeline used, an RNA-seq analysis wrapper around Detection of RNA Outlier Pipeline (DROP), used detected a highly ranked splicing variant in a positive control control  sample that was hard to identify in a WGS analysis. The remaining two other positive control other two control samples with aberrant expression were also detected by the pipeline. Additionally, the pipeline gave a manageable list of candidate genes per affected sample in the population along with corroborating graphs that can support the decision-making for clinicians. The results of this pipeline proved successful for integrating RNA-seq and thustherefore, we expect anticipate an increase in diagnosis.
APA, Harvard, Vancouver, ISO, and other styles
39

Gogolewski, Krzysztof. "Matrix methods in transcriptomic and metabolomic data analysis." Doctoral thesis, 2019. https://depotuw.ceon.pl/handle/item/3341.

Full text
Abstract:
In this dissertation we walk through various approaches of modelling and analysis of transcriptomic and metabolomic data. The dissertation opens with an introduction of the current state of the art in the context of high-throughput data along with genetic background description. We look through current technologies that are used for obtaining transcriptomic data as well as computational methods and tools for their analysis including various methods for decomposition of transcriptomic signal and integration with metabolomic knowledge. Throughout the main three chapters of this dissertation we discuss specific experimental settings and data for which adequate methods for transcriptomic and metabolomic data analysis are derived and applied. Each of these chapters presents a different computational method for inference of biological knowledge, and is supported with a case-study based on real life experimental data. Finally, results from joint work with Baylor Collage of Medicine concerning the role of FOXF1 gene in lungs disease are closing the dissertation.
W niniejszej rozprawie omawianych jest kilka podejść do modelowania i analizy danych transkryptomicznych. Pracę otwiera krótki wstęp do obecnego stanu wiedzy dotyczącego wysokoprzepustowych danych wraz z ogólnym wprowadzeniem do genetyki. Omówione zostają obecnie używane technologie do gromadzenia danych tran- skryptomicznych, jak również obliczeniowe metody ich modelowania i analizy, w szczególności dotyczących dekompozycji sygnału transkryptomicznego oraz jego integracji z wiedzą metabolomiczną. W ramach trzech głównych rozdziałów rozprawy dyskutowane są specyficzne scenariusze i dane eksperymentalne, do których zostają opracowane i zastosowane odpowiednie metody analizy danych transkryptomicznych. Każdy z rozdziałów prezentuje pewną metodę obliczeniową służącą pozyskiwaniu wiedzy biologicznej oraz jej zastosowanie w konkretnym studium przypadku używającym danych eksperymentalnych. Ostatecznie, wyniki pochodzące ze współpracy z Baylor Collage of Medicine dotyczące roli genu FOXF1 w rozwoju chorób płuc zamykają zasadniczą część rozprawy.
APA, Harvard, Vancouver, ISO, and other styles
40

YEN, MING-YI, and 顏名儀. "Quantitative Analysis of ECI2 Isoforms from Cancer Transcriptomic Data." Thesis, 2019. http://ndltd.ncl.edu.tw/handle/6hv7zs.

Full text
Abstract:
碩士
亞洲大學
生物資訊與醫學工程學系
107
Quantitative analysis of transcriptomes has received increasing attention in recent years, and more and more studies have confirmed that transcriptome isomers may have a key mechanism in cancers. Recently, the study found that ECI2 (Homo sapiens enoyl-CoA delta isomerase 2) is a peroxisome isomerase of mammals with 11 transcriptome isoforms, one of which is found to be an important cancer antigen, called Hepatocellular Carcinoma Antigen 64 (HCA64). In other studies, it has also been found as a biomarker for the prognosis of other cancers such as breast cancer and prostate cancer. RNA-Seq high-throughput sequencing has become one of the most advanced methods for measuring gene expression. Therefore, we use of RNA-seq data to analyze the transcript expression of the software Salmon and to analyze the expression of the transcriptome for breast cancer, small cell lung cancer, ovarian cancer and prostate cancer. It is expected that the relationship between HCA64 and a specific cancer can be found, which may serve as a biomarker for the cancer, which is useful for diagnosis and subsequent treatment, and improves the survival rate of the patient. According to the results of the study, the transcript of HCA64 was only expressed in some cancer samples. After DESeq2 analysis, significant genes were found in metastatic breast cancer, prostate cancer and liver cancer, and many of these genes have a related mechanism in developing three cancers. It can be inferred that HCA64 has an important association in metastatic breast cancer, prostate cancer and liver cancer.
APA, Harvard, Vancouver, ISO, and other styles
41

Rogers, Gary L. "Transcriptomic Data Analysis Using Graph-Based Out-of-Core Methods." 2011. http://trace.tennessee.edu/utk_graddiss/1122.

Full text
Abstract:
Biological data derived from high-throughput microarrays can be transformed into finite, simple, undirected graphs and analyzed using tools first introduced by the Langston Lab at the University of Tennessee. Transforming raw data can be broken down into three main tasks: data normalization, generation of similarity metrics, and threshold selection. The choice of methods used in each of these steps effect the final outcome of the graph, with respect to size, density, and structure. A number of different algorithms are examined and analyzed to illustrate the magnitude of the effects. Graph-based tools are then used to extract putative gene networks. These tools are loosely based on the concept of clique, which generates clusters optimized for density. Innovative additions to the paraclique algorithm, developed at the Langston Lab, are introduced to generate results that have highest average correlation or highest density. A new suite of algorithms is then presented that exploits the use of a priori gene interactions. Aptly named the anchored analysis toolkit, these algorithms use known interactions as anchor points for generating subgraphs, which are then analyzed for their graph structure. This results in clusters that might have otherwise been lost in noise. A main product of this thesis is a novel collection of algorithms to generate exact solutions to the maximum clique problem for graphs that are too large to fit within core memory. No other algorithms are currently known that produce exact solutions to this problem for extremely large graphs. A combination of in-core and out-of-core techniques is used in conjunction with a distributed-memory programming model. These algorithms take into consideration such pitfalls as external disk I/O and hardware failure and recovery. Finally, a web-based tool is described that provides researchers access the aforementioned algorithms. The Graph Algorithms Pipeline for Pathway Analysis tool, GrAPPA, was previously developed by the Langston Lab and provides the software needed to take raw microarray data as input and preprocess, analyze, and post-process it in a single package. GrAPPA also provides access to high-performance computing resources, via the TeraGrid.
APA, Harvard, Vancouver, ISO, and other styles
42

Gatto, Sole. "Integrated bioinformatics analysis of epigenomic and transcriptomic data from ICF syndrome patient's cells." Tesi di dottorato, 2013. http://www.fedoa.unina.it/9340/1/TESI_SG.pdf.

Full text
Abstract:
Immunodeficiency, Centromeric region instability, Facial anomalies (ICF) syndrome (OMIM 242860), is a human autosomic recessive disease due to mutations in the Dnmt3b gene, characterized by inheritance of aberrant patterns of DNA methylation and heterochromatin defects. How mutations in Dnmt3B and the resulting deficiency in DNA methyltransferase activity result mainly in immunodeficiency has not been clarified yet. It is already known that the expression of several genes and microRNAs is deregulated in ICF lymphoblastoid cell lines (LCLs), being both up- and down-regulated. Subltle and sporadic changes were observed in the epigenetic profile of those genes. It is clear that Dnmt3B mutations affect not only DNA methylation, but also several other expression regulators. The new Next Generation Sequencing (NGS) technologies had a very important role in assessing to what extent these mutations affect the epigenetic landscape of the whole genome. The global DNA methylation profile was generated (Heyn et al., 2011) and the genome-wide mapping of H3K4me3, H3K27me3 and H3K9me3 by chromatin immunoprecipitation-sequencing (ChIP-seq) and correlated those to mRNA transcriptome (obtained by RNA-seq) and to microRNA expression (Gatto et al., 2010) in ICF and control LCLs. Reliable pipelines for the analysis and the integration of these data were developed during this work. In this thesis are described in detail the performed analyses and also the biological results obtained.
APA, Harvard, Vancouver, ISO, and other styles
43

Bewerunge, Peter [Verfasser]. "Integrative data mining and meta analysis of disease-specific large-scale genomic, transcriptomic and proteomic data / presented by Peter Bewerunge." 2009. http://d-nb.info/997856645/34.

Full text
APA, Harvard, Vancouver, ISO, and other styles
44

Sitte, Maren. "Network Based Integration of Proteomic and Transcriptomic Data: Study of BCR and WNT11 Signaling Pathways in Cancer Cells." Doctoral thesis, 2020. http://hdl.handle.net/21.11130/00-1735-0000-0005-1397-B.

Full text
APA, Harvard, Vancouver, ISO, and other styles
45

Chu, An-Yuan, and 諸安元. "Application of Machine Learning in Analysis of Transcriptomic Data Derived from Next Generation Sequencing and Model Construction." Thesis, 2019. http://ndltd.ncl.edu.tw/handle/tszn3d.

Full text
Abstract:
碩士
國立中興大學
資訊管理學系所
107
Tobacco Mosaic Virus, the most studied plant virus, could infected over 100 species of plants and over 550 species of flowering plant, cause enormous loss of economy at home and abroad. Microarray, an important analytic tool of Genomics and Genetics, enable researchers to analyze massive gene expression simultaneously. To find out the genes related to replication of Tobacco Mosaic Virus, material of this research is gene expression with 5 time points (30 min, 4hr, 6hr, 18hr and 24hr), which made by Next Generation Sequencing, about cell of Arbidopsis infected by Tobacco Mosaic Virus. In addition to refer the FCBF and Wrapper algorithms of papers, which integrated machine learning and microarray analysis, this research re-defines genes to samples and translates original target variables to attributes of each gene, then proposes DiSK algorithm to select genes. The selected genes are validated by C4.5 algorithm and Multi-Layer Perceptron, results show that genes selected by DiSK algorithm with average accuracy 81.25%, average true positive rate (classified accuracy of control group) 90%, true negative rate (classified accuracy of experiment group) 72.5%, average F-measure 80.3% and average AUC 0.849 are all better than genes selected by other algorithms. Last but not least, this research explores the function, searches and constructs genetic network of selected genes by Pathway Studio Plant, which makes the algorithm proposed by this research more persuasive and provides new targets to researchers of plant virus.
APA, Harvard, Vancouver, ISO, and other styles
46

Santos, Diogo André Passagem dos 1987. "Comparative analysis of 454 pyrosequencing data from coffee transcriptomes." Master's thesis, 2011. http://hdl.handle.net/10451/4944.

Full text
Abstract:
Tese de mestrado. Biologia (Bioinformática e Biologia Computacional). Universidade de Lisboa, Faculdade de Ciências, 2011
Understanding the mechanisms beyond the resistance of coffee plants (Coffea spp.) to leaf rust (caused by Hemileia vastatrix) is of vital importance for breeding coffee varieties with durable resistance. However, loss of resistance due to the appearance of new rust races is occurring, but some genotypes are still resistant to all known H. vastatrix races, such as HDT832/2. Previous studies show that the resistance to H. vastatrix in this genotype shares common immunity components with the nonhost resistance. 454 pyrosequencing transcriptomic data representing HDT832/2 host and nonhost resistance, along a healthy plant control, were analyzed with the purpose of better understanding this resistance. Expressed sequence tags (ESTs) are a very common and interesting solution for transcriptomic studies because they lack the non-expressed part of the genome. The small amount of reads generated for this project present a limitation that has not an established solution. To analyze this dataset, two different assembly strategies (individual assembly versus global assembly) and two different assemblers (Newbler versus MIRA) were used, and the results of all four assemblies are reported and analyzed. Assemblies were compared by assessing the number of transcripts shared by the three libraries, by a blast searches against NCBI nr protein and Coffea spp. EST databases and searching for previously studied genes. Overall the global assembly strategy performed better than the individual strategy, and Newbler performed better than MIRA in most but not all parameters. Here we provide a good strategy for small budget transcriptome projects to optimize their data and we present an annotated transcriptome of coffee line HDT832/2 resistance response to rust in host and nonhost interactions.
O café é um dos produtos mais importantes do mercado internacional, sendo a sua produção e exportação a base da economia de mais de 60 países, na sua maioria países em desenvolvimento. A cafeicultura é uma indústria em crescimento que se debate com a necessidade de aumentar a produção sem fazer subir em demasia os respectivos custos. A cultura do cafeeiro (nomeadamente do cafeeiro Arábica, Coffea arabica) é afectada em larga escala por factores de índole fitopatológica que destroem ou enfraquecem as plantas. De entre estas doenças, a ferrugem alaranjada, causada pelo fungo Hemileia vastatrix Berkley & Broome, é uma das mais importantes, e afecta países cafeicultores por todo o mundo, gerando perdas de 30% se nenhuma medida de controlo for aplicada. H. vastatrix é um fungo biotrófico que depende das células vivas do hospedeiro para se alimentar e completar o seu ciclo de vida. Apesar de o controlo desta doença ser possível por via da aplicação de produtos fitofarmaceuticos, os custos associados são elevados económica e ambientalmente, pelo que o cultivo de variedades resistentes é uma opção com maior sustentabilidade. A identificação e caracterização de populações de Híbrido de Timor (HDT, um híbrido natural entre C. arabica e C. canephora) permitiu a selecção de plantas com elevado espectro de resistência, que foram subsequentemente utilizadas como dadoras de resistência em programas de melhoramento genético de cafeeiro em diversos países. No entanto, estas resistências têm sido colocadas em causa com o aparecimento de novas raças do fungo, sendo a linha HDT832/2, seleccionada no Centro de Investigação das Ferrugens do Cafeeiro (CIFC), alvo de interesse por manter a resistência a todas as raças conhecidas de H. vastatrix. Como uma das formas de resistência mais duradoura em plantas é a resistência da toda uma espécie de plantas a todas as variantes genéticas de um patogénio (resistência não-hospedeira) revelou-se importante comparar a resistência do cafeeiro a H. vastatrix (resistência hospedeira) com a resistência não-hospedeira, neste caso entre HDT832/2 e Uromyces vignae, o fungo responsável pela ferrugem do feijão-frade. Um estudo anterior de 8 genes rnvolvidos em mecanismos de imunidade em plantas sugere que a resistência de HDT832/2 a este dois patogénios tem componentes partilhados. Esse estudo permitiu também perceber a cronologia da infecção de forma a se identificar os pontos temporais com maior expressão de genes de resposta por parte da planta. Desta forma, folhas de HDT832/2 foram inoculadas com cada fungo separadamente e, tal como uma amostra controlo, amostras de RNA foram recolhidas e enviadas para pirosequenciação de cDNA com a tecnologia 454. A analise de Expressed Sequence Tags (EST) é uma alternativa interessante no caso do estudo de organismos não modelo, como o cafeeiro. Além disso, o facto de apenas ser sequenciada a porção expressa do genoma, permite que não só a quantidade de dados a analisar seja muito menor como torna possível perceber e estudar as diferenças de expressão em condições biológicas distintas. Visto ser um projecto de pequena envergadura, o número de corridas realizadas para a sequenciação do cDNA das 3 condições em estudo foi apenas uma, o que levou a que o número de sequências para cada condição fosse baixo. Assim, foi necessário estudar a melhor forma de assemblar estas sequências, tendo sido estudadas duas estratégias de assemblagem e dois assembladores diferentes. A diferença entra as duas estratégias de assemblagem incidiu na separação ou não das sequências por condição. Assim, numa estrategia de assemblagem individual, cada conjunto de sequências relativas a uma condição foi assemblado apenas com sequências da mesma condição. Por oposição, e de forma a obter um conjunto de sequências com uma maior cobertura do transcritoma, todas as sequências originais foram juntas numa só assemblagem, denomida assemblagem global. A escolha do programa para realizar a assemblagem tem também uma grande influencia no resultado final e por isso foram comparados os resultados do Newbler v2.5 e do MIRA v.2.3.0. Desta forma foram obtidas quatro assemblagens diferentes, que foram depois comparadas. Para realizar a comparação, e na falta do genoma completo do cafeeiro, foram escolhidas diferentes formas de análise. Uma importante característica que se espera encontrar neste tipo de dados é uma grande quantidade de sequências partilhadas pelas 3 condições em deterimento de sequências que apenas apareçam numa das condições. Nas assemblagens globais foi possível mapear a proveniência das sequências utilizadas para construir as sequências finais e tanto o Newbler como o MIRA resultaram em assemblagens onde grande parte das sequências provêm das três condições. No caso das assemblagens individuais, para definir que uma sequência era a mesma que outra de outra condição, utilizamos o resultado do mapeamento das mesmas, através de Blastx, na base de dados de proteínas do NCBI (nr-protein database). Aqui foi possivel observar que a falta de cobertura de cada um dos conjuntos de sequências de cada condição levou a um distribuição dos dados muito diferente da esperada. De forma a podermos comparar mais facilmente os dois métodos, as sequências das assemblagens individuais com o mesmo melhor resultado no blast contra nr foram assemblados juntos de forma a que, para cada assemblador, existisse apenas um conjunto de sequências para cada um dos métodos. Cada um desses conjuntos de sequências foi depois mapeado, atraves de Blastn, contra as sequências de ESTs de cafeeiro existentes na bases de dados do NCBI. As assemblagens globais obtiveram uma melhor performance que as assemblagens individuais, sendo que o Newbler conseguiu obter uma maior percentagem de sequências anotadas que o MIRA, especialmente se observados apenas os resultados com homologia total. O estudo da presença e do número de homólogos de 10 genes de cafeeiro previamente caracterizados por RT-qPCR nestas mesmas amostras foi também efectuado. Enquanto que as assemblagens realizadas com o Newbler apenas foram capazes de reconstruir 7 dos 10 genes, a assemblagem Global com o MIRA conseguiu reconstruir os 10 genes. No entanto o Newbler consegue reconstruir o gene de forma completa, sendo que apenas em 3 situações o gene se encontra divido em diferentes sequências, sendo que no entanto estas se encontram agrupadas no mesmo isogroup. O MIRA por outro lado tem 6 dos 10 genes repartidos por diferentes sequências, sendo que muitas das vezes o mesmo gene está representado por inúmeras sequências. Desta forma foi possível perceber que a estratégia de assemblagem global é melhor que a assemblagem individual das sequências, sendo o Newbler melhor que o Mira na maior parte dos parâmetros avaliados. Desta forma foi realizado o mapeamento das sequências dos dois programas na base de dados nr do NCBI, utilizando o Blastx e a sua posterior anotação com termos GO através do Blast2go. A assemblagem realizada com o Newbler consegue uma melhor percentagem de sequências com resultado na base de dados nr e um maior número de sequências anotadas. Este trabalho permitiu desenvolver uma estratégia de assemblagem para projectos de baixo orçamento conseguirem estudar o transcritoma de uma especie não-modelo e disponibilizou, para futura análise mais detalhada, o transcritoma expresso por folhas de cafeeiro numa sitação de resistência hospedeira (resistência a H. vastatrix) e de resistência não-hospedeira (resistência a U. vignae). Uma melhor forma de mapear as sequências assembladas pelo Newbler é necessária. Além disso a utilização combinada dos resultados dos dois assembladores pode levar a um melhor resultado final. Uma extensa análise aos resultados aqui reportados pode levar a uma melhor compreensão da resistência da linha HDT832/2 a H. vastatrix e levar a sua manutenção e manipulação futura.
APA, Harvard, Vancouver, ISO, and other styles
47

Howard, Brian Edward. "Methods for accurate analysis of high-throughput transcriptome data." 2009. http://www.lib.ncsu.edu/theses/available/etd-10132009-213553/unrestricted/etd.pdf.

Full text
APA, Harvard, Vancouver, ISO, and other styles
48

"Development of bioinformatics platforms for methylome and transcriptome data analysis." 2014. http://library.cuhk.edu.hk/record=b6115790.

Full text
Abstract:
高通量大規模並行測序技術,又称為二代測序(NGS),極大的加速了生物和醫學研究的進程。隨著測序通量和複雜度的不斷提高,在分析大量的資料以挖掘其中的資訊的過程中,生物訊息學變得越發重要。在我的博士研究生期間(及本論文中),我主要從事於以下兩個領域的生物訊息學演算法的開發:DNA甲基化資料分析和基因間區長鏈非編碼蛋白RNA(lincRNA)的鑒定。目前二代測序技術在這兩個領域的研究中有著廣泛的應用,同時急需有效的資料處理方法來分析對應的資料。
DNA甲基化是一種重要的表觀遺傳修飾,主要用來調控基因的表達。目前,全基因組重亞硫酸鹽測序(BS-seq)是最準確的研究DNA甲基化的實驗方法之一,該技術的一大特點就是可以精確到單個堿基的解析度。為了分析BS-seq產生的大量測序數據,我參與開發並深度優化了Methy-Pipe軟體。Methy-Pipe集成了測序序列比對和甲基化程度分析,是一個一體化的DNA甲基化資料分析工具。另外,在Methy-Pipe的基礎上,我又開發了一個新的用於檢測DNA甲基化差異區域(DMR)的演算法,可以用於大範圍的尋找DNA甲基化標記。Methy-Pipe在我們實驗室的DNA甲基化研究項目中得到廣泛的應用,其中包括基於血漿的無創產前診斷(NIPD)和癌症的檢測。
基因間區長鏈非編碼蛋白RNA(lincRNA)是一種重要的調節子,其在很多生物學過程中發揮作用,例如轉錄後調控,RNA的剪接,細胞老化等。lincRNA的表達具有很強的組織特異性,因此很大一部分lincRNA還沒有被發現。最近,全轉錄組測序技術(RNA-seq)結合基因從頭組裝,為新的lincRNA鑒定以及構建完整的轉錄組列表提供了最有力的方法。然而,有效並準確的從大量的RNA-seq測序數據中鑒定出真實的新的lincRNA仍然具有很大的挑戰性。為此,我開發了兩個生物訊息學工具:1)iSeeRNA,用於區分lincRNA和編碼蛋白RNA(mRNA);2)sebnif,用於深層次資料篩選以得到高品質的lincRNA列表。這兩個工具已經在多個生物學系統中使用並表現出很好的效果。
總的來說,我開發了一些生物訊息學方法,這些方法可以幫助研究人員更好的利用二代測序技術來挖掘大量的測序數據背後的生物學本質,尤其是DNA甲基化和轉錄組的研究。
High-throughput massive parallel sequencing technologies, or Next-Generation Sequencing (NGS) technologies, have greatly accelerated biological and medical research. With the ever-growing throughput and complexity of the NGS technologies, bioinformatics methods and tools are urgently needed for analyzing the large amount of data and discovering the meaningful information behind. In this thesis, I mainly worked on developing bioinformatics algorithms for two research fields: DNA methylation data analysis and large intergenic noncoding RNA discovery, where the NGS technologies are in-depth employed and novel bioinformatics algorithms are highly needed.
DNA methylation is one of the important epigenetic modifications to control the transcriptional regulations of the genes. Whole genome bisulfite sequencing (BS-seq) is one of the most precise methodologies for DNA methylation study which allows us to perform whole methylome research at single-base resolution. To analyze the large amount of data generated by BS-seq experiments, I have co-developed and optimized Methy-Pipe, an integrated bioinformatics pipeline which can perform both sequencing read alignment and methylation state decoding. Furthermore, I’ve developed a novel algorithm for Differentially Methylated Regions (DMR) mining, which can be used for large scale methylation marker discovery. Methy-Pipehas been routinely used in our laboratory for methylomic studies, including non-invasive prenatal diagnosis and early cancer detections in human plasma.
Large intergenic noncoding RNAs, or lincRNAs, is avery important novel family of gene regulators in many biological processes, such as post-transcriptional regulation, splicing and aging. Due to high tissue-specific expression pattern of the lincRNAs, a large proportion is still undiscovered. The development of Whole Transcriptome Shotgun Sequencing, also known as RNA-seq, combined with de novo or ab initio assembly, promises quantity discovery of novel lincRNAs hence building the complete transcriptome catalog. However, to efficiently and accurately identify the novel lincRNAs from the large transcriptome data stillremains a bioinformatics challenge.To fill this gap, I have developed two bioinformatics tools: I) iSeeRNAfor distinguishing lincRNAs from mRNAs and II) sebnif for comprehensive filtering towards high quality lincRNA screening which has been used in various biological systems and showed satisfactory performance.
In summary, I have developed several bioinformatics algorithms which help the researchers to take advantage of the strength of the NGS technologies(methylome and transcriptome studies) and explore the biological nature behind the large amount of data.
Detailed summary in vernacular field only.
Detailed summary in vernacular field only.
Detailed summary in vernacular field only.
Detailed summary in vernacular field only.
Sun, Kun.
Thesis (Ph.D.) Chinese University of Hong Kong, 2014.
Includes bibliographical references (leaves 118-126).
Abstracts also in Chinese.
APA, Harvard, Vancouver, ISO, and other styles
49

"Transcriptome analysis and applications based on next-generation RNA sequencing data." 2012. http://library.cuhk.edu.hk/record=b5549664.

Full text
Abstract:
二代cDNA测序技术,又名“RNA-Seq“,为转录组(transcriptome)的研究提供了新的手段。作为革命性的技术方法,RNA-Seq 不仅可以帮助准确测量转录体(transcript)的表达水平,更可以发现新的转录体和揭示转录调控的机理。同时,整合多个不同水平的测序数据,例如基因组(genome)测序,甲基化组(methylome)测序等,可以为深入挖掘生物学意义提供一个强有力的的工具。
我的博士研究主要集中在二代测序(next-generation sequencing,NGS),特别是RNA-Seq数据的分析。它主要包含三部分:分析工具开发,数据分析和机理研究。
大量测序数据的分析对于二代测序技术来说是一个重大的挑战。目前,相对于剪接比对工具(splice-aware aligner),普通比对工具可以极速(ultrafast)的将数以千万记的短序列(Reads)比对到基因组,但是他们很难处理那些跨过剪接位点(splice junction)的短序列(spliced reads)或者匹配多个基因组位置的短序列(multireads)。我们开发了一个利用two-seed策略的全新的序列比对工具-ABMapper。基准测试(Benchmark test) 结果显示ABMapper比其他的同类工具:TopHat和SpliceMap有更高的accuracy和recall。另一方面,spliced reads和multireads在基因组上会有多个匹配的位置,选择最可能的位置也成为一个大问题。在计算基因表达值时,multireads和spliced reads常会被随机的选定其中之一,或者直接被排除。这种处理方式会引入偏差而直接影响下游(downstream)分析的准确性。为了解决multireads和spliced reads位置选择问题,我们提出了一个利用内含子(intron)长度的Geometric-tail (GT) 经验分布的最大似然估计 (maximum likelihood estimation) 的方法。这个概率模型可以适用于剪接位点位于短序列上或者位于成对短序列(Pair-ended, PE) 之间的情况。基于这个模型,我们可以更好的确定那些在基因组上存在多个匹配的成对短序列(pair-ended, PE reads)的最可能位置。
测序数据的积累为深入研究生物学意义提供了丰富的资源。利用RNA-Seq数据和甲基化测序数据,我们建立了一个基于DNA甲基化模式 (pattern) 的基因表达水平的预测模型。根据这个模型,我们发现DNA甲基化可以相当准确的预测基因表达水平,准确率达到78%。我们还发现基因主体上的DNA甲基化比启动子 (promoter) 附近的更重要。最后我们还从整合所有甲基化模式和CpG模式的组合数据集中,利用特征筛选(feature selection)选择了一个最优化子集。我们基于最优子集建立了特征重叠作用网络,进一步揭示了DNA甲基化模式对于基因表达的协作调控机理。
除了开发RNA-Seq数据分析的工具和数据挖掘,我们还分析斑马鱼(zebrafish)的转录组(transcriptome)。RNA-Seq数据分析结合荧光成像,定量PCR等生物学实验,揭示了Calycosin处理之后的相关作用通路(pathway)和差异表达基因,分析结果还证明了Calycosin在体内的血管生成活性。
综上所述,本论文将会详细阐述我在二代测序数据分析,基于数据挖掘的生物学意义的发现和转录组分析方面的工作。
The recent development of next generation RNA-sequencing, termed ‘RNA-Seq’, has offered an opportunity to explore the RNA transcripts from the whole transcriptome. As a revolutionary method, RNA-Seq not only could precisely measure the abundances of transcripts, but discover the novel transcribed contents and uncover the unknown regulatory mechanisms. Meanwhile, the combination of different levels of next-generation sequencing, such as genome sequencing and methylome sequencing has provided a powerful tool for novel discovery in the biological context.
My PhD study focuses on the analysis of next-generation sequencing data, especially on RNA-Seq data. It mainly includes three parts: pipeline development analysis, data analysis and mechanistic study.
As the next-generation sequencing (NGS) technology, the analysis of massive NGS data is a great challenge. Many existing general aligners (as contrast to splicing-aware alignment tools) are capable of mapping millions of sequencing reads onto a reference genome. However, they are neither designed for reads that span across splice junctions (spliced reads) nor for reads that could match multiple locations along the reference genome (multireads). Hence, we have developed an ab initio mapping method - ABMapper, using two-seed strategy. The benchmark results show that ABMapper can get higher accuracy and recall compared with the same kind of tools: TopHat and SpliceMap. On the other hand, the selection of the most probable location for spliced reads and multireads becomes a big problem. These reads are randomly assigned to one of the possible locations or discarded completely when calculating the expression level, which would bias the downstream analysis, such as the differentiated expression analysis and alternative splicing analysis. To rationally determine the location of spliced reads and multireads, we have proposed a maximum likelihood estimation method based on a geometric-tail (GT) distribution of intron length. This probabilistic model deals with splice junctions between reads, or those encompassed in one or both of a pair-ended (PE) reads. Based on this model, multiple alignments of reads within a PE pair can be properly resolved.
The accumulation of NGS data has provided rich resources for deep discovery of biological significance. We have integrated RNA-Seq data and methylation sequencing data to build a predictive model for the regulation of gene expression based on DNA methylation patterns. We found that DNA methylation could predict gene expression fairly accurately and the accuracy can reach up to 78%. We have also found DNA methylation at gene body is the most important region in these models, even more useful than promoter. Finally, feature overlap network based on an optimum subset of combination of all methylation patterns and CpG patterns has indicated the collaborative regulation of gene expression by DNA methylation patterns.
Not only new algorithms were developed to facilitate the RNA-Seq data analysis, but the transcriptome analysis was performed on zebrafish. The analysis of differentially-expressed genes and pathways involved after calycosin treatment, combined with other experimental evidence such as fluorescence microscopy and quantitative real-time polymerase chain reaction (qPCR), has well demonstrated the proangiogenic effects of calycosin in vivo.
In summary, this thesis detailed my work on NGS data analysis, discovery of biological significance using data-mining algorithms and transcriptome analysis.
Detailed summary in vernacular field only.
Detailed summary in vernacular field only.
Detailed summary in vernacular field only.
Detailed summary in vernacular field only.
Detailed summary in vernacular field only.
Detailed summary in vernacular field only.
Lou, Shaoke.
Thesis (Ph.D.)--Chinese University of Hong Kong, 2012.
Includes bibliographical references (leaves 135-146).
Electronic reproduction. Hong Kong : Chinese University of Hong Kong, [2012] System requirements: Adobe Acrobat Reader. Available via World Wide Web.
Abstract also in Chinese.
摘要 --- p.iii
Acknowledgement --- p.v
Chapter Chapter 1 --- Introduction --- p.1
Chapter 1.1 --- Bioinformatics --- p.1
Chapter 1.2 --- Bioinformatics application --- p.1
Chapter 1.3 --- Motivation --- p.2
Chapter 1.4 --- Objectives --- p.3
Chapter 1.5 --- Thesis outline --- p.3
Chapter Chapter 2 --- Background --- p.4
Chapter 2.1 --- Biological and biotechnology background --- p.4
Chapter 2.1.1 --- Central dogma and biology ABC --- p.4
Chapter 2.1.2 --- Transcription --- p.5
Chapter 2.1.3 --- Splicing and Alternative Splicing --- p.6
Chapter 2.1.4 --- Next-generation Sequencing --- p.10
Chapter 2.1.5 --- RNA-Seq --- p.18
Chapter 2.2 --- Computational background --- p.20
Chapter 2.2.1 --- Approximate string matching and read mapping --- p.21
Chapter 2.2.2 --- Read mapping algorithms and tools --- p.22
Chapter 2.2.3 --- Spliced alignment tools --- p.27
Chapter Chapter 3 --- ABMapper: a two-seed based spliced alignment tool --- p.29
Chapter 3.1 --- Introduction --- p.29
Chapter 3.2 --- State-of-the-art --- p.30
Chapter 3.3 --- Problem formulation --- p.31
Chapter 3.4 --- Methods --- p.33
Chapter 3.5 --- Results --- p.35
Chapter 3.5.1 --- Benchmark test --- p.35
Chapter 3.5.2 --- Complexity analysis --- p.39
Chapter 3.5.3 --- Comparison with other tools --- p.39
Chapter 3.6 --- Discussion and conclusion --- p.41
Chapter Chapter 4 --- Geometric-tail (GT) model for rational selection of RNA-Seq read location --- p.42
Chapter 4.1 --- Introduction --- p.42
Chapter 4.2 --- State-of-the-art --- p.44
Chapter 4.3 --- Problem formulation --- p.44
Chapter 4.4 --- Algorithms --- p.45
Chapter 4.5 --- Results --- p.49
Chapter 4.5.1 --- Workflow of GT MLE method --- p.49
Chapter 4.5.2 --- GT distribution and insert-size distribution --- p.50
Chapter 4.5.3 --- Multiread analysis --- p.51
Chapter 4.5.4 --- Splice-site comparison --- p.52
Chapter 4.6 --- Discussion and conclusion --- p.55
Chapter Chapter 5 --- Explore relationship between methylation patterns and gene expression --- p.56
Chapter 5.1 --- Introduction --- p.56
Chapter 5.2 --- State-of-the-art --- p.58
Chapter 5.3 --- Problem formulation --- p.62
Chapter 5.4 --- Methods --- p.62
Chapter 5.4.1 --- NGS sequencing and analysis --- p.62
Chapter 5.4.2 --- Data preparation and transformation --- p.64
Chapter 5.4.3 --- Random forest (RF) classification and regression --- p.65
Chapter 5.5 --- Results --- p.68
Chapter 5.5.1 --- Genome wide profiling of methylation --- p.68
Chapter 5.5.2. --- Aggregation plot of methylation levels at different regions --- p.72
Chapter 5.5.3. --- Scatterplot between methylation and gene expression --- p.75
Chapter 5.5.4 --- Predictive model of gene expression using DNA methylation features --- p.76
Chapter 5.5.5 --- Comb-model based on the full dataset --- p.87
Chapter 5.6 --- Discussion and conclusion --- p.98
Chapter Chapter 6 --- RNA-Seq data analysis and applications --- p.99
Chapter 6.1 --- Transcriptional Profiling of Angiogenesis Activities of Calycosin in Zebrafish --- p.99
Chapter 6.1.1 --- Introduction --- p.99
Chapter 6.1.2 --- Background --- p.100
Chapter 6.1.3 --- Materials and methods and ethics statement --- p.101
Chapter 6.1.4 --- Results --- p.104
Chapter 6.1.5 --- Conclusion --- p.108
Chapter 6.2 --- An integrated web medicinal materials DNA database: MMDBD (Medicinal Materials DNA Barcode Database). --- p.110
Chapter 6.2.1 --- Introduction --- p.110
Chapter 6.2.2 --- Background --- p.110
Chapter 6.2.3 --- Construction and content --- p.113
Chapter 6.2.4 --- Utility and discussion --- p.116
Chapter 6.2.5 --- Conclusion and future development --- p.119
Chapter Chapter 7 --- Conclusion --- p.121
Chapter 7.1 --- Conclusion --- p.121
Chapter 7.2 --- Future work --- p.123
Appendix --- p.124
Chapter A1. --- Descriptive analysis of trio data --- p.124
Chapter A2. --- Whole genome methylation level profiling --- p.125
Chapter A3. --- Global sliding window correlation between individuals --- p.128
Chapter A4. --- Features selected after second-run filtering --- p.133
Bibliography --- p.135
Chapter A. --- Publications --- p.135
Reference --- p.135
APA, Harvard, Vancouver, ISO, and other styles
50

Puthiyedth, Nisha. "A novel feature selection approach for data integration analysis: applications to transcriptomics study." Thesis, 2016. http://hdl.handle.net/1959.13/1322449.

Full text
Abstract:
Research Doctorate - Doctor of Philosophy (PhD)
Meta-analysis has become a popular method for identifying novel biomarkers in the field of medical research. Meta-analysis has been widely applied to genome-wide association and transcriptomic studies due to the availability of datasets in the public domain. Joint analysis of multiple datasets has become a common technique for increasing statistical power in detecting biomarkers reported in smaller studies. The approach generally followed relies on the fact that as the total number of samples increases, greater power to detect associations of interest is anticipated. Integrating available information from different datasets to generate a combined result seems reasonable and promising. Consequently, there is a need for computationally based integration methods that evaluate multiple independent datasets investigating a common theme or disorder. This raises a variety of issues in the analysis of such data and leads to more complications than are seen with standard meta-analysis, including diverse experimental platforms and complex data structures. I illustrate these ideas using microarray datasets from multiple studies and propose an integrative methodology to combine datasets generated using different platforms. Having combined the data, the main challenge is to choose a subset of features that represent the combined dataset in a particular aspect. While the approach is well established in biostatistics, the introduction of new combinatorial optimisation models to address this issue has not been explored in depth. In 2004, a new feature selection approach based on a combinatorial optimisation method was proposed, entitled the (α,β)-k Feature Set problem approach. The main advantage of this approach over ranking methods for selecting individual features is that the features are evaluated as groups instead of on the basis of their individual performance. The (α,β)-k Feature Set problem approach has been defined having first in mind a single uniform dataset, and conceived in this ways, it is not readily applicable to the case of integrated datasets. An extended version of this approach handles integrated datasets in a consistent manner and selects features that differentiate sample pairs across datasets. The application of an (α,β)-k Feature Set problem -based approach for meta-analysis thus helps to identify the best set of features from a combined dataset, allowing researchers to reveal the genetic pathways that contribute to the development of a disease. I propose an extended version of the (α,β)-k Feature Set problem approach that aims to find a set of genes whose expression level may be used to identify a joint core subset of genes that putatively play an important role in two conditions: prostate cancer and Alzheimer's disease. The results of the current study suggest that the proposed method is an efficient meta-analysis method that is capable of identifying biologically relevant genes that other methods fail to identify. As the amount of data increases, this novel method can be applied to find additional genes and pathways that are significant in these diseases, which may provide new insights into the disease mechanism and contribute towards understanding, prevention and cures.
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography