Tesis sobre el tema "Transcriptomic data analysis"
Crea una cita precisa en los estilos APA, MLA, Chicago, Harvard y otros
Consulte los 50 mejores tesis para su investigación sobre el tema "Transcriptomic data analysis".
Junto a cada fuente en la lista de referencias hay un botón "Agregar a la bibliografía". Pulsa este botón, y generaremos automáticamente la referencia bibliográfica para la obra elegida en el estilo de cita que necesites: APA, MLA, Harvard, Vancouver, Chicago, etc.
También puede descargar el texto completo de la publicación académica en formato pdf y leer en línea su resumen siempre que esté disponible en los metadatos.
Explore tesis sobre una amplia variedad de disciplinas y organice su bibliografía correctamente.
Xu, Huan. "Controlling false positive rate in network analysis of transcriptomic data". University of Cincinnati / OhioLINK, 2019. http://rave.ohiolink.edu/etdc/view?acc_num=ucin156267322069819.
Texto completoKmetzsch, Virgilio. "Multimodal analysis of neuroimaging and transcriptomic data in genetic frontotemporal dementia". Electronic Thesis or Diss., Sorbonne université, 2022. https://accesdistant.sorbonne-universite.fr/login?url=https://theses-intra.sorbonne-universite.fr/2022SORUS279.pdf.
Texto completoFrontotemporal dementia (FTD) represents the second most common type of dementia in adults under the age of 65. Currently, there are no treatments that can cure this condition. In this context, it is essential that biomarkers capable of assessing disease progression are identified. This thesis has two objectives. First, to analyze the expression patterns of microRNAs taken from blood samples of patients, asymptomatic individuals who have certain genetic mutations causing FTD, and controls, to identify whether the expressions of some microRNAs correlate with mutation status and disease progression. Second, this work aims at proposing methods for integrating cross-sectional data from microRNAs and neuroimaging to estimate disease progression. We conducted three studies. Initially, we focused on plasma samples from C9orf72 expansion carriers. We identified four microRNAs whose expressions correlated with the clinical status of the participants. Next, we tested all microRNA signatures identified in the literature as potential biomarkers of FTD or amyotrophic lateral sclerosis (ALS), in two groups of individuals. Finally, in our third work, we proposed a new approach, using a supervised multimodal variational autoencoder, that estimates a disease progression score from cross-sectional microRNA expression and neuroimaging datasets with small sample sizes. The work conducted in this interdisciplinary thesis showed that it is possible to use non-invasive biomarkers, such as circulating microRNAs and magnetic resonance imaging, to assess the progression of rare neurodegenerative diseases such as FTD and ALS
Caterino, Cinzia. "The aging synapse: an integrated proteomic and transcriptomic analysis". Doctoral thesis, Scuola Normale Superiore, 2019. http://hdl.handle.net/11384/86004.
Texto completoCaptier, Nicolas. "Multimodal analysis of radiological, pathological, and transcriptomic data for the prediction of immunotherapy outcome in Non-Small Cell Lung Cancer patients". Electronic Thesis or Diss., Université Paris sciences et lettres, 2024. http://www.theses.fr/2024UPSLS012.
Texto completoOverall survival of patients with metastatic non-small cell lung cancer (NSCLC) has been increasing with the use of anti-PD-1 immune checkpoint inhibitors. However, the duration of response remains highly variable between patients, and only 20-30% of patients are alive at 2 years. Thus, new biomarkers for predicting response to treatment and patient outcomes are still needed to guide therapeutic decision. In my PhD, we investigated machine learning approaches to leverage radiological, transcriptomic, and pathological data, integrating them into powerful multimodal models that might improve the limited predictive power of routine clinical data.My doctoral research stood at the heart of a multidisciplinary project funded by Fondation ARC call «SIGN’IT 2020—Signatures in Immunotherapy». It brought together several research teams of Institut Curie alongside a team from Institut du thorax, led by Professor Nicolas Girard, in charge of patient management and data collection. We built a new multimodal cohort of 317 metastatic NSCLC patients treated with first-line immunotherapy alone or combined with chemotherapy. At baseline, we collected clinical information from routine care, 18F-FDG PET/CT scans, digitized pathological slides from the initial diagnosis, and bulk RNA-seq profiles from solid biopsies. Immunotherapy outcome was monitored with Overall Survival (OS) and Progression-Free Survival (PFS).Together with Irène Buvat and Emmanuel Barillot, whose teams hold significant expertise in the analysis of medical images and RNAseq tumor profiles, respectively, we initially focused on designing computational tools to extract relevant and interpretable information from these two data modalities. We notably developed a Python tool to apply Independent Component Analysis (ICA) on omics data and stabilize the results through multiple runs. We then explored the potential of stabilized ICA to extract powerful and biologically relevant transcriptomic features for the prediction of patient outcome. For medical images, and in particular 18F-FDG PET scans, we investigated the potential of radiomic approaches to characterize the metastatic disease at the whole-body level and design novel predictive features. We designed a Python explanation tool, based on Shapley values, to highlight the contribution of each individual metastasis to the prediction of radiomic models that use as input such whole-body features. A substantial portion of my PhD was devoted to the integration of clinical, radiomic, and transcriptomic features, as well as pathomic features extracted from digitized pathological slides (with the assistance of Thomas Walter’s team). We conducted a thorough comparison of the predictive capabilities of the different multimodal combinations using various state-of-the-art learning algorithms and integration methods. We devised strategies to overcome the many challenges associated to multimodal integration within our dataset, including handling missing modalities for numerous patients, dealing with a modest cohort size in comparison to the high dimensionality of the data, or ensuring a fair comparison of all the possible multimodal combinations. We especially focused on highlighting the potential of multimodal approaches to enhance patient risk stratification with respect to models using only clinical information collected during routine care
Schmidt, Florian [Verfasser] y Marcel Holger [Akademischer Betreuer] Schulz. "Applications, challenges and new perspectives on the analysis of transcriptional regulation using epigenomic and transcriptomic data / Florian Schmidt ; Betreuer: Marcel Holger Schulz". Saarbrücken : Saarländische Universitäts- und Landesbibliothek, 2019. http://d-nb.info/1196090173/34.
Texto completoSchmidt, Florian Verfasser] y Marcel Holger [Akademischer Betreuer] [Schulz. "Applications, challenges and new perspectives on the analysis of transcriptional regulation using epigenomic and transcriptomic data / Florian Schmidt ; Betreuer: Marcel Holger Schulz". Saarbrücken : Saarländische Universitäts- und Landesbibliothek, 2019. http://nbn-resolving.de/urn:nbn:de:bsz:291--ds-287773.
Texto completoCzerwińska, Urszula. "Unsupervised deconvolution of bulk omics profiles : methodology and application to characterize the immune landscape in tumors Determining the optimal number of independent components for reproducible transcriptomic data analysis Application of independent component analysis to tumor transcriptomes reveals specific and reproducible immune-related signals A multiscale signalling network map of innate immune response in cancer reveals signatures of cell heterogeneity and functional polarization". Thesis, Sorbonne Paris Cité, 2018. http://www.theses.fr/2018USPCB075.
Texto completoTumors are engulfed in a complex microenvironment (TME) including tumor cells, fibroblasts, and a diversity of immune cells. Currently, a new generation of cancer therapies based on modulation of the immune system response is in active clinical development with first promising results. Therefore, understanding the composition of TME in each tumor case is critically important to make a prognosis on the tumor progression and its response to treatment. However, we lack reliable and validated quantitative approaches to characterize the TME in order to facilitate the choice of the best existing therapy. One part of this challenge is to be able to quantify the cellular composition of a tumor sample (called deconvolution problem in this context), using its bulk omics profile (global quantitative profiling of certain types of molecules, such as mRNA or epigenetic markers). In recent years, there was a remarkable explosion in the number of methods approaching this problem in several different ways. Most of them use pre-defined molecular signatures of specific cell types and extrapolate this information to previously unseen contexts. This can bias the TME quantification in those situations where the context under study is significantly different from the reference. In theory, under certain assumptions, it is possible to separate complex signal mixtures, using classical and advanced methods of source separation and dimension reduction, without pre-existing source definitions. If such an approach (unsupervised deconvolution) is feasible to apply for bulk omic profiles of tumor samples, then this would make it possible to avoid the above mentioned contextual biases and provide insights into the context-specific signatures of cell types. In this work, I developed a new method called DeconICA (Deconvolution of bulk omics datasets through Immune Component Analysis), based on the blind source separation methodology. DeconICA has an aim to decipher and quantify the biological signals shaping omics profiles of tumor samples or normal tissues. A particular focus of my study was on the immune system-related signals and discovering new signatures of immune cell types. In order to make my work more accessible, I implemented the DeconICA method as an R package named "DeconICA". By applying this software to the standard benchmark datasets, I demonstrated that DeconICA is able to quantify immune cells with accuracy comparable to published state-of-the-art methods but without a priori defining a cell type-specific signature genes. The implementation can work with existing deconvolution methods based on matrix factorization techniques such as Independent Component Analysis (ICA) or Non-Negative Matrix Factorization (NMF). Finally, I applied DeconICA to a big corpus of data containing more than 100 transcriptomic datasets composed of, in total, over 28000 samples of 40 tumor types generated by different technologies and processed independently. This analysis demonstrated that ICA-based immune signals are reproducible between datasets and three major immune cell types: T-cells, B-cells and Myeloid cells can be reliably identified and quantified. Additionally, I used the ICA-derived metagenes as context-specific signatures in order to study the characteristics of immune cells in different tumor types. The analysis revealed a large diversity and plasticity of immune cells dependent and independent on tumor type. Some conclusions of the study can be helpful in identification of new drug targets or biomarkers for immunotherapy of cancer
Owen, Anne M. "Widescale analysis of transcriptomics data using cloud computing methods". Thesis, University of Essex, 2016. http://repository.essex.ac.uk/16125/.
Texto completoHernandez-Ferrer, Carles 1987. "Bioinformatic tools for exposome data analysis : application to human molecular signatures of ultraviolet light effects". Doctoral thesis, Universitat Pompeu Fabra, 2017. http://hdl.handle.net/10803/572046.
Texto completoMost common diseases are caused by a combination of genetic, environmental and lifestyle factors. These diseases are referred to as complex diseases. Examples of this type of diseases are obesity, asthma, hypertension or diabetes. Several empirical evidence suggest that exposures are necessary determinants of complex disease operating in a causal background of genetic diversity. Moreover, environmental factors have long been implicated as major contributors to the global disease burden. This leads to the formulation of the exposome, that contains any exposure to which an individual is subjected from conception to death. The study of the underlying mechanics that links the exposome with human health is an emerging research field with a strong potential to provide new insights into disease etiology. The first part of this thesis is focused on ultraviolet radiation (UVR) exposure. UVR exposure occurs from both natural and artificial sources. UVR includes three subtypes of radiation according to its wavelength (UVA 315-400 nm, UVB 315-295 nm, and UVC 295-200 nm). While the main natural source of UVR is the Sun, UVC radiation does not reach Earth's surface because of its absorption by the stratospheric ozone layer. Then, exposures to UVR typically consist of a mixture of UVA (95%) and UVB (5%). Effects of UVR on human can be both beneficial and detrimental, depending on the amount and form of UVR. Detrimental and acute effects of UVR include erythema, pigment darkening, delayed tanning and thickening of the epidermis. Repeated UVR-induced injury to the skin, may ultimately predispose one to the chronic effects photoaging, immunosuppression, and photocarcinogenesis. The beneficial effect of UVR is the cutaneous synthesis of vitamin D. Vitamin D is necessary to maintain physiologic calcium and phosphorous for normal bone mineralization and to prevent rickets, osteomalacia, and osteoporosis. But the exposome paradigm is to work with multiple exposures at a time and with one or more health outcomes rather focus in a single exposures analysis. This approach tends to be a more accurate snapshot of the reality that we live in complex environments. Then, the second part is focused on the tools to explore how to characterize and analyze the exposome and how to test its effects in multiple intermediate biological layers to provide insights into the underlying molecular mechanisms linking environmental exposures to health outcomes.
Daub, Carsten O. "Analysis of integrated transcriptomics and metabolomics data a systems biology approach /". [S.l. : s.n.], 2004. http://pub.ub.uni-potsdam.de/2004/0025/daub.pdf.
Texto completoDaub, Carsten Oliver. "Analysis of integrated transcriptomics and metabolomics data : a systems biology approach". Phd thesis, Universität Potsdam, 2004. http://opus.kobv.de/ubp/volltexte/2005/138/.
Texto completoWir verwenden das informationstheoretische Konzept der wechselseitigen Information, das ursprünglich für diskrete Daten definiert ist, als Ähnlichkeitsmaß und schlagen eine Erweiterung eines für gewöhnlich für die Anwendung auf kontinuierliche biologische Daten verwendeten Algorithmus vor. Wir vergleichen unseren Ansatz mit bereits existierenden Algorithmen. Wir entwickeln ein geschwindigkeitsoptimiertes Computerprogramm für die Anwendung der wechselseitigen Information auf große Datensätze. Weiterhin konstruieren und implementieren wir einen web-basierten Dienst fuer die Analyse von integrierten Daten, die durch unterschiedliche Messmethoden gemessen wurden. Die Anwendung auf biologische Daten zeigt biologisch relevante Gruppierungen, und rekonstruierte Signalnetzwerke zeigen Übereinstimmungen mit physiologischen Erkenntnissen.
Recent high-throughput technologies enable the acquisition of a variety of complementary data and imply regulatory networks on the systems biology level. A common approach to the reconstruction of such networks is the cluster analysis which is based on a similarity measure.
We use the information theoretic concept of the mutual information, that has been originally defined for discrete data, as a measure of similarity and propose an extension to a commonly applied algorithm for its calculation from continuous biological data. We compare our approach to previously existing algorithms. We develop a performance optimised software package for the application of the mutual information to large-scale datasets. Furthermore, we design and implement a web-based service for the analysis of integrated data measured with different technologies. Application to biological data reveals biologically relevant groupings and reconstructed signalling networks show agreements with physiological findings.
Östman, Josephine. "The fertile ovary transcriptome and proteome". Thesis, Uppsala universitet, Institutionen för kvinnors och barns hälsa, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-447785.
Texto completoHu, Yin. "A NOVEL COMPUTATIONAL FRAMEWORK FOR TRANSCRIPTOME ANALYSIS WITH RNA-SEQ DATA". UKnowledge, 2013. http://uknowledge.uky.edu/cs_etds/17.
Texto completoKelso, Janet. "The development and application of informatics-based systems for the analysis of the human transcriptome". Thesis, University of the Western Cape, 2003. http://etd.uwc.ac.za/index.php?module=etd&action=viewtitle&id=gen8Srv25Nme4_5101_1185442672.
Texto completoDespite the fact that the sequence of the human genome is now complete it has become clear that the elucidation of the transcriptome is more complicated than previously expected. There is mounting evidence for unexpected and previously underestimated phenomena such as alternative splicing in the transcriptome. As a result, the identification of novel transcripts arising from the genome continues. Furthermore, as the volume of transcript data grows it is becoming increasingly difficult to integrate expression information which is from different sources, is stored in disparate locations, and is described using differing terminologies. Determining the function of translated transcripts also remains a complex task. Information about the expression profile &ndash
the location and timing of transcript expression &ndash
provides evidence that can be used in understanding the role of the expressed transcript in the organ or tissue under study, or in developmental pathways or disease phenotype observed.
In this dissertation I present novel computational approaches with direct biological applications to two distinct but increasingly important areas of research in gene expression research. The first addresses detection and characterisation of alternatively spliced transcripts. The second is the construction of an hierarchical controlled vocabulary for gene expression data and the annotation of expression libraries with controlled terms from the hierarchies. In the final chapter the biological questions that can be approached, and the discoveries that can be made using these systems are illustrated with a view to demonstrating how the application of informatics can both enable and accelerate biological insight into the human transcriptome.
Windhorst, Anita Cornelia [Verfasser]. "Transcriptome analysis in preterm infants developing bronchopulmonary dysplasia : data processing and statistical analysis of microarray data / Anita Cornelia Windhorst". Gießen : Universitätsbibliothek, 2015. http://d-nb.info/1078220395/34.
Texto completoBécavin, Christophe. "Dimensionaly reduction and pathway network analysis of transcriptome data : application to T-cell characterization". Paris, Ecole normale supérieure, 2010. http://www.theses.fr/2010ENSUBS02.
Texto completoCicek, A. Ercument. "METABOLIC NETWORK-BASED ANALYSES OF OMICS DATA". Case Western Reserve University School of Graduate Studies / OhioLINK, 2013. http://rave.ohiolink.edu/etdc/view?acc_num=case1372866879.
Texto completoCalviello, Lorenzo. "Detecting and quantifying the translated transcriptome with Ribo-seq data". Doctoral thesis, Humboldt-Universität zu Berlin, 2018. http://dx.doi.org/10.18452/18974.
Texto completoThe study of post-transcriptional gene regulation requires in-depth knowledge of multiple molecular processes acting on RNA, from its nuclear processing to translation and decay in the cytoplasm. With the advent of RNA-seq technologies we can now follow each of these steps with high throughput and resolution. Ribosome profiling (Ribo-seq) is a popular RNA-seq technique, which aims at monitoring the precise positions of millions of translating ribosomes, proving to be an essential tool in studying gene regulation. However, the interpretation of Ribo-seq profiles over the transcriptome is challenging, due to noisy data and to our incomplete knowledge of the translated transcriptome. In this Thesis, I present a strategy to detect translated regions from Ribo-seq data, using a spectral analysis approach aimed at detecting ribosomal translocation over the translated regions. The high sensitivity and specificity of our approach enabled us to draw a comprehensive map of translation over the human and Arabidopsis thaliana transcriptomes, uncovering the presence of known and novel translated regions. Evolutionary conservation analysis, together with large-scale proteomics evidence, provided insights on their functions, between the synthesis of previously unknown proteins to other possible regulatory roles. Moreover, quantification of Ribo-seq signal over annotated transcript structures exposed translation of multiple transcripts per gene, revealing the link between translation and RNA-surveillance mechanisms. Together with a comparison of different Ribo-seq datasets in human cells and in Arabidopsis thaliana, this work comprises a set of analysis strategies for Ribo-seq data, as a window into the manifold functions of the expressed transcriptome.
Enjalbert, Courrech Nicolas. "Inférence post-sélection pour l'analyse des données transcriptomiques". Electronic Thesis or Diss., Université de Toulouse (2023-....), 2024. http://www.theses.fr/2024TLSES199.
Texto completoIn the field of transcriptomics, technological advances, such as microarrays and high-throughput sequencing, have enabled large-scale quantification of gene expression. These advances have raised statistical challenges, particularly in differential expression analysis, which aims to identify genes that significantly differentiate between two populations. However, traditional inference procedures lose their ability to control the false positive rate when biologists select a subset of genes. Post-hoc inference methods address this limitation by providing control over the number of false positives, even for arbitrary gene sets. The first contribution of this manuscript demonstrates the effectiveness of these methods for the differential analysis of transcriptomic data between two biological conditions, notably through the introduction of a linear-time algorithm for computing post-hoc bounds, adapted to the high dimensionality of the data. An interactive application was also developed to facilitate the selection and simultaneous evaluation of post-hoc bounds for sets of genes of interest. These contributions are presented in the first part of the manuscript. The technological evolution towards single-cell sequencing has raised new questions, particularly regarding the identification of genes whose expression distinguishes one cellular group from another. This issue is complex because cell groups must first be estimated using clustering method before performing a comparative test, leading to a circular analysis. In the second part of this manuscript, we present a review of post-clustering inference methods addressing this problem, as well as a numerical comparison of multivariate and marginal approaches for cluster comparison. Finally, we explore how the use of mixture models in the clustering step can be exploited in post-clustering tests, and discuss perspectives for applying these tests to transcriptomic data
Siatkowski, Marcin [Verfasser]. "Approaches to the Analysis of Proteomics and Transcriptomics Data based on Statistical Methodology / Marcin Siatkowski". Greifswald : Universitätsbibliothek Greifswald, 2014. http://d-nb.info/1050274954/34.
Texto completoJohnson, Kristen. "Software for Estimation of Human Transcriptome Isoform Expression Using RNA-Seq Data". ScholarWorks@UNO, 2012. http://scholarworks.uno.edu/td/1448.
Texto completoLi, Mengbo. "Integration of Multi-Modal Data to Guide Classification in Studies of Complex Diseases". Thesis, The University of Sydney, 2020. https://hdl.handle.net/2123/22693.
Texto completoJeanmougin, Marine. "Statistical methods for robust analysis of transcriptome data by integration of biological prior knowledge". Thesis, Evry-Val d'Essonne, 2012. http://www.theses.fr/2012EVRY0029/document.
Texto completoRecent advances in Molecular Biology have led biologists toward high-throughput genomic studies. In particular, the investigation of the human transcriptome offers unprecedented opportunities for understanding cellular and disease mechanisms. In this PhD, we put our focus on providing robust statistical methods dedicated to the treatment and the analysis of high-throughput transcriptome data. We discuss the differential analysis approaches available in the literature for identifying genes associated with a phenotype of interest and propose a comparison study. We provide practical recommendations on the appropriate method to be used based on various simulation models and real datasets. With the eventual goal of overcoming the inherent instability of differential analysis strategies, we have developed an innovative approach called DiAMS, for DIsease Associated Modules Selection. This method was applied to select significant modules of genes rather than individual genes and involves the integration of both transcriptome and protein interactions data in a local-score strategy. We then focus on the development of a framework to infer gene regulatory networks by integration of a biological informative prior over network structures using Gaussian graphical models. This approach offers the possibility of exploring the molecular relationships between genes, leading to the identification of altered regulations potentially involved in disease processes. Finally, we apply our statistical developments to study the metastatic relapse of breast cancer
Hindle, Matthew Morritt. "An integrated approach to enhancing functional annotation of sequences for data analysis of a transcriptome". Thesis, University of Nottingham, 2012. http://eprints.nottingham.ac.uk/12580/.
Texto completoSchissler, Alfred Grant y Alfred Grant Schissler. "Contributions to Gene Set Analysis of Correlated, Paired-Sample Transcriptome Data to Enable Precision Medicine". Diss., The University of Arizona, 2017. http://hdl.handle.net/10150/624283.
Texto completoRubanova, Natalia. "MasterPATH : network analysis of functional genomics screening data". Thesis, Sorbonne Paris Cité, 2018. http://www.theses.fr/2018USPCC109/document.
Texto completoIn this work we developed a new exploratory network analysis method, that works on an integrated network (the network consists of protein-protein, transcriptional, miRNA-mRNA, metabolic interactions) and aims at uncovering potential members of molecular pathways important for a given phenotype using hit list dataset from “omics” experiments. The method extracts subnetwork built from the shortest paths of 4 different types (with only protein-protein interactions, with at least one transcription interaction, with at least one miRNA-mRNA interaction, with at least one metabolic interaction) between hit genes and so called “final implementers” – biological components that are involved in molecular events responsible for final phenotypical realization (if known) or between hit genes (if “final implementers” are not known). The method calculates centrality score for each node and each path in the subnetwork as a number of the shortest paths found in the previous step that pass through the node and the path. Then, the statistical significance of each centrality score is assessed by comparing it with centrality scores in subnetworks built from the shortest paths for randomly sampled hit lists. It is hypothesized that the nodes and the paths with statistically significant centrality score can be considered as putative members of molecular pathways leading to the studied phenotype. In case experimental scores and p-values are available for a large number of nodes in the network, the method can also calculate paths’ experiment-based scores (as an average of the experimental scores of the nodes in the path) and experiment-based p-values (by aggregating p-values of the nodes in the path using Fisher’s combined probability test and permutation approach). The method is illustrated by analyzing the results of miRNA loss-of-function screening and transcriptomic profiling of terminal muscle differentiation and of ‘druggable’ loss-of-function screening of the DNA repair process. The Java source code is available on GitHub page https://github.com/daggoo/masterPATH
Ghazanfar, Shila. "Statistical approaches to harness high throughput sequencing data in diverse biological systems". Thesis, The University of Sydney, 2017. http://hdl.handle.net/2123/17268.
Texto completoJangerstad, August. "Transcription factor analysis of longitudinal mRNA expression data". Thesis, KTH, Skolan för kemi, bioteknologi och hälsa (CBH), 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-278693.
Texto completoTranscriptionsfaktorer (TFer) är viktiga regulatoriska protein som reglerar transkriptiongenom att binda till cis-regulatoriska element på precisa, menmycketvarierande vis. Komplexiteten i deras regulatoriska mönster gör det svårt attavgöra vilka roller olika TFer har, vilket är en uppgift som fältet fortfarandebrottas med. Experimentella procedurer i detta syfte, till exempel "knockout"experiment, är dock kostsamma och tidskrävande, och med den evigt ökandetillgången på sekvenseringsdata har metoder för att beräkna TFers aktivitetfrån sådan data fått stort intresse. De beräkningsmetoder som finns idag bristerdock på flera punker, vilket erfordrar ett fortsatt sökande efter alternativ. Ett nytt vektyg för att upskatta aktiviteten hos individuella TFer över tidmed hjälp av longitunell mRNA-uttrycksdata utvecklades därför i det här projektetoch testades på data från Mus musculus lever och hjärna. Verktyget ärbaserat på principalkomponentsanalys, som applicerades på set med uttrycksdatafrån gener sannolikt reglerade av en specifik TF för att erhålla en uppskattningav dess aktivitet. Trots att de första testerna för 17 utvalda TFer påvisadeproblem med ospecifika trender i upskattningarna krävs forsatta tester för attkunna ge ett tydligt svar på vilken potential estimatorn har.
Monraz, Gomez Luis Cristobal. "Application of systems biology resources to human diseases : combining transcriptomics data analysis and molecular networks to identify major players". Electronic Thesis or Diss., Université Paris sciences et lettres, 2023. http://www.theses.fr/2023UPSLS069.
Texto completoBiological systems are complex structures with multiple interactions between their components. Thanks to the combination of fields such as mathematics, computational science, biology, physiology etc. it is now possible to study these systems and answer different questions that have different applications, like in human health. In this thesis I have explored some tools and approaches used in systems biology in order to find molecular players as well as mechanisms that are important in the molecular networks for the biological systems. For this thesis, I have integrated data analysis techniques to transcriptomics data in different diseases. Also, I have used knowledge formalization approaches in order to construct or extend existing descriptive molecular networks in different diseases.I have studied the role of adipose tissue in breast cancer. The adipose tissue constitutes a fundamental and large part of the breast anatomy. Mammary adipocytes have been hypothesized to interact with cancer cells at the invasive front of the tumor, supporting the progression of the disease. These adipocytes have been termed “Cancer Associated Adipocytes (CAA)”. The interaction of these CAA and the progression of the disease have been suggested to be worse in obese patients. Therefore, to have an insight on the mechanism , a cohort of patients that had ductal breast carcinoma and that are considered as obese or normal-weight was created. I have analyzed adipose tissue samples of these patients, that were either close (proximal) or far (distal) from the tumor, at the transcriptome level. Both tissue types showed similar gene expression patterns. However, with the enrichment analysis, proximal samples had enriched estrogen signaling pathways, and pathways related to epithelium when compared to distal samples. When compared to tumor samples, proximal showed mostly pathways to their adipose tissue function, as adipogenesis, fatty acid metabolism PPAR signaling among others. We applied ROMA analysis to determine activation of pathways of interest from the enrichment results, and we found thermogenesis and matrix metalloproteinases to be more active in the proximal adipose tissues. The genes MMP7, MMP16, MMP3, SMARCC1, CREB3L4, MAPK13, RPS6KA6, SMARCA4, ZNF516, ACTG1, SLC25A9 appeared as major contributors.Molecular networks can be depicted as diagrams in order to facilitate their exploration and visualization. The information contained in these networks may serve to exploit the analysis of transcriptomics data using techniques such as gene-set enrichment analysis. Previously, the Atlas of Cancer Signaling Network was assembled. This resource is composed of known biological processes that are relevant for cancer development and progression in the form of maps depicting molecular interactions. I have used one of the maps, cellular senescence and Epithelial to Mesenchymal Transition (EMT), to explore the role of prototypic metastasis suppressor gene NME1 (previously called NM23-H1) in these processes. I had enriched the map with functions of the protein and also used the information to compile the players that are involved in cellular senescence and EMT. Some interesting players that are related were identified to both processes, like NF-κB, showing that senescence has a relationship with EMT. Then, I used transcriptomics data from colorectal cancer patients to observe the activity of the different modules in the network to observe the progression through the different stages of the disease.Lastly, due to the COVID-19 epidemic, I have participated in a multi-research groups’ effort where we constructed a map of the host-virus interaction, the COVID-19 map. My contribution was focused on building the network representing the endoplasmic reticulum stress
Isik, Zerrin, Tulin Ersahin, Volkan Atalay, Cevdet Aykanat y Rengul Cetin-Atalay. "A signal transduction score flow algorithm for cyclic cellular pathway analysis, which combines transcriptome and ChIP-seq data". Saechsische Landesbibliothek- Staats- und Universitaetsbibliothek Dresden, 2014. http://nbn-resolving.de/urn:nbn:de:bsz:14-qucosa-138982.
Texto completoDieser Beitrag ist mit Zustimmung des Rechteinhabers aufgrund einer (DFG-geförderten) Allianz- bzw. Nationallizenz frei zugänglich
Isik, Zerrin, Tulin Ersahin, Volkan Atalay, Cevdet Aykanat y Rengul Cetin-Atalay. "A signal transduction score flow algorithm for cyclic cellular pathway analysis, which combines transcriptome and ChIP-seq data". Royal Society of Chemistry, 2012. https://tud.qucosa.de/id/qucosa%3A27799.
Texto completoDieser Beitrag ist mit Zustimmung des Rechteinhabers aufgrund einer (DFG-geförderten) Allianz- bzw. Nationallizenz frei zugänglich.
Finotello, Francesca. "Computational methods for the analysis of gene expression from RNA sequencing data". Doctoral thesis, Università degli studi di Padova, 2014. http://hdl.handle.net/11577/3423789.
Texto completoIl patrimonio genetico di ogni organismo vivente è codificato, sotto forma di DNA, nel genoma. Il genoma è costituito da geni e da sequenze non codificanti e racchiude in sé tutte le informazioni necessarie al corretto funzionamento delle cellule dell'organismo. Le cellule possono accedere a specifiche istruzioni di questo codice tramite un processo chiamato espressione genica, ovvero attivando o disattivando un particolare set di geni e trascrivendo l'informazione necessaria in RNA. L'insieme degli RNA trascritti caratterizza quindi un preciso stato cellulare e può fornire importanti informazioni sui meccanismi coinvolti nella patogenesi di una malattia. Recentemente, una metodologia per il sequenziamento dell'RNA, chiamata RNA-seq, sta rapidamente sostituendo i microarray nello studio dell'espressione genica. Grazie alle proprietà delle tecnologie di sequenziamento su cui è basato, l'RNA-seq permette di misurare il numero di RNA presenti in un campione e al contempo di "leggerne" l'esatta sequenza. In realtà, il sequenziamento produce milioni di sequenze, chiamate "read", che rappresentano piccole stringhe lette da posizioni random degli RNA in input. Le read devono quindi essere mappate con un algoritmo su un genoma di riferimento, in modo da ricostruire una mappa trascrizionale, in cui il numero di read allineate su ciascun gene dà una misura digitale (chiamata "count") del suo livello di espressione. Sebbene a prima vista questa procedura possa sembrare molto semplice, lo schema di analisi integrale è in realtà molto complesso e non ben definito. In questi anni sono stati sviluppati diversi metodi per ciascuna delle fasi di elaborazione, ma non è stata tuttora definita una pipeline di analisi dei dati RNA-seq standardizzata. L'obiettivo principale del mio progetto di dottorato è stato lo sviluppo di una pipeline computazionale per l'analisi di dati RNA-seq, dal pre-processing alla misura dell'espressione genica differenziale. I diversi moduli di elaborazione sono stati definiti e implementati tramite una serie di passi successivi. Inizialmente, abbiamo considerato e ridefinito metodi e modelli per la descrizione e l'elaborazione dei dati, in modo da stabilire uno schema di analisi preliminare. In seguito, abbiamo considerato più attentamente uno degli aspetti più problematici dell'analisi dei dati RNA-seq: la correzione dei bias presenti nei count. Abbiamo dimostrato che alcuni di questi bias possono essere corretti in modo efficace tramite le tecniche di normalizzazione correnti, mentre altri, ad esempio il "length bias", non possono essere completamente rimossi senza introdurre ulteriori errori sistematici. Abbiamo quindi definito e testato un nuovo approccio per il calcolo dei count che minimizza i bias ancora prima di procedere con un'eventuale normalizzazione. Infine, abbiamo implementato la pipeline di analisi completa considerando gli algoritmi più robusti e accurati, selezionati nelle fasi precedenti, e ottimizzato alcun step in modo da garantire stime dell'espressione genica accurate anche in presenza di geni ad alta similarità. La pipeline implementata è stata in seguito applicata ad un caso di studio reale, per identificare i geni coinvolti nella patogenesi dell'atrofia muscolare spinale (SMA). La SMA è una malattia neuromuscolare degenerativa che costituisce una delle principali cause genetiche di morte infantile e per la quale non sono ad oggi disponibili né una cura né un trattamento efficace. Con la nostra analisi abbiamo identificato un insieme di geni legati ad altre malattie del tessuto connettivo e muscoloscheletrico i cui pattern di espressione differenziale correlano con il fenotipo, e che quindi potrebbero rappresentare dei meccanismi protettivi in grado di combattere i sintomi della SMA. Alcuni di questi target putativi sono in via di validazione poiché potrebbero portare allo sviluppo di strumenti efficaci per lo screening diagnostico e il trattamento di questa malattia. Gli obiettivi futuri riguardano l'ottimizzazione della pipeline definita in questa tesi e la sua estensione all'analisi di dati dinamici da "time-series RNA-seq". A questo scopo, abbiamo definito il design di due data set "time-series", uno reale e uno simulato. La progettazione del design sperimentale e del sequenziamento del data set reale, nonché la modellazione dei dati simulati, sono stati parte integrante dell'attività di ricerca svolta durante il dottorato. L'evoluzione rapida e costante che ha caratterizzato i metodi per l'analisi di dati RNA-seq ha impedito fino ad ora la definizione di uno schema di analisi standardizzato e la risoluzione di problematiche legate a diversi aspetti dell'elaborazione, quali ad esempio la normalizzazione. In questo contesto, la pipeline definita in questa tesi e, più in ampiamente, i temi discussi in ciascun capitolo, toccano tutti i diversi aspetti dell'analisi dei dati RNA-seq e forniscono delle linee guida utili a definire un approccio computazionale efficace e robusto.
Aghamirzaie, Delasa. "Isoform-Specific Expression During Embryo Development in Arabidopsis and Soybean". Diss., Virginia Tech, 2016. http://hdl.handle.net/10919/73054.
Texto completoPh. D.
Logotheti, Marianthi. "Integration of functional genomics and data mining methodologies in the study of bipolar disorder and schizophrenia". Doctoral thesis, Örebro universitet, Institutionen för medicinska vetenskaper, 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:oru:diva-52644.
Texto completoShi, Xu. "Bayesian Modeling for Isoform Identification and Phenotype-specific Transcript Assembly". Diss., Virginia Tech, 2017. http://hdl.handle.net/10919/79772.
Texto completoPh. D.
Hassan, Aamir Ul. "Integration of Genome Scale Data for Identifying New Biomarkers in Colon Cancer: Integrated Analysis of Transcriptomics and Epigenomics Data from High Throughput Technologies in Order to Identifying New Biomarkers Genes for Personalised Targeted Therapies for Patients Suffering from Colon Cancer". Thesis, University of Bradford, 2017. http://hdl.handle.net/10454/17419.
Texto completoSadacca, Benjamin. "Pharmacogenomic and High-Throughput Data Analysis to Overcome Triple Negative Breast Cancers Drug Resistance". Thesis, Université Paris-Saclay (ComUE), 2017. http://www.theses.fr/2017SACLS538/document.
Texto completoGiven the large number of treatment-resistant triple-negative breast cancers, it is essential to understand the mechanisms of resistance and to find new effective molecules. First, we analyze two large-scale pharmacogenomic datasets. We propose a novel classification based on transcriptomic profiles of cell lines, according to a biological network-driven gene selection process. Our molecular classification shows greater homogeneity in drug response than when cell lines are grouped according to their original tissue. It also helps identify similar patterns of treatment response. In a second analysis, we study a cohort of patients with triple-negative breast cancer who have resisted to neoadjuvant chemotherapy. We perform complete molecular analyzes based on RNAseq and WES. We observe a high molecular heterogeneity of tumors before and after treatment. Although we highlighted clonal evolution under treatment, no recurrent mechanism of resistance could be identified Our results strongly suggest that each tumor has a unique molecular profile and that that it is increasingly important to have large series of tumors. Finally, we are improving a method for testing the overrepresentation of known RNA binding protein motifs in a given set of regulated sequences. This tool uses an innovative approach to control the proportion of false positives that is not realized by the existing algorithm. We show the effectiveness of our approach using two different datasets
Wu, Mei. "Detection of aberrant events in RNA for clinical diagnostics". Thesis, Uppsala universitet, Institutionen för biologisk grundutbildning, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-448361.
Texto completoGogolewski, Krzysztof. "Matrix methods in transcriptomic and metabolomic data analysis". Doctoral thesis, 2019. https://depotuw.ceon.pl/handle/item/3341.
Texto completoW niniejszej rozprawie omawianych jest kilka podejść do modelowania i analizy danych transkryptomicznych. Pracę otwiera krótki wstęp do obecnego stanu wiedzy dotyczącego wysokoprzepustowych danych wraz z ogólnym wprowadzeniem do genetyki. Omówione zostają obecnie używane technologie do gromadzenia danych tran- skryptomicznych, jak również obliczeniowe metody ich modelowania i analizy, w szczególności dotyczących dekompozycji sygnału transkryptomicznego oraz jego integracji z wiedzą metabolomiczną. W ramach trzech głównych rozdziałów rozprawy dyskutowane są specyficzne scenariusze i dane eksperymentalne, do których zostają opracowane i zastosowane odpowiednie metody analizy danych transkryptomicznych. Każdy z rozdziałów prezentuje pewną metodę obliczeniową służącą pozyskiwaniu wiedzy biologicznej oraz jej zastosowanie w konkretnym studium przypadku używającym danych eksperymentalnych. Ostatecznie, wyniki pochodzące ze współpracy z Baylor Collage of Medicine dotyczące roli genu FOXF1 w rozwoju chorób płuc zamykają zasadniczą część rozprawy.
YEN, MING-YI y 顏名儀. "Quantitative Analysis of ECI2 Isoforms from Cancer Transcriptomic Data". Thesis, 2019. http://ndltd.ncl.edu.tw/handle/6hv7zs.
Texto completo亞洲大學
生物資訊與醫學工程學系
107
Quantitative analysis of transcriptomes has received increasing attention in recent years, and more and more studies have confirmed that transcriptome isomers may have a key mechanism in cancers. Recently, the study found that ECI2 (Homo sapiens enoyl-CoA delta isomerase 2) is a peroxisome isomerase of mammals with 11 transcriptome isoforms, one of which is found to be an important cancer antigen, called Hepatocellular Carcinoma Antigen 64 (HCA64). In other studies, it has also been found as a biomarker for the prognosis of other cancers such as breast cancer and prostate cancer. RNA-Seq high-throughput sequencing has become one of the most advanced methods for measuring gene expression. Therefore, we use of RNA-seq data to analyze the transcript expression of the software Salmon and to analyze the expression of the transcriptome for breast cancer, small cell lung cancer, ovarian cancer and prostate cancer. It is expected that the relationship between HCA64 and a specific cancer can be found, which may serve as a biomarker for the cancer, which is useful for diagnosis and subsequent treatment, and improves the survival rate of the patient. According to the results of the study, the transcript of HCA64 was only expressed in some cancer samples. After DESeq2 analysis, significant genes were found in metastatic breast cancer, prostate cancer and liver cancer, and many of these genes have a related mechanism in developing three cancers. It can be inferred that HCA64 has an important association in metastatic breast cancer, prostate cancer and liver cancer.
Rogers, Gary L. "Transcriptomic Data Analysis Using Graph-Based Out-of-Core Methods". 2011. http://trace.tennessee.edu/utk_graddiss/1122.
Texto completoGatto, Sole. "Integrated bioinformatics analysis of epigenomic and transcriptomic data from ICF syndrome patient's cells". Tesi di dottorato, 2013. http://www.fedoa.unina.it/9340/1/TESI_SG.pdf.
Texto completoBewerunge, Peter [Verfasser]. "Integrative data mining and meta analysis of disease-specific large-scale genomic, transcriptomic and proteomic data / presented by Peter Bewerunge". 2009. http://d-nb.info/997856645/34.
Texto completoSitte, Maren. "Network Based Integration of Proteomic and Transcriptomic Data: Study of BCR and WNT11 Signaling Pathways in Cancer Cells". Doctoral thesis, 2020. http://hdl.handle.net/21.11130/00-1735-0000-0005-1397-B.
Texto completoChu, An-Yuan y 諸安元. "Application of Machine Learning in Analysis of Transcriptomic Data Derived from Next Generation Sequencing and Model Construction". Thesis, 2019. http://ndltd.ncl.edu.tw/handle/tszn3d.
Texto completo國立中興大學
資訊管理學系所
107
Tobacco Mosaic Virus, the most studied plant virus, could infected over 100 species of plants and over 550 species of flowering plant, cause enormous loss of economy at home and abroad. Microarray, an important analytic tool of Genomics and Genetics, enable researchers to analyze massive gene expression simultaneously. To find out the genes related to replication of Tobacco Mosaic Virus, material of this research is gene expression with 5 time points (30 min, 4hr, 6hr, 18hr and 24hr), which made by Next Generation Sequencing, about cell of Arbidopsis infected by Tobacco Mosaic Virus. In addition to refer the FCBF and Wrapper algorithms of papers, which integrated machine learning and microarray analysis, this research re-defines genes to samples and translates original target variables to attributes of each gene, then proposes DiSK algorithm to select genes. The selected genes are validated by C4.5 algorithm and Multi-Layer Perceptron, results show that genes selected by DiSK algorithm with average accuracy 81.25%, average true positive rate (classified accuracy of control group) 90%, true negative rate (classified accuracy of experiment group) 72.5%, average F-measure 80.3% and average AUC 0.849 are all better than genes selected by other algorithms. Last but not least, this research explores the function, searches and constructs genetic network of selected genes by Pathway Studio Plant, which makes the algorithm proposed by this research more persuasive and provides new targets to researchers of plant virus.
Santos, Diogo André Passagem dos 1987. "Comparative analysis of 454 pyrosequencing data from coffee transcriptomes". Master's thesis, 2011. http://hdl.handle.net/10451/4944.
Texto completoUnderstanding the mechanisms beyond the resistance of coffee plants (Coffea spp.) to leaf rust (caused by Hemileia vastatrix) is of vital importance for breeding coffee varieties with durable resistance. However, loss of resistance due to the appearance of new rust races is occurring, but some genotypes are still resistant to all known H. vastatrix races, such as HDT832/2. Previous studies show that the resistance to H. vastatrix in this genotype shares common immunity components with the nonhost resistance. 454 pyrosequencing transcriptomic data representing HDT832/2 host and nonhost resistance, along a healthy plant control, were analyzed with the purpose of better understanding this resistance. Expressed sequence tags (ESTs) are a very common and interesting solution for transcriptomic studies because they lack the non-expressed part of the genome. The small amount of reads generated for this project present a limitation that has not an established solution. To analyze this dataset, two different assembly strategies (individual assembly versus global assembly) and two different assemblers (Newbler versus MIRA) were used, and the results of all four assemblies are reported and analyzed. Assemblies were compared by assessing the number of transcripts shared by the three libraries, by a blast searches against NCBI nr protein and Coffea spp. EST databases and searching for previously studied genes. Overall the global assembly strategy performed better than the individual strategy, and Newbler performed better than MIRA in most but not all parameters. Here we provide a good strategy for small budget transcriptome projects to optimize their data and we present an annotated transcriptome of coffee line HDT832/2 resistance response to rust in host and nonhost interactions.
O café é um dos produtos mais importantes do mercado internacional, sendo a sua produção e exportação a base da economia de mais de 60 países, na sua maioria países em desenvolvimento. A cafeicultura é uma indústria em crescimento que se debate com a necessidade de aumentar a produção sem fazer subir em demasia os respectivos custos. A cultura do cafeeiro (nomeadamente do cafeeiro Arábica, Coffea arabica) é afectada em larga escala por factores de índole fitopatológica que destroem ou enfraquecem as plantas. De entre estas doenças, a ferrugem alaranjada, causada pelo fungo Hemileia vastatrix Berkley & Broome, é uma das mais importantes, e afecta países cafeicultores por todo o mundo, gerando perdas de 30% se nenhuma medida de controlo for aplicada. H. vastatrix é um fungo biotrófico que depende das células vivas do hospedeiro para se alimentar e completar o seu ciclo de vida. Apesar de o controlo desta doença ser possível por via da aplicação de produtos fitofarmaceuticos, os custos associados são elevados económica e ambientalmente, pelo que o cultivo de variedades resistentes é uma opção com maior sustentabilidade. A identificação e caracterização de populações de Híbrido de Timor (HDT, um híbrido natural entre C. arabica e C. canephora) permitiu a selecção de plantas com elevado espectro de resistência, que foram subsequentemente utilizadas como dadoras de resistência em programas de melhoramento genético de cafeeiro em diversos países. No entanto, estas resistências têm sido colocadas em causa com o aparecimento de novas raças do fungo, sendo a linha HDT832/2, seleccionada no Centro de Investigação das Ferrugens do Cafeeiro (CIFC), alvo de interesse por manter a resistência a todas as raças conhecidas de H. vastatrix. Como uma das formas de resistência mais duradoura em plantas é a resistência da toda uma espécie de plantas a todas as variantes genéticas de um patogénio (resistência não-hospedeira) revelou-se importante comparar a resistência do cafeeiro a H. vastatrix (resistência hospedeira) com a resistência não-hospedeira, neste caso entre HDT832/2 e Uromyces vignae, o fungo responsável pela ferrugem do feijão-frade. Um estudo anterior de 8 genes rnvolvidos em mecanismos de imunidade em plantas sugere que a resistência de HDT832/2 a este dois patogénios tem componentes partilhados. Esse estudo permitiu também perceber a cronologia da infecção de forma a se identificar os pontos temporais com maior expressão de genes de resposta por parte da planta. Desta forma, folhas de HDT832/2 foram inoculadas com cada fungo separadamente e, tal como uma amostra controlo, amostras de RNA foram recolhidas e enviadas para pirosequenciação de cDNA com a tecnologia 454. A analise de Expressed Sequence Tags (EST) é uma alternativa interessante no caso do estudo de organismos não modelo, como o cafeeiro. Além disso, o facto de apenas ser sequenciada a porção expressa do genoma, permite que não só a quantidade de dados a analisar seja muito menor como torna possível perceber e estudar as diferenças de expressão em condições biológicas distintas. Visto ser um projecto de pequena envergadura, o número de corridas realizadas para a sequenciação do cDNA das 3 condições em estudo foi apenas uma, o que levou a que o número de sequências para cada condição fosse baixo. Assim, foi necessário estudar a melhor forma de assemblar estas sequências, tendo sido estudadas duas estratégias de assemblagem e dois assembladores diferentes. A diferença entra as duas estratégias de assemblagem incidiu na separação ou não das sequências por condição. Assim, numa estrategia de assemblagem individual, cada conjunto de sequências relativas a uma condição foi assemblado apenas com sequências da mesma condição. Por oposição, e de forma a obter um conjunto de sequências com uma maior cobertura do transcritoma, todas as sequências originais foram juntas numa só assemblagem, denomida assemblagem global. A escolha do programa para realizar a assemblagem tem também uma grande influencia no resultado final e por isso foram comparados os resultados do Newbler v2.5 e do MIRA v.2.3.0. Desta forma foram obtidas quatro assemblagens diferentes, que foram depois comparadas. Para realizar a comparação, e na falta do genoma completo do cafeeiro, foram escolhidas diferentes formas de análise. Uma importante característica que se espera encontrar neste tipo de dados é uma grande quantidade de sequências partilhadas pelas 3 condições em deterimento de sequências que apenas apareçam numa das condições. Nas assemblagens globais foi possível mapear a proveniência das sequências utilizadas para construir as sequências finais e tanto o Newbler como o MIRA resultaram em assemblagens onde grande parte das sequências provêm das três condições. No caso das assemblagens individuais, para definir que uma sequência era a mesma que outra de outra condição, utilizamos o resultado do mapeamento das mesmas, através de Blastx, na base de dados de proteínas do NCBI (nr-protein database). Aqui foi possivel observar que a falta de cobertura de cada um dos conjuntos de sequências de cada condição levou a um distribuição dos dados muito diferente da esperada. De forma a podermos comparar mais facilmente os dois métodos, as sequências das assemblagens individuais com o mesmo melhor resultado no blast contra nr foram assemblados juntos de forma a que, para cada assemblador, existisse apenas um conjunto de sequências para cada um dos métodos. Cada um desses conjuntos de sequências foi depois mapeado, atraves de Blastn, contra as sequências de ESTs de cafeeiro existentes na bases de dados do NCBI. As assemblagens globais obtiveram uma melhor performance que as assemblagens individuais, sendo que o Newbler conseguiu obter uma maior percentagem de sequências anotadas que o MIRA, especialmente se observados apenas os resultados com homologia total. O estudo da presença e do número de homólogos de 10 genes de cafeeiro previamente caracterizados por RT-qPCR nestas mesmas amostras foi também efectuado. Enquanto que as assemblagens realizadas com o Newbler apenas foram capazes de reconstruir 7 dos 10 genes, a assemblagem Global com o MIRA conseguiu reconstruir os 10 genes. No entanto o Newbler consegue reconstruir o gene de forma completa, sendo que apenas em 3 situações o gene se encontra divido em diferentes sequências, sendo que no entanto estas se encontram agrupadas no mesmo isogroup. O MIRA por outro lado tem 6 dos 10 genes repartidos por diferentes sequências, sendo que muitas das vezes o mesmo gene está representado por inúmeras sequências. Desta forma foi possível perceber que a estratégia de assemblagem global é melhor que a assemblagem individual das sequências, sendo o Newbler melhor que o Mira na maior parte dos parâmetros avaliados. Desta forma foi realizado o mapeamento das sequências dos dois programas na base de dados nr do NCBI, utilizando o Blastx e a sua posterior anotação com termos GO através do Blast2go. A assemblagem realizada com o Newbler consegue uma melhor percentagem de sequências com resultado na base de dados nr e um maior número de sequências anotadas. Este trabalho permitiu desenvolver uma estratégia de assemblagem para projectos de baixo orçamento conseguirem estudar o transcritoma de uma especie não-modelo e disponibilizou, para futura análise mais detalhada, o transcritoma expresso por folhas de cafeeiro numa sitação de resistência hospedeira (resistência a H. vastatrix) e de resistência não-hospedeira (resistência a U. vignae). Uma melhor forma de mapear as sequências assembladas pelo Newbler é necessária. Além disso a utilização combinada dos resultados dos dois assembladores pode levar a um melhor resultado final. Uma extensa análise aos resultados aqui reportados pode levar a uma melhor compreensão da resistência da linha HDT832/2 a H. vastatrix e levar a sua manutenção e manipulação futura.
Howard, Brian Edward. "Methods for accurate analysis of high-throughput transcriptome data". 2009. http://www.lib.ncsu.edu/theses/available/etd-10132009-213553/unrestricted/etd.pdf.
Texto completo"Development of bioinformatics platforms for methylome and transcriptome data analysis". 2014. http://library.cuhk.edu.hk/record=b6115790.
Texto completoDNA甲基化是一種重要的表觀遺傳修飾,主要用來調控基因的表達。目前,全基因組重亞硫酸鹽測序(BS-seq)是最準確的研究DNA甲基化的實驗方法之一,該技術的一大特點就是可以精確到單個堿基的解析度。為了分析BS-seq產生的大量測序數據,我參與開發並深度優化了Methy-Pipe軟體。Methy-Pipe集成了測序序列比對和甲基化程度分析,是一個一體化的DNA甲基化資料分析工具。另外,在Methy-Pipe的基礎上,我又開發了一個新的用於檢測DNA甲基化差異區域(DMR)的演算法,可以用於大範圍的尋找DNA甲基化標記。Methy-Pipe在我們實驗室的DNA甲基化研究項目中得到廣泛的應用,其中包括基於血漿的無創產前診斷(NIPD)和癌症的檢測。
基因間區長鏈非編碼蛋白RNA(lincRNA)是一種重要的調節子,其在很多生物學過程中發揮作用,例如轉錄後調控,RNA的剪接,細胞老化等。lincRNA的表達具有很強的組織特異性,因此很大一部分lincRNA還沒有被發現。最近,全轉錄組測序技術(RNA-seq)結合基因從頭組裝,為新的lincRNA鑒定以及構建完整的轉錄組列表提供了最有力的方法。然而,有效並準確的從大量的RNA-seq測序數據中鑒定出真實的新的lincRNA仍然具有很大的挑戰性。為此,我開發了兩個生物訊息學工具:1)iSeeRNA,用於區分lincRNA和編碼蛋白RNA(mRNA);2)sebnif,用於深層次資料篩選以得到高品質的lincRNA列表。這兩個工具已經在多個生物學系統中使用並表現出很好的效果。
總的來說,我開發了一些生物訊息學方法,這些方法可以幫助研究人員更好的利用二代測序技術來挖掘大量的測序數據背後的生物學本質,尤其是DNA甲基化和轉錄組的研究。
High-throughput massive parallel sequencing technologies, or Next-Generation Sequencing (NGS) technologies, have greatly accelerated biological and medical research. With the ever-growing throughput and complexity of the NGS technologies, bioinformatics methods and tools are urgently needed for analyzing the large amount of data and discovering the meaningful information behind. In this thesis, I mainly worked on developing bioinformatics algorithms for two research fields: DNA methylation data analysis and large intergenic noncoding RNA discovery, where the NGS technologies are in-depth employed and novel bioinformatics algorithms are highly needed.
DNA methylation is one of the important epigenetic modifications to control the transcriptional regulations of the genes. Whole genome bisulfite sequencing (BS-seq) is one of the most precise methodologies for DNA methylation study which allows us to perform whole methylome research at single-base resolution. To analyze the large amount of data generated by BS-seq experiments, I have co-developed and optimized Methy-Pipe, an integrated bioinformatics pipeline which can perform both sequencing read alignment and methylation state decoding. Furthermore, I’ve developed a novel algorithm for Differentially Methylated Regions (DMR) mining, which can be used for large scale methylation marker discovery. Methy-Pipehas been routinely used in our laboratory for methylomic studies, including non-invasive prenatal diagnosis and early cancer detections in human plasma.
Large intergenic noncoding RNAs, or lincRNAs, is avery important novel family of gene regulators in many biological processes, such as post-transcriptional regulation, splicing and aging. Due to high tissue-specific expression pattern of the lincRNAs, a large proportion is still undiscovered. The development of Whole Transcriptome Shotgun Sequencing, also known as RNA-seq, combined with de novo or ab initio assembly, promises quantity discovery of novel lincRNAs hence building the complete transcriptome catalog. However, to efficiently and accurately identify the novel lincRNAs from the large transcriptome data stillremains a bioinformatics challenge.To fill this gap, I have developed two bioinformatics tools: I) iSeeRNAfor distinguishing lincRNAs from mRNAs and II) sebnif for comprehensive filtering towards high quality lincRNA screening which has been used in various biological systems and showed satisfactory performance.
In summary, I have developed several bioinformatics algorithms which help the researchers to take advantage of the strength of the NGS technologies(methylome and transcriptome studies) and explore the biological nature behind the large amount of data.
Detailed summary in vernacular field only.
Detailed summary in vernacular field only.
Detailed summary in vernacular field only.
Detailed summary in vernacular field only.
Sun, Kun.
Thesis (Ph.D.) Chinese University of Hong Kong, 2014.
Includes bibliographical references (leaves 118-126).
Abstracts also in Chinese.
"Transcriptome analysis and applications based on next-generation RNA sequencing data". 2012. http://library.cuhk.edu.hk/record=b5549664.
Texto completo我的博士研究主要集中在二代测序(next-generation sequencing,NGS),特别是RNA-Seq数据的分析。它主要包含三部分:分析工具开发,数据分析和机理研究。
大量测序数据的分析对于二代测序技术来说是一个重大的挑战。目前,相对于剪接比对工具(splice-aware aligner),普通比对工具可以极速(ultrafast)的将数以千万记的短序列(Reads)比对到基因组,但是他们很难处理那些跨过剪接位点(splice junction)的短序列(spliced reads)或者匹配多个基因组位置的短序列(multireads)。我们开发了一个利用two-seed策略的全新的序列比对工具-ABMapper。基准测试(Benchmark test) 结果显示ABMapper比其他的同类工具:TopHat和SpliceMap有更高的accuracy和recall。另一方面,spliced reads和multireads在基因组上会有多个匹配的位置,选择最可能的位置也成为一个大问题。在计算基因表达值时,multireads和spliced reads常会被随机的选定其中之一,或者直接被排除。这种处理方式会引入偏差而直接影响下游(downstream)分析的准确性。为了解决multireads和spliced reads位置选择问题,我们提出了一个利用内含子(intron)长度的Geometric-tail (GT) 经验分布的最大似然估计 (maximum likelihood estimation) 的方法。这个概率模型可以适用于剪接位点位于短序列上或者位于成对短序列(Pair-ended, PE) 之间的情况。基于这个模型,我们可以更好的确定那些在基因组上存在多个匹配的成对短序列(pair-ended, PE reads)的最可能位置。
测序数据的积累为深入研究生物学意义提供了丰富的资源。利用RNA-Seq数据和甲基化测序数据,我们建立了一个基于DNA甲基化模式 (pattern) 的基因表达水平的预测模型。根据这个模型,我们发现DNA甲基化可以相当准确的预测基因表达水平,准确率达到78%。我们还发现基因主体上的DNA甲基化比启动子 (promoter) 附近的更重要。最后我们还从整合所有甲基化模式和CpG模式的组合数据集中,利用特征筛选(feature selection)选择了一个最优化子集。我们基于最优子集建立了特征重叠作用网络,进一步揭示了DNA甲基化模式对于基因表达的协作调控机理。
除了开发RNA-Seq数据分析的工具和数据挖掘,我们还分析斑马鱼(zebrafish)的转录组(transcriptome)。RNA-Seq数据分析结合荧光成像,定量PCR等生物学实验,揭示了Calycosin处理之后的相关作用通路(pathway)和差异表达基因,分析结果还证明了Calycosin在体内的血管生成活性。
综上所述,本论文将会详细阐述我在二代测序数据分析,基于数据挖掘的生物学意义的发现和转录组分析方面的工作。
The recent development of next generation RNA-sequencing, termed ‘RNA-Seq’, has offered an opportunity to explore the RNA transcripts from the whole transcriptome. As a revolutionary method, RNA-Seq not only could precisely measure the abundances of transcripts, but discover the novel transcribed contents and uncover the unknown regulatory mechanisms. Meanwhile, the combination of different levels of next-generation sequencing, such as genome sequencing and methylome sequencing has provided a powerful tool for novel discovery in the biological context.
My PhD study focuses on the analysis of next-generation sequencing data, especially on RNA-Seq data. It mainly includes three parts: pipeline development analysis, data analysis and mechanistic study.
As the next-generation sequencing (NGS) technology, the analysis of massive NGS data is a great challenge. Many existing general aligners (as contrast to splicing-aware alignment tools) are capable of mapping millions of sequencing reads onto a reference genome. However, they are neither designed for reads that span across splice junctions (spliced reads) nor for reads that could match multiple locations along the reference genome (multireads). Hence, we have developed an ab initio mapping method - ABMapper, using two-seed strategy. The benchmark results show that ABMapper can get higher accuracy and recall compared with the same kind of tools: TopHat and SpliceMap. On the other hand, the selection of the most probable location for spliced reads and multireads becomes a big problem. These reads are randomly assigned to one of the possible locations or discarded completely when calculating the expression level, which would bias the downstream analysis, such as the differentiated expression analysis and alternative splicing analysis. To rationally determine the location of spliced reads and multireads, we have proposed a maximum likelihood estimation method based on a geometric-tail (GT) distribution of intron length. This probabilistic model deals with splice junctions between reads, or those encompassed in one or both of a pair-ended (PE) reads. Based on this model, multiple alignments of reads within a PE pair can be properly resolved.
The accumulation of NGS data has provided rich resources for deep discovery of biological significance. We have integrated RNA-Seq data and methylation sequencing data to build a predictive model for the regulation of gene expression based on DNA methylation patterns. We found that DNA methylation could predict gene expression fairly accurately and the accuracy can reach up to 78%. We have also found DNA methylation at gene body is the most important region in these models, even more useful than promoter. Finally, feature overlap network based on an optimum subset of combination of all methylation patterns and CpG patterns has indicated the collaborative regulation of gene expression by DNA methylation patterns.
Not only new algorithms were developed to facilitate the RNA-Seq data analysis, but the transcriptome analysis was performed on zebrafish. The analysis of differentially-expressed genes and pathways involved after calycosin treatment, combined with other experimental evidence such as fluorescence microscopy and quantitative real-time polymerase chain reaction (qPCR), has well demonstrated the proangiogenic effects of calycosin in vivo.
In summary, this thesis detailed my work on NGS data analysis, discovery of biological significance using data-mining algorithms and transcriptome analysis.
Detailed summary in vernacular field only.
Detailed summary in vernacular field only.
Detailed summary in vernacular field only.
Detailed summary in vernacular field only.
Detailed summary in vernacular field only.
Detailed summary in vernacular field only.
Lou, Shaoke.
Thesis (Ph.D.)--Chinese University of Hong Kong, 2012.
Includes bibliographical references (leaves 135-146).
Electronic reproduction. Hong Kong : Chinese University of Hong Kong, [2012] System requirements: Adobe Acrobat Reader. Available via World Wide Web.
Abstract also in Chinese.
摘要 --- p.iii
Acknowledgement --- p.v
Chapter Chapter 1 --- Introduction --- p.1
Chapter 1.1 --- Bioinformatics --- p.1
Chapter 1.2 --- Bioinformatics application --- p.1
Chapter 1.3 --- Motivation --- p.2
Chapter 1.4 --- Objectives --- p.3
Chapter 1.5 --- Thesis outline --- p.3
Chapter Chapter 2 --- Background --- p.4
Chapter 2.1 --- Biological and biotechnology background --- p.4
Chapter 2.1.1 --- Central dogma and biology ABC --- p.4
Chapter 2.1.2 --- Transcription --- p.5
Chapter 2.1.3 --- Splicing and Alternative Splicing --- p.6
Chapter 2.1.4 --- Next-generation Sequencing --- p.10
Chapter 2.1.5 --- RNA-Seq --- p.18
Chapter 2.2 --- Computational background --- p.20
Chapter 2.2.1 --- Approximate string matching and read mapping --- p.21
Chapter 2.2.2 --- Read mapping algorithms and tools --- p.22
Chapter 2.2.3 --- Spliced alignment tools --- p.27
Chapter Chapter 3 --- ABMapper: a two-seed based spliced alignment tool --- p.29
Chapter 3.1 --- Introduction --- p.29
Chapter 3.2 --- State-of-the-art --- p.30
Chapter 3.3 --- Problem formulation --- p.31
Chapter 3.4 --- Methods --- p.33
Chapter 3.5 --- Results --- p.35
Chapter 3.5.1 --- Benchmark test --- p.35
Chapter 3.5.2 --- Complexity analysis --- p.39
Chapter 3.5.3 --- Comparison with other tools --- p.39
Chapter 3.6 --- Discussion and conclusion --- p.41
Chapter Chapter 4 --- Geometric-tail (GT) model for rational selection of RNA-Seq read location --- p.42
Chapter 4.1 --- Introduction --- p.42
Chapter 4.2 --- State-of-the-art --- p.44
Chapter 4.3 --- Problem formulation --- p.44
Chapter 4.4 --- Algorithms --- p.45
Chapter 4.5 --- Results --- p.49
Chapter 4.5.1 --- Workflow of GT MLE method --- p.49
Chapter 4.5.2 --- GT distribution and insert-size distribution --- p.50
Chapter 4.5.3 --- Multiread analysis --- p.51
Chapter 4.5.4 --- Splice-site comparison --- p.52
Chapter 4.6 --- Discussion and conclusion --- p.55
Chapter Chapter 5 --- Explore relationship between methylation patterns and gene expression --- p.56
Chapter 5.1 --- Introduction --- p.56
Chapter 5.2 --- State-of-the-art --- p.58
Chapter 5.3 --- Problem formulation --- p.62
Chapter 5.4 --- Methods --- p.62
Chapter 5.4.1 --- NGS sequencing and analysis --- p.62
Chapter 5.4.2 --- Data preparation and transformation --- p.64
Chapter 5.4.3 --- Random forest (RF) classification and regression --- p.65
Chapter 5.5 --- Results --- p.68
Chapter 5.5.1 --- Genome wide profiling of methylation --- p.68
Chapter 5.5.2. --- Aggregation plot of methylation levels at different regions --- p.72
Chapter 5.5.3. --- Scatterplot between methylation and gene expression --- p.75
Chapter 5.5.4 --- Predictive model of gene expression using DNA methylation features --- p.76
Chapter 5.5.5 --- Comb-model based on the full dataset --- p.87
Chapter 5.6 --- Discussion and conclusion --- p.98
Chapter Chapter 6 --- RNA-Seq data analysis and applications --- p.99
Chapter 6.1 --- Transcriptional Profiling of Angiogenesis Activities of Calycosin in Zebrafish --- p.99
Chapter 6.1.1 --- Introduction --- p.99
Chapter 6.1.2 --- Background --- p.100
Chapter 6.1.3 --- Materials and methods and ethics statement --- p.101
Chapter 6.1.4 --- Results --- p.104
Chapter 6.1.5 --- Conclusion --- p.108
Chapter 6.2 --- An integrated web medicinal materials DNA database: MMDBD (Medicinal Materials DNA Barcode Database). --- p.110
Chapter 6.2.1 --- Introduction --- p.110
Chapter 6.2.2 --- Background --- p.110
Chapter 6.2.3 --- Construction and content --- p.113
Chapter 6.2.4 --- Utility and discussion --- p.116
Chapter 6.2.5 --- Conclusion and future development --- p.119
Chapter Chapter 7 --- Conclusion --- p.121
Chapter 7.1 --- Conclusion --- p.121
Chapter 7.2 --- Future work --- p.123
Appendix --- p.124
Chapter A1. --- Descriptive analysis of trio data --- p.124
Chapter A2. --- Whole genome methylation level profiling --- p.125
Chapter A3. --- Global sliding window correlation between individuals --- p.128
Chapter A4. --- Features selected after second-run filtering --- p.133
Bibliography --- p.135
Chapter A. --- Publications --- p.135
Reference --- p.135
Puthiyedth, Nisha. "A novel feature selection approach for data integration analysis: applications to transcriptomics study". Thesis, 2016. http://hdl.handle.net/1959.13/1322449.
Texto completoMeta-analysis has become a popular method for identifying novel biomarkers in the field of medical research. Meta-analysis has been widely applied to genome-wide association and transcriptomic studies due to the availability of datasets in the public domain. Joint analysis of multiple datasets has become a common technique for increasing statistical power in detecting biomarkers reported in smaller studies. The approach generally followed relies on the fact that as the total number of samples increases, greater power to detect associations of interest is anticipated. Integrating available information from different datasets to generate a combined result seems reasonable and promising. Consequently, there is a need for computationally based integration methods that evaluate multiple independent datasets investigating a common theme or disorder. This raises a variety of issues in the analysis of such data and leads to more complications than are seen with standard meta-analysis, including diverse experimental platforms and complex data structures. I illustrate these ideas using microarray datasets from multiple studies and propose an integrative methodology to combine datasets generated using different platforms. Having combined the data, the main challenge is to choose a subset of features that represent the combined dataset in a particular aspect. While the approach is well established in biostatistics, the introduction of new combinatorial optimisation models to address this issue has not been explored in depth. In 2004, a new feature selection approach based on a combinatorial optimisation method was proposed, entitled the (α,β)-k Feature Set problem approach. The main advantage of this approach over ranking methods for selecting individual features is that the features are evaluated as groups instead of on the basis of their individual performance. The (α,β)-k Feature Set problem approach has been defined having first in mind a single uniform dataset, and conceived in this ways, it is not readily applicable to the case of integrated datasets. An extended version of this approach handles integrated datasets in a consistent manner and selects features that differentiate sample pairs across datasets. The application of an (α,β)-k Feature Set problem -based approach for meta-analysis thus helps to identify the best set of features from a combined dataset, allowing researchers to reveal the genetic pathways that contribute to the development of a disease. I propose an extended version of the (α,β)-k Feature Set problem approach that aims to find a set of genes whose expression level may be used to identify a joint core subset of genes that putatively play an important role in two conditions: prostate cancer and Alzheimer's disease. The results of the current study suggest that the proposed method is an efficient meta-analysis method that is capable of identifying biologically relevant genes that other methods fail to identify. As the amount of data increases, this novel method can be applied to find additional genes and pathways that are significant in these diseases, which may provide new insights into the disease mechanism and contribute towards understanding, prevention and cures.