To see the other types of publications on this topic, follow the link: Omic data.

Dissertations / Theses on the topic 'Omic data'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 dissertations / theses for your research on the topic 'Omic data.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.

1

Guan, Xiaowei. "Bioinformatics Approaches to Heterogeneous Omic Data Integration." Case Western Reserve University School of Graduate Studies / OhioLINK, 2012. http://rave.ohiolink.edu/etdc/view?acc_num=case1340302883.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Xiao, Hui. "Network-based approaches for multi-omic data integration." Thesis, University of Cambridge, 2019. https://www.repository.cam.ac.uk/handle/1810/289716.

Full text
Abstract:
The advent of advanced high-throughput biological technologies provides opportunities to measure the whole genome at different molecular levels in biological systems, which produces different types of omic data such as genome, epigenome, transcriptome, translatome, proteome, metabolome and interactome. Biological systems are highly dynamic and complex mechanisms which involve not only the within-level functionality but also the between-level regulation. In order to uncover the complexity of biological systems, it is desirable to integrate multi-omic data to transform the multiple level data into biological knowledge about the underlying mechanisms. Due to the heterogeneity and high-dimension of multi-omic data, it is necessary to develop effective and efficient methods for multi-omic data integration. This thesis aims to develop efficient approaches for multi-omic data integration using machine learning methods and network theory. We assume that a biological system can be represented by a network with nodes denoting molecules and edges indicating functional links between molecules, in which multi-omic data can be integrated as attributes of nodes and edges. We propose four network-based approaches for multi-omic data integration using machine learning methods. Firstly, we propose an approach for gene module detection by integrating multi-condition transcriptome data and interactome data using network overlapping module detection method. We apply the approach to study the transcriptome data of human pre-implantation embryos across multiple development stages, and identify several stage-specific dynamic functional modules and genes which provide interesting biological insights. We evaluate the reproducibility of the modules by comparing with some other widely used methods and show that the intra-module genes are significantly overlapped between the different methods. Secondly, we propose an approach for gene module detection by integrating transcriptome, translatome, and interactome data using multilayer network. We apply the approach to study the ribosome profiling data of mTOR perturbed human prostate cancer cells and mine several translation efficiency regulated modules associated with mTOR perturbation. We develop an R package, TERM, for implementation of the proposed approach which offers a useful tool for the research field. Next, we propose an approach for feature selection by integrating transcriptome and interactome data using network-constrained regression. We develop a more efficient network-constrained regression method eGBL. We evaluate its performance in term of variable selection and prediction, and show that eGBL outperforms the other related regression methods. With application on the transcriptome data of human blastocysts, we select several interested genes associated with time-lapse parameters. Finally, we propose an approach for classification by integrating epigenome and transcriptome data using neural networks. We introduce a superlayer neural network (SNN) model which learns DNA methylation and gene expression data parallelly in superlayers but with cross-connections allowing crosstalks between them. We evaluate its performance on human breast cancer classification. The SNN provides superior performances and outperforms several other common machine learning methods. The approaches proposed in this thesis offer effective and efficient solutions for integration of heterogeneous high-dimensional datasets, which can be easily applied to other datasets presenting the similar structures. They are therefore applicable to many fields including but not limited to Bioinformatics and Computer Science.
APA, Harvard, Vancouver, ISO, and other styles
3

Zuo, Yiming. "Differential Network Analysis based on Omic Data for Cancer Biomarker Discovery." Diss., Virginia Tech, 2017. http://hdl.handle.net/10919/78217.

Full text
Abstract:
Recent advances in high-throughput technique enables the generation of a large amount of omic data such as genomics, transcriptomics, proteomics, metabolomics, glycomics etc. Typically, differential expression analysis (e.g., student's t-test, ANOVA) is performed to identify biomolecules (e.g., genes, proteins, metabolites, glycans) with significant changes on individual level between biologically disparate groups (disease cases vs. healthy controls) for cancer biomarker discovery. However, differential expression analysis on independent studies for the same clinical types of patients often led to different sets of significant biomolecules and had only few in common. This may be attributed to the fact that biomolecules are members of strongly intertwined biological pathways and highly interactive with each other. Without considering these interactions, differential expression analysis could lead to biased results. Network-based methods provide a natural framework to study the interactions between biomolecules. Commonly used data-driven network models include relevance network, Bayesian network and Gaussian graphical models. In addition to data-driven network models, there are many publicly available databases such as STRING, KEGG, Reactome, and ConsensusPathDB, where one can extract various types of interactions to build knowledge-driven networks. While both data- and knowledge-driven networks have their pros and cons, an appropriate approach to incorporate the prior biological knowledge from publicly available databases into data-driven network model is desirable for more robust and biologically relevant network reconstruction. Recently, there has been a growing interest in differential network analysis, where the connection in the network represents a statistically significant change in the pairwise interaction between two biomolecules in different groups. From the rewiring interactions shown in differential networks, biomolecules that have strongly altered connectivity between distinct biological groups can be identified. These biomolecules might play an important role in the disease under study. In fact, differential expression and differential network analyses investigate omic data from two complementary perspectives: the former focuses on the change in individual biomolecule level between different groups while the latter concentrates on the change in pairwise biomolecules level. Therefore, an approach that can integrate differential expression and differential network analyses is likely to discover more reliable and powerful biomarkers. To achieve these goals, we start by proposing a novel data-driven network model (i.e., LOPC) to reconstruct sparse biological networks. The sparse networks only contains direct interactions between biomolecules which can help researchers to focus on the more informative connections. Then we propose a novel method (i.e., dwgLASSO) to incorporate prior biological knowledge into data-driven network model to build biologically relevant networks. Differential network analysis is applied based on the networks constructed for biologically disparate groups to identify cancer biomarker candidates. Finally, we propose a novel network-based approach (i.e., INDEED) to integrate differential expression and differential network analyses to identify more reliable and powerful cancer biomarker candidates. INDEED is further expanded as INDEED-M to utilize omic data at different levels of human biological system (e.g., transcriptomics, proteomics, metabolomics), which we believe is promising to increase our understanding of cancer. Matlab and R packages for the proposed methods are developed and available at Github (https://github.com/Hurricaner1989) to share with the research community.
Ph. D.
APA, Harvard, Vancouver, ISO, and other styles
4

Tsai, Tsung-Heng. "Bayesian Alignment Model for Analysis of LC-MS-based Omic Data." Diss., Virginia Tech, 2014. http://hdl.handle.net/10919/64151.

Full text
Abstract:
Liquid chromatography coupled with mass spectrometry (LC-MS) has been widely used in various omic studies for biomarker discovery. Appropriate LC-MS data preprocessing steps are needed to detect true differences between biological groups. Retention time alignment is one of the most important yet challenging preprocessing steps, in order to ensure that ion intensity measurements among multiple LC-MS runs are comparable. In this dissertation, we propose a Bayesian alignment model (BAM) for analysis of LC-MS data. BAM uses Markov chain Monte Carlo (MCMC) methods to draw inference on the model parameters and provides estimates of the retention time variability along with uncertainty measures, enabling a natural framework to integrate information of various sources. From methodology development to practical application, we investigate the alignment problem through three research topics: 1) development of single-profile Bayesian alignment model, 2) development of multi-profile Bayesian alignment model, and 3) application to biomarker discovery research. Chapter 2 introduces the profile-based Bayesian alignment using a single chromatogram, e.g., base peak chromatogram from each LC-MS run. The single-profile alignment model improves on existing MCMC-based alignment methods through 1) the implementation of an efficient MCMC sampler using a block Metropolis-Hastings algorithm, and 2) an adaptive mechanism for knot specification using stochastic search variable selection (SSVS). Chapter 3 extends the model to integrate complementary information that better captures the variability in chromatographic separation. We use Gaussian process regression on the internal standards to derive a prior distribution for the mapping functions. In addition, a clustering approach is proposed to identify multiple representative chromatograms for each LC-MS run. With the Gaussian process prior, these chromatograms are simultaneously considered in the profile-based alignment, which greatly improves the model estimation and facilitates the subsequent peak matching process. Chapter 4 demonstrates the applicability of the proposed Bayesian alignment model to biomarker discovery research. We integrate the proposed Bayesian alignment model into a rigorous preprocessing pipeline for LC-MS data analysis. Through the developed analysis pipeline, candidate biomarkers for hepatocellular carcinoma (HCC) are identified and confirmed on a complementary platform.
Ph. D.
APA, Harvard, Vancouver, ISO, and other styles
5

Ruffalo, Matthew M. "Algorithms for Constructing Features for Integrated Analysis of Disparate Omic Data." Case Western Reserve University School of Graduate Studies / OhioLINK, 2016. http://rave.ohiolink.edu/etdc/view?acc_num=case1449238712.

Full text
APA, Harvard, Vancouver, ISO, and other styles
6

Elhezzani, Najla Saad R. "New statistical methodologies for improved analysis of genomic and omic data." Thesis, King's College London (University of London), 2018. https://kclpure.kcl.ac.uk/portal/en/theses/new-statistical-methodologies-for-improved-analysis-of-genomic-and-omic-data(eb8d95f4-e926-4c54-984f-94d86306525a).html.

Full text
Abstract:
We develop statistical tools for analyzing different types of phenotypic data in genome-wide settings. When the phenotype of interest is a binary case-control status, most genome-wide association studies (GWASs) use randomly selected samples from the population (hereafter bases) as the control set. This approach is successful when the trait of interest is very rare; otherwise, a loss in the statistical power to detect disease-associated variants is expected. To address this, we propose a joint analysis of the three types of samples; cases, bases and controls. This is done by modeling the bases as a mixture of multinomial logistic functions of cases and controls, according to disease prevalence. In a typical GWAS, where thousands of single-nucleotide polymorphisms (SNPs) are available for testing, score-based test statistics are ideal in this case. Other tests of associations such as Wald’s and likelihood ratio tests are known to be asymptotically equivalent to the score test, however their performance under small sample sizes can vary significantly. In order to allow the test comparison to be performed under the proposed case-base-control (CBC) design, we provide an estimation procedure using the maximum likelihood (ML) method along with the expectation-maximization (EM) algorithm. Simulations show that combining the three samples can increase the power to detect disease-associated variants, though a very large base sample set can compensate for lack of controls. In the second part of the thesis, we consider a joint analysis of both genome-wide SNPs as well as multiple phenotypes, with a focus on the challenges they present in the estimation of SNP heritability. The current standard for performing this task is fit-ting a variance component model, despite its tendency to produce boundary estimates when small sample sizes are used. We propose a Bayesian covariance component model (BCCM) that takes into account genetic correlation among phenotypes and genetic correlation among individuals. The use of Bayesian methods allows us to circumvent some issues related to small sample sizes, mainly overfitting and boundary estimates. Using gene expression pathways, we demonstrate a significant improvement in SNP heritability estimates over univariate and ML-based methods, thus explaining why recent progress in eQTL identification has been limited. I published this work as an article in the European Journal of Human genetics. In the third part of the thesis, we study the prospects of using the proposed BCCM for phenotype prediction. Results from real data show consistency in accuracy between ML based methods and the proposed Bayesian method, when effect sizes are estimated using their posterior mode. It is also noted that an initial imputation step relatively increases the predictive accuracy.
APA, Harvard, Vancouver, ISO, and other styles
7

Elsheikh, Samar Salah Mohamedahmed. "Integration of multi-omic data and neuroimaging characteristics in studying brain related diseases." Doctoral thesis, Faculty of Health Sciences, 2020. http://hdl.handle.net/11427/32609.

Full text
Abstract:
Approaches to the identification of genetic variants associated with complex brain diseases have evolved in recent decades. This evolution was supported by advancements in medical imaging and genotyping technologies that result in rich data production in the field of imaging genetics and radiogenomics. Studies in these fields have taken different designs and directions from genomewide associations to studying the complex interplay between genetics and structural connectivity of a wide range of brain-related diseases. Nevertheless, such combinations of heterogeneous, high dimensional and inter-related data has introduced new challenges which cannot be handled with traditional statistical methods. In this thesis, we proposed analysis pipelines and methodologies to study the causal relationship between neuroimaging features, including tumour characteristics and connectomics, genetics and clinical factors in brain-related diseases. In doing so, we adopted two longitudinal study designs and modelled the association between Alzheimer's disease progression and genetic factors, utilising local and global brain connectivity networks. In addition to that, we performed a multi-stage radiogenomic analysis in glioblastoma using non-parametric statistical methods. To address some limitations in the methods, we adopted the Structural Equation Model and developed a mathematical model to examine the inter-correlation between neuroimaging and multi-omic characteristics of brain-related diseases. Our findings have successfully identified risk genes that were previously reported in the literature of Alzheimer's and glioblastoma diseases, and discovered potential risk variants which associate with disease progression. More specifically, we found some loci in the genes CDH18, ANTXR2 and IGF1, located in Chromosomes 5, 4 and 12, to have effect on the brain connectivity over time in Alzheimer's disease. We also found that the expression of APP, HFE, PLAU and BLMH have significant effects on the structural connectivity of local areas in the brain, these are the left Heschl gyrus, right anterior cingulate gyrus, left fusiform gyrus and left Heschl gyrus, respectively. These potential association patterns could be useful for early disease diagnosis, treatment and neurodegeneration prediction. More importantly, we identified gaps in the imaging genetics methodologies, we proposed a mathematical model accounting for these limitations and evaluated the model which produced promising results. Our proposed flexible model, BiGen, addresses the gaps in the existing tools by combining neuroimaging, genetics, environmental, and phenotype information to a single complex analysis, accounting for the heterogeneity, inter-correlation, and non-linearity of the variables. Moreover, BiGen adopts an important assumption which is hardly met in the literature of imaging genetics, and that is, all the four variables are assumed to be latent constructs, that means they can not be observed directly from the data, and are measured through observed indicators. This is an important assumption in both neuroimaging, behavioural and genetic studies, and it is one of the reasons why BiGen is flexible and can easily be extended to include more indicators and latent constructs in the context of brain-related diseases.
APA, Harvard, Vancouver, ISO, and other styles
8

Ehrenberger, Tobias. "Cancer systems biology : functional insights and therapeutic strategies for medulloblastoma from omic data integration." Thesis, Massachusetts Institute of Technology, 2019. https://hdl.handle.net/1721.1/123062.

Full text
Abstract:
Thesis: Ph. D., Massachusetts Institute of Technology, Department of Biological Engineering, 2019
Cataloged from PDF version of thesis.
Includes bibliographical references (pages 151-167).
Medulloblastoma (MB) is a chiefly pediatric cancer of the cerebellum that has been studied extensively using genomic, epigenomic, and transcriptomic data. It comprises at least four molecularly distinct subgroups: WNT, SHH, Group 3, and Group 4. Despite the detailed characterization of MB, many disease-driving events remain to be elucidated and therapeutic targets to be nominated. In this thesis, we describe three studies that contribute to a better understanding of this devastating disease: First, we describe a study that aims to fully describe the genomic landscape in the largest medulloblastoma cohort to date, using 491 sequenced MB tumors and 1,256 epigenetically analyzed cases. This work describes subgroup-specific driver alterations including previously unappreciated actionable targets; and, based on epigenetic data, identifies further heterogeneity within Group 3 and Group 4 tumors. Second, we focus on the proteomes and phospho-proteomes of 45 medulloblastoma samples.
We identified distinct pathways associated with two subsets of SHH tumors that showed robustly distinct proteomes, but similar transcriptomes, and found post-translational modifications of MYC that are associated with poor outcomes in Group 3 tumors. We also found kinases associated with subtypes and showed that inhibiting PRKDC sensitizes MYC-driven cells to radiation. This study shows that proteomics enables a more comprehensive, functional readout, providing a foundation for future therapeutic strategies. Third, we characterize the metabolomic space of MB on largely the same 45 tumors as used in the proteome-focused study. Here, we present preliminary insights from derived from integrative network and other analyses. We find that MB consensus subgroups are preserved in metabolic space, and that certain classes of metabolites are elevated in MYC-activated MB.
We also show that, similar to other cancers, a previously described gain-of-function mutation in IDH1 may cause elevated 2-hydroxyglutarate levels in MB. The work described in this thesis significantly enhances previous knowledge of medulloblastoma and its subgroups, and provides insights that may aid in the development of medulloblastoma therapies in the near future.
by Tobias Ehrenberger.
Ph. D.
Ph.D. Massachusetts Institute of Technology, Department of Biological Engineering
APA, Harvard, Vancouver, ISO, and other styles
9

Curti, Nico. "Implementazione e benchmarking dell'algoritmo QDANet PRO per l'analisi di big data genomici." Master's thesis, Alma Mater Studiorum - Università di Bologna, 2016. http://amslaurea.unibo.it/12018/.

Full text
Abstract:
Dato il recente avvento delle tecnologie NGS, in grado di sequenziare interi genomi umani in tempi e costi ridotti, la capacità di estrarre informazioni dai dati ha un ruolo fondamentale per lo sviluppo della ricerca. Attualmente i problemi computazionali connessi a tali analisi rientrano nel topic dei Big Data, con databases contenenti svariati tipi di dati sperimentali di dimensione sempre più ampia. Questo lavoro di tesi si occupa dell'implementazione e del benchmarking dell'algoritmo QDANet PRO, sviluppato dal gruppo di Biofisica dell'Università di Bologna: il metodo consente l'elaborazione di dati ad alta dimensionalità per l'estrazione di una Signature a bassa dimensionalità di features con un'elevata performance di classificazione, mediante una pipeline d'analisi che comprende algoritmi di dimensionality reduction. Il metodo è generalizzabile anche all'analisi di dati non biologici, ma caratterizzati comunque da un elevato volume e complessità, fattori tipici dei Big Data. L'algoritmo QDANet PRO, valutando la performance di tutte le possibili coppie di features, ne stima il potere discriminante utilizzando un Naive Bayes Quadratic Classifier per poi determinarne il ranking. Una volta selezionata una soglia di performance, viene costruito un network delle features, da cui vengono determinate le componenti connesse. Ogni sottografo viene analizzato separatamente e ridotto mediante metodi basati sulla teoria dei networks fino all'estrapolazione della Signature finale. Il metodo, già precedentemente testato su alcuni datasets disponibili al gruppo di ricerca con riscontri positivi, è stato messo a confronto con i risultati ottenuti su databases omici disponibili in letteratura, i quali costituiscono un riferimento nel settore, e con algoritmi già esistenti che svolgono simili compiti. Per la riduzione dei tempi computazionali l'algoritmo è stato implementato in linguaggio C++ su HPC, con la parallelizzazione mediante librerie OpenMP delle parti più critiche.
APA, Harvard, Vancouver, ISO, and other styles
10

Arsenteva, Polina. "Statistical modeling and analysis of radio-induced adverse effects based on in vitro and in vivo data." Electronic Thesis or Diss., Bourgogne Franche-Comté, 2023. http://www.theses.fr/2023UBFCK074.

Full text
Abstract:
Dans ce travail nous abordons le problème des effets indésirables induits par la radiothérapie sur les tissus sains. L'objectif est de proposer un cadre mathématique pour comparer les effets de différentes modalités d'irradiation, afin de pouvoir éventuellement choisir les traitements qui produisent le moins d'effets indésirables pour l’utilisation potentielle en clinique. Les effets secondaires sont étudiés dans le cadre de deux types de données : en termes de réponse omique in vitro des cellules endothéliales humaines, et en termes d'effets indésirables observés sur des souris dans le cadre d'expérimentations in vivo. Dans le cadre in vitro, nous rencontrons le problème de l'extraction d'informations clés à partir de données temporelles complexes qui ne peuvent pas être traitées avec les méthodes disponibles dans la littérature. Nous modélisons le fold change radio-induit, l'objet qui code la différence d'effet de deux conditions expérimentales, d’une manière qui permet de prendre en compte les incertitudes des mesures ainsi que les corrélations entre les entités observées. Nous construisons une distance, avec une généralisation ultérieure à une mesure de dissimilarité, permettant de comparer les fold changes en termes de toutes leurs propriétés statistiques importantes. Enfin, nous proposons un algorithme computationnellement efficace effectuant le clustering joint avec l'alignement temporel des fold changes. Les caractéristiques clés extraites de ces dernières sont visualisées à l'aide de deux types de représentations de réseau, dans le but de faciliter l'interprétation biologique. Dans le cadre in vivo, l’enjeu statistique est d’établir un lien prédictif entre des variables qui, en raison des spécificités du design expérimental, ne pourront jamais être observées sur les mêmes animaux. Dans le contexte de ne pas avoir accès aux lois jointes, nous exploitons les informations supplémentaires sur les groupes observés pour déduire le modèle de régression linéaire. Nous proposons deux estimateurs des paramètres de régression, l'un basé sur la méthode des moments et l'autre basé sur le transport optimal, ainsi que des estimateurs des intervalles de confiance basés sur le bootstrap stratifié
In this work we address the problem of adverse effects induced by radiotherapy on healthy tissues. The goal is to propose a mathematical framework to compare the effects of different irradiation modalities, to be able to ultimately choose those treatments that produce the minimal amounts of adverse effects for potential use in the clinical setting. The adverse effects are studied in the context of two types of data: in terms of the in vitro omic response of human endothelial cells, and in terms of the adverse effects observed on mice in the framework of in vivo experiments. In the in vitro setting, we encounter the problem of extracting key information from complex temporal data that cannot be treated with the methods available in literature. We model the radio-induced fold change, the object that encodes the difference in the effect of two experimental conditions, in the way that allows to take into account the uncertainties of measurements as well as the correlations between the observed entities. We construct a distance, with a further generalization to a dissimilarity measure, allowing to compare the fold changes in terms of all the important statistical properties. Finally, we propose a computationally efficient algorithm performing clustering jointly with temporal alignment of the fold changes. The key features extracted through the latter are visualized using two types of network representations, for the purpose of facilitating biological interpretation. In the in vivo setting, the statistical challenge is to establish a predictive link between variables that, due to the specificities of the experimental design, can never be observed on the same animals. In the context of not having access to joint distributions, we leverage the additional information on the observed groups to infer the linear regression model. We propose two estimators of the regression parameters, one based on the method of moments and the other based on optimal transport, as well as the estimators for the confidence intervals based on the stratified bootstrap procedure
APA, Harvard, Vancouver, ISO, and other styles
11

LOVINO, MARTA. "Algorithms for complex systems in the life sciences." Doctoral thesis, Politecnico di Torino, 2021. http://hdl.handle.net/11583/2910082.

Full text
APA, Harvard, Vancouver, ISO, and other styles
12

Serra, Angela. "Multi-view learning and data integration for omics data." Doctoral thesis, Universita degli studi di Salerno, 2017. http://hdl.handle.net/10556/2580.

Full text
Abstract:
2015 - 2016
In recent years, the advancement of high-throughput technologies, combined with the constant decrease of the data-storage costs, has led to the production of large amounts of data from different experiments that characterise the same entities of interest. This information may relate to specific aspects of a phenotypic entity (e.g. Gene expression), or can include the comprehensive and parallel measurement of multiple molecular events (e.g., DNA modifications, RNA transcription and protein translation) in the same samples. Exploiting such complex and rich data is needed in the frame of systems biology for building global models able to explain complex phenotypes. For example, theuseofgenome-widedataincancerresearch, fortheidentificationof groups of patients with similar molecular characteristics, has become a standard approach for applications in therapy-response, prognosis-prediction, and drugdevelopment.ÂăMoreover, the integration of gene expression data regarding cell treatment by drugs, and information regarding chemical structure of the drugs allowed scientist to perform more accurate drug repositioning tasks. Unfortunately, there is a big gap between the amount of information and the knowledge in which it is translated. Moreover, there is a huge need of computational methods able to integrate and analyse data to fill this gap. Current researches in this area are following two different integrative methods: one uses the complementary information of different measurements for the 7 i i “Template” — 2017/6/9 — 16:42 — page 8 — #8 i i i i i i study of complex phenotypes on the same samples (multi-view learning); the other tends to infer knowledge about the phenotype of interest by integrating and comparing the experiments relating to it with respect to those of different phenotypes already known through comparative methods (meta-analysis). Meta-analysis can be thought as an integrative study of previous results, usually performed aggregating the summary statistics from different studies. Due to its nature, meta-analysis usually involves homogeneous data. On the other hand, multi-view learning is a more flexible approach that considers the fusion of different data sources to get more stable and reliable estimates. Based on the type of data and the stage of integration, new methodologies have been developed spanning a landscape of techniques comprising graph theory, machine learning and statistics. Depending on the nature of the data and on the statistical problem to address, the integration of heterogeneous data can be performed at different levels: early, intermediate and late. Early integration consists in concatenating data from different views in a single feature space. Intermediate integration consists in transforming all the data sources in a common feature space before combining them. In the late integration methodologies, each view is analysed separately and the results are then combined. The purpose of this thesis is twofold: the former objective is the definition of a data integration methodology for patient sub-typing (MVDA) and the latter is the development of a tool for phenotypic characterisation of nanomaterials (INSIdEnano). In this PhD thesis, I present the methodologies and the results of my research. MVDA is a multi-view methodology that aims to discover new statistically relevant patient sub-classes. Identify patient subtypes of a specific diseases is a challenging task especially in the early diagnosis. This is a crucial point for the treatment, because not allthe patients affected bythe same diseasewill have the same prognosis or need the same drug treatment. This problem is usually solved by using transcriptomic data to identify groups of patients that share the same gene patterns. The main idea underlying this research work is that to combine more omics data for the same patients to obtain a better characterisation of their disease profile. The proposed methodology is a late integration approach i i “Template” — 2017/6/9 — 16:42 — page 9 — #9 i i i i i i based on clustering. It works by evaluating the patient clusters in each single view and then combining the clustering results of all the views by factorising the membership matrices in a late integration manner. The effectiveness and the performance of our method was evaluated on six multi-view cancer datasets related to breast cancer, glioblastoma, prostate and ovarian cancer. The omics data used for the experiment are gene and miRNA expression, RNASeq and miRNASeq, Protein Expression and Copy Number Variation. In all the cases, patient sub-classes with statistical significance were found, identifying novel sub-groups previously not emphasised in literature. The experiments were also conducted by using prior information, as a new view in the integration process, to obtain higher accuracy in patients’ classification. The method outperformed the single view clustering on all the datasets; moreover, it performs better when compared with other multi-view clustering algorithms and, unlike other existing methods, it can quantify the contribution of single views in the results. The method has also shown to be stable when perturbation is applied to the datasets by removing one patient at a time and evaluating the normalized mutual information between all the resulting clusterings. These observations suggest that integration of prior information with genomic features in sub-typing analysis is an effective strategy in identifying disease subgroups. INSIdE nano (Integrated Network of Systems bIology Effects of nanomaterials) is a novel tool for the systematic contextualisation of the effects of engineered nanomaterials (ENMs) in the biomedical context. In the recent years, omics technologies have been increasingly used to thoroughly characterise the ENMs molecular mode of action. It is possible to contextualise the molecular effects of different types of perturbations by comparing their patterns of alterations. While this approach has been successfully used for drug repositioning, it is still missing to date a comprehensive contextualisation of the ENM mode of action. The idea behind the tool is to use analytical strategies to contextualise or position the ENM with the respect to relevant phenotypes that have been studied in literature, (such as diseases, drug treatments, and other chemical exposures) by comparing their patterns of molecular alteration. This could greatly increase the knowledge on the ENM molecular effects and in turn i i “Template” — 2017/6/9 — 16:42 — page 10 — #10 i i i i i i contribute to the definition of relevant pathways of toxicity as well as help in predicting the potential involvement of ENM in pathogenetic events or in novel therapeutic strategies. The main hypothesis is that suggestive patterns of similarity between sets of phenotypes could be an indication of a biological association to be further tested in toxicological or therapeutic frames. Based on the expression signature, associated to each phenotype, the strength of similarity between each pair of perturbations has been evaluated and used to build a large network of phenotypes. To ensure the usability of INSIdE nano, a robust and scalable computational infrastructure has been developed, to scan this large phenotypic network and a web-based effective graphic user interface has been built. Particularly, INSIdE nano was scanned to search for clique sub-networks, quadruplet structures of heterogeneous nodes (a disease, a drug, a chemical and a nanomaterial) completely interconnected by strong patterns of similarity (or anti-similarity). The predictions have been evaluated for a set of known associations between diseases and drugs, based on drug indications in clinical practice, and between diseases and chemical, based on literature-based causal exposure evidence, and focused on the possible involvement of nanomaterials in the most robust cliques. The evaluation of INSIdE nano confirmed that it highlights known disease-drug and disease-chemical connections. Moreover, disease similarities agree with the information based on their clinical features, as well as drugs and chemicals, mirroring their resemblance based on the chemical structure. Altogether, the results suggest that INSIdE nano can also be successfully used to contextualise the molecular effects of ENMs and infer their connections to other better studied phenotypes, speeding up their safety assessment as well as opening new perspectives concerning their usefulness in biomedicine. [edited by author]
L’avanzamento tecnologico delle tecnologie high-throughput, combinato con il costante decremento dei costi di memorizzazione, ha portato alla produzione di grandi quantit`a di dati provenienti da diversi esperimenti che caratterizzano le stesse entit`a di interesse. Queste informazioni possono essere relative a specifici aspetti fenotipici (per esempio l’espressione genica), o possono includere misure globali e parallele di diversi aspetti molecolari (per esempio modifiche del DNA, trascrizione dell’RNA e traduzione delle proteine) negli stessi campioni. Analizzare tali dati complessi `e utile nel campo della systems biology per costruire modelli capaci di spiegare fenotipi complessi. Ad esempio, l’uso di dati genome-wide nella ricerca legata al cancro, per l’identificazione di gruppi di pazienti con caratteristiche molecolari simili, `e diventato un approccio standard per una prognosi precoce piu` accurata e per l’identificazione di terapie specifiche. Inoltre, l’integrazione di dati di espressione genica riguardanti il trattamento di cellule tramite farmaci ha permesso agli scienziati di ottenere accuratezze elevate per il drug repositioning. Purtroppo, esiste un grosso divario tra i dati prodotti, in seguito ai numerosi esperimenti, e l’informazione in cui essi sono tradotti. Quindi la comunit`a scientifica ha una forte necessit`a di metodi computazionali per poter integrare e analizzate tali dati per riempire questo divario. La ricerca nel campo delle analisi multi-view, segue due diversi metodi di analisi integrative: uno usa le informazioni complementari di diverse misure per studiare fenotipi complessi su diversi campioni (multi-view learning); l’altro tende ad inferire conoscenza sul fenotipo di interesse di una entit`a confrontando gli esperimenti ad essi relativi con quelli di altre entit`a fenotipiche gi`a note in letteratura (meta-analisi). La meta-analisi pu`o essere pensata come uno studio comparativo dei risultati identificati in un particolare esperimento, rispetto a quelli di studi precedenti. A causa della sua natura, la meta-analisi solitamente coinvolge dati omogenei. D’altra parte, il multi-view learning `e un approccio piu` flessibile che considera la fusione di diverse sorgenti di dati per ottenere stime piu` stabili e affidabili. In base al tipo di dati e al livello di integrazione, nuove metodologie sono state sviluppate a partire da tecniche basate sulla teoria dei grafi, machine learning e statistica. In base alla natura dei dati e al problema statistico da risolvere, l’integrazione di dati eterogenei pu`o essere effettuata a diversi livelli: early, intermediate e late integration. Le tecniche di early integration consistono nella concatenazione dei dati delle diverse viste in un unico spazio delle feature. Le tecniche di intermediate integration consistono nella trasformazione di tutte le sorgenti dati in un unico spazio comune prima di combinarle. Nelle tecniche di late integration, ogni vista `e analizzata separatamente e i risultati sono poi combinati. Lo scopo di questa tesi `e duplice: il primo obbiettivo `e la definizione di una metodologia di integrazione dati per la sotto-tipizzazione dei pazienti (MVDA) e il secondo `e lo sviluppo di un tool per la caratterizzazione fenotipica dei nanomateriali (INSIdEnano). In questa tesi di dottorato presento le metodologie e i risultati della mia ricerca. MVDA `e una tecnica multi-view con lo scopo di scoprire nuove sotto tipologie di pazienti statisticamente rilevanti. Identificare sottotipi di pazienti per una malattia specifica `e un obbiettivo con alto rilievo nella pratica clinica, soprattutto per la diagnosi precoce delle malattie. Questo problema `e generalmente risolto usando dati di trascrittomica per identificare i gruppi di pazienti che condividono gli stessi pattern di alterazione genica. L’idea principale alla base di questo lavoro di ricerca `e quello di combinare piu` tipologie di dati omici per gli stessi pazienti per ottenere una migliore caratterizzazione del loro profilo. La metodologia proposta `e un approccio di tipo late integration basato sul clustering. Per ogni vista viene effettuato il clustering dei pazienti rappresentato sotto forma di matrici di membership. I risultati di tutte le viste vengono poi combinati tramite una tecnica di fattorizzazione di matrici per ottenere i metacluster finali multi-view. La fattibilit`a e le performance del nostro metodo sono stati valutati su sei dataset multi-view relativi al tumore al seno, glioblastoma, cancro alla prostata e alle ovarie. I dati omici usati per gli esperimenti sono relativi alla espressione dei geni, espressione dei mirna, RNASeq, miRNASeq, espressione delle proteine e della Copy Number Variation. In tutti i dataset sono state identificate sotto-tipologie di pazienti con rilevanza statistica, identificando nuovi sottogruppi precedentemente non noti in letteratura. Ulteriori esperimenti sono stati condotti utilizzando la conoscenza a priori relativa alle macro classi dei pazienti. Tale informazione `e stata considerata come una ulteriore vista nel processo di integrazione per ottenere una accuratezza piu` elevata nella classificazione dei pazienti. Il metodo proposto ha performance migliori degli algoritmi di clustering clussici su tutti i dataset. MVDA ha ottenuto risultati migliori in confronto a altri algoritmi di integrazione di tipo ealry e intermediate integration. Inoltre il metodo `e in grado di calcolare il contributo di ogni singola vista al risultato finale. I risultati mostrano, anche, che il metodo `e stabile in caso di perturbazioni del dataset effettuate rimuovendo un paziente alla volta (leave-one-out). Queste osservazioni suggeriscono che l’integrazione di informazioni a priori e feature genomiche, da utilizzare congiuntamente durante l’analisi, `e una strategia vincente nell’identificazione di sotto-tipologie di malattie. INSIdE nano (Integrated Network of Systems bIology Effects of nanomaterials) `e un tool innovativo per la contestualizzazione sistematica degli effetti delle nanoparticelle (ENMs) in contesti biomedici. Negli ultimi anni, le tecnologie omiche sono state ampiamente applicate per caratterizzare i nanomateriali a livello molecolare. E’ possibile contestualizzare l’effetto a livello molecolare di diversi tipi di perturbazioni confrontando i loro pattern di alterazione genica. Mentre tale approccio `e stato applicato con successo nel campo del drug repositioning, una contestualizzazione estensiva dell’effetto dei nanomateriali sulle cellule `e attualmente mancante. L’idea alla base del tool `e quello di usare strategie comparative di analisi per contestualizzare o posizionare i nanomateriali in confronto a fenotipi rilevanti che sono stati studiati in letteratura (come ad esempio malattie dell’uomo, trattamenti farmacologici o esposizioni a sostanze chimiche) confrontando i loro pattern di alterazione molecolare. Questo potrebbe incrementare la conoscenza dell’effetto molecolare dei nanomateriali e contribuire alla definizione di nuovi pathway tossicologici oppure identificare eventuali coinvolgimenti dei nanomateriali in eventi patologici o in nuove strategie terapeutiche. L’ipotesi alla base `e che l’identificazione di pattern di similarit`a tra insiemi di fenotipi potrebbe essere una indicazione di una associazione biologica che deve essere successivamente testata in ambito tossicologico o terapeutico. Basandosi sulla firma di espressione genica, associata ad ogni fenotipo, la similarit`a tra ogni coppia di perturbazioni `e stata valuta e usata per costruire una grande network di interazione tra fenotipi. Per assicurare l’utilizzo di INSIdE nano, `e stata sviluppata una infrastruttura computazionale robusta e scalabile, allo scopo di analizzare tale network. Inoltre `e stato realizzato un sito web che permettesse agli utenti di interrogare e visualizzare la network in modo semplice ed efficiente. In particolare, INSIdE nano `e stato analizzato cercando tutte le possibili clique di quattro elementi eterogenei (un nanomateriale, un farmaco, una malattia e una sostanza chimica). Una clique `e una sotto network completamente connessa, dove ogni elemento `e collegato con tutti gli altri. Di tutte le clique, sono state considerate come significative solo quelle per le quali le associazioni tra farmaco e malattia e farmaco e sostanze chimiche sono note. Le connessioni note tra farmaci e malattie si basano sul fatto che il farmaco `e prescritto per curare tale malattia. Le connessioni note tra malattia e sostanze chimiche si basano su evidenze presenti in letteratura del fatto che tali sostanze causano la malattia. Il focus `e stato posto sul possibile coinvolgimento dei nanomateriali con le malattie presenti in tali clique. La valutazione di INSIdE nano ha confermato che esso mette in evidenza connessioni note tra malattie e farmaci e tra malattie e sostanze chimiche. Inoltre la similarit`a tra le malattie calcolata in base ai geni `e conforme alle informazioni basate sulle loro informazioni cliniche. Allo stesso modo le similarit`a tra farmaci e sostanze chimiche rispecchiano le loro similarit`a basate sulla struttura chimica. Nell’insieme, i risultati suggeriscono che INSIdE nano pu`o essere usato per contestualizzare l’effetto molecolare dei nanomateriali e inferirne le connessioni rispetto a fenotipi precedentemente studiati in letteratura. Questo metodo permette di velocizzare il processo di valutazione della loro tossicit`a e apre nuove prospettive per il loro utilizzo nella biomedicina. [a cura dell'autore]
XV n.s.
APA, Harvard, Vancouver, ISO, and other styles
13

Nonell, Mazelon Lara 1972. "New approaches in omics data modelling." Doctoral thesis, Universitat Pompeu Fabra, 2019. http://hdl.handle.net/10803/668053.

Full text
Abstract:
The breakthrough in the technological field has allowed the extraction of large amounts of the so-called omics data. The analysis and Integration of this type of data by means of advanced statistical and bioinformatics methods will allow the improvement in the management of diseases. The diversity and complexity of omics data has encouraged the development of hundreds of new statistical methods to meet this objective. Therefore, having the appropriate methods to accommodate different data distributions and modelling complex data structures becomes essential. This thesis presents advances in three directions in this regard. First, the study of several methods to assess non-linear associations which is relevant when assessing the effect of environmental exposures (i.e exposome) on complex diseases. The study is accompanied by the development of the R package nlOmicAssoc. Second, the simplex distribution is proposed to analyse methylome data since this distribution properly fits beta values that are generated in this type of studies. The extension to generalized linear models with simplex response is also proposed. Lastly, an R package, HOmics, has been developed to incorporate a priori biological knowledge into association studies by using Bayesian hierarchical models. It also implements methods to model the dependence between omics data, enabling data integration
L’avenç en el camp tecnològic ens ha permès obtenir grans quantitats de les anomenades dades òmiques. L’anàlisi i integració d’aquesta mena de dades mitjançant mètodes estadístics i bioinformàtics avançats ha de permetre la millora en el maneig de les malalties. La diversitat i complexitat de les dades òmiques ha incentivat el desenvolupament de centenars de nous mètodes estadístics per a complir amb aquest objectiu. Per tant, és primordial disposar de mètodes que acomodin les distribucions adequades i modelin estructures de dades complexes. Davant d’això, aquesta tesi presenta avenços en tres direccions. En primer lloc, l’estudi de diferents mètodes per a analitzar associacions no lineals, molt rellevant en estudis d’associació entre exposicions mediambientals (i.e. exposoma) i malalties complexes. Aquesta anàlisi va acompanyada del desenvolupament del paquet de R nlOmicAssoc. En segon lloc, es proposa utilitzar la distribució simplex per analitzar dades metilòmiques, donat que aquesta distribució ajusta els valors beta generats en aquesta mena d’estudis. També es formula l’extensió a models lineals generalitzats amb resposta simplex. I per últim, el paquet de R HOmics, que incorpora coneixement biològic als estudis d’associació mitjançant models Bayesians jeràrquics. També implementa mètodes per modelar la dependència entre dades òmiques, permetent la integració de dades
APA, Harvard, Vancouver, ISO, and other styles
14

Wang, Zhi. "Module-Based Analysis for "Omics" Data." Thesis, North Carolina State University, 2015. http://pqdtopen.proquest.com/#viewpdf?dispub=3690212.

Full text
Abstract:

This thesis focuses on methodologies and applications of module-based analysis (MBA) in omics studies to investigate the relationships of phenotypes and biomarkers, e.g., SNPs, genes, and metabolites. As an alternative to traditional single–biomarker approaches, MBA may increase the detectability and reproducibility of results because biomarkers tend to have moderate individual effects but significant aggregate effect; it may improve the interpretability of findings and facilitate the construction of follow-up biological hypotheses because MBA assesses biomarker effects in a functional context, e.g., pathways and biological processes. Finally, for exploratory “omics” studies, which usually begin with a full scan of a long list of candidate biomarkers, MBA provides a natural way to reduce the total number of tests, and hence relax the multiple-testing burdens and improve power.

The first MBA project focuses on genetic association analysis that assesses the main and interaction effects for sets of genetic (G) and environmental (E) factors rather than for individual factors. We develop a kernel machine regression approach to evaluate the complete effect profile (i.e., the G, E, and G-by-E interaction effects separately or in combination) and construct a kernel function for the Gene-Environmental (GE) interaction directly from the genetic kernel and the environmental kernel. We use simulation studies and real data applications to show improved performance of the Kernel Machine (KM) regression method over the commonly adapted PC regression methods across a wide range of scenarios. The largest gain in power occurs when the underlying effect structure is involved complex GE interactions, suggesting that the proposed method could be a useful and powerful tool for performing exploratory or confirmatory analyses in GxE-GWAS.

In the second MBA project, we extend the kernel machine framework developed in the first project to model biomarkers with network structure. Network summarizes the functional interplay among biological units; incorporating network information can more precisely model the biological effects, enhance the ability to detect true signals, and facilitate our understanding of the underlying biological mechanisms. In the work, we develop two kernel functions to capture different network structure information. Through simulations and metabolomics study, we show that the proposed network-based methods can have markedly improved power over the approaches ignoring network information.

Metabolites are the end products of cellular processes and reflect the ultimate responses of biology system to genetic variations or environment exposures. Because of the unique properties of metabolites, pharmcometabolomics aims to understand the underlying signatures that contribute to individual variations in drug responses and identify biomarkers that can be helpful to response predictions. To facilitate mining pharmcometabolomic data, we establish an MBA pipeline that has great practical value in detection and interpretation of signatures, which may potentially indicate a functional basis for the drug response. We illustrate the utilities of the pipeline by investigating two scientific questions in aspirin study: (1) which metabolites changes can be attributed to aspirin intake, and (2) what are the metabolic signatures that can be helpful in predicting aspirin resistance. Results show that the MBA pipeline enables us to identify metabolic signatures that are not found in preliminary single-metabolites analysis.

APA, Harvard, Vancouver, ISO, and other styles
15

Müller, Nikola. "Finding correlations and independences in omics data." Diss., lmu, 2012. http://nbn-resolving.de/urn:nbn:de:bvb:19-144027.

Full text
APA, Harvard, Vancouver, ISO, and other styles
16

Cicek, A. Ercument. "METABOLIC NETWORK-BASED ANALYSES OF OMICS DATA." Case Western Reserve University School of Graduate Studies / OhioLINK, 2013. http://rave.ohiolink.edu/etdc/view?acc_num=case1372866879.

Full text
APA, Harvard, Vancouver, ISO, and other styles
17

Sathyanarayanan, Anita. "Integration of multi-omics data in cancer." Thesis, Queensland University of Technology, 2021. https://eprints.qut.edu.au/225924/1/Anita_Sathyanarayanan_Thesis.pdf.

Full text
Abstract:
Cancer is a complex disease with multiple molecular (omics) factors influencing the risk, development, prognosis, and treatment. Availability of largescale multiple omics data has provided the opportunity to jointly analyse these data using advanced statistical approaches and identify cancer drivers and regulatory pathways underpinning the disease. In the first study, this thesis provides the much-needed guidance for conducting multi-omics analysis using open-source software tools. Next, it introduces an enrichment pipeline developed using imputation-based integration of multi-omics data and was applied to breast and prostate cancers to identify the associated biomarkers and genes.
APA, Harvard, Vancouver, ISO, and other styles
18

Bersanelli, Matteo <1987&gt. "Mathematical Physics Techniques for Omics Data Integration." Doctoral thesis, Alma Mater Studiorum - Università di Bologna, 2017. http://amsdottorato.unibo.it/7812/1/Bersanelli_Matteo_tesi.pdf.

Full text
Abstract:
Nowadays different types of high-throughput technologies allow us to collect information on the molecular components of biological systems. Each of such technologies is designed to simultaneously collect large sets of molecular data of a specific omic-kind. In order to draw a more comprehensive view of biological processes, experimental data made on different layers have to be integrated and analyzed. The complexity of biological systems, the technological limits, the large number of biological variables and the relatively low number of biological samples make integrative analyses a challenge. Hence, the development of methods for omics integration is one of the most relevant problems computational scientists are addressing nowadays. The most representative and promising techniques for the analysis of omics data are presented and broadly divided into categories. In the literature we notice a growing interest around approaches that use graphs for modeling the relationships among omic variables. In particular we found that algorithms propagating molecular information on networks are being proposed in several applications and are often related to actual physical models. We considered the chemical master equation (CME) framework to model the exchange of information in biological networks as a stochastic process on the network. In this context we defined new algorithms and pipelines for the analysis of omics. In particular we propose two network-based methods with applications to both synthetic and prostate ardenocarcinoma data. In both the applications the molecular alterations are mapped on the protein-protein interaction network. In the first application we defined a novel methodology for extracting modules of connected genes that present the most significant differential molecular information between two classes of samples. In the second application we measure to which degree a distribution of deleterious molecular information on a given network deviates the normal trajectories of information flow using a perturbative approach to the CME.
APA, Harvard, Vancouver, ISO, and other styles
19

Wack, Maxime. "Dimension longitudinale du suivi omique dans les entrepôts de données cliniques : application aux cancers suivis par biopsie liquide." Electronic Thesis or Diss., Université Paris Cité, 2024. http://www.theses.fr/2024UNIP5258.

Full text
Abstract:
Une nouvelle technique en génomique virale permet la capture et le séquençage d'HPV (Human Papilloma Virus) chez les patients porteurs. L'intégration de ces données aux informations médicales des Entrepôts de Données Cliniques (EDCs) ouvre de nouvelles perspectives en recherche translationnelle sur les cancers viro-induits. Cependant, les données génomiques nécessaires ne sont pas disponibles dans les EDCs, mais dans des outils dédiés limitant leur utilisation pour ces études. Nous proposons viroCapt, un pipeline bioinformatique automatisant l'analyse des données de capture HPV, permettant la caractérisation des cancers HPV-induits. L'utilisation de viroCapt a mis en évidence le besoin d'intégrer des données génomiques de manière longitudinale dans les EDCs, notamment dans le suivi des cancers par biopsie liquide. La limitation des EDCs à intégrer ces données et leurs relations longitudinales nous a amené à concevoir gitOmmix, une méthode combinant systèmes de gestion de version de fichiers et représentations des connaissances, pour y répondre. Les travaux issus de l'utilisation de viroCapt ont montré son intérêt dans le suivi des cancers HPV-induits et plus généralement viro-induits. Par ailleurs, nous avons conçu un modèle permettant l'intégration de données omiques longitudinales dans les EDCs. gitOmmix est généralisable à toute donnée massive, agnostique du système d'EDC, et permet une meilleure adhésion aux principes FAIR en ajoutant la provenance et l'accès aux données sources. Notre contribution permet une meilleure caractérisation des cancers viro-induits, et met en exergue de nouveaux défis en recherche translationnelle, motivant la conception d'une méthode de gestion de la provenance et des données massives dans les EDC
A novel technique in viral genomics enables capturing and sequencing HPV (Human Papilloma Virus) DNA present in patients with lesions associated with HPV. The integration of genomic data with information present in Clinical Data Warehouses (CDWs) opens new avenues in translational research in HPV-induced cancers. However, genomic data that are necessary are not available in CDWs, but usually in dedicated tools which strongly constraint such studies. We propose viroCapt, a bioinformatics pipeline automating the analysis of HPV capture data, enabling the characterization of patients with HPV-induced cancers. Using viroCapt in translational research highlighted the need to integrate longitudinal genomic data in CDWs, particularly in the case of ctDNA monitoring in cancer follow-up. This led us to consider the limit of CDW in handling large files and longitudinal relationships.For this reason, we designed gitOmmix, a method combining file versioning systems and formal provenance knowledge representation to address longitudinal data integration. We show that viroCapt supports HPV-induced cancer follow-up, and generalizes to other virus-induced cancers. We designed and implemented a model enabling the longitudinal collection of omic data in CDWs, supported by robust tools and standards. gitOmmix generalizes to other large biomedical data, is agnostic from any CDW system, and supports adherence to FAIR principles by adding provenance and versioned data access. Our contribution helped characterize virus-induced cancers, and exposed new challenges in translation research. This motivated designing a general method to handle provenance and longitudinal management of high-throughput data in CDWs
APA, Harvard, Vancouver, ISO, and other styles
20

MASPERO, DAVIDE. "Computational strategies to dissect the heterogeneity of multicellular systems via multiscale modelling and omics data analysis." Doctoral thesis, Università degli Studi di Milano-Bicocca, 2022. http://hdl.handle.net/10281/368331.

Full text
Abstract:
L'eterogeneità pervade i sistemi biologici e si manifesta in differenze strutturali e funzionali osservate sia tra diversi individui di uno stesso gruppo (es. organismi o patologie), sia fra gli elementi costituenti di un singolo individuo (es. cellule). Lo studio dell’eterogeneità dei sistemi biologici e, in particolare, di quelli multicellulari è fondamentale per la comprensione meccanicistica di fenomeni fisiologici e patologici complessi (es. il cancro), così come per la definizione di strategie prognostiche, diagnostiche e terapeutiche efficaci. Questo lavoro è focalizzato sullo sviluppo e l’applicazione di metodi computazionali e modelli matematici per la caratterizzazione dell’eterogeneità di sistemi multicellulari e delle sottopopolazioni di cellule tumorali che sottendono l’evoluzione di una patologia neoplastica. Analoghe metodologie sono state sviluppate per caratterizzare efficacemente l’evoluzione e l’eterogeneità virale. La ricerca è suddivisa in due porzioni complementari, la prima finalizzata alla definizione di metodi per l’analisi e l’integrazione di dati omici generati da esperimenti di sequenziamento, la seconda alla modellazione e simulazione multiscala di sistemi multicellulari. Per quanto riguarda il primo filone, le tecnologie di next-generation sequencing permettono di generare enormi moli di dati omici, relativi per esempio al genoma o trascrittoma di un determinato individuo, attraverso esperimenti di bulk o single-cell sequencing. Una delle sfide principale in informatica è quella di definire metodi computazionali per estrarre informazione utile da tali dati, tenendo conto degli alti livelli di errori dato-specifico, dovuti principalmente a limiti tecnologici. In particolare, nell’ambito di questo lavoro, ci si è concentrati sullo sviluppo di metodi per l’analisi di dati di espressione genica e di mutazioni genomiche. In dettaglio, è stata effettuata una comparazione esaustiva dei metodi di machine-learning per il denoising e l’imputation di dati di single-cell RNA-sequencing. Inoltre, sono stati sviluppati metodi per il mapping dei profili di espressione su reti metaboliche, attraverso un framework innovativo che ha consentito di stratificare pazienti oncologici in base al loro metabolismo. Una successiva estensione del metodo ha permesso di analizzare la distribuzione dei flussi metabolici all'interno di una popolazione di cellule, via un approccio di flux balance analysis. Per quanto riguarda l’analisi dei profili mutazionali, è stato ideato e implementato il primo metodo per la ricostruzione di modelli filogenomici a partire da dati longitudinali a risoluzione single-cell, che sfrutta un framework che combina una Markov Chain Monte Carlo con una nuova funzione di likelihood pesata. Analogamente, è stato sviluppato un framework che sfrutta i profili delle mutazioni a bassa frequenza per ricostruire filogenie robuste e probabili catene di infenzione, attraverso l’analisi dei dati di sequenziamento di campioni virali. Gli stessi profili mutazionali permettono anche di deconvolvere il segnale nelle firme associati a specifici meccanismi molecolari che generano tali mutazioni, attraverso un approccio basato su non-negative matrix factorization. La ricerca condotta per quello che riguarda la simulazione computazionale ha portato allo sviluppo di un modello multiscala, in cui la simulazione della dinamica di popolazioni cellulari, rappresentata attraverso un Cellular Potts Model, è accoppiata all'ottimizzazione di un modello metabolico associato a ciascuna cellula sintetica. Co modello è possibile rappresentare ipotesi in termini matematici e osservare proprietà emergenti da tali assunti. Infine, un primo tentativo per combinare i due approcci metodologici ha condotto all'integrazione di dati di single-cell RNA-seq all'interno del modello multiscala, consentendo di formulare ipotesi data-driven sulle proprietà emergenti del sistema.
Heterogeneity pervades biological systems and manifests itself in the structural and functional differences observed both among different individuals of the same group (e.g., organisms or disease systems) and among the constituent elements of a single individual (e.g., cells). The study of the heterogeneity of biological systems and, in particular, of multicellular systems is fundamental for the mechanistic understanding of complex physiological and pathological phenomena (e.g., cancer), as well as for the definition of effective prognostic, diagnostic, and therapeutic strategies. This work focuses on developing and applying computational methods and mathematical models for characterising the heterogeneity of multicellular systems and, especially, cancer cell subpopulations underlying the evolution of neoplastic pathology. Similar methodologies have been developed to characterise viral evolution and heterogeneity effectively. The research is divided into two complementary portions, the first aimed at defining methods for the analysis and integration of omics data generated by sequencing experiments, the second at modelling and multiscale simulation of multicellular systems. Regarding the first strand, next-generation sequencing technologies allow us to generate vast amounts of omics data, for example, related to the genome or transcriptome of a given individual, through bulk or single-cell sequencing experiments. One of the main challenges in computer science is to define computational methods to extract useful information from such data, taking into account the high levels of data-specific errors, mainly due to technological limitations. In particular, in the context of this work, we focused on developing methods for the analysis of gene expression and genomic mutation data. In detail, an exhaustive comparison of machine-learning methods for denoising and imputation of single-cell RNA-sequencing data has been performed. Moreover, methods for mapping expression profiles onto metabolic networks have been developed through an innovative framework that has allowed one to stratify cancer patients according to their metabolism. A subsequent extension of the method allowed us to analyse the distribution of metabolic fluxes within a population of cells via a flux balance analysis approach. Regarding the analysis of mutational profiles, the first method for reconstructing phylogenomic models from longitudinal data at single-cell resolution has been designed and implemented, exploiting a framework that combines a Markov Chain Monte Carlo with a novel weighted likelihood function. Similarly, a framework that exploits low-frequency mutation profiles to reconstruct robust phylogenies and likely chains of infection has been developed by analysing sequencing data from viral samples. The same mutational profiles also allow us to deconvolve the signal in the signatures associated with specific molecular mechanisms that generate such mutations through an approach based on non-negative matrix factorisation. The research conducted with regard to the computational simulation has led to the development of a multiscale model, in which the simulation of cell population dynamics, represented through a Cellular Potts Model, is coupled to the optimisation of a metabolic model associated with each synthetic cell. Using this model, it is possible to represent assumptions in mathematical terms and observe properties emerging from these assumptions. Finally, we present a first attempt to combine the two methodological approaches which led to the integration of single-cell RNA-seq data within the multiscale model, allowing data-driven hypotheses to be formulated on the emerging properties of the system.
APA, Harvard, Vancouver, ISO, and other styles
21

Zheng, Ning. "Mediation modeling and analysis forhigh-throughput omics data." Thesis, Uppsala universitet, Statistiska institutionen, 2015. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-256318.

Full text
Abstract:
There is a strong need for powerful unified statistical methods for discovering underlying genetic architecture of complex traits with the assistance of omics information. In this paper, two methods aiming to detect novel association between the human genome and complex traits using intermediate omics data are developed based on statistical mediation modeling. We demonstrate theoretically that given proper mediators, the proposed statistical mediation models have better power than genome-wide association studies (GWAS) to detect associations missed in standard GWAS that ignore the mediators. For each ofthe modeling methods in this paper, an empirical example is given, where the association between a SNP and BMI missed by standard GWAS can be discovered by mediation analysis.
APA, Harvard, Vancouver, ISO, and other styles
22

Ayati, Marzieh. "Algorithms to Integrate Omics Data for Personalized Medicine." Case Western Reserve University School of Graduate Studies / OhioLINK, 2018. http://rave.ohiolink.edu/etdc/view?acc_num=case1527679638507616.

Full text
APA, Harvard, Vancouver, ISO, and other styles
23

Campanella, Gianluca. "Statistical analysis of '-omics' data : developments and applications." Thesis, Imperial College London, 2015. http://hdl.handle.net/10044/1/32109.

Full text
Abstract:
In recent years, increasingly efficient molecular biology techniques created new opportunities to harness large-scale repositories of biological material collected in epidemiological studies; however, methods to manipulate and analyse the wealth of information thus generated have lagged behind. The introductory chapter of this thesis presents the multifaceted field of 'computational epidemiology' from the perspectives of molecular biology, measurement theory, and statistical modelling. Focusing on measurement of DNA methylation levels, the author also reviews the state of the art, proposes novel pre-processing methods and evaluation frameworks, and provides recommendations for genome-wide studies of DNA methylation levels using Illumina Infinium® HumanMethylation450 BeadChips. The remaining chapters, in the form of three self-contained scientific articles, cover applications on the following topics: (i) DNA methylation differences associated with internal migration patterns within Italy; (ii) associations of DNA methylation profiles with adiposity measures, targeted gene expression, biomarkers of lipid and glucose metabolism, and risk of developing three obesity-associated diseases; (iii) associations of a dietary score with blood pressure, and with urinary metabolites as characterised by NMR spectroscopy. The thesis is concluded with general remarks and the presentation of some open problems that offer potential for future research.
APA, Harvard, Vancouver, ISO, and other styles
24

Budimir, Iva <1992&gt. "Stochastic Modeling and Correlation Analysis of Omics Data." Doctoral thesis, Alma Mater Studiorum - Università di Bologna, 2021. http://amsdottorato.unibo.it/9792/1/Budimir_Iva_tesi.pdf.

Full text
Abstract:
We studied the properties of three different types of omics data: protein domains in bacteria, gene length in metazoan genomes and methylation in humans. Gene elongation and protein domain diversification are some of the most important mechanisms in the evolution of functional complexity. For this reason, the investigation of the dynamic processes that led to their current configuration can highlight the important aspects of genome and proteome evolution and consequently of the evolution of living organisms. The potential of methylation to regulate the expression of genes is usually attributed to the groups of close CpG sites. We performed the correlation analysis to investigate the collaborative structure of all CpGs on chromosome 21. The long-tailed distributions of gene length and protein domain occurrences were successfully described by the stochastic evolutionary model and fitted with the Poisson Log-Normal distribution. This approach included both demographic and environmental stochasticity and the Gompertzian density regulation. The parameters of the fitted distributions were compared at the evolutionary scale. This allowed us to define a novel protein-domain-based phylogenetic method for bacteria which performed well at the intraspecies level. In the context of gene length distribution, we derived a new generalized population dynamics model for diverse subcommunities which allowed us to jointly model both coding and non-coding genomic sequences. A possible application of this approach is a method for differentiation between protein-coding genes and pseudogenes based on their length. General properties of the methylation correlation structure were firstly analyzed for the large data set of healthy controls and later compared to the Down syndrome (DS) data set. The CpGs demonstrated strong group behaviour even across the large genomic distances. Detected differences in DS were surprisingly small, possibly caused by the small sample size of DS which reduced the power of statistical analysis.
APA, Harvard, Vancouver, ISO, and other styles
25

Zandonà, Alessandro. "Predictive networks for multi meta-omics data integration." Doctoral thesis, Università degli studi di Trento, 2017. https://hdl.handle.net/11572/367893.

Full text
Abstract:
The role of microbiome in disease onset and in equilibrium is being exposed by a wealth of high-throughput omics methods. All key research directions, e.g., the study of gut microbiome dysbiosis in IBD/IBS, indicate the need for bioinformatics methods that can model the complexity of the microbial communities ecology and unravel its disease-associated perturbations. A most promising direction is the “meta-omics†approach, that allows a profiling based on various biological molecules at the metagenomic scale (e.g., metaproteomics, metametabolomics) as well as different “microbial†omes (eukaryotes and viruses) within a system biology approach. This thesis introduces a bioinformatic framework for microbiota datasets that combines predictive profiling, differential network analysis and meta-omics integration. In detail, the framework identifies biomarkers discriminating amongst clinical phenotypes, through machine learning techniques (Random Forest or SVM) based on a complete Data Analysis Protocol derived by two initiatives funded by FDA: the MicroArray Quality Control-II and Sequencing Quality Control projects. The biomarkers are interpreted in terms of biological networks: the framework provides a setup for networks inference, quantification of networks differences based on the glocal Hamming and Ipsen-Mikhailov (HIM) distance and detection of network communities. The differential analysis of networks allows the study of microbiota structural organization as well as the evolving trajectories of microbial communities associated to the dynamics of the target phenotypes. Moreover, the framework combines a novel similarity network fusion method and machine learning to identify biomarkers from the integration of multiple meta-omics data. The framework implementation requires only standard open source computational biology tools, as a combination of R/Bioconductor and Python functions. In particular, full scripts for meta-omics integration are available in a GitHub repository to ease reuse (https://github.com/AleZandona/INF). The pipeline has been validated on original data from three different clinical datasets. First, the predictive profiling and the network differential analysis have been applied on a pediatric Inflammatory Bowel Disease (IBD) cohort (in faecal vs biopsy environments) and controls, in collaboration with a multidisciplinary team at the Ospedale Pediatrico Bambino Gesú (Rome, I). Then, the meta-omics integration has been tested on a paired bacterial and fungal gut microbiota human IBD datasets from the Gastroenterology Department of the Saint Antoine Hospital (Paris, F), thanks to the collaboration with “Commensals and Probiotics-Host Interactions†team at INRA (Jouy-en-Josas, F). Finally, the framework has been validated on a bacterial-fungal gut microbiota dataset from children affected by Rett syndrome. The different nature of datasets used for validation naturally supports the extension of the framework on different omics datasets. Besides, clinical practice can take advantage of our framework, given the reproducibility and robustness of results, ensured by the adopted Data Analysis Protocol, as well as the biological relevance of the findings, confirmed by the clinical collaborators. Specifically, the omics-based dysbiosis profiles and the inferred biological networks can support the current diagnostic tools to reveal disease-associated perturbations at a much prodromal earlier stage of disease and may be used for disease prevention, diagnosis and prognosis.
APA, Harvard, Vancouver, ISO, and other styles
26

Zandonà, Alessandro. "Predictive networks for multi meta-omics data integration." Doctoral thesis, University of Trento, 2017. http://eprints-phd.biblio.unitn.it/2547/1/zandona2017_phdthesis.pdf.

Full text
Abstract:
The role of microbiome in disease onset and in equilibrium is being exposed by a wealth of high-throughput omics methods. All key research directions, e.g., the study of gut microbiome dysbiosis in IBD/IBS, indicate the need for bioinformatics methods that can model the complexity of the microbial communities ecology and unravel its disease-associated perturbations. A most promising direction is the “meta-omics” approach, that allows a profiling based on various biological molecules at the metagenomic scale (e.g., metaproteomics, metametabolomics) as well as different “microbial” omes (eukaryotes and viruses) within a system biology approach. This thesis introduces a bioinformatic framework for microbiota datasets that combines predictive profiling, differential network analysis and meta-omics integration. In detail, the framework identifies biomarkers discriminating amongst clinical phenotypes, through machine learning techniques (Random Forest or SVM) based on a complete Data Analysis Protocol derived by two initiatives funded by FDA: the MicroArray Quality Control-II and Sequencing Quality Control projects. The biomarkers are interpreted in terms of biological networks: the framework provides a setup for networks inference, quantification of networks differences based on the glocal Hamming and Ipsen-Mikhailov (HIM) distance and detection of network communities. The differential analysis of networks allows the study of microbiota structural organization as well as the evolving trajectories of microbial communities associated to the dynamics of the target phenotypes. Moreover, the framework combines a novel similarity network fusion method and machine learning to identify biomarkers from the integration of multiple meta-omics data. The framework implementation requires only standard open source computational biology tools, as a combination of R/Bioconductor and Python functions. In particular, full scripts for meta-omics integration are available in a GitHub repository to ease reuse (https://github.com/AleZandona/INF). The pipeline has been validated on original data from three different clinical datasets. First, the predictive profiling and the network differential analysis have been applied on a pediatric Inflammatory Bowel Disease (IBD) cohort (in faecal vs biopsy environments) and controls, in collaboration with a multidisciplinary team at the Ospedale Pediatrico Bambino Gesú (Rome, I). Then, the meta-omics integration has been tested on a paired bacterial and fungal gut microbiota human IBD datasets from the Gastroenterology Department of the Saint Antoine Hospital (Paris, F), thanks to the collaboration with “Commensals and Probiotics-Host Interactions” team at INRA (Jouy-en-Josas, F). Finally, the framework has been validated on a bacterial-fungal gut microbiota dataset from children affected by Rett syndrome. The different nature of datasets used for validation naturally supports the extension of the framework on different omics datasets. Besides, clinical practice can take advantage of our framework, given the reproducibility and robustness of results, ensured by the adopted Data Analysis Protocol, as well as the biological relevance of the findings, confirmed by the clinical collaborators. Specifically, the omics-based dysbiosis profiles and the inferred biological networks can support the current diagnostic tools to reveal disease-associated perturbations at a much prodromal earlier stage of disease and may be used for disease prevention, diagnosis and prognosis.
APA, Harvard, Vancouver, ISO, and other styles
27

Bussoli, Ilaria. "Heterogeneous Graphical Models with Applications to Omics Data." Doctoral thesis, Università degli studi di Padova, 2019. http://hdl.handle.net/11577/3423293.

Full text
Abstract:
Thanks to the advances in bioinformatics and high-throughput methodologies of the last decades, a large unprecedented amount of biological data coming from various experiments in metabolomics, genomics and proteomics is available. This has lead the researchers to conduct more and more comprehensive molecular proling of biological samples through different multiple aspects of genomic activities, thus introducing new challenges in the developments of statistical tools to integrate and model multi-omics data. The main research objective of this thesis is to develop a statistical framework for modelling the interactions between genes when their activity is measured on different domains; to do so, our approach relies on the concept of multilayer network, and how structures of this type can be combined with graphical models for mixed data, i.e., data comprising variables of different nature (e.g., continuous, categorical, skewed, to name a few). We further develop an algorithm for learning the structure of the undirected multilayer networks underlying the proposed models, showing its promising results through empirical analyses on cancer data, which was downloaded from the public TCGA consortium.
APA, Harvard, Vancouver, ISO, and other styles
28

Kim, Jieun. "Computational tools for the integrative analysis of muti-omics data to decipher trans-omics networks." Thesis, The University of Sydney, 2022. https://hdl.handle.net/2123/28524.

Full text
Abstract:
Regulatory networks define the phenotype, morphology, and function of cells. These networks are built from the basic building blocks of the cell—DNA, RNA, and proteins—and cut across the respective omics layers—genome, transcriptome, and proteome. The resulting omics networks depict a near infinite possibility of nodes and edges that intricately connect the ‘omes’. With the rapid advancement in the technologies that generate omics data in bulk samples and now at single-cell resolution, the field of life sciences is now met with the challenge to connect these omes to generate trans-omics networks. To this end, this thesis addressed some of the pressing challenges in trans-omics network reconstruction and the integrative analysis of omics data at both bulk and single-cell resolution: 1) the lack of an integrated pipeline for processing and downstream analysis of lesser studied omics layers; 2) the need for an integrative framework to reconstruct transcriptional networks and discover novel regulators of transcriptional regulation; and 3) development of tools for the reconstruction of single-cell multi-modal TRNs. I envision the work of my thesis to contribute towards the integrative study of bulk and single-cell trans-omics analysis, which I believe will become essential and standard-place in molecular biological studies as the comprehensiveness and accuracy of omics data measurements and databases for connecting different omics improves.
APA, Harvard, Vancouver, ISO, and other styles
29

Erten, Mehmet Sinan. "Algorithms for discovering disease genes by integrating 'omics data." Case Western Reserve University School of Graduate Studies / OhioLINK, 2013. http://rave.ohiolink.edu/etdc/view?acc_num=case1343769483.

Full text
APA, Harvard, Vancouver, ISO, and other styles
30

Ding, Hao. "Visualization and Integrative analysis of cancer multi-omics data." The Ohio State University, 2016. http://rave.ohiolink.edu/etdc/view?acc_num=osu1467843712.

Full text
APA, Harvard, Vancouver, ISO, and other styles
31

Nikolayeva, Iryna. "Network and machine learning approaches to dengue omics data." Thesis, Sorbonne Paris Cité, 2017. http://www.theses.fr/2017USPCB032/document.

Full text
Abstract:
Les 20 dernières années ont vu l'émergence de technologies de mesure puissantes, permettant l'analyse omique de diverses maladies. Ils fournissent souvent des moyens non invasifs pour étudier l'étiologie des maladies complexes nouvellement émergentes, telles que l'infection de la dengue, transmise par les moustiques. Ma thèse se concentre sur l'adaptation et l'application d'approches utilisant des réseaux d'interaction de gènes et l'apprentissage automatique pour l'analyse de données génomiques et transcriptomiques. La première partie va au-delà d'une analyse pangénomique précédemment publiée de 4 026 personnes en appliquant une analyse de réseaux d'interaction pour trouver des groupes de gènes qui interagissent dans un réseau d'interactions fonctionnelles et qui, pris ensemble, sont associés à la dengue sévère. Dans cette partie, j'ai d'abord recalculé les valeurs-p d'association des polymorphismes séquencés, puis j'ai travaillé sur le mapping des polymorphismes à des gènes fonctionnellement apparentés, et j'ai enfin exploré différentes bases de données de voies métaboliques et d'interactions génétiques pour trouver des groupes de gènes qui, pris ensemble, sont associés à la dengue sévère. La deuxième partie de ma thèse dévoile une approche théorique pour étudier un biais dans les algorithmes de recherche de réseau actifs. Mon analyse théorique suggère que le meilleur score de sous-réseaux d'une taille donnée devrait être normalisé en fonction de la taille, selon l'hypothèse selon laquelle il s'agit d'un échantillon d'une distribution de valeur extrême, et non un échantillon de la distribution normale, comme c'est généralement le cas dans la littérature. Je propose alors une solution théorique à ce biais. La troisième partie présente un nouvel outil de recherche de sous-réseaux que j'ai co-conçu. Son modèle sous-jacent et l'algorithme évite le biais de taille trouvé dans les méthodes existantes et génère des résultats facilement compréhensibles. Je présente une application aux données transcriptomiques de la dengue. Dans la quatrième et dernière partie, je décris l'identification d'un biomarqueur qui détecte la sévérité de la dengue à l'arrivée à l'hôpital en utilisant une nouvelle approche d'apprentissage automatique. Cette approche combine la régression monotone bidimensionnelle avec la sélection des variables. Le modèle sous-jacent va au-delà des approches linéaires couramment utilisées, tout en permettant de contrôler le nombre de transcrits dans le biomarqueur. Le petit nombre de transcrits accompagné de leur représentation visuelle maximisent la compréhension et l'interprétation du biomarqueur par les professionnels de la biomédecine. Je présente un biomarqueur à 18 gènes qui permet de distinguer, à leur arrivée à l'hôpital, les patients qui vont développer des symptômes de dengue sévères de ceux qui auront une dengue non sévère. Ce biomarqueur a une performance prédictive élevée et robuste. La performance prédictive du biomarqueur a été confirmée sur deux ensembles de données qui ont tous deux utilisé différentes technologies transcriptomiques et différents sous-types de cellules sanguines
The last 20 years have seen the emergence of powerful measurement technologies, enabling omics analysis of diverse diseases. They often provide non-invasive means to study the etiology of newly emerging complex diseases, such as the mosquito-borne infectious dengue disease. My dissertation concentrates on adapting and applying network and machine learning approaches to genomic and transcriptomic data. The first part goes beyond a previously published genome-wide analysis of 4,026 individuals by applying network analysis to find groups of interacting genes in a gene functional interaction network that, taken together, are associated to severe dengue. In this part, I first recalculated association p-values of sequences polymorphisms, then worked on mapping polymorphisms to functionally related genes, and finally explored different pathway and gene interaction databases to find groups of genes together associated to severe dengue. The second part of my dissertation unveils a theoretical approach to study a size bias of active network search algorithms. My theoretical analysis suggests that the best score of subnetworks of a given size should be size-normalized, based on the hypothesis that it is a sample of an extreme value distribution, and not a sample of the normal distribution, as usually assumed in the literature. I then suggest a theoretical solution to this bias. The third part introduces a new subnetwork search tool that I co-designed. Its underlying model and the corresponding efficient algorithm avoid size bias found in existing methods, and generates easily comprehensible results. I present an application to transcriptomic dengue data. In the fourth and last part, I describe the identification of a biomarker that detects dengue severity outcome upon arrival at the hospital using a novel machine learning approach. This approach combines two-dimensional monotonic regression with feature selection. The underlying model goes beyond the commonly used linear approaches, while allowing controlling the number of transcripts in the biomarker. The small number of transcripts along with its visual representation maximize the understanding and the interpretability of the biomarker by biomedical professionals. I present an 18-gene biomarker that allows distinguishing severe dengue patients from non-severe ones upon arrival at the hospital with a unique biomarker of high and robust predictive performance. The predictive performance of the biomarker has been confirmed on two datasets that both used different transcriptomic technologies and different blood cell subtypes
APA, Harvard, Vancouver, ISO, and other styles
32

Jagtap, Surabhi. "Multilayer Graph Embeddings for Omics Data Integration in Bioinformatics." Electronic Thesis or Diss., université Paris-Saclay, 2023. http://www.theses.fr/2023UPAST014.

Full text
Abstract:
Les systèmes biologiques sont composés de biomolécules en interaction à différents niveaux moléculaires. D’un côté, les avancées technologiques ont facilité l’obtention des données omiques à ces divers niveaux. De l’autre, de nombreuses questions se posent, pour donner du sens et élucider les interactions importantes dans le flux d’informations complexes porté par cette énorme variété et quantité des données multi-omiques. Les réponses les plus satisfaisantes seront celles qui permettront de dévoiler les mécanismes sous-jacents à la condition biologique d’intérêt. On s’attend souvent à ce que l’intégration de différents types de données omiques permette de mettre en lumière les changements causaux potentiels qui conduisent à un phénotype spécifique ou à des traitements ciblés. Avec les avancées récentes de la science des réseaux, nous avons choisi de traiter ce problème d’intégration en représentant les données omiques à travers les graphes. Dans cette thèse, nous avons développé trois modèles à savoir BraneExp, BraneNet et BraneMF pour l’apprentissage d’intégrations de noeuds à partir de réseaux biologiques multicouches générés à partir de données omiques. Notre objectif est de résoudre divers problèmes complexes liés à l’intégration de données multiomiques, en développant des méthodes expressives et évolutives capables de tirer parti de la riche sémantique structurelle latente des réseaux du monde réel
Biological systems are composed of interacting bio-molecules at different molecular levels. With the advent of high-throughput technologies, omics data at their respective molecular level can be easily obtained. These huge, complex multi-omics data can be useful to provide insights into the flow of information at multiple levels, unraveling the mechanisms underlying the biological condition of interest. Integration of different omics data types is often expected to elucidate potential causative changes that lead to specific phenotypes, or targeted treatments. With the recent advances in network science, we choose to handle this integration issue by representing omics data through networks. In this thesis, we have developed three models, namely BraneExp, BraneNet, and BraneMF, for learning node embeddings from multilayer biological networks generated with omics data. We aim to tackle various challenging problems arising in multi-omics data integration, developing expressive and scalable methods capable of leveraging rich structural semantics of realworld networks
APA, Harvard, Vancouver, ISO, and other styles
33

PATRIZI, SARA. "Multi-omics approaches to complex diseases in children." Doctoral thesis, Università degli Studi di Trieste, 2022. http://hdl.handle.net/11368/3015193.

Full text
Abstract:
Le tecnologie “-omiche” studiano l’insieme delle molecole presenti nel campione biologico di interesse, in maniera completamente agnostica. L’integrazione di diversi tipi di dati omici, chiamata “multi-omica” o “omica verticale”, fornisce indicazioni importanti su come le cause di una malattia portano alle sue conseguenze funzionali. Queste indicazioni sono particolarmente utili nel caso delle malattie complesse, che sono causate dall’interazione di vari fattori genetici e regolatori con vari contributi ambientali. In questo lavoro, degli approcci multi-omici appropriati sono stati applicati a due malattie complesse che di solito iniziano a manifestarsi durante l’infanzia, hanno un’incidenza crescente, e hanno vari elementi sconosciuti nella loro patologia molecolare, ovvero le malformazioni polmonari congenite e la celiachia. Gli scopi dei due progetti sono, rispettivamente, di verificare se nel tessuto polmonare malformato ci sono varianti genetiche o alterazioni della metilazione del DNA associate al cancro, e di trovare alterazioni comuni nel metiloma e nel trascrittoma di cellule epiteliali dell’intestino tenue di bambini affetti da celiachia. Per quanto riguarda i metodi, nel progetto sulle malformazioni polmonari sono stati usati microarray di metilazione whole genome e sequenziamento dell’intero genoma, mentre nel progetto sulla celiachia sono stati usati microarray di metilazione whole genome e sequenziamento dell’mRNA totale. In tutte le 20 malformazioni polmonari incluse nello studio sono state trovate regioni differenzialmente metilate in geni probabilmente legati al cancro del polmone. Inoltre, 5 campioni malformati avevano almeno una variante somatica missenso in un gene noto come driver del tumore del polmone, e 5 altri campioni avevano un totale di 2 delezioni di oncosoppressori driver del tumore del polmone e 10 amplificazioni di oncogeni driver del tumore del polmone. Questi dati suggeriscono che le malformazioni polmonari congenite possono avere alterazioni genetiche ed epigenetiche di tipo pre-maligno, la cui presenza è impossibile da prevedere sulla base delle sole informazioni cliniche. Nel secondo progetto, una Principal Component Analysis dei dati di metilazione ha mostrato che i pazienti celiaci si dividono in due cluster, di cui uno si sovrappone ai controlli. 174 geni erano differenzialmente metilati rispetto ai controlli in entrambi i cluster. Una Principal Component Analysis dei dati di espressione genica (mRNA-Seq) ha mostrato una distribuzione simile a quella dei dati di metilazione, e 442 geni erano differenzialmente espressi in entrambi i cluster. Sei geni, principalmente coinvolti nella risposta interferonica e nel processo di processamento e presentazione degli antigeni, erano sia differenzialmente espressi che differenzialmente metilati in entrambi i cluster. Questi risultati indicano che le cellule epiteliali dell’intestino tenue di bambini affetti da celiachia sono altamente variabili da un punto di vista molecolare, ma condividono delle differenze fondamentali che le rendono in grado di rispondere agli interferoni e di processare e presentare antigeni con maggiore efficienza rispetto ai controlli. Nonostante le loro limitazioni, gli studi presentati mostrano che degli approcci multi-omici specifici possono essere usati per rispondere alle domande ancora aperte riguardo a diverse malattie, studiando più funzioni cellulari contemporaneamente e spesso portando anche alla generazione di nuove ipotesi e a scoperte inaspettate.
“-Omic” technologies can detect the entirety of the molecules in the biological sample of interest, in a non-targeted and non-biased fashion. The integration of multiple types of omics data, known as “multi-omics” or “vertical omics”, can provide a better understanding of how the cause of disease leads to its functional consequences, which is particularly valuable in the study of complex diseases, that are caused by the interaction of multiple genetic and regulatory factors with contributions from the environment. In the present work appropriate multi-omics approaches are applied to two complex conditions that usually first manifest in childhood, have rising incidence and gaps in the knowledge of their molecular pathology, specifically Congenital Lung Malformations and Coeliac Disease. The aims are, respectively, to verify if cancer-associated genomic variants or DNA methylation features exist in the malformed lung tissue and to find common alterations in the methylome and the transcriptome of small intestine epithelial cells of children with CD. The methods used in the Congenital Lung Malformations project are Whole Genome Methylation microarrays and Whole Genome Sequencing, and for the Coeliac Disease the whole genome methylation microarrays and mRNA sequencing. Differentially methylated regions in possibly cancer-related genes were found in each one of the 20 lung malformation samples included. Moreover, 5 malformed samples had at least one somatic missense single nucleotide variant in genes known as lung cancer drivers, and 5 malformed samples had a total of 2 deletions of lung cancer driver tumour suppressor and 10 amplifications of lung cancer driver oncogenes. The data showed that congenital lung malformations can have premalignant genetic and epigenetic features, that are impossible to predict with clinical information only. In the second project, Principal Component Analysis of the whole genome methylation data showed that CD patients divide into two clusters, one of which overlaps with controls. 174 genes were differentially methylated compared to the controls in both clusters. Principal Component Analysis of gene expression data (mRNA-Seq) showed a distribution that is similar to the methylation data, and 442 genes were differentially expressed in both clusters. Six genes, mainly related to interferon response and antigen processing and presentation, were differentially expressed and methylated in both clusters. These results show that the intestinal epithelial cells of individuals with CD are highly variable from a molecular point of view, but they share some fundamental differences that make them able to respond to interferons, process, and present antigens more efficiently than controls. Despite the limitations of the present studies, they have shown that targeted multi-omics approaches can be set up to answer the relevant disease-specific questions by investigating many cellular functions at once, often generating new hypotheses and making unexpected discoveries in the process.
APA, Harvard, Vancouver, ISO, and other styles
34

Tellaroli, Paola. "Three topics in omics research." Doctoral thesis, Università degli studi di Padova, 2015. http://hdl.handle.net/11577/3423912.

Full text
Abstract:
The rather generic title of this Thesis is due to the fact that several aspects of biological phenomena have been investigated. Most of this work was addressed at the investigation of the limitations of one of the essential tools for analyzing gene expression data: cluster analysis. With several hundred of clustering methods in existence, there is clearly no shortage of clustering algorithms but, at the same time, satisfactory answers to some basic questions are still to come. In particular, we present a novel algorithm for the clustering of static data and a new strategy for the clustering of short-length time-course data. Finally, we analyzed data coming from Cap Analysis Gene Expression, a relatively new technology useful for the genome-wide promoter analysis and still mostly unexplored.
Il titolo piuttosto generico di questa tesi è dovuto al fatto che sono stati indagati diversi aspetti di fenomeni biologici. La maggior parte di questo lavoro è stato rivolto alla ricerca dei limiti di uno degli strumenti essenziali per l'analisi di dati di espressione genica: l'analisi dei gruppi. Esistendo diverse centinaia di metodi di raggruppamento, chiaramente non c'è carenza di algoritmi di analisi dei gruppi, ma, allo stesso tempo, alcuni quesiti fondamentali non hanno ancora ricevuto risposte soddisfacenti. In particolare, presentiamo un nuovo algoritmo di analisi dei gruppi per dati statici ed una nuova strategia per il raggruppamento di dati temporali di breve lunghezza. Infine, abbiamo analizzato dati provenienti da una tecnologia relativamente nuova, chiamata Cap Analysis Gene Expression, utile per l'analisi dei promotori su tutto il genoma e ancora in gran parte inesplorata.
APA, Harvard, Vancouver, ISO, and other styles
35

Lu, Yingzhou. "Multi-omics Data Integration for Identifying Disease Specific Biological Pathways." Thesis, Virginia Tech, 2018. http://hdl.handle.net/10919/83467.

Full text
Abstract:
Pathway analysis is an important task for gaining novel insights into the molecular architecture of many complex diseases. With the advancement of new sequencing technologies, a large amount of quantitative gene expression data have been continuously acquired. The springing up omics data sets such as proteomics has facilitated the investigation on disease relevant pathways. Although much work has previously been done to explore the single omics data, little work has been reported using multi-omics data integration, mainly due to methodological and technological limitations. While a single omic data can provide useful information about the underlying biological processes, multi-omics data integration would be much more comprehensive about the cause-effect processes responsible for diseases and their subtypes. This project investigates the combination of miRNAseq, proteomics, and RNAseq data on seven types of muscular dystrophies and control group. These unique multi-omics data sets provide us with the opportunity to identify disease-specific and most relevant biological pathways. We first perform t-test and OVEPUG test separately to define the differential expressed genes in protein and mRNA data sets. In multi-omics data sets, miRNA also plays a significant role in muscle development by regulating their target genes in mRNA dataset. To exploit the relationship between miRNA and gene expression, we consult with the commonly used gene library - Targetscan to collect all paired miRNA-mRNA and miRNA-protein co-expression pairs. Next, by conducting statistical analysis such as Pearson's correlation coefficient or t-test, we measured the biologically expected correlation of each gene with its upstream miRNAs and identify those showing negative correlation between the aforementioned miRNA-mRNA and miRNA-protein pairs. Furthermore, we identify and assess the most relevant disease-specific pathways by inputting the differential expressed genes and negative correlated genes into the gene-set libraries respectively, and further characterize these prioritized marker subsets using IPA (Ingenuity Pathway Analysis) or KEGG. We will then use Fisher method to combine all these p-values derived from separate gene sets into a joint significance test assessing common pathway relevance. In conclusion, we will find all negative correlated paired miRNA-mRNA and miRNA-protein, and identifying several pathophysiological pathways related to muscular dystrophies by gene set enrichment analysis. This novel multi-omics data integration study and subsequent pathway identification will shed new light on pathophysiological processes in muscular dystrophies and improve our understanding on the molecular pathophysiology of muscle disorders, preventing and treating disease, and make people become healthier in the long term.
Master of Science
APA, Harvard, Vancouver, ISO, and other styles
36

Zampieri, Guido. "Prioritisation of candidate disease genes via multi-omics data integration." Doctoral thesis, Università degli studi di Padova, 2018. http://hdl.handle.net/11577/3421826.

Full text
Abstract:
The uncovering of genes linked to human diseases is a pressing challenge in molecular biology, towards the full achievement of precision medicine. Next-generation technologies provide an unprecedented amount of biological information, but at the same time they unveil enormous numbers of candidate disease genes and pose novel challenges at multiple analytical levels. Multi-omics data integration is currently the principal strategy to prioritise candidate disease genes. In particular, kernel-based methods are a powerful resource for the integration of biological knowledge, but their use is often precluded by their limited scalability. In this thesis, we propose a novel scalable kernel-based method for gene prioritisation which implements a novel multiple kernel learning approach, based on a semi-supervised perspective and on the optimisation of the margin distribution in binary problems. Our method is optimised to cope with strongly unbalanced settings where known disease genes are few and large scale predictions are required. Importantly, it is able to efficiently deal both with a large amount of candidate genes and with an arbitrary number of data sources. Through the simulation of real case studies, we show that our method outperforms a wide range of state-of-the-art methods and has enhanced scalability compared to existing kernel-based approaches for genomic data. We apply the proposed method to investigate the potential role for disease gene prediction of metabolic rearrangements caused by genetic perturbations. To this end, we use constraint-based modelling of metabolism to generate gene-specific information at a genome scale, which is mined via machine learning. Moreover, we compare constraint-based modelling and our kernel-based method as alternative integration strategies for omics data such as transcriptional profiles. Experimental assessments across various cancers demonstrate that information on metabolic rewiring reconstructed in silico can be valuable to prioritise associated genes, although accuracy strongly depends on the cancer type. Despite these fluctuations, predictions achieved starting from metabolic modelling are largely complementary to those from gene expression or pathway annotations, highlighting the potential of this approach to identify novel genes involved in cancer.
La scoperta dei geni legati alle malattie nell'uomo è una sfida pressante in biologia molecolare, in vista del pieno raggiungimento della medicina di precisione. Le tecnologie di nuova generazione forniscono una quantità di informazioni biologiche senza precedenti, ma allo stesso tempo rivelano numeri enormi di geni malattia candidati e pongono nuove sfide a molteplici livelli di analisi. L'integrazione di dati multi-omici è attualmente la strategia principale per prioritizzare geni malattia candidati. In particolare, i metodi basati su kernel sono una potente risorsa per l'integrazione della conoscenza biologica, tuttavia il loro utilizzo è spesso precluso dalla loro limitata scalabilità. In questa tesi, proponiamo un nuovo metodo kernel scalabile per la prioritizzazione di geni, che applica un nuovo approccio di multiple kernel learning basato su una prospettiva semi-supervisionata e sull'ottimizzazione della distribuzione dei margini in problemi binari. Il nostro metodo è ottimizzato per fare fronte a condizioni fortemente sbilanciate in cui si disponga di pochi geni malattia noti e siano richieste predizioni su larga scala. Significativamente, è capace di gestire sia un gran numero di candidati sia un numero arbitrario di sorgenti di informazione. Attraverso la simulazione di casi studio reali, mostriamo che il nostro metodo supera in prestazioni un'ampia gamma di metodi allo stato dell'arte ed è dotato di migliore scalabilità rispetto a metodi kernel esistenti per dati genomici. Applichiamo il metodo proposto per studiare il potenziale ruolo per la predizione di geni malattia dei riarrangiamenti metabolici causati da perturbazioni genetiche. A questo scopo, utilizziamo modelli del metabolismo basati su vincoli per generare informazione sui geni a scala genomica, che viene analizzata tramite apprendimento automatico. Inoltre, compariamo modelli basati su vincoli ed il nostro metodo basato su kernel come strategie di integrazione alternative per dati omici come profili trascrizionali. Valutazioni sperimentali su vari cancri dimostrano come i riarrangiamenti metabolici ricostruiti in silico possano essere utili per prioritizzare i geni associati, nonostante l'accuratezza dipenda fortemente dalla tipologia di cancro. Malgrado queste fluttuazioni, le predizioni basate su modelli metabolici sono largamente complentari a quelle basate su espressione genica o annotazioni di pathway, evidenziando il potenziale di questo approccio per identificare nuovi geni implicati nel cancro.
APA, Harvard, Vancouver, ISO, and other styles
37

Mönchgesang, Susann [Verfasser]. "Metabolomics and biochemical omics data - integrative approaches : [kumulative Dissertation] / Susann Mönchgesang." Halle, 2017. http://d-nb.info/1131075994/34.

Full text
APA, Harvard, Vancouver, ISO, and other styles
38

Konrad, Attila. "Investigation of Pathway Analysis Tools for mapping omics data to pathways." Thesis, Malmö högskola, Fakulteten för teknik och samhälle (TS), 2014. http://urn.kb.se/resolve?urn=urn:nbn:se:mau:diva-20843.

Full text
Abstract:
Detta examensarbete granskar analysverktyg ur ett tvärvetenskapligt perspektiv. Det finns en hel del olika analysverktyg idag som analyserar specifika typer av omik data och därför undersöker vi hur många det finns samt vad de kan göra. Genom att definiera ett antal specifika krav såsom hur många typer av omik data den kan hantera, noggrannhet av verktygets analys så kan man se vilka som är mest lämpliga analysverktygen när det gäller kartläggning av omik data. Resultaten visar att det idag inte finns analysverktyg som uppfyller de specifikt angivna kraven eller huvudsyftet genom testning av programvaran. Ingenuity analysverktyget är det närmaste vi kan komma för de krav som vi söker. På begäran av slutanvändaren testades två analysverktyg för att se om en kombination av dessa kan uppfylla slut användarens krav. Analysverktyget Uniprot batch converter testas med FEvER men resultat är inte framgångsrikt, då kombinationen av dessa verktyg inte är bättre än Ingenuity analysverktyget. Fokus vänds mot en alternativ kombination som är en hemsida och heter NCBI. Hemsidan har en sökmotor kopplad till flera olika analysverktyg som är gratis att använda. Genom sökmotorn kan ”omik” data kombineras och mer än ett inmatat värde kan hanteras i taget. Eftersom tekniken snabbt går framåt innebär det däremot att nya analysverktyg behövs för data hantering och inom en snar framtid så har vi kanske ett analysverktyg som uppfyller kraven av slutanvändarna.
This thesis examines PATs from a multidisciplinary view. There are a lot of PAT's existing today analyzing specific type of omics data, therefore we investigate them and what they can do. By defining some specific requirements such as how many omics data types it can handle, the accuracy of the PAT can be obtained to get the most suitable PAT when it comes to mapping omics data to pathways. Results show that no PATs found today fulfills the specific set of requirements or the main goal though software testing. The Ingenuity PAT is the closest to fulfill the requirements. Requested by the end user, two PATs are tested in combination to see if these can fulfill the requirements of the end user. Uniprot batch converter was tested with FEvER and results did not turn out successfully since the combination of the two PATs is no better than the Ingenuity PAT. Focus then turned to an alternative combination, a homepage called NCBI that have search engines connected to several free PATs available thus fulfilling the requirements. Through the search engine “omics” data can be combined and more than one input can be taken at a time. Since technology is rapidly moving forward, the need for new tools for data interpretation also grows. It means that in a near future we may be able to find a PAT that fulfills the requirements of the end users.
APA, Harvard, Vancouver, ISO, and other styles
39

Castleberry, Alissa. "Integrated Analysis of Multi-Omics Data Using Sparse Canonical Correlation Analysis." The Ohio State University, 2019. http://rave.ohiolink.edu/etdc/view?acc_num=osu15544898045976.

Full text
APA, Harvard, Vancouver, ISO, and other styles
40

Strbenac, Dario. "Novel Preprocessing Approaches for Omics Data Types and Their Performance Evaluation." Thesis, The University of Sydney, 2016. http://hdl.handle.net/2123/16007.

Full text
Abstract:
A diverse range of high-dimensional datasets has recently become available to help elucidate the functioning of biological systems and defects within those systems leading to disease. All of these new technologies come with the challenges of determining how the raw data should be efficiently processed or normalised and, subsequently, how can the data best be summarised for more complex downstream analysis. There are many approaches to summarising and normalising omics data, with new methods frequently being developed. To date, there has not been a comprehensive evaluation of existing methods for many omics data types. This thesis focusses on systematically evaluating existing methods for three different types of omics data and, having identified limitations in the current methods, also proposes new approaches to improve their quality. Firstly, CAGE-seq data are considered. A two-stage method based on a novel region-finding algorithm followed by a classifier that integrates sequence patterns surrounding the identified regions is shown to possess superior performance to two existing methods. Similarly, a novel data summarisation approach to gene expression data, which integrates changes in location and scale into a unified metric, demonstrates benefits in two-class classification problems. The error rates are found to be competitive with existing methods, and the feature selection has higher stability and increased biological relevance. Finally, in the proteomics setting, there are many choices for how to summarise peptides to proteins, as well as issues relating to batch effects and whether internal controls are necessary. By developing a broad variety of performance metrics, and an accompanying web-based framework, novel recommendations about peptide to protein summaries and batch correction algorithms are made, and a surprising result regarding the necessity of internal standards is revealed.
APA, Harvard, Vancouver, ISO, and other styles
41

Pestarino, Luca <1992&gt. "Challenges and Opportunities of Machine Learning for Clinical and Omics Data." Doctoral thesis, Alma Mater Studiorum - Università di Bologna, 2022. http://amsdottorato.unibo.it/10091/1/PhD_Thesis_Pestarino_Luca.pdf.

Full text
Abstract:
Clinical and omics data are a promising field of application for machine learning techniques even though these methods are not yet systematically adopted in healthcare institutions. Despite artificial intelligence has proved successful in terms of prediction of pathologies or identification of their causes, the systematic adoption of these techniques still presents challenging issues due to the peculiarities of the analysed data. The aim of this thesis is to apply machine learning algorithms to both clinical and omics data sets in order to predict a patient's state of health and get better insights on the possible causes of the analysed diseases. In doing so, many of the arising issues when working with medical data will be discussed while possible solutions will be proposed to make machine learning provide feasible results and possibly become an effective and reliable support tool for healthcare systems.
APA, Harvard, Vancouver, ISO, and other styles
42

Salviato, Elisa. "Computational methods for the discovery of molecular signatures from Omics Data." Doctoral thesis, Università degli studi di Padova, 2018. http://hdl.handle.net/11577/3421961.

Full text
Abstract:
Molecular biomarkers, derived from high-throughput technologies, are the foundations of the "next-generation" precision medicine. Despite a decade of intense efforts and investments, the number of clinically valid biomarkers is modest. Indeed, the "big-data" nature of omics data provides new challenges that require an improvement in the strategies of data analysis and interpretation. In this thesis, two themes are proposed, both aimed at improving the statistical and computational methodology in the field of signatures discovery. The first work aim at identifying serum miRNAs to be used as diagnostic biomarkers associated with ovarian cancer. In particular, a guideline and an ad-hoc microarray normalization strategy for the analysis of circulating miRNAs is proposed. In the second work, a new approach for the identification of functional molecular signatures based on Gaussian graphical models is presented. The model can explore the topological information contained in the biological pathways and highlight the potential sources of differential behaviors in two experimental conditions.
I biomarcatori molecolari, ottenuti attraverso l'utilizzo di piattaforme high-throughput sequencing, costituiscono le basi della medicina personalizzata di nuova generazione. Nonostante un decennio di sforzi e di investimenti, il numero di biomarcatori validi a livello clinico rimane modesto. La natura di "big-data" dei dati omici infatti ha introdotto nuove sfide che richiedono un miglioramento sia degli strumenti di analisi che di quelli di esplorazione dei risultati. In questa tesi vengono proposti due temi centrali, entrambi volti al miglioramento delle metodologie statistiche e computazionali nell'ambito dell'individuazione di firme molecolari. Il primo lavoro si sviluppa attorno all'identificazione di miRNA su siero in pazienti affetti da carcinoma ovarico impiegabili a livello diagnostico. In particolare si propongono delle linee guida per il processo di analisi e una normalizzazione ad-hoc per dati di microarray da utilizzarsi nel contesto di molecole circolanti. Nel secondo lavoro si presenta un nuovo approccio basato sui modelli grafici Gaussiani per l'identificazione di firme molecolari funzionali. Il metodo proposto è in grado di esplorare le informazioni contenute nei pathway biologici e di evidenziare la potenziale origine del comportamento differenziale tra due condizioni sperimentali.
APA, Harvard, Vancouver, ISO, and other styles
43

Boyd, Joseph. "BioBridge: Bringing Data Exploration to Biologists." Digital WPI, 2014. https://digitalcommons.wpi.edu/etd-theses/1186.

Full text
Abstract:
Since the completion of the Human Genome Project in 2003, biologists have become exceptionally good at producing data. Indeed, biological data has experienced a sustained exponential growth rate, putting effective and thorough analysis beyond the reach of many biologists. This thesis presents BioBridge, an interactive visualization tool developed to bring intuitive data exploration to biologists. BioBridge is designed to work on omics style tabular data in general and thus has broad applicability. This work describes the design and evaluation of BioBridge's Entity View primary visualization as well the accompanying user interface. The Entity View visualization arranges glyphs representing biological entities (e.g. genes, proteins, metabolites) along with related text mining results to provide biological context. Throughout development the goal has been to maximize accessibility and usability for biologists who are not computationally inclined. Evaluations were done with three informal case studies, one of a metabolome dataset and two of microarray datasets. BioBridge is a proof of concept that there is an underexploited niche in the data analysis ecosystem for tools that prioritize accessibility and usability. The use case studies, while anecdotal, are very encouraging. These studies indicate that BioBridge is well suited for the task of data exploration. With further development, BioBridge could become more flexible and usable as additional use case datasets are explored and more feedback is gathered.
APA, Harvard, Vancouver, ISO, and other styles
44

Gadaleta, Emanuela. "A multidisciplinary computational approach to model cancer-omics data : organising, integrating and mining multiple sources of data." Thesis, Queen Mary, University of London, 2015. http://qmro.qmul.ac.uk/xmlui/handle/123456789/8141.

Full text
Abstract:
It is imperative that the cancer research community has the means with which to effectively locate, access, manage, analyse and interpret the plethora of data values being generated by novel technologies. This thesis addresses this unmet requirement by using pancreatic cancer and breast cancer as prototype malignancies to develop a generic integrative transcriptomic model. The analytical workflow was initially applied to publicly available pancreatic cancer data from multiple experimental types. The transcriptomic landscape of comparative groups was examined both in isolation and relative to each other. The main observations included (i) a clear separation of profiles based on experimental type, (ii) identification of three subgroups within normal tissue samples resected adjacent to pancreatic cancer, each showing disruptions to biofunctions previously associated with pancreatic cancer (iii) and that cell lines and xenograft models are not representative of changes occurring during pancreatic tumourigenesis. Previous studies examined transcriptomic profiles across 306 biological and experimental samples, including breast cancer. The plethora of clinical and survival data readily available for breast cancer, compared to the paucity of publicly available pancreatic cancer data, allowed for expansion of the pipeline’s infrastructure to include functionalities for cross-platform and survival analysis. Application of this enhanced pipeline to multiple cohorts of triple negative and basal-like breast cancers identified differential risk groups within these breast cancer subtypes. All of the main experimental findings of this thesis are being integrated with the Pancreatic Expression Database and the Breast Cancer Campaign Tissue Bank bioinformatics portal, which enhances the sharing capacity of this information and ensures its exposure to a wider audience.
APA, Harvard, Vancouver, ISO, and other styles
45

Eichner, Johannes [Verfasser]. "Machine learning and statistical methods for preclinical omics data analysis / Johannes Eichner." München : Verlag Dr. Hut, 2015. http://d-nb.info/1079768874/34.

Full text
APA, Harvard, Vancouver, ISO, and other styles
46

Wrzodek, Clemens [Verfasser]. "Inference and integration of biochemical networks with multilayered omics data / Clemens Wrzodek." München : Verlag Dr. Hut, 2013. http://d-nb.info/1042307652/34.

Full text
APA, Harvard, Vancouver, ISO, and other styles
47

Müller, Nikola [Verfasser], and Christian [Akademischer Betreuer] Böhm. "Finding correlations and independences in omics data / Nikola Müller. Betreuer: Christian Böhm." München : Universitätsbibliothek der Ludwig-Maximilians-Universität, 2012. http://d-nb.info/1023435594/34.

Full text
APA, Harvard, Vancouver, ISO, and other styles
48

Barcelona, Cabeza Rosa. "Genomics tools in the cloud: the new frontier in omics data analysis." Doctoral thesis, Universitat Politècnica de Catalunya, 2021. http://hdl.handle.net/10803/672757.

Full text
Abstract:
Substantial technological advancements in next generation sequencing (NGS) have revolutionized the genomic field. Over the last years, the speed and throughput of NGS technologies have increased while their costs have decreased, allowing us to achieve base-by-base interrogation of the human genome in an efficient and affordable way. All these advances have led to a growing application of NGS technologies in clinical practice to identify the genomics variations and their relationship with certain diseases. However, there is still the need to improve data accessibility, processing and interpretation due to both the huge amount of data generated by these sequencing technologies and the large number of tools available to process it. In addition to a large number of algorithms for variant discovery, each type of variation and data requires the use of a specific algorithm. Therefore, a solid background in bioinformatics is required to be able to select the most suitable algorithm in each case but also to be able to execute them successfully. On that basis, the aim of this project is to facilitate the processing of sequencing data for variant identification and interpretation for non-bioinformaticians. All this by creating high-performance workflows with a strong scientific basis, while remaining accessible and easy to use, as well as a simple and highly intuitive platform for data interpretation. An exhaustive bibliographic review has been carried out where the best existing algorithm has been selected to create automatic pipelines for the discovery of germline short variants (SNPs and indels) and germline structural variants (SVs), including both CNVs and chromosomal rearrangements, from modern human DNA. In addition to creating variant discovery pipelines, a pipeline has been implemented for in silico optimization of CNV detection from WES and TS data (isoCNV). This optimization pipeline has been shown to increase the sensitivity of CNV discovery using only NGS data. Such increased sensitivity is especially important for diagnosis in the clinical settings. Furthermore, a variant discovery workflow has been developed by integrating WES and RNA-seq data (varRED) that has been shown to increase the number of variants identified over those identified when only using WES data. It is important to note that variant discovery is not only important for modern populations, the study of the variation in ancient genomes is also essential to understand past human evolution. Thus, a germline short variant discovery pipeline from ancient WGS samples has been implemented. This workflow has been applied to a human mandible dated between 16980-16510 calibrated years before the present. The ancient short variants discovered were reported without further interpretation due to the low sample coverage. Finally, GINO has been implemented to facilitate the interpretation of the variants identified by the workflows developed in the context of this thesis. GINO is an easy-to-use platform for the visualization and interpretation of germline variants under user license. With the development of this thesis, it has been possible to implement the necessary tools for a high-performance identification of all types of germline variants, as well as a powerful platform to interpret the identified variants in a simple and fast way. Using this platform allows non-bioinformaticians to focus on interpreting results without having to worry about data processing with the guarantee of scientifically sound results. Furthermore, it has laid the foundations for implementing a platform for comprehensive analysis and visualization of genomic data in the cloud in the near future.
Los avances tecnológicos en la secuenciación de próxima generación (NGS) han revolucionado el campo de la genómica. El aumento de velocidad y rendimiento de las tecnologías NGS de los últimos años junto con la reducción de su coste ha permitido interrogar base por base el genoma humano de una manera eficiente y asequible. Todos estos avances han permitido incrementar el uso de las tecnologías NGS en la práctica clínica para la identificación de variaciones genómicas y su relación con determinadas enfermedades. Sin embargo, sigue siendo necesario mejorar la accesibilidad, el procesamiento y la interpretación de los datos debido a la enorme cantidad de datos generados y a la gran cantidad de herramientas disponibles para procesarlos. Además de la gran cantidad de algoritmos disponibles para el descubrimiento de variantes, cada tipo de variación y de datos requiere un algoritmo específico. Por ello, se requiere una sólida formación en bioinformática tanto para poder seleccionar el algoritmo más adecuado como para ser capaz de ejecutarlo correctamente. Partiendo de esa base, el objetivo de este proyecto es facilitar el procesamiento de datos de secuenciación para la identificación e interpretación de variantes para los no bioinformáticos. Todo ello mediante la creación de flujos de trabajo de alto rendimiento y con una sólida base científica, sin dejar de ser accesibles y fáciles de utilizar, así como de una plataforma sencilla y muy intuitiva para la interpretación de datos. Se ha realizado una exhaustiva revisión bibliográfica donde se han seleccionado los mejores algoritmos con los que crear flujos de trabajo automáticos para el descubrimiento de variantes cortas germinales (SNPs e indels) y variantes estructurales germinales (SV), incluyendo tanto CNV como reordenamientos cromosómicos, de ADN humano moderno. Además de crear flujos de trabajo para el descubrimiento de variantes, se ha implementado un flujo para la optimización in silico de la detección de CNV a partir de datos de WES y TS (isoCNV). Se ha demostrado que dicha optimización aumenta la sensibilidad de detección utilizando solo datos NGS, lo que es especialmente importante para el diagnóstico clínico. Además, se ha desarrollado un flujo de trabajo para el descubrimiento de variantes mediante la integración de datos de WES y RNA-seq (varRED) que ha demostrado aumentar el número de variantes detectadas sobre las identificadas cuando solo se utilizan datos de WES. Es importante señalar que la identificación de variantes no solo es importante para las poblaciones modernas, el estudio de las variaciones en genomas antiguos es esencial para comprender la evolución humana. Por ello, se ha implementado un flujo de trabajo para la identificación de variantes cortas a partir de muestras antiguas de WGS. Dicho flujo se ha aplicado a una mandíbula humana datada entre el 16980-16510 a.C. Las variantes ancestrales allí descubiertas se informaron sin mayor interpretación debido a la baja cobertura de la muestra. Finalmente, se ha implementado GINO para facilitar la interpretación de las variantes identificadas por los flujos de trabajo desarrollados en esta tesis. GINO es una plataforma fácil de usar para la visualización e interpretación de variantes germinales que requiere licencia de uso. Con el desarrollo de esta tesis se ha conseguido implementar las herramientas necesarias para la identificación de alto rendimiento de todos los tipos de variantes germinales, así como de una poderosa plataforma para visualizar dichas variantes de forma sencilla y rápida. El uso de esta plataforma permite a los no bioinformáticos centrarse en interpretar los resultados sin tener que preocuparse por el procesamiento de los datos con la garantía de que estos sean científicamente robustos. Además, ha sentado las bases para en un futuro próximo implementar una plataforma para el completo análisis y visualización de datos genómicos
Bioinformática
APA, Harvard, Vancouver, ISO, and other styles
49

Wolf, Beat [Verfasser], and Thomas [Gutachter] Dandekar. "Reducing the complexity of OMICS data analysis / Beat Wolf ; Gutachter: Thomas Dandekar." Würzburg : Universität Würzburg, 2017. http://d-nb.info/1142114295/34.

Full text
APA, Harvard, Vancouver, ISO, and other styles
50

Sala, Claudia <1987&gt. "Stochastic Modeling and Statistical Properties of Biological Systems Inferred from Omics Data." Doctoral thesis, Alma Mater Studiorum - Università di Bologna, 2017. http://amsdottorato.unibo.it/7810/1/sala_claudia_tesi.pdf.

Full text
Abstract:
In this thesis we aim to describe the dynamic processes that govern the evolution of two very different ecological systems. First, we consider the ensemble of bacteria that populate the intestine (Gut Microbiota, GM), which has been proven to have great impact on human health, being associated to several metabolic and immunological diseases. Then, we deal with the set of protein domains enclosed in the genome of living organisms. In general, the neutrality hypothesis, that was proposed by Hubbell as the Ockham’s razor for ecology, is a respectable approximation for both the GM and the protein domains ecosystems. In the first case, a birth-death model that takes into account demographic noise is able to describe the population dynamics if we relax the neutrality assumption and consider two non-interacting niches in which species equivalence holds. Interestingly, the biodiversity index derived from our modeling predicts healthy aging with better accuracy than common indices. When constructing the empirical Relative Species Abundances distribution (RSA) for GM, a fundamental step regards the clustering of particular DNA sequences (16S rRNA). This is a critical task that enables to redefine the concept of species according to the phylogenetic tree. Here we introduce LOC-kNN, that is a parameter-free clustering algorithm recently developed by d’Errico et al, and we adapt it for this purpose. LOC-kNN detects clusters as density peaks based on the dataset topography and, besides still having difficulties in detecting small clusters, shows promising performances. Finally, for what concerns the protein domains ecosystem, environmental noise should also be taken into account. This has a multiplicative effect and, together with the introduction of the Gompertzian death hypothesis, predicts a Poisson Log-Normal RSA. The model fits well the protein domain RSA and captures the dynamics of genome evolution, manifesting good agreement with the phylogenetic distances among bacteria.
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography