Log in

Relevant bibliographies by topics / Bioinformatics predictions / Dissertations / Theses

To see the other types of publications on this topic, follow the link: Bioinformatics predictions.

Dissertations / Theses on the topic 'Bioinformatics predictions'

Author: Grafiati

Published: 4 June 2021

Last updated: 1 February 2022

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 dissertations / theses for your research on the topic 'Bioinformatics predictions.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.

1

Åkesson, Julia. "Robust Community Predictions of Hubs in Gene Regulatory Networks." Thesis, Linköpings universitet, Bioinformatik, 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-153200.

Full text

Abstract:

Many diseases, such as cardiovascular diseases, cancer and diabetes, originate from several malfunctions in biological systems. The human body is regulated by a wide range of biological systems, composed of biological entities interacting in complex networks, responsible for carrying out specific functions. Some parts of the networks, such as hubs serving as master regulators, are more important for maintaining a function. To find the cause of diseases, where hubs are possible disease regulators, it is critical to know the structure of these biological systems. Such structures can be reverse engineered from high-throughput data with measured levels of biological entities. However, the complexity of biological systems makes inferring their structure a complicated task, demanding the use of computational methods, called network inference methods. Today, many network inference methods have been developed, that predicts the interactions of biological networks, with varying degree of success. In the DREAM5 challenge 35 network inference methods were evaluated on how well interactions in gene regulatory networks (GRNs) were predicted. Herein, in contrast to the DREAM5 challenge, we have evaluated network inference methods’ ability to predict hubs in GRNs. In accordance with the DREAM5 challenge, different methods performed the best on different data sets. Moreover, we discovered that network inference methods were not able to identify hubs from groups of similarly expressed genes. Also, we noticed that hubs in GRNs had a distinct expression in the data, leading to the development of a new method (the PCA method) for the prediction of hubs. Furthermore, the DREAM5 challenge showed that community predictions, combining the predictions from many network inference methods, resulted in more robust predictions of interactions. Herein, the community approach was applied on predicting hubs, with the conclusion that community predictions is the more robust approach. However, we also concluded that it was enough to combine 6-7 network inference methods to achieve robust predictions of hubs.

APA, Harvard, Vancouver, ISO, and other styles

2

Bernsel, Andreas. "Sequence-based predictions of membrane-protein topology, homology and insertion." Doctoral thesis, Stockholms universitet, Institutionen för biokemi och biofysik, 2008. http://urn.kb.se/resolve?urn=urn:nbn:se:su:diva-8126.

Full text

Abstract:

Membrane proteins comprise around 20-30% of a typical proteome and play crucial roles in a wide variety of biochemical pathways. Apart from their general biological significance, membrane proteins are of particular interest to the pharmaceutical industry, being targets for more than half of all available drugs. This thesis focuses on prediction methods for membrane proteins that ultimately rely on their amino acid sequence only. By identifying soluble protein domains in membrane protein sequences, we were able to constrain and improve prediction of membrane protein topology, i.e. what parts of the sequence span the membrane and what parts are located on the cytoplasmic and extra-cytoplasmic sides. Using predicted topology as input to a profile-profile based alignment protocol, we managed to increase sensitivity to detect distant membrane protein homologs. Finally, experimental measurements of the level of membrane integration of systematically designed transmembrane helices in vitro were used to derive a scale of position-specific contributions to helix insertion efficiency for all 20 naturally occurring amino acids. Notably, position within the helix was found to be an important factor for the contribution to helix insertion efficiency for polar and charged amino acids, reflecting the highly anisotropic environment of the membrane. Using the scale to predict natural transmembrane helices in protein sequences revealed that, whereas helices in single-spanning proteins are typically hydrophobic enough to insert by themselves, a large part of the helices in multi-spanning proteins seem to require stabilizing helix-helix interactions for proper membrane integration. Implementing the scale to predict full transmembrane topologies yielded results comparable to the best statistics-based topology prediction methods.

APA, Harvard, Vancouver, ISO, and other styles

3

Yang, Sen. "Disease, Drug, and Target Association Predictions by Integrating Multiple Heterogeneous Sources." Case Western Reserve University School of Graduate Studies / OhioLINK, 2012. http://rave.ohiolink.edu/etdc/view?acc_num=case1342194249.

Full text

APA, Harvard, Vancouver, ISO, and other styles

4

Wagih, Omar. "Elucidating the mechanistic impact of single nucleotide variants in model organisms." Thesis, University of Cambridge, 2018. https://www.repository.cam.ac.uk/handle/1810/271713.

Full text

Abstract:

Understanding how genetic variation propagate to differences in phenotypes in individuals is an ongoing challenge in genetics. Genome-wide association studies have allowed for the identification of many trait-associated genomic loci. However, they are limited in their inability to explain the altered cellular mechanism. Genetic variation can drive disease by altering a range of mechanisms, including signalling networks, TF binding, and protein folding. Understanding the impact of variants on such processes has key implications in therapeutics, drug development, and more. This thesis aims to utilise computational predictors to shed light on how cellular mechanisms are altered in the context of genetic variation and better understand how they drive both molecular and organism-level phenotypes. Many binding events in the cell are mediated by short stretches of sequence motifs. The ability to discover these underlying rules of binding could greatly aid our understanding of variant impact. Kinase–substrate phosphorylation is one of the most prominent post-translational modifications (PTMs) which is mediated by such motifs. We first describe a computational method which utilises interaction and phosphorylation data to predict sequence preferences of kinases. Our method was applied to 57% of human kinases capturing known well-characterised and novel kinase specificities. We experimentally validate four understudied kinases to show that predicted models closely resemble true specificities. We further demonstrate that this method can be applied to different organisms and can be used for other phospho-recognition domains. The described approach allows for an extended repertoire of sequence specificities to be generated, particularly in organisms for which little data is available. TF-DNA binding is another mechanism driven by sequence motifs, which is key for the tight regulation of gene expression and can be greatly altered by genetic variation. We have comprehensively benchmarked current methods used to predict non-coding variant effects on TF-DNA binding by employing over 20,000 compiled allele-specific ChIP-seq variants across 94 TFs. We show that machine learning-based approaches significantly outperform more rudimentary methods such as the position weight matrix. We further note that models for many TFs with distinct binding specificities were unable to accurately assess the impact of variants. For these TFs, we explore alternative mechanisms underlying TF-binding, such as methylation, co-operative binding, and DNA shape that drive poor performance. Our results demonstrate the complexity of predicting non-coding variant effects and the importance of incorporating alternative mechanisms into models. Finally, we describe a comprehensive effort to compile and benchmark state-of-the-art sequence and structure-based predictors of mutational consequences and predict the effect of coding and non-coding variants in the reference genomes of human, yeast, and E. coli. Predicted mechanisms include the impact on protein stability, interaction interfaces, and PTMs. These variant effects are provided through mutfunc, a fast and intuitive web tool by which users can interactively explore pre-computed mechanistic variant impact predictions. We validate computed predictions by analysing known pathogenic disease variants and provide mechanistic hypotheses for causal variants of unknown function. We further use our predictions to devise gene-level functionality scores in human and yeast individuals, which we then used to perform gene-phenotype associations and uncover novel gene-phenotype associations.

APA, Harvard, Vancouver, ISO, and other styles

5

Rezwan, Faisal Ibne. "Improving computational predictions of Cis-regulatory binding sites in genomic data." Thesis, University of Hertfordshire, 2011. http://hdl.handle.net/2299/7133.

Full text

Abstract:

Cis-regulatory elements are the short regions of DNA to which specific regulatory proteins bind and these interactions subsequently influence the level of transcription for associated genes, by inhibiting or enhancing the transcription process. It is known that much of the genetic change underlying morphological evolution takes place in these regions, rather than in the coding regions of genes. Identifying these sites in a genome is a non-trivial problem. Experimental (wet-lab) methods for finding binding sites exist, but all have some limitations regarding their applicability, accuracy, availability or cost. On the other hand computational methods for predicting the position of binding sites are less expensive and faster. Unfortunately, however, these algorithms perform rather poorly, some missing most binding sites and others over-predicting their presence. The aim of this thesis is to develop and improve computational approaches for the prediction of transcription factor binding sites (TFBSs) by integrating the results of computational algorithms and other sources of complementary biological evidence. Previous related work involved the use of machine learning algorithms for integrating predictions of TFBSs, with particular emphasis on the use of the Support Vector Machine (SVM). This thesis has built upon, extended and considerably improved this earlier work. Data from two organisms was used here. Firstly the relatively simple genome of yeast was used. In yeast, the binding sites are fairly well characterised and they are normally located near the genes that they regulate. The techniques used on the yeast genome were also tested on the more complex genome of the mouse. It is known that the regulatory mechanisms of the eukaryotic species, mouse, is considerably more complex and it was therefore interesting to investigate the techniques described here on such an organism. The initial results were however not particularly encouraging: although a small improvement on the base algorithms could be obtained, the predictions were still of low quality. This was the case for both the yeast and mouse genomes. However, when the negatively labeled vectors in the training set were changed, a substantial improvement in performance was observed. The first change was to choose regions in the mouse genome a long way (distal) from a gene over 4000 base pairs away - as regions not containing binding sites. This produced a major improvement in performance. The second change was simply to use randomised training vectors, which contained no meaningful biological information, as the negative class. This gave some improvement over the yeast genome, but had a very substantial benefit for the mouse data, considerably improving on the aforementioned distal negative training data. In fact the resulting classifier was finding over 80% of the binding sites in the test set and moreover 80% of the predictions were correct. The final experiment used an updated version of the yeast dataset, using more state of the art algorithms and more recent TFBSs annotation data. Here it was found that using randomised or distal negative examples once again gave very good results, comparable to the results obtained on the mouse genome. Another source of negative data was tried for this yeast data, namely using vectors taken from intronic regions. Interestingly this gave the best results.

APA, Harvard, Vancouver, ISO, and other styles

6

Ishtiaq, Khandker S. "Robust Modeling and Predictions of Greenhouse Gas Fluxes from Forest and Wetland Ecosystems." FIU Digital Commons, 2015. http://digitalcommons.fiu.edu/etd/2287.

Full text

Abstract:

The land-atmospheric exchanges of carbon dioxide (CO2) and methane (CH4) are major drivers of global warming and climatic changes. The greenhouse gas (GHG) fluxes indicate the dynamics and potential storage of carbon in terrestrial and wetland ecosystems. Appropriate modeling and prediction tools can provide a quantitative understanding and valuable insights into the ecosystem carbon dynamics, while aiding the development of engineering and management strategies to limit emissions of GHGs and enhance carbon sequestration. This dissertation focuses on the development of data-analytics tools and engineering models by employing a range of empirical and semi-mechanistic approaches to robustly predict ecosystem GHG fluxes at variable scales. Scaling-based empirical models were developed by using an extended stochastic harmonic analysis algorithm to achieve spatiotemporally robust predictions of the diurnal cycles of net ecosystem exchange (NEE). A single set of model parameters representing different days/sites successfully estimated the diurnal NEE cycles for various ecosystems. A systematic data-analytics framework was then developed to determine the mechanistic, relative linkages of various climatic and environmental drivers with the GHG fluxes. The analytics, involving big data for diverse ecosystems of the AmeriFLUX network, revealed robust latent patterns: a strong control of radiation-energy variables, a moderate control of temperature-hydrology variables, and a relatively weak control of aerodynamic variables on the terrestrial CO2 fluxes. The data-analytics framework was then employed to determine the relative controls of different climatic, biogeochemical and ecological drivers on CO2 and CH4 fluxes from coastal wetlands. The knowledge was leveraged to develop nonlinear, predictive models of GHG fluxes using a small set of environmental variables. The models were presented in an Excel spreadsheet as an ecological engineering tool to estimate and predict the net ecosystem carbon balance of the wetland ecosystems. The research also investigated the emergent biogeochemical-ecological similitude and scaling laws of wetland GHG fluxes by employing dimensional analysis from fluid mechanics. Two environmental regimes were found to govern the wetland GHG fluxes. The discovered similitude and scaling laws can guide the development of data-based mechanistic models to robustly predict wetland GHG fluxes under a changing climate and environment.

APA, Harvard, Vancouver, ISO, and other styles

7

Papadimitriou, Sofia. "Towards multivariant pathogenicity predictions: Using machine-learning to directly predict and explore disease-causing oligogenic variant combinations." Doctoral thesis, Universite Libre de Bruxelles, 2020. http://hdl.handle.net/2013/ULB-DIPOT:oai:dipot.ulb.ac.be:2013/312576.

Full text

Abstract:

The emergence of statistical and predictive methods able to analyse genomic data has revolutionised the field of medical genetics, allowing the identification of disease-causing gene variants (i.e. mutations) for several human genetic diseases. Although these approaches have greatly improved our understanding of Mendelian «one gene – one phenotype» genetic models, studying diseases related to more intricate models that involve causative variants in several genes (i.e. oligogenic diseases) still remains a challenge, either due to the lack of sufficient methodologies and disease-specific cohorts to study or the phenotypic complexity associated with such diseases. This situation makes it difficult to not only understand the genetic mechanisms of the disease, but to also offer proper counseling and support to the patient. Until recently, no specialized predictive methods existed to directly predict causative variant combinations for oligogenic diseases. However, with the advent of data on variant combinations in gene pairs (i.e. bilocus variant combinations) leading to disease, collected at the Digenic Diseases Database (DIDA), we hypothesized that the transition from single to variant combination pathogenicity predictors is now possible.To investigate this hypothesis, we organised our research on two main routes. At first, we developed an interpretable variant combination pathogenicity predictor, called VarCoPP, for gene pairs. For this goal, we trained multiple Random Forest algorithms on pathogenic bilocus variant combinations from DIDA against neutral data from the 1000 Genomes Project and investigated the contribution of the incorporated variant, gene and gene pair features to the prediction outcome. In the second part, we explored the usefulness of different gene pair burden scores based on this novel predictive method, in discovering oligogenic signatures in neurodevelopmental diseases, which involve a spectrum of monogenic to polygenic cases. We performed a preliminary analysis on the Deciphering Developmental Diseases (DDD) project containing exome data of 4195 families and assessed the capability of our scores in supporting already diagnosed monogenic cases, discovering significant pairs compared to control cases and linking patients in communities based on the genetic burden they share, using the Leiden community detection algorithm.The performance of VarCoPP shows that it is possible to predict disease-causing bilocus variant combinations with good accuracy both during cross-validation and when testing on new cases. We also show its relevance for disease-related gene panels, and enhanced its clinical applicability by defining confidence zones that guarantee with 95\% or 99\% probability that a prediction is indeed a true positive, guiding clinical researchers towards the most relevant results. This method and additional biological annotations are incorporated in an online platform called ORVAL that allows the prediction and exploration of candidate disease-causing oligogenic variant combinations with predicted gene networks, based on patient variant data. Our preliminary analysis on the DDD cohort shows that - although all bi-locus burden scores show advantages, disadvantages and certain types of biases - taking the maximum pathogenicity score present inside a gene pair seems to provide, at the moment, the most unbiased results. We also show that our predictive methods enable us to detect patient communities inside DDD, based exclusively on the shared pathogenic bi-locus burden between patients, with more than half of these communities containing enriched phenotypic and molecular pathway information. Our predictive method is also able to bring to the surface genes not officially known to be involved in disease, but nevertheless, with a biological relevance, as well as a few examples of potential oligogenicity inside the network, paving the way for further exploration of oligogenic signatures for neurodevelopmental diseases.<br>Doctorat en Sciences<br>info:eu-repo/semantics/nonPublished

APA, Harvard, Vancouver, ISO, and other styles

8

Hennerdal, Aron, and Arne Elofsson. "Rapid membrane protein topology prediction." Stockholms universitet, Institutionen för biokemi och biofysik, 2011. http://urn.kb.se/resolve?urn=urn:nbn:se:su:diva-61921.

Full text

Abstract:

State-of-the-art methods for topology of α-helical membrane proteins are based on the use of time-consuming multiple sequence alignments obtained from PSI-BLAST or other sources. Here, we examine if it is possible to use the consensus of topology prediction methods that are based on single sequences to obtain a similar accuracy as the more accurate multiple sequence-based methods. Here, we show that TOPCONS-single performs better than any of the other topology prediction methods tested here, but ~6% worse than the best method that is utilizing multiple sequence alignments. AVAILABILITY AND IMPLEMENTATION: TOPCONS-single is available as a web server from http://single.topcons.net/ and is also included for local installation from the web site. In addition, consensus-based topology predictions for the entire international protein index (IPI) is available from the web server and will be updated at regular intervals.

APA, Harvard, Vancouver, ISO, and other styles

9

Habtemariam, Mesay. "Bioinformatics Approach to Probe Protein-Protein Interactions: Understanding the Role of Interfacial Solvent in the Binding Sites of Protein-Protein Complexes;Network Based Predictions and Analysis of Human Proteins that Play Critical Roles in HIV Pathogenesis." VCU Scholars Compass, 2013. http://scholarscompass.vcu.edu/etd/2997.

Full text

Abstract:

The thesis work contains two projects under the same umbrella. The first project is to provide a detailed analysis on the behavior of interfacial water molecules at protein-protein complexes, in this case focusing on homodimeric complexes, and to investigate their effect with respect to different residue types. For that reason the homodimeric data-set, which includes high-resolution (≤ 2.30 Å) X-ray crystal structures of 252 (140 Biological & 112 Non-biological) protein complexes was chosen to explore fundamental differences between interfaces that Nature has “engineered” vs. compared to interfaces found under man-made conditions. The data set was comprised of 5391 water molecules where a maximum of 4 Å from both interfacing proteins. Our analysis is applied a suite of modeling tools based on HINT, a program for hydropathic analysis developed in our laboratory. HINT is based on the experimental measurement of the hydrophobic effect. The second project is designed to explore various means of suppressing the expression of human genes that play critical role in HIV pathogenesis. To achieve this aim, a data set of Affymetrix Human HG Focus Target Array, which measures the expression levels of HIV seronegative and seropositive individuals in human PBMCs, was analyzed with Pathway Studio 9.0 software. This work gives insight into the elucidation of the important mechanisms of human proteins interactions in HIV seropositive individuals and their implications. Hence, we found the kind and types of microRNAs that are suppressing the human genes which have great role for HIV replication in a cell.

APA, Harvard, Vancouver, ISO, and other styles

10

Hvidsten, Torgeir R. "Predicting Function of Genes and Proteins from Sequence, Structure and Expression Data." Doctoral thesis, Uppsala : Acta Universitatis Upsaliensis : Univ.-bibl. [distributör], 2004. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-4490.

Full text

APA, Harvard, Vancouver, ISO, and other styles

11

Wallner, Björn. "Protein Structure Prediction : Model Building and Quality Assessment." Doctoral thesis, Stockholm University, Department of Biochemistry and Biophysics, 2005. http://urn.kb.se/resolve?urn=urn:nbn:se:su:diva-649.

Full text

Abstract:

<p>Proteins play a crucial roll in all biological processes. The wide range of protein functions is made possible through the many different conformations that the protein chain can adopt. The structure of a protein is extremely important for its function, but to determine the structure of protein experimentally is both difficult and time consuming. In fact with the current methods it is not possible to study all the billions of proteins in the world by experiments. Hence, for the vast majority of proteins the only way to get structural information is through the use of a method that predicts the structure of a protein based on the amino acid sequence.</p><p>This thesis focuses on improving the current protein structure prediction methods by combining different prediction approaches together with machine-learning techniques. This work has resulted in some of the best automatic servers in world – Pcons and Pmodeller. As a part of the improvement of our automatic servers, I have also developed one of the best methods for predicting the quality of a protein model – ProQ. In addition, I have also developed methods to predict the local quality of a protein, based on the structure – ProQres and based on evolutionary information – ProQprof. Finally, I have also performed the first large-scale benchmark of publicly available homology modeling programs.</p>

APA, Harvard, Vancouver, ISO, and other styles

12

Viklund, Håkan. "Formalizing life : Towards an improved understanding of the sequence-structure relationship in alpha-helical transmembrane proteins." Doctoral thesis, Stockholm University, Department of Biochemistry and Biophysics, 2007. http://urn.kb.se/resolve?urn=urn:nbn:se:su:diva-7144.

Full text

Abstract:

<p>Genes coding for alpha-helical transmembrane proteins constitute roughly 25% of the total number of genes in a typical organism. As these proteins are vital parts of many biological processes, an improved understanding of them is important for achieving a better understanding of the mechanisms that constitute life.</p><p>All proteins consist of an amino acid sequence that fold into a three-dimensional structure in order to perform its biological function. The work presented in this thesis is directed towards improving the understanding of the relationship between sequence and structure for alpha-helical transmembrane proteins. Specifically, five original methods for predicting the topology of alpha-helical transmembrane proteins have been developed: PRO-TMHMM, PRODIV-TMHMM, OCTOPUS, Toppred III and SCAMPI. </p><p>A general conclusion from these studies is that approaches that use multiple sequence information achive the best prediction accuracy. Further, the properties of reentrant regions have been studied, both with respect to sequence and structure. One result of this study is an improved definition of the topological grammar of transmembrane proteins, which is used in OCTOPUS and shown to further improve topology prediction. Finally, Z-coordinates, an alternative system for representation of topological information for transmembrane proteins that is based on distance to the membrane center has been introduced, and a method for predicting Z-coordinates from amino acid sequence, Z-PRED, has been developed.</p>

APA, Harvard, Vancouver, ISO, and other styles

13

Lindefelt, Lisa. "Predicting gene expression using artificial neural networks." Thesis, University of Skövde, Department of Computer Science, 2002. http://urn.kb.se/resolve?urn=urn:nbn:se:his:diva-707.

Full text

Abstract:

<p>Today one of the greatest aims within the area of bioinformatics is to gain a complete understanding of the functionality of genes and the systems behind gene regulation. Regulatory relationships among genes seem to be of a complex nature since transcriptional control is the result of complex networks interpreting a variety of inputs. It is therefore essential to develop analytical tools detecting complex genetic relationships.</p><p>This project examines the possibility of the data mining technique artificial neural network (ANN) detecting regulatory relationships between genes. As an initial step for finding regulatory relationships with the help of ANN the goal of this project is to train an ANN to predict the expression of an individual gene. The genes predicted are the nuclear receptor PPAR-g and the insulin receptor. Predictions of the two target genes respectively were made using different datasets of gene expression data as input for the ANN. The results of the predictions of PPAR-g indicate that it is not possible to predict the expression of PPAR-g under the circumstances for this experiment. The results of the predictions of the insulin receptor indicate that it is not possible to discard using ANN for predicting the gene expression of an individual gene.</p>

APA, Harvard, Vancouver, ISO, and other styles

14

Freyhult, Eva. "A Study in RNA Bioinformatics : Identification, Prediction and Analysis." Doctoral thesis, Uppsala : Acta Universitatis Upsaliensis Acta Universitatis Upsaliensis, 2007. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-8305.

Full text

APA, Harvard, Vancouver, ISO, and other styles

15

Hillerton, Thomas. "Predicting adverse drug reactions in cancer treatment using a neural network based approach." Thesis, Högskolan i Skövde, Institutionen för biovetenskap, 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:his:diva-15659.

Full text

APA, Harvard, Vancouver, ISO, and other styles

16

Telele, Nigus Fikrie. "Predicting interspecies transmission and pandemic risks of coronaviruses." Thesis, Högskolan i Skövde, Institutionen för biovetenskap, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:his:diva-19495.

Full text

APA, Harvard, Vancouver, ISO, and other styles

17

Michel, Mirco. "From Sequence to Structure : Using predicted residue contacts to facilitate template-free protein structure prediction." Doctoral thesis, Stockholms universitet, Institutionen för biokemi och biofysik, 2017. http://urn.kb.se/resolve?urn=urn:nbn:se:su:diva-141946.

Full text

Abstract:

Despite the fundamental role of experimental protein structure determination, computational methods are of essential importance to bridge the ever growing gap between available protein sequence and structure data. Common structure prediction methods rely on experimental data, which is not available for about half of the known protein families. Recent advancements in amino acid contact prediction have revolutionized the field of protein structure prediction. Contacts can be used to guide template-free structure predictions that do not rely on experimentally solved structures of homologous proteins. Such methods are now able to produce accurate models for a wide range of protein families. We developed PconsC2, an approach that improved existing contact prediction methods by recognizing intra-molecular contact patterns and noise reduction. An inherent problem of contact prediction based on maximum entropy models is that large alignments with over 1000 effective sequences are needed to infer contacts accurately. These are however not available for more than 80% of all protein families that do not have a representative structure in PDB. With PconsC3, we could extend the applicability of contact prediction to families as small as 100 effective sequences by combining global inference methods with machine learning based on local pairwise measures. By introducing PconsFold, a pipeline for contact-based structure prediction, we could show that improvements in contact prediction accuracy translate to more accurate models. Finally, we applied a similar technique to Pfam, a comprehensive database of known protein families. In addition to using a faster folding protocol we employed model quality assessment methods, crucial for estimating the confidence in the accuracy of predicted models. We propose models tobe accurate for 558 families that do not have a representative known structure. Out of those, over 75% have not been reported before.<br><p>At the time of the doctoral defense, the following papers were unpublished and had a status as follows: Paper 2: Submitted. Paper 4: In press.</p><p> </p>

APA, Harvard, Vancouver, ISO, and other styles

18

Dean, M. K. "Bioinformatics approach to predicting protein interactions." Thesis, University of Essex, 2003. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.275862.

Full text

APA, Harvard, Vancouver, ISO, and other styles

19

Janvid, Vincent. "Building a genomic variant based prediction model for lung cancer toxicity." Thesis, KTH, Tillämpad fysik, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-297411.

Full text

Abstract:

Since the completion of the the Human genome project in 2003, the evident complexity of our genome and its regulation has only grown. The idea that having sequenced the human genome would solve this mystery was quickly discarded. With the decreasing costs of DNA sequencing, a plethora of new methods have evolved to further understand the role of non-coding regions of our genome, which makes up 98% its length. Genetic variations in these regions are therefore abundant in the human population, but their e ects are hard to characterize. Many non-coding variants have been linked to complex diseases such as cancer predisposition. This thesis aims to investigate the potential e ects of non-coding variants on drug toxicity, that is, how severe the adverse e ects of a drug are to the treated patients. More specifically it will study the effects of two cancer drugs, Gemcitabine and Carboplatin, on a set of 96 patients with lung cancer. To do this we use spatial data acquired by the promoter-targeting method HiCap as well as expression data obtained from blood cell lines. Using the variants obtained through whole genome sequencing of the patients, a supervised learning approach was attempted to predict the final toxicity experienced by the patients. The large number of variants present among the comparably few patients resulted in poor accuracy. The conclusion was drawn that the resolution of HiCap is too low compared to the density of variants in the non-coding regions. Additional data, such as transcription factor Chip-Seq data, and transcription factor motifs are needed to locate potentially contributing variants within the interactions.<br>Sedan den första sekvenseringen av det mänskliga genomet 2003 har vår bild av vårt genom och hur det regleras bara blivit mer komplex. Iden om att ha tillgång till ett helt genom skulle losa detta mysterium förkastades snabbt. Med de sjunkande kostnaderna for sekvensering har ett brett utbud av nya metoder utvecklats for att bättre förstå de icke-kodande regionernas roll i v art genom. Då dessa regioner utgör98% av vårt DNA ar innehåller de stor variation bland det mänskliga släktet, men att förutsaga deras effekt är mycket svårt. Många icke-kodande variationer har kopplats till komplexa sjukdomar så som ökad risk för cancer.Denna uppsats syftar till att undersoka de potentiella effekterna av icke-kodande varianter på hur allvarliga biverkningar en patient får av en cancerbehandling. Närmare undersöks två mediciners, Gemcitabins och Carboplatins effekt på 96 lungcancerpatienter. För detta används spatial data samt genuttrycksdata från blodcellinjer.Med utgångspunkt från genetiska varianter bland patienternas sekvenserade genom testades övervakad inlärning för att förutsäga graden av biverkningar hos patienterna. Den stora mängden varianter som bärs av de förhållandevis få patienterna resulterade i låg träffsäkerhet hos prediktorn. Slutsatsen drogs att upplösningen av HiCap är för låg i jämförelse med den höga densiteten av varianter i icke-kodanderegioner. Mer data, så som Chip-Seq data från transkriptionsfaktorer samt deras specifika bindningsekvenser behövs för att lokalisera varianter inom en interaktion, som potentiellt skulle kunna påverka biverkningarna.

APA, Harvard, Vancouver, ISO, and other styles

20

Tsirigos, Konstantinos. "Bioinformatics Methods for Topology Prediction of Membrane Proteins." Doctoral thesis, Stockholms universitet, Institutionen för biokemi och biofysik, 2017. http://urn.kb.se/resolve?urn=urn:nbn:se:su:diva-138479.

Full text

Abstract:

Membrane proteins are key elements of the cell since they are associated with a variety of very important biological functions crucial to its survival. They are implicated in cellular recognition and adhesion, act as molecular receptors, transport substrates through membranes and exhibit specific enzymatic activity.This thesis is focused on integral membrane proteins, most of which contain transmembrane segments that form an alpha helix and are composed of mainly hydrophobic residues, spanning the lipid bilayer. A more specialized and less well-studied case, is the case of integral membrane proteins found in the outer membrane of Gram-negative bacteria and (presumably) in the outer envelope of mitochondria and chloroplasts, proteins whose transmembrane segments are formed by amphipathic beta strands that create a closed barrel (beta-barrels). The importance of transmembrane proteins, as well as the inherent difficulties in crystallizing and obtaining three-dimensional structures of these, dictates the need for developing computational algorithms and tools that will allow for a reliable and fast prediction of their structural and functional features. In order to elucidate their function, we must acquire knowledge about their structure and topology with relation to the membrane. Therefore, a large number of computational methods have been developed in order to predict the transmembrane segments and the overall topology of transmembrane proteins. In this thesis, I initially describe a large-scale benchmark of many topology prediction tools in order to devise a strategy that will allow for better detection of alpha-helical membrane proteins in a proteome. Then, I give a description of construction of improved machine-learning algorithms and computer software for accurate topology prediction of transmembrane proteins and discrimination of such proteins from non-transmembrane proteins. Finally, I introduce a fast way to obtain a position-specific scoring matrix, which is essential for modern topology prediction methods.<br><p>At the time of the doctoral defense, the following paper was unpublished and had a status as follows: Paper 3: Manuscript.</p>

APA, Harvard, Vancouver, ISO, and other styles

21

Johansson-Åkhe, Isak. "PePIP : a Pipeline for Peptide-Protein Interaction-site Prediction." Thesis, Linköpings universitet, Institutionen för fysik, kemi och biologi, 2017. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-138411.

Full text

Abstract:

Protein-peptide interactions play a major role in several biological processes, such as cellproliferation and cancer cell life-cycles. Accurate computational methods for predictingprotein-protein interactions exist, but few of these method can be extended to predictinginteractions between a protein and a particularly small or intrinsically disordered peptide. In this thesis, PePIP is presented. PePIP is a pipeline for predicting where on a given proteina given peptide will most probably bind. The pipeline utilizes structural aligning to perusethe Protein Data Bank for possible templates for the interaction to be predicted, using thelarger chain as the query. The possible templates are then evaluated as to whether they canrepresent the query protein and peptide using a Random Forest classifier machine learningalgorithm, and the best templates are found by using the evaluation from the Random Forest in combination with hierarchical clustering. These final templates are then combined to givea prediction of binding site. PePIP is proven to be highly accurate when testing on a set of 502 experimentally determinedprotein-peptide structures, suggesting a binding site on the correct part of the protein- surfaceroughly 4 out of 5 times.

APA, Harvard, Vancouver, ISO, and other styles

22

Schröder, Michael, Rainer Winnenburg, and Conrad Plake. "Improved mutation tagging with gene identifiers applied to membrane protein stability prediction." Saechsische Landesbibliothek- Staats- und Universitaetsbibliothek Dresden, 2015. http://nbn-resolving.de/urn:nbn:de:bsz:14-qucosa-177379.

Full text

Abstract:

Background The automated retrieval and integration of information about protein point mutations in combination with structure, domain and interaction data from literature and databases promises to be a valuable approach to study structure-function relationships in biomedical data sets. Results We developed a rule- and regular expression-based protein point mutation retrieval pipeline for PubMed abstracts, which shows an F-measure of 87% for the mutation retrieval task on a benchmark dataset. In order to link mutations to their proteins, we utilize a named entity recognition algorithm for the identification of gene names co-occurring in the abstract, and establish links based on sequence checks. Vice versa, we could show that gene recognition improved from 77% to 91% F-measure when considering mutation information given in the text. To demonstrate practical relevance, we utilize mutation information from text to evaluate a novel solvation energy based model for the prediction of stabilizing regions in membrane proteins. For five G protein-coupled receptors we identified 35 relevant single mutations and associated phenotypes, of which none had been annotated in the UniProt or PDB database. In 71% reported phenotypes were in compliance with the model predictions, supporting a relation between mutations and stability issues in membrane proteins. Conclusion We present a reliable approach for the retrieval of protein mutations from PubMed abstracts for any set of genes or proteins of interest. We further demonstrate how amino acid substitution information from text can be utilized for protein structure stability studies on the basis of a novel energy model.

APA, Harvard, Vancouver, ISO, and other styles

23

Phan, John H. "Biomarker discovery and clinical outcome prediction using knowledge based-bioinformatics." Diss., Georgia Institute of Technology, 2009. http://hdl.handle.net/1853/33855.

Full text

Abstract:

Advances in high-throughput genomic and proteomic technology have led to a growing interest in cancer biomarkers. These biomarkers can potentially improve the accuracy of cancer subtype prediction and subsequently, the success of therapy. However, identification of statistically and biologically relevant biomarkers from high-throughput data can be unreliable due to the nature of the data--e.g., high technical variability, small sample size, and high dimension size. Due to the lack of available training samples, data-driven machine learning methods are often insufficient without the support of knowledge-based algorithms. We research and investigate the benefits of using knowledge-based algorithms to solve clinical prediction problems. Because we are interested in identifying biomarkers that are also feasible in clinical prediction models, we focus on two analytical components: feature selection and predictive model selection. In addition to data variance, we must also consider the variance of analytical methods. There are many existing feature selection algorithms, each of which may produce different results. Moreover, it is not trivial to identify model parameters that maximize the sensitivity and specificity of clinical prediction. Thus, we introduce a method that uses independently validated biological knowledge to reduce the space of relevant feature selection algorithms and to improve the reliability of clinical predictors. Finally, we implement several functions of this knowledge-based method as a web-based, user-friendly, and standards-compatible software application.

APA, Harvard, Vancouver, ISO, and other styles

24

Kiełbasa, Szymon M. "Bioinformatics of eukaryotic gene regulation." Doctoral thesis, Humboldt-Universität zu Berlin, Mathematisch-Naturwissenschaftliche Fakultät I, 2006. http://dx.doi.org/10.18452/15562.

Full text

Abstract:

Die Aufklärung der Mechanismen zur Kontrolle der Genexpression ist eines der wichtigsten Probleme der modernen Molekularbiologie. Detaillierte experimentelle Untersuchungen sind enorm aufwändig aufgrund der komplexen und kombinatorischen Wechselbeziehungen der beteiligten Moleküle. Infolgedessen sind bioinformatische Methoden unverzichtbar. Diese Dissertation stellt drei Methoden vor, die die Vorhersage der regulatorischen Elementen der Gentranskription verbessern. Der erste Ansatz findet Bindungsstellen, die von den Transkriptionsfaktoren erkannt werden. Dieser sucht statistisch überrepräsentierte kurze Motive in einer Menge von Promotersequenzen und wird erfolgreich auf das Genom der Bäckerhefe angewandt. Die Analyse der Genregulation in höheren Eukaryoten benötigt jedoch fortgeschrittenere Techniken. In verschiedenen Datenbanken liegen Hunderte von Profilen vor, die von den Transkriptionsfaktoren erkannt werden. Die Ähnlichkeit zwischen ihnen resultiert in mehrfachen Vorhersagen einer einzigen Bindestelle, was im nachhinein korrigiert werden muss. Es wird eine Methode vorgestellt, die eine Möglichkeit zur Reduktion der Anzahl von Profilen bietet, indem sie die Ähnlichkeiten zwischen ihnen identifiziert. Die komplexe Natur der Wechselbeziehung zwischen den Transkriptionsfaktoren macht jedoch die Vorhersage von Bindestellen schwierig. Auch mit einer Verringerung der zu suchenden Profile sind die Resultate der Vorhersagen noch immer stark fehlerbehafted. Die Zuhilfenahme der unabhängigen Informationsressourcen reduziert die Häufigkeit der Falschprognosen. Die dritte beschriebene Methode schlägt einen neuen Ansatz vor, die die Gen-Anotation mit der Regulierung von multiplen Transkriptionsfaktoren und den von ihnen erkannten Bindestellen assoziiert. Der Nutzen dieser Methode wird anhand von verschiedenen wohlbekannten Sätzen von Transkriptionsfaktoren demonstriert.<br>Understanding the mechanisms which control gene expression is one of the fundamental problems of molecular biology. Detailed experimental studies of regulation are laborious due to the complex and combinatorial nature of interactions among involved molecules. Therefore, computational techniques are used to suggest candidate mechanisms for further investigation. This thesis presents three methods improving the predictions of regulation of gene transcription. The first approach finds binding sites recognized by a transcription factor based on statistical over-representation of short motifs in a set of promoter sequences. A succesful application of this method to several gene families of yeast is shown. More advanced techniques are needed for the analysis of gene regulation in higher eukaryotes. Hundreds of profiles recognized by transcription factors are provided by libraries. Dependencies between them result in multiple predictions of the same binding sites which need later to be filtered out. The second method presented here offers a way to reduce the number of profiles by identifying similarities between them. Still, the complex nature of interaction between transcription factors makes reliable predictions of binding sites difficult. Exploiting independent sources of information reduces the false predictions rate. The third method proposes a novel approach associating gene annotations with regulation of multiple transcription factors and binding sites recognized by them. The utility of the method is demonstrated on several well-known sets of transcription factors. RNA interference provides a way of efficient down-regulation of gene expression. Difficulties in predicting efficient siRNA sequences motivated the development of a library containing siRNA sequences and related experimental details described in the literature. This library, presented in the last chapter, is publicly available at http://www.human-sirna-database.net

APA, Harvard, Vancouver, ISO, and other styles

25

Amanzadi, Amirhossein. "Predicting safe drug combinations with Graph Neural Networks (GNN)." Thesis, Uppsala universitet, Institutionen för farmaceutisk biovetenskap, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-446691.

Full text

Abstract:

Many people - especially during their elderly - consume multiple drugs for the treatment of complex or co-existing diseases. Identifying side effects caused by polypharmacy is crucial for reducing mortality and morbidity of the patients which will lead to improvement in their quality of life. Since there is immense space for possible drug combinations, it is infeasible to examine them entirely in the lab. In silico models can offer a convenient solution, however, due to the lack of a sufficient amount of homogenous data it is difficult to develop both reliable and scalable models in its ability to accurately predict Polypharmacy Side Effect. Recent advancement in the field of representational learning has utilized the power of graph networks to harmonize information from the heterogeneous biological databases and interactomes. This thesis takes advantage of those techniques and incorporates them with the state-of-the-art Graph Neural Network algorithms to implement a Deep learning pipeline capable of predicting the Adverse Drug Reaction of any given paired drug combinations.

APA, Harvard, Vancouver, ISO, and other styles

26

Nair, Karthik. "Optimisation of autoencoders for prediction of SNPs determining phenotypes in wheat." Thesis, Uppsala universitet, Institutionen för biologisk grundutbildning, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-437451.

Full text

Abstract:

The increase in demand for food has resulted in increased demand for tools that help streamline plant breeding process in order to create new varieties of crops. Identifying the underlying genetic mechanism of favourable characteristics is essential in order to make the best breeding decisions. In this project we have developed a modified autoencoder model which allows for lateral phenotype injection into the latent layer, in order to identify causal SNPs for phenotypes of interest in wheat. SNP and phenotype data for 500 samples of Lantmännen SW Seed provided by Lantmännen was used to train the network. Artificial phenotype created using a single SNP was used during training instead of real phenotype, since the relationship between the phenotype and SNP is already known. The modified training model with lateral phenotype injection showed significant increase in genotype concordance of the artificial phenotype when compared to the control model without phenotype injection. Causal SNP was successfully identified by using concordance terrain graph, where the difference in concordance of individual SNPs between the modified modified model and control model was plotted against the genomic position of each SNP. The model requires further testing to elucidate its behaviour for phenotypes linked to multiple SNPs.

APA, Harvard, Vancouver, ISO, and other styles

27

Chen, Huiling Zhou Huan Xiang Ferrone Frank A. "Prediction of protein structures and protein-protein interactions : a bioinformatics approach /." Philadelphia, Pa. : Drexel University, 2005. http://dspace.library.drexel.edu/handle/1860/481.

Full text

APA, Harvard, Vancouver, ISO, and other styles

28

Fagerberg, Linn. "Mapping the human proteome using bioinformatic methods." Doctoral thesis, KTH, Proteomik, 2011. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-31477.

Full text

Abstract:

The fundamental goal of proteomics is to gain an understanding of the expression and function of the proteome on the level of individual proteins, on the level of defined cell types and on the level of the entire organism. In this thesis, the human proteome is explored using membrane protein topology prediction methods to define the human membrane proteome and by global protein expression profiling, which relies on a complex study of the location and expression levels of proteins in tissues and cells. A whole-proteome analysis was performed based on the predicted protein-coding genes of humans using a selection of membrane protein topology prediction methods. The study used a majority decision-based method, which estimated that approximately 26% of the human genes encode for a membrane protein. The prediction results are displayed in a visualization tool to facilitate the selection of antigens to be used for antibody generation. Global protein expression profiles in a large number of cells and tissues in the human body were analyzed for more than 4000 protein targets, based on data from the antibody-based immunohistochemistry and immunofluorescence methods within the framework of the Human Protein Atlas project. The results revealed few cell-type specific proteins and a high fraction of human proteins expressed in most cells, suggesting that cell and tissue specificity is attained by a fine-tuned regulation of protein levels. The expression profiles were also used to analyze the relationship between 45 cell lines by hierarchical clustering and principal component analysis. The global protein expression patterns overall reflected the tumor origin of the cells, and also allowed for identification of proteins of importance for distinguishing different categories of cell lines, as defined by phenotype of progenitor cell. In addition, the protein distribution in 16 subcellular compartments in three of the human cell lines was mapped. A large fraction of proteins were localized in two or more compartments and, in line with previous results, a majority of proteins were detected in all three cell lines. Finally, mass spectrometry-based protein expression levels were compared to RNA-seq-based transcript expression levels in three cell lines. Highly ubiquitous mRNA expression was found and the changes of expression levels between the cell lines showed high correlations between proteins and transcripts. Large general differences in abundance of proteins from various functional classes were observed. A comparison between categories based on expression levels revealed that, in general, genes with varying expression levels between the cell lines or only expressed in one cell line were highly enriched for cell-surface proteins. These studies show a path for a systematic analysis to characterize the proteome in human cells, tissues and organs.<br>QC 20110317<br>The Human Protein Atlas project

APA, Harvard, Vancouver, ISO, and other styles

29

Tubeuf, Helene. "Développement de stratégies de criblage de mutations d'épissage dans des gènes de prédisposition au cancer. Demystifying the splicing code: new bioinformatics insights for the interpretation of genetic variants A staggering number of genetic variations affect the splicing pattern of BRCA2 exon 7: validation of the predictive power of splicing-dedicated silico analyses MLH1 exon 7, an emblematic exon sensitive to intronic mutations but not to alterations of exonic splicing regulators, sheds light into the performance of SRE-dedicated bioinformatics approaches Calibration of pathogenicity of partial splicing defects: The model of BRCA2 Exon 3." Thesis, Normandie, 2019. http://www.theses.fr/2019NORMR009.

Full text

Abstract:

Le développement du séquençage de l’ADN à haut débit a grandement facilité le criblage de variations génétiques dans le génome des patients. Désormais, l’un des principaux défis de la génétique médicale n’est donc plus la détection des variations, mais leur interprétation fonctionnelle et clinique. Récemment, nous avons montré, à l’aide de tests fonctionnels basés sur l’utilisation de minigènes, que bien que le nombre de mutations d’épissage, et en particulier celles qui affectent sa régulation, est actuellement sous-estimé, l’effet de ces variations pourrait être dorénavant prédit à l’aide d’outils bioinformatiques spécifiques. Nous avons ainsi étendu l’évaluation du caractère prédictif de ces quatre nouvelles approches bioinformatiques par une étude comparative des scores générés par ces approches avec des données expérimentales obtenues pour un total d’environ 1200 variations exoniques. Nos travaux ont ainsi démontré la fiabilité de ces approches, utilisées seules ou en combinaison, et ont permis de proposer des recommandations quant à leur utilisation en tant qu’outils de filtration pour prioriser les variations à analyser dans des tests fonctionnels axés sur l’épissage. Néanmoins, une analyse mutationnelle exhaustive ciblée sur l’exon 7 de MLH1, a mis en évidence l’échec apparent de ces approches, pourtant validées par des études menées sur l’exon 7 de BRCA2, l’exon 10 de MAPT et l’exon 5 de MSH2, laissant suggérer que ces méthodes pourraient ne pas s’appliquer de manière équivalente à tous les exons et/ou à tous les gènes. En effet, nous avons montré que cet exon était doté de caractéristiques particulières, i.e. de sites d’épissage remarquablement forts, lui conférant une résistance totale aux mutations de régulation d’épissage et mettant en échec les outils de prédictions. Ces données contribuent à mieux déterminer les limitations de ces outils bioinformatiques tout en contribuant à leur amélioration. En dépit de ces avancées, l'évaluation de la pathogénicité des mutations d'épissage reste complexe, en particulier celles conduisant à des anomalies d'épissage en phase et/ou partielles. En utilisant, comme modèle d’étude, des variations à l’origine du saut partiel de l’exon 3 de BRCA2, nos résultats ont révélé que l’activité tumeur-suppressive de BRCA2 tolère une réduction substantielle du niveau d’expression, étant donné qu’un allèle produisant jusqu’à 70% de transcrit codant une protéine déficiente n’est pas nécessairement associé à un risque élevé de développer un cancer. L’ensemble de ces données a d’importantes implications dans le diagnostic moléculaire et la prise en charge des patients et de leurs apparentés, avec un bénéfice direct pour les familles évocatrices d’une prédisposition héréditaire et devrait contribuer à l’interprétation de VSI identifiées par séquençage à haut débit dans toute autre pathologie d’origine génétique<br>The development of high-throughput DNA sequencing has greatly facilitated the screening of genetic variations within patient genome. Henceforth, one of the main challenges in medical genetics is no longer the detection of variations, but their functional and clinical interpretation. Recently, we showed by using splicing reporter minigene assays, that although splicing mutations, and in particular those affecting its regulation, are more prevalent than initially estimated, they could now be predicted by using dedicated bioinformatics tools. We thus extended the evaluation of the predictive power of these four newly developed computational approaches by a comparative study of the scores obtained by these approaches with experimental data for a total of about 1200 exonic variations. Our findings have demonstrated the reliability of these approaches, used alone or in combination, and allow to offer recommendations for their use as a filtration tool to prioritize the variations to be analysed as a priority in splicing-dedicated functional assays. Nevertheless, an exhaustive mutational analysis targeting MLH1 exon 7, has highlighted the apparent failure of these approaches, yet validated by studies focused on BRCA2 exon 7, MAPT exon 10 and MSH2 exon 5, suggesting that these methods might not be equivalently applicable to all exons and/or genes. Indeed, we have shown that this exon has particular characteristics, i.e. remarkably strong splice sites, conferring it a total resistance to splicing regulation mutations and defeating prediction tools. These findings help to better determine the limitations of these bioinformatics tools while contributing to their improvement. In spite of these advances, the pathogenicity assessment of splicing mutations remains complicated, especially of those leading to in-frame and/or partial splicing anomalies. By using variant-induced partial BRCA2 exon 3 skipping as a model system, we showed that BRCA2 tumor suppressor function tolerates a substantial reduction in expression level, as BRCA2 allele producing as much as 70% of transcript encoding deficient protein may not necessarily confer high-risk of developing cancer. Altogether, these data have important implications in the molecular diagnosis and clinical management of patients and their relatives, with a direct benefit for hereditary cancer-suspected families and should contribute to the interpretation of VSI identified by high throughput sequencing in any other genetic disease

APA, Harvard, Vancouver, ISO, and other styles

30

Bae, Kyounghwa. "Bayesian model-based approaches with MCMC computation to some bioinformatics problems." Texas A&M University, 2005. http://hdl.handle.net/1969.1/2396.

Full text

Abstract:

Bioinformatics applications can address the transfer of information at several stages of the central dogma of molecular biology, including transcription and translation. This dissertation focuses on using Bayesian models to interpret biological data in bioinformatics, using Markov chain Monte Carlo (MCMC) for the inference method. First, we use our approach to interpret data at the transcription level. We propose a two-level hierarchical Bayesian model for variable selection on cDNA Microarray data. cDNA Microarray quantifies mRNA levels of a gene simultaneously so has thousands of genes in one sample. By observing the expression patterns of genes under various treatment conditions, important clues about gene function can be obtained. We consider a multivariate Bayesian regression model and assign priors that favor sparseness in terms of number of variables (genes) used. We introduce the use of different priors to promote different degrees of sparseness using a unified two-level hierarchical Bayesian model. Second, we apply our method to a problem related to the translation level. We develop hidden Markov models to model linker/non-linker sequence regions in a protein sequence. We use a linker index to exploit differences in amino acid composition between regions from sequence information alone. A goal of protein structure prediction is to take an amino acid sequence (represented as a sequence of letters) and predict its tertiary structure. The identification of linker regions in a protein sequence is valuable in predicting the three-dimensional structure. Because of the complexities of both models encountered in practice, we employ the Markov chain Monte Carlo method (MCMC), particularly Gibbs sampling (Gelfand and Smith, 1990) for the inference of the parameter estimation.

APA, Harvard, Vancouver, ISO, and other styles

31

Xia, Jing. "Bioinformatics analyses of alternative splicing, est-based and machine learning-based prediction." Thesis, Manhattan, Kan. : Kansas State University, 2008. http://hdl.handle.net/2097/1113.

Full text

APA, Harvard, Vancouver, ISO, and other styles

32

Loewe, Laurence. "Evolutionary bioinformatics predicting genetic stability of asexual genomes by global computing /." [S.l.] : [s.n.], 2003. http://deposit.ddb.de/cgi-bin/dokserv?idn=969894201.

Full text

APA, Harvard, Vancouver, ISO, and other styles

33

Choi, Yoonjoo. "Protein loop structure prediction." Thesis, University of Oxford, 2011. http://ora.ox.ac.uk/objects/uuid:bd5c1b9b-89ba-4225-bc17-85d3f5067e58.

Full text

Abstract:

This dissertation concerns the study and prediction of loops in protein structures. Proteins perform crucial functions in living organisms. Despite their importance, we are currently unable to predict their three dimensional structure accurately. Loops are segments that connect regular secondary structures of proteins. They tend to be located on the surface of proteins and often interact with other biological agents. As loops are generally subject to more frequent mutations than the rest of the protein, their sequences and structural conformations can vary significantly even within the same protein family. Although homology modelling is the most accurate computational method for protein structure prediction, difficulties still arise in predicting protein loops. Protein loop structure prediction is therefore a bottleneck in solving the protein structure prediction problem. Reflecting on the success of homology modelling, I implement an improved version of a database search method, FREAD. I show how sequence similarity as quantified by environment specific substitution scores can be used to significantly improve loop prediction. FREAD performs appreciably better for an identifiable subset of loops (two thirds of shorter loops and half of the longer loops tested) than ab initio methods; FREAD's predictive ability is length independent. In general, it produces results within 2Å root mean square deviation (RMSD) from the native conformations, compared to an average of over 10Å for loop length 20 for any of the other tested ab initio methods. I then examine FREAD’s predictive ability on a specific type of loops called complementarity determining regions (CDRs) in antibodies. CDRs consist of six hypervariable loops and form the majority of the antigen binding site. I examine CDR loop structure prediction as a general case of loop structure prediction problem. FREAD achieves accuracy similar to specific CDR predictors. However, it fails to accurately predict CDR-H3, which is known to be the most challenging CDR. Various FREAD versions including FREAD with contact information (ConFREAD) are examined. The FREAD variants improve predictions for CDR-H3 on homology models and docked structures. Lastly, I focus on the local properties of protein loops and demonstrate that the protein loop structure prediction problem is a local protein folding problem. The end-to-end distance of loops (loop span) follows a distinctive frequency distribution, regardless of secondary structure elements connected or the number of residues in the loop. I show that the loop span distribution follows a Maxwell-Boltzmann distribution. Based on my research, I propose future directions in protein loop structure prediction including estimating experimentally undetermined local structures using FREAD, multiple loop structure prediction using contact information and a novel ab initio method which makes use of loop stretch.

APA, Harvard, Vancouver, ISO, and other styles

34

Grech, Brian James. "Bioinformatic prediction of conserved promoters across multiple whole genomes of Chlamydia." Queensland University of Technology, 2007. http://eprints.qut.edu.au/16521/.

Full text

Abstract:

The genome sequencing projects have generated a wealth of genomic data and the analysis of this data has provided many interesting findings. However, genome wide analysis of bacteria for promoters has lagged behind, because it has been difficult to accurately predict the promoters with so much background noise that are found in bacterial genomes. One approach to overcome this problem is to predict phylogenetically conserved promoters across multiple genomes of different bacteria, thus filtering out many of the false positives, which are predicted by the current methods. However, there are no programmes capable of doing this. Therefore, the work presented in this thesis has developed a position weight matrix (PWM) based programme called Multiscan that predicts conserved promoters across multiple bacterial genomes. Since Chlamydia is one of the most sequenced bacterial genera and has a high level of conservation of genes and large-scale conservation of gene order between species, Multiscan was developed and tested on Chlamydia. When Multiscan analysed a genome wide dataset of equivalent non-coding regions (NCRs) upstream of genes, from Chlamydia trachomatis, Chlamydia pneumoniae and Chlamydia caviae for σ66 promoters that are phylogenetically conserved, Multiscan predicted 42 promoters. Since only one of the 42 promoters predicted by Multiscan had previously available biological data to confirm its prediction, an additional subset of 10 of the remaining 41 σ66 promoters were analysed in C. trachomatis by mapping the 5' end of the transcripts. The primer extension assay synthesised cDNA products of the correct length for seven of the 10 genes chosen. When the performance of Multiscan was compared to one of the accepted method for genome wide prediction of promoters in bacteria, the &quotstandard PWM method", Multiscan predicted 32 more promoters than the &quotstandard PWM method" in Chlamydia. Furthermore, the promoters predicted by Multiscan were up to three more mismatches from the Escherichia coli σ70 consensus sequence than the promoters predicted by the standard PWM method. Although Multiscan predicted 42 promoters that were well conserved across the three chlamydial species, the analysis was unable to identify the 14 known σ66 promoters in C. trachomatis. These promoters were missed (1) because they were dissimilar to the E. coli σ70 consensus sequence and/or (2) because the promoters were poorly conserved across the three chlamydial species. To address the second possibility, the 14 false negatives were analysed by another phylogenetic footprinting method. Fourteen sets of equivalent NCRs located upstream of the homologous genes from the three chlamydiae were aligned with the computer programme Clustal W and the alignment analysed &quotby eye" for evidence of phylogenetic footprints containing the 14 false negatives. The analysis identified that seven of the 14 false negatives were poorly conserved across the chlamydial species. Analysis of two of the seven promoters that could not be footprinted, the promoters of ltuA and ltuB, by mapping the transcriptional start sites in C. caviae, confirmed their poor conservation across C. trachomatis and C. caviae. This analysis showed that substantial differences exist in chlamydial σ66 promoters from equivalent NCRs upstream of genes. This study has developed a new computer programme for genome wide prediction of promoters that are phylogenetically conserved and has shown the value of this programme by identifying seven new well conserved promoters and seven candidate poorly conserved promoters in Chlamydia.

APA, Harvard, Vancouver, ISO, and other styles

35

Shu, Nanjiang. "Protein structure prediction zinc-binding sites, one-dimensional structure and remote homology /." Doctoral thesis, Stockholm : Department of Materials and Environmental Chemistry (MMK), Stockholm University, 2010. http://urn.kb.se/resolve?urn=urn:nbn:se:su:diva-34094.

Full text

Abstract:

Diss. (sammanfattning) Stockholm : Stockholms universitet, 2010.<br>At the time of the doctoral defense, the following paper was unpublished and had a status as follows: Paper 3: Manuscript. Härtill 4 uppsatser.

APA, Harvard, Vancouver, ISO, and other styles

36

Thorburn, Henrik. "Applying Bioinformatic Techniques to Identify Cold-associated Genes in Oat." Thesis, University of Skövde, Department of Computer Science, 2002. http://urn.kb.se/resolve?urn=urn:nbn:se:his:diva-728.

Full text

Abstract:

<p>As the interest in biological sequence analysis increases, more efficient techniques to sequence, map and analyse genome data are needed. One frequently used technique is EST sequencing, which has proven to be a fast and cheap method to extract genome data. An EST sequencing generates large numbers of low-quality sequences which have to be managed and analysed further.</p><p>Performing complete searches and finding guaranteed results are very time consuming. This dissertation project presents a method that can be used to perform rapid gene prediction of function-specific genes in EST data, as well as the results and an estimation of the accuracy of the method.</p><p>This dissertation project applies various methods and techniques on actual data, attempting to identify genes involved in cold-associative processes in plants. The presented method consists of three steps. First, a database with genes known to have cold-associated properties is assembled. These genes are extracted from other, already sequenced and analysed organisms. Secondly, this database is used to identify homologues in an unanalysed EST dataset, generating a candidate-list of cold-associated genes. Last, each of the identified candidate cold-associative genes are verified, both to estimate the accuracy of the rapid gene prediction and also to support the removal of candidates which are not cold-associative.</p><p>The method was applied to a previously unanalysed Avena sativa EST dataset, and was able to identify 135 candidate genes from approximately 9500 EST's. Out of these, 103 were verified as cold-associated genes.</p>

APA, Harvard, Vancouver, ISO, and other styles

37

Transell, Mark Marriott. "The Use of bioinformatics techniques to perform time-series trend matching and prediction." Diss., University of Pretoria, 2012. http://hdl.handle.net/2263/37061.

Full text

Abstract:

Process operators often have process faults and alarms due to recurring failures on process equipment. It is also the case that some processes do not have enough input information or process models to use conventional modelling or machine learning techniques for early fault detection. A proof of concept for online streaming prediction software based on matching process behaviour to historical motifs has been developed, making use of the Basic Local Alignment Search Tool (BLAST) used in the Bioinformatics field. Execution times of as low as 1 second have been recorded, demonstrating that online matching is feasible. Three techniques have been tested and compared in terms of their computational effciency, robustness and selectivity, with results shown in Table 1: • Symbolic Aggregate Approximation combined with PSI-BLAST • Naive Triangular Representation with PSI-BLAST • Dynamic Time Warping Table 1: Properties of different motif-matching methods Property SAX-PSIBLAST TER-PSIBLAST DTW Noise tolerance (Selectivity) Acceptable Inconclusive Good Vertical Shift tolerance None Perfect Poor Matching speed Acceptable Acceptable Fast Match speed scaling O < O(mn) O < O(mn) O(mn) Dimensionality Reduction Tolerance Good Inconclusive Acceptable It is recommended that a method using a weighted confidence measure for each technique be investigated for the purpose of online process event handling and operator alerts. Keywords: SAX, BLAST, motif-matching, Dynamic Time Warping<br>Dissertation (MEng)--University of Pretoria, 2012.<br>Chemical Engineering<br>unrestricted

APA, Harvard, Vancouver, ISO, and other styles

38

Hon, Jiří. "Vyhledávání příbuzných proteinů s modifikovanou funkcí." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2015. http://www.nusl.cz/ntk/nusl-234914.

Full text

Abstract:

Protein engineering is a young dynamic discipline with great amount of potential practical applications. However, its success is primarily based on perfect knowledge and usage of all existing information about protein function and structure. To achieve that, protein engineering is supported by plenty of bioinformatic tools and analysis. The goal of this project is to create a new tool for protein engineering that would enable researchers to identificate related proteins with modified function in still growing biological databases. The tool is designed as an automated workflow of existing bioinformatic analyses that leads to identification of proteins with the same type of enzymatic function, but with slightly modified properties - primarily in terms of selectivity, reaction speed and stability.

APA, Harvard, Vancouver, ISO, and other styles

39

Hennerdal, Aron. "Investigation of multivariate prediction methods for the analysis of biomarker data." Thesis, Linköping University, The Department of Physics, Chemistry and Biology, 2006. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-5889.

Full text

Abstract:

<p>The paper describes predictive modelling of biomarker data stemming from patients suffering from multiple sclerosis. Improvements of multivariate analyses of the data are investigated with the goal of increasing the capability to assign samples to correct subgroups from the data alone.</p><p>The effects of different preceding scalings of the data are investigated and combinations of multivariate modelling methods and variable selection methods are evaluated. Attempts at merging the predictive capabilities of the method combinations through voting-procedures are made. A technique for improving the result of PLS-modelling, called bagging, is evaluated.</p><p>The best methods of multivariate analysis of the ones tried are found to be Partial least squares (PLS) and Support vector machines (SVM). It is concluded that the scaling have little effect on the prediction performance for most methods. The method combinations have interesting properties – the default variable selections of the multivariate methods are not always the best. Bagging improves performance, but at a high cost. No reasons for drastically changing the work flows of the biomarker data analysis are found, but slight improvements are possible. Further research is needed.</p>

APA, Harvard, Vancouver, ISO, and other styles

40

Sriram, Aparna. "Predicting Gene Relations Using Bayesian Networks." University of Akron / OhioLINK, 2011. http://rave.ohiolink.edu/etdc/view?acc_num=akron1302619630.

Full text

APA, Harvard, Vancouver, ISO, and other styles

41

Marko, Adam Christian. "Structure prediction and virtual screening: Application to G protein-coupled receptors." Diss., Search in ProQuest Dissertations & Theses. UC Only, 2009. http://gateway.proquest.com/openurl?url_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:dissertation&res_dat=xri:pqdiss&rft_dat=xri:pqdiss:1469757.

Full text

APA, Harvard, Vancouver, ISO, and other styles

42

Kehr, Stephanie. "Expanding the SnoRNA Interaction Network." Doctoral thesis, Universitätsbibliothek Leipzig, 2016. http://nbn-resolving.de/urn:nbn:de:bsz:15-qucosa-216221.

Full text

Abstract:

Small nucleolar RNAs (snoRNAs) are one of the most abundant and evolutionary ancient group of small non-coding RNAs. Their main function is to target chemical modifications of ribosomal RNAs (rRNAs) and small nuclear (snRNAs). They fall into two classes, box C/D snoRNAs and box H/ACA snoRNAs, which are clearly distinguished by conserved sequence motifs and the type of modification that they govern. The box H/ACA snoRNAs are responsible for targeting pseudouridylation sites and the box C/D snoRNAs for directing 2’-O-methylation of ribonucleotides. A subclass that localize to the Cajal bodies, termed scaRNAs, are responsible for methylation and pseudouridylation of snRNAs. In addition an amazing diversity of non-canonical functions of individual snoRNAs arose. The modification patterns in rRNAs and snRNAs are retained during evolution making it even possible to project them from yeast onto human. The stringent conservation of modification sites and the slow evolution of rRNAs and snRNAs contradicts the rapid evolution of snoRNA sequences. Recent studies that incorporate high-throughput sequencing experiments still identify undetected snoRNAs even in well studied organisms as human. The snoRNAbase, which has been the standard database for human snoRNAs has not been updated ince 2006 and misses these new data. Along with the lack of a centralized data collection across species, which incorporates also snoRNA class specific characteristics the need to integrate distributed data from literature and databases into a comprehensive snoRNA set arose. Although several snoRNA studies included pro forma target predictions in individual species and more and more studies focus on non-canonical functions of subclasses a systematic survey on the guiding function and especially functional homologies of snoRNAs was not available. To establish a sound set of snoRNAs a computational snoRNA annotation pipeline, named snoStrip that identifies homologous snoRNAs in related species was employed. For large scale investigation of the snoRNA function, state-of-the-art target pedictions were performed with our software RNAsnoop and PLEXY. Further, a new measure the Interaction Conservation Index (ICI) was developed to evaluate the conservation of snoRNA function. The snoStrip pipeline was applied to vertebrate species, where the genome sequence has been available. In addition, it was used in several ncRNA annotation studies (48 avian, spotted gar) of newly assembled genomes to contribute the snoRNA genes. Detailed target analysis of the new vertebrate snoRNA set revealed that in general functions of homologous snoRNAs are evolutionarily stable, thus, members of the same snoRNA family guide equivalent modifications. The conservation of snoRNA sequences is high at target binding regions while the remaining sequence varies significantly. In addition to elucidating principles of correlated evolution it was possible, with the help of the ICI measure, to assign functions to previously orphan snoRNAs and to associate snoRNAs as partners to known but so far unexplained chemical modifications. As further pattern redundant guiding became apparent. For many modification sites more than one snoRNA encodes the appropriate antisense element (ASE), which could ensure constant modification through snoRNAs that have different expression patterns. Furthermore, predictions of snoRNA functions in conjunction with sequence conservation could identify distant homologies. Due to the high overall entropy of snoRNA sequences, such relationships are hard to detect by means of sequence homology search methods alone. The snoRNA interaction network was further expanded through novel snoRNAs that were detected in data from high-throughput experiments in human and mouse. Through subsequent target analysis the new snoRNAs could immediately explain known modifications that had no appropriate snoRNA guide assigned before. In a further study a full catalog of expressed snoRNAs in human was provided. Beside canonical snoRNAs also recent findings like AluACAs, sno-lncRNAs and extraordinary short SNORD-like transcripts were taken into account. Again the target analysis workflow identified undetected connections between snoRNA guides and modifications. Especially some species/clade specific interactions of SNORD-like genes emerged that seem to act as bona fide snoRNA guides for rRNA and snRNA modifications. For all high confident new snoRNA genes identified during this work official gene names were requested from the HUGO Gene Nomenclature Committee (HGNC) avoiding further naming confusion.

APA, Harvard, Vancouver, ISO, and other styles

43

Parakkal, Sreenivasan Akshai. "Deep learning prediction of Quantmap clusters." Thesis, Uppsala universitet, Institutionen för biologisk grundutbildning, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-445909.

Full text

Abstract:

The hypothesis that similar chemicals exert similar biological activities has been widely adopted in the field of drug discovery and development. Quantitative Structure-Activity Relationship (QSAR) models have been used ubiquitously in drug discovery to understand the function of chemicals in biological systems. A common QSAR modeling method calculates similarity scores between chemicals to assess their biological function. However, due to the fact that some chemicals can be similar and yet have different biological activities, or conversely can be structurally different yet have similar biological functions, various methods have instead been developed to quantify chemical similarity at the functional level. Quantmap is one such method, which utilizes biological databases to quantify the biological similarity between chemicals. Quantmap uses quantitative molecular network topology analysis to cluster chemical substances based on their bioactivities. This method by itself, unfortunately, cannot assign new chemicals (those which may not yet have biological data) to the derived clusters. Owing to the fact that there is a lack of biological data for many chemicals, deep learning models were explored in this project with respect to their ability to correctly assign unknown chemicals to Quantmap clusters. The deep learning methods explored included both convolutional and recurrent neural networks. Transfer learning/pretraining based approaches and data augmentation methods were also investigated. The best performing model, among those considered, was the Seq2seq model (a recurrent neural network containing two joint networks, a perceiver and an interpreter network) without pretraining, but including data augmentation.

APA, Harvard, Vancouver, ISO, and other styles

44

Meraba, Rebone Leboreng. "Evaluating the predictive performance of cytotoxic T lymphocyte epitope prediction tools using Elispot assay data." Master's thesis, University of Cape Town, 2018. http://hdl.handle.net/11427/27972.

Full text

Abstract:

Computational T-cell epitope prediction tools have been previously devised to predict potential human leukocyte antigen (HLA) binding peptides from protein sequences. These tools are complements of Enzyme-linked immunosorbent spot (ELISpot) assays - a very commonly applied immunological technique that is used both to identify regions of pathogen genomes that trigger an immune response and to characterize the relationships between an individual's complement of HLA alleles and the degree of immunity that they display. If computational tools could accurately predict HLA-peptide binding, then these tools might be useable as a cheap and reliable alternative to ELISpot assays. A web-based IFN γ ELISpot assay dataset sharing resource, called IMMUNO-SHARE, was developed to enable the simple and straightforward storage and dissemination amongst researchers of large volumes of IFN γ ELISpot assay data. Such experimental data was next used to make HLA-peptide binding predictions with four frequently used T-cell epitope prediction tools - netMHC 3.2, IEDB_ANN, IEDB_ARB Matrix and IEDB_SMM. The predictive performances of all four tools individually and collectively was statistically assessed using non-parametric Spearman rank-order correlation tests. It was found that none of the four tested tools yielded binding affinity predictions that were detectably correlated with the observed ELISpot data. High false positive rates, where high predicted binding affinities between peptides and patient HLAs corresponded in these patients with no appreciable immune responses, were apparent for all four of the tested methods. The low degree of correlation between ELISpot data and HLA-peptide binding predictions and in particular, high false positive rates and relatively low true positive and true negative rates, indicate that the four tested tools would require substantial improvement before they could be seen as a viable alternative to ELISpot assays. Given that the accuracy of predictions of each of the four methods tested is largely dependent on both the quantity and quality of known true binder and true non-binder datasets that were used to train the HLA-peptide binding prediction methods implemented by the tools, it is plausible that the accuracy of these tools could be increased with larger training datasets. Retraining either the current methods or the next generation of prediction tools would therefore be greatly facilitated by the availability of large quantities of publically available HLA-peptide binding interaction information. It is hoped that IMMUNO-SHARE or some other ELISpot data sharing resource could eventually meet this need.

APA, Harvard, Vancouver, ISO, and other styles

45

Davis, Fred Pejman. "Prediction of potential host-pathogen protein interactions by structure." Diss., Search in ProQuest Dissertations & Theses. UC Only, 2007. http://gateway.proquest.com/openurl?url_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:dissertation&res_dat=xri:pqdiss&rft_dat=xri:pqdiss:3261227.

Full text

APA, Harvard, Vancouver, ISO, and other styles

46

Carlsson, Jonas. "Mutational effects on protein structure and function." Doctoral thesis, Linköpings universitet, Bioinformatik, 2009. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-50491.

Full text

Abstract:

In this thesis several important proteins are investigated from a structural perspective. Some of the proteins are disease related while other have important but not completely characterised functions. The techniques used are general as demonstrated by applications on metabolic proteins (CYP21, CYP11B1, IAPP, ADH3), regulatory proteins (p53, GDNF) and a transporter protein (ANTR1). When the protein CYP21 (steroid 21-hydroxylase) is deficient it causes CAH (congenital adrenal hyperplasia). For this protein, there are about 60 known mutations with characterised clinical phenotypes. Using manual structural analysis we managed to explain the severity of all but one of the mutations. By observing the properties of these mutations we could perform good predictions on, at the time, not classified mutations. For the cancer suppressor protein p53, there are over thousand mutations with known activity. To be able to analyse such a large number of mutations we developed an automated method for evaluation of the mutation effect called PREDMUT. In this method we include twelve different prediction parameters including two energy parameters calculated using an energy minimization procedure. The method manages to differentiate severe mutations from non-severe mutations with 77% accuracy on all possible single base substitutions and with 88% on mutations found in breast cancer patients. The automated prediction was further applied to CYP11B1 (steroid 11-beta-hydroxylase), which in a similar way as CYP21 causes CAH when deficient. A generalized method applicable to any kind of globular protein was developed. The method was subsequently evaluated on nine additional proteins for which mutants were known with annotated disease phenotypes. This prediction achieved 84% accuracy on CYP11B1 and 81% accuracy in total on the evaluation proteins while leaving 8% as unclassified. By increasing the number of unclassified mutations the accuracy of the remaining mutations could be increased on the evaluation proteins and substantially increase the classification quality as measured by the Matthews correlation coefficient. Servers with predictions for all possible single based substitutions are provided for p53, CYP21 and CYP11B1. The amyloid formation of IAPP (islet amyloid polypeptide) is strongly connected to diabetes and has been studied using both molecular dynamics and Monte Carlo energy minimization. The effects of mutations on the amount and speed of amyloid formation were investigated using three approaches. Applying a consensus of the three methods on a number of interesting mutations, 94% of the mutations could be correctly classified as amyloid forming or not, evaluated with in vitro measurements. In the brain there are many proteins whose functions and interactions are largely unknown. GDNF (glial cell line-derived neurotrophic factor) and NCAM (neural cell adhesion molecule) are two such neuron connected proteins that are known to interact. The form of interaction was studied using protein--protein docking where a docking interface was found mediated by four oppositely charged residues in respective protein. This interface was subsequently confirmed by mutagenesis experiments. The NCAM dimer interface upon binding to the GDNF dimer was also mapped as well as an additional interacting protein, GFRα1, which was successfully added to the protein complex without any clashes. A large and well studied protein family is the alcohol dehydrogenase family, ADH. A class of this family is ADH3 (alcohol dehydrogenase class III) that has several known substrates and inhibitors. By using virtual screening we tried to characterize new ligands. As some ligands were already known we could incorporate this knowledge when the compound docking simulations were scored and thereby find two new substrates and two new inhibitors which were subsequently successfully tested in vitro. ANTR1 (anion transporter 1) is a membrane bound transporter important in the photosynthesis in plants. To be able to study the amino acid residues involved in inorganic phosphate transportation a homology model of the protein was created. Important residues were then mapped onto the structure using conservation analysis and we were in this way able to propose roles of amino acid residues involved in the transportation of inorganic phosphate. Key residues were subsequently mutated in vitro and a transportation process could be postulated. To conclude, we have used several molecular modelling techniques to find functional clues, interaction sites and new ligands. Furthermore, we have investigated the effect of muations on the function and structure of a multitude of disease related proteins.

APA, Harvard, Vancouver, ISO, and other styles

47

Leung, Shuen-yi. "Predicting metabolic pathways from metabolic networks." Click to view the E-thesis via HKUTO, 2009. http://sunzi.lib.hku.hk/hkuto/record/B42664317.

Full text

APA, Harvard, Vancouver, ISO, and other styles

48

Hronský, Patrik. "Bioinformatický nástroj pro predikci rozpustnosti proteinů." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2016. http://www.nusl.cz/ntk/nusl-255363.

Full text

Abstract:

This master's thesis addresses the solubility of recombinant proteins and its prediction. It describes the subject of protein synthesis, as well as the process of recombinant protein creation. Recombinant protein synthesis is of great importance for example to pharmacologic industry. This synthesis is not a simple task and it does not always produce viable proteins. Protein solubility is an important factor, determining the viability of the resulting proteins. It is of course favourable for companies, that take part in recombinant protein synthesis, to focus their effort and their resources on proteins, that will be viable in the end. In this regard, bioinformatics is of great help, as it is capable, with the help of machine learning, of predicting the solubility of proteins, for example based on their sequences. This thesis introduces the reader to the basic principles of machine learning and presents several machine learning methods, used in the field of protein solubility prediction. It deals with the definition of a dataset, which is later used to test selected predictors, as well as to train the ensemble predictor, which is the main focus of this thesis. It also focuses on several specific protein solubility predictors and explains the basic principles upon which they are built, as well as the results of their testing. In the end, it presents the ensemble predictor of protein solubility.

APA, Harvard, Vancouver, ISO, and other styles

49

Youngs, Noah. "Positive-Unlabeled Learning in the Context of Protein Function Prediction." Thesis, New York University, 2014. http://pqdtopen.proquest.com/#viewpdf?dispub=3665223.

Full text

Abstract:

<p> With the recent proliferation of large, unlabeled data sets, a particular subclass of semisupervised learning problems has become more prevalent. Known as positive-unlabeled learning (PU learning), this scenario provides only positive labeled examples, usually just a small fraction of the entire dataset, with the remaining examples unknown and thus potentially belonging to either the positive or negative class. Since the vast majority of traditional machine learning classifiers require both positive and negative examples in the training set, a new class of algorithms has been developed to deal with PU learning problems.</p><p> A canonical example of this scenario is topic labeling of a large corpus of documents. Once the size of a corpus reaches into the thousands, it becomes largely infeasible to have a curator read even a sizable fraction of the documents, and annotate them with topics. In addition, the entire set of topics may not be known, or may change over time, making it impossible for a curator to annotate which documents are NOT about certain topics. Thus a machine learning algorithm needs to be able to learn from a small set of positive examples, without knowledge of the negative class, and knowing that the unlabeled training examples may contain an arbitrary number of additional but as yet unknown positive examples. </p><p> Another example of a PU learning scenario recently garnering attention is the protein function prediction problem (PFP problem). While the number of organisms with fully sequenced genomes continues to grow, the progress of annotating those sequences with the biological functions that they perform lags far behind. Machine learning methods have already been successfully applied to this problem, but with many organisms having a small number of positive annotated training examples, and the lack of availability of almost any labeled negative examples, PU learning algorithms have the potential to make large gains in predictive performance.</p><p> The first part of this dissertation motivates the protein function prediction problem, explores previous work, and introduces novel methods that improve upon previously reported benchmarks for a particular type of learning algorithm, known as Gaussian Random Field Label Propagation (GRFLP). In addition, we present improvements to the computational efficiency of the GRFLP algorithm, and a modification to the traditional structure of the PFP learning problem that allows for simultaneous prediction across multiple species.</p><p> The second part of the dissertation focuses specifically on the positive-unlabeled aspects of the PFP problem. Two novel algorithms are presented, and rigorously compared to existing PU learning techniques in the context of protein function prediction. Additionally, we take a step back and examine some of the theoretical considerations of the PU scenario in general, and provide an additional novel algorithm applicable in any PU context. This algorithm is tailored for situations in which the labeled positive examples are a small fraction of the set of true positive examples, and where the labeling process may be subject to some type of bias rather than being a random selection of true positives (arguably some of the most difficult PU learning scenarios).</p><p> The third and fourth sections return to the PFP problem, examining the power of tertiary structure as a predictor of protein function, as well as presenting two case studies of function prediction performance on novel benchmarks. Lastly, we conclude with several promising avenues of future research into both PU learning in general, and the protein function prediction problem specifically. </p>

APA, Harvard, Vancouver, ISO, and other styles

50

Midic, Uros. "Genome-Wide Prediction of Intrinsic Disorder; Sequence Alignment of Intrinsically Disordered Proteins." Diss., Temple University Libraries, 2012. http://cdm16002.contentdm.oclc.org/cdm/ref/collection/p245801coll10/id/159800.

Full text

Abstract:

Computer and Information Science<br>Ph.D.<br>Intrinsic disorder (ID) is defined as a lack of stable tertiary and/or secondary structure under physiological conditions in vitro. Intrinsically disordered proteins (IDPs) are highly abundant in nature. IDPs possess a number of crucial biological functions, being involved in regulation, recognition, signaling and control, e.g. their functional repertoire complements the functions of ordered proteins. Intrinsically disordered regions (IDRs) of IDPs have a different amino-acid composition than structured regions and proteins. This fact has been exploited for development of predictors of ID; the best predictors currently achieve around 80% per-residue accuracy. Earlier studies revealed that some IDPs are associated with various human diseases, including cancer, cardiovascular disease, amyloidoses, neurodegenerative diseases, diabetes and others. We developed a methodology for prediction and analysis of abundance of intrinsic disorder on the genome scale, which combines data from various gene and protein databases, and utilizes several ID prediction tools. We used this methodology to perform a large-scale computational analysis of the abundance of (predicted) ID in transcripts of various classes of disease-related genes. We further analyzed the relationships between ID and the occurrence of alternative splicing and Molecular Recognition Features (MoRFs) in human disease classes. An important, never before addressed issue with such genome-wide applications of ID predictors is that - for less-studied organisms - in addition to the experimentally confirmed protein sequences, there is a large number of putative sequences, which have been predicted with automated annotation procedures and lack experimental confirmation. In the human genome, these predicted sequences have significantly higher predicted disorder content. I investigated a hypothesis that this discrepancy is not correct, and that it is due to incorrectly annotated parts of the putative protein sequences that exhibit some similarities to confirmed IDRs, which lead to high predicted ID content. I developed a procedure to create synthetic nonsense peptide sequences by translation of non-coding regions of genomic sequences and translation of coding regions with incorrect codon alignment. I further trained several classifiers to discriminate between confirmed sequences and synthetic nonsense sequences, and used these predictors to estimate the abundance of incorrectly annotated regions in putative sequences, as well as to explore the link between such regions and intrinsic disorder. Sequence alignment is an essential tool in modern bioinformatics. Substitution matrices - such as the BLOSUM family - contain 20x20 parameters which are related to the evolutionary rates of amino acid substitutions. I explored various strategies for extension of sequence alignment to utilize the (predicted) disorder/structure information about the sequences being aligned. These strategies employ an extended 40 symbol alphabet which contains 20 symbols for amino acids in ordered regions and 20 symbols for amino acids in IDRs, as well as expanded 40x40 and 40x20 matrices. The new matrices exhibit significant and substantial differences in the substitution scores for IDRs and structured regions. Tests on a reference dataset show that 40x40 matrices perform worse than the standard 20x20 matrices, while 40x20 matrices - used in a scenario where ID is predicted for a query sequence but not for the target sequences - have at least comparable performance. However, I also demonstrate that the variations in performance between 20x20 and 20x40 matrices are insignificant compared to the variation in obtained matrices that occurs when the underlying algorithm for calculation of substitution matrices is changed.<br>Temple University--Theses

APA, Harvard, Vancouver, ISO, and other styles

We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!