Dissertations / Theses on the topic 'Biology, Genetics|Biology, Bioinformatics|Computer Science'
Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles
Consult the top 50 dissertations / theses for your research on the topic 'Biology, Genetics|Biology, Bioinformatics|Computer Science.'
Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.
You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.
Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.
Wang, Jeremy R. "Analysis and Visualization of Local Phylogenetic Structure within Species." Thesis, The University of North Carolina at Chapel Hill, 2013. http://pqdtopen.proquest.com/#viewpdf?dispub=3562960.
Full textWhile it is interesting to examine the evolutionary history and phylogenetic relationship between species, for example, in a sort of "tree of life", there is also a great deal to be learned from examining population structure and relationships within species. A careful description of phylogenetic relationships within species provides insights into causes of phenotypic variation, including disease susceptibility. The better we are able to understand the patterns of genotypic variation within species, the better these populations may be used as models to identify causative variants and possible therapies, for example through targeted genome-wide association studies (GWAS). My thesis describes a model of local phylogenetic structure, how it can be effectively derived under various circumstances, and useful applications and visualizations of this model to aid genetic studies.
I introduce a method for discovering phylogenetic structure among individuals of a population by partitioning the genome into a minimal set of intervals within which there is no evidence of recombination. I describe two extensions of this basic method. The first allows it to be applied to heterozygous, in addition to homozygous, genotypes and the second makes it more robust to errors in the source genotypes.
I demonstrate the predictive power of my local phylogeny model using a novel method for genome-wide genotype imputation. This imputation method achieves very high accuracy—on the order of the accuracy rate in the sequencing technology—by imputing genotypes in regions of shared inheritance based on my local phylogenies.
Comparative genomic analysis within species can be greatly aided by appropriate visualization and analysis tools. I developed a framework for web-based visualization and analysis of multiple individuals within a species, with my model of local phylogeny providing the underlying structure. I will describe the utility of these tools and the applications for which they have found widespread use.
Guturu, Harendra. "Deciphering human gene regulation using computational and statistical methods." Thesis, Stanford University, 2014. http://pqdtopen.proquest.com/#viewpdf?dispub=3581147.
Full textIt is estimated that at least 10-20% of the mammalian genome is dedicated towards regulating the 1-2% of the genome that codes for proteins. This non-coding, regulatory layer is a necessity for the development of complex organisms, but is poorly understood compared to the genetic code used to translate coding DNA into proteins. In this dissertation, I will discuss methods developed to better understand the gene regulatory layer. I begin, in Chapter 1, with a broad overview of gene regulation, motivation for studying it, the state of the art with a historically context and where to look forward.
In Chapter 2, I discuss a computational method developed to detect transcription factor (TF) complexes. The method compares co-occurring motif spacings in conserved versus unconserved regions of the human genome to detect evolutionarily constrained binding sites of rigid transcription factor (TF) complexes. Structural data were integrated to explore overlapping motif arrangements while ensuring physical plausibility of the TF complex. Using this approach, I predicted 422 physically realistic TF complex motifs at 18% false discovery rate (FDR). I found that the set of complexes is enriched in known TF complexes. Additionally, novel complexes were supported by chromatin immunoprecipitation sequencing (ChIP-seq) datasets. Analysis of the structural modeling revealed three cooperativity mechanisms and a tendency of TF pairs to synergize through overlapping binding to the same DNA base pairs in opposite grooves or strands. The TF complexes and associated binding site predictions are made available as a web resource at http://complex.stanford.edu.
Next, in Chapter 3, I discuss how gene enrichment analysis can be applied to genome-wide conserved binding sites to successfully infer regulatory functions for a given TF complex. A genomic screen predicted 732,568 combinatorial binding sites for 422 TF complex motifs. From these predictions, I inferred 2,440 functional roles, which are consistent with known functional roles of TF complexes. In these functional associations, I found interesting themes such as promiscuous partnering of TFs (such as ETS) in the same functional context (T cells). Additionally, functional enrichment identified two novel TF complex motifs associated with spinal cord patterning genes and mammary gland development genes, respectively. Based on these predictions, I discovered novel spinal cord patterning enhancers (5/9, 56% validation rate) and enhancers active in MCF7 cells (11/19, 53% validation rate). This set replete with thousands of additional predictions will serve as a powerful guide for future studies of regulatory patterns and their functional roles.
Then, in Chapter 4, I outline a method developed to predict disease susceptibility due to gene mis-regulation. The method interrogates ensembles of conserved binding sites of regulatory factors disrupted by an individual's variants and then looks for their most significant congregation next to a group of functionally related genes. Strikingly, when the method is applied to five different full human genomes, the top enriched function for each is reflective of their very different medical histories. These results suggest that erosion of gene regulation results in function specific mutation loads that manifest as disease predispositions in a familial lineage. Additionally, this aggregate analysis method addresses the problem that although many human diseases have a genetic component involving many loci, the majority of studies are statistically underpowered to isolate the many contributing loci.
Finally, I conclude in Chapter 5 with a summary of my findings throughout my research and future directions of research based on my findings.
Brewer, Judy. "Metabolic Modeling of Inborn Errors of Metabolism: Carnitine Palmitoyltransferase II Deficiency and Respiratory Chain Complex I Deficiency." Thesis, Harvard University, 2015. http://nrs.harvard.edu/urn-3:HUL.InstRepos:24078365.
Full textZou, James Yang. "Algorithms and Models for Genome Biology." Thesis, Harvard University, 2014. http://dissertations.umi.com/gsas.harvard:11280.
Full textMathematics
Nicol, Megan E. "Unraveling the Nexus: Investigating the Regulatory Genetic Networks of Hereditary Ataxias." Ohio University Honors Tutorial College / OhioLINK, 2014. http://rave.ohiolink.edu/etdc/view?acc_num=ouhonors1400604580.
Full textKiritchenko, Svetlana. "Hierarchical text categorization and its application to bioinformatics." Thesis, University of Ottawa (Canada), 2006. http://hdl.handle.net/10393/29298.
Full textParmidge, Amelia J. "NEPIC, a Semi-Automated Tool with a Robust and Extensible Framework that Identifies and Tracks Fluorescent Image Features." Thesis, Mills College, 2014. http://pqdtopen.proquest.com/#viewpdf?dispub=1556025.
Full textAs fluorescent imaging techniques for biological systems have advanced in recent years, scientists have used fluorescent imaging more and more to capture the state of biological systems at different moments in time. For many researchers, analysis of the fluorescent image data has become the limiting factor of this new technique. Although identification of fluorescing neurons in an image is (seemingly) easily done by the human visual system, manual delineation of the exact pixels comprising these fluorescing regions of interest (or fROIs) in digital images does not scale up well, being time-consuming, reiterative, and error-prone. This thesis introduces NEPIC, the Neuron-to- Environment Pixel Intensity Calculator, which seeks to help resolve this issue. NEPIC is a semi-automated tool for finding and tracking the cell body of a single neuron over an entire movie of grayscale calcium image data. NEPIC also provides a highly extensible, open source framework that could easily support finding and tracking other kinds of fROIs. When tested on calcium image movies of the AWC neuron in C. elegans under highly variant conditions, NEPIC correctly identified the neuronal cell body in 95.48% of the movie frames, and successfully tracked this cell body feature across 98.60% of the frame transitions in the movies. Although support for finding and tracking multiple fROIs has yet to be implemented, NEPIC displays promise as a tool for assisting researchers in the bulk analysis of fluorescent imaging data.
Daniels, Noah Manus. "Remote Homology Detection in Proteins Using Graphical Models." Thesis, Tufts University, 2013. http://pqdtopen.proquest.com/#viewpdf?dispub=3563611.
Full textGiven the amino acid sequence of a protein, researchers often infer its structure and function by finding homologous, or evolutionarily-related, proteins of known structure and function. Since structure is typically more conserved than sequence over long evolutionary distances, recognizing remote protein homologs from their sequence poses a challenge.
We first consider all proteins of known three-dimensional structure, and explore how they cluster according to different levels of homology. An automatic computational method reasonably approximates a human-curated hierarchical organization of proteins according to their degree of homology.
Next, we return to homology prediction, based only on the one-dimensional amino acid sequence of a protein. Menke, Berger, and Cowen proposed a Markov random field model to predict remote homology for beta-structural proteins, but their formulation was computationally intractable on many beta-strand topologies.
We show two different approaches to approximate this random field, both of which make it computationally tractable, for the first time, on all protein folds. One method simplifies the random field itself, while the other retains the full random field, but approximates the solution through stochastic search. Both methods achieve improvements over the state of the art in remote homology detection for beta-structural protein folds.
Chen, Hui 1974. "Algorithms and statistics for the detection of binding sites in coding regions." Thesis, McGill University, 2006. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=97926.
Full textThe inter-species sequence conservation observed in coding regions may be the result of two types of selective pressure: the selective pressure on the protein encoded and, sometimes, the selective pressure on the binding sites. To predict some region in coding regions as a binding site, one needs to make sure that the conservation observed in this region is not due to the selective pressure on the protein encoded. To achieve this, COSMO built a null model with only the selective pressure on the protein encoded and computed p-values for the observed conservation scores, conditional on the fixed set of amino acids observed at the leaves.
It is believed, however, that the selective pressure on the protein assumed in COSMO is overly strong. Consequently, some interesting regions may be left undetected. In this thesis, a new method, COSMO-2, is developed to relax this assumption.
The amino acids are first classified into a fixed number of overlapping functional classes by applying an expectation maximization algorithm on a protein database. Two probabilities for each gene position are then calculated: (i) the probability of observing a certain degree of conservation in the orthologous sequences generated under each class in the null model (i.e. the p-value of the observed conservation under each class); and (ii) the probability that the codon column associated with that gene position belongs to each class. The p-value of the observed conservation for each gene position is the sum of the products of the two probabilities for all classes. Regions with low p-values are identified as potential binding sites.
Five sets of orthologous genes are analyzed using COSMO-2. The results show that COSMO-2 can detect the interesting regions identified by COSMO and can detect more interesting regions than COSMO in some cases.
Chen, Xiaoyu 1974. "Computational detection of tissue-specific cis-regulatory modules." Thesis, McGill University, 2006. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=97927.
Full textIt is believed that tissue-specific CRMs tend to regulate nearby genes in a certain tissue and that they consist of binding sites for transcription factors (TFs) that are also expressed in that tissue. These facts allow us to make use of tissue-specific gene expression data to detect tissue-specific CRMs and improve the specificity of module prediction.
We build a Bayesian network to integrate the sequence information about TF binding sites and the expression information about TFs and regulated genes. The network is then used to infer whether a given genomic region indeed has regulatory activity in a given tissue. A novel EM algorithm incorporating probability tree learning is proposed to train the Bayesian network in an unsupervised way. A new probability tree learning algorithm is developed to learn the conditional probability distribution for a variable in the network that has a large number of hidden variables as its parents.
Our approach is evaluated using biological data, and the results show that it is able to correctly discriminate among human liver-specific modules, erythroid-specific modules, and negative-control regions, even though no prior knowledge about the TFs and the target genes is employed in our algorithm. In a genome-wide scale, our network is trained to identify tissue-specific CRMs in ten tissues. Some known tissue-specific modules are rediscovered, and a set of novel modules are predicted to be related with tissue-specific expression.
Siek, Katie A. "The design and evaluation of an assistive application for dialysis patients." [Bloomington, Ind.] : Indiana University, 2006. http://gateway.proquest.com/openurl?url_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:dissertation&res_dat=xri:pqdiss&rft_dat=xri:pqdiss:3223070.
Full text"Title from dissertation home page (viewed June 28, 2007)." Source: Dissertation Abstracts International, Volume: 67-06, Section: B, page: 3242. Adviser: Kay H. Connelly.
Ahlert, Darla. "Application of Graph Theoretic Clustering on Some Biomedical Data Sets." Thesis, Southern Illinois University at Edwardsville, 2015. http://pqdtopen.proquest.com/#viewpdf?dispub=1588658.
Full textClustering algorithms have become a popular way to analyze biomedical data sets and in particular, gene expression data. Since these data sets are often large, it is difficult to gather useful information from them as a whole. Clustering is a proven method to extract knowledge about the data that can eventually lead to many discoveries in the biological world. Hierarchical clustering is used frequently to interpret gene expression data, but recently, graph-theoretic clustering algorithms have started to gain some attraction for analysis of this type of data. We consider five graph-theoretic clustering algorithms run over a post-mortem gene expression dataset, as well as a few different biomedical data sets, in which the ground truth, or class label, is known for each data point. We then externally evaluate the algorithms based on the accuracy of the resulting clusters against the ground truth clusters. Comparing the results of each of the algorithms run over all of the datasets, we found that our algorithms are efficient on the real biomedical datasets but find gene expression data especially difficult to handle.
Roberts, Adam. "Ambiguous fragment assignment for high-throughput sequencing experiments." Thesis, University of California, Berkeley, 2014. http://pqdtopen.proquest.com/#viewpdf?dispub=3616509.
Full textAs the cost of short-read, high-throughput DNA sequencing continues to fall rapidly, new uses for the technology have been developed aside from its original purpose in determining the genome of various species. Many of these new experiments use the sequencer as a digital counter for measuring biological activities such as gene expression (RNA-Seq) or protein binding (ChIP-Seq).
A common problem faced in the analysis of these data is that of sequenced fragments that are "ambiguous", meaning they resemble multiple loci in a reference genome or other sequence. In early analyses, such ambiguous fragments were ignored or were assigned to loci using simple heuristics. However, statistical approaches using maximum likelihood estimation have been shown to greatly improve the accuracy of downstream analyses and have become widely adopted Optimization based on the expectation-maximization (EM) algorithm are often employed by these methods to find the optimal sets of alignments, with frequent enhancements to the model. Nevertheless, these improvements increase complexity, which, along with an exponential growth in the size of sequencing datasets, has led to new computational challenges.
Herein, we present our model for ambiguous fragment assignment for RNA-Seq, which includes the most comprehensive set of parameters of any model introduced to date, as well as various methods we have explored for scaling our optimization procedure. These methods include the use of an online EM algorithm and a distributed EM solution implemented on the Spark cluster computing system. Our advances have resulted in the first efficient solution to the problem of fragment assignment in sequencing.
Furthermore, we are the first to create a fully generalized model for ambiguous fragment assignment and present details on how our method can provide solutions for additional high-throughput sequencing assays including ChIP-Seq, Allele-Specific Expression (ASE), and the detection of RNA-DNA Differences (RDDs) in RNA-Seq.
Huang, Huanhua. "Taxonomic assignment of gene sequences using hidden Markov models." Thesis, Northern Arizona University, 2014. http://pqdtopen.proquest.com/#viewpdf?dispub=1563863.
Full textOur ability to study communities of microorganisms has been vastly improved by the development of high-throughput DNA sequences. These technologies however can only sequence short fragments of organism's genomes at a time, which introduces many challenges in translating sequences results to biological insight. The field of bioinformatics has arisen in part to address these problems.
One bioinformatics problem is assigning a genetic sequence to a source organism. It is now common to use high−throughput, short−read sequencing technologies, such as the Illumina MiSeq, to sequence the 16S rRNA gene from a community of microorganisms. Researchers use this information to generate a profile of the different microbial organisms (i.e., the taxonomic composition) present in an environmental sample. There are a number of approaches for assigning taxonomy to genetic sequences, but all suffer from problems with accuracy. The methods that have been most widely used are pairwise alignment methods, like BLAST, UCLUST, and RTAX, and probability-based methods, such as RDP and MOTHUR. These methods can classify microbial sequences with high accuracy when sequences are long (e.g., thousand bases), however accuracy decreases as sequences are shorter. Current high−throughout sequencing technologies generates sequences between about 150 and 500 bases in length.
In my thesis I have developed new software for assigning taxonomy to short DNA sequences using profile Hidden Markov Models (HMMs). HMMs have been applied in related areas, such as assigning biological functions to protein sequences, and I hypothesize that it might be useful for achieving high accuracy taxonomic assignments from 16S rRNA gene sequences. My method builds models of 16S rRNA sequences for different taxonomic groups (kingdom, phylum, class, order, family genus and species) using the Greengenes 16S rRNA database. Given a sequence with unknown taxonomic origin, my method searches each kingdom model to determine the most likely kingdom. It then searches all of the phyla within the highest scoring kingdom to determine the most likely phylum. This iterative process continues until the sequence cannot be assigned at a taxonomic level with a user-defined confidence level, or until a species-level assignment is made that meets the user-defined confidence level.
I next evaluated this method on both artificial and real microbial community data, with both qualitative and quantitative metrics of method performance. The evaluation results showed that in the qualitative analyses (specificity and sensitivity) my method is not as good as the previously existing methods. However, the accuracy in the quantitative analysis was better than some other pre-existing methods. This suggests that my current implementation is sensitive to false positives, but is better at classifying more sequences than the other methods.
I present my method, my evaluations, and suggestions for next steps that might improve the performance of my HMM-based taxonomic classifier.
Thavappiragasam, Mathialakan. "A web semantic for SBML merge." Thesis, University of South Dakota, 2014. http://pqdtopen.proquest.com/#viewpdf?dispub=1566784.
Full textThe manipulation of XML based relational representations of biological systems (BioML for Bioscience Markup Language) is a big challenge in systems biology. The needs of biologists, like translational study of biological systems, cause their challenges to become grater due to the material received in next generation sequencing. Among these BioML's, SBML is the de facto standard file format for the storage and exchange of quantitative computational models in systems biology, supported by more than 257 software packages to date. The SBML standard is used by several biological systems modeling tools and several databases for representation and knowledge sharing. Several sub systems are integrated in order to construct a complex bio system. The issue of combining biological sub-systems by merging SBML files has been addressed in several algorithms and tools. But it remains impossible to build an automatic merge system that implements reusability, flexibility, scalability and sharability. The technique existing algorithms use is name based component comparisons. This does not allow integration into Workflow Management System (WMS) to build pipelines and also does not include the mapping of quantitative data needed for a good analysis of the biological system. In this work, we present a deterministic merging algorithm that is consumable in a given WMS engine, and designed using a novel biological model similarity algorithm. This model merging system is designed with integration of four sub modules: SBMLChecker, SBMLAnot, SBMLCompare, and SBMLMerge, for model quality checking, annotation, comparison, and merging respectively. The tools are integrated into the BioExtract server leveraging iPlant collaborative resources to support users by allowing them to process large models and design work flows. These tools are also embedded into a user friendly online version SW4SBMLm.
Youngs, Noah. "Positive-Unlabeled Learning in the Context of Protein Function Prediction." Thesis, New York University, 2014. http://pqdtopen.proquest.com/#viewpdf?dispub=3665223.
Full textWith the recent proliferation of large, unlabeled data sets, a particular subclass of semisupervised learning problems has become more prevalent. Known as positive-unlabeled learning (PU learning), this scenario provides only positive labeled examples, usually just a small fraction of the entire dataset, with the remaining examples unknown and thus potentially belonging to either the positive or negative class. Since the vast majority of traditional machine learning classifiers require both positive and negative examples in the training set, a new class of algorithms has been developed to deal with PU learning problems.
A canonical example of this scenario is topic labeling of a large corpus of documents. Once the size of a corpus reaches into the thousands, it becomes largely infeasible to have a curator read even a sizable fraction of the documents, and annotate them with topics. In addition, the entire set of topics may not be known, or may change over time, making it impossible for a curator to annotate which documents are NOT about certain topics. Thus a machine learning algorithm needs to be able to learn from a small set of positive examples, without knowledge of the negative class, and knowing that the unlabeled training examples may contain an arbitrary number of additional but as yet unknown positive examples.
Another example of a PU learning scenario recently garnering attention is the protein function prediction problem (PFP problem). While the number of organisms with fully sequenced genomes continues to grow, the progress of annotating those sequences with the biological functions that they perform lags far behind. Machine learning methods have already been successfully applied to this problem, but with many organisms having a small number of positive annotated training examples, and the lack of availability of almost any labeled negative examples, PU learning algorithms have the potential to make large gains in predictive performance.
The first part of this dissertation motivates the protein function prediction problem, explores previous work, and introduces novel methods that improve upon previously reported benchmarks for a particular type of learning algorithm, known as Gaussian Random Field Label Propagation (GRFLP). In addition, we present improvements to the computational efficiency of the GRFLP algorithm, and a modification to the traditional structure of the PFP learning problem that allows for simultaneous prediction across multiple species.
The second part of the dissertation focuses specifically on the positive-unlabeled aspects of the PFP problem. Two novel algorithms are presented, and rigorously compared to existing PU learning techniques in the context of protein function prediction. Additionally, we take a step back and examine some of the theoretical considerations of the PU scenario in general, and provide an additional novel algorithm applicable in any PU context. This algorithm is tailored for situations in which the labeled positive examples are a small fraction of the set of true positive examples, and where the labeling process may be subject to some type of bias rather than being a random selection of true positives (arguably some of the most difficult PU learning scenarios).
The third and fourth sections return to the PFP problem, examining the power of tertiary structure as a predictor of protein function, as well as presenting two case studies of function prediction performance on novel benchmarks. Lastly, we conclude with several promising avenues of future research into both PU learning in general, and the protein function prediction problem specifically.
Lee, Lawrence Chet-Lun. "Text mining of point mutation information from biomedical literature." Diss., Search in ProQuest Dissertations & Theses. UC Only, 2008. http://gateway.proquest.com/openurl?url_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:dissertation&res_dat=xri:pqdiss&rft_dat=xri:pqdiss:3339194.
Full textHudson, Cody Landon. "Protein structure analysis and prediction utilizing the Fuzzy Greedy K-means Decision Forest model and Hierarchically-Clustered Hidden Markov Models method." Thesis, University of Central Arkansas, 2014. http://pqdtopen.proquest.com/#viewpdf?dispub=1549796.
Full textStructural genomics is a field of study that strives to derive and analyze the structural characteristics of proteins through means of experimentation and prediction using software and other automatic processes. Alongside implications for more effective drug design, the main motivation for structural genomics concerns the elucidation of each protein’s function, given that the structure of a protein almost completely governs its function. Historically, the approach to derive the structure of a protein has been through exceedingly expensive, complex, and time consuming methods such as x-ray crystallography and nuclear magnetic resonance (NMR) spectroscopy.
In response to the inadequacies of these methods, three families of approaches developed in a relatively new branch of computer science known as bioinformatics. The aforementioned families include threading, homology-modeling, and the de novo approach. However, even these methods fail either due to impracticalities, the inability to produce novel folds, rampant complexity, inherent limitations, etc. In their stead, this work proposes the Fuzzy Greedy K-means Decision Forest model, which utilizes sequence motifs that transcend protein family boundaries to predict local tertiary structure, such that the method is cheap, effective, and can produce semi-novel folds due to its local (rather than global) prediction mechanism. This work further extends the FGK-DF model with a new algorithm, the Hierarchically Clustered-Hidden Markov Models (HC-HMM) method to extract protein primary sequence motifs in a more accurate and adequate manner than currently exhibited by the FGK-DF model, allowing for more accurate and powerful local tertiary structure predictions. Both algorithms are critically examined, their methodology thoroughly explained and tested against a consistent data set, the results thereof discussed at length.
Dinh, Hieu Trung. "Algorithms for DNA Sequence Assembly and Motif Search." University of Connecticut, 2013.
Find full textBliven, Spencer Edward. "Structure-Preserving Rearrangements| Algorithms for Structural Comparison and Protein Analysis." Thesis, University of California, San Diego, 2015. http://pqdtopen.proquest.com/#viewpdf?dispub=3716489.
Full textProtein structure is fundamental to a deep understanding of how proteins function. Since structure is highly conserved, structural comparison can provide deep information about the evolution and function of protein families. The Protein Data Bank (PDB) continues to grow rapidly, providing copious opportunities for advancing our understanding of proteins through large-scale searches and structural comparisons. In this work I present several novel structural comparison methods for specific applications, as well as apply structure comparison tools systematically to better understand global properties of protein fold space.
Circular permutation describes a relationship between two proteins where the N-terminal portion of one protein is related to the C-terminal portion of the other. Proteins that are related by a circular permutation generally share the same structure despite the rearrangement of their primary sequence. This non-sequential relationship makes them difficult for many structure alignment tools to detect. Combinatorial Extension for Circular Permutations (CE-CP) was developed to align proteins that may be related by a circular permutation. It is widely available due to its incorporation into the RCSB PDB website.
Symmetry and structural repeats are common in protein structures at many levels. The CE-Symm tool was developed in order to detect internal pseudosymmetry within individual polypeptide chains. Such internal symmetry can arise from duplication events, so aligning the individual symmetry units provides insights about conservation and evolution. In many cases, internal symmetry can be shown to be important for a number of functions, including ligand binding, allostery, folding, stability, and evolution.
Structural comparison tools were applied comprehensively across all PDB structures for systematic analysis. Pairwise structural comparisons of all proteins in the PDB have been computed using the Open Science Grid computing infrastructure, and are kept continually up-to-date with the release of new structures. These provide a network-based view of protein fold space. CE-Symm was also applied to systematically survey the PDB for internally symmetric proteins. It is able to detect symmetry in ~20% of all protein families. Such PDB-wide analyses give insights into the complex evolution of protein folds.
Westbrook, Anthony. "The Paladin Suite| Multifaceted Characterization of Whole Metagenome Shotgun Sequences." Thesis, University of New Hampshire, 2018. http://pqdtopen.proquest.com/#viewpdf?dispub=10685940.
Full textWhole metagenome shotgun sequencing is a powerful approach for assaying many aspects of microbial communities, including the functional and symbiotic potential of each contributing community member. The research community currently lacks tools that efficiently align DNA reads against protein references, the technique necessary for constructing functional profiles. This thesis details the creation of PALADIN—a novel modification of the Burrows-Wheeler Aligner that provides orders-of-magnitude improved efficiency by directly mapping in protein space. In addition to performance considerations, utilizing PALADIN and associated tools as the foundation of metagenomic pipelines also allows for novel characterization and downstream analysis.
The accuracy and efficiency of PALADIN were compared against existing applications that employ nucleotide or protein alignment algorithms. Using both simulated and empirically obtained reads, PALADIN consistently outperformed all compared alignment tools across a variety of metrics, mapping reads nearly 8,000 times faster than the widely utilized protein aligner, BLAST. A variety of analysis techniques were demonstrated using this data, including detecting horizontal gene transfer, performing taxonomic grouping, and generating declustered references.
Jain, Mudita 1968. "Algorithms for physical mapping using unique probes." Diss., The University of Arizona, 1996. http://hdl.handle.net/10150/290622.
Full textAltman, Erik R. (Erik Richter). "Genetic algorithms and cache replacement policy." Thesis, McGill University, 1991. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=61096.
Full textIf better replacement policies exist, they may not be obvious. One way to find better policies is to study a large number of address traces for common patterns. Such an undertaking involves such a large amount of data, that some automated method of generating and evaluating policies is required. Genetic Algorithms provide such a method, and have been used successfully on a wide variety of tasks (21).
The best replacement policy found using this approach had a mean improvement in overall hit rate of 0.6% over LRU for the benchmarks used. This corresponds to 27% of the 2.2% mean difference between LRU and OPT. Performance of the best of these replacement policies was found to be generally superior to shadow cache (33), an enhanced replacement policy similar to some of those used here.
Yang, Qian 1973. "RNA sequence alignment and secondary structure prediction." Thesis, McGill University, 2005. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=82453.
Full textTang, Zuojian 1967. "Identifying mouse genes putatively transcriptionally regulated by the glucocorticoid receptor." Thesis, McGill University, 2005. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=82437.
Full textChen, Huiling Zhou Huan Xiang Ferrone Frank A. "Prediction of protein structures and protein-protein interactions : a bioinformatics approach /." Philadelphia, Pa. : Drexel University, 2005. http://dspace.library.drexel.edu/handle/1860/481.
Full textTcheimegni, Elie. "Kernel Based Relevance Vector Machine for Classification of Diseases." Thesis, Bowie State University, 2013. http://pqdtopen.proquest.com/#viewpdf?dispub=3558597.
Full textMotivated by improvements of diseases and cancers depiction that will be facilitated by an ability to predict the related syndrome occurrence; this work employs a data-driven approach to developing cancer classification/prediction models using Relevance Vector Machine (RVM), a probabilistic kernel-based learning machine.
Drawing from the work of Bertrand Luvision, Chao Dong, and the outcome result classification of electrocardiogram signals by S. Karpagachelvi ,which show the superiority of the RVM approach as compared to traditional classifiers, the problem addressed in this research is to design a program of piping components together in a graphic workflows which could help improve the accuracy classification/regression of two models structure methods (Support vector machines and kernel based Relevance Vector machines) for better prediction performance of related diseases and then make a comparison among both methods using clinical data.
Would the application of relevance vector machine on these data classification improve their coverage. We developed a hierarchical Bayesian model for binary and bivariate data classification using the RBF, sigmoid kernel, with different parameterization and varied threshold. The parameters of the kernel function are considered as model parameters. The finding results allow us to conclude that RVM is almost equal to SVM on training efficiency and classification accuracy, but RVM performs better on sparse property, generalization ability, and decision speed.
Meanwhile, the use of RVM raise some issues due to the fact that it used less support vectors but it trains much faster for non-linear kernel than SVM-light. Finally, we test those approaches on a corpus of public release phenotype data. Further research to improve the accuracy prediction with more patients' data is needed. Appendices provide the SVM and RVM derivation in detail. One important area of focus is the development of models for predicting cancers.
Keywords: Support Vector Machines, Relevance Vector Machine, Rapidminer, Tanagra, Accuracy's values.
Alouani, David James. "THE AGING PROCESS OF C. ELEGANS VIEWED THROUGH TIME DEPENDENT PROTEIN EXPRESSION ANALYSIS." Case Western Reserve University School of Graduate Studies / OhioLINK, 2015. http://rave.ohiolink.edu/etdc/view?acc_num=case1436393267.
Full textRajabi, Zeyad. "BIAS : bioinformatics integrated application software and discovering relationships between transcription factors." Thesis, McGill University, 2004. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=81427.
Full textWu, Chao. "Intelligent Data Mining on Large-scale Heterogeneous Datasets and its Application in Computational Biology." University of Cincinnati / OhioLINK, 2014. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1406880774.
Full textBatt, Gregory. "Design, optimization and control in systems and synthetic biology." Habilitation à diriger des recherches, Université Paris-Diderot - Paris VII, 2014. http://tel.archives-ouvertes.fr/tel-00958566.
Full textTakane, Marina. "Inference of gene regulatory networks from large scale gene expression data." Thesis, McGill University, 2003. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=80883.
Full textHu, Chunxiao. "Microfluidic electrophysiological device for genetic and chemical biology screening of nematodes." Thesis, University of Southampton, 2013. https://eprints.soton.ac.uk/368250/.
Full textQiu, Shuhao. "Computational Simulation and Analysis of Mutations: Nucleotide Fixation, Allelic Age and rare Genetic Variations in population." University of Toledo Health Science Campus / OhioLINK, 2016. http://rave.ohiolink.edu/etdc/view?acc_num=mco1430494327.
Full textKrohn, Jonathan Jacob Pastushchyn. "Genes contributing to variation in fear-related behaviour." Thesis, University of Oxford, 2013. http://ora.ox.ac.uk/objects/uuid:1e8e40bd-9a98-405f-9463-e9423f0a60ca.
Full textSeidel, Richard Alan. "Conservation Biology of the Gammarus pecos Species Complex: Ecological Patterns across Aquatic Habitats in an Arid Ecosystem." Miami University / OhioLINK, 2009. http://rave.ohiolink.edu/etdc/view?acc_num=miami1251472290.
Full textKuntala, Prashant Kumar. "Optimizing Biomarkers From an Ensemble Learning Pipeline." Ohio University / OhioLINK, 2017. http://rave.ohiolink.edu/etdc/view?acc_num=ohiou1503592057943043.
Full textBebek, Gurkan. "Functional Characteristics of Cancer Driver Genes in Colorectal Cancer." Case Western Reserve University School of Graduate Studies / OhioLINK, 2017. http://rave.ohiolink.edu/etdc/view?acc_num=case1495012693440067.
Full textDeBlasio, Daniel. "NEW COMPUTATIONAL APPROACHES FOR MULTIPLE RNA ALIGNMENT AND RNA SEARCH." Master's thesis, University of Central Florida, 2009. http://digital.library.ucf.edu/cdm/ref/collection/ETD/id/4070.
Full textM.S.
School of Electrical Engineering and Computer Science
Engineering and Computer Science
Computer Science MS
Yerardi, Jason T. "The Implementation and Evaluation of Bioinformatics Algorithms for the Classification of Arabinogalactan-Proteins in Arabidopsis thaliana." Ohio University / OhioLINK, 2011. http://rave.ohiolink.edu/etdc/view?acc_num=ohiou1301069861.
Full textDEETER, ANTHONY E. Deeter. "A Web-Based Software System Utilizing Consensus Networks to Infer Gene Interactions." University of Akron / OhioLINK, 2018. http://rave.ohiolink.edu/etdc/view?acc_num=akron152302071289795.
Full textEvans, Daniel T. "A SNP Microarray Analysis Pipeline Using Machine Learning Techniques." Ohio University / OhioLINK, 2010. http://rave.ohiolink.edu/etdc/view?acc_num=ohiou1289950347.
Full textFord, Colby Tyler. "An Integrated Phylogeographic Analysis of the Bantu Migration." Thesis, The University of North Carolina at Charlotte, 2018. http://pqdtopen.proquest.com/#viewpdf?dispub=10748780.
Full text"Bantu" is a term used to describe lineages of people in around 600 different ethnic groups on the African continent ranging from modern-day Cameroon to South Africa. The migration of the Bantu people, which occurred around 3,000 years ago, was influential in spreading culture, language, and genetic traits and helped to shape human diversity on the continent. Research in the 1970s was completed to geographically divide the Bantu languages into 16 zones now known as "Guthrie zones" (Guthrie, 1971).
Researchers have postulated the migratory pattern of the Bantu people by examining cultural information, linguistic traits, or small genetic datasets. These studies offer differing results due to variations in the data type used. Here, an assessment of the Bantu migration is made using a large dataset of combined cultural data and genetic (Y-chromosomal and mitochondrial) data.
One working hypothesis is that the Bantu expansion can be characterized by a primary split in lineages, which occurred early on and prior to the population spreading south through what is now called the Congolese forest (i.e. "early split"). A competing hypothesis is that the split occurred south of the forest (i.e. "late split").
Using the comprehensive dataset, a phylogenetic tree was developed on which to reconstruct the relationships of the Bantu lineages. With an understanding of these lineages in hand, the changes between Guthrie zones were traced geospatially.
Evidence supporting the "early split" hypothesis was found, however, evidence for several complex and convoluted paths across the continent were also shown. These findings were then analyzed using dimensionality reduction and machine learning techniques to further understand the confidence of the model.
Stanfield, Zachary. "Comprehensive Characterization of the Transcriptional Signaling of Human Parturition through Integrative Analysis of Myometrial Tissues and Cell Lines." Case Western Reserve University School of Graduate Studies / OhioLINK, 2019. http://rave.ohiolink.edu/etdc/view?acc_num=case1562863761406809.
Full textRamraj, Varun. "Exploiting whole-PDB analysis in novel bioinformatics applications." Thesis, University of Oxford, 2014. http://ora.ox.ac.uk/objects/uuid:6c59c813-2a4c-440c-940b-d334c02dd075.
Full textKalluru, Vikram Gajanan. "Identify Condition Specific Gene Co-expression Networks." The Ohio State University, 2012. http://rave.ohiolink.edu/etdc/view?acc_num=osu1338304258.
Full textDabdoub, Shareef Majed. "Applied Visual Analytics in Molecular, Cellular, and Microbiology." The Ohio State University, 2011. http://rave.ohiolink.edu/etdc/view?acc_num=osu1322602183.
Full textZhong, Cuncong. "Computational Methods for Comparative Non-coding RNA Analysis: From Structural Motif Identification to Genome-wide Functional Classification." Doctoral diss., University of Central Florida, 2013. http://digital.library.ucf.edu/cdm/ref/collection/ETD/id/5894.
Full textPh.D.
Doctorate
Computer Science
Engineering and Computer Science
Computer Science
Dutta, Sara. "A multi-scale computational investigation of cardiac electrophysiology and arrhythmias in acute ischaemia." Thesis, University of Oxford, 2014. http://ora.ox.ac.uk/objects/uuid:f5f68d8b-7a60-4109-91c8-6b1d80c7ee5b.
Full textHayes, Matthew. "Algorithms to Resolve Large Scale and Complex StructuralVariants in the Human Genome." Case Western Reserve University School of Graduate Studies / OhioLINK, 2013. http://rave.ohiolink.edu/etdc/view?acc_num=case1372864570.
Full text