To see the other types of publications on this topic, follow the link: Biology, Genetics|Biology, Bioinformatics|Computer Science.

Dissertations / Theses on the topic 'Biology, Genetics|Biology, Bioinformatics|Computer Science'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 dissertations / theses for your research on the topic 'Biology, Genetics|Biology, Bioinformatics|Computer Science.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.

1

Wang, Jeremy R. "Analysis and Visualization of Local Phylogenetic Structure within Species." Thesis, The University of North Carolina at Chapel Hill, 2013. http://pqdtopen.proquest.com/#viewpdf?dispub=3562960.

Full text
Abstract:

While it is interesting to examine the evolutionary history and phylogenetic relationship between species, for example, in a sort of "tree of life", there is also a great deal to be learned from examining population structure and relationships within species. A careful description of phylogenetic relationships within species provides insights into causes of phenotypic variation, including disease susceptibility. The better we are able to understand the patterns of genotypic variation within species, the better these populations may be used as models to identify causative variants and possible therapies, for example through targeted genome-wide association studies (GWAS). My thesis describes a model of local phylogenetic structure, how it can be effectively derived under various circumstances, and useful applications and visualizations of this model to aid genetic studies.

I introduce a method for discovering phylogenetic structure among individuals of a population by partitioning the genome into a minimal set of intervals within which there is no evidence of recombination. I describe two extensions of this basic method. The first allows it to be applied to heterozygous, in addition to homozygous, genotypes and the second makes it more robust to errors in the source genotypes.

I demonstrate the predictive power of my local phylogeny model using a novel method for genome-wide genotype imputation. This imputation method achieves very high accuracy—on the order of the accuracy rate in the sequencing technology—by imputing genotypes in regions of shared inheritance based on my local phylogenies.

Comparative genomic analysis within species can be greatly aided by appropriate visualization and analysis tools. I developed a framework for web-based visualization and analysis of multiple individuals within a species, with my model of local phylogeny providing the underlying structure. I will describe the utility of these tools and the applications for which they have found widespread use.

APA, Harvard, Vancouver, ISO, and other styles
2

Guturu, Harendra. "Deciphering human gene regulation using computational and statistical methods." Thesis, Stanford University, 2014. http://pqdtopen.proquest.com/#viewpdf?dispub=3581147.

Full text
Abstract:

It is estimated that at least 10-20% of the mammalian genome is dedicated towards regulating the 1-2% of the genome that codes for proteins. This non-coding, regulatory layer is a necessity for the development of complex organisms, but is poorly understood compared to the genetic code used to translate coding DNA into proteins. In this dissertation, I will discuss methods developed to better understand the gene regulatory layer. I begin, in Chapter 1, with a broad overview of gene regulation, motivation for studying it, the state of the art with a historically context and where to look forward.

In Chapter 2, I discuss a computational method developed to detect transcription factor (TF) complexes. The method compares co-occurring motif spacings in conserved versus unconserved regions of the human genome to detect evolutionarily constrained binding sites of rigid transcription factor (TF) complexes. Structural data were integrated to explore overlapping motif arrangements while ensuring physical plausibility of the TF complex. Using this approach, I predicted 422 physically realistic TF complex motifs at 18% false discovery rate (FDR). I found that the set of complexes is enriched in known TF complexes. Additionally, novel complexes were supported by chromatin immunoprecipitation sequencing (ChIP-seq) datasets. Analysis of the structural modeling revealed three cooperativity mechanisms and a tendency of TF pairs to synergize through overlapping binding to the same DNA base pairs in opposite grooves or strands. The TF complexes and associated binding site predictions are made available as a web resource at http://complex.stanford.edu.

Next, in Chapter 3, I discuss how gene enrichment analysis can be applied to genome-wide conserved binding sites to successfully infer regulatory functions for a given TF complex. A genomic screen predicted 732,568 combinatorial binding sites for 422 TF complex motifs. From these predictions, I inferred 2,440 functional roles, which are consistent with known functional roles of TF complexes. In these functional associations, I found interesting themes such as promiscuous partnering of TFs (such as ETS) in the same functional context (T cells). Additionally, functional enrichment identified two novel TF complex motifs associated with spinal cord patterning genes and mammary gland development genes, respectively. Based on these predictions, I discovered novel spinal cord patterning enhancers (5/9, 56% validation rate) and enhancers active in MCF7 cells (11/19, 53% validation rate). This set replete with thousands of additional predictions will serve as a powerful guide for future studies of regulatory patterns and their functional roles.

Then, in Chapter 4, I outline a method developed to predict disease susceptibility due to gene mis-regulation. The method interrogates ensembles of conserved binding sites of regulatory factors disrupted by an individual's variants and then looks for their most significant congregation next to a group of functionally related genes. Strikingly, when the method is applied to five different full human genomes, the top enriched function for each is reflective of their very different medical histories. These results suggest that erosion of gene regulation results in function specific mutation loads that manifest as disease predispositions in a familial lineage. Additionally, this aggregate analysis method addresses the problem that although many human diseases have a genetic component involving many loci, the majority of studies are statistically underpowered to isolate the many contributing loci.

Finally, I conclude in Chapter 5 with a summary of my findings throughout my research and future directions of research based on my findings.

APA, Harvard, Vancouver, ISO, and other styles
3

Brewer, Judy. "Metabolic Modeling of Inborn Errors of Metabolism: Carnitine Palmitoyltransferase II Deficiency and Respiratory Chain Complex I Deficiency." Thesis, Harvard University, 2015. http://nrs.harvard.edu/urn-3:HUL.InstRepos:24078365.

Full text
Abstract:
The research goal was to assess the current capabilities of a metabolic modeling environment to support exploration of inborn errors of metabolism (IEMs); and to assess whether, drawing on evidence from published studies of EMs, the current capabilities of this modeling environment correlate with clinical measures of energy production, fatty acid oxidation, accumulation of toxic by-products of defective metabolism, and mitigation via therapeutic agents. IEMs comprise several hundred disorders of energy production, often with significant impact on morbidity and mortality. Despite advances in genomic medicine, currently the majority of therapeutic options for IEMs are supportive only, and most only weakly evidenced. Metabolic modeling could potentially offer an in silico alternative for exploring therapeutic possibilities. This research established models of two inborn errors of metabolism (IEMs), carnitine palmitoyltransferase (CPT) II deficiency and respiratory chain complex I deficiency, allowing exploration of combinations of IEMs at different degrees of enzyme deficiency. It utilized a modified version of the human metabolic network reconstruction, Recon 2, which includes known metabolic reactions and metabolites in human cells, and which allows constraint-based modeling within a computational and mathematical representation of human metabolism. It utilized the Matlab-based COBRA (Constraint-based Reconstruction and Analysis) Toolbox 2.0, and a customized suite of functions, to model ATP production, long-chain fatty acid oxidation (LCFA), and acylcarnitine accumulation in response to varying defect levels, inputs and a simulated candidate therapy. Following significant curation of the metabolic network reconstruction and customization of COBRA/Matlab functions, this study demonstrated that ATP production and LCFA oxidation were within expected ranges, and correlated with clinical data for enzyme deficiencies, while acylcarnitine accumulation inversely correlated with the degree of enzyme deficiency; and that it was possible to simulate upregulation of enzyme activity with a therapeutic agent. Results of the curation effort contributed to development of an updated version of the metabolic reconstruction Recon 2. Customization of modeling approaches resulted in a suite of re-usable Matlab functions and scripts usable with COBRA Toolbox methods available for further exploration of IEMs. While this research points to potentially greater suitability of kinetic modeling for some aspects of metabolic modeling of IEMs, it helps to demonstrate potential viability of constraint-based steady state modeling as a means to explore some clinically relevant measures of metabolic function for single and combined inborn errors of metabolism.
APA, Harvard, Vancouver, ISO, and other styles
4

Zou, James Yang. "Algorithms and Models for Genome Biology." Thesis, Harvard University, 2014. http://dissertations.umi.com/gsas.harvard:11280.

Full text
Abstract:
New advances in genomic technology make it possible to address some of the most fundamental questions in biology for the first time. They also highlight a need for new approaches to analyze and model massive amounts of complex data. In this thesis, I present six research projects that illustrate the exciting interaction between high-throughput genomic experiments, new machine learning algorithms, and mathematical modeling. This interdisci- plinary approach gives insights into questions ranging from how variations in the epigenome lead to diseases across human populations to how the slime mold finds the shortest path. The algorithms and models developed here are also of interest to the broader machine learning community, and have applications in other domains such as text modeling.
Mathematics
APA, Harvard, Vancouver, ISO, and other styles
5

Nicol, Megan E. "Unraveling the Nexus: Investigating the Regulatory Genetic Networks of Hereditary Ataxias." Ohio University Honors Tutorial College / OhioLINK, 2014. http://rave.ohiolink.edu/etdc/view?acc_num=ouhonors1400604580.

Full text
APA, Harvard, Vancouver, ISO, and other styles
6

Kiritchenko, Svetlana. "Hierarchical text categorization and its application to bioinformatics." Thesis, University of Ottawa (Canada), 2006. http://hdl.handle.net/10393/29298.

Full text
Abstract:
In a hierarchical categorization problem, categories are partially ordered to form a hierarchy. In this dissertation, we explore two main aspects of hierarchical categorization: learning algorithms and performance evaluation. We introduce the notion of consistent hierarchical classification that makes classification results more comprehensible and easily interpretable for end-users. Among the previously introduced hierarchical learning algorithms, only a local top-down approach produces consistent classification. The present work extends this algorithm to the general case of DAG class hierarchies and possible internal class assignments. In addition, a new global hierarchical approach aimed at performing consistent classification is proposed. This is a general framework of converting a conventional "flat" learning algorithm into a hierarchical one. An extensive set of experiments on real and synthetic data indicate that the proposed approach significantly outperforms the corresponding "flat" as well as the local top-down method. For evaluation purposes, we use a novel hierarchical evaluation measure that is superior to the existing hierarchical and non-hierarchical evaluation techniques according to a number of formal criteria. Also, this dissertation presents the first endeavor of applying the hierarchical text categorization techniques to the tasks of bioinformatics. Three bioinformatics problems are addressed. The objective of the first task, indexing biomedical articles with Medical Subject Headings (MeSH), is to associate documents with biomedical concepts from the specialized vocabulary of MeSH. In the second application, we tackle a challenging problem of gene functional annotation from biomedical literature. Our experiments demonstrate a considerable advantage of hierarchical text categorization techniques over the "flat" method on these two tasks. In the third application, our goal is to enrich the analysis of plain experimental data with biological knowledge. In particular, we incorporate the functional information on genes directly into the clustering process of microarray data with the outcome of an improved biological relevance and value of clustering results.
APA, Harvard, Vancouver, ISO, and other styles
7

Parmidge, Amelia J. "NEPIC, a Semi-Automated Tool with a Robust and Extensible Framework that Identifies and Tracks Fluorescent Image Features." Thesis, Mills College, 2014. http://pqdtopen.proquest.com/#viewpdf?dispub=1556025.

Full text
Abstract:

As fluorescent imaging techniques for biological systems have advanced in recent years, scientists have used fluorescent imaging more and more to capture the state of biological systems at different moments in time. For many researchers, analysis of the fluorescent image data has become the limiting factor of this new technique. Although identification of fluorescing neurons in an image is (seemingly) easily done by the human visual system, manual delineation of the exact pixels comprising these fluorescing regions of interest (or fROIs) in digital images does not scale up well, being time-consuming, reiterative, and error-prone. This thesis introduces NEPIC, the Neuron-to- Environment Pixel Intensity Calculator, which seeks to help resolve this issue. NEPIC is a semi-automated tool for finding and tracking the cell body of a single neuron over an entire movie of grayscale calcium image data. NEPIC also provides a highly extensible, open source framework that could easily support finding and tracking other kinds of fROIs. When tested on calcium image movies of the AWC neuron in C. elegans under highly variant conditions, NEPIC correctly identified the neuronal cell body in 95.48% of the movie frames, and successfully tracked this cell body feature across 98.60% of the frame transitions in the movies. Although support for finding and tracking multiple fROIs has yet to be implemented, NEPIC displays promise as a tool for assisting researchers in the bulk analysis of fluorescent imaging data.

APA, Harvard, Vancouver, ISO, and other styles
8

Daniels, Noah Manus. "Remote Homology Detection in Proteins Using Graphical Models." Thesis, Tufts University, 2013. http://pqdtopen.proquest.com/#viewpdf?dispub=3563611.

Full text
Abstract:

Given the amino acid sequence of a protein, researchers often infer its structure and function by finding homologous, or evolutionarily-related, proteins of known structure and function. Since structure is typically more conserved than sequence over long evolutionary distances, recognizing remote protein homologs from their sequence poses a challenge.

We first consider all proteins of known three-dimensional structure, and explore how they cluster according to different levels of homology. An automatic computational method reasonably approximates a human-curated hierarchical organization of proteins according to their degree of homology.

Next, we return to homology prediction, based only on the one-dimensional amino acid sequence of a protein. Menke, Berger, and Cowen proposed a Markov random field model to predict remote homology for beta-structural proteins, but their formulation was computationally intractable on many beta-strand topologies.

We show two different approaches to approximate this random field, both of which make it computationally tractable, for the first time, on all protein folds. One method simplifies the random field itself, while the other retains the full random field, but approximates the solution through stochastic search. Both methods achieve improvements over the state of the art in remote homology detection for beta-structural protein folds.

APA, Harvard, Vancouver, ISO, and other styles
9

Chen, Hui 1974. "Algorithms and statistics for the detection of binding sites in coding regions." Thesis, McGill University, 2006. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=97926.

Full text
Abstract:
This thesis deals with the problem of detecting binding sites in coding regions. A new comparative analysis method is developed by improving an existing method called COSMO.
The inter-species sequence conservation observed in coding regions may be the result of two types of selective pressure: the selective pressure on the protein encoded and, sometimes, the selective pressure on the binding sites. To predict some region in coding regions as a binding site, one needs to make sure that the conservation observed in this region is not due to the selective pressure on the protein encoded. To achieve this, COSMO built a null model with only the selective pressure on the protein encoded and computed p-values for the observed conservation scores, conditional on the fixed set of amino acids observed at the leaves.
It is believed, however, that the selective pressure on the protein assumed in COSMO is overly strong. Consequently, some interesting regions may be left undetected. In this thesis, a new method, COSMO-2, is developed to relax this assumption.
The amino acids are first classified into a fixed number of overlapping functional classes by applying an expectation maximization algorithm on a protein database. Two probabilities for each gene position are then calculated: (i) the probability of observing a certain degree of conservation in the orthologous sequences generated under each class in the null model (i.e. the p-value of the observed conservation under each class); and (ii) the probability that the codon column associated with that gene position belongs to each class. The p-value of the observed conservation for each gene position is the sum of the products of the two probabilities for all classes. Regions with low p-values are identified as potential binding sites.
Five sets of orthologous genes are analyzed using COSMO-2. The results show that COSMO-2 can detect the interesting regions identified by COSMO and can detect more interesting regions than COSMO in some cases.
APA, Harvard, Vancouver, ISO, and other styles
10

Chen, Xiaoyu 1974. "Computational detection of tissue-specific cis-regulatory modules." Thesis, McGill University, 2006. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=97927.

Full text
Abstract:
A cis-regulatory module (CRM) is a DNA region of a few hundred base pairs that consists of clustering of several transcription factor binding sites and regulates the expression of a nearby gene. This thesis presents a new computational approach to CRM detection.
It is believed that tissue-specific CRMs tend to regulate nearby genes in a certain tissue and that they consist of binding sites for transcription factors (TFs) that are also expressed in that tissue. These facts allow us to make use of tissue-specific gene expression data to detect tissue-specific CRMs and improve the specificity of module prediction.
We build a Bayesian network to integrate the sequence information about TF binding sites and the expression information about TFs and regulated genes. The network is then used to infer whether a given genomic region indeed has regulatory activity in a given tissue. A novel EM algorithm incorporating probability tree learning is proposed to train the Bayesian network in an unsupervised way. A new probability tree learning algorithm is developed to learn the conditional probability distribution for a variable in the network that has a large number of hidden variables as its parents.
Our approach is evaluated using biological data, and the results show that it is able to correctly discriminate among human liver-specific modules, erythroid-specific modules, and negative-control regions, even though no prior knowledge about the TFs and the target genes is employed in our algorithm. In a genome-wide scale, our network is trained to identify tissue-specific CRMs in ten tissues. Some known tissue-specific modules are rediscovered, and a set of novel modules are predicted to be related with tissue-specific expression.
APA, Harvard, Vancouver, ISO, and other styles
11

Siek, Katie A. "The design and evaluation of an assistive application for dialysis patients." [Bloomington, Ind.] : Indiana University, 2006. http://gateway.proquest.com/openurl?url_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:dissertation&res_dat=xri:pqdiss&rft_dat=xri:pqdiss:3223070.

Full text
Abstract:
Thesis (Ph.D.)--Indiana University, Dept. of Computer Science, 2006.
"Title from dissertation home page (viewed June 28, 2007)." Source: Dissertation Abstracts International, Volume: 67-06, Section: B, page: 3242. Adviser: Kay H. Connelly.
APA, Harvard, Vancouver, ISO, and other styles
12

Ahlert, Darla. "Application of Graph Theoretic Clustering on Some Biomedical Data Sets." Thesis, Southern Illinois University at Edwardsville, 2015. http://pqdtopen.proquest.com/#viewpdf?dispub=1588658.

Full text
Abstract:

Clustering algorithms have become a popular way to analyze biomedical data sets and in particular, gene expression data. Since these data sets are often large, it is difficult to gather useful information from them as a whole. Clustering is a proven method to extract knowledge about the data that can eventually lead to many discoveries in the biological world. Hierarchical clustering is used frequently to interpret gene expression data, but recently, graph-theoretic clustering algorithms have started to gain some attraction for analysis of this type of data. We consider five graph-theoretic clustering algorithms run over a post-mortem gene expression dataset, as well as a few different biomedical data sets, in which the ground truth, or class label, is known for each data point. We then externally evaluate the algorithms based on the accuracy of the resulting clusters against the ground truth clusters. Comparing the results of each of the algorithms run over all of the datasets, we found that our algorithms are efficient on the real biomedical datasets but find gene expression data especially difficult to handle.

APA, Harvard, Vancouver, ISO, and other styles
13

Roberts, Adam. "Ambiguous fragment assignment for high-throughput sequencing experiments." Thesis, University of California, Berkeley, 2014. http://pqdtopen.proquest.com/#viewpdf?dispub=3616509.

Full text
Abstract:

As the cost of short-read, high-throughput DNA sequencing continues to fall rapidly, new uses for the technology have been developed aside from its original purpose in determining the genome of various species. Many of these new experiments use the sequencer as a digital counter for measuring biological activities such as gene expression (RNA-Seq) or protein binding (ChIP-Seq).

A common problem faced in the analysis of these data is that of sequenced fragments that are "ambiguous", meaning they resemble multiple loci in a reference genome or other sequence. In early analyses, such ambiguous fragments were ignored or were assigned to loci using simple heuristics. However, statistical approaches using maximum likelihood estimation have been shown to greatly improve the accuracy of downstream analyses and have become widely adopted Optimization based on the expectation-maximization (EM) algorithm are often employed by these methods to find the optimal sets of alignments, with frequent enhancements to the model. Nevertheless, these improvements increase complexity, which, along with an exponential growth in the size of sequencing datasets, has led to new computational challenges.

Herein, we present our model for ambiguous fragment assignment for RNA-Seq, which includes the most comprehensive set of parameters of any model introduced to date, as well as various methods we have explored for scaling our optimization procedure. These methods include the use of an online EM algorithm and a distributed EM solution implemented on the Spark cluster computing system. Our advances have resulted in the first efficient solution to the problem of fragment assignment in sequencing.

Furthermore, we are the first to create a fully generalized model for ambiguous fragment assignment and present details on how our method can provide solutions for additional high-throughput sequencing assays including ChIP-Seq, Allele-Specific Expression (ASE), and the detection of RNA-DNA Differences (RDDs) in RNA-Seq.

APA, Harvard, Vancouver, ISO, and other styles
14

Huang, Huanhua. "Taxonomic assignment of gene sequences using hidden Markov models." Thesis, Northern Arizona University, 2014. http://pqdtopen.proquest.com/#viewpdf?dispub=1563863.

Full text
Abstract:

Our ability to study communities of microorganisms has been vastly improved by the development of high-throughput DNA sequences. These technologies however can only sequence short fragments of organism's genomes at a time, which introduces many challenges in translating sequences results to biological insight. The field of bioinformatics has arisen in part to address these problems.

One bioinformatics problem is assigning a genetic sequence to a source organism. It is now common to use high−throughput, short−read sequencing technologies, such as the Illumina MiSeq, to sequence the 16S rRNA gene from a community of microorganisms. Researchers use this information to generate a profile of the different microbial organisms (i.e., the taxonomic composition) present in an environmental sample. There are a number of approaches for assigning taxonomy to genetic sequences, but all suffer from problems with accuracy. The methods that have been most widely used are pairwise alignment methods, like BLAST, UCLUST, and RTAX, and probability-based methods, such as RDP and MOTHUR. These methods can classify microbial sequences with high accuracy when sequences are long (e.g., thousand bases), however accuracy decreases as sequences are shorter. Current high−throughout sequencing technologies generates sequences between about 150 and 500 bases in length.

In my thesis I have developed new software for assigning taxonomy to short DNA sequences using profile Hidden Markov Models (HMMs). HMMs have been applied in related areas, such as assigning biological functions to protein sequences, and I hypothesize that it might be useful for achieving high accuracy taxonomic assignments from 16S rRNA gene sequences. My method builds models of 16S rRNA sequences for different taxonomic groups (kingdom, phylum, class, order, family genus and species) using the Greengenes 16S rRNA database. Given a sequence with unknown taxonomic origin, my method searches each kingdom model to determine the most likely kingdom. It then searches all of the phyla within the highest scoring kingdom to determine the most likely phylum. This iterative process continues until the sequence cannot be assigned at a taxonomic level with a user-defined confidence level, or until a species-level assignment is made that meets the user-defined confidence level.

I next evaluated this method on both artificial and real microbial community data, with both qualitative and quantitative metrics of method performance. The evaluation results showed that in the qualitative analyses (specificity and sensitivity) my method is not as good as the previously existing methods. However, the accuracy in the quantitative analysis was better than some other pre-existing methods. This suggests that my current implementation is sensitive to false positives, but is better at classifying more sequences than the other methods.

I present my method, my evaluations, and suggestions for next steps that might improve the performance of my HMM-based taxonomic classifier.

APA, Harvard, Vancouver, ISO, and other styles
15

Thavappiragasam, Mathialakan. "A web semantic for SBML merge." Thesis, University of South Dakota, 2014. http://pqdtopen.proquest.com/#viewpdf?dispub=1566784.

Full text
Abstract:

The manipulation of XML based relational representations of biological systems (BioML for Bioscience Markup Language) is a big challenge in systems biology. The needs of biologists, like translational study of biological systems, cause their challenges to become grater due to the material received in next generation sequencing. Among these BioML's, SBML is the de facto standard file format for the storage and exchange of quantitative computational models in systems biology, supported by more than 257 software packages to date. The SBML standard is used by several biological systems modeling tools and several databases for representation and knowledge sharing. Several sub systems are integrated in order to construct a complex bio system. The issue of combining biological sub-systems by merging SBML files has been addressed in several algorithms and tools. But it remains impossible to build an automatic merge system that implements reusability, flexibility, scalability and sharability. The technique existing algorithms use is name based component comparisons. This does not allow integration into Workflow Management System (WMS) to build pipelines and also does not include the mapping of quantitative data needed for a good analysis of the biological system. In this work, we present a deterministic merging algorithm that is consumable in a given WMS engine, and designed using a novel biological model similarity algorithm. This model merging system is designed with integration of four sub modules: SBMLChecker, SBMLAnot, SBMLCompare, and SBMLMerge, for model quality checking, annotation, comparison, and merging respectively. The tools are integrated into the BioExtract server leveraging iPlant collaborative resources to support users by allowing them to process large models and design work flows. These tools are also embedded into a user friendly online version SW4SBMLm.

APA, Harvard, Vancouver, ISO, and other styles
16

Youngs, Noah. "Positive-Unlabeled Learning in the Context of Protein Function Prediction." Thesis, New York University, 2014. http://pqdtopen.proquest.com/#viewpdf?dispub=3665223.

Full text
Abstract:

With the recent proliferation of large, unlabeled data sets, a particular subclass of semisupervised learning problems has become more prevalent. Known as positive-unlabeled learning (PU learning), this scenario provides only positive labeled examples, usually just a small fraction of the entire dataset, with the remaining examples unknown and thus potentially belonging to either the positive or negative class. Since the vast majority of traditional machine learning classifiers require both positive and negative examples in the training set, a new class of algorithms has been developed to deal with PU learning problems.

A canonical example of this scenario is topic labeling of a large corpus of documents. Once the size of a corpus reaches into the thousands, it becomes largely infeasible to have a curator read even a sizable fraction of the documents, and annotate them with topics. In addition, the entire set of topics may not be known, or may change over time, making it impossible for a curator to annotate which documents are NOT about certain topics. Thus a machine learning algorithm needs to be able to learn from a small set of positive examples, without knowledge of the negative class, and knowing that the unlabeled training examples may contain an arbitrary number of additional but as yet unknown positive examples.

Another example of a PU learning scenario recently garnering attention is the protein function prediction problem (PFP problem). While the number of organisms with fully sequenced genomes continues to grow, the progress of annotating those sequences with the biological functions that they perform lags far behind. Machine learning methods have already been successfully applied to this problem, but with many organisms having a small number of positive annotated training examples, and the lack of availability of almost any labeled negative examples, PU learning algorithms have the potential to make large gains in predictive performance.

The first part of this dissertation motivates the protein function prediction problem, explores previous work, and introduces novel methods that improve upon previously reported benchmarks for a particular type of learning algorithm, known as Gaussian Random Field Label Propagation (GRFLP). In addition, we present improvements to the computational efficiency of the GRFLP algorithm, and a modification to the traditional structure of the PFP learning problem that allows for simultaneous prediction across multiple species.

The second part of the dissertation focuses specifically on the positive-unlabeled aspects of the PFP problem. Two novel algorithms are presented, and rigorously compared to existing PU learning techniques in the context of protein function prediction. Additionally, we take a step back and examine some of the theoretical considerations of the PU scenario in general, and provide an additional novel algorithm applicable in any PU context. This algorithm is tailored for situations in which the labeled positive examples are a small fraction of the set of true positive examples, and where the labeling process may be subject to some type of bias rather than being a random selection of true positives (arguably some of the most difficult PU learning scenarios).

The third and fourth sections return to the PFP problem, examining the power of tertiary structure as a predictor of protein function, as well as presenting two case studies of function prediction performance on novel benchmarks. Lastly, we conclude with several promising avenues of future research into both PU learning in general, and the protein function prediction problem specifically.

APA, Harvard, Vancouver, ISO, and other styles
17

Lee, Lawrence Chet-Lun. "Text mining of point mutation information from biomedical literature." Diss., Search in ProQuest Dissertations & Theses. UC Only, 2008. http://gateway.proquest.com/openurl?url_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:dissertation&res_dat=xri:pqdiss&rft_dat=xri:pqdiss:3339194.

Full text
APA, Harvard, Vancouver, ISO, and other styles
18

Hudson, Cody Landon. "Protein structure analysis and prediction utilizing the Fuzzy Greedy K-means Decision Forest model and Hierarchically-Clustered Hidden Markov Models method." Thesis, University of Central Arkansas, 2014. http://pqdtopen.proquest.com/#viewpdf?dispub=1549796.

Full text
Abstract:

Structural genomics is a field of study that strives to derive and analyze the structural characteristics of proteins through means of experimentation and prediction using software and other automatic processes. Alongside implications for more effective drug design, the main motivation for structural genomics concerns the elucidation of each protein’s function, given that the structure of a protein almost completely governs its function. Historically, the approach to derive the structure of a protein has been through exceedingly expensive, complex, and time consuming methods such as x-ray crystallography and nuclear magnetic resonance (NMR) spectroscopy.

In response to the inadequacies of these methods, three families of approaches developed in a relatively new branch of computer science known as bioinformatics. The aforementioned families include threading, homology-modeling, and the de novo approach. However, even these methods fail either due to impracticalities, the inability to produce novel folds, rampant complexity, inherent limitations, etc. In their stead, this work proposes the Fuzzy Greedy K-means Decision Forest model, which utilizes sequence motifs that transcend protein family boundaries to predict local tertiary structure, such that the method is cheap, effective, and can produce semi-novel folds due to its local (rather than global) prediction mechanism. This work further extends the FGK-DF model with a new algorithm, the Hierarchically Clustered-Hidden Markov Models (HC-HMM) method to extract protein primary sequence motifs in a more accurate and adequate manner than currently exhibited by the FGK-DF model, allowing for more accurate and powerful local tertiary structure predictions. Both algorithms are critically examined, their methodology thoroughly explained and tested against a consistent data set, the results thereof discussed at length.

APA, Harvard, Vancouver, ISO, and other styles
19

Dinh, Hieu Trung. "Algorithms for DNA Sequence Assembly and Motif Search." University of Connecticut, 2013.

Find full text
APA, Harvard, Vancouver, ISO, and other styles
20

Bliven, Spencer Edward. "Structure-Preserving Rearrangements| Algorithms for Structural Comparison and Protein Analysis." Thesis, University of California, San Diego, 2015. http://pqdtopen.proquest.com/#viewpdf?dispub=3716489.

Full text
Abstract:

Protein structure is fundamental to a deep understanding of how proteins function. Since structure is highly conserved, structural comparison can provide deep information about the evolution and function of protein families. The Protein Data Bank (PDB) continues to grow rapidly, providing copious opportunities for advancing our understanding of proteins through large-scale searches and structural comparisons. In this work I present several novel structural comparison methods for specific applications, as well as apply structure comparison tools systematically to better understand global properties of protein fold space.

Circular permutation describes a relationship between two proteins where the N-terminal portion of one protein is related to the C-terminal portion of the other. Proteins that are related by a circular permutation generally share the same structure despite the rearrangement of their primary sequence. This non-sequential relationship makes them difficult for many structure alignment tools to detect. Combinatorial Extension for Circular Permutations (CE-CP) was developed to align proteins that may be related by a circular permutation. It is widely available due to its incorporation into the RCSB PDB website.

Symmetry and structural repeats are common in protein structures at many levels. The CE-Symm tool was developed in order to detect internal pseudosymmetry within individual polypeptide chains. Such internal symmetry can arise from duplication events, so aligning the individual symmetry units provides insights about conservation and evolution. In many cases, internal symmetry can be shown to be important for a number of functions, including ligand binding, allostery, folding, stability, and evolution.

Structural comparison tools were applied comprehensively across all PDB structures for systematic analysis. Pairwise structural comparisons of all proteins in the PDB have been computed using the Open Science Grid computing infrastructure, and are kept continually up-to-date with the release of new structures. These provide a network-based view of protein fold space. CE-Symm was also applied to systematically survey the PDB for internally symmetric proteins. It is able to detect symmetry in ~20% of all protein families. Such PDB-wide analyses give insights into the complex evolution of protein folds.

APA, Harvard, Vancouver, ISO, and other styles
21

Westbrook, Anthony. "The Paladin Suite| Multifaceted Characterization of Whole Metagenome Shotgun Sequences." Thesis, University of New Hampshire, 2018. http://pqdtopen.proquest.com/#viewpdf?dispub=10685940.

Full text
Abstract:

Whole metagenome shotgun sequencing is a powerful approach for assaying many aspects of microbial communities, including the functional and symbiotic potential of each contributing community member. The research community currently lacks tools that efficiently align DNA reads against protein references, the technique necessary for constructing functional profiles. This thesis details the creation of PALADIN—a novel modification of the Burrows-Wheeler Aligner that provides orders-of-magnitude improved efficiency by directly mapping in protein space. In addition to performance considerations, utilizing PALADIN and associated tools as the foundation of metagenomic pipelines also allows for novel characterization and downstream analysis.

The accuracy and efficiency of PALADIN were compared against existing applications that employ nucleotide or protein alignment algorithms. Using both simulated and empirically obtained reads, PALADIN consistently outperformed all compared alignment tools across a variety of metrics, mapping reads nearly 8,000 times faster than the widely utilized protein aligner, BLAST. A variety of analysis techniques were demonstrated using this data, including detecting horizontal gene transfer, performing taxonomic grouping, and generating declustered references.

APA, Harvard, Vancouver, ISO, and other styles
22

Jain, Mudita 1968. "Algorithms for physical mapping using unique probes." Diss., The University of Arizona, 1996. http://hdl.handle.net/10150/290622.

Full text
Abstract:
DNA molecules are sequences of characters over a four letter alphabet. Determining the text of the DNA sequence contained in human cells is the goal of the Human Genome Project. The structure of a DNA sequence is reconstructed from a set of shorter fragments sampled from it at unknown locations, as it is usually too long to be determined directly. We consider the problem when the the fragments are very long, and each fragment has a fingerprint consisting of the presence of two or three pre-selected, smaller sequences called probes within it. These probes have a unique location along the original DNA sequence. The fingerprints contain false negative and false positive errors, and the fragments may be chimeric. A physical map of a DNA sequence is a reconstruction of the order of the probes and fragments along it. In short, given a collection of fragments, with fingerprints for each fragment taken from a collection of probes, and parameters that bound the rates of false negatives, false positives, and chimeras in the input data, the problem is to find the most likely probe ordering. Physical mapping is NP-complete when the input data contains errors. To construct physical maps we first determine neighbourhoods of probes and clones that are highly likely to be adjacent on the original DNA sequence. We then use a new, versatile integer linear programming formulation of the problem, to derive heuristics for ordering probes within neighbourhoods. This formulation provides a single, uniform representation for diverse data such as end-clone probes and in-situ hybridization, and provides a natural medium for the integration of previously constructed maps with newer data. We also present an ordering heuristic based upon end-clone data. Finally, we connect these local permutations into a larger, more global probe permutation. For this we use heuristics that have at their core previously mapped data. All heuristics are implemented and evaluated by comparing the computed probe orderings to the original probe orderings for simulated data.
APA, Harvard, Vancouver, ISO, and other styles
23

Altman, Erik R. (Erik Richter). "Genetic algorithms and cache replacement policy." Thesis, McGill University, 1991. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=61096.

Full text
Abstract:
The most common and generally best performing replacement algorithm in modern caches is LRU. Despite LRU's superiority, it is still possible that other feasible and implementable replacement policies could yield better performance. (34) found that an optimal replacement policy (OPT) would often have a miss rate 70% that of LRU.
If better replacement policies exist, they may not be obvious. One way to find better policies is to study a large number of address traces for common patterns. Such an undertaking involves such a large amount of data, that some automated method of generating and evaluating policies is required. Genetic Algorithms provide such a method, and have been used successfully on a wide variety of tasks (21).
The best replacement policy found using this approach had a mean improvement in overall hit rate of 0.6% over LRU for the benchmarks used. This corresponds to 27% of the 2.2% mean difference between LRU and OPT. Performance of the best of these replacement policies was found to be generally superior to shadow cache (33), an enhanced replacement policy similar to some of those used here.
APA, Harvard, Vancouver, ISO, and other styles
24

Yang, Qian 1973. "RNA sequence alignment and secondary structure prediction." Thesis, McGill University, 2005. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=82453.

Full text
Abstract:
Functional RNA sequences typically have structural elements that are highly conserved during evolution. Here we present an algorithmic method for multiple alignment of RNAs, taking into consideration both structural similarity and sequence identity. Furthermore, we performed a comparative analysis on pairing probability matrices of a set of aligned orthologous sequences and predicted the conserved secondary structure. Our alignment method outperforms the most widely used multiple alignment tool - Clustal W, and the structure prediction approach we proposed can generate a more accurate secondary structure for 5S rRNA compared to the existing approaches such as Alifold. In addition, our algorithms are efficient in terms of CPU time and memory usage compared to most existing methods for secondary structure prediction.
APA, Harvard, Vancouver, ISO, and other styles
25

Tang, Zuojian 1967. "Identifying mouse genes putatively transcriptionally regulated by the glucocorticoid receptor." Thesis, McGill University, 2005. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=82437.

Full text
Abstract:
The Glucocorticoid receptor (GR) is one of many steroid hormone receptors. It controls broad physiological gene networks, confers pathological effects in a range of disease states, and offers an excellent target for therapeutic intervention. Therefore, it is necessary to better understand the mechanisms of GR regulation. In particular, we are interested in better understanding the protein-nucleotide interactions (transcription factors interacting with transcription factor binding sites). Upon glucocorticoids-hormone binding, the GR forms a protein-nucleotide interaction with a specific transcription factor binding site known as a glucocorticoid response element (GRE). This research has employed three different but complementary bioinformatics approaches to identify Mouse genes putatively transcriptionally regulated by GR. Firstly, we focus on the problem of searching for putative GREs in the complete Mouse genome using a position weight matrix. This produced a large number of putative GREs. Most of these are likely false positive predictions. Secondly, two different strategies are used to improve the accuracy of our framework: combinatorial analysis of multiple TFs/modules of TFBSs and phylogenetic footprinting (PF). The number of putative GREs can be reduced by 97.9% using the module of TFBSs analysis, 97.7% using the PF analysis, and 99.9% using both module and PF analyses. In each step, a statistical test has been used to measure the significance of the results.
APA, Harvard, Vancouver, ISO, and other styles
26

Chen, Huiling Zhou Huan Xiang Ferrone Frank A. "Prediction of protein structures and protein-protein interactions : a bioinformatics approach /." Philadelphia, Pa. : Drexel University, 2005. http://dspace.library.drexel.edu/handle/1860/481.

Full text
APA, Harvard, Vancouver, ISO, and other styles
27

Tcheimegni, Elie. "Kernel Based Relevance Vector Machine for Classification of Diseases." Thesis, Bowie State University, 2013. http://pqdtopen.proquest.com/#viewpdf?dispub=3558597.

Full text
Abstract:

Motivated by improvements of diseases and cancers depiction that will be facilitated by an ability to predict the related syndrome occurrence; this work employs a data-driven approach to developing cancer classification/prediction models using Relevance Vector Machine (RVM), a probabilistic kernel-based learning machine.

Drawing from the work of Bertrand Luvision, Chao Dong, and the outcome result classification of electrocardiogram signals by S. Karpagachelvi ,which show the superiority of the RVM approach as compared to traditional classifiers, the problem addressed in this research is to design a program of piping components together in a graphic workflows which could help improve the accuracy classification/regression of two models structure methods (Support vector machines and kernel based Relevance Vector machines) for better prediction performance of related diseases and then make a comparison among both methods using clinical data.

Would the application of relevance vector machine on these data classification improve their coverage. We developed a hierarchical Bayesian model for binary and bivariate data classification using the RBF, sigmoid kernel, with different parameterization and varied threshold. The parameters of the kernel function are considered as model parameters. The finding results allow us to conclude that RVM is almost equal to SVM on training efficiency and classification accuracy, but RVM performs better on sparse property, generalization ability, and decision speed.

Meanwhile, the use of RVM raise some issues due to the fact that it used less support vectors but it trains much faster for non-linear kernel than SVM-light. Finally, we test those approaches on a corpus of public release phenotype data. Further research to improve the accuracy prediction with more patients' data is needed. Appendices provide the SVM and RVM derivation in detail. One important area of focus is the development of models for predicting cancers.

Keywords: Support Vector Machines, Relevance Vector Machine, Rapidminer, Tanagra, Accuracy's values.

APA, Harvard, Vancouver, ISO, and other styles
28

Alouani, David James. "THE AGING PROCESS OF C. ELEGANS VIEWED THROUGH TIME DEPENDENT PROTEIN EXPRESSION ANALYSIS." Case Western Reserve University School of Graduate Studies / OhioLINK, 2015. http://rave.ohiolink.edu/etdc/view?acc_num=case1436393267.

Full text
APA, Harvard, Vancouver, ISO, and other styles
29

Rajabi, Zeyad. "BIAS : bioinformatics integrated application software and discovering relationships between transcription factors." Thesis, McGill University, 2004. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=81427.

Full text
Abstract:
In the first part of this thesis, we present a new development platform especially tailored to Bioinformatics research and software development called Bias (Bioinformatics Integrated Application Software) designed to provide the tools necessary for carrying out integrative Bioinformatics research. Bias follows an object-relational strategy for providing persistent objects, allows third-party tools to be easily incorporated within the system, and it supports standards and data-exchange protocols common to Bioinformatics. The second part of this thesis is on the design and implementation of modules and libraries within Bias related to transcription factors. We present a module in Bias that focuses on discovering competitive relationships between mouse and yeast transcription factors. By competitive relationships we mean the competitive binding of two transcription factors for a given binding site. We also present a method that divides a transcription factor's set of binding sites into two or more different sets when constructing PSSMs.
APA, Harvard, Vancouver, ISO, and other styles
30

Wu, Chao. "Intelligent Data Mining on Large-scale Heterogeneous Datasets and its Application in Computational Biology." University of Cincinnati / OhioLINK, 2014. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1406880774.

Full text
APA, Harvard, Vancouver, ISO, and other styles
31

Batt, Gregory. "Design, optimization and control in systems and synthetic biology." Habilitation à diriger des recherches, Université Paris-Diderot - Paris VII, 2014. http://tel.archives-ouvertes.fr/tel-00958566.

Full text
Abstract:
How good is our understanding of the way cells treat information and make decisions? To what extend our current understanding enables us to reprogram and control the way cells behave? In this manuscript I describe several approaches developed for the computational analysis of the dynamics of biological networks. In particular I present work done on (i) the analysis of large gene networks with partial information on parameter values, (ii) the use of specification languages to express observations or desired properties in an abstract manner and efficiently search for parameters satisfying these properties, and (iii) recent efforts to use models to drive gene expression in real-time at the cellular level.
APA, Harvard, Vancouver, ISO, and other styles
32

Takane, Marina. "Inference of gene regulatory networks from large scale gene expression data." Thesis, McGill University, 2003. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=80883.

Full text
Abstract:
With the advent of the age of genomics, an increasing number of genes have been identified and their functions documented. However, not as much is known of specific regulatory relations among genes (e.g. gene A up-regulates gene B). At the same time, there is an increasing number of large-scale gene expression datasets, in which the mRNA transcript levels of tens of thousands of genes are measured at a number of time points, or under a number of different conditions. A number of studies have proposed to find gene regulatory networks from such datasets. Our method is a modification of the continuous-time neural network method of Wahde & Hertz [25, 26]. The genetic algorithm used to update weights was replaced with Levenberg-Marquardt optimization. We tested our method on artificial data as well as Spellman's yeast cell cycle data [22]. Results indicated that this method was able to detect salient regulatory relations between genes.
APA, Harvard, Vancouver, ISO, and other styles
33

Hu, Chunxiao. "Microfluidic electrophysiological device for genetic and chemical biology screening of nematodes." Thesis, University of Southampton, 2013. https://eprints.soton.ac.uk/368250/.

Full text
Abstract:
Genetic and chemical biology screens of C. elegans have been of enormous benefit in providing fundamental insight into neural function and neuroactive drugs. Recently the exploitation of microfluidic devices has added greater power to this experimental approach providing more discrete and higher throughput phenotypic analysis of neural systems. This repertoire is extended through the design of a semi-automated microfluidic device, NeuroChip, which has been optimised for selecting worms based on the electrophysiological features of the pharyngeal neural network. This device has the capability to sort mutant from wild-type worms based on high definition extracellular electrophysiological recordings. NeuroChip resolves discrete differences in excitatory, inhibitory and neuromodulatory components of the neural network from individual animals. Worms may be fed into the device consecutively from a reservoir and recovered unharmed. It combines microfluidics with integrated electrode recording for sequential trapping, restraining, recording, releasing and recovering of C. elegans. Thus mutant worms may be selected, recovered and propagated enabling mutagenesis screens based on an electrophysiological phenotype. Drugs may be rapidly applied during the recording thus permitting compound screening. For toxicology, this analysis can provide a precise description of sub-lethal effects on neural function. The chamber has been modified to accommodate L2 larval stages C. elegans and J2 stage G. pallida showing applicability for small size nematodes including parasitic species which otherwise are not tractable to this experimental approach. NeuroChip may be combined with optogenetics for targeted interrogation of the function of the neural circuit. NeuroChip thus adds a new tool for exploitation of C. elegans and G. pallida and has applications in neurogenetics, drug discovery and neurotoxicology.
APA, Harvard, Vancouver, ISO, and other styles
34

Qiu, Shuhao. "Computational Simulation and Analysis of Mutations: Nucleotide Fixation, Allelic Age and rare Genetic Variations in population." University of Toledo Health Science Campus / OhioLINK, 2016. http://rave.ohiolink.edu/etdc/view?acc_num=mco1430494327.

Full text
APA, Harvard, Vancouver, ISO, and other styles
35

Krohn, Jonathan Jacob Pastushchyn. "Genes contributing to variation in fear-related behaviour." Thesis, University of Oxford, 2013. http://ora.ox.ac.uk/objects/uuid:1e8e40bd-9a98-405f-9463-e9423f0a60ca.

Full text
Abstract:
Anxiety and depression are highly prevalent diseases with common heritable elements, but the particular genetic mechanisms and biological pathways underlying them are poorly understood. Part of the challenge in understanding the genetic basis of these disorders is that they are polygenic and often context-dependent. In my thesis, I apply a series of modern statistical tools to ascertain some of the myriad genetic and environmental factors that underlie fear-related behaviours in nearly two thousand heterogeneous stock mice, which serve as animal models of anxiety and depression. Using a Bayesian method called Sparse Partitioning and a frequentist method called Bagphenotype, I identify gene-by-sex interactions that contribute to variation in fear-related behaviours, such as those displayed in the elevated plus maze and the open field test, although I demonstrate that the contributions are generally small. Also using Bagphenotype, I identify hundreds of gene-by-environment interactions related to these traits. The interacting environmental covariates are diverse, ranging from experimenter to season of the year. With gene expression data from a brain structure associated with anxiety called the hippocampus, I generate modules of co-expressed genes and map them to the genome. Two of these modules were enriched for key nervous system components — one for dendritic spines, another for oligodendrocyte markers — but I was unable to find significant correlations between them and fear-related behaviours. Finally, I employed another Bayesian technique, Sparse Instrumental Variables, which takes advantage of conditional probabilities to identify hippocampus genes whose expression appears not just to be associated with variation in fear-related behaviours, but cause variation in those phenotypes.
APA, Harvard, Vancouver, ISO, and other styles
36

Seidel, Richard Alan. "Conservation Biology of the Gammarus pecos Species Complex: Ecological Patterns across Aquatic Habitats in an Arid Ecosystem." Miami University / OhioLINK, 2009. http://rave.ohiolink.edu/etdc/view?acc_num=miami1251472290.

Full text
APA, Harvard, Vancouver, ISO, and other styles
37

Kuntala, Prashant Kumar. "Optimizing Biomarkers From an Ensemble Learning Pipeline." Ohio University / OhioLINK, 2017. http://rave.ohiolink.edu/etdc/view?acc_num=ohiou1503592057943043.

Full text
APA, Harvard, Vancouver, ISO, and other styles
38

Bebek, Gurkan. "Functional Characteristics of Cancer Driver Genes in Colorectal Cancer." Case Western Reserve University School of Graduate Studies / OhioLINK, 2017. http://rave.ohiolink.edu/etdc/view?acc_num=case1495012693440067.

Full text
APA, Harvard, Vancouver, ISO, and other styles
39

DeBlasio, Daniel. "NEW COMPUTATIONAL APPROACHES FOR MULTIPLE RNA ALIGNMENT AND RNA SEARCH." Master's thesis, University of Central Florida, 2009. http://digital.library.ucf.edu/cdm/ref/collection/ETD/id/4070.

Full text
Abstract:
In this thesis we explore the the theory and history behind RNA alignment. Normal sequence alignments as studied by computer scientists can be completed in $O(n^2)$ time in the naive case. The process involves taking two input sequences and finding the list of edits that can transform one sequence into the other. This process is applied to biology in many forms, such as the creation of multiple alignments and the search of genomic sequences. When you take into account the RNA sequence structure the problem becomes even harder. Multiple RNA structure alignment is particularly challenging because covarying mutations make sequence information alone insufficient. Existing tools for multiple RNA alignments first generate pair-wise RNA structure alignments and then build the multiple alignment using only the sequence information. Here we present PMFastR, an algorithm which iteratively uses a sequence-structure alignment procedure to build a multiple RNA structure alignment. PMFastR also has low memory consumption allowing for the alignment of large sequences such as 16S and 23S rRNA. Specifically, we reduce the memory consumption to $\sim O(band^2*m)$ where $band$ is the banding size. Other solutions are $\sim O(n^2*m)$ where $n$ and $m$ are the lengths of the target and query respectively. The algorithm also provides a method to utilize a multi-core environment. We present results on benchmark data sets from BRAliBase, which shows PMFastR outperforms other state-of-the-art programs. Furthermore, we regenerate 607 Rfam seed alignments and show that our automated process creates similar multiple alignments to the manually-curated Rfam seed alignments. While these methods can also be applied directly to genome sequence search, the abundance of new multiple species genome alignments presents a new area for exploration. Many multiple alignments of whole genomes are available and these alignments keep growing in size. These alignments can provide more information to the searcher than just a single sequence. Using the methodology from sequence-structure alignment we developed AlnAlign, which searches an entire genome alignment using RNA sequence structure. While programs have been readily available to align alignments, this is the first to our knowledge that is specifically designed for RNA sequences. This algorithm is presented only in theory and is yet to be tested.
M.S.
School of Electrical Engineering and Computer Science
Engineering and Computer Science
Computer Science MS
APA, Harvard, Vancouver, ISO, and other styles
40

Yerardi, Jason T. "The Implementation and Evaluation of Bioinformatics Algorithms for the Classification of Arabinogalactan-Proteins in Arabidopsis thaliana." Ohio University / OhioLINK, 2011. http://rave.ohiolink.edu/etdc/view?acc_num=ohiou1301069861.

Full text
APA, Harvard, Vancouver, ISO, and other styles
41

DEETER, ANTHONY E. Deeter. "A Web-Based Software System Utilizing Consensus Networks to Infer Gene Interactions." University of Akron / OhioLINK, 2018. http://rave.ohiolink.edu/etdc/view?acc_num=akron152302071289795.

Full text
APA, Harvard, Vancouver, ISO, and other styles
42

Evans, Daniel T. "A SNP Microarray Analysis Pipeline Using Machine Learning Techniques." Ohio University / OhioLINK, 2010. http://rave.ohiolink.edu/etdc/view?acc_num=ohiou1289950347.

Full text
APA, Harvard, Vancouver, ISO, and other styles
43

Ford, Colby Tyler. "An Integrated Phylogeographic Analysis of the Bantu Migration." Thesis, The University of North Carolina at Charlotte, 2018. http://pqdtopen.proquest.com/#viewpdf?dispub=10748780.

Full text
Abstract:

"Bantu" is a term used to describe lineages of people in around 600 different ethnic groups on the African continent ranging from modern-day Cameroon to South Africa. The migration of the Bantu people, which occurred around 3,000 years ago, was influential in spreading culture, language, and genetic traits and helped to shape human diversity on the continent. Research in the 1970s was completed to geographically divide the Bantu languages into 16 zones now known as "Guthrie zones" (Guthrie, 1971).

Researchers have postulated the migratory pattern of the Bantu people by examining cultural information, linguistic traits, or small genetic datasets. These studies offer differing results due to variations in the data type used. Here, an assessment of the Bantu migration is made using a large dataset of combined cultural data and genetic (Y-chromosomal and mitochondrial) data.

One working hypothesis is that the Bantu expansion can be characterized by a primary split in lineages, which occurred early on and prior to the population spreading south through what is now called the Congolese forest (i.e. "early split"). A competing hypothesis is that the split occurred south of the forest (i.e. "late split").

Using the comprehensive dataset, a phylogenetic tree was developed on which to reconstruct the relationships of the Bantu lineages. With an understanding of these lineages in hand, the changes between Guthrie zones were traced geospatially.

Evidence supporting the "early split" hypothesis was found, however, evidence for several complex and convoluted paths across the continent were also shown. These findings were then analyzed using dimensionality reduction and machine learning techniques to further understand the confidence of the model.

APA, Harvard, Vancouver, ISO, and other styles
44

Stanfield, Zachary. "Comprehensive Characterization of the Transcriptional Signaling of Human Parturition through Integrative Analysis of Myometrial Tissues and Cell Lines." Case Western Reserve University School of Graduate Studies / OhioLINK, 2019. http://rave.ohiolink.edu/etdc/view?acc_num=case1562863761406809.

Full text
APA, Harvard, Vancouver, ISO, and other styles
45

Ramraj, Varun. "Exploiting whole-PDB analysis in novel bioinformatics applications." Thesis, University of Oxford, 2014. http://ora.ox.ac.uk/objects/uuid:6c59c813-2a4c-440c-940b-d334c02dd075.

Full text
Abstract:
The Protein Data Bank (PDB) is the definitive electronic repository for experimentally-derived protein structures, composed mainly of those determined by X-ray crystallography. Approximately 200 new structures are added weekly to the PDB, and at the time of writing, it contains approximately 97,000 structures. This represents an expanding wealth of high-quality information but there seem to be few bioinformatics tools that consider and analyse these data as an ensemble. This thesis explores the development of three efficient, fast algorithms and software implementations to study protein structure using the entire PDB. The first project is a crystal-form matching tool that takes a unit cell and quickly (< 1 second) retrieves the most related matches from the PDB. The unit cell matches are combined with sequence alignments using a novel Family Clustering Algorithm to display the results in a user-friendly way. The software tool, Nearest-cell, has been incorporated into the X-ray data collection pipeline at the Diamond Light Source, and is also available as a public web service. The bulk of the thesis is devoted to the study and prediction of protein disorder. Initially, trying to update and extend an existing predictor, RONN, the limitations of the method were exposed and a novel predictor (called MoreRONN) was developed that incorporates a novel sequence-based clustering approach to disorder data inferred from the PDB and DisProt. MoreRONN is now clearly the best-in-class disorder predictor and will soon be offered as a public web service. The third project explores the development of a clustering algorithm for protein structural fragments that can work on the scale of the whole PDB. While protein structures have long been clustered into loose families, there has to date been no comprehensive analytical clustering of short (~6 residue) fragments. A novel fragment clustering tool was built that is now leading to a public database of fragment families and representative structural fragments that should prove extremely helpful for both basic understanding and experimentation. Together, these three projects exemplify how cutting-edge computational approaches applied to extensive protein structure libraries can provide user-friendly tools that address critical everyday issues for structural biologists.
APA, Harvard, Vancouver, ISO, and other styles
46

Kalluru, Vikram Gajanan. "Identify Condition Specific Gene Co-expression Networks." The Ohio State University, 2012. http://rave.ohiolink.edu/etdc/view?acc_num=osu1338304258.

Full text
APA, Harvard, Vancouver, ISO, and other styles
47

Dabdoub, Shareef Majed. "Applied Visual Analytics in Molecular, Cellular, and Microbiology." The Ohio State University, 2011. http://rave.ohiolink.edu/etdc/view?acc_num=osu1322602183.

Full text
APA, Harvard, Vancouver, ISO, and other styles
48

Zhong, Cuncong. "Computational Methods for Comparative Non-coding RNA Analysis: From Structural Motif Identification to Genome-wide Functional Classification." Doctoral diss., University of Central Florida, 2013. http://digital.library.ucf.edu/cdm/ref/collection/ETD/id/5894.

Full text
Abstract:
Non-coding RNA (ncRNA) plays critical functional roles such as regulation, catalysis, and modification etc. in the biological system. Non-coding RNAs exert their functions based on their specific structures, which makes the thorough understanding of their structures a key step towards their complete functional annotation. In this dissertation, we will cover a suite of computational methods for the comparison of ncRNA secondary and 3D structures, and their applications to ncRNA molecular structural annotation and their genome-wide functional survey. Specifically, we have contributed the following five computational methods. First, we have developed an alignment algorithm to compare RNA structural motifs, which are recurrent RNA 3D structural fragments. Second, we have improved upon the previous alignment algorithm by incorporating base-stacking information and devise a new branch-and-bond algorithm. Third, we have developed a clustering pipeline for RNA structural motif classification using the above alignment methods. Fourth, we have generalized the clustering pipeline to a genome-wide analysis of RNA secondary structures. Finally, we have devised an ultra-fast alignment algorithm for RNA secondary structure by using the sparse dynamic programming technique. A large number of novel RNA structural motif instances and ncRNA elements have been discovered throughout these studies. We anticipate that these computational methods will significantly facilitate the analysis of ncRNA structures in the future.
Ph.D.
Doctorate
Computer Science
Engineering and Computer Science
Computer Science
APA, Harvard, Vancouver, ISO, and other styles
49

Dutta, Sara. "A multi-scale computational investigation of cardiac electrophysiology and arrhythmias in acute ischaemia." Thesis, University of Oxford, 2014. http://ora.ox.ac.uk/objects/uuid:f5f68d8b-7a60-4109-91c8-6b1d80c7ee5b.

Full text
Abstract:
Sudden cardiac death is one of the leading causes of mortality in the western world. One of the main factors is myocardial ischaemia, when there is a mismatch between blood demand and supply to the heart, which may lead to disturbed cardiac excitation patterns, known as arrhythmias. Ischaemia is a dynamic and complex process, which is characterised by many electrophysiological changes that vary through space and time. Ischaemia-induced arrhythmic mechanisms, and the safety and efficacy of certain therapies are still not fully understood. Most experimental studies are carried out in animal, due to the ethical and practical limitations of human experiments. Therefore, extrapolation of mechanisms from animal to human is challenging, but can be facilitated by in silico models. Since the first cardiac cell model was built over 50 years ago, computer simulations have provided a wealth of information and insight that is not possible to obtain through experiments alone. Therefore, mathematical models and computational simulations provide a powerful and complementary tool for the study of multi-scale problems. The aim of this thesis is to investigate pro-arrhythmic electrophysiological consequences of acute myocardial ischaemia, using a multi-scale computational modelling and simulation framework. Firstly, we present a novel method, combining computational simulations and optical mapping experiments, to characterise ischaemia-induced spatial differences modulating arrhythmic risk in rabbit hearts. Secondly, we use computer models to extend our investigation of acute ischaemia to human, by carrying out a thorough analysis of recent human action potential models under varied ischaemic conditions, to test their applicability to simulate ischaemia. Finally, we combine state-of-the-art knowledge and techniques to build a human whole ventricles model, in which we investigate how anti-arrhythmic drugs modulate arrhythmic mechanisms in the presence of ischaemia.
APA, Harvard, Vancouver, ISO, and other styles
50

Hayes, Matthew. "Algorithms to Resolve Large Scale and Complex StructuralVariants in the Human Genome." Case Western Reserve University School of Graduate Studies / OhioLINK, 2013. http://rave.ohiolink.edu/etdc/view?acc_num=case1372864570.

Full text
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography