Dissertations / Theses: 'Sequence alignment (Bioinformatics)'

1

Birney, Ewan. "Sequence alignment in bioinformatics." Thesis, University of Cambridge, 2000. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.621653.

Full text

APA, Harvard, Vancouver, ISO, and other styles

2

Jiang, Tianwei. "Sequence alignment : algorithm development and applications /." View abstract or full-text, 2009. http://library.ust.hk/cgi/db/thesis.pl?ECED%202009%20JIANG.

Full text

APA, Harvard, Vancouver, ISO, and other styles

3

Zhao, Kaiyong. "GPU accelerated sequence alignment /Zhao Kaiyong." HKBU Institutional Repository, 2016. https://repository.hkbu.edu.hk/etd_oa/378.

Full text

Abstract:

DNA sequence alignment is a fundamental task in gene information processing, which is about searching the location of a string (usually based on newly collected DNA data) in the existing huge DNA sequence databases. Due to the huge amount of newly generated DNA data and the complexity of approximate string match, sequence alignment becomes a time-consuming process. Hence how to reduce the alignment time becomes a significant research problem. Some algorithms of string alignment based on HASH comparison, suffix array and BWT, which have been proposed for DNA sequence alignment. Although these algorithms have reached the speed of O(N), they still cannot meet the increasing demand if they are running on traditional CPUs. Recently, GPUs have been widely accepted as an efficient accelerator for many scientific and commercial applications. A typical GPU has thousands of processing cores which can speed up repetitive computations significantly as compared to multi-core CPUs. However, sequence alignment is one kind of computation procedure with intensive data access, i.e., it is memory-bounded. The access to GPU memory and IO has more significant influence in performance when compared to the computing capabilities of GPU cores. By analyzing GPU memory and IO characteristics, this thesis produces novel parallel algorithms for DNA sequence alignment applications. This thesis consists of six parts. The first two parts explain some basic knowledge of DNA sequence alignment and GPU computing. The third part investigates the performance of data access on different types of GPU memory. The fourth part describes a parallel method to accelerate short-read sequence alignment based on BWT algorithm. The fifth part proposes the parallel algorithm for accelerating BLASTN, one of the most popular sequence alignment software. It shows how multi-threaded control and multiple GPU cards can accelerate the BLASTN algorithm significantly. The sixth part concludes the whole thesis. To summarize, through analyzing the layout of GPU memory and comparing data under the mode of multithread access, this thesis analyzes and concludes a perfect optimization method to achieve sequence alignment on GPU. The outcomes can help practitioners in bioinformatics to improve their working efficiency by significantly reducing the sequence alignment time.

APA, Harvard, Vancouver, ISO, and other styles

4

Ho, Ngai-lam, and 何毅林. "Algorithms on constrained sequence alignment." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 2004. http://hub.hku.hk/bib/B30201949.

Full text

APA, Harvard, Vancouver, ISO, and other styles

5

Lu, Yue. "Improving the quality of multiple sequence alignment." [College Station, Tex. : Texas A&M University, 2008. http://hdl.handle.net/1969.1/ETD-TAMU-3111.

Full text

APA, Harvard, Vancouver, ISO, and other styles

6

Ye, Yongtao, and 叶永滔. "Aligning multiple sequences adaptively." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 2014. http://hdl.handle.net/10722/206465.

Full text

Abstract:

With the rapid development of genome sequencing, an ever-increasing number of molecular biology analyses rely on the construction of an accurate multiple sequence alignment (MSA), such as motifs detection, phylogeny inference and structure prediction. Although many methods have been developed during the last two decades, most of them may perform poorly on some types of inputs, in particular when families of sequences fall below thirty percent similarity. Therefore, this thesis introduced two different effective approaches to improve the overall quality of multiple sequence alignment. First, by considering the similarity of the input sequences, we proposed an adaptive approach to compute better substitution matrices for each pair of sequences, and then apply the progressive alignment method to align them. For example, for inputs with high similarity, we consider the whole sequences and align them with global pair-Hidden Markov model, while for those with moderate low similarity, we may ignore the ank regions and use some local pair-Hidden Markov models to align them. To test the effectiveness of this approach, we have implemented a multiple sequence alignment tool called GLProbs and compared its performance with one dozen leading tools on three benchmark alignment databases, and GLProbs' alignments have the best scores in almost all testings. We have also evaluated the practicability of the alignments of GLProbs by applying the tool to three biological applications, namely phylogenetic tree reconstruction, protein secondary structure prediction and the detection of high risk members for cervical cancer in the HPV-E6 family, and the results are very encouraging. Second, based on our previous study, we proposed another new tool PnpProbs, which constructs better multiple sequence alignments by better handling of guide trees. It classifies input sequences into two types: normally related sequences and distantly related sequences. For normally related sequences, it uses an adaptive approach to construct the guide tree, and based on this guide tree, aligns the sequences progressively. To be more precise, it first estimates the input's discrepancy by computing the standard deviation of their percent identities, and based on this estimate, it chooses the best method to construct the guide tree. For distantly related sequences, PnpProbs abandons the guide tree; instead it uses the non-progressive sequence annealing method to construct the multiple sequence alignment. By combining the strength of the progressive and non-progressive methods, and with a better way to construct the guide tree, PnpProbs improves the quality of multiple sequence alignments significantly for not only general input sequences, but also those very distantly related. With those encouraging empirical results, our developed software tools have been appreciated by the community gradually. For example, GLProbs has been invited and incorporated into the JAva Bioinformatics Analysis Web Services system (JABAWS). published_or_final_version Computer Science Master Master of Philosophy

APA, Harvard, Vancouver, ISO, and other styles

7

Aniba, Mohamed Radhouane. "Knowledge based expert system development in bioinformatics : applied to multiple sequence alignment of protein sequences." Strasbourg, 2010. https://publication-theses.unistra.fr/public/theses_doctorat/2010/ANIBA_Mohamed_Radhouane_2010.pdf.

Full text

Abstract:

L'objectif de ce projet de thèse a été le développement d'un système expert afin de tester, évaluer et d'optimiser toutes les étapes de la construction et l'analyse d'un alignement multiple de séquences. Le nouveau système a été validé en utilisant des alignements de référence et apporte une nouvelle vision pour le développement de logiciels en bioinformatique: les systèmes experts basés sur la connaissance. L'architecture utilisée pour construire le système expert est très modulaire et flexible, permettant à AlexSys d'évoluer en même temps que de nouveaux algorithmes seront mis à disposition. Ultérieurement, AlexSys sera utilisé pour optimiser davantage chaque étape du processus d'alignement, par exemple en optimisant les paramètres des différents programmes d 'alignement. Le moteur d'inférence pourrait également être étendu à identification des combinaisons d'algorithmes qui pourraient fournir des informations complémentaires sur les séquences. Par exemple, les régions bien alignées par différents algorithmes pourraient être identifiées et regroupées en un alignement consensus unique. Des informations structurales et fonctionnelles supplémentaires peuvent également être utilisées pour améliorer la précision de l'alignement final. Enfin, un aspect crucial de tout outil bioinformatique consiste en son accessibilité et la convivialité d' utilisation. Par conséquent, nous sommes en train de développer un serveur web, et un service web, nous allons également concevoir un nouveau module de visualisation qui fournira une interface intuitive et conviviale pour toutes les informa ions récupérées et construites par AlexSys The objective of this PhD project was the development of an integrated expert system to test, evaluate and optimize all the stages of the construction and the analysis of a multiple sequence alignment. The new system was validated using standard benchmark cases and brings a ncw vision to software development in Bioinformatics: knowledge-guided systems. The architecture used to build the expert system is highly modular and flcxible, allowing AlcxSys to evolve as new algorithms are made available. In the future, AlexSys will he uscd to furthcr optimize each stage of the alignment process, for example by optimizing the input parameters of the different algorithms. The inference engine could also be extended to identify combinations of algorithms that could potentially provide complementary information about the input sequences. For example, well aligned regions from different aligners could be identified and combined into a single consensus alignment. Additional structural and functional information could also be exploited to improve the final alignment accuracy. Finally, a crucial aspect of any bioinformatics tool is its accessibility and usability. Therefore, we are currently developing a web server, and a web services based distributed system. We will also design a novel visualization module that will provide an intuitive, user-friendly interface to all the information retrieved and constructed by AlexSys

APA, Harvard, Vancouver, ISO, and other styles

8

Isa, Mohammad Nazrin. "High performance reconfigurable architectures for biological sequence alignment." Thesis, University of Edinburgh, 2013. http://hdl.handle.net/1842/7721.

Full text

Abstract:

Bioinformatics and computational biology (BCB) is a rapidly developing multidisciplinary field which encompasses a wide range of domains, including genomic sequence alignments. It is a fundamental tool in molecular biology in searching for homology between sequences. Sequence alignments are currently gaining close attention due to their great impact on the quality aspects of life such as facilitating early disease diagnosis, identifying the characteristics of a newly discovered sequence, and drug engineering. With the vast growth of genomic data, searching for a sequence homology over huge databases (often measured in gigabytes) is unable to produce results within a realistic time, hence the need for acceleration. Since the exponential increase of biological databases as a result of the human genome project (HGP), supercomputers and other parallel architectures such as the special purpose Very Large Scale Integration (VLSI) chip, Graphic Processing Unit (GPUs) and Field Programmable Gate Arrays (FPGAs) have become popular acceleration platforms. Nevertheless, there are always trade-off between area, speed, power, cost, development time and reusability when selecting an acceleration platform. FPGAs generally offer more flexibility, higher performance and lower overheads. However, they suffer from a relatively low level programming model as compared with off-the-shelf microprocessors such as standard microprocessors and GPUs. Due to the aforementioned limitations, the need has arisen for optimized FPGA core implementations which are crucial for this technology to become viable in high performance computing (HPC). This research proposes the use of state-of-the-art reprogrammable system-on-chip technology on FPGAs to accelerate three widely-used sequence alignment algorithms; the Smith-Waterman with affine gap penalty algorithm, the profile hidden Markov model (HMM) algorithm and the Basic Local Alignment Search Tool (BLAST) algorithm. The three novel aspects of this research are firstly that the algorithms are designed and implemented in hardware, with each core achieving the highest performance compared to the state-of-the-art. Secondly, an efficient scheduling strategy based on the double buffering technique is adopted into the hardware architectures. Here, when the alignment matrix computation task is overlapped with the PE configuration in a folded systolic array, the overall throughput of the core is significantly increased. This is due to the bound PE configuration time and the parallel PE configuration approach irrespective of the number of PEs in a systolic array. In addition, the use of only two configuration elements in the PE optimizes hardware resources and enables the scalability of PE systolic arrays without relying on restricted onboard memory resources. Finally, a new performance metric is devised, which facilitates the effective comparison of design performance between different FPGA devices and families. The normalized performance indicator (speed-up per area per process technology) takes out advantages of the area and lithography technology of any FPGA resulting in fairer comparisons. The cores have been designed using Verilog HDL and prototyped on the Alpha Data ADM-XRC-5LX card with the Virtex-5 XC5VLX110-3FF1153 FPGA. The implementation results show that the proposed architectures achieved giga cell updates per second (GCUPS) performances of 26.8, 29.5 and 24.2 respectively for the acceleration of the Smith-Waterman with affine gap penalty algorithm, the profile HMM algorithm and the BLAST algorithm. In terms of speed-up improvements, comparisons were made on performance of the designed cores against their corresponding software and the reported FPGA implementations. In the case of comparison with equivalent software execution, acceleration of the optimal alignment algorithm in hardware yielded an average speed-up of 269x as compared to the SSEARCH 35 software. For the profile HMM-based sequence alignment, the designed core achieved speed-up of 103x and 8.3x against the HMMER 2.0 and the latest version of HMMER (version 3.0) respectively. On the other hand, the implementation of the gapped BLAST with the two-hit method in hardware achieved a greater than tenfold speed-up compared to the latest NCBI BLAST software. In terms of comparison against other reported FPGA implementations, the proposed normalized performance indicator was used to evaluate the designed architectures fairly. The results showed that the first architecture achieved more than 50 percent improvement, while acceleration of the profile HMM sequence alignment in hardware gained a normalized speed-up of 1.34. In the case of the gapped BLAST with the two-hit method, the designed core achieved 11x speed-up after taking out advantages of the Virtex-5 FPGA. In addition, further analysis was conducted in terms of cost and power performances; it was noted that, the core achieved 0.46 MCUPS per dollar spent and 958.1 MCUPS per watt. This shows that FPGAs can be an attractive platform for high performance computation with advantages of smaller area footprint as well as represent economic ‘green’ solution compared to the other acceleration platforms. Higher throughput can be achieved by redeploying the cores on newer, bigger and faster FPGAs with minimal design effort.

APA, Harvard, Vancouver, ISO, and other styles

9

Kim, Eagu. "Inverse Parametric Alignment for Accurate Biological Sequence Comparison." Diss., The University of Arizona, 2008. http://hdl.handle.net/10150/193664.

Full text

Abstract:

For as long as biologists have been computing alignments of sequences, the question of what values to use for scoring substitutions and gaps has persisted. In practice, substitution scores are usually chosen by convention, and gap penalties are often found by trial and error. In contrast, a rigorous way to determine parameter values that are appropriate for aligning biological sequences is by solving the problem of Inverse Parametric Sequence Alignment. Given examples of biologically correct reference alignments, this is the problem of finding parameter values that make the examples score as close as possible to optimal alignments of their sequences. The reference alignments that are currently available contain regions where the alignment is not specified, which leads to a version of the problem with partial examples.In this dissertation, we develop a new polynomial-time algorithm for Inverse Parametric Sequence Alignment that is simple to implement, fast in practice, and can learn hundreds of parameters simultaneously from hundreds of examples. Computational results with partial examples show that best possible values for all 212 parameters of the standard alignment scoring model for protein sequences can be computed from 200 examples in 4 hours of computation on a standard desktop machine. We also consider a new scoring model with a small number of additional parameters that incorporates predicted secondary structure for the protein sequences. By learning parameter values for this new secondary-structure-based model, we can improve on the alignment accuracy of the standard model by as much as 15% for sequences with less than 25% identity.

APA, Harvard, Vancouver, ISO, and other styles

10

Midic, Uros. "Genome-Wide Prediction of Intrinsic Disorder; Sequence Alignment of Intrinsically Disordered Proteins." Diss., Temple University Libraries, 2012. http://cdm16002.contentdm.oclc.org/cdm/ref/collection/p245801coll10/id/159800.

Full text

Abstract:

Computer and Information Science Ph.D. Intrinsic disorder (ID) is defined as a lack of stable tertiary and/or secondary structure under physiological conditions in vitro. Intrinsically disordered proteins (IDPs) are highly abundant in nature. IDPs possess a number of crucial biological functions, being involved in regulation, recognition, signaling and control, e.g. their functional repertoire complements the functions of ordered proteins. Intrinsically disordered regions (IDRs) of IDPs have a different amino-acid composition than structured regions and proteins. This fact has been exploited for development of predictors of ID; the best predictors currently achieve around 80% per-residue accuracy. Earlier studies revealed that some IDPs are associated with various human diseases, including cancer, cardiovascular disease, amyloidoses, neurodegenerative diseases, diabetes and others. We developed a methodology for prediction and analysis of abundance of intrinsic disorder on the genome scale, which combines data from various gene and protein databases, and utilizes several ID prediction tools. We used this methodology to perform a large-scale computational analysis of the abundance of (predicted) ID in transcripts of various classes of disease-related genes. We further analyzed the relationships between ID and the occurrence of alternative splicing and Molecular Recognition Features (MoRFs) in human disease classes. An important, never before addressed issue with such genome-wide applications of ID predictors is that - for less-studied organisms - in addition to the experimentally confirmed protein sequences, there is a large number of putative sequences, which have been predicted with automated annotation procedures and lack experimental confirmation. In the human genome, these predicted sequences have significantly higher predicted disorder content. I investigated a hypothesis that this discrepancy is not correct, and that it is due to incorrectly annotated parts of the putative protein sequences that exhibit some similarities to confirmed IDRs, which lead to high predicted ID content. I developed a procedure to create synthetic nonsense peptide sequences by translation of non-coding regions of genomic sequences and translation of coding regions with incorrect codon alignment. I further trained several classifiers to discriminate between confirmed sequences and synthetic nonsense sequences, and used these predictors to estimate the abundance of incorrectly annotated regions in putative sequences, as well as to explore the link between such regions and intrinsic disorder. Sequence alignment is an essential tool in modern bioinformatics. Substitution matrices - such as the BLOSUM family - contain 20x20 parameters which are related to the evolutionary rates of amino acid substitutions. I explored various strategies for extension of sequence alignment to utilize the (predicted) disorder/structure information about the sequences being aligned. These strategies employ an extended 40 symbol alphabet which contains 20 symbols for amino acids in ordered regions and 20 symbols for amino acids in IDRs, as well as expanded 40x40 and 40x20 matrices. The new matrices exhibit significant and substantial differences in the substitution scores for IDRs and structured regions. Tests on a reference dataset show that 40x40 matrices perform worse than the standard 20x20 matrices, while 40x20 matrices - used in a scenario where ID is predicted for a query sequence but not for the target sequences - have at least comparable performance. However, I also demonstrate that the variations in performance between 20x20 and 20x40 matrices are insignificant compared to the variation in obtained matrices that occurs when the underlying algorithm for calculation of substitution matrices is changed. Temple University--Theses

APA, Harvard, Vancouver, ISO, and other styles

11

Peng, Yu, and 彭煜. "Iterative de Bruijn graph assemblers for second-generation sequencing reads." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 2012. http://hub.hku.hk/bib/B50534051.

Full text

Abstract:

The recent advance of second-generation sequencing technologies has made it possible to generate a vast amount of short read sequences from a DNA (cDNA) sample. Current short read assemblers make use of the de Bruijn graph, in which each vertex is a k-mer and each edge connecting vertex u and vertex v represents u and v appearing in a read consecutively, to produce contigs. There are three major problems for de Bruijn graph assemblers: (1) branch problem, due to errors and repeats; (2) gap problem, due to low or uneven sequencing depth; and (3) error problem, due to sequencing errors. A proper choice of k value is a crucial tradeoff in de Bruijn graph assemblers: a low k value leads to fewer gaps but more branches; a high k value leads to fewer branches but more gaps. In this thesis, I first analyze the fundamental genome assembly problem and then propose an iterative de Bruijn graph assembler (IDBA), which iterates from low to high k values, to construct a de Bruijn graph with fewer branches and fewer gaps than any other de Bruijn graph assembler using a fixed k value. Then, the second-generation sequencing data from metagenomic, single-cell and transcriptome samples is investigated. IDBA is then tailored with special treatments to handle the specific issues for each kind of data. For metagenomic sequencing data, a graph partition algorithm is proposed to separate de Bruijn graph into dense components, which represent similar regions in subspecies from the same species, and multiple sequence alignment is used to produce consensus of each component. For sequencing data with highly uneven depth such as single-cell and metagenomic sequencing data, a method called local assembly is designed to reconstruct missing k-mers in low-depth regions. Then, based on the observation that short and relatively low-depth contigs are more likely erroneous, progressive depth on contigs is used to remove errors in both low-depth and high-depth regions iteratively. For transcriptome sequencing data, a variant of the progressive depth method is adopted to decompose the de Bruijn graph into components corresponding to transcripts from the same gene, and then the transcripts are found in each component by considering the reads and paired-end reads support. Plenty of experiments on both simulated and real data show that IDBA assemblers outperform the existing assemblers by constructing longer contigs with higher completeness and similar or better accuracy. The running time of IDBA assemblers is comparable to existing algorithms, while the memory cost is usually less than the others. published_or_final_version Computer Science Doctoral Doctor of Philosophy

APA, Harvard, Vancouver, ISO, and other styles

12

Gharazi, Elham. "IMAP : design and implementation of an interactive and integrative programming environment for progressive multiple sequence alignment and phylogeny reconstruction." Phd thesis, Faculty of Engineering and Information Technologies, 2011. http://hdl.handle.net/2123/9328.

Full text

APA, Harvard, Vancouver, ISO, and other styles

13

Lee, Marianne M. "A two-pronged approach to improve distant homology detection." Columbus, Ohio : Ohio State University, 2009. http://rave.ohiolink.edu/etdc/view?acc%5Fnum=osu1242235868.

Full text

APA, Harvard, Vancouver, ISO, and other styles

14

Hossain, A. S. Md Mukarram. "Scalable tools for high-throughput viral sequence analysis." Thesis, University of Cambridge, 2017. https://www.repository.cam.ac.uk/handle/1810/276228.

Full text

Abstract:

Viral sequence data are increasingly being used to estimate evolutionary and epidemiological parameters to understand the dynamics of viral diseases. This thesis focuses on developing novel and improved computational methods for high-throughput analysis of large viral sequence datasets. I have developed a novel computational pipeline, Pipelign, to detect potentially unrelated sequences from groups of viral sequences during sequence alignment. Pipelign detected a large number of unrelated and mis-annotated sequences from several viral sequence datasets collected from GenBank. I subsequently developed ANVIL, a machine learning-based recombination detection and subtyping framework for pathogen sequences. ANVIL's performance was benchmarked using two large HIV datasets collected from the Los Alamos HIV Sequence Database and the UK HIV Drug Resistance Database, as well as on simulated data. Finally, I present a computational pipeline named Phlow, for rapid phylodynamic inference of heterochronous pathogen sequence data. Phlow is implemented with specialised and published analysis tools to infer important phylodynamic parameters from large datasets. Phlow was run with three empirical viral datasets and their outputs were compared with published results. These results show that Phlow is suitable for high-throughput exploratory phylodynamic analysis of large viral datasets. When combined, these three novel computational tools offer a comprehensive system for large scale viral sequence analysis addressing three important aspects: 1) establishing accurate evolutionary history, 2) recombination detection and subtyping, and 3) inferring phylodynamic history from heterochronous sequence datasets.

APA, Harvard, Vancouver, ISO, and other styles

15

Kamali, Amir Hossein. "Automated, Systematic and Parallel Approaches to Software Testing in Bioinformatics." Thesis, The University of Sydney, 2016. http://hdl.handle.net/2123/14673.

Full text

Abstract:

Software quality assurance becomes especially critical if bioinformatics tools are to be used in a translational medical setting, such as analysis and interpretation of biological data. We must ensure that only validated algorithms are used, and that they are implemented correctly in the analysis pipeline – and not disrupted by hardware or software failure. In this thesis, I review common quality assurance practice and guidelines for bioinformatics software testing. Furthermore, I present a novel cloud-based framework to enable automated testing of genetic sequence alignment programs. This framework performs testing based on gold standard simulation data sets, and metamorphic testing. I demonstrate the effectiveness of this cloudbased framework using two widely used sequence alignment programs, BWA and Bowtie, and some fault-seeded ‘mutant’ versions of BWA and Bowtie. This preliminary study demonstrates that this type of cloud-based software testing framework is an effective and promising way to implement quality assurance in bioinformatics software that is used in genomic medicine.

APA, Harvard, Vancouver, ISO, and other styles

16

Herman, Joseph L. "Multiple sequence analysis in the presence of alignment uncertainty." Thesis, University of Oxford, 2014. http://ora.ox.ac.uk/objects/uuid:88a56d9f-a96e-48e3-b8dc-a73f3efc8472.

Full text

Abstract:

Sequence alignment is one of the most intensely studied problems in bioinformatics, and is an important step in a wide range of analyses. An issue that has gained much attention in recent years is the fact that downstream analyses are often highly sensitive to the specific choice of alignment. One way to address this is to jointly sample alignments along with other parameters of interest. In order to extend the range of applicability of this approach, the first chapter of this thesis introduces a probabilistic evolutionary model for protein structures on a phylogenetic tree; since protein structures typically diverge much more slowly than sequences, this allows for more reliable detection of remote homologies, improving the accuracy of the resulting alignments and trees, and reducing sensitivity of the results to the choice of dataset. In order to carry out inference under such a model, a number of new Markov chain Monte Carlo approaches are developed, allowing for more efficient convergence and mixing on the high-dimensional parameter space. The second part of the thesis presents a directed acyclic graph (DAG)-based approach for representing a collection of sampled alignments. This DAG representation allows the initial collection of samples to be used to generate a larger set of alignments under the same approximate distribution, enabling posterior alignment probabilities to be estimated reliably from a reasonable number of samples. If desired, summary alignments can then be generated as maximum-weight paths through the DAG, under various types of loss or scoring functions. The acyclic nature of the graph also permits various other types of algorithms to be easily adapted to operate on the entire set of alignments in the DAG. In the final part of this work, methodology is introduced for alignment-DAG-based sequence annotation using hidden Markov models, and RNA secondary structure prediction using stochastic context-free grammars. Results on test datasets indicate that the additional information contained within the DAG allows for improved predictions, resulting in substantial gains over simply analysing a set of alignments one by one.

APA, Harvard, Vancouver, ISO, and other styles

17

Ye, Lin, and 叶林. "Exploring microbial community structures and functions of activated sludge by high-throughput sequencing." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 2012. http://hub.hku.hk/bib/B48079649.

Full text

Abstract:

To investigate the diversities and abundances of nitrifiers and to apply the highthroughput sequencing technologies to analyze the overall microbial community structures and functions in the wastewater treatment bioreactors were the major objectives of this study. Specifically, this study was conducted: (1) to investigate the diversities and abundances of AOA, AOB and NOB in bioreactors, (2) to explore the bacterial communities in bioreactors using 454 pyrosequencing, and (3) to analyze the metagenomes of activated sludge using Illumina sequencing. A lab-scale nitrification bioreactor was operated for 342 days under low DO (0.15~0.5 mg/L) and high nitrogen loading (0.26~0.52 kg-N/(m3d)). T-RFLP and cloning analysis showed there were only one dominant AOA, AOB and NOB species in the bioreactor, respectively. The amoA gene of the dominant AOA had a similarity of 89.3% with the isolated AOA species Nitrosopumilus maritimus SCM1. The AOB species detected in the bioreactor belonged to Nitrosomonas genus. The abundance of AOB was more than 40 times larger than that of AOA. The percentage of NOB in total bacteria increased from not detectable to 30% when DO changed from 0.15 to 0.5 mg/L. Compared with traditional methods, pyrosequencing analysis of the bacteria in this bioreactor provided unprecedented information. 494 bacterial OTUs was obtained at 3% distance cutoff. Furthermore, 454 pyrosequencing was applied to investigate the bacterial communities of activated sludge samples from 14 WWTPs of Asia (mainland China, Hong Kong, and Singapore) and North America (Canada and the United States). The results revealed huge amounts of OTUs in activated sludge, i.e. 1183~3567 OTUs in one sludge sample at 3% distance cutoff. Clear geographical differences among these samples were observed. The AOB amoA genes in different WWTPs were found quite diverse while the 16S rRNA genes were relatively conserved. To explore microbial community structures and functions in the abovementioned labscale bioreactor and a full-scale bioreactor, over six gigabases of metagenomic sequence data and 150,000 paired-end reads of PCR amplicons were generated from the activated sludge in the two bioreactors on Illumina HiSeq2000 platform. Three kinds of sequences (16S rRNA amplicons, 16S rRNA gene tags and predicted genes) were used to conduct taxonomic assignment and their applicabilities and reliabilities were compared. Specially, based on 16S rRNA and amoA gene sequences, AOB were found more abundant than AOA in the two bioreactors. Furthermore, the analysis of the metabolic profiles and pathways indicated that the overall pathways in the two bioreactors were quite similar. However, the abundances of some specific genes in the two bioreactors were different. In addition, 454 pyrosequencing was also used to detect potentially pathogenic bacteria in environmental samples. It was found most abundant potentially pathogenic bacteria in the WWTPs were affiliated with Aeromonas and Clostridium. Aeromonas veronii, Aeromonas hydrophila and Clostridium perfringens were species most similar to the potentially pathogenic bacteria found in this study. Overall, the percentage of the sequences closely related to known pathogenic bacteria sequences was about 0.16% of the total sequences. Additionally, a Java application (BAND) was developed for graphical visualization of microbial abundance data. published_or_final_version Civil Engineering Doctoral Doctor of Philosophy

APA, Harvard, Vancouver, ISO, and other styles

18

Cameron, Michael, and mcam@mc-mc net. "Efficient Homology Search for Genomic Sequence Databases." RMIT University. Computer Science and Information Technology, 2006. http://adt.lib.rmit.edu.au/adt/public/adt-VIT20070509.162443.

Full text

Abstract:

Genomic search tools can provide valuable insights into the chemical structure, evolutionary origin and biochemical function of genetic material. A homology search algorithm compares a protein or nucleotide query sequence to each entry in a large sequence database and reports alignments with highly similar sequences. The exponential growth of public data banks such as GenBank has necessitated the development of fast, heuristic approaches to homology search. The versatile and popular blast algorithm, developed by researchers at the US National Center for Biotechnology Information (NCBI), uses a four-stage heuristic approach to efficiently search large collections for analogous sequences while retaining a high degree of accuracy. Despite an abundance of alternative approaches to homology search, blast remains the only method to offer fast, sensitive search of large genomic collections on modern desktop hardware. As a result, the tool has found widespread use with millions of queries posed each day. A significant investment of computing resources is required to process this large volume of genomic searches and a cluster of over 200 workstations is employed by the NCBI to handle queries posed through the organisation's website. As the growth of sequence databases continues to outpace improvements in modern hardware, blast searches are becoming slower each year and novel, faster methods for sequence comparison are required. In this thesis we propose new techniques for fast yet accurate homology search that result in significantly faster blast searches. First, we describe improvements to the final, gapped alignment stages where the query and sequences from the collection are aligned to provide a fine-grain measure of similarity. We describe three new methods for aligning sequences that roughly halve the time required to perform this computationally expensive stage. Next, we investigate improvements to the first stage of search, where short regions of similarity between a pair of sequences are identified. We propose a novel deterministic finite automaton data structure that is significantly smaller than the codeword lookup table employed by ncbi-blast, resulting in improved cache performance and faster search times. We also discuss fast methods for nucleotide sequence comparison. We describe novel approaches for processing sequences that are compressed using the byte packed format already utilised by blast, where four nucleotide bases from a strand of DNA are stored in a single byte. Rather than decompress sequences to perform pairwise comparisons, our innovations permit sequences to be processed in their compressed form, four bases at a time. Our techniques roughly halve average query evaluation times for nucleotide searches with no effect on the sensitivity of blast. Finally, we present a new scheme for managing the high degree of redundancy that is prevalent in genomic collections. Near-duplicate entries in sequence data banks are highly detrimental to retrieval performance, however existing methods for managing redundancy are both slow, requiring almost ten hours to process the GenBank database, and crude, because they simply purge highly-similar sequences to reduce the level of internal redundancy. We describe a new approach for identifying near-duplicate entries that is roughly six times faster than the most successful existing approaches, and a novel approach to managing redundancy that reduces collection size and search times but still provides accurate and comprehensive search results. Our improvements to blast have been integrated into our own version of the tool. We find that our innovations more than halve average search times for nucleotide and protein searches, and have no signifcant effect on search accuracy. Given the enormous popularity of blast, this represents a very significant advance in computational methods to aid life science research.

APA, Harvard, Vancouver, ISO, and other styles

19

Gupta, Kapil. "Combinatorial optimization and application to DNA sequence analysis." Diss., Atlanta, Ga. : Georgia Institute of Technology, 2008. http://hdl.handle.net/1853/26676.

Full text

Abstract:

Thesis (Ph.D)--Industrial and Systems Engineering, Georgia Institute of Technology, 2009. Committee Chair: Lee, Eva K.; Committee Member: Barnes, Earl; Committee Member: Fan, Yuhong; Committee Member: Johnson, Ellis; Committee Member: Yuan, Ming. Part of the SMARTech Electronic Thesis and Dissertation Collection.

APA, Harvard, Vancouver, ISO, and other styles

20

Helal, Manal Computer Science &amp Engineering Faculty of Engineering UNSW. "Indexing and partitioning schemes for distributed tensor computing with application to multiple sequence alignment." Awarded by:University of New South Wales. Computer Science & Engineering, 2009. http://handle.unsw.edu.au/1959.4/44781.

Full text

Abstract:

This thesis investigates indexing and partitioning schemes for high dimensional scientific computational problems. Building on the foundation offered by Mathematics of Arrays (MoA) for tensor-based computation, the ultimate contribution of the thesis is a unified partitioning scheme that works invariant of the dataset dimension and shape. Consequently, portability is ensured between different high performance machines, cluster architectures, and potentially computational grids. The Multiple Sequence Alignment (MSA) problem in computational biology has an optimal dynamic programming based solution, but it becomes computationally infeasible as its dimensionality (the number of sequences) increases. Even sub-optimal approximations may be unmanageable for more than eight sequences. Furthermore, no existing MSA algorithms have been formulated in a manner invariant over the number of sequences. This thesis presents an optimal distributed MSA method based on MoA. The latter offers a set of constructs that help represent multidimensional arrays in memory in a linear, concise and efficient way. Using MoA allows the partitioning of the dynamic programming algorithm to be expressed independently of dimension. MSA is the highest dimensional scientific problem considered for MoA-based partitioning to date. Two partitioning schemes are presented: the first is a master/slave approach which is based on both master/slave scheduling and slave/slave coupling. The second approach is a peer-to-peer design, in which the scheduling and dependency communication are calculated independently by each process, with no need for a master scheduler. A search space reduction technique is introduced to cater for the exponential expansion as the problem dimensionality increases. This technique relies on defining a hyper-diagonal through the tensor space, and choosing a band of neighbouring partitions around the diagonal to score. In contrast, other sub-optimal methods in the literature only consider projections on the surface of the hyper-cube. The resulting massively parallel design produces a scalable solution that has been implemented on high performance machines and cluster architectures. Experimental results for these implementations are presented for both simulated and real datasets. Comparisons between the reduced search space technique of this thesis with other sub-optimal methods for the MSA problem are presented.

APA, Harvard, Vancouver, ISO, and other styles

21

Pappas, Nicholas Peter. "Searching Biological Sequence Databases Using Distributed Adaptive Computing." Thesis, Virginia Tech, 2003. http://hdl.handle.net/10919/31074.

Full text

Abstract:

Genetic research projects currently can require enormous computing power to processes the vast quantities of data available. Further, DNA sequencing projects are generating data at an exponential rate greater than that of the development microprocessor technology; thus, new, faster methods and techniques of processing this data are needed. One common type of processing involves searching a sequence database for the most similar sequences. Here we present a distributed database search system that utilizes adaptive computing technologies. The search is performed using the Smith-Waterman algorithm, a common sequence comparison algorithm. To reduce the total search time, an initial search is performed using a version of the algorithm, implemented in adaptive computing hardware, which is designed to efficiently perform the initial search. A final search is performed using a complete version of the algorithm. This two-stage search, employing adaptive and distributed hardware, achieves a performance increase of several orders of magnitude over similar processor based systems. Master of Science

APA, Harvard, Vancouver, ISO, and other styles

22

Steinfadt, Shannon Irene. "Smith-Waterman Sequence Alignment For Massively Parallel High-Performance Computing Architectures." Kent State University / OhioLINK, 2010. http://rave.ohiolink.edu/etdc/view?acc_num=kent1271656353.

Full text

APA, Harvard, Vancouver, ISO, and other styles

23

Johansson, Joakim. "Modifying a Protein-Protein Interaction Identifier with a Topology and Sequence-Order Independent Structural Comparison Method." Thesis, Linköpings universitet, Bioinformatik, 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-147777.

Full text

Abstract:

Using computational methods to identify protein-protein interactions (PPIs) supports experimental techniques by using less time and less resources. Identifying PPIs can be made through a template-based approach that describes how unstudied proteins interact by aligning a common structural template that exists in both interacting proteins. A pipeline that uses this is InterPred, that combines homology modelling and massive template comparison to construct coarse interaction models. These models are reviewed by a machine learning classifier that classifies models that shows traits of being true, which can be further refined with a docking technique. However, InterPred is dependent on using complex structural information, that might not be available from unstudied proteins, while it is suggested that PPIs are dependent of the shape and interface of proteins. A method that aligns structures based on the interface attributes is InterComp, which uses topological and sequence-order independent structural comparison. Implementing this method into InterPred will lead to restricting structural information to the interface of proteins, which could lead to discovery of undetected PPI models. The result showed that the modified pipeline was not comparable based on the receiver operating characteristic (ROC) performance. However, the modified pipeline could identify new potential PPIs that were undetected by InterPred.

APA, Harvard, Vancouver, ISO, and other styles

24

Tångrot, Jeanette. "Structural Information and Hidden Markov Models for Biological Sequence Analysis." Doctoral thesis, Umeå universitet, Institutionen för datavetenskap, 2008. http://urn.kb.se/resolve?urn=urn:nbn:se:umu:diva-1629.

Full text

Abstract:

Bioinformatics is a fast-developing field, which makes use of computational methods to analyse and structure biological data. An important branch of bioinformatics is structure and function prediction of proteins, which is often based on finding relationships to already characterized proteins. It is known that two proteins with very similar sequences also share the same 3D structure. However, there are many proteins with similar structures that have no clear sequence similarity, which make it difficult to find these relationships. In this thesis, two methods for annotating protein domains are presented, one aiming at assigning the correct domain family or families to a protein sequence, and the other aiming at fold recognition. Both methods use hidden Markov models (HMMs) to find related proteins, and they both exploit the fact that structure is more conserved than sequence, but in two different ways. Most of the research presented in the thesis focuses on the structure-anchored HMMs, saHMMs. For each domain family, an saHMM is constructed from a multiple structure alignment of carefully selected representative domains, the saHMM-members. These saHMM-members are collected in the so called "midnight ASTRAL set", and are chosen so that all saHMM-members within the same family have mutual sequence identities below a threshold of about 20%. In order to construct the midnight ASTRAL set and the saHMMs, a pipe-line of software tools are developed. The saHMMs are shown to be able to detect the correct family relationships at very high accuracy, and perform better than the standard tool Pfam in assigning the correct domain families to new domain sequences. We also introduce the FI-score, which is used to measure the performance of the saHMMs, in order to select the optimal model for each domain family. The saHMMs are made available for searching through the FISH server, and can be used for assigning family relationships to protein sequences. The other approach presented in the thesis is secondary structure HMMs (ssHMMs). These HMMs are designed to use both the sequence and the predicted secondary structure of a query protein when scoring it against the model. A rigorous benchmark is used, which shows that HMMs made from multiple sequences result in better fold recognition than those based on single sequences. Adding secondary structure information to the HMMs improves the ability of fold recognition further, both when using true and predicted secondary structures for the query sequence. Bioinformatik är ett område där datavetenskapliga och statistiska metoder används för att analysera och strukturera biologiska data. Ett viktigt område inom bioinformatiken försöker förutsäga vilken tredimensionell struktur och funktion ett protein har, utifrån dess aminosyrasekvens och/eller likheter med andra, redan karaktäriserade, proteiner. Det är känt att två proteiner med likande aminosyrasekvenser också har liknande tredimensionella strukturer. Att två proteiner har liknande strukturer behöver dock inte betyda att deras sekvenser är lika, vilket kan göra det svårt att hitta strukturella likheter utifrån ett proteins aminosyrasekvens. Den här avhandlingen beskriver två metoder för att hitta likheter mellan proteiner, den ena med fokus på att bestämma vilken familj av proteindomäner, med känd 3D-struktur, en given sekvens tillhör, medan den andra försöker förutsäga ett proteins veckning, d.v.s. ge en grov bild av proteinets struktur. Båda metoderna använder s.k. dolda Markov modeller (hidden Markov models, HMMer), en statistisk metod som bland annat kan användas för att beskriva proteinfamiljer. Med hjälp en HMM kan man förutsäga om en viss proteinsekvens tillhör den familj modellen representerar. Båda metoderna använder också strukturinformation för att öka modellernas förmåga att känna igen besläktade sekvenser, men på olika sätt. Det mesta av arbetet i avhandlingen handlar om strukturellt förankrade HMMer (structure-anchored HMMs, saHMMer). För att bygga saHMMerna används strukturbaserade sekvensöverlagringar, vilka genereras utifrån hur proteindomänerna kan läggas på varandra i rymden, snarare än utifrån vilka aminosyror som ingår i deras sekvenser. I varje proteinfamilj används bara ett särskilt, representativt urval av domäner. Dessa är valda så att då sekvenserna jämförs parvis, finns det inget par inom familjen med högre sekvensidentitet än ca 20%. Detta urval görs för att få så stor spridning som möjligt på sekvenserna inom familjen. En programvaruserie har utvecklats för att välja ut representanter för varje familj och sedan bygga saHMMer baserade på dessa. Det visar sig att saHMMerna kan hitta rätt familj till en hög andel av de testade sekvenserna, med nästan inga fel. De är också bättre än den ofta använda metoden Pfam på att hitta rätt familj till helt nya proteinsekvenser. saHMMerna finns tillgängliga genom FISH-servern, vilken alla kan använda via Internet för att hitta vilken familj ett intressant protein kan tillhöra. Den andra metoden som presenteras i avhandlingen är sekundärstruktur-HMMer, ssHMMer, vilka är byggda från vanliga multipla sekvensöverlagringar, men också från information om vilka sekundärstrukturer proteinsekvenserna i familjen har. När en proteinsekvens jämförs med ssHMMen används en förutsägelse om sekundärstrukturen, och den beräknade sannolikheten att sekvensen tillhör familjen kommer att baseras både på sekvensen av aminosyror och på sekundärstrukturen. Vid en jämförelse visar det sig att HMMer baserade på flera sekvenser är bättre än sådana baserade på endast en sekvens, när det gäller att hitta rätt veckning för en proteinsekvens. HMMerna blir ännu bättre om man också tar hänsyn till sekundärstrukturen, både då den riktiga sekundärstrukturen används och då man använder en teoretiskt förutsagd. Jeanette Hargbo.

APA, Harvard, Vancouver, ISO, and other styles

25

Darbha, Sriram. "RNA Homology Searches Using Pair Seeding." Thesis, University of Waterloo, 2005. http://hdl.handle.net/10012/1172.

Full text

Abstract:

Due to increasing numbers of non-coding RNA (ncRNA) being discovered recently, there is interest in identifying homologs of a given structured RNA sequence. Exhaustive homology searching for structured RNA molecules using covariance models is infeasible on genome-length sequences. Hence, heuristic methods are employed, but they largely ignore structural information in the query. We present a novel method, which uses secondary structure information, to perform homology searches for a structured RNA molecule. We define the concept of a pair seed and theoretically model alignments of random and related paired regions to compute expected sensitivity and specificity. We show that our method gives theoretical gains in sensitivity and specificity compared to a BLAST-based heuristic approach. We provide experimental verification of this gain. We also show that pair seeds can be effectively combined with the spaced seeds approach to nucleotide homology search. The hybrid search method has theoretical specificity superior to that of the BLAST seed. We provide experimental evaluation of our hypotheses. Finally, we note that our method is easily modified to process pseudo-knotted regions in the query, something outside the scope of covariance model based methods.

APA, Harvard, Vancouver, ISO, and other styles

26

Kemena, Carsten 1983. "Improving the accuracy and the efficiency of multiple sequence alignment methods." Doctoral thesis, Universitat Pompeu Fabra, 2012. http://hdl.handle.net/10803/128678.

Full text

Abstract:

Sequence alignment is one of the basic methods to compare biological sequences and the cornerstone of a wide range of different analyses. Due to this privileged position at the beginning of many studies its accuracy is of great importance, in fact, each result based on an alignment is depending on the alignment quality. This has been confirmed in several recent papers investigating the effect of alignment methods on phylogenetic reconstruction and the estimation of positive selection. In this thesis, I present several projects dedicated to the problem of developing more accurate multiple sequence alignments and how to evaluate them. I addressed the problem of structural protein alignment evaluation, the accurate structural alignment of RNA sequences and the alignment of large sequence data sets. El alineamiento es uno de los métodos básicos en la comparación de secuencias biológicas, y a menudo el primer pasó en análisis posteriores. Por su posición privilegiada al principio de muchos estudios, la calidad del alineamiento es de gran importancia, de hecho cada resultado basado en un alineamiento depende en gran medida de la calidad de ´este. Este hecho se ha confirmado en diversos artículos recientes, en los cuales se ha investigado los efectos de la elección del método de alineamiento en la reconstrucción filogenética y la estimación de la selección positiva. En esta tesis, presento varios proyectos enfocados en la implementación de mejoras tanto en los métodos de alineamiento múltiple de secuencias como en la evaluación de estos. Concretamente, he tratado problemas como la evaluación de alineamientos estructurales de proteínas, la construcción de alineamientos estructurales y precisos de ARN y también el alineamiento de grandes conjuntos de secuencias.

APA, Harvard, Vancouver, ISO, and other styles

27

Hatherley, Rowan. "Structural bioinformatics studies and tool development related to drug discovery." Thesis, Rhodes University, 2016. http://hdl.handle.net/10962/d1020021.

Full text

Abstract:

This thesis is divided into two distinct sections which can be combined under the broad umbrella of structural bioinformatics studies related to drug discovery. The first section involves the establishment of an online South African natural products database. Natural products (NPs) are chemical entities synthesised in nature and are unrivalled in their structural complexity, chemical diversity, and biological specificity, which has long made them crucial to the drug discovery process. South Africa is rich in both plant and marine biodiversity and a great deal of research has gone into isolating compounds from organisms found in this country. However, there is no official database containing this information, making it difficult to access for research purposes. This information was extracted manually from literature to create a database of South African natural products. In order to make the information accessible to the general research community, a website, named “SANCDB”, was built to enable compounds to be quickly and easily searched for and downloaded in a number of different chemical formats. The content of the database was assessed and compared to other established natural product databases. Currently, SANCDB is the only database of natural products in Africa with an online interface. The second section of the thesis was aimed at performing structural characterisation of proteins with the potential to be targeted for antimalarial drug therapy. This looked specifically at 1) The interactions between an exported heat shock protein (Hsp) from Plasmodium falciparum (P. falciparum), PfHsp70-x and various host and exported parasite J proteins, as well as 2) The interface between PfHsp90 and the heat shock organising protein (PfHop). The PfHsp70-x:J protein study provided additional insight into how these two proteins potentially interact. Analysis of the PfHsp90:PfHop also provided a structural insight into the interaction interface between these two proteins and identified residues that could be targeted due to their contribution to the stability of the Hsp90:Hop binding complex and differences between parasite and human proteins. These studies inspired the development of a homology modelling tool, which can be used to assist researchers with homology modelling, while providing them with step-by-step control over the entire process. This thesis presents the establishment of a South African NP database and the development of a homology modelling tool, inspired by protein structural studies. When combined, these two applications have the potential to contribute greatly towards in silico drug discovery research.

APA, Harvard, Vancouver, ISO, and other styles

28

Brandström, Mikael. "Bioinformatic Analysis of Mutation and Selection in the Vertebrate Non-coding Genome." Doctoral thesis, Uppsala University, Department of Evolution, Genomics and Systematics, 2007. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-8240.

Full text

Abstract:

The majority of the vertebrate genome sequence is not coding for proteins. In recent years, the evolution of this noncoding fraction of the genome has gained interest. These studies have been greatly facilitated by the availability of full genome sequences. The aim of this thesis is to study evolution of the noncoding vertebrate genome through bioinformatic analysis of large-scale genomic datasets.In a first analysis we addressed the use of conservation of sequence between highly diverged genomes to infer function. We provided evidence for a turnover of the patterns of negative selection. Hence, measures of constraint based on comparisons of diverged genomes might underestimate the functional proportion of the genome.In the following analyses we focused on length variation as found in small-scale insertion and deletion (indel) polymorphisms and microsatellites. For indels in chicken, replication slippage is a likely mutation mechanism, as a large proportion of the indels are parts of tandem-duplicates. Using a set of microsatellite polymorphisms in chicken, where we avoid ascertainment bias, we showed that polymorphism is positively correlated with microsatellite length and AT-content. Furthermore, interruptions in the microsatellite sequence decrease the levels of polymorphism.We also analysed the association between microsatellite polymorphism and recombination in the human genome. Here we found increased levels of microsatellite polymorphism in human recombination hotspots and also similar increases in the frequencies of single nucleotide polymorphisms (SNPs) and indels. This points towards natural selection shaping the levels of variation. Alternatively, recombination is mutagenic for all three kinds of polymorphisms. Finally, I present the program ILAPlot. It is a tool for visualisation, exploration and data extraction based on BLAST.Our combined results highlight the intricate connections between evolutionary phenomena. It also emphasises the importance of length variability in genome evolution, as well as the gradual difference between indels and microsatellites.

APA, Harvard, Vancouver, ISO, and other styles

29

Ozer, Hatice Gulcin. "Residue Associations In Protein Family Alignments." The Ohio State University, 2008. http://rave.ohiolink.edu/etdc/view?acc_num=osu1211570026.

Full text

APA, Harvard, Vancouver, ISO, and other styles

30

Montañola, Lacort Alberto. "The pairwise problem with High Performance Computing Systems, contextualized as a key part to solve the Multiple Sequence Alignment problem." Doctoral thesis, Universitat de Lleida, 2016. http://hdl.handle.net/10803/385282.

Full text

Abstract:

L'alineació múltiple de seqüencies (MSA), com a repte dins de la bioinformàtica, es un element clau per entendre el funcionament del genoma. Aquest consisteix en alinear en un temps òptim aquestes seqüencies garantint un nivell de qualitat. Aquest problema esdevé un repte de computació de altes prestacions degut als requeriments de recursos de memòria i còmput. S'han estudiat diferents implementacions, les quals es comparen i es presenten en aquesta investigació. Hem contribuït en la millora dels primers passos del problema MSA de diverses maneres. Amb l'objectiu de reduir el temps de càlcul i l'ús de memòria, adaptem T-Coffee per treballar en paral·lel amb ús de fils lleugers. Seguidament, hem desenvolupat un mètode de alineació de parells paral·lel, amb una assignació eficient de seqüències a nodes. Finalment es presenta un mètode per determinar la quantitat mínima de recursos del sistema, necessaris per resoldre un problema d'una mida determinada, per tal de configurar el sistema per un ús eficient. El alineamiento múltiple de secuencias (MSA), como reto dentro de la bioinformática, es un elemento clave para entender el funcionamiento del genoma. Este consiste en alinear en un tiempo óptimo esta secuencias garantizando un nivel de calidad. Este problema es un reto de computo de altas prestaciones debido a los altos requerimientos de memoria y computo. Se han estudiado diferentes implementaciones, las cuales se comparan y se presentan en esta investigación. Hemos contribuido en la mejora de los primeros pasos del problema MSA de diversas formas. Con el objetivo de reducir el tiempo de cálculo y el uso de memoria, adaptamos T-Coffee para trabajar en paralelo con el uso de hilos ligeros. Seguidamente, hemos desarrollado un método de alineación de pares en paralelo, con una asignación eficiente de secuencias a nodos. Finalmente se presenta un método para determinar la cantidad mínima de recursos del sistema, necesarios para resolver el problema de un tamaño determinado, para poder configurar el sistema para un uso eficiente. The multiple sequence alignment (MSA), as a challenge in bioinformatics, becomes a key element for understanding the inner working of the genome. This consists on aligning these sequences in an optimal time, with a good level of quality. This problem is a challenge for the high performance computing, because of the high memory and processing requirements. Different implementations were studied, which are being compared and presented on this thesis. We have contributed in the improvement of the first steps of the MSA problem in different ways. With the goal of reducing the computing time and the memory usage, we adapted T-Coffee for working in parallel with the usage of threads. Furthermore, we have developed a pair-wise sequence alignment method, with an efficient mapping of sequences to nodes. Finally, we are presenting the method for determining the minimal amount of resources, required for solving the problem of a determined size, in order to configure the system for an efficient use.

APA, Harvard, Vancouver, ISO, and other styles

31

Yin, Zhaoming. "Enhance the understanding of whole-genome evolution by designing, accelerating and parallelizing phylogenetic algorithms." Diss., Georgia Institute of Technology, 2014. http://hdl.handle.net/1853/51875.

Full text

Abstract:

The advent of new technology enhance the speed and reduce the cost for sequencing biological data. Making biological sense of this genomic data is a big challenge to the algorithm design as well as the high performance computing society. There are many problems in Bioinformatics, such as how new functional genes arise, why genes are organized into chromosomes, how species are connected through the evolutionary tree of life, or why arrangements are subject to change. Phylogenetic analyses have become essential to research on the evolutionary tree of life. It can help us to track the history of species and the relationship between different genes or genomes through millions of years. One of the fundamentals for phylogenetic construction is the computation of distances between genomes. Since there are much more complicated combinatoric patterns in rearrangement events, the distance computation is still a hot topic as much belongs to mathematics as to biology. For the distance computation with input of two genomes containing unequal gene contents (with insertions/deletions and duplications) the problem is especially hard. In this thesis, we will discuss about our contributions to the distance estimation for unequal gene order data. The problem of finding the median of three genomes is the key process in building the most parsimonious phylogenetic trees from genome rearrangement data. For genomes with unequal contents, to the best of our knowledge, there is no algorithm that can help to find the median. In this thesis, we make our contributions to the median computation in two aspects. 1) Algorithm engineering aspect, we harness the power of streaming graph analytics methods to implement an exact DCJ median algorithm which run as fast as the heuristic algorithm and can help construct a better phylogenetic tree. 2) Algorithmic aspect, we theoretically formulate the problem of finding median with input of genomes having unequal gene content, which leads to the design and implementation of an efficient Lin-Kernighan heuristic based median algorithm. Inferring phylogenies (evolutionary history) of a set of given species is the ultimate goal when the distance and median model are chosen. For more than a decade, biologists and computer scientists have studied how to infer phylogenies by the measurement of genome rearrangement events using gene order data. While evolution is not an inherently parsimonious process, maximum parsimony (MP) phylogenetic analysis has been supported by widely applied to the phylogeny inference to study the evolutionary patterns of genome rearrangements. There are generally two problems with the MP phylogenetic arose by genome rearrangement: One is, given a set of modern genomes, how to compute the topologies of the according phylogenetic tree; Another is, given the topology of a model tree, how to infer the gene orders of the ancestor species. To assemble a MP phylogenetic tree constructor, there are multiple NP hard problems involved, unfortunately, they organized as one problem on top of other problems. Which means, to solve a NP hard problem, we need to solve multiple NP hard sub-problems. For phylogenetic tree construction with the input of unequal content genomes, there are three layers of NP hard problems. In this thesis, we will mainly discuss about our contributions to the design and implementation of the software package DCJUC (Phylogeny Inference using DCJ model to cope with Unequal Content Genomes), that can help to achieve both of these two goals. Aside from the biological problems, another issue we need to concern is about the use of the power of parallel computing to assist accelerating algorithms to handle huge data sets, such as the high resolution gene order data. For one thing, all of the method to tackle with phylogenetic problems are based on branch and bound algorithms, which are quite irregular and unfriendly to parallel computing. To parallelize these algorithms, we need to properly enhance the efficiency for localized memory access and load balance methods to make sure that each thread can put their potentials into full play. For the other, there is a revolution taking place in computing with the availability of commodity graphical processors such as Nvidia GPU and with many-core CPUs such as Cray-XMT, or Intel Xeon Phi Coprocessor with 60 cores. These architectures provide a new way for us to achieve high performance at much lower cost. However, code running on these machines are not so easily programmed, and scientific computing is hard to tune well on them. We try to explore the potentials of these architectures to help us accelerate branch and bound based phylogenetic algorithms.

APA, Harvard, Vancouver, ISO, and other styles

32

Alderson, Rosanna Grace. "Tracking the evolution of function in diverse enzyme superfamilies." Thesis, University of St Andrews, 2016. http://hdl.handle.net/10023/10496.

Full text

Abstract:

Tracking the evolution of function in enzyme superfamilies is key in understanding how important biological functions and mechanisms have evolved. New genes are being sequenced at a rate that far surpasses the ability of characterization by wet-lab techniques. Moreover, bioinformatics allows for the use of methods not amenable to wet lab experimentation. We now face a situation in which we are aware of the existence of many gene families but are ignorant of what they do and how they function. Even for families with many structurally and functionally characterized members, the prediction of function of ancestral sequences can be used to elucidate past patterns of evolution and highlight likely future trajectories. In this thesis, we apply in silico structure and function methods to predict the functions of protein sequences from two diverse superfamily case studies. In the first, the metallo-β-lactamase superfamily, many members have been structurally and functionally characterised. In this work, we asked how many times the same function has independently evolved in the same superfamily using ancestral sequence reconstruction, homology modelling and alignment to catalytic templates. We found that in only 5% of evolutionary scenarios assessed, was there evidence of a lactam hydrolysing ancestor. This could be taken as strong evidence that metallo-β-lactamase function has evolved independently on multiple occasions. This finding has important implications for predicting the evolution of antibiotic resistance in this protein fold. However, as discussed, the interpretation of this statistic is not clear-cut. In the second case study, we analysed protein sequences of the DUF-62 superfamily. In contrast to the metallo-β-lactmase superfamily, very few members of this superfamily have been structurally and functionally characterised. We used the analysis of alignment, gene context, species tree reconciliation and comparison of the rates of evolution to ask if other functions or cellular roles might exist in this family other than the ones already established. We find that multiple lines of evidence present a compelling case for the evolution of different functions within the Archaea, and propose possible cellular interactions and roles for members of this enzyme family.

APA, Harvard, Vancouver, ISO, and other styles

33

Zafalon, Geraldo Francisco Donegá. "Aplicação de estratégias híbridas em algoritmos de alinhamento múltiplo de sequências para ambientes de computação paralela e distribuída." Universidade de São Paulo, 2014. http://www.teses.usp.br/teses/disponiveis/3/3141/tde-28082015-120515/.

Full text

Abstract:

A Bioinformática tem se desenvolvido de forma intensa nos últimos anos. A necessidade de se processar os grandes conjuntos de sequências, sejam de nucleotídeos ou de aminoácidos, tem estimulado o desenvolvimento de diversas técnicas algorítmicas, de modo a tratar este problema de maneira factível. Os algoritmos de alinhamento de alinhamento múltiplo de sequências assumiram um papel primordial, tornando a execução de alinhamentos de conjuntos com mais de duas sequencias uma tarefa viável computacionalmente. No entanto, com o aumento vertiginoso tanto da quantidade de sequencias em um determinado conjunto, quanto do comprimento dessas sequencias, a utilização desses algoritmos de alinhamento múltiplo, sem o acoplamento de novas estratégias, tornou-se algo impraticável. Consequentemente, a computação de alto desempenho despontou como um dos recursos a serem utilizados, através da paralelização de diversas estratégias para sua execução em grandes sistemas computacionais. Além disso, com a contínua expansão dos conjuntos de sequências, outras estratégias de otimização passaram a ser agregadas aos algoritmos de alinhamento múltiplo paralelos. Com isso, o desenvolvimento de ferramentas para alinhamento múltiplo de sequencias baseadas em abordagens híbridas destaca-se, atualmente, como a solução com melhor aceitação. Assim, no presente trabalho, pode-se verificar o desenvolvimento de uma estratégia híbrida para os algoritmos de alinhamento múltiplo progressivos, cuja utilização e amplamente difundida, em Bioinformática. Nesta abordagem, conjugou-se a paralelização e o particionamento dos conjuntos de sequências, na fase de construção da matriz de pontuação, e a otimização das fases de construção da árvore filogenética e de alinhamento múltiplo, através dos algoritmos de colônia de formigas e simulated annealling paralelo, respectivamente. Bioinformatics has been developed in a fast way in the last years. The need for processing large sequences sets, either nucleotides or aminoacids, has stimulated the development of many algorithmic techniques, to solve this problem in a feasible way. Multiple sequence alignment algorithms have played an important role, because with the reduced computational complexity provided by them, it is possible to perform alignments with more than two sequences. However, with the fast growing of the amount and length of sequences in a set, the use of multiple alignment algorithms without new optimization strategies became almost impossible. Therefore, high performance computing has emerged as one of the features being used, through the parallelization of many strategies for execution in large computational systems. Moreover, with the continued expansion of sequences sets, other optimization strategies have been coupled with parallel multiple sequence alignments. Thus, the development of multiple sequences alignment tools based on hybrid strategies has been considered the solution with the best results. In this work, we present the development of a hybrid strategy to progressive multiple sequence alignment, where its using is widespread in Bioinformatics. In this approach, we have aggregated the parallelization and the partitioning of sequences sets in the score matrix calculation stage, and the optimization of the stages of the phylogenetic tree reconstruction and multiple alignment through ant colony and parallel simulated annealing algorithms, respectively.

APA, Harvard, Vancouver, ISO, and other styles

34

Dwivedi, Bhakti. "Impact of molecular evolutionary footprints on phylogenetic accuracy a simulation study /." Dayton, Ohio : University of Dayton, 2009. http://rave.ohiolink.edu/etdc/view?acc_num=dayton1250807136.

Full text

APA, Harvard, Vancouver, ISO, and other styles

35

Gomes, Mireille. "Role of mutual information for predicting contact residues in proteins." Thesis, University of Oxford, 2012. http://ora.ox.ac.uk/objects/uuid:5ec3c90c-73fb-494f-ad2e-efc718406aa4.

Full text

Abstract:

Mutual Information (MI) based methods are used to predict contact residues within proteins and between interacting proteins. There have been many high impact papers citing the successful use of MI for determining contact residues in a particular protein of interest, or in certain types of proteins, such as homotrimers. In this dissertation we have carried out a systematic study to assess if this popularly employed contact prediction tool is useful on a global scale. After testing original MI and leading MI based methods on large, cross-species datasets we found that in general the performance of these methods for predicting contact residues both within (intra-protein) and between proteins (inter-protein) is weak. We observe that all MI variants have a bias towards surface residues, and therefore predict surface residues instead of contact residues. This finding is in contrast to the relatively good performance of i-Patch (Hamer et al. [2010]), a statistical scoring tool for inter-protein contact prediction. i-Patch uses as input surface residues only, groups amino acids by physiochemical properties, and assumes the existence of patches of contact residues on interacting proteins. We examine whether using these ideas would improve the performance of MI. Since inter-protein contact residues are only on the surface of each protein, to disentangle surface from contact prediction we filtered out the confounding buried residues. We observed that considering surface residues only does indeed improve the interprotein contact prediction ability of all tested MI methods. We examined a specific "successful" case study in the literature and demonstrated that here, even when considering surface residues only, the most accurate MI based inter-protein contact predictor,MIc, performs no better than random. We have developed two novel MI variants; the first groups amino acids by their physiochemical properties, and the second considers patches of residues on the interacting proteins. In our analyses these new variants highlight the delicate trade-off between signal and noise that must be achieved when using MI for inter-protein contact prediction. The input for all tested MI methods is a multiple sequence alignment of homologous proteins. In a further attempt to understand why the MI methods perform poorly, we have investigated the influence of gaps in the alignment on intra-protein contact prediction. Our results suggest that depending on the evaluation criteria and the alignment construction algorithm employed, a gap cutoff of around 10% would maximise the performance of MI methods, whereas the popularly employed 0% gap cutoff may lead to predictions that are no better than random guesses. Based on the insight we have gained through our analyses, we end this dissertation by identifying a number of ways in which the contact residue prediction ability of MI variants may be improved, including direct coupling analysis.

APA, Harvard, Vancouver, ISO, and other styles

36

Vellozo, Augusto Fernandes. "Alinhamento de seqüências com rearranjos." Universidade de São Paulo, 2007. http://www.teses.usp.br/teses/disponiveis/45/45134/tde-04052007-185842/.

Full text

Abstract:

Uma das tarefas mais básicas em bioinformática é a comparação de seqüências feita por algoritmos de alinhamento, que modelam as alterações evolutivas nas seqüências biológicas através de mutações como inserção, remoção e substituição de símbolos. Este trabalho trata de generalizações nos algoritmos de alinhamento que levam em consideração outras mutações conhecidas como rearranjos, mais especificamente, inversões, duplicações em tandem e duplicações por transposição. Alinhamento com inversões não tem um algoritmo polinomial conhecido e uma simplificação para o problema que considera somente inversões não sobrepostas foi proposta em 1992 por Schöniger e Waterman. Em 2003, dois trabalhos independentes propuseram algoritmos com tempo O(n^4) para alinhar duas seqüências com inversões não sobrepostas. Desenvolvemos dois algoritmos que resolvem este mesmo problema: um com tempo de execução O(n^3 logn) e outro que, sob algumas condições no sistema de pontuação, tem tempo de execução O(n^3), ambos em memória O(n^2). Em 1997, Benson propôs um modelo de alinhamento que reconhecesse as duplicações em tandem além das inserções, remoções e substituições. Ele propôs dois algoritmos exatos para alinhar duas seqüências com duplicações em tandem: um em tempo O(n^5) e memória O(n^2), e outro em tempo O(n^4) e memória O(n^3). Propomos um algoritmo para alinhar duas seqüências com duplicações em tandem em tempo O(n^3) e memória O(n^2). Propomos também um algoritmo para alinhar duas seqüências com transposons (um tipo mais geral que a duplicação em tandem), em tempo O(n^3) e memória O(n^2). Sequence comparison done by alignment algorithms is one of the most fundamental tasks in bioinformatics. The evolutive mutations considered in these alignments are insertions, deletions and substitutions of nucleotides. This work treats of generalizations introduced in alignment algorithms in such a way that other mutations known as rearrangements are also considered, more specifically, we consider inversions, duplications in tandem and duplications by transpositions. Alignment with inversions does not have a known polynomial algorithm and a simplification to the problem that considers only non-overlapping inversions were proposed by Schöniger and Waterman in 1992. In 2003, two independent works proposed algorithms with O(n^4) time to align two sequences with non-overlapping inversions. We developed two algorithms to solve this problem: one in O(n^3 log n) time and other, considering some conditions in the scoring system, in O(n^3) time, both in O(n^2) memory. In 1997, Benson proposed a model of alignment that recognized tandem duplication, insertion, deletion and substitution. He proposed two exact algorithms to align two sequences with tandem duplication: one in O(n^5) time and O(n^2) memory, and other in O(n^4) time and O(n^3) memory. We propose one algorithm to align two sequences with tandem duplication in O(n^3) time and O(n^2) memory. We also propose one algorithm to align two sequences with transposons (a type of duplication more general than tandem duplication), in O(n^3) time and O(n^2) memory.

APA, Harvard, Vancouver, ISO, and other styles

37

Mokin, Sergey. "Measuring deviation from a deeply conserved consensus in protein multiple sequence alignments." Thesis, McGill University, 2008. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=21956.

Full text

Abstract:

Proteins across species show variable degrees of conservation. Different patterns of conservation in the columns of an alignment indicate different evolutionary pressures on sequences. Protein conservation analysis is useful for a wide variety of applications, including disease mutation assessment, pseudogene analysis and functional residue prediction. This study describes a novel measure of column conservation in protein multiple sequence alignments (‘MSA'), and the application of this measure to calculate statistical deviation from alignment consensus (‘SDAC'). We have assessed SDAC for two case studies of sequences: (a) putative pseudogenes in Mycobacteria, and (b) young lineage-specific retrotransposed sequences in the human and mouse genomes. In the procedure, we rank residue positions for deep conservation, and evaluate statistically significant violations from MSA consensus. Novel conservation measure clearly indicated a variable degree of physiochemical conservation for a given column entropy. That, in turn, enabled us to detect deviations from physiochemical consensus in a protein MSA, which are not found by entropy measures. D'une espèce à l'autre, des variations peuvent survenir dans la composition des protéines. Les tendances suivies par les colonnes d'un alignement de séquences multiples reflètent les différentes pressions évolutionnaires imposes sur les séquences. Les analyses de conservation de protéines sont utiles à plusieurs fins, comme dans l'évaluation des mutations de maladies, l'analyse de pseudogenes ainsi que les prédictions fonctionnelles de résidus. Cette étude décrit une nouvelle mesure de conservation de colonnes pour les analyses d'alignement de séquences multiples. De plus, nous décrivons l'utilisation de cette nouvelle mesure pour calculer la déviation statistique avec un consensus d'alignement. Nous avons utilisé cette mesure pour deux études cas de séquence : (a) Celle de pseudogenes putatifs du Mycobactérie, et (b) Celle de jeunes séquences spécifiques a certains lignages rétrotransposés dans les génomes humains et souris. Ce faisant, nous avons classifié les positions de résidus hautement conservés et avons évalué les cas ou d'importantes variations existent avec les consensus des alignements de séquences multiples. Cette nouvelle échelle de conservation indique qu'il existe un degré variable de conservation physiochimique pour une entropie fixe des colonnes. En retour, ceci nous permet de détecter les variations physiochimiques des consensus d'une colonne qui ne serait autrement pas détecté par des mesures d'entropie.

APA, Harvard, Vancouver, ISO, and other styles

38

Almeida, André Atanasio Maranhão 1981. "Novas abordagens para o problema do alinhamento múltiplo de sequências." [s.n.], 2013. http://repositorio.unicamp.br/jspui/handle/REPOSIP/275646.

Full text

Abstract:

Orientador: Zanoni Dias Tese (doutorado) - Universidade Estadual de Campinas, Instituto de Computação Made available in DSpace on 2018-08-22T15:29:14Z (GMT). No. of bitstreams: 1 Almeida_AndreAtanasioMaranhao_D.pdf: 2248939 bytes, checksum: b57ed5328b80a2fc7f36d1509558e756 (MD5) Previous issue date: 2013 Resumo: Alinhamento de seqüências é, reconhecidamente, uma das tarefas de maior importância em bioinformática. Tal importância origina-se no fato de ser uma operação básica utilizada por diversos outros procedimentos na área, como busca em bases de dados, visualização do efeito da evolução em uma família de proteínas, construção de árvores filogenéticas e identificação de motifs preservados. Seqüências podem ser alinhadas aos pares, problema para o qual já se conhece algoritmo exato com complexidade de tempo O(l2), para seqüências de comprimento l. Pode-se também alinhar simultaneamente três ou mais seqüências, o que é chamado de alinhamento múltiplo de seqüências (MSA, do inglês Multiple Sequence Alignment ). Este, que é empregado em tarefas como detecção de padrões para caracterizar famílias protéicas e predição de estruturas secundárias e terciárias de proteínas, é um problema NP - Difícil. Neste trabalho foram desenvolvidos métodos heurísticos para alinhamento múltiplo de seqüências de proteína. Estudaram-se as principais abordagens e métodos existentes e foi realizada uma série de implementações e avaliações. Em um primeiro momento foram construídos 342 alinhadores múltiplos utilizando a abordagem progressiva. Esta, que é uma abordagem largamente utilizada para construção de MSAs, consiste em três etapas. Na primeira delas é computada a matriz de distâncias. Em seguida, uma árvore guia é gerada com base na matriz e, finalmente, o MSA é construído através de alinhamentos de pares, cuja ordem é definida pela árvore. Os alinhadores desenvolvidos combinam diferentes métodos aplicados a cada uma das etapas. Para a computação das matrizes de distâncias foram desenvolvidos dois métodos, que são capazes também de gerar alinhamentos de pares de seqüências. Um deles constrói o alinhamento com base em alinhamentos locais e o outro utiliza uma função logarítmica para a penalização de gaps. Foram utilizados ainda outros métodos disponíveis numa ferramenta chamada PHYLIP. Para a geração das árvores guias, foram utilizados os métodos clássicos UPGMA e Neighbor Joining. Usaram-se implementações disponíveis em uma ferramenta chamada R. Já para a construção do alinhamento múltiplo, foram implementados os métodos seleção por bloco único e seleção do par mais próximo. Estes, que se destinam a seleção xiii do par de alinhamentos a agrupar no ciclo corrente, são comumente utilizados para tal tarefa. Já para o agrupamento de um par de alinhamentos, foram implementados 12 métodos inspirados em métodos comumente utilizados - alinhamento de consensos e alinhamento de perfis. Foram feitas todas as combinações possíveis entre esses métodos, resultando em 342 alinhadores. Eles foram avaliados quanto à qualidade dos alinhamentos que geram e avaliou-se também o desempenho dos métodos, utilizados em cada etapa. Em seguida foram realizadas avaliações no contexto de alinhamento baseado em consistência. Nesta abordagem, considera-se MSA ótimo aquele que estão de acordo com a maioria dos alinhamentos ótimos para os n(n ? 1)/2 alinhamentos de pares contidos no MSA. Alterações foram realizadas em um alinhador múltiplo conhecido, MUMMALS, que usa a abordagem. As modificações foram feitas no método de contagem k-mer, assim como, em outro momento, substituiu-se a parte inicial do algoritmo. Foram alterados os métodos para computação da matriz de distâncias e para geração da árvore guia por outros que foram bem avaliados nos testes realizados para a abordagem progressiva. No total, foram implementadas e avaliadas 89 variações do algoritmo original do MUMMALS e, apesar do MUMMALS já produzir alinhamentos de alta qualidade, melhoras significativas foram alcançadas. O trabalho foi concluído com a implementação e a avaliação de algoritmos iterativos. Estes se caracterizam pela dependência de outros alinhadores para a produção de alinhamentos iniciais. Ao alinhador iterativo cabe a tarefa de refinar tais alinhamentos através de uma série de ciclos até que haja uma estabilização na qualidade dos alinhamentos. Foram implementados e avaliados dois alinhadores iterativos não estocásticos, assim como um algoritmo genético (GA) voltado para a geração de MSAs. Nesse algoritmo genético, implementado na forma de um ambiente parametrizável para execução de algoritmos genéticos para MSA, chamado ALGAe, foram realizadas diversas experiências que progressivamente elevaram a qualidade dos alinhamentos gerados. No ALGAe foram incluídas outras abordagens para construção de alinhamentos múltiplos, tais como baseada em blocos, em consenso e em modelos. A primeira foi aplicada na geração de indivíduos para a população inicial. Foram implementados alinhadores baseados em blocos usando duas abordagens distintas e, para uma delas, foram implementadas cinco variações. A segunda foi aplicada na definição de um operador de cruzamento, que faz uso da ferramenta M-COFFEE para realizar alinhamentos baseados em consenso a partir de indivíduos da população corrente do GA, e a terceira foi utilizada para definir uma função de aptidão, que utiliza a ferramenta PSIPRED para predição das estruturas secundárias das seqüências. O ALGAe permite a realização de uma grande variedade de novas avaliações Abstract: Sequence alignment is one the most important tasks of bioinformatics. It is a basic operation used for several procedures in that domain, such as sequence database searches, evolution effect visualization in an entire protein family, phylogenetic trees construction and preserved motifs identification. Sequences can be aligned in pairs and generate a pairwise alignment. Three or more sequences can also be simultaneously aligned and generate a multiple sequence alignment (MSA). MSAs could be used for pattern recognition for protein family characterization and secondary and tertiary protein structure prediction. Let l be the sequence length. The pairwise alignment takes time O(l2) to build an exact alignment. However, multiple sequence alignment is a NP-Hard problem. In this work, heuristic methods were developed for multiple protein sequence alignment. The main approaches and methods applied to the problem were studied and a series of aligners developed and evaluated. In a first moment 342 multiple aligners using the progressive approach were built. That is a largely used approach for MSA construction and is composed by three steps. In the first one a distance matrix is computed. Then, a guide tree is built based on the matrix and finally the MSA is constructed through pairwise alignments. The order to the pairwise alignments is defined by the tree. The developed aligners combine distinct methods applied to each of steps. Then, evaluations in the consistency based alignment context were performed. In that approach, a MSA is optimal when agree with the majority along all possible optimal pairwise alignments. MUMMALS is a known consistency based aligner. It was changed in this evaluation. The k-mer counting method was modified in two distinct ways. The k value and the compressed alphabet were ranged. In another evaluation, the k-mer counting method and guide tree construction method were replaced. In the last stage of the work, iterative algorithms were developed and evaluated. Those methods are characterized by other aligner's dependence. The other aligners generate an initial population and the iterative aligner performs a refinement procedure, which iteratively changes the alignments until the alignments quality are stabilized. Several evaluations were performed. However, a genetic algorithm for MSA construction stood out along this stage. In that aligner were added other approaches for multiple sequence alignment construction, such as block based, consensus based and template based. The first one was applied to initial population generation, the second one was used for a crossover operator creation and the third one defined a fitness function Doutorado Ciência da Computação Doutor em Ciência da Computação

APA, Harvard, Vancouver, ISO, and other styles

39

Liang, Chengzhi. "COPIA: A New Software for Finding Consensus Patterns in Unaligned Protein Sequences." Thesis, University of Waterloo, 2001. http://hdl.handle.net/10012/1050.

Full text

Abstract:

Consensus pattern problem (CPP) aims at finding conserved regions, or motifs, in unaligned sequences. This problem is NP-hard under various scoring schemes. To solve this problem for protein sequences more efficiently,a new scoring scheme and a randomized algorithm based on substitution matrix are proposed here. Any practical solutions to a bioinformatics problem must observe twoprinciples: (1) the problem that it solves accurately describes the real problem; in CPP, this requires the scoring scheme be able to distinguisha real motif from background; (2) it provides an efficient algorithmto solve the mathematical problem. A key question in protein motif-finding is how to determine the motif length. One problem in EM algorithms to solve CPP is how to find good startingpoints to reach the global optimum. These two questions were both well addressed under this scoring scheme,which made the randomized algorithm both fast and accurate in practice. A software, COPIA (COnsensus Pattern Identification and Analysis),has been developed implementing this algorithm. Experiments using sequences from the von Willebrand factor (vWF)familyshowed that it worked well on finding multiple motifs and repeats. COPIA's ability to find repeats makes it also useful in illustrating the internal structures of multidomain proteins. Comparative studies using several groups of protein sequences demonstrated that COPIA performed better than the commonly used motif-finding programs.

APA, Harvard, Vancouver, ISO, and other styles

40

Wang, Shu 1973. "On multiple sequence alignment." Thesis, 2007. http://hdl.handle.net/2152/3715.

Full text

Abstract:

The tremendous increase in biological sequence data presents us with an opportunity to understand the molecular and cellular basis for cellular life. Comparative studies of these sequences have the potential, when applied with sufficient rigor, to decipher the structure, function, and evolution of cellular components. The accuracy and detail of these studies are directly proportional to the quality of these sequences alignments. Given the large number of sequences per family of interest, and the increasing number of families to study, improving the speed, accuracy and scalability of MSA is becoming an increasingly important task. In the past, much of interest has been on Global MSA. In recent years, the focus for MSA has shifted from global MSA to local MSA. Local MSA is being needed to align variable sequences from different families/species. In this dissertation, we developed two new algorithms for fast and scalable local MSA, a three-way-consistency-based MSA and a biclustering -based MSA. The first MSA algorithm is a three-way-Consistency-Based MSA (CBMSA). CBMSA applies alignment consistency heuristics in the form of a new three-way alignment to MSA. While three-way consistency approach is able to maintain the same time complexity as the traditional pairwise consistency approach, it provides more reliable consistency information and better alignment quality. We quantify the benefit of using three-way consistency as compared to pairwise consistency. We have also compared CBMSA to a suite of leading MSA programs and CBMSA consistently performs favorably. We also developed another new MSA algorithm, a biclustering-based MSA. Biclustering is a clustering method that simultaneously clusters both the domain and range of a relation. A challenge in MSA is that the alignment of sequences is often intended to reveal groups of conserved functional subsequences. Simultaneously, the grouping of the sequences can impact the alignment; precisely the kind of dual situation biclustering algorithms are intended to address. We define a representation of the MSA problem enabling the application of biclustering algorithms. We develop a computer program for local MSA, BlockMSA, that combines biclustering with divide-and-conquer. BlockMSA simultaneously finds groups of similar sequences and locally aligns subsequences within them. Further alignment is accomplished by dividing both the set of sequences and their contents. The net result is both a multiple sequence alignment and a hierarchical clustering of the sequences. BlockMSA was compared with a suite of leading MSA programs. With respect to quantitative measures of MSA, BlockMSA scores comparable to or better than the other leading MSA programs. With respect to biological validation of MSA, the other leading MSA programs lag BlockMSA in their ability to identify the most highly conserved regions.

APA, Harvard, Vancouver, ISO, and other styles

41

Morgenstern, Burkhard, Nadine Werner, Sonja J. Prohaska, et al. "Multiple sequence alignment with user-defined constraints at GOBICS." 2005. https://ul.qucosa.de/id/qucosa%3A32117.

Full text

Abstract:

Most multi-alignment methods are fully automated, i.e. they are based on a fixed set of mathematical rules. For various reasons, such methods may fail to produce biologically meaningful alignments. Herein, we describe a semi-automatic approach to multiple sequence alignment where biological expert knowledge can be used to influence the alignment procedure. The user can specify parts of the sequences that are biologically related to each other; our software program uses these sites as anchor points and creates a multiple alignment respecting these user-defined constraints. By using known functionally, structurally or evolutionarily related positions of the input sequences as anchor points, our method can produce alignments that reflect the true biological relationships among the input sequences more accurately than fully automated procedures can do.

APA, Harvard, Vancouver, ISO, and other styles

42

Zablocki, Fabien B. R. "Multiple sequence alignment using particle swarm optimization." 2007. http://upetd.up.ac.za/thesis/available/etd-01162009-131115/.

Full text

APA, Harvard, Vancouver, ISO, and other styles

43

Ma, Fangrui. "Biological sequence analyses theory, algorithms, and applications /." 2009. http://proquest.umi.com/pqdweb?did=1821098721&sid=1&Fmt=2&clientId=14215&RQT=309&VName=PQD.

Full text

Abstract:

Thesis (Ph.D.)--University of Nebraska-Lincoln, 2009. Title from title screen (site viewed October 13, 2009). PDF text: xv, 233 p. : ill. ; 4 Mb. UMI publication number: AAT 3360173. Includes bibliographical references. Also available in microfilm and microfiche formats.

APA, Harvard, Vancouver, ISO, and other styles

44

Zablocki, Fabien Bernard Roman. "Multiple sequence alignment using particle swarm optimization." Diss., 2008. http://hdl.handle.net/2263/23406.

Full text

Abstract:

The recent advent of bioinformatics has given rise to the central and recurrent problem of optimally aligning biological sequences. Many techniques have been proposed in an attempt to solve this complex problem with varying degrees of success. This thesis investigates the application of a computational intelligence technique known as particle swarm optimization (PSO) to the multiple sequence alignment (MSA) problem. Firstly, the performance of the standard PSO (S-PSO) and its characteristics are fully analyzed. Secondly, a scalability study is conducted that aims at expanding the S-PSO’s application to complex MSAs, as well as studying the behaviour of three other kinds of PSOs on the same problems. Experimental results show that the PSO is efficient in solving the MSA problem and compares positively with well-known CLUSTAL X and T-COFFEE. Dissertation (MSc)--University of Pretoria, 2009. Computer Science Unrestricted

APA, Harvard, Vancouver, ISO, and other styles

45

"Development of bioinformatics platforms for methylome and transcriptome data analysis." 2014. http://library.cuhk.edu.hk/record=b6115790.

Full text

Abstract:

高通量大規模並行測序技術，又称為二代測序（NGS），極大的加速了生物和醫學研究的進程。隨著測序通量和複雜度的不斷提高，在分析大量的資料以挖掘其中的資訊的過程中，生物訊息學變得越發重要。在我的博士研究生期間（及本論文中），我主要從事於以下兩個領域的生物訊息學演算法的開發：DNA甲基化資料分析和基因間區長鏈非編碼蛋白RNA（lincRNA）的鑒定。目前二代測序技術在這兩個領域的研究中有著廣泛的應用，同時急需有效的資料處理方法來分析對應的資料。 DNA甲基化是一種重要的表觀遺傳修飾，主要用來調控基因的表達。目前，全基因組重亞硫酸鹽測序（BS-seq）是最準確的研究DNA甲基化的實驗方法之一，該技術的一大特點就是可以精確到單個堿基的解析度。為了分析BS-seq產生的大量測序數據，我參與開發並深度優化了Methy-Pipe軟體。Methy-Pipe集成了測序序列比對和甲基化程度分析，是一個一體化的DNA甲基化資料分析工具。另外，在Methy-Pipe的基礎上，我又開發了一個新的用於檢測DNA甲基化差異區域（DMR）的演算法，可以用於大範圍的尋找DNA甲基化標記。Methy-Pipe在我們實驗室的DNA甲基化研究項目中得到廣泛的應用，其中包括基於血漿的無創產前診斷（NIPD）和癌症的檢測。 基因間區長鏈非編碼蛋白RNA（lincRNA）是一種重要的調節子，其在很多生物學過程中發揮作用，例如轉錄後調控，RNA的剪接，細胞老化等。lincRNA的表達具有很強的組織特異性，因此很大一部分lincRNA還沒有被發現。最近，全轉錄組測序技術（RNA-seq）結合基因從頭組裝，為新的lincRNA鑒定以及構建完整的轉錄組列表提供了最有力的方法。然而，有效並準確的從大量的RNA-seq測序數據中鑒定出真實的新的lincRNA仍然具有很大的挑戰性。為此，我開發了兩個生物訊息學工具：1）iSeeRNA，用於區分lincRNA和編碼蛋白RNA（mRNA）；2）sebnif，用於深層次資料篩選以得到高品質的lincRNA列表。這兩個工具已經在多個生物學系統中使用並表現出很好的效果。 總的來說，我開發了一些生物訊息學方法，這些方法可以幫助研究人員更好的利用二代測序技術來挖掘大量的測序數據背後的生物學本質，尤其是DNA甲基化和轉錄組的研究。 High-throughput massive parallel sequencing technologies, or Next-Generation Sequencing (NGS) technologies, have greatly accelerated biological and medical research. With the ever-growing throughput and complexity of the NGS technologies, bioinformatics methods and tools are urgently needed for analyzing the large amount of data and discovering the meaningful information behind. In this thesis, I mainly worked on developing bioinformatics algorithms for two research fields: DNA methylation data analysis and large intergenic noncoding RNA discovery, where the NGS technologies are in-depth employed and novel bioinformatics algorithms are highly needed. DNA methylation is one of the important epigenetic modifications to control the transcriptional regulations of the genes. Whole genome bisulfite sequencing (BS-seq) is one of the most precise methodologies for DNA methylation study which allows us to perform whole methylome research at single-base resolution. To analyze the large amount of data generated by BS-seq experiments, I have co-developed and optimized Methy-Pipe, an integrated bioinformatics pipeline which can perform both sequencing read alignment and methylation state decoding. Furthermore, I’ve developed a novel algorithm for Differentially Methylated Regions (DMR) mining, which can be used for large scale methylation marker discovery. Methy-Pipehas been routinely used in our laboratory for methylomic studies, including non-invasive prenatal diagnosis and early cancer detections in human plasma. Large intergenic noncoding RNAs, or lincRNAs, is avery important novel family of gene regulators in many biological processes, such as post-transcriptional regulation, splicing and aging. Due to high tissue-specific expression pattern of the lincRNAs, a large proportion is still undiscovered. The development of Whole Transcriptome Shotgun Sequencing, also known as RNA-seq, combined with de novo or ab initio assembly, promises quantity discovery of novel lincRNAs hence building the complete transcriptome catalog. However, to efficiently and accurately identify the novel lincRNAs from the large transcriptome data stillremains a bioinformatics challenge.To fill this gap, I have developed two bioinformatics tools: I) iSeeRNAfor distinguishing lincRNAs from mRNAs and II) sebnif for comprehensive filtering towards high quality lincRNA screening which has been used in various biological systems and showed satisfactory performance. In summary, I have developed several bioinformatics algorithms which help the researchers to take advantage of the strength of the NGS technologies(methylome and transcriptome studies) and explore the biological nature behind the large amount of data. Detailed summary in vernacular field only. Detailed summary in vernacular field only. Detailed summary in vernacular field only. Detailed summary in vernacular field only. Sun, Kun. Thesis (Ph.D.) Chinese University of Hong Kong, 2014. Includes bibliographical references (leaves 118-126). Abstracts also in Chinese.

APA, Harvard, Vancouver, ISO, and other styles

46

Jensen, Kyle, and Gregory Stephanopoulos. "Bioinformatics and Handwriting/Speech Reconition: Uncoventional Applications of Similarity Search Tools." 2005. http://hdl.handle.net/1721.1/30381.

Full text

Abstract:

This work introduces two unconventional applications for sequence alignment algorithms outside the domain of bioinformatics: handwriting recognition and speech recognition. In each application we treated data samples, such as the path of a and written pen stroke, as a protein sequence and use the FastA sequence alignment tool to classify unknown data samples, such as a written character. That is, we handle the handwriting and speech recognition problems like the protein annotation problem: given a sequence of unknown function, we annotate the sequence via sequence alignment. This approach achieves classification rates of 99.65% and 93.84% for the handwriting and speech recognition respectively. In addition, we provide a framework for applying sequence alignment to a variety of other non–traditional problems. Singapore-MIT Alliance (SMA)

APA, Harvard, Vancouver, ISO, and other styles

47

Loving, Joshua. "Bit-parallel and SIMD alignment algorithms for biological sequence analysis." Thesis, 2017. https://hdl.handle.net/2144/27172.

Full text

Abstract:

High-throughput next-generation sequencing techniques have hugely decreased the cost and increased the speed of sequencing, resulting in an explosion of sequencing data. This motivates the development of high-efficiency sequence alignment algorithms. In this thesis, I present multiple bit-parallel and Single Instruction Multiple Data (SIMD) algorithms that greatly accelerate the processing of biological sequences. The first chapter describes the BitPAl bit-parallel algorithms for global alignment with general integer scoring, which assigns integer weights for match, mismatch, and insertion/deletion. The bit-parallel approach represents individual cells in an alignment scoring matrix as bits in computer words and emulates the calculation of scores by a series of logic operations. Bit-parallelism has previously been applied to other pattern matching problems, producing fast algorithms. In timed tests, we show that BitPAl runs 7 - 25 times faster than a standard iterative algorithm. The second part involves two approaches to alignment with substitution scoring, which assigns a potentially different substitution weight to every pair of alphabet characters, better representing the relative rates of different mutations. The first approach extends the existing BitPAl method. The second approach is a new SIMD algorithm that uses partial sums of adjacent score differences. I present a simple partial sum method as well as one that uses parallel scan for additional acceleration. Results demonstrate that these algorithms are significantly faster than existing SIMD dynamic programming algorithms. Finally, I describe two extensions to the partial sums algorithm. The first adds support for affine gap penalty scoring. Affine gap scoring represents the biological likelihood that it is more likely for gaps to be continuous than to be distributed throughout a region by introducing a gap opening penalty and a gap extension penalty. The second extension is an algorithm that uses the partial sums method to calculate the tandem alignment of a pattern against a text sequence using a single pattern copy. Next generation sequencing data provides a wealth of information to researchers. Extracting that information in a timely manner increases the utility and practicality of sequence analysis algorithms. This thesis presents a family of algorithms which provide alignment scores in less time than previous algorithms.

APA, Harvard, Vancouver, ISO, and other styles

48

Xu, Tianchuan. "Loop Prediction and Homology Modeling with High Resolution." Thesis, 2020. https://doi.org/10.7916/d8-1dqz-y485.

Full text

Abstract:

Three-dimensional (3D) structure of a protein is essential as the guidance of structure-based drug dis-covery. To achieve robust homology modeling with atomic-level accuracy, reliable loop predictions are required. Here, a novel hierarchical protocol of Protein Local Optimization Program (PLOP) is designed to produce sub-2 angstrom predictions on loop regions in homology modeling. Dramatic improvements in both speed and accuracy have been realized with implementation of special-designed clustering and adaptive loop closure algorithm. Four prediction rounds are designed for homology modeling as the high-level protocol of PLOP, which allows latter rounds employ the educated guess of backbone atom positions and hydrogen bonding information inherited from the previous rounds, contributing to additional prediction accuracy. The success of PLOP has been demonstrated with four different data sets, mainly concen-trating on homology modeling of H3 loops of antibodies. GPU-accelerated sampling algorithm and deep learning models are implemented, which are able to produce promising predictions as input templates for PLOP in the context of homology modeling.

APA, Harvard, Vancouver, ISO, and other styles

49

Lee, En-Shiun Annie. "Training of Template-Specific Weighted Energy Function for Sequence-to-Structure Alignment." Thesis, 2008. http://hdl.handle.net/10012/4060.

Full text

Abstract:

Threading is a protein structure prediction method that uses a library of template protein structures in the following steps: first the target sequence is matched to the template library and the best template structure is selected, secondly the predicted target structure of the target sequence is modeled by this selected template structure. The deceleration of new folds which are added to the protein data bank promises completion of the template structure library. This thesis uses a new set of template-specific weights to improve the energy function for sequence-to-structure alignment in the template selection step of the threading process. The weights are estimated using least squares methods with the quality of the modelling step in the threading process as the label. These new weights show an average 12.74% improvement in estimating the label. Further family analysis show a correlation between the performance of the new weights to the number of seeds in pFam.

APA, Harvard, Vancouver, ISO, and other styles

50

"Characterization and Analysis of a Novel Platform for Profiling the Antibody Response." Doctoral diss., 2011. http://hdl.handle.net/2286/R.I.14273.

Full text

Abstract:

abstract: Immunosignaturing is a new immunodiagnostic technology that uses random-sequence peptide microarrays to profile the humoral immune response. Though the peptides have little sequence homology to any known protein, binding of serum antibodies may be detected, and the pattern correlated to disease states. The aim of my dissertation is to analyze the factors affecting the binding patterns using monoclonal antibodies and determine how much information may be extracted from the sequences. Specifically, I examined the effects of antibody concentration, competition, peptide density, and antibody valence. Peptide binding could be detected at the low concentrations relevant to immunosignaturing, and a monoclonal's signature could even be detected in the presences of 100 fold excess naive IgG. I also found that peptide density was important, but this effect was not due to bivalent binding. Next, I examined in more detail how a polyreactive antibody binds to the random sequence peptides compared to protein sequence derived peptides, and found that it bound to many peptides from both sets, but with low apparent affinity. An in depth look at how the peptide physicochemical properties and sequence complexity revealed that there were some correlations with properties, but they were generally small and varied greatly between antibodies. However, on a limited diversity but larger peptide library, I found that sequence complexity was important for antibody binding. The redundancy on that library did enable the identification of specific sub-sequences recognized by an antibody. The current immunosignaturing platform has little repetition of sub-sequences, so I evaluated several methods to infer antibody epitopes. I found two methods that had modest prediction accuracy, and I developed a software application called GuiTope to facilitate the epitope prediction analysis. None of the methods had sufficient accuracy to identify an unknown antigen from a database. In conclusion, the characteristics of the immunosignaturing platform observed through monoclonal antibody experiments demonstrate its promise as a new diagnostic technology. However, a major limitation is the difficulty in connecting the signature back to the original antigen, though larger peptide libraries could facilitate these predictions. Dissertation/Thesis Ph.D. Molecular and Cellular Biology 2011

APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic 'Sequence alignment (Bioinformatics)'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles