To see the other types of publications on this topic, follow the link: Bioinformatics pipeline.

Dissertations / Theses on the topic 'Bioinformatics pipeline'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 46 dissertations / theses for your research on the topic 'Bioinformatics pipeline.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.

1

Johansson-Åkhe, Isak. "PePIP : a Pipeline for Peptide-Protein Interaction-site Prediction." Thesis, Linköpings universitet, Institutionen för fysik, kemi och biologi, 2017. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-138411.

Full text
Abstract:
Protein-peptide interactions play a major role in several biological processes, such as cellproliferation and cancer cell life-cycles. Accurate computational methods for predictingprotein-protein interactions exist, but few of these method can be extended to predictinginteractions between a protein and a particularly small or intrinsically disordered peptide. In this thesis, PePIP is presented. PePIP is a pipeline for predicting where on a given proteina given peptide will most probably bind. The pipeline utilizes structural aligning to perusethe Protein Data Bank for possible templates for the interaction to be predicted, using thelarger chain as the query. The possible templates are then evaluated as to whether they canrepresent the query protein and peptide using a Random Forest classifier machine learningalgorithm, and the best templates are found by using the evaluation from the Random Forest in combination with hierarchical clustering. These final templates are then combined to givea prediction of binding site. PePIP is proven to be highly accurate when testing on a set of 502 experimentally determinedprotein-peptide structures, suggesting a binding site on the correct part of the protein- surfaceroughly 4 out of 5 times.
APA, Harvard, Vancouver, ISO, and other styles
2

Ishak, Helena. "Developing a ChIP-seq pipeline that analyzes the human genome and its repetitive sequences." Thesis, Uppsala universitet, Institutionen för biologisk grundutbildning, 2017. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-335914.

Full text
APA, Harvard, Vancouver, ISO, and other styles
3

Lember, Geivi. "Sepsis-associated Escherichia coli whole-genome sequencing analysis using in-house developed pipeline and 1928 diagnostics tool." Thesis, Högskolan i Skövde, Institutionen för biovetenskap, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:his:diva-19841.

Full text
Abstract:
Sepsis is a life-threatening condition that is caused by a dysregulated host response to infection. Timely detection of sepsis and antibiotic treatment is important for the patient’s recovery from sepsis. Usually, when sepsis is detected, immediate antibiotic treatment is started with broad-spectrum antibiotics as it takes time to determine the correct antibiotic susceptibility. To overcome this problem, next-generation sequencing is seen as one possible development in clinical diagnostics in the future. Automated bioinformatics pipelines could be used initially for surveillance purposes but eventually for rapid clinical diagnosis. Therefore, the results of 1928 Diagnostics, an automated pipeline for whole-genome sequencing (WGS) data analysis, were compared with the results of an in-house developed pipeline for manual data processing by analyzing sepsis-associated Escherichia coli (SEPEC) WGS data. The pipelines were compared by assessing their predicted antimicrobial resistance (AMR) genes, virulence genes and epidemiological relatedness. In addition, the predicted resistance genes were compared to phenotypic antimicrobial susceptibility testing (AST) data from the clinical microbiology laboratory. All the results obtained from the 1928 Diagnostics and in-house pipeline were similar but differed in the number of virulence/predicted AMR genes, AMR gene variants, detection of species and epidemiologically related E. coli samples. Moreover, the predicted AMR genes from both pipelines did not show a good overall relation to the phenotypic AST result. More studies are needed to make predictions of genes from the WGS analysis more reliable so that WGS analysis can be used as a diagnostics tool in clinical laboratories in the future.
APA, Harvard, Vancouver, ISO, and other styles
4

Ramsay, Trevor. "A Motif Discovery and Analysis Pipeline for Heterogeneous Next-Generation Sequencing Data." Thesis, University of California, Davis, 2015. http://pqdtopen.proquest.com/#viewpdf?dispub=1599520.

Full text
Abstract:
<p> Bioinformatics has made great strides in understanding the regulation of gene expression, but many of the tools developed for this purpose depend on data from a limited number of species. Despite their unique genetic attributes, there remains a dearth of research into undomesticated trees. The poplar tree, <i> Populus trichocarpa</i>, has undergone multiple rounds of genome duplication during its evolution. In addition its life cycle varies from other annual crop and model plants previously studied, leading to significant technical challenges to understand the unique biology of these trees. For example, the process of secondary growth occurs as the tree stems thicken, and creates secondary xylem (wood) and phloem (inner bark) for water and products of photosynthesis transport, respectively. Because of this, the research group I work with studies the secondary growth of <i>P. trichocarpa</i> (Spicer, 2010) (Groover, et al., 2010) (Groover, et al., 2006) (Groover, 2005).</p><p> The genomic tools to investigate gene regulation in <i>P. trichocarpa </i> are readily available. Next-generation sequencing technologies such as RNA-Seq and ChIP-Seq can be used to understand gene expression and binding of transcription factors to specific locations in the genome. Similarly, a variety of specialized bioinformatic tools such as EdgeR, Cufflinks, and MACS can be used to analyze gene binding and expression from sequencing data provided by ChIP-seq and RNA-seq (Blahnik, et al., 2010) (Mortazavi, et al., 2008) (Robinson, 2010) (Robinson, 2007) (Robinson, et al., 2008) (McCarthy, 2012) (Trapnell, 2013) (Zhang, 2008). The binding and expression data these tools provide form a foundation for analyzing the gene expression regulation in <i> P. trichocarpa.</i></p><p> The goal of my project is to provide a motif discovery and analysis pipeline for analyses of <i>Populus</i> species. The motif discovery and analysis pipeline utilizes heterogeneous data collected from poplar and aspen mutants to elucidate the gene regulatory mechanisms involved in secondary growth. The experiments target transcription factors related to secondary growth, and through analysis of the variety of transcription factor binding experiments, I have identified the motifs involved in gene regulation of secondary growth within <i>P. trichocarpa.</i> (Filkov, et al., 2008).</p>
APA, Harvard, Vancouver, ISO, and other styles
5

Garcia, Krystine. "Bioinformatics Pipeline for Improving Identification of Modified Proteins by Neutral Loss Peak Filtering." Ohio University / OhioLINK, 2015. http://rave.ohiolink.edu/etdc/view?acc_num=ohiou1440157843.

Full text
APA, Harvard, Vancouver, ISO, and other styles
6

Norris, Shaun W. "A Pipeline for Creation of Genome-Scale Metabolic Reconstructions." VCU Scholars Compass, 2016. http://scholarscompass.vcu.edu/etd/4667.

Full text
Abstract:
The decreasing costs of next generation sequencing technologies and the increasing speeds at which they work have lead to an abundance of 'omic datasets. The need for tools and methods to analyze, annotate, and model these datasets to better understand biological systems is growing. Here we present a novel software pipeline to reconstruct the metabolic model of an organism in silico starting from its genome sequence and a novel compilation of biological databases to better serve the generation of metabolic models. We validate these methods using five Gardnerella vaginalis strains and compare the gene annotation results to NCBI and the FBA results to Model SEED models. We found that our gene annotations were larger and highly similar in terms of function and gene types to the gene annotations downloaded from NCBI. Further, we found that our FBA models required a minimal addition of transport reactions, sources, and escapes indicating that our draft pathway models were very complete. We also found that on average our solutions contained more reactions than the models obtained from Model SEED due to a large amount of baseline reactions and gene products found in ASGARD.
APA, Harvard, Vancouver, ISO, and other styles
7

Kuntala, Prashant Kumar. "Optimizing Biomarkers From an Ensemble Learning Pipeline." Ohio University / OhioLINK, 2017. http://rave.ohiolink.edu/etdc/view?acc_num=ohiou1503592057943043.

Full text
APA, Harvard, Vancouver, ISO, and other styles
8

Milani, Renato 1985. "Desenvolvimento de um pipeline para analise em larga escala de um chip de proteinas quinases." [s.n.], 2010. http://repositorio.unicamp.br/jspui/handle/REPOSIP/314742.

Full text
Abstract:
Orientadores: Eduardo Galembeck, Carmen Verissima Ferreira<br>Dissertação (mestrado) - Universidade Estadual de Campinas, Instituto de Biologia<br>Made available in DSpace on 2018-08-15T15:43:32Z (GMT). No. of bitstreams: 1 Milani_Renato_M.pdf: 4434069 bytes, checksum: f6dd6ae86e6b3ac6b79f8c4b28828a91 (MD5) Previous issue date: 2010<br>Resumo: A atividade de proteínas quinases é responsável pela regulação de muitos processos biológicos através de cascatas de sinalização que levam a diferentes efeitos celulares. No entanto, a análise da reação de fosforilação reversível catalisada por estas enzimas e prejudicada pela complexidade inerente as interações entre proteínas neste sistema de sinalização. Consequentemente, o foco em uma única proteína quinase pode não ser suficiente para revelar completamente os mecanismos por trás dos fenótipos observados. Nesse sentido, em conjunto com outras técnicas de análise em larga-escala como chips de expressão de mRNA, arranjos de peptídios contendo substratos de proteínas quinases vem sendo cada vez mais utilizados por pesquisadores. No entanto, a falta de uniformidade na análise estatística desses chips tem sido um grande empecilho à obtenção de dados relevantes com o uso dessa técnica. Por conta disso, o objetivo desse trabalho foi desenvolver uma metodologia, chamada de PepMatrix, capaz de aplicar estatística básica de forma automatizada visando a seleção de replicações com baixa variabilidade e a obtenção da anotação das proteínas envolvidas nos eventos de fosforilação ocorridos no chip. Esse novo método foi aplicado em vários conjuntos de dados de diferentes experimentos biológicos e seus resultados revelaram atividades quinásicas significativamente alteradas, muitas das quais tiveram confirmação por Western blot. Alem disso, os resultados ressaltaram a importância da análise sistêmica dos eventos de sinalização celular em conjunto com uma análise crítica das replicações. O alto grau de uniformidade analítica obtido por esse método faz com que ele seja uma poderosa e confiável ferramenta na análise quinômica em larga-escala<br>Abstract: The activity of protein kinases governs many biological processes through signaling cascades that lead to distinct outputs, from homeostasis to disease. However, analysis of the reversible phosphorylation perpetrated by these enzymes is hindered by the inherent complexity of interactions in this signaling system. Consequently, focusing on a single kinase may not be enough to completely unfold the mechanisms behind observed phenomena. In this sense, together with other omics approaches like mRNA expression analysis, peptide arrays have shown increasing popularity, particularly ones containing kinase substrate sequences. The lack of uniformity in statistical analysis of these chips, though, has been a major issue for the field. In this paper, we propose PepMatrix, a fast and accurate method for selecting so called "reliable replicate spots" and automatically retrieving differential activity and annotation information about phosphorylation events identified in a peptide array of kinase substrates. Here, we present several cases where this new methodology was applied to biological datasets. We successfully identified putative up and down-regulated kinases, many of which were confirmed to have altered activity by Western blot. Moreover, the results emphasized the need for a true systems biology approach to the cellular signaling events alongside a critical replicate selection method. The high degree of analysis uniformity we achieved with this method provides a powerful and reliable addition for high-throughput kinome analysis<br>Mestrado<br>Bioquimica<br>Mestre em Biologia Funcional e Molecular
APA, Harvard, Vancouver, ISO, and other styles
9

Isak, Sylvin. "Increasing bioinformatics in third world countries : Studies of S.digitata and P.Polymyxa to further bioinformatics in east Africa." Thesis, Uppsala universitet, Institutionen för biologisk grundutbildning, 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-293636.

Full text
Abstract:
Despite an increase of biotechnical studies in third world countries, the bioinformatical side is largely lacking. In this paper we attempt to further the bioinformatical capabilities of east Af-rica. The project consisted of two teaching segments for east African doctorates, one as part of an academic workshop at ILRI, Kenya, and one in a small class at SLU, Sweden. The project also included the generation of two simple to use bioinformatical pipelines with the explicit aim to be reused by novice bioinformaticians from the very same region. The viability of the piplines were verified by generating transcriptional expression level differences for Paeni-bacillus polymyxa strain A26 and whole genome annotations for Setaria digitata. Both pipe-lines may have some merit for the collaborative effort between ILRI and SLU to annotate Eleusine coracana, a draught resilient crop, the annotation of which may save lives. The teaching material, source code for the pipelines and overall teaching impression have been included in this paper.
APA, Harvard, Vancouver, ISO, and other styles
10

Andersson, Christoffer. "PELICAN : a PipELIne, including a novel redundancy-eliminating algorithm, to Create and maintain a topicAl family-specific Non-redundant protein database." Thesis, University of Skövde, School of Humanities and Informatics, 2005. http://urn.kb.se/resolve?urn=urn:nbn:se:his:diva-960.

Full text
Abstract:
<p>The increasing number of biological databases today requires that users are able to search more efficiently among as well as in individual databases. One of the most widespread problems is redundancy, i.e. the problem of duplicated information in sets of data. This thesis aims at implementing an algorithm that distinguishes from other related attempts by using the genomic positions of sequences, instead of similarity based sequence comparisons, when making a sequence data set non-redundant. In an automatic updating procedure the algorithm drastically increases the possibility to update and to maintain the topicality of a non-redundant database. The procedure creates a biologically sound non-redundant data set with accuracy comparable to other algorithms focusing on making data sets non-redundant</p>
APA, Harvard, Vancouver, ISO, and other styles
11

Evans, Daniel T. "A SNP Microarray Analysis Pipeline Using Machine Learning Techniques." Ohio University / OhioLINK, 2010. http://rave.ohiolink.edu/etdc/view?acc_num=ohiou1289950347.

Full text
APA, Harvard, Vancouver, ISO, and other styles
12

Kremer, Frederico Schmitt. "Genix: desenvolvimento de uma nova pipeline automatizada para anotação de genomas microbianos." Universidade Federal de Pelotas, 2016. http://repositorio.ufpel.edu.br:8080/handle/prefix/3732.

Full text
Abstract:
Submitted by Maria Beatriz Vieira (mbeatriz.vieira@gmail.com) on 2017-10-18T12:09:03Z No. of bitstreams: 2 license_rdf: 0 bytes, checksum: d41d8cd98f00b204e9800998ecf8427e (MD5) dissertacao_frederico_schmitt_kremer.pdf: 1606431 bytes, checksum: 192db9fb559b24dfd0b3038659fdd5b7 (MD5)<br>Approved for entry into archive by Aline Batista (alinehb.ufpel@gmail.com) on 2017-10-23T11:10:01Z (GMT) No. of bitstreams: 2 dissertacao_frederico_schmitt_kremer.pdf: 1606431 bytes, checksum: 192db9fb559b24dfd0b3038659fdd5b7 (MD5) license_rdf: 0 bytes, checksum: d41d8cd98f00b204e9800998ecf8427e (MD5)<br>Approved for entry into archive by Aline Batista (alinehb.ufpel@gmail.com) on 2017-10-23T11:11:40Z (GMT) No. of bitstreams: 2 dissertacao_frederico_schmitt_kremer.pdf: 1606431 bytes, checksum: 192db9fb559b24dfd0b3038659fdd5b7 (MD5) license_rdf: 0 bytes, checksum: d41d8cd98f00b204e9800998ecf8427e (MD5)<br>Made available in DSpace on 2017-10-23T11:11:52Z (GMT). No. of bitstreams: 2 dissertacao_frederico_schmitt_kremer.pdf: 1606431 bytes, checksum: 192db9fb559b24dfd0b3038659fdd5b7 (MD5) license_rdf: 0 bytes, checksum: d41d8cd98f00b204e9800998ecf8427e (MD5) Previous issue date: 2016-02-17<br>Conselho Nacional de Pesquisa e Desenvolvimento Científico e Tecnológico - CNPq<br>O advento do sequenciamento de DNA de nova geração (NGS) reduziu significativamente o custo dos projetos de sequenciamento de genomas. Quanto mais fácil é de obter novos dados genômicos, mais acuradas deve ser a etapa de anotação, de forma a se reduzir a perda de informações relevantes e efetuar o acúmulo de erros que possam afetar a acurácia das análises posteriores. No caso dos genomas bacterianos, um grande número de programas para anotação já foi desenvolvido, entretanto, muitos destes softwares não incorporaram etapas para otimizar os seus resultados, como filtragem de proteínas falso-positivas/spurious e a anotação mais completa de RNA não-codificantes. O presente trabalho descreve o desenvolvimento do Genix, uma nova pipeline automatizada que combina a funcionalidade de diferentes softwares, incluindo Prodigal, tRNAscan-SE, RNAmmer, Aragorn, INFERNAL, NCBI-BLAST+, CD-HIT, Rfam e Uniprot, com a intenção de aumentar a afetividade dos resultados de anotação. Para avaliar a acurácia da presente ferramenta, foram usados como modelo de estudo os genomas de referência de Escherichia coli K-12, Leptospira interrogans cepa Fiocruz L1-130, Listeria monocytogenese EGD-e e Mycobacterium tuberculosis H37Rv. Os resultados obtidos pelo Genix foram comparados às anotações originais e as obtidas pelas ferramentas de anotação RAST e BASys, considerando genes novos, faltantes e exclusivos, informações de anotação funcional e predições de ORFs spurious. De forma a se quantificar o grau de acurácia, uma nova métrica, denominada discrepância de anotação foi também proposta. Na análise comparativa o Genix apresentou para todos os genomas o menor valor de discrepância, variando entre 0,96 e 5,71%, sendo o maior valor observado no genoma de L. interrogans, para o qual RAST e BASys apresentaram valores superiores a 14,0%. Além disso, foram identificadas proteínas spurious nas anotações geradas pelos demais programas, e, em menor número, nas anotações de referência, indicando que a utilização do Antifam permite um melhor controle do número de genes falso positivos. A partir dos testes realizados, foi possível demonstrar que o Genix é capaz de gerar anotação com boa acurácia (baixo discrepância), menor perda de genes relevantes (funcionais) e menor número de genes falso positivos.<br>The advent of next-generation sequencing (NGS) significantly reduced the cost of genome sequencing projects. The easier it is to generate genomic data, the more accurate the annotation steps must to be to avoid both the loss of information and the accumulation of erroneous features that may affect the accuracy of further analysis. In the case of bacteria genomes, a range of web annotation software has been developed; however, many applications have not incorporated the steps required to improve the output (eg: false-positive/spurious ORF filtering and a more complete non-coding RNA annotation). The present work describes the implementation of Genix, a new bacteria genome annotation pipeline that combines the functionality of the programs Prodigal, tRNAscan-SE, RNAmmer, Aragorn, INFERNAL, NCBI-BLAST+, CD-HIT, Rfam and UniProt, with the intention of increasing the effectiveness of the annotation results. To evaluate the accuracy of Genix, we used as models of study the reference genomes of Escherichia coli K-12, Leptospira interrogans strain Fiocruz L1-130, Listeria monocytogenes EGD-e and Mycobacterium tuberculosis H37Rv. the results obtained by Genix were compared to the original annotation and to those from the annotation pipelines RAST and BASys considering new, missing and exclusive genes, functional annotation information and the prediction of spurious ORFs. To quantify the annotation accuracy, a new metric, called “annotation discrepancy” was developed. In a comparative analysis, Genix showed the smallest discrepancy for the four genomes, ranging for 0.96 to 5.71%, the highest discrepancy was bserved in the L. interrogans genome, for which RAST and BASys resulted in discrepancies greater than 14.0%. Additionally, several spurious proteins were identified in the annotations generated by RAST and BASys, and, in smaller number, in the reference annotations, indicating that the use of the Antifam database allows a better control of the number of false-positive genes. Based on the evaluations, it was possible to show that Genix is able to generate annotations with good accuracy (low discrepancy), low omission of relevant (functional) genes and a small number of false-positive genes.
APA, Harvard, Vancouver, ISO, and other styles
13

Halstead, Holly. "ARG-MATEE Automated Pipeline for Detection of Antimicrobial Resistance in WGS Data Collected from Pig Farms and Surrounding Communities." Thesis, Uppsala universitet, Institutionen för biologisk grundutbildning, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-414587.

Full text
Abstract:
As part of recognizing the interconnected nature of different sectors in relation to health, AMR (antimicrobial resistance) has emerged as an issue of high global importance. E. coli isolates were taken from pig farms in Thailand, which serves as a point of interest in the study of ARGs (antimicrobial resistance genes) in emerging economies. The fecal samples were collected from pigs, humans who came in contact with the pigs, and humans who did not have contact with pigs to be analyzed for ARGS, virulence genes, and plasmids. Data was analyzed with an automated pipeline in the form of ARG-MATEE, the Antimicrobial Resistance Gene Multi-Analysis Tool for Enteric E. coli, a tool designed in this study to be used here and in future investigations. ARG-MATEE regulates and records internal software versions in a produced report which also includes data tables for all non phylogeny results in Boyce–Codd normal form and data visualizations for plasmids, ARGs, virulence genes, and phylogeny. Through the use of ARG-MATEE, the iss virulence gene was seen to be significantly different between testing groups as it is present in only human testing groups, suggesting the loss of function of the iss gene in pigs, showing host specialization.
APA, Harvard, Vancouver, ISO, and other styles
14

Pranckeviciene, Erinija. "Bioinformatics Tools for the Analysis of Gene-Phenotype Relationships Coupled with a Next Generation ChIP-Sequencing Data Analysis Pipeline." Thesis, Université d'Ottawa / University of Ottawa, 2015. http://hdl.handle.net/10393/31940.

Full text
Abstract:
The rapidly advancing high-throughput and next generation sequencing technologies facilitate deeper insights into the molecular mechanisms underlying the expression of phenotypes in living organisms. Experimental data and scientific publications following this technological advancement have rapidly accumulated in public databases. Meaningful analysis of currently available data in genomic databases requires sophisticated computational tools and algorithms, and presents considerable challenges to molecular biologists without specialized training in bioinformatics. To study their phenotype of interest molecular biologists must prioritize large lists of poorly characterized genes generated in high-throughput experiments. To date, prioritization tools have primarily been designed to work with phenotypes of human diseases as defined by the genes known to be associated with those diseases. There is therefore a need for more prioritization tools for phenotypes which are not related with diseases generally or diseases with which no genes have yet been associated in particular. Chromatin immunoprecipitation followed by next generation sequencing (ChIP-Seq) is a method of choice to study the gene regulation processes responsible for the expression of cellular phenotypes. Among publicly available computational pipelines for the processing of ChIP-Seq data, there is a lack of tools for the downstream analysis of composite motifs and preferred binding distances of the DNA binding proteins. This thesis is aimed to address the gap existing in the tools available to process high-throughput ChIP-Seq data to provide rapid analysis and interpretation of large lists of poorly characterized genes. Additionally, programs for the analysis of preferred binding distances of transcription factors were integrated into the pipeline for expedited results. A gene prioritization algorithm linking genes to non-disease phenotypes described by meaningful keywords was developed. This algorithm can be used to process candidate genetic targets of a transcription factor produced by a computational pipeline for ChIP-Seq data analysis.
APA, Harvard, Vancouver, ISO, and other styles
15

Hafez, Khafaga Ahmed Ibrahem 1987. "Bioinformatics approaches for integration and analysis of fungal omics data oriented to knowledge discovery and diagnosis." Doctoral thesis, TDX (Tesis Doctorals en Xarxa), 2021. http://hdl.handle.net/10803/671160.

Full text
Abstract:
Aquesta tesi presenta una sèrie de recursos bioinformàtics desenvolupats per a donar suport en l'anàlisi de dades de NGS i altres òmics en el camp d'estudi i diagnòstic d'infeccions fúngiques. Hem dissenyat tècniques de computació per identificar nous biomarcadors i determinar potencial trets de resistència, pronosticant les característiques de les seqüències d'ADN/ARN, i planejant estratègies optimitzades de seqüenciació per als estudis de hoste-patogen transcriptomes (Dual RNA-seq). Hem dissenyat i desenvolupat tambe una solució bioinformàtica composta per un component de costat de servidor (constituït per diferents pipelines per a fer anàlisi VariantSeq, Denovoseq i RNAseq) i un altre component constituït per eines software basades en interfícies gràfiques (GUIs) per permetre a l'usuari accedir, gestionar i executar els pipelines mitjançant interfícies amistoses. També hem desenvolupat i validat un software per a l'anàlisi de seqüències i el disseny dels primers (SeqEditor) orientat a la identificació i detecció d'espècies en el diagnòstic de la PCR. Finalment, hem desenvolupat CandidaMine una base de dades integrant dades omiques de fongs patògens.<br>The aim of this thesis has been to develop a series of bioinformatic resources for analysis of NGS data, proteomics, or other omics technologies in the field of study and diagnosis of yeast infections. In particular, we have explored and designed distinct computational techniques to identify novel biomarker candidates of resistance traits, to predict DNA/RNA sequences’ features, and to optimize sequencing strategies for host-pathogen transcriptome sequencing studies (Dual RNA-seq). We have designed and developed an efficient bioinformatic solution composed of a server-side component constituted by distinct pipelines for VariantSeq, Denovoseq and RNAseq analyses as well as another component constituted by distinct GUI-based software to let the user to access, manage and run the pipelines with friendly-to-use interfaces. We have also designed and developed SeqEditor a software for sequence analysis and primers design for species identification and detection in PCR diagnosis. We also have developed CandidaMine an integrated data warehouse of fungal omics and for data analysis and knowledge discovery.
APA, Harvard, Vancouver, ISO, and other styles
16

Chen, Jonathan Jun Feng. "Data Mining/Machine Learning Techniques for Drug Discovery: Computational and Experimental Pipeline Development." University of Akron / OhioLINK, 2018. http://rave.ohiolink.edu/etdc/view?acc_num=akron1524661027035591.

Full text
APA, Harvard, Vancouver, ISO, and other styles
17

Nordin, Jessika. "Assessment of variant load in an idiopathic autoinflammatory index patient." Thesis, Uppsala universitet, Institutionen för biologisk grundutbildning, 2014. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-236025.

Full text
APA, Harvard, Vancouver, ISO, and other styles
18

Xu, Guorong. "Computational Pipeline for Human Transcriptome Quantification Using RNA-seq Data." ScholarWorks@UNO, 2011. http://scholarworks.uno.edu/td/343.

Full text
Abstract:
The main theme of this thesis research is concerned with developing a computational pipeline for processing Next-generation RNA sequencing (RNA-seq) data. RNA-seq experiments generate tens of millions of short reads for each DNA/RNA sample. The alignment of a large volume of short reads to a reference genome is a key step in NGS data analysis. Although storing alignment information in the Sequence Alignment/Map (SAM) or Binary SAM (BAM) format is now standard, biomedical researchers still have difficulty accessing useful information. In order to assist biomedical researchers to conveniently access essential information from NGS data files in SAM/BAM format, we have developed a Graphical User Interface (GUI) software tool named SAMMate to pipeline human transcriptome quantification. SAMMate allows researchers to easily process NGS data files in SAM/BAM format and is compatible with both single-end and paired-end sequencing technologies. It also allows researchers to accurately calculate gene expression abundance scores.
APA, Harvard, Vancouver, ISO, and other styles
19

Mohamed, Saleem Mohamed Ashick. "Pipeline intégratif multidimensionnel d'analyse de données NGS pour l'étude du devenir cellulaire." Thesis, Strasbourg, 2015. http://www.theses.fr/2015STRAJ072/document.

Full text
Abstract:
L'épigénomique pourrait nous aider à mieux comprendre pourquoi différents types cellulaires montrent différents comportements. Puisque, dans le cadre d'études épigénétiques, il peut êtrenécessaire de comparer plusieurs profils de séquençage, il y a un besoin urgent en nouvelles approches et nouveaux outils pour pallier aux variabilités techniques sous-jacentes. Nous avons développé NGS-QC, un système de contrôle qualité qui détermine la qualité de données et Epimetheus, un outil de normalisation d'expériences de modifications d'histones basé sur les quartiles afin de corriger les variations techniques entre les expériences. Enfin, nous avons intégré ces outils dans un pipeline d'analyse allèle-spécifique afin de comprendre le statut épigénétique de XCI dans le cancer du sein où la perte du Xi est fréquent. Notre analyse a dévoilé des perturbations dans le paysage épigénétique du X et des réactivations géniques aberrantes dans le Xi, dont celles associées au développement du cancer<br>Epigenomics would help us understand why various cells types exhibit different behaviours. Aberrant changes in reversible epigenetic modifications observed in cancer raised focus towards epigenetic targeted therapy. As epigenetic studies may involve comparing multi-profile sequencing data, thereis an imminent need for novel approaches and tools to address underlying technical variabilities. Wehave developed NGS-QC, a QC system to infer the experimental quality of the data and Epimetheus, a quantile-based multi-profile normalization tool for histone modification datasets to correct technical variation among samples. Further, we have employed these developed tools in an allele-specific analysis to understand the epigenetic status of X chromosome inactivation in breast cancer cells where disappearance of Xi is frequent. Our analysis has revealed perturbation in epigenetic landscape of X and aberrant gene reactivation in Xi including the ones that are associated with cancer promotion
APA, Harvard, Vancouver, ISO, and other styles
20

Pignata, Luiz Fernando Martins. "Pipeline para Análise In Sílico de Dados de Expressão de miRNAs e mRNAs em Células de Mamíferos." Universidade de São Paulo, 2012. http://www.teses.usp.br/teses/disponiveis/17/17135/tde-14062012-132754/.

Full text
Abstract:
Os microRNAs estão envolvidos no processo de regulação da expressão gênica da célula, onde a molécula de microRNA se liga com o RNA mensageiro interrompendo, assim, a expressão do respectivo gene pela interrupção da tradução. A bioinformática tem auxiliado na identificação de vários genes codificadores de microRNAs em plantas e animais, incluindo mamíferos, por meio de analises de dados de microarray; assim como na predição de estruturas. Os dados de expressão de microRNAs e RNAs mensageiros foram obtidos por meio de cooperação firmada entre o Laboratório de Bioinformática do Departamento de Genética da Faculdade de Medicina de Ribeirão Preto - USP, coordenado pela orientadora desse projeto, e o Laboratório de Imunogenética Molecular do mesmo departamento, coordenado pelo Professor Doutor Geraldo A. S. Passos. Durante o desenvolvimento e os testes realizados, foram utilizados dados (valores numéricos de dados de expressão coletados por microarrays) provenientes da comparação da expressão de microRNAs e RNAs do timo de camundongos non obese diabetic que reproduzem diabetes melitus do tipo 1, e dados provenientes da comparação da expressão de microRNAs e RNAs de outros experimentos. O presente projeto teve como objetivo o desenvolvimento de um pipeline para a análise in silico de dados de expressão gênica de microRNAs e mRNAs obtidos por microarray. Com base em dados de expressão de microRNAs e RNAs mensageiros, foi possível a análise de diversas ferramentas e o desenvolvimento e ajuste de scripts para que seja possível a análise sequencial de tais dados. Dessa forma, o pipeline desenvolvido inclui a quantificação dos dados de expressão gênica a partir das lâminas de microarray, a normalização dos dados, as análises estatísticas das sequências diferencialmente expressas utilizando o Multi Experiment Viewer, a construção de redes de interação microRNAs-RNAs mensageiros e a busca de alvos de microRNAs baseada nesta rede, ambos pelo GenMir++. O pipeline desenvolvido é executado com facilidade e possibilitou a correta análise dos dados, evitando desperdício de tempo em análises de bancada. A partir dos resultados obtidos, novos alvos de miRNA foram encontrados com o uso do pipeline e comprovados em bancada. Tais resultados apresentados no 55º Congresso Brasileiro de Genética com o resumo intitulado MicroRNA-mRNA Network Controlling the Promiscuous Gene Expression in the Thymus of NOD (Non Obese Diabetic) Mice: Implications in the Emergence of Type 1 Diabetes Mellitus.<br>The microRNAs are involved in the regulation of gene expression of the cell. The miRNA molecule binds to the messenger RNA and interrupts the gene expression by disrupting the translation. Through microarray data analysis, bioinformatics is a valuable aid for the identification of several genes that encode miRNAs in plants and animals, including mammals. It is also very useful for predicting structures. Data of miRNA and mRNA expression were obtained by the collaboration the Bioinformatics Laboratory and the Molecular Immunogenetics Laboratory of the Department of Genetics of the Faculty of Medicine of Ribeirão Preto - USP, coordinated by professors Silvana Giuliatti and Geraldo A. S. Passos, respectively. During the development and tests of the research, microarrays data (numerical values os the expression) were obtained from the comparison between the expression of miRNA and mRNA of the thymus of non obese diabetic mice with diabetes mellitus type 1, as well as from comparisons of their expression in other experiments. The present study is aimed at the development of a pipeline for in silico analysis of the data of miRNAs and mRNA gene expression obtained by microarray. Based on miRNAs and mRNA expression, it was possible to analyze several tools, develop and adjust scripts that allowed the sequential analysis of such data. The pipeline includes the quantification of gene expression data from microarray, the normalization of the data, the statistical analysis of differentially expressed sequences using Multi Experiment Viewer, the construction of networks of interaction of miRNA-mRNAs, and the search for targets of miRNAs based on such network using GenMir++. The pipeline was performed easily and allowed the correct analysis of the data, avoiding waste of time in bench analysis. From the results, new targets of miRNA were found using the pipeline and were verified further in bench analysis. The results were presented in the 55 th Brazilian Genetics Congress in the paper entitled \"MicroRNA-mRNA Network Controlling the Promiscuous Gene Expression in the Thymus of NOD (Non Obese Diabetic) Mice: Implications in the Emergence of Type 1 Diabetes Mellitus\".
APA, Harvard, Vancouver, ISO, and other styles
21

Amgarten, Deyvid Emanuel. "Análise computacional da diversidade viral presente na comunidade microbiana do processo de compostagem do Zoológico de São Paulo." Universidade de São Paulo, 2016. http://www.teses.usp.br/teses/disponiveis/95/95131/tde-14072017-161226/.

Full text
Abstract:
O estudo da diversidade viral em amostras ambientais tem se tornado cada vez mais importante devido a funções-chave desempenhadas por esses organismos. Estudos recentes têm fornecido evidências de que vírus de bactérias (bacteriófagos) podem ser os principais determinantes em ciclos biogeoquímicos de grandes ecossistemas, além de atuarem no fluxo de genes entre comunidades ambientais e na plasticidade funcional das mesmas frente a estresses ambientais. Neste trabalho, propomos a investigação e caracterização da diversidade viral presente em amostras de compostagem através de abordagens não dependentes e dependentes de cultivo. Na primeira abordagem, coletamos amostras seriadas de uma unidade de compostagem do zoológico de São Paulo para realização de sequenciamento metagenômico. O conjunto de sequências gerado foi extensivamente minerado (data-mining) para a produção de resultados de diversidade e abundância de táxons virais ao longo do processo de compostagem. Adicionalmente, procedemos com a montagem e recuperação de sequências virais candidatas a genomas completos e/ou parciais de novos vírus ambientais. Os dois protocolos computacionais utilizados para a mineração de dados encontram-se definidos e automatizados, podendo ser aplicados em quaisquer conjuntos de dados de sequenciamento metagenômico ou metatranscritômico obtidos através da plataforma Illumina. A segunda abordagem correspondeu ao isolamento e caracterização de novos fagos de Pseudomonas obtidos de amostras de compostagem. Três novos fagos foram identificados e tiveram os seus genomas sequenciados. A caracterização genômica desses fagos revelou genomas com alto grau de novidade, insights sobre a evolução de Caudovirales e a presença de genes de tRNA, cuja função pode estar relacionada com um mecanismo dos fagos para contornar o viés traducional apresentado pela bactéria hospedeira. A caracterização experimental dos novos fagos isolados demonstrou grande potencial para lise e dissolução de biofilme da cepa Pseudomonas aeruginosa PA14, conhecida como agente causador de infecções hospitalares em pacientes imunodeprimidos. Em suma, os dados reunidos nesta dissertação caracterizam a diversidade presente no viroma da compostagem e contribuem para o entendimento dos perfis taxonômico, funcional e ecológico do processo.<br>The study of the viral diversity in environmental samples has become increasingly important due to key-roles that are performed by these organisms in our ecosystems. Recent publications provide evidence that viruses of bacteria (bacteriophages) may be key-players in biogeochemical cycles of large ecosystems, as oceans and forests. Besides, they may also be determinant in the genes flux among populations and in the plasticity of the communities face to environmental stresses. In this work, we propose the investigation and characterization of the viral diversity in composting samples through non-culturable and culturable-dependent approaches. In the first approach, we sampled a composting unit from the Sao Paulo Zoo Park in different time points and proceeded with metagenomic sequencing. The dataset generated was extensively mined to provide results of diversity and abundance of viral taxa through the composting process. Additionally, we proceeded with the assembly and retrieval of candidate sequences to partial or/and complete viral genomes. The two computational protocols were automatized as pipelines and can be applied to any metagenomic dataset of illumina reads. The second approach refers to the isolation and characterization of new Pseudomonas phages obtained from composting samples. Three new phages were identified and their genomes were sequenced. A detailed characterization of these genomes revealed high degree of novelty, insights about evolution of tailed-phages and the presence of tRNA genes, which may be related to a mechanism to bypass host translational bias. The experimental characterization of the new phages demonstrated great potential to lyse bacterial cells and to degrade Pseudomonas aeruginosa PA14 biofilms. In short, the data presented in this dissertation shed light to the composting virome diversity, as well as to the functional and ecological profiles of viruses in the composting environment.
APA, Harvard, Vancouver, ISO, and other styles
22

Yousaf, Afsheen [Verfasser], Ina [Gutachter] Koch, and Christine M. [Gutachter] Freitag. "Integrative bioinformatics pipeline for genome-wide association studies in neuropsychiatry and the subsequent application in Autism Spectrum Disorder cohorts / Afsheen Yousaf ; Gutachter: Ina Koch, Christine M. Freitag." Frankfurt am Main : Universitätsbibliothek Johann Christian Senckenberg, 2020. http://d-nb.info/1210555689/34.

Full text
APA, Harvard, Vancouver, ISO, and other styles
23

Chalco, Jesus Pascual Mena. "Identificação de regiões codificantes de proteína através da transformada modificada de Morlet." Universidade de São Paulo, 2005. http://www.teses.usp.br/teses/disponiveis/45/45134/tde-05062007-115359/.

Full text
Abstract:
Um tópico importante na análise de seqüências biológicas é a busca de genes, ou seja, a identificação de regiões codificantes de proteínas. Esta identificação permite a posterior procura de significado, descrição ou categorização biológica do organismo analisado. Atualmente, vários métodos combinam reconhecimento de padrões com conhecimento coletado de conjuntos de treinamento ou de comparações com banco de dados genômicos. Entretanto, a acurácia desses métodos está ainda longe do satisfatório. Novos métodos de processamento de seqüências de DNA e de identificação de genes podem ser criados através da busca por conteúdo (search-by-content). O padrão periódico de DNA em regiões codificantes de proteína, denominada periodicidade de três bases, vem sendo considerado uma propriedade dessas regiões. As técnicas de processamento digital de sinais fornecem uma base robusta para a identificação de regiões com periodicidade de três bases. Nesta dissertação, são apresentados um \\pipeline, os conceitos básicos da identificação genômica, e métodos de processamento digital de sinais utilizados para a identificação de regiões codificantes de proteínas. Introduzimos um novo método para a identificação dessas regiões, baseado na transformada proposta, denominada Transformada Modificada de Morlet. Apresentamos vários resultados experimentais obtidos a partir de seqüências de DNA sintéticas e reais. As principais contribuições do trabalho consistem no desenvolvimento de um pipeline para projetos genoma e na criação de um método de identificação de regiões codificantes onde a periodicidade de três bases seja latente. O método apresenta desempenho superior e vantagens importantes em comparação ao método tradicional baseado na transformada de Fourier de tempo reduzido.<br>An important topic in biological sequences analysis is gene finding, i.e. the identification of protein coding regions. This identification allows the posterior research for meaning, description or biological categorization of the analyzed organism. Currently, several methods combine pattern recognition with knowledge collected from training datasets or from comparison with genomic databases. Nonetheless, the accuracy of these methods is still far from satisfactory. New methods of DNA sequences processing and genes identification can be created through search-by-content such sequences. The periodic pattern of DNA in protein coding regions, called three-base periodicity, has been considered proper of coding regions. Digital signal processing techniques supply a strong basis for regions identification with three-base periodicity. In this work, we present a bioinformatics pipeline, basic concepts of the genomic identification and digital signal processing methods used for protein coding regions identification. We introduce a new method for identification of these regions, based on a newly proposed transform, called Modified Morlet Transform. We present some obtained experimental results from synthetic and real DNA sequences. The main contributions consist of the bioinformatics pipeline development for genoma projects and the creation of a method for protein coding regions identification where the three-base periodicity is latent. The method presents superior performance and important advantages in comparison to traditional method based on the short time Fourier transform.
APA, Harvard, Vancouver, ISO, and other styles
24

Mittal, Vinay K. "Detection and characterization of gene-fusions in breast and ovarian cancer using high-throughput sequencing." Diss., Georgia Institute of Technology, 2014. http://hdl.handle.net/1853/54014.

Full text
Abstract:
Gene-fusions are a prevalent class of genetic variants that are often employed as cancer biomarkers and therapeutic targets. In recent years, high-throughput sequencing of the cellular genome and transcriptome have emerged as a promising approach for the investigation of gene-fusions at the DNA and RNA level. Although, large volumes of sequencing data and complexity of gene-fusion structures presents unique computational challenges. This dissertation describes research that first addresses the bioinformatics challenges associated with the analysis of the massive volumes of sequencing data by developing bioinformatics pipeline and more applied integrated computational workflows. Application of high-throughput sequencing and the proposed bioinformatics approaches for the breast and ovarian cancer study reveals unexpected complex structures of gene-fusions and their functional significance in the onset and progression of cancer. Integrative analysis of gene-fusions at DNA and RNA level shows the key importance of the regulation of gene-fusion at the transcription level in cancer.
APA, Harvard, Vancouver, ISO, and other styles
25

Segundo, Edgar José Garcia Neto. "Hardware paralelo reconfigurável para identificação de alinhamentos de sequências de DNA." Universidade do Estado do Rio de Janeiro, 2012. http://www.bdtd.uerj.br/tde_busca/arquivo.php?codArquivo=7434.

Full text
Abstract:
Amostras de DNA são encontradas em fragmentos, obtidos em vestígios de uma cena de crime, ou coletados de amostras de cabelo ou sangue, para testes genéticos ou de paternidade. Para identificar se esse fragmento pertence ou não a uma sequência de DNA, é necessário compará-los com uma sequência determinada, que pode estar armazenada em um banco de dados para, por exemplo, apontar um suspeito. Para tal, é preciso uma ferramenta eficiente para realizar o alinhamento da sequência de DNA encontrada com a armazenada no banco de dados. O alinhamento de sequências de DNA, em inglês DNA matching, é o campo da bioinformática que tenta entender a relação entre as sequências genéticas e suas relações funcionais e parentais. Essa tarefa é frequentemente realizada através de softwares que varrem clusters de base de dados, demandando alto poder computacional, o que encarece o custo de um projeto de alinhamento de sequências de DNA. Esta dissertação apresenta uma arquitetura de hardware paralela, para o algoritmo BLAST, que permite o alinhamento de um par de sequências de DNA. O algoritmo BLAST é um método heurístico e atualmente é o mais rápido. A estratégia do BLAST é dividir as sequências originais em subsequências menores de tamanho w. Após realizar as comparações nessas pequenas subsequências, as etapas do BLAST analisam apenas as subsequências que forem idênticas. Com isso, o algoritmo diminui o número de testes e combinações necessárias para realizar o alinhamento. Para cada sequência idêntica há três etapas, a serem realizadas pelo algoritmo: semeadura, extensão e avaliação. A solução proposta se inspira nas características do algoritmo para implementar um hardware totalmente paralelo e com pipeline entre as etapas básicas do BLAST. A arquitetura de hardware proposta foi implementada em FPGA e os resultados obtidos mostram a comparação entre área ocupada, número de ciclos e máxima frequência de operação permitida, em função dos parâmetros de alinhamento. O resultado é uma arquitetura de hardware em lógica reconfigurável, escalável, eficiente e de baixo custo, capaz de alinhar pares de sequências utilizando o algoritmo BLAST.<br>DNA samples are found in fragments, obtained in traces of a crime scene, collected from hair or blood samples, for genetic or paternity tests. To identify whether this fragment belongs or not to a given DNA sequence it is necessary to compare it with a determined sequence which usually come from a database, for instance, to point a suspect. To this end, we need an efficient tool to perform the alignment of the DNA sequence found with the ones stored in the database. The alignment of DNA sequences, which is a field of bioinformatics that helps to understand the relationship between genetic sequences and their functional relationships and parenting. This task is often performed by software that scan clusters of databases, which requires high computing effort, thus increasing the cost of DNA sequences alignment projects. This work presents a parallel hardware architecture, for BLAST algorithm, to DNA pairwise alignment. This is the original version of the BLAST algorithm, that resulted in several other versions. The BLAST algorithm is a heuristic method and is the fastest algorithm for sequence alignment. The strategy of BLAST is to divide the sequences into smaller subsequences of size w. After making comparisons in these subsequences, algorithm steps analyzes only the subsequences that are identical. Thus, reducing the number of tests and combinations needed to perform the alignment. For each identical sequence found, three steps are followed by the algorithm: seeding, extension and evaluation. The proposed hardware architecture is based on the characteristics of the algorithm to implement a fully parallel hardware, where the basic steps of BLAST are pipelined. The proposed architecture was implemented in FPGA and the results show a comparison between the area occupied, number of cycles and maximum frequency of operation permitted, as a function of alignment parameters. The result is a hardware architecture in reconfigurable logic, scalable, efficient and with low cost, capable of aligning the pairs of sequences using BLAST algorithm.
APA, Harvard, Vancouver, ISO, and other styles
26

Sakaram, Suraj. "Delineating ΔNp63α's function in epithelial cells". Wright State University / OhioLINK, 2016. http://rave.ohiolink.edu/etdc/view?acc_num=wright1484411625682248.

Full text
APA, Harvard, Vancouver, ISO, and other styles
27

Ferro, Milene. "Desenvolvimento e validação de protocolos para a anotação automática de sequências ORESTES de Eimeria spp. de galinha doméstica." Universidade de São Paulo, 2008. http://www.teses.usp.br/teses/disponiveis/42/42135/tde-31032009-114017/.

Full text
Abstract:
A coccidiose aviária é uma doença entérica causada por protozoários parasitas do gênero Eimeria. Visando uma maior compreensão dos mecanismos moleculares envolvidos na regulação do ciclo de vida dos parasitas, foram geradas 15.000 seqüências expressas (ORESTES) para cada uma das três espécies mais importantes: E. tenella, E. maxima e E. acervulina. O presente trabalho consistiu no desenvolvimento de componentes de anotação automática de seqüências para o sistema EGene, plataforma previamente desenvolvida pelo nosso grupo (Durham et al. Bioinformatics 21: 2812-2813, 2005) para a construção de processamentos encadeados (pipelines). Estes componentes foram utilizados para a construção de pipelines de anotação automática de seqüências-consenso obtidas a partir da montagem dos ORESTES de Eimeria spp. A anotação consistiu na identificação dos genes e atribuição da função dos respectivos produtos protéicos, baseando-se em um conjunto de evidências. As seqüências também foram classificadas e quantificadas utilizando-se um vocabulário controlado de termos de ontologia gênica (GO).<br>Avian coccidiosis is an enteric disease caused by protozoan parasites of the genus Eimeria. Aiming at obtaining a better understanding of the molecular mechanisms that regulate the life cycle of the parasites, our group generated 15,000 expressed sequences (ORESTES) for each one of the three most important species: E. tenella, E. maxima and E. acervulina. In the present work, we report the development of a set of components for the automated sequence annotation through EGene, a platform for pipeline construction previously described by our group (Durham et al. Bioinformatics 21: 2812-2813, 2005). These components were used to construct pipelines for the automated annotation of assembled sequences of ORESTES of Eimeria spp. The annotation process consisted in the identification of genes and the corresponding protein function based on a set of evidences. The sequences were also mapped and quantified using a controlled vocabulary of gene ontology (GO) terms.
APA, Harvard, Vancouver, ISO, and other styles
28

Highnam, Gareth Wei An. "Optimizing analysis pipelines for improved variant discovery." Diss., Virginia Tech, 2014. http://hdl.handle.net/10919/47451.

Full text
APA, Harvard, Vancouver, ISO, and other styles
29

Wu, Mei. "Detection of aberrant events in RNA for clinical diagnostics." Thesis, Uppsala universitet, Institutionen för biologisk grundutbildning, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-448361.

Full text
Abstract:
Rare diseases are estimated to affect 3.75% of the global population, which roughly translates to 300 million affected individuals. A large proportion of patients still do not have their diagnosis and current approaches such as chromosomal microarray (CMA), whole exome sequencing (WES), and whole genome sequencing (WGS) that targets DNA and the exome aims to resolve that very first step. RNA-seq serves as a powerful approach complementing the aforementioned methods that have reached a plateau in the diagnostic yield. RNA-seq can facilitate the finding of aberrant events that appear during transcription e.g., splicing, changes in gene expression and monoallelic expression. In this study, we aimed to establish RNA-seq analysis pipelines and evaluate whether RNA-seq could be utilized to enhance diagnostic yield. A total of 47 clinical samples were analysed along with the publicly controlled GEAUVADIS dataset to evaluate the potential of RNA-seq in a clinical setting. The pilot pipeline used, an RNA-seq analysis wrapper around Detection of RNA Outlier Pipeline (DROP), used detected a highly ranked splicing variant in a positive control control  sample that was hard to identify in a WGS analysis. The remaining two other positive control other two control samples with aberrant expression were also detected by the pipeline. Additionally, the pipeline gave a manageable list of candidate genes per affected sample in the population along with corroborating graphs that can support the decision-making for clinicians. The results of this pipeline proved successful for integrating RNA-seq and thustherefore, we expect anticipate an increase in diagnosis.
APA, Harvard, Vancouver, ISO, and other styles
30

Dall'Olio, Daniele. "Implementazione, creazione e ottimizzazione di una pipeline per l'analisi biofisica su cluster a basso consumo energetico." Bachelor's thesis, Alma Mater Studiorum - Università di Bologna, 2017. http://amslaurea.unibo.it/14073/.

Full text
Abstract:
In questa tesi si è studiata l'efficienza computazionale di nodi di calcolo a basso consumo energetico per l'analisi biofisica, confrontati con nodi tradizionali. Questo lavoro è parte di un progetto per valutare la fattibilità dell'utilizzo di macchine a basso consumo energetico per calcolo ad alta performance. Lo scopo della ricerca è provare che l'utilizzo di cluster low power possa fornire una potenza di calcolo confrontabile con quelli tradizionali. Il sistema su cui si è concentrato il lavoro di tesi è uno dei metodi più recenti nella ricerca sulle mutazioni genetiche che sono cause di vari tipi di tumori: il sistema GATK-LODn. Nel corso della tesi è stata reimplementata una componente di questo metodo in una pipeline nel programma Snakemake, che ha permesso una gestione più accurata delle operazioni previste per ottimizzare l'esecuzione complessiva. Questa tesi prende in esame questo algoritmo di bioinformatica per valutare se è realmente possibile confrontare le capacità dei nodi low power con quelli tradizionali, in quanto questo richiede alte prestazioni computazionali, di memoria e capacità di storage. Nel primo capitolo saranno spiegati gli elementi del progetto. Sarà esposto il metodo GATK-LODn. Sarà poi descritta la parte del metodo che è stata reimplementata tramite Snakemake e saranno approfondite le capacità di questo strumento. Infine, sarà spiegato il significato di "nodo low power" e saranno descritte le caratteristiche dei nodi adoperati nelle analisi. Nel secondo capitolo sarà spiegato il funzionamento del programma, approfondendo i parametri utilizzati, e verranno evidenziati i passaggi necessari per un corretto uso del metodo. In più, saranno descritte le fasi dello studio statistico e sarà spiegata la tipologia di simulazioni effettuate. Infine verrano discussi i risultati finali più rilevanti per ciascuna regola della pipeline in termini di tempi di esecuzioni e memoria occupata.
APA, Harvard, Vancouver, ISO, and other styles
31

Robitaille, Alexis. "Detection and identification of papillomavirus sequences in NGS data of human DNA samples : a bioinformatic approach." Thesis, Lyon, 2019. http://www.theses.fr/2019LYSE1358.

Full text
Abstract:
Les papillomavirus humains (HPV) constituent une famille de petits virus à double brin d’ADN qui ont un tropisme pour les cellules épithéliales de la peau et des muqueuses. Plus de 200 types d’HPV ont été découverts, et classifiés en plusieurs genres taxonomiques en fonction de la constitution de leur séquence ADN. De part le rôle de certains HPV dans les maladies affectant les humains, allant de l’apparition de verrues anogénitales bénignes jusqu’au développement d’un cancer, il est nécessaire de développer des méthodes de détection et de caractérisation de la population d’HPV dans un échantillon d’ADN. Elles sont nécessaires à la clarification du rôle de l’HPV dans les différentes étapes de la progression de la maladie. Cette détection d’HPV lors d’approches ciblées en laboratoire a principalement reposé sur des méthodes de PCR couplées avec du séquençage Sanger. Avec l’introduction des nouvelles technologies de séquençage haut débit (NGS), ces approches peuvent être revisitées afin d’intégrer la puissance de séquençage de ces technologies. Alors que des outils d’analyse in-silico ont été développés pour la recherche de virus, connus ou nouveaux, à partir de données de NGS, aucun outil approprié n’est disponible pour la classification et l’identification de nouvelles séquences virales à partir de données produites par des méthodes de séquençage d’amplicons. Dans cette thèse, la première partie présente cinq nouveaux génomes d’HPV isolés via l’utilisation d’amorces d’amplification dégénérées ciblant le gène L1 à partir d’échantillons de peau humaine. Puis, dans une seconde partie, nous présentons PVAmpliconFinder, un outil d’analyse de données conçu pour identifier et classifier rapidement des séquences connues et potentiellement nouvelles de la famille Papillomaviridae, à partir de données de NGS d’amplicons générées par PCR via l’utilisation d’oligonucleotides dégénérés ciblants les HPV. Enfin, les caractéristiques de PVAmpliconFinder sont présentées, ainsi que plusieurs applications sur des données biologiques obtenues lors du séquençage d’amplicons de spécimens humains. Ces applications ont permis la découverte de nouveaux types d’HPV<br>Human Papillomaviruses (HPV) are a family of small double-stranded DNA viruses that have a tropism for the mucosal and cutaneous epithelia. More than 200 types of HPV have been discovered so far and are classified into several genera based on their DNA sequence. Due to the role of some HPV types in human disease, ranging from benign anogenital warts to cancer, methods to detect and characterize HPV population in DNA sample have been developed. These detection methods are needed to clarify the implications of HPV at the various stages of the disease. The detection of HPV from targeted wet-lab approaches has traditionally used PCR- based methods coupled with cloning and Sanger sequencing. With the introduction of next generation sequencing (NGS) these approaches can be improved by integrating the sequencing power of NGS. While computational tools have been developed for metagenomic approaches to search for known or novel viruses in NGS data, no appropriate bioinformatic tool has been available for the classification and identification of novel viral sequences from data produced by amplicon-based methods. In this thesis, we initially describe five fully reconstructed novel HPV genomes detected from skin samples after amplification using degenerate L1 primers. Then, is the second part, we present PVAmpliconFinder, a data analysis workflow designed to rapidly identify and classify known and potentially new Papillomaviridae sequences from NGS amplicon sequencing with degenerate PV primers. This thesis describes the features of PVAmpliconFinder and presents several applications using biological data obtained from amplicon sequencing of human specimens, leading to the identification of new HPV types
APA, Harvard, Vancouver, ISO, and other styles
32

Rizzo, Stefano Giovanni. "Una base dati per il knowledge discovery in genetica medica." Master's thesis, Alma Mater Studiorum - Università di Bologna, 2013. http://amslaurea.unibo.it/6207/.

Full text
Abstract:
L'innovazione delle tecnologie di sequenziamento negli ultimi anni ha reso possibile la catalogazione delle varianti genetiche nei campioni umani, portando nuove scoperte e comprensioni nella ricerca medica, farmaceutica, dell'evoluzione e negli studi sulla popolazione. La quantità di sequenze prodotta è molto cospicua, e per giungere all'identificazione delle varianti sono necessari diversi stadi di elaborazione delle informazioni genetiche in cui, ad ogni passo, vengono generate ulteriori informazioni. Insieme a questa immensa accumulazione di dati, è nata la necessità da parte della comunità scientifica di organizzare i dati in repository, dapprima solo per condividere i risultati delle ricerche, poi per permettere studi statistici direttamente sui dati genetici. Gli studi su larga scala coinvolgono quantità di dati nell'ordine dei petabyte, il cui mantenimento continua a rappresentare una sfida per le infrastrutture. Per la varietà e la quantità di dati prodotti, i database giocano un ruolo di primaria importanza in questa sfida. Modelli e organizzazione dei dati in questo campo possono fare la differenza non soltanto per la scalabilità, ma anche e soprattutto per la predisposizione al data mining. Infatti, la memorizzazione di questi dati in file con formati quasi-standard, la dimensione di questi file, e i requisiti computazionali richiesti, rendono difficile la scrittura di software di analisi efficienti e scoraggiano studi su larga scala e su dati eterogenei. Prima di progettare il database si è perciò studiata l’evoluzione, negli ultimi vent’anni, dei formati quasi-standard per i flat file biologici, contenenti metadati eterogenei e sequenze nucleotidiche vere e proprie, con record privi di relazioni strutturali. Recentemente questa evoluzione è culminata nell’utilizzo dello standard XML, ma i flat file delimitati continuano a essere gli standard più supportati da tools e piattaforme online. È seguita poi un’analisi dell’organizzazione interna dei dati per i database biologici pubblici. Queste basi di dati contengono geni, varianti genetiche, strutture proteiche, ontologie fenotipiche, relazioni tra malattie e geni, relazioni tra farmaci e geni. Tra i database pubblici studiati rientrano OMIM, Entrez, KEGG, UniProt, GO. L'obiettivo principale nello studio e nella modellazione del database genetico è stato quello di strutturare i dati in modo da integrare insieme i dati eterogenei prodotti e rendere computazionalmente possibili i processi di data mining. La scelta di tecnologia Hadoop/MapReduce risulta in questo caso particolarmente incisiva, per la scalabilità garantita e per l’efficienza nelle analisi statistiche più complesse e parallele, come quelle riguardanti le varianti alleliche multi-locus.
APA, Harvard, Vancouver, ISO, and other styles
33

Bogaerts, Márquez María 1991. "Identification of environmental variables in Drosophila melanogaster natural populations." Doctoral thesis, TDX (Tesis Doctorals en Xarxa), 2022. http://hdl.handle.net/10803/673159.

Full text
Abstract:
Entender cómo las especies se adaptan al ambiente es aún una pregunta sin resolver en el campo de la Biología Evolutiva. Mientras el foco principal siempre ha estado en la base genética, los factores ambientales responsables de dichos procesos adaptativos se quedan por detrás. Nuestro objetivo principal es identificar las principales variables ambientales que contribuyen a la adaptación. Utilizamos poblaciones naturales de D. melanogaster de Europa y Norte América, y analizamos tanto SNPs como elementos transponibles (TEs). Para detectar y estimar las frecuencias de una población con precisión, actualizamos el algoritmo de T-lex y lanzamos una nueva versión: T-lex3. Realizamos un análisis de Asociación GenomaAmbiente (GEA) para awsociar las frecuencias alélicas de TEs y SNPs con las diferentes variables ambientales, e identificamos temperatura, lluvia y viento, como las variables más relevantes implicadas en la adaptación ambiental. Tambien encontramos 10 TEs asociados con al menos, una variable ambiental. Finalmente, desarrollamos una herramienta bioinfórmatica para integrar más de 200 genomas de D. melanogaster de todo el mundo, lo que facilitará los análisis ambientales espacial y temporalmente.<br>Understanding how species adapt to the environment is still an open question in Evolutionary Biology. While the focus has been on the genetic basis, the analysis of the environmental factors which drive these adaptive processes lags behind. Our main goal is to identify the main environmental variables that contribute to adaptation. We used natural D. melanogaster populations from Europe and North America, and analyzed both SNPs and transposable elements (TEs). To accurately detect and estimate TE population frequencies, we updated the T-lex algorithm and released a new version: T-lex3. We performed a Genome-Environment Analysis (GEA) to associate TEs and SNP allele frequencies with several environmental variables, and we identified temperature, rainfall and wind as the relevant variables involved in environmental adaptation. In addition, we found 10 TEs associated with an environmental variable. Finally, we developed a bioinformatic pipeline that integrates >200 D. melanogaster world-wide genomes, which will facilitate environmental analysis in space and time.
APA, Harvard, Vancouver, ISO, and other styles
34

Östlund, Emma. "BacIL - En Bioinformatisk Pipeline för Analys av Bakterieisolat." Thesis, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-388087.

Full text
Abstract:
Listeria monocytogenes and Campylobacter spp. are bacteria that sometimes can cause severe illness in humans. Both can be found as contaminants in food that has been produced, stored or prepared improperly, which is why it is important to ensure that the handling of food is done correctly. The National Food Agency (Livsmedelsverket) is the Swedish authority responsible for food safety. One important task is to, in collaboration with other authorities, track and prevent food-related disease outbreaks. For this purpose bacterial samples are regularly collected from border control, at food production facilities and retail as well as from suspected food items and drinking water during outbreaks, and epidemiological analyses are employed to determine the type of bacteria present and whether they can be linked to a common source. One part of these epidemiological analyses involve bioinformatic analyses of the bacterial DNA. This includes determination of sequence type and serotype, as well as calculations of similarities between samples. Such analyses require data processing in several different steps which are usually performed by a bioinformatician using different computer programs. Currently the National Food Agency outsources most of these analyses to other authorities and companies, and the purpose of this project was to develop a pipeline that would allow for these analyses to be performed in-house. The result was a pipeline named BacIL - Bacterial Identification and Linkage which has been developed to automatically perform sequence typing, serotyping and SNP-analysis of Listeria monocytogenes as well as sequence typing and SNP-analysis of Campylobacter jejuni, C. coli and C. lari. The result of the SNP-analysisis is used to create clusters which can be used to identify related samples. The pipeline decreases the number of programs that have to be manually started from more than ten to two.
APA, Harvard, Vancouver, ISO, and other styles
35

Waury, Katharina. "A pipeline for the identification and examination of proteins implicated in frontotemporal dementia." Thesis, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-416827.

Full text
Abstract:
Frontotemporal dementia is a neurodegenerative disorder with high heterogeneity on the genetic, pathological and clinical level. The familial form of the disease is mainly caused by pathogenic variants of three genes: C9orf72, MAPT and GRN. As there is no clear correlation between the mutation and the clinical phenotype, symptom severity or age of onset, the demand for predictive biomarkers is high. While there is no fluid biomarker for frontotemporal dementia in use yet, there is strong hope that changes of protein concentrations in the blood or cerebrospinal fluid can aid prognostics many years before symptoms develop. Increasing amounts of data are becoming available because of long-term studies of families affected by familial frontotemporal dementia, but its analysis is time-consuming and work intensive. In the scope of this project a pipeline was built for the automated analysis of proteomics data. Specifically, it aims to identify proteins useful for differentiation between two groups by using random forest, a supervised machine learning method. The analysis results of the pipeline for a data set containing blood plasma protein concentration of healthy controls and participants affected by frontotemporal dementia were promising and the generalized functioning of the pipeline was proven with an independent breast cancer proteomics data set.
APA, Harvard, Vancouver, ISO, and other styles
36

"A Robust scRNA-seq Data Analysis Pipeline for Measuring Gene Expression Noise." Master's thesis, 2017. http://hdl.handle.net/2286/R.I.44010.

Full text
Abstract:
abstract: The past decade has seen a drastic increase in collaboration between Computer Science (CS) and Molecular Biology (MB). Current foci in CS such as deep learning require very large amounts of data, and MB research can often be rapidly advanced by analysis and models from CS. One of the places where CS could aid MB is during analysis of sequences to find binding sites, prediction of folding patterns of proteins. Maintenance and replication of stem-like cells is possible for long terms as well as differentiation of these cells into various tissue types. These behaviors are possible by controlling the expression of specific genes. These genes then cascade into a network effect by either promoting or repressing downstream gene expression. The expression level of all gene transcripts within a single cell can be analyzed using single cell RNA sequencing (scRNA-seq). A significant portion of noise in scRNA-seq data are results of extrinsic factors and could only be removed by customized scRNA-seq analysis pipeline. scRNA-seq experiments utilize next-gen sequencing to measure genome scale gene expression levels with single cell resolution. Almost every step during analysis and quantification requires the use of an often empirically determined threshold, which makes quantification of noise less accurate. In addition, each research group often develops their own data analysis pipeline making it impossible to compare data from different groups. To remedy this problem a streamlined and standardized scRNA-seq data analysis and normalization protocol was designed and developed. After analyzing multiple experiments we identified the possible pipeline stages, and tools needed. Our pipeline is capable of handling data with adapters and barcodes, which was not the case with pipelines from some experiments. Our pipeline can be used to analyze single experiment scRNA-seq data and also to compare scRNA-seq data across experiments. Various processes like data gathering, file conversion, and data merging were automated in the pipeline. The main focus was to standardize and normalize single-cell RNA-seq data to minimize technical noise introduced by disparate platforms.<br>Dissertation/Thesis<br>Masters Thesis Bioengineering 2017
APA, Harvard, Vancouver, ISO, and other styles
37

Morris, Joseph P. "An analysis pipeline for the processing, annotation, and dissemination of expressed sequence tags." 2009. http://etd.louisville.edu/data/UofL0482t2009.pdf.

Full text
Abstract:
Thesis (M.Eng.)--University of Louisville, 2009.<br>Title and description from thesis home page (viewed May 22, 2009). Department of Computer Engineering and Computer Science. Vita. "May 2009." Includes bibliographical references (p. 39-41).
APA, Harvard, Vancouver, ISO, and other styles
38

Tan, Yuxiang. "Computational approaches for whole-transcriptome cancer analysis based on RNA sequencing data." Thesis, 2016. https://hdl.handle.net/2144/14502.

Full text
Abstract:
RNA-Seq (Whole Transcriptome Shotgun Sequencing) provides an ideal platform to study the complete set of transcripts for a specific developmental stage or physiological condition. It reveals not only expression-level changes, but also structural changes in the coding sequences, including gene rearrangements. In this dissertation, I present my contributions to the development of computational tools for the robust and efficient analysis of RNA-seq data to support cancer research. To automate the laborious and computationally intensive procedure of RNA-seq data management, I worked on the development of Hydra, an RNA-seq pipeline for the parallel processing and quality control of large numbers of samples. With user-friendly reports on quality control and running checkpoints, Hydra makes the data processing procedure fast, efficient and reliable. Here, I report my application of the pipeline to the analysis of patient-derived lymphoma xenograft samples, to show Hydra’s ability to detect abnormalities (e.g., mouse tissue contamination) in the sequencing data. Because fusions play an important role in carcinogenesis, fusion detection has become an important area of methodological research. Several computational methods have been developed to identify fusion transcripts from RNA-seq data. However, all these methods require realignment to the transcriptome, a computationally expensive task, unnecessary in many cases. Here, I present QueryFuse, a novel gene-specific fusion-detection algorithm for aligned RNA-seq data. It is designed to help biologists find and/or computationally validate fusions of interest quickly, and to annotate the detected events with visualization and detailed properties of the supporting reads. By focusing the fusion detection on read pairs aligned to query genes, we can not only reduce realignment time, but also afford to use a more accurate but computationally expensive local aligner. In the extensive evaluation I performed, I obtained comparable or better results compared with two widely adopted tools (deFuse and TophatFusion) on two simulated datasets, as well as on cell line datasets with known fusions. Finally, I contributed to the identification of a novel fusion event in lymphoma, with potential therapeutic implications in clinical samples. I validated this fusion in silico by my putative reference method before experimental validation.
APA, Harvard, Vancouver, ISO, and other styles
39

Venkatraman, Anand. "Validation of a novel expressed sequence tag (EST) clustering method and development of a phylogenetic annotation pipeline for livestock gene families." Thesis, 2008. http://hdl.handle.net/1969.1/ETD-TAMU-3112.

Full text
Abstract:
Prediction of functions of genes in a genome is a key step in all genome sequencing projects. Sequences that carry out important functions are likely to be conserved between evolutionarily distant species and can be identified using cross-species comparisons. In the absence of completed genomes and the accompanying high-quality annotations, expressed sequence tags (ESTs) from random cDNA clones are the primary tools for functional genomics. EST datasets are fragmented and redundant, necessitating clustering of ESTs into groups that are likely to have been derived from the same genes. EST clustering helps reduce the search space for sequence homology searching and improves the accuracy of function predictions using EST datasets. This dissertation is a case study that describes clustering of Bos taurus and Sus scrofa EST datasets, and utilizes the EST clusters to make computational function predictions using a comparative genomics approach. We used a novel EST clustering method, TAMUClust, to cluster bovine ESTs and compare its performance to the bovine EST clusters from TIGR Gene Indices (TGI) by using bovine ESTs aligned to the bovine genome assembly as a gold standard. This comparison study reveals that TAMUClust and TGI are similar in performance. Comparisons of TAMUClust and TGI with predicted bovine gene models reveal that both datasets are similar in transcript coverage. We describe here the design and implementation of an annotation pipeline for predicting functions of the Bos taurus (cattle) and Sus scrofa (pig) transcriptomes. EST datasets were clustered into gene families using Ensembl protein family clusters as a framework. Following clustering, the EST consensus sequences were assigned predicted function by transferring annotations of the Ensembl vertebrate protein(s) they are grouped to after sequence homology searches and phylogenetic analysis. The annotations benefit the livestock community by helping narrow down the gamut of direct experiments needed to verify function.
APA, Harvard, Vancouver, ISO, and other styles
40

Robert, Bonnie-Jean. "A pipeline for differential expression analysis of RNA-seq data and the effect of filter cutoff on performance." Thesis, 2017. https://dspace.library.uvic.ca//handle/1828/8538.

Full text
Abstract:
RNA sequencing is a powerful new approach to analyzing differential expression of transcripts between treatments. Many statistical methods are now available to test for differential expression, each one reports results differently. This thesis presents a workflow of five popular methods and discusses the results. A pipeline was built in the R language to analyze four of these packages using a real RNA-seq dataset. At present, researchers must prepare RNA-seq data prior to analysis to achieve reliable results. Filtering is a necessary preparatory step in which transcripts exhibiting low levels of genetic expression are removed from further analysis. Yet, little research is available to guide researchers on how best to choose this threshold. This thesis introduces a study designed to determine if the choice of filter threshold has a significant effect on individual package performance. Increasing the filtering threshold was shown to decrease the sensitivity and increase the specificity of the four statistical methods studied.<br>Graduate
APA, Harvard, Vancouver, ISO, and other styles
41

Filomena, João Pedro Fernandes Lourenço da. "Adaptation of genoqual pipeline to new upstream applications and to run independently from galaxy portal." Master's thesis, 2021. http://hdl.handle.net/10400.26/36701.

Full text
Abstract:
O presente estágio foi realizado no Instituto Gulbenkian de Ciência (IGC) no âmbito do mestrado em engenharia biológica e química. O estagiário esteve envolvido com termos e ferramentas usadas em bioinformática, metagenómica e NGS. A principal tarefa do estagiário focou-se na atualização de uma pipeline de análises genómicas feito pelo IGC designado por “GenoQual”. O principal objetivo do estágio do discente focou-se na atualização de uma pipeline de análises genómicas feito pelo IGC, há vários anos, designado por “GenoQual”. Desde a última atualização do GenoQual, tem havido uma evolução natural das ferramentas usadas em bioinformática, surgindo assim novas alternativas e melhorias nas ferramentas usadas pelo GenoQual. Uma das atualizações mais importantes foi o lançamento do QIIME 2 que trouxe melhorias e novas funcionalidades em relação ao QIIME ainda em utilização no GenoQual. A tarefa principal desta dissertação foi a atualização do código Python do pipeline de modo ser compatível com uma versão mais recente de Python e adicionar novas funcionalidades à pipeline, nomeadamente a compatibilidade com o QIIME 2 e Kraken2. O projeto foi organizado em duas etapas distintas, a primeira foi a atualização do código do GenoQual de Python 2.7 para o novo Python 3.x. A segunda etapa consistiu na atualização dos softwares utilizados pela versão original do GenoQual de modo garantir que a nova pipeline era compatível com as novas versões desses softwares para aproveitar as novas melhorias e funcionalidades provenientes das novas atualizações. O código do GenoQual foi sucessivamente atualizado de modo ser compatível com o Python 3.8 e foi proposto a adição da nova plataforma de bioinformáticas microbioma QIIME 2 e o classificador taxonómico Kraken 2 de modo poder realizar analises do tipo 16S e WGS.<br>The following internship was developed at the Instituto Gulbenkian de Ciência (IGC) in the scope of the master’s biological and chemical engineering degree. The intern dealt with bioinformatics, metagenomics and NGS related terms and tools and focused on the task of updating a pipeline of genomic analyses developed by IGC a few years ago designated as “GenoQual. Ever since GenoQual was last updated, there has been a natural evolution of the tools used in the bioinformatics field, appearing newer alternatives and updates to the tools used by GenoQual. One of the main updates that occurred was the release of QIIME 2 which brought newer upgrades and features in relation to QIIME 1 which GenoQual was still using at the time. The main objective of this internship was to update the Python code used by the pipeline so that it would become compatible with a more recent Python version as well as adding newer functionalities to the GenoQual pipeline, namely the compatibility with QIIME 2 and Kraken 2. The project was organized into two distinct stages; the first was the updating of GenoQual’s Python 2.7 code to the newer Python 3.x version. The second stage was the updating of the packages used by the original version of GenoQual to make sure that the pipeline was still compatible with the newer versions of those required packages, so that it could make use of their improvements and newer functionalities. GenoQual’s code was successively updated to be compatible with Python 3.8 and the addition of the new microbiome bioinformatics QIIME 2 platform and Kraken 2 taxonomic classifier were proposed as additions to the GenoQual pipeline so that it would be able to do both 16S and WGS type analyses.
APA, Harvard, Vancouver, ISO, and other styles
42

(8815928), Samantha Jurecki. "APPLICATION AND VALIDATION OF THE EDNA-METABARCODED MIFISH/MITOFISH PIPELINE FOR ASSESSMENT OF NATIVE AND NON-NATIVE FISH COMMUNITIES OF LAKE MICHIGAN." Thesis, 2020.

Find full text
Abstract:
Environmental DNA (eDNA) is being used increasingly for biomonitoring of communities (e.g., microbes, macroinvertebrates, fish species) across terrestrial and aquatic ecosystems. Developing methods that combine eDNA approaches with metagenomic barcoded amplicon sequencing (eDNA-metabarcoding) are now providing a powerful noninvasive and cost-effective means for comprehensively surveying biodiversity in a wide range of habitats. Invasive species have a substantial impact on the ecology and economics of the Great Lakes region, and eDNAmetabarcoding methods have recently been applied in monitoring non-native, as well as native, fish populations in the freshwater systems there. In this research, we validated an eDNAmetabarcoding approach that uses established platforms, the MiFish/MitoFish pipeline, for fish community monitoring on Lake Michigan. For validation, we compared survey results from our eDNA-metabarcoding approach to those obtained using traditional surveys (e.g., electrofishing and seining). We also sampled a closed 180,000-gallon freshwater fish tank system to see how well our methods characterized a known native fish population that resided in the tank. Finally, we applied the approach to monitoring invasive and native fish populations in southern Lake Michigan at a site that is currently undergoing restoration to improve the aquatic habitats.. We were able to reliably capture the fish community structure of the native fish tank as well as those of open waters on the lake using our methods. Diversity patterns detected at the restoration site using our eDNA-metabarcoding approach accurately reflected those of the historical record, which have taken many years to establish by conventional means. Overall, this study suggests eDNAmetabarcoding is an efficient, credible, and powerful approach to biomonitoring.
APA, Harvard, Vancouver, ISO, and other styles
43

Wills, Bailey. "Optimization of Marker Sets and Tools for Phenotype, Ancestry, and Identity using Genetics and Proteomics." Thesis, 2019. http://hdl.handle.net/1805/19916.

Full text
Abstract:
Indiana University-Purdue University Indianapolis (IUPUI)<br>In the forensic science community, there is a vast need for tools to help assist investigations when standard DNA profiling methods are uninformative. Methods such as Forensic DNA Phenotyping (FDP) and proteomics aims to help this problem and provide aid in investigations when other methods have been exhausted. FDP is useful by providing physical appearance information, while proteomics allows for the examination of difficult samples, such as hair, to infer human identity and ancestry. To create a “biological eye witness” or develop informative probability of identity match statistics through proteomically inferred genetic profiles, it is necessary to constantly strive to improve these methods. Currently, two developmentally validated FDP prediction assays, ‘HIrisPlex’ and ‘HIrisplex-S’, are used on the capillary electrophoresis to develop a phenotypic prediction for eye, hair, and skin color based on 41 variants. Although highly useful, these assays are limited in their ability when used on the CE due to a 25 variant per assay cap. To overcome these limitations and expand the capacities of FDP, we successfully designed and validated a massive parallel sequencing (MPS) assay for use on both the ThermoFisher Scientific Ion Torrent and Illumina MiSeq systems that incorporates all HIrisPlex-S variants into one sensitive assay. With the migration of this assay to an MPS platform, we were able to create a semi-automated pipeline to extract SNP-specific sequencing data that can then be easily uploaded to the freely accessible online phenotypic prediction tool (found at https://hirisplex.erasmusmc.nl) and a mixture deconvolution tool with built-in read count thresholds. Based on sequencing reads counts, this tool can be used to assist in the separation of difficult two-person mixture samples and outline the confidence in each genotype call. In addition to FDP, proteomic methods, specifically in hair protein analysis, opens doors and possibilities for forensic investigations when standard DNA profiling methods come up short. Here, we analyzed 233 genetically variant peptides (GVPs) within hair-associated proteins and genes for 66 individuals. We assessed the proteomic methods ability to accurately infer and detect genotypes at each of the 233 SNPs and generated statistics for the probability of identity (PID). Of these markers, 32 passed all quality control and population genetics criteria and displayed an average PID of 3.58 x 10-4. A population genetics assessment was also conducted to identify any SNP that could be used to infer ancestry and/or identity. Providing this information is valuable for the future use of this set of markers for human identification in forensic science settings.
APA, Harvard, Vancouver, ISO, and other styles
44

Jean-Louis, Martineau. "Séquençage d’exomes d’une cohorte de familles caucasiennes simplex dont les patients sont atteints du syndrome d’interruption de la tige hypophysaire." Thèse, 2017. http://hdl.handle.net/1866/21577.

Full text
APA, Harvard, Vancouver, ISO, and other styles
45

"Biomedical Information Extraction Pipelines for Public Health in the Age of Deep Learning." Doctoral diss., 2019. http://hdl.handle.net/2286/R.I.55694.

Full text
Abstract:
abstract: Unstructured texts containing biomedical information from sources such as electronic health records, scientific literature, discussion forums, and social media offer an opportunity to extract information for a wide range of applications in biomedical informatics. Building scalable and efficient pipelines for natural language processing and extraction of biomedical information plays an important role in the implementation and adoption of applications in areas such as public health. Advancements in machine learning and deep learning techniques have enabled rapid development of such pipelines. This dissertation presents entity extraction pipelines for two public health applications: virus phylogeography and pharmacovigilance. For virus phylogeography, geographical locations are extracted from biomedical scientific texts for metadata enrichment in the GenBank database containing 2.9 million virus nucleotide sequences. For pharmacovigilance, tools are developed to extract adverse drug reactions from social media posts to open avenues for post-market drug surveillance from non-traditional sources. Across these pipelines, high variance is observed in extraction performance among the entities of interest while using state-of-the-art neural network architectures. To explain the variation, linguistic measures are proposed to serve as indicators for entity extraction performance and to provide deeper insight into the domain complexity and the challenges associated with entity extraction. For both the phylogeography and pharmacovigilance pipelines presented in this work the annotated datasets and applications are open source and freely available to the public to foster further research in public health.<br>Dissertation/Thesis<br>Doctoral Dissertation Biomedical Informatics 2019
APA, Harvard, Vancouver, ISO, and other styles
46

AJMANI, NISHA. "Transcriptomic analysis of ovarian development in parasitic Ichthyomyzon castaneus (chestnut lamprey) and non-parasitic Ichthyomyzon fossor (northern brook lamprey)." 2017. http://hdl.handle.net/1993/32179.

Full text
Abstract:
Lampreys are primitive jawless fishes that diverged over 550 million years ago. As adults, they are either parasitic or non-parasitic. In non-parasitic species, sexual differentiation and oocyte development generally occur earlier than in parasitic species; fecundity is reduced and sexual maturation is accelerated following metamorphosis. The genes controlling ovarian differentiation and maturation in lampreys are poorly understood. This study used RNA-Seq data in the parasitic chestnut lamprey Ichthyomyzon castaneus and non-parasitic northern brook lamprey Ichthyomyzon fossor to identify suites of genes expressed during different stages of ovarian development that show different developmental trajectories with respect to ovarian differentiation and sexual maturation. For this, reference-guided and de novo assembly pipelines were designed for studying a non-model species. To test and explore the relative advantages of the pipelines, expression of insulin superfamily genes was used. This research helps to identify genes involved in lamprey ovarian development and provides insight into evolution of the insulin superfamily in vertebrates.<br>May 2017
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography