Dissertations / Theses: 'DNA - Data processing'

1

高銘謙 and Ming-him Ko. "A multi-agent model for DNA analysis." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 1999. http://hub.hku.hk/bib/B31222778.

Full text

APA, Harvard, Vancouver, ISO, and other styles

2

Camerlengo, Terry Luke. "Techniques for Storing and Processing Next-Generation DNA Sequencing Data." The Ohio State University, 2014. http://rave.ohiolink.edu/etdc/view?acc_num=osu1388502159.

Full text

APA, Harvard, Vancouver, ISO, and other styles

3

Huang, Songbo, and 黄颂博. "Detection of splice junctions and gene fusions via short read alignment." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 2011. http://hub.hku.hk/bib/B45862527.

Full text

APA, Harvard, Vancouver, ISO, and other styles

4

Leung, Chi-ming, and 梁志銘. "Motif discovery for DNA sequences." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 2006. http://hub.hku.hk/bib/B3859755X.

Full text

APA, Harvard, Vancouver, ISO, and other styles

5

Oelofse, Andries Johannes. "Development of a MAIME-compliant microarray data management system for functional genomics data integration." Pretoria : [s.n.], 2006. http://upetd.up.ac.za/thesis/available/etd-08222007-135249.

Full text

APA, Harvard, Vancouver, ISO, and other styles

6

Cheng, Lok-lam, and 鄭樂霖. "Approximate string matching in DNA sequences." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 2003. http://hub.hku.hk/bib/B29350591.

Full text

APA, Harvard, Vancouver, ISO, and other styles

7

Karanam, Suresh Kumar. "Automation of comparative genomic promoter analysis of DNA microarray datasets." Thesis, Available online, Georgia Institute of Technology, 2004:, 2003. http://etd.gatech.edu/theses/available/etd-04062004-164658/unrestricted/karanam%5Fsuresh%5Fk%5F200312%5Fms.pdf.

Full text

APA, Harvard, Vancouver, ISO, and other styles

8

Labuschagne, Jan Phillipus Lourens. "Development of a data processing toolkit for the analysis of next-generation sequencing data generated using the primer ID approach." University of the Western Cape, 2018. http://hdl.handle.net/11394/6736.

Full text

Abstract:

Philosophiae Doctor - PhD
Sequencing an HIV quasispecies with next generation sequencing technologies yields a dataset with significant amplification bias and errors resulting from both the PCR and sequencing steps. Both the amplification bias and sequencing error can be reduced by labelling each cDNA (generated during the reverse transcription of the viral RNA to DNA prior to PCR) with a random sequence tag called a Primer ID (PID). Processing PID data requires additional computational steps, presenting a barrier to the uptake of this method. MotifBinner is an R package designed to handle PID data with a focus on resolving potential problems in the dataset. MotifBinner groups sequences into bins by their PID tags, identifies and removes false unique bins, produced from sequencing errors in the PID tags, as well as removing outlier sequences from within a bin. MotifBinner produces a consensus sequence for each bin, as well as a detailed report for the dataset, detailing the number of sequences per bin, the number of outlying sequences per bin, rates of chimerism, the number of degenerate letters in the final consensus sequences and the most divergent consensus sequences (potential contaminants). We characterized the ability of the PID approach to reduce the effect of sequencing error, to detect minority variants in viral quasispecies and to reduce the rates of PCR induced recombination. We produced reference samples with known variants at known frequencies to study the effectiveness of increasing PCR elongation time, decreasing the number of PCR cycles, and sample partitioning, by means of dPCR (droplet PCR), on PCR induced recombination. After sequencing these artificial samples with the PID approach, each consensus sequence was compared to the known variants. There are complex relationships between the sample preparation protocol and the characteristics of the resulting dataset. We produce a set of recommendations that can be used to inform sample preparation that is the most useful the particular study. The AMP trial infuses HIV-negative patients with the VRC01 antibody and monitors for HIV infections. Accurately timing the infection event and reconstructing the founder viruses of these infections are critical for relating infection risk to antibody titer and homology between the founder virus and antibody binding sites. Dr. Paul Edlefsen at the Fred Hutch Cancer Research Institute developed a pipeline that performs infection timing and founder reconstruction. Here, we document a portion of the pipeline, produce detailed tests for that portion of the pipeline and investigate the robustness of some of the tools used in the pipeline to violations of their assumptions.

APA, Harvard, Vancouver, ISO, and other styles

9

Bärmann, Daniel. "Aufzählen von DNA-Codes." Master's thesis, Universität Potsdam, 2006. http://opus.kobv.de/ubp/volltexte/2006/1026/.

Full text

Abstract:

In dieser Arbeit wird ein Modell zum Aufzählen von DNA-Codes entwickelt. Indem eine Ordnung auf der Menge aller DNA-Codewörter eingeführt und auf die Menge aller Codes erweitert wird, erlaubt das Modell das Auffinden von DNA-Codes mit bestimmten Eigenschaften, wie Überlappungsfreiheit, Konformität, Kommafreiheit, Stickyfreiheit, Überhangfreiheit, Teilwortkonformität und anderer bezüglich einer gegebenen Involution auf der Menge der Codewörter. Ein auf Grundlage des geschaffenen Modells entstandenes Werkzeug erlaubt das Suchen von Codes mit beliebigen Kombinationen von Codeeigenschaften. Ein weiterer wesentlicher Bestandteil dieser Arbeit ist die Untersuchung der Optimalität von DNA-Codes bezüglich ihrer Informationsrate sowie das Finden solider DNA-Codes.
In this work a model for enumerating DNA codes is developed. By applying an order on the set of DNA codewords and extending this order on the set of codes, this model assists in the discovery of DNA codes with properties like non-overlappingness, compliance, comma-freeness, sticky-freeness, overhang-freeness, subword-compliance, solidness and others with respect to a given involution on the set of codewords. This tool can be used to find codes with arbitrary combinations of code properties with respect to the standard Watson-Crick-DNA involution. The work also investigates DNA codes with respect to the optimizing of the information rate, as well as finding solid DNA codes.

APA, Harvard, Vancouver, ISO, and other styles

10

Shmeleva, Nataliya V. "Making sense of cDNA : automated annotation, storing in an interactive database, mapping to genomic DNA." Thesis, Georgia Institute of Technology, 2002. http://hdl.handle.net/1853/25178.

Full text

APA, Harvard, Vancouver, ISO, and other styles

11

Zheng, Hao. "Prediction and analysis of the methylation status of CpG islands in human genome." Diss., Georgia Institute of Technology, 2012. http://hdl.handle.net/1853/43631.

Full text

Abstract:

DNA methylation serves as a major epigenetic modification crucial to the normal organismal development and the onset and progression of complex diseases such as cancer. Computational predictions for DNA methylation profiling serve multiple purposes. First, accurate predictions can contribute valuable information for speeding up genome-wide DNA methylation profiling so that experimental resources can be focused on a few selected while computational procedures are applied to the bulk of the genome. Second, computational predictions can extract functional features and construct useful models of DNA methylation based on existing data, and can therefore be used as an initial step toward quantitative identification of critical factors or pathways controlling DNA methylation patterns. Third, computational prediction of DNA methylation can provide benchmark data to calibrate DNA methylation profiling equipment and to consolidate profiling results from different equipments or techniques. This thesis is written based on our study on the computational analysis of the DNA methylation patterns of the human genome. Particularly, we have established computational models (1) to predict the methylation patterns of the CpG islands in normal conditions, and (2) to detect the CpG islands that are unmethylated in normal conditions but aberrantly methylated in cancer conditions. When evaluated using the CD4 lymphocyte data of Human Epigenome Project (HEP) data set based on bisulfite sequencing, our computational models for predicting the methylation status of CpG islands in the normal conditions can achieve a high accuracy of 93-94%, specificity of 94%, and sensitivity of 92-93%. And, when evaluated using the aberrant methylation data from the MethCancerDB database for aberrantly methylated genes in cancer, our models for detecting the CpG islands that are unmethylated in normal conditions but aberrantly methylated in colon or prostate cancer can achieve an accuracy of 92-93%, specificity of 98-99%, and sensitivity of 92-93%.

APA, Harvard, Vancouver, ISO, and other styles

12

Jordan, Philip. "Using high-throughput data to imply gene function : assessment of selected genes for roles in mitotic and meiotic DNA processing." Thesis, University of Edinburgh, 2006. http://hdl.handle.net/1842/12320.

Full text

Abstract:

A method of integrating large data sets of S. cerevisiae was developed here to select novel genes that possess characteristics implying their involvement in DNA processing. Form this method 54 genes were selected and mutants of these genes were screened for phenotype abnormalities associated with both meiotic and mitotic DNA processing. Eleven mutants were found to be sensitive to hydroxyurea (Δsoh1, Δrtt101, Δfyv6, Δhur1Δ/pmr1, Δbre5, Δmms22, Δbrel, Δvid21, Δsgf73, Δrmd11, Δdef1, and Δlgel; 20.4%) and two (Δdef1 and Δmms22; 3.7%) were found to be sensitive to methyl methanesulfonate during vegetative growth. Six mutants were found to have low levels of nuclear division during meiosis (Δbre1, Δvid21, Δsg73, Δarmd11, Δdef1, and Δlgel; 11.1%), four mutants displayed reductions in heteroallelic recombination in meiosis (Δsoh1, Δbre5, Δhda2 and Δygl250w; 7.4%) and higher levels of nondisjunction in meiosis was observed in two mutants (Δsoh1, and Δyp1107; 3.7%). Genes that were required for normal meiotic nuclear division (MND genes) were further assessed by analysing pre-meiotic DNA replication. One mutant (Δvid21) failed to initiate pre-meiotic DNA replication ; the other five mutants (Δbre1, Δsgf73, Δrmd11, Δdef1, and Δlge1) were observed to have a delay in entry and lengthened time to complete pre-meiotic DNA replication. From cytological analysis of the MND mutants able to undergo pre-meiotic DNA replication it was shown that a Δdef1 strain homologous chromosomes associate during meiosis I but chromosome synapsis is abnormal. Finally a mutation that interrupts PMR1 was shown to over-replicate its genome during meiosis and form asci with greater then the normal four spores that are mostly inviable.

APA, Harvard, Vancouver, ISO, and other styles

13

Zhang, Yue. "Detection copy number variants profile by multiple constrained optimization." HKBU Institutional Repository, 2017. https://repository.hkbu.edu.hk/etd_oa/439.

Full text

Abstract:

Copy number variation, causing by the genome rearrangement, generally refers to the copy numbers increased or decreased of large genome segments whose lengths are more than 1kb. Such copy number variations mainly appeared as the sub-microscopic level of deletion and duplication. Copy number variation is an important component of genome structural variation, and is one of pathogenic factors of human diseases. Next generation sequencing technology is a popular CNV detection method and it has been widely used in various fields of life science research. It possesses the advantages of high throughput and low cost. By tailoring NGS technology, it is plausible to sequence individual cells. Such single cell sequencing can reveal the gene expression status and genomic variation profile of a single-cell. Single cell sequencing is promising in the study of tumor, developmental biology, neuroscience and other fields. However, there are two challenging problems encountered in CNV detection for NGS data. The first one is that since single-cell sequencing requires a special genome amplification step to accumulate enough samples, a large number of bias is introduced, making the calling of copy number variants rather challenging. The performances of many popular copy number calling methods, designed for bulk sequencings, are not consistent and cannot be applied on single-cell sequenced data directly. The second one is to simultaneously analyze genome data for multiple samples, thus achieving assembling and subgrouping similar cells accurately and efficiently. The high level of noises in single-cell-sequencing data negatively affects the reliability of sequence reads and leads to inaccurate patterns of variations. To handle the problem of reliably finding CNVs in NGS data, in this thesis, we firstly establish a workflow for analyzing NGS and single-cell sequencing data. The CNVs identification is formulated as a quadratic optimization problem with both constraints of sparsity and smoothness. Tailored from alternating direction minimization (ADM) framework, an efficient numerical solution is designed accordingly. The proposed model was tested extensively to demonstrate its superior performances. It is shown that the proposed approach can successfully reconstruct CNVs especially somatic copy number alteration patterns from raw data. By comparing with existing counterparts, it achieved superior or comparable performances in detection of the CNVs. To tackle this issue of recovering the hidden blocks within multiple single-cell DNA-sequencing samples, we present an permutation based model to rearrange the samples such that similar ones are positioned adjacently. The permutation is guided by the total variational (TV) norm of the recovered copy number profiles, and is continued until the TV-norm is minimized when similar samples are stacked together to reveal block patterns. Accordingly, an efficient numerical scheme for finding this permutation is designed, tailored from the alternating direction method of multipliers. Application of this method to both simulated and real data demonstrates its ability to recover the hidden structures of single-cell DNA sequences.

APA, Harvard, Vancouver, ISO, and other styles

14

Hall, Richard James. "Development of methods for improving throughput in the processing of single particle cryo-electron microscopy data, applied to the reconstruction of E. coli RNA polymerase holoenzyne - DNA complex." Thesis, Imperial College London, 2004. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.411621.

Full text

APA, Harvard, Vancouver, ISO, and other styles

15

Haibe-Kains, Benjamin. "Identification and assessment of gene signatures in human breast cancer." Doctoral thesis, Universite Libre de Bruxelles, 2009. http://hdl.handle.net/2013/ULB-DIPOT:oai:dipot.ulb.ac.be:2013/210348.

Full text

Abstract:

This thesis addresses the use of machine learning techniques to develop clinical diagnostic tools for breast cancer using molecular data. These tools are designed to assist physicians in their evaluation of the clinical outcome of breast cancer (referred to as prognosis).

The traditional approach to evaluating breast cancer prognosis is based on the assessment of clinico-pathologic factors known to be associated with breast cancer survival. These factors are used to make recommendations about whether further treatment is required after the removal of a tumor by surgery. Treatment such as chemotherapy depends on the estimation of patients' risk of relapse. Although current approaches do provide good prognostic assessment of breast cancer survival, clinicians are aware that there is still room for improvement in the accuracy of their prognostic estimations.

In the late nineties, new high throughput technologies such as the gene expression profiling through microarray technology emerged. Microarrays allowed scientists to analyze for the first time the expression of the whole human genome ("transcriptome"). It was hoped that the analysis of genome-wide molecular data would bring new insights into the critical, underlying biological mechanisms involved in breast cancer progression, as well as significantly improve prognostic prediction. However, the analysis of microarray data is a difficult task due to their intrinsic characteristics: (i) thousands of gene expressions are measured for only few samples; (ii) the measurements are usually "noisy"; and (iii) they are highly correlated due to gene co-expressions. Since traditional statistical methods were not adapted to these settings, machine learning methods were picked up as good candidates to overcome these difficulties. However, applying machine learning methods for microarray analysis involves numerous steps, and the results are prone to overfitting. Several authors have highlighted the major pitfalls of this process in the early publications, shedding new light on the promising but overoptimistic results.

Since 2002, large comparative studies have been conducted in order to identify the key characteristics of successful methods for class discovery and classification. Yet methods able to identify robust molecular signatures that can predict breast cancer prognosis have been lacking. To fill this important gap, this thesis presents an original methodology dealing specifically with the analysis of microarray and survival data in order to build prognostic models and provide an honest estimation of their performance. The approach used for signature extraction consists of a set of original methods for feature transformation, feature selection and prediction model building. A novel statistical framework is presented for performance assessment and comparison of risk prediction models.

In terms of applications, we show that these methods, used in combination with a priori biological knowledge of breast cancer and numerous public microarray datasets, have resulted in some important discoveries. In particular, the research presented here develops (i) a robust model for the identification of breast molecular subtypes and (ii) a new prognostic model that takes into account the molecular heterogeneity of breast cancers observed previously, in order to improve traditional clinical guidelines and state-of-the-art gene signatures./Cette thèse concerne le développement de techniques d'apprentissage (machine learning) afin de mettre au point de nouveaux outils cliniques basés sur des données moleculaires. Nous avons focalisé notre recherche sur le cancer du sein, un des cancers les plus fréquemment diagnostiqués. Ces outils sont développés dans le but d'aider les médecins dans leur évaluation du devenir clinique des patients cancéreux (cf. le pronostique).

Les approches traditionnelles d'évaluation du pronostique d'un patient cancéreux se base sur des critères clinico-pathologiques connus pour être prédictifs de la survie. Cette évaluation permet aux médecins de décider si un traitement est nécessaire après l'extraction de la tumeur. Bien que les outils d'évaluation traditionnels sont d'une aide importante, les cliniciens sont conscients de la nécessité d'améliorer de tels outils.

Dans les années 90, de nouvelles technologies à haut-débit, telles que le profilage de l'expression génique par biopuces à ADN (microarrays), ont été mises au point afin de permettre aux scientifiques d'analyser l'expression de l'entièreté du génôme de cellules cancéreuses. Ce nouveau type de données moléculaires porte l'espoir d'améliorer les outils pronostiques traditionnels et d'approfondir nos connaissances concernant la génèse du cancer du sein. Cependant ces données sont extrêmement difficiles à analyser à cause (i) de leur haute dimensionalité (plusieurs dizaines de milliers de gènes pour seulement quelques centaines d'expériences); (ii) du bruit important dans les mesures; (iii) de la collinéarité entre les mesures dûe à la co-expression des gènes.

Depuis 2002, des études comparatives à grande échelle ont permis d'identifier les méthodes performantes pour l'analyse de groupements et la classification de données microarray, négligeant l'analyse de survie pertinente pour le pronostique dans le cancer du sein. Pour pallier ce manque, cette thèse présente une méthodologie originale adaptée à l'analyse de données microarray et de survie afin de construire des modèles pronostiques performants et robustes.

En termes d'applications, nous montrons que cette méthodologie, utilisée en combinaison avec des connaissances biologiques a priori et de nombreux ensembles de données publiques, a permis d'importantes découvertes. En particulier, il résulte de la recherche presentée dans cette thèse, le développement d'un modèle robuste d'identification des sous-types moléculaires du cancer du sein et de plusieurs signatures géniques améliorant significativement l'état de l'art au niveau pronostique.
Doctorat en Sciences
info:eu-repo/semantics/nonPublished

APA, Harvard, Vancouver, ISO, and other styles

16

Kontos, Kevin. "Gaussian graphical model selection for gene regulatory network reverse engineering and function prediction." Doctoral thesis, Universite Libre de Bruxelles, 2009. http://hdl.handle.net/2013/ULB-DIPOT:oai:dipot.ulb.ac.be:2013/210301.

Full text

Abstract:

One of the most important and challenging ``knowledge extraction' tasks in bioinformatics is the reverse engineering of gene regulatory networks (GRNs) from DNA microarray gene expression data. Indeed, as a result of the development of high-throughput data-collection techniques, biology is experiencing a data flood phenomenon that pushes biologists toward a new view of biology--systems biology--that aims at system-level understanding of biological systems.

Unfortunately, even for small model organisms such as the yeast Saccharomyces cerevisiae, the number p of genes is much larger than the number n of expression data samples. The dimensionality issue induced by this ``small n, large p' data setting renders standard statistical learning methods inadequate. Restricting the complexity of the models enables to deal with this serious impediment. Indeed, by introducing (a priori undesirable) bias in the model selection procedure, one reduces the variance of the selected model thereby increasing its accuracy.

Gaussian graphical models (GGMs) have proven to be a very powerful formalism to infer GRNs from expression data. Standard GGM selection techniques can unfortunately not be used in the ``small n, large p' data setting. One way to overcome this issue is to resort to regularization. In particular, shrinkage estimators of the covariance matrix--required to infer GGMs--have proven to be very effective. Our first contribution consists in a new shrinkage estimator that improves upon existing ones through the use of a Monte Carlo (parametric bootstrap) procedure.

Another approach to GGM selection in the ``small n, large p' data setting consists in reverse engineering limited-order partial correlation graphs (q-partial correlation graphs) to approximate GGMs. Our second contribution consists in an inference algorithm, the q-nested procedure, that builds a sequence of nested q-partial correlation graphs to take advantage of the smaller order graphs' topology to infer higher order graphs. This allows us to significantly speed up the inference of such graphs and to avoid problems related to multiple testing. Consequently, we are able to consider higher order graphs, thereby increasing the accuracy of the inferred graphs.

Another important challenge in bioinformatics is the prediction of gene function. An example of such a prediction task is the identification of genes that are targets of the nitrogen catabolite repression (NCR) selection mechanism in the yeast Saccharomyces cerevisiae. The study of model organisms such as Saccharomyces cerevisiae is indispensable for the understanding of more complex organisms. Our third contribution consists in extending the standard two-class classification approach by enriching the set of variables and comparing several feature selection techniques and classification algorithms.

Finally, our fourth contribution formulates the prediction of NCR target genes as a network inference task. We use GGM selection to infer multivariate dependencies between genes, and, starting from a set of genes known to be sensitive to NCR, we classify the remaining genes. We hence avoid problems related to the choice of a negative training set and take advantage of the robustness of GGM selection techniques in the ``small n, large p' data setting.
Doctorat en Sciences
info:eu-repo/semantics/nonPublished

APA, Harvard, Vancouver, ISO, and other styles

17

Hart, Dennis L., Johnny J. Pappas, and John E. Lindegren. "Desktop GPS Analyst Standardized GPS Data Processing and Analysis on a Personal Computer." International Foundation for Telemetering, 1996. http://hdl.handle.net/10150/611424.

Full text

Abstract:

International Telemetering Conference Proceedings / October 28-31, 1996 / Town and Country Hotel and Convention Center, San Diego, California
In the last few years there has been a proliferation of GPS receivers and receiver manufacturers. Couple this with a growing number of DoD test programs requiring high accuracy Time-Space-Position-Information (TSPI) with diminishing test support funds and/or needing a wide area, low altitude or surface tracking capability. The Air Force Development Test Center (AFDTC) recognized the growing requirements for using GPS in test programs and the need for a low cost, portable TSPI processing capability which sparked the development of the Desktop GPS Analyst. The Desktop GPS Analyst is a personal computer (PC) based software application for the generation of GPS-based TSPI.

APA, Harvard, Vancouver, ISO, and other styles

18

Abyad, Emad. "Modeled Estimates of Solar Direct Normal Irradiance and Diffuse Horizontal Irradiance in Different Terrestrial Locations." Thesis, Université d'Ottawa / University of Ottawa, 2017. http://hdl.handle.net/10393/36499.

Full text

Abstract:

The transformation of solar energy into electricity is starting to impact to overall worldwide energy production mix. Photovoltaic-generated electricity can play a significant role in minimizing the use of non-renewable energy sources. Sunlight consists of three main components: global horizontal irradiance (GHI), direct normal irradiance (DNI) and diffuse horizontal irradiance (DHI). Typically, these components are measured using specialized instruments in order to study solar radiation at any location. However, these measurements are not always available, especially in the case of the DNI and DHI components of sunlight. Consequently, many models have been developed to estimate these components from available GHI data. These models have their own merits. For this thesis, solar radiation data collected at four locations have been analyzed. The data come from Al-Hanakiyah (Saudi Arabia), Boulder (U.S.), Ma’an (Jordan), and Ottawa (Canada). The BRL, Reindl*, DISC, and Perez models have been used to estimate DNI and DHI data from the experimentally measured GHI data. The findings show that the Reindl* and Perez model outcomes offered similar accuracy of computing DNI and DHI values when comparing with detailed experimental data for Al-Hanakiyah and Ma’an. For Boulder, the Perez and BRL models have similar estimation abilities of DHI values and the DISC and Perez models are better estimators of DNI. The Reindl* model performs better when modeling DHI and DNI for Ottawa data. The BRL and DISC models show similar metrics error analyses, except in the case of the Ma’an location where the BRL model shows high error metrics values in terms of MAE, RMSE, and standard deviation (σ). The Boulder and Ottawa locations datasets were not complete and affected the outcomes with regards to the model performance metrics. Moreover, the metrics show very high, unreasonable values in terms of RMSE and σ. It is advised that a global model be developed by collecting data from many locations as a way to help minimize the error between the actual and modeled values since the current models have their own limitations. Availability of multi-year data, parameters such as albedo and aerosols, and one minute to hourly time steps data could help minimize the error between measured and modeled data. In addition to having accurate data, analysis of spectral data is important to evaluate their impact on solar technologies.

APA, Harvard, Vancouver, ISO, and other styles

19

Castagno, Thomas A. "The effect of knee pads on gait and comfort." Link to electronic thesis, 2004. http://www.wpi.edu/Pubs/ETD/Available/etd-0426104-174716.

Full text

APA, Harvard, Vancouver, ISO, and other styles

20

Scarlato, Michele. "Sicurezza di rete, analisi del traffico e monitoraggio." Master's thesis, Alma Mater Studiorum - Università di Bologna, 2012. http://amslaurea.unibo.it/3223/.

Full text

Abstract:

Il lavoro è stato suddiviso in tre macro-aree. Una prima riguardante un'analisi teorica di come funzionano le intrusioni, di quali software vengono utilizzati per compierle, e di come proteggersi (usando i dispositivi che in termine generico si possono riconoscere come i firewall). Una seconda macro-area che analizza un'intrusione avvenuta dall'esterno verso dei server sensibili di una rete LAN. Questa analisi viene condotta sui file catturati dalle due interfacce di rete configurate in modalità promiscua su una sonda presente nella LAN. Le interfacce sono due per potersi interfacciare a due segmenti di LAN aventi due maschere di sotto-rete differenti. L'attacco viene analizzato mediante vari software. Si può infatti definire una terza parte del lavoro, la parte dove vengono analizzati i file catturati dalle due interfacce con i software che prima si occupano di analizzare i dati di contenuto completo, come Wireshark, poi dei software che si occupano di analizzare i dati di sessione che sono stati trattati con Argus, e infine i dati di tipo statistico che sono stati trattati con Ntop. Il penultimo capitolo, quello prima delle conclusioni, invece tratta l'installazione di Nagios, e la sua configurazione per il monitoraggio attraverso plugin dello spazio di disco rimanente su una macchina agent remota, e sui servizi MySql e DNS. Ovviamente Nagios può essere configurato per monitorare ogni tipo di servizio offerto sulla rete.

APA, Harvard, Vancouver, ISO, and other styles

21

"Pattern analysis of microarray data." Thesis, 2009. http://library.cuhk.edu.hk/record=b6074754.

Full text

Abstract:

DNA microarray technology is the most notable high throughput technology which emerged for functional genomics in recent years. Patterns in microarray data provide clues of gene functions, cell types, and interactions among genes or gene products. Since the scale of microarray data keeps on growing, there is an urgent need for the development of methods and tools for the analysis of these huge amounts of complex data.
Interesting patterns in microarray data can be patterns appearing with significant frequencies or patterns appearing special trends. Firstly, an algorithm to find biclusters with coherent values is proposed. For these biclusters the subset of genes (or samples) show some similarities, such as low Euclidean distance or high Pearson correlation coefficient. We propose Average Correlation Value (ACV) to measure the homogeneity of a bicluster. ACV outperforms other alternatives for being applicable for biclusters of more types. Our algorithm applies dominant set approach to create sets of sorting vectors for rows of the data matrix. In this way, the co-expressed rows of the data matrix could be gathered. By alternatively sorting and transposing the data matrix the blocks of co-expressed subset are gathered. Weighted correlation coefficient is used to measure the similarity in the gene level and the sample level. Their weights are updated each time using the sorting vector of the previous iteration. Genes/samples which are assigned higher weights contribute more to the similarity measure when they are used as features for the other dimension. Unlike the two-way clustering or divide and conquer algorithm, our approach does not break the structure of the whole data and can find multiple overlapping biclusters. Also the method has low computation cost comparing to the exhaustive enumeration and distribution parameter identification methods.
Next, algorithms to find biclusters with coherent evolutions, more specific, the order preserving patterns, are proposed. In an Order Preserving Cluster (OP-Cluster) genes induce the same relative order on samples, while the exact magnitude of the data are not regarded. By converting each gene expression vector into an ordered label sequence, we transfer the problem into finding frequent orders appearing in the sequence set. Two heuristic algorithms, Growing Prefix and Suffix (GPS) and Growing Frequent Position (GFP) are presented. The results show these methods both have good scale-up properties. They output larger OP-Clusters more efficiently and have lower space and computation space cost comparing to the existing methods.
We propose the idea of Discovering Distinct Patterns (DDP) in gene expression data. The distinct patterns correspond to genes with significantly different patterns. DDP is useful to scale-down the analysis when there is little prior knowledge. A DDP algorithm is proposed by iteratively picking out pairs of genes with the largest dissimilarities. Experiments are implemented on both synthetic data sets and real microarray data. The results show the effectiveness and efficiency in finding functional significant genes. The usefulness of genes with distinct patterns for constructing simplified gene regulatory network is further discussed.
Teng, Li.
Adviser: Laiwan Chan.
Source: Dissertation Abstracts International, Volume: 71-01, Section: B, page: 0446.
Thesis (Ph.D.)--Chinese University of Hong Kong, 2009.
Includes bibliographical references (leaves 118-128).
Electronic reproduction. Hong Kong : Chinese University of Hong Kong, [2012] System requirements: Adobe Acrobat Reader. Available via World Wide Web.
Electronic reproduction. Ann Arbor, MI : ProQuest Information and Learning Company, [200-] System requirements: Adobe Acrobat Reader. Available via World Wide Web.
Abstracts in English and Chinese.

APA, Harvard, Vancouver, ISO, and other styles

22

"The analysis of cDNA sequences: an algorithm for alignment." 1997. http://library.cuhk.edu.hk/record=b5889263.

Full text

Abstract:

by Lam Fung Ming.
Thesis (M.Phil.)--Chinese University of Hong Kong, 1997.
Includes bibliographical references (leaves 45-47).
Chapter CHAPTER 1 --- INTRODUCTION --- p.1
Chapter CHAPTER 2 --- BACKGROUND --- p.4
Section 2.1 DNA Cloning --- p.5
Section 2.1.1 Principles of cell-based DNA cloning --- p.5
Section 2.1.2. Polymerase Chain Reaction --- p.8
Section 2.2 DNA Libraries --- p.10
Section 2.3. Expressed Sequence Tags --- p.11
"Section 2.4 dbEST - Database for ""Expressed Sequence Tag""" --- p.13
Chapter CHAPTER 3 --- REDUCTION OF PARTIAL SEQUENCE REDUNDANCY AND CDNA ALIGNMENT --- p.15
Section 3.1 Materials --- p.15
Section 3.2 Our Algorithm --- p.16
Section 3.3 Data Storage --- p.24
Section 3.4 Criterion of Alignment --- p.27
Section 3.5 Pairwise Alignment --- p.29
Chapter CHAPTER 4 --- RESULTS AND DISCUSSION --- p.32
Chapter CHAPTER 5 --- CONCLUSION AND FUTURE DEVELOPMENT --- p.42
REFERENCES --- p.45
APPENDIX --- p.i

APA, Harvard, Vancouver, ISO, and other styles

23

"Bioinformatics analyses for next-generation sequencing of plasma DNA." 2012. http://library.cuhk.edu.hk/record=b5549423.

Full text

Abstract:

1997年，Dennis等證明胚胎DNA在孕婦母體中存在的事實開啟了產前無創診斷的大門。起初的應用包括性別鑒定和恒河猴血型系統的識別。隨著二代測序的出現和發展，對外周血游離DNA更加成熟的分析和應用應運而生。例如當孕婦懷孕十二周時，應用二代測序技術在母體外周血DNA中預測胎兒21號染色體是否是三倍體，其準確性達到98%。本論文的第一部分介紹如何應用母體外周血DNA構建胎兒的全基因組遺傳圖譜。這項研究極具挑戰，原因是孕後12周，胎兒對外周血DNA貢獻很小，大多數在10%左右，另外外周血中的胎兒DNA大多數短於200 bp。目前的演算法和程式都不適合於從母體外周血DNA中構建胎兒的遺傳圖譜。在這項研究中，根據母親和父親的基因型，用生物資訊學手段先構建胎兒可能有的遺傳圖譜，然後將母體外周血DNA的測序資訊比對到這張可能的遺傳圖譜上。如果在母親純和遺傳背景下，決定父親的特異遺傳片段，只要定性檢測父親的特異遺傳片段是否在母體外周血中存在。如果在母親雜合遺傳背景下，決定母親的遺傳特性，就要進行定量分析。我開發了單倍型相對劑量分析方案，統計學上判斷母親外周血中的兩條單倍型相對劑量水準，顯著增加的單倍型即為最大可能地遺傳給胎兒的單倍型。單倍型相對劑量分析方案可以加強測序資訊的分析效率，降低測序數據波動，比單個位點分析更加穩定，強壯。
隨著靶標富集測序出現，測序價格急劇下降。第一部分運用母親父親的多態位點基因型的組合加上測序的資訊可以計算出胎兒DNA在母體外周血中的濃度。但是該方法的局限是要利用母親父親的多態位點的基因型，而不能直接從測序的資訊中推測胎兒DNA在母體外周血中的濃度。本論文的第二部分，我開發了基於二項分佈的混合模型直接預測胎兒DNA在母體外周血中的濃度。當混合模型的似然值達到最大的時候，胎兒DNA在母體外周血中的濃度得到最優估算。由於靶標富集測序可以提供高倍覆蓋的測序資訊，從而有機會直接根據概率模型識別出母親是純和而且胎兒是雜合的有特異信息量的位點。
除了母體外周血DNA水準分析推動產前無創診斷外，表觀遺傳學的分析也不容忽視。在本論文的第三部分，我開發了Methyl-Pipe軟體，專門用於全基因組的甲基化的分析。甲基化測序數據分析比一般的基因組測序分析更加複雜。由於重亞硫酸鹽測序文庫的沒有甲基化的胞嘧啶轉化成尿嘧啶，最後以胸腺嘧啶的形式存在PCR產物中，但是對於甲基化的胞嘧啶則保持不變。因此，為了實現將重亞硫酸鹽處理過的測序序列比對到參考基因組。首先，分別將Watson和Crick鏈的參考基因組中胞嘧啶轉化成全部轉化為胸腺嘧啶，同時也將測序序列中的胞嘧啶轉化成胸腺嘧啶。然後將轉化後的測序序列比對到參考基因組上。最後根據比對到基因組上的測序序列中的胞嘧啶和胸腺嘧啶的含量推到全基因組的甲基化水準和甲基化特定模式。Methyl-Pipe可以用於識別甲基化水平顯著性差異的基因組區別，因此它可以用於識別潛在的胎兒特異的甲基化位點用於產前無創診斷。
The presence of fetal DNA in the cell-free plasma of pregnant women was first described in 1997. The initial clinical applications of this phenomenon focused on the detection of paternally inherited traits such as sex and rhesus D blood group status. The development of massively parallel sequencing technologies has allowed more sophisticated analyses on circulating cell-free DNA in maternal plasma. For example, through the determination of the proportional representation of chromosome 21 sequences in maternal plasma, noninvasive prenatal diagnosis of fetal Down syndrome can be achieved with an accuracy of >98%. In the first part of my thesis, I have developed bioinformatics algorithms to perform genome-wide construction of the fetal genetic map from the massively parallel sequencing data of the maternal plasma DNA sample of a pregnant woman. The construction of the fetal genetic map through the maternal plasma sequencing data is very challenging because fetal DNA only constitutes approximately 10% of the maternal plasma DNA. Moreover, as the fetal DNA in maternal plasma exists as short fragments of less than 200 bp, existing bioinformatics techniques for genome construction are not applicable for this purpose. For the construction of the genome-wide fetal genetic map, I have used the genome of the father and the mother as scaffolds and calculated the fractional fetal DNA concentration. First, I looked at the paternal specific sequences in maternal plasma to determine which portions of the father’s genome had been passed on to the fetus. For the determination of the maternal inheritance, I have developed the Relative Haplotype Dosage (RHDO) approach. This method is based on the principle that the portion of maternal genome inherited by the fetus would be present in slightly higher concentration in the maternal plasma. The use of haplotype information can enhance the efficacy of using the sequencing data. Thus, the maternal inheritance can be determined with a much lower sequencing depth than just looking at individual loci in the genome. This algorithm makes it feasible to use genome-wide scanning to diagnose fetal genetic disorders prenatally in a noninvasive way.
As the emergence of targeted massively parallel sequencing, the sequencing cost per base is reducing dramatically. Even though the first part of the thesis has already developed a method to estimate fractional fetal DNA concentration using parental genotype informations, it still cannot be used to deduce the fractional fetal DNA concentration directly from sequencing data without prior knowledge of genotype information. In the second part of this thesis, I propose a statistical mixture model based method, FetalQuant, which utilizes the maximum likelihood to estimate the fractional fetal DNA concentration directly from targeted massively parallel sequencing of maternal plasma DNA. This method allows fetal DNA concentration estimation superior to the existing methods in term of obviating the need of genotype information without loss of accuracy. Furthermore, by using Bayes’ rule, this method can distinguish the informative SNPs where mother is homozygous and fetus is heterozygous, which is potential to detect dominant inherited disorder.
Besides the genetic analysis at the DNA level, epigenetic markers are also valuable for noninvasive diagnosis development. In the third part of this thesis, I have also developed a bioinformatics algorithm to efficiently analyze genomewide DNA methylation status based on the massively parallel sequencing of bisulfite-converted DNA. DNA methylation is one of the most important mechanisms for regulating gene expression. The study of DNA methylation for different genes is important for the understanding of the different physiological and pathological processes. Currently, the most popular method for analyzing DNA methylation status is through bisulfite sequencing. The principle of this method is based on the fact that unmethylated cytosine residues would be chemically converted to uracil on bisulfite treatment whereas methylated cytosine would remain unchanged. The converted uracil and unconverted cytosine can then be discriminated on sequencing. With the emergence of massively parallel sequencing platforms, it is possible to perform this bisulfite sequencing analysis on a genome-wide scale. However, the bioinformatics analysis of the genome-wide bisulfite sequencing data is much more complicated than analyzing the data from individual loci. Thus, I have developed Methyl-Pipe, a bioinformatics program for analyzing the DNA methylation status of genome-wide methylation status of DNA samples based on massively parallel sequencing. In the first step of this algorithm, an in-silico converted reference genome is produced by converting all the cytosine residues to thymine residues. Then, the sequenced reads of bisulfite-converted DNA sequences are aligned to this modified reference sequence. Finally, post-processing of the alignments removes non-unique and low-quality mappings and characterizes the methylation pattern in genome-wide manner. Making use of this new program, potential fetal-specific hypomethylated regions which can be used as blood biomarkers can be identified in a genome-wide manner.
Detailed summary in vernacular field only.
Detailed summary in vernacular field only.
Detailed summary in vernacular field only.
Jiang, Peiyong.
Thesis (Ph.D.)--Chinese University of Hong Kong, 2012.
Includes bibliographical references (leaves 100-105).
Abstracts also in Chinese.
Chapter SECTION I : --- BACKGROUND --- p.1
Chapter CHAPTER 1: --- Circulating nucleic acids and Next-generation sequencing --- p.2
Chapter 1.1 --- Circulating nucleic acids --- p.2
Chapter 1.2 --- Next-generation sequencing --- p.3
Chapter 1.3 --- Bioinformatics analyses --- p.9
Chapter 1.4 --- Applications of the NGS --- p.11
Chapter 1.5 --- Aims of this thesis --- p.12
Chapter SECTION II : --- Mathematically decoding fetal genome in maternal plasma --- p.14
Chapter CHAPTER 2: --- Characterizing the maternal and fetal genome in plasma at single base resolution --- p.15
Chapter 2.1 --- Introduction --- p.15
Chapter 2.2 --- SNP categories and principle --- p.17
Chapter 2.3 --- Clinical cases and SNP genotyping --- p.20
Chapter 2.4 --- Sequencing depth and fractional fetal DNA concentration determination --- p.24
Chapter 2.5 --- Filtering of genotyping errors for maternal genotypes --- p.26
Chapter 2.6 --- Constructing fetal genetic map in maternal plasma --- p.27
Chapter 2.7 --- Sequencing error estimation --- p.36
Chapter 2.8 --- Paternal-inherited alleles --- p.38
Chapter 2.9 --- Maternally-derived alleles by RHDO analysis --- p.39
Chapter 2.1 --- Recombination breakpoint simulation and detection --- p.49
Chapter 2.11 --- Prenatal diagnosis of β- thalassaemia --- p.51
Chapter 2.12 --- Discussion --- p.53
Chapter SECTION III : --- Statistical model for fractional fetal DNA concentration estimation --- p.56
Chapter CHAPTER 3: --- FetalQuant: deducing the fractional fetal DNA concentration from massively parallel sequencing of maternal plasma DNA --- p.57
Chapter 3.1 --- Introduction --- p.57
Chapter 3.2 --- Methods --- p.60
Chapter 3.2.1 --- Maternal-fetal genotype combinations --- p.60
Chapter 3.2.2 --- Binomial mixture model and likelihood --- p.64
Chapter 3.2.3 --- Fractional fetal DNA concentration fitting --- p.66
Chapter 3.3 --- Results --- p.71
Chapter 3.3.1 --- Datasets --- p.71
Chapter 3.3.2 --- Evaluation of FetalQuant algorithm --- p.75
Chapter 3.3.3 --- Simulation --- p.78
Chapter 3.3.4 --- Sequencing depth and the number of SNPs required by FetalQuant --- p.81
Chapter 3.5 --- Discussion --- p.85
Chapter SECTION IV : --- NGS-based data analysis pipeline development --- p.88
Chapter CHAPTER 4: --- Methyl-Pipe: Methyl-Seq bioinformatics analysis pipeline --- p.89
Chapter 4.1 --- Introduction --- p.89
Chapter 4.2 --- Methods --- p.89
Chapter 4.2.1 --- Overview of Methyl-Pipe --- p.90
Chapter 4.3 --- Results and discussion --- p.96
Chapter SECTION V : --- CONCLUDING REMARKS --- p.97
Chapter CHAPTER 5: --- Conclusion and future perspectives --- p.98
Chapter 5.1 --- Conclusion --- p.98
Chapter 5.2 --- Future perspectives --- p.99
Reference --- p.100

APA, Harvard, Vancouver, ISO, and other styles

24

Oelofse, Andries Johannes. "Development of a MIAME-compliant microarray data management system for functional genomics data integration." Diss., 2007. http://hdl.handle.net/2263/27456.

Full text

APA, Harvard, Vancouver, ISO, and other styles

25

Narayanaswamy, Rammohan 1978. "Genome-wide analyses of single cell phenotypes using cell microarrays." Thesis, 2008. http://hdl.handle.net/2152/3967.

Full text

Abstract:

The past few decades have witnessed a revolution in recombinant DNA and nucleic acid sequencing technologies. Recently however, technologies capable of massively high-throughout, genome-wide data collection, combined with computational and statistical tools for data mining, integration and modeling have enabled the construction of predictive networks that capture cellular regulatory states, paving the way for ‘Systems biology’. Consequently, protein interactions can be captured in the context of a cellular interaction network and emergent ‘system’ properties arrived at, that may not have been possible by conventional biology. The ability to generate data from multiple, non-redundant experimental sources is one of the important facets to systems biology. Towards this end, we have established a novel platform called ‘spotted cell microarrays’ for conducting image-based genetic screens. We have subsequently used spotted cell microarrays for studying multidimensional phenotypes in yeast under different regulatory states. In particular, we studied the response to mating pheromone using a cell microarray comprised of the yeast non-essential deletion library and analyzed morphology changes to identify novel genes that were involved in mating. An important aspect of the mating response pathway is large-scale spatiotemporal changes to the proteome, an aspect of proteomics, still largely obscure. In our next study, we used an imaging screen and a computational approach to predict and validate the complement of proteins that polarize and change localization towards the mating projection tip. By adopting such hybrid approaches, we have been able to, not only study proteins involved in specific pathways, but also their behavior in a systemic context, leading to a broader comprehension of cell function. Lastly, we have performed a novel metabolic starvation-based screen using the GFP-tagged collection to study proteome dynamics in response to nutrient limitation and are currently in the process of rationalizing our observations through follow-up experiments. We believe this study to have implications in evolutionarily conserved cellular mechanisms such as protein turnover, quiescence and aging. Our technique has therefore been applied towards addressing several interesting aspects of yeast cellular physiology and behavior and is now being extended to mammalian cells.
text

APA, Harvard, Vancouver, ISO, and other styles

26

Murti, Bayu Tri. "Experimental and computational studies on sensing of DNA damage in Alzheimer's disease." Thesis, 2017. http://hdl.handle.net/10321/2670.

Full text

Abstract:

Submitted in fulfilment of the requirements of Master's Degree in Chemistry, Durban University of Technology, 2017.
DNA damage plays a pivotal role in the pathogenesis of Alzheimer’s disease (AD) therefore, an innovative ss-DNA/dopamine/TiO2/FTO electrode strategy was developed to detect the genotoxicity upon photocatalytic reactions. This study involves a computational and electrochemical investigation towards the direct measurement of DNA damage. Computational chemistry was useful to resolve the intricate chemistry problems behind electrode constructions. The computational protocols were simultaneously carried out comprising of density functional theory (DFT) calculations, Metropolis Monte Carlo (MC) adsorption studies, and molecular dynamics (MD) simulations. The DFT calculations elucidated the structural, electronics, and vibrational properties of the electrode components resulting in a good agreement with the experimental parameters. The MC simulations carried out using simulated annealing predicted the adsorption process within layer-by-layer electrode as well generating reliable inputs prior to MD simulations. A 100 ns MD simulations were performed using a canonical ensemble provided information on the thermodynamics parameters such as total energy, temperature, and potential energy profiles, including radius of gyrations and atomic density profiles. Binding energies calculated from the MD trajectories revealed increasing interaction energies for the layer-by-layer electrode, in agreement with the electrochemical characterization studies (i.e. gradual decrease of cyclic voltammogram (CV) as well as increasing diameter of electrochemical impedance spectroscopy (EIS) semicircle upon electrode modification). The higher binding energies may lead to smaller changes in the electrochemical polarizability which directly affect to the decreasing of redox peak current and charge transfer resistance enhancement. Instead, HOMO-LUMO DFT levels are also taken into account to explain electron transfer phenomena within layer construction leading to the alteration of CV behaviours. Experimentally, the ss-DNA was electronically linked to TiO2/FTO surface through dopamine as a molecular anchor. Electrochemical measurements using cyclic voltammetry and EIS were employed to characterize the electrode modifications. The square wave voltammetry was subsequently used to measure the DNA damage and the potency of antioxidant treatment using ascorbic acid (AA) due to its ability in protecting the DNA from the damages. The presence of AA significantly protected the DNA from the damage, therefore was able to be used as a potential treatment in AD. Theoretically, guanine residues predicted by DFT as the most reactive sites of the ss-DNA involved in the genotoxic reactions. Overall, the theoretical studies successfully validated the experimental study as well as providing the molecular basis of interaction phenomena towards electrode constructions. Our results highlight the potential application of this methodology to screen the genotoxicity in Alzheimer’s, suggesting the important role of theoretical studies to predict the molecular interaction and validation of the DNA-based sensors and bioelectronics.
M

APA, Harvard, Vancouver, ISO, and other styles

27

"Generalized pattern matching applied to genetic analysis." Thesis, 2011. http://library.cuhk.edu.hk/record=b6075184.

Full text

Abstract:

Approximate pattern matching problem is, given a reference sequence T, a pattern (query) Q, and a maximum allowed error e, to find all the substrings in the reference, such that the edit distance between the substrings and the pattern is smaller than or equal to the maximum allowed error. Though it is a well-studied problem in Computer Science, it gains a resurrection in Bioinformatics in recent years, largely due to the emergence of the next-generation high-throughput sequencing technologies. This thesis contributes in a novel generalized pattern matching framework, and applies it to solve pattern matching problems in general and alternative splicing detection (AS) in particular. AS is to map a large amount of next-generation sequencing short reads data to a reference human genome, which is the first and an important step in analyzing the sequenced data for further Biological analysis. The four parts of my research are as follows.
In the first part of my research work, we propose a novel deterministic pattern matching algorithm which applies Agrep, a well-known bit-parallel matching algorithm, to a truncated suffix array. Due to the linear cost of Agrep, the cost of our approach is linear to the number of characters processed in the truncated suffix array. We analyze the matching cost theoretically, and .obtain empirical costs from experiments. We carry out experiments using both synthetic and real DNA sequence data (queries) and search them in Chromosome-X of a reference human genome. The experimental results show that our approach achieves a speed-up of several magnitudes over standard Agrep algorithm.
In the fourth part, we focus on the seeding strategies for alternative splicing detection. We review the history of seeding-and-extending (SAE), and assess both theoretically and empirically the seeding strategies adopted in existing splicing detection tools, including Bowtie's heuristic and ABMapper's exact seedings, against the novel complementary quad-seeding strategy we proposed and the corresponding novel splice detection tool called CS4splice, which can handle inexact seeding (with errors) and all 3 types of errors including mismatch (substitution), insertion, and deletion. We carry out experiments using short reads (queries) of length 105bp comprised of several data sets consisting of various levels of errors, and align them back to a reference human genome (hg18). On average, CS4splice can align 88. 44% (recall rate) of 427,786 short reads perfectly back to the reference; while the other existing tools achieve much smaller recall rates: SpliceMap 48.72%, MapSplice 58.41%, and ABMapper 51.39%. The accuracies of CS4splice are also the highest or very close to the highest in all the experiments carried out. But due to the complementary quad-seeding that CS4splice use, it takes more computational resources, about twice (or more) of the other alternative splicing detection tools, which we think is practicable and worthy.
In the second part, we define a novel generalized pattern (query) and a framework of generalized pattern matching, for which we propose a heuristic matching algorithm. Simply speaking, a generalized pattern is Q 1G1Q2 ... Qc--1Gc--1 Qc, which consists of several substrings Q i and gaps Gi occurring in-between two substrings. The prototypes of the generalized pattern come from several real Biological problems that can all be modeled as generalized pattern matching problems. Based on a well-known seeding-and-extending heuristic, we propose a dual-seeding strategy, with which we solve the matching problem effectively and efficiently. We also develop a specialized matching tool called Gpattern-match. We carry out experiments using 10,000 generalized patterns and search them in a reference human genome (hg18). Over 98.74% of them can be recovered from the reference. It takes 1--2 seconds on average to recover a pattern, and memory peak goes to a little bit more than 1G.
In the third part, a natural extension of the second part, we model a real biological problem, alternative splicing detection, into a generalized pattern matching problem, and solve it using a proposed bi-directional seeding-and-extending algorithm. Different from all the other tools which depend on third-party tools, our mapping tool, ABMapper, is not only stand-alone but performs unbiased alignments. We carry out experiments using 427,786 real next-generation sequencing short reads data (queries) and align them back to a reference human genome (hg18). ABMapper achieves 98.92% accuracy and 98.17% recall rate, and is much better than the other state-of-the-art tools: SpliceMap achieves 94.28% accuracy and 78.13% recall rate;while TopHat 88.99% accuracy and 76.33% recall rate. When the seed length is set to 12 in ABMapper, the whole searching and alignment process takes about 20 minutes, and memory peak goes to a little bit more than 2G.
Ni, Bing.
Adviser: Kwong-Sak Leung.
Source: Dissertation Abstracts International, Volume: 73-06, Section: B, page: .
Thesis (Ph.D.)--Chinese University of Hong Kong, 2011.
Includes bibliographical referencesTexture mapping (leaves 151-161).
Electronic reproduction. Hong Kong : Chinese University of Hong Kong, [2012] System requirements: Adobe Acrobat Reader. Available via World Wide Web.
Electronic reproduction. [Ann Arbor, MI] : ProQuest Information and Learning, [201-] System requirements: Adobe Acrobat Reader. Available via World Wide Web.
Electronic reproduction. Ann Arbor, MI : ProQuest Information and Learning Company, [200-] System requirements: Adobe Acrobat Reader. Available via World Wide Web.
Abstract also in Chinese.

APA, Harvard, Vancouver, ISO, and other styles

28

"Applications of evolutionary algorithms on biomedical systems." 2007. http://library.cuhk.edu.hk/record=b5893179.

Full text

Abstract:

Tse, Sui Man.
Thesis (M.Phil.)--Chinese University of Hong Kong, 2007.
Includes bibliographical references (leaves 95-104).
Abstracts in English and Chinese.
Abstract --- p.i
Acknowledgement --- p.v
Chapter 1 --- Introduction --- p.1
Chapter 1.1 --- Motivation --- p.1
Chapter 1.1.1 --- Basic Concepts and Definitions --- p.2
Chapter 1.2 --- Evolutionary Algorithms --- p.5
Chapter 1.2.1 --- Chromosome Encoding --- p.6
Chapter 1.2.2 --- Selection --- p.7
Chapter 1.2.3 --- Crossover --- p.9
Chapter 1.2.4 --- Mutation --- p.10
Chapter 1.2.5 --- Elitism --- p.11
Chapter 1.2.6 --- Niching --- p.11
Chapter 1.2.7 --- Population Manipulation --- p.13
Chapter 1.2.8 --- Building Blocks --- p.13
Chapter 1.2.9 --- Termination Conditions --- p.14
Chapter 1.2.10 --- Co-evolution --- p.14
Chapter 1.3 --- Local Search --- p.15
Chapter 1.4 --- Memetic Algorithms --- p.16
Chapter 1.5 --- Objective --- p.17
Chapter 1.6 --- Summary --- p.17
Chapter 2 --- Background --- p.18
Chapter 2.1 --- Multiple Drugs Tumor Chemotherapy --- p.18
Chapter 2.2 --- Bioinformatics --- p.22
Chapter 2.2.1 --- Basics of Bioinformatics --- p.24
Chapter 2.2.2 --- Applications on Biomedical Systems --- p.26
Chapter 3 --- A New Drug Administration Dynamic Model --- p.29
Chapter 3.1 --- Three Drugs Mathematical Model --- p.31
Chapter 3.1.1 --- Rate of Change of Different Subpopulations --- p.32
Chapter 3.1.2 --- Rate of Change of Different Drug Concen- trations --- p.35
Chapter 3.1.3 --- Toxicity Effects --- p.35
Chapter 3.1.4 --- Summary --- p.36
Chapter 4 --- Memetic Algorithm - Iterative Dynamic Program- ming (MA-IDP) --- p.38
Chapter 4.1 --- Problem Formulation: Optimal Control Problem (OCP) for Mutlidrug Optimization --- p.38
Chapter 4.2 --- Proposed Memetic Optimization Algorithm --- p.40
Chapter 4.2.1 --- Iterative Dynamic Programming (IDP) . . --- p.40
Chapter 4.2.2 --- Adaptive Elitist-population-based Genetic Algorithm (AEGA) --- p.44
Chapter 4.2.3 --- Memetic Algorithm 一 Iterative Dynamic Programming (MA-IDP) --- p.50
Chapter 4.3 --- Summary --- p.56
Chapter 5 --- MA-IDP: Experiments and Results --- p.57
Chapter 5.1 --- Experiment Settings --- p.57
Chapter 5.2 --- Optimization Results --- p.61
Chapter 5.3 --- Extension to Other Mutlidrug Scheduling Model . --- p.62
Chapter 5.4 --- Summary --- p.65
Chapter 6 --- DNA Sequencing by Hybridization (SBH) --- p.66
Chapter 6.1 --- Problem Formulation: Reconstructing a DNA Sequence from Hybridization Data --- p.70
Chapter 6.2 --- Proposed Memetic Optimization Algorithm --- p.71
Chapter 6.2.1 --- Chromosome Encoding --- p.71
Chapter 6.2.2 --- Fitness Function --- p.73
Chapter 6.2.3 --- Crossover --- p.74
Chapter 6.2.4 --- Hill Climbing Local Search for Sequencing by Hybridization --- p.76
Chapter 6.2.5 --- Elitism and Diversity --- p.79
Chapter 6.2.6 --- Outline of Algorithm: MA-HC-SBH --- p.81
Chapter 6.3 --- Summary --- p.82
Chapter 7 --- DNA Sequencing by Hybridization (SBH): Experiments and Results --- p.83
Chapter 7.1 --- Experiment Settings --- p.83
Chapter 7.2 --- Experiment Results --- p.85
Chapter 7.3 --- Summary --- p.89
Chapter 8 --- Conclusion --- p.90
Chapter 8.1 --- Multiple Drugs Cancer Chemotherapy Schedule Optimization --- p.90
Chapter 8.2 --- Use of the MA-IDP --- p.91
Chapter 8.3 --- DNA Sequencing by Hybridization (SBH) --- p.92
Chapter 8.4 --- Use of the MA-HC-SBH --- p.92
Chapter 8.5 --- Future Work --- p.93
Chapter 8.6 --- Item Learned --- p.93
Chapter 8.7 --- Papers Published --- p.94
Bibliography --- p.95

APA, Harvard, Vancouver, ISO, and other styles

29

Desai, Akshay A. "Data analysis and creation of epigenetics database." Thesis, 2014. http://hdl.handle.net/1805/4452.

Full text

Abstract:

Indiana University-Purdue University Indianapolis (IUPUI)
This thesis is aimed at creating a pipeline for analyzing DNA methylation epigenetics data and creating a data model structured well enough to store the analysis results of the pipeline. In addition to storing the results, the model is also designed to hold information which will help researchers to decipher a meaningful epigenetics sense from the results made available. Current major epigenetics resources such as PubMeth, MethyCancer, MethDB and NCBI’s Epigenomics database fail to provide holistic view of epigenetics. They provide datasets produced from different analysis techniques which raises an important issue of data integration. The resources also fail to include numerous factors defining the epigenetic nature of a gene. Some of the resources are also struggling to keep the data stored in their databases up-to-date. This has diminished their validity and coverage of epigenetics data. In this thesis we have tackled a major branch of epigenetics: DNA methylation. As a case study to prove the effectiveness of our pipeline, we have used stage-wise DNA methylation and expression raw data for Lung adenocarcinoma (LUAD) from TCGA data repository. The pipeline helped us to identify progressive methylation patterns across different stages of LUAD. It also identified some key targets which have a potential for being a drug target. Along with the results from methylation data analysis pipeline we combined data from various online data reserves such as KEGG database, GO database, UCSC database and BioGRID database which helped us to overcome the shortcomings of existing data collections and present a resource as complete solution for studying DNA methylation epigenetics data.

APA, Harvard, Vancouver, ISO, and other styles

30

"Theoretical investigation of cisplatin-deoxyribonucleic acid crosslink products using hybrid molecular dynamics + quantum mechanics method." 2009. http://library.cuhk.edu.hk/record=b5893997.

Full text

Abstract:

Yan, Changqing.
Thesis (M.Phil.)--Chinese University of Hong Kong, 2009.
Includes bibliographical references (leaves 92-97).
Abstracts in English and Chinese.
ABSTRACT (ENGLISH) --- p.iii
ABSTRACT (CHINESE) --- p.iv
ACKNOWLEDGMENTS --- p.v
LIST OF ABBREVIATIONS --- p.vi
TABLE OF CONTENTS --- p.vii
LIST OF FIGURES --- p.ix
LIST OF TABLES --- p.x
Chapter CHAPTER ONE: --- BACKGROUND INFORMATION --- p.1
Chapter 1.1 --- Introduction --- p.1
Chapter 1.2 --- Deoxyribonucleic Acid --- p.2
Chapter 1.3 --- DNA Studies --- p.9
Chapter 1.4 --- Cisplatin Studies --- p.11
Chapter 1.5 --- Scope of the Thesis --- p.13
Chapter CHAPTER TWO: --- METHODOLOY AND COMPUTATION --- p.16
Chapter 2.1 --- Introduction --- p.16
Chapter 2.2 --- Molecular Dynamics Simulation --- p.16
Chapter 2.3 --- Quantum Mechanics Calculation --- p.23
Chapter 2.4 --- Verification of Methodology --- p.25
Chapter 2.4.1 --- Backbone Torsion Angles --- p.25
Chapter 2.4.2 --- N7-N7 Distance --- p.30
Chapter 2.4.3 --- Location of HOMO --- p.33
Chapter 2.5 --- Summary --- p.35
Chapter CHAPTER THREE: --- UNDERSTANDING OF THE CISPLATIN-DNA CROSSLINKS --- p.36
Chapter 3.1 --- Introduction --- p.36
Chapter 3.2 --- MO Analysis --- p.37
Chapter 3.3 --- Potential Binding Products with the Ligand --- p.37
Chapter 3.3.1 --- "1,2-d(GpG) Intrastrand Crosslink" --- p.43
Chapter 3.3.2 --- "l,2-d(ApG) Intrastrand Crosslink" --- p.43
Chapter 3.3.3 --- "l,3-d(GpXpG) Intrastrand Crosslink" --- p.44
Chapter 3.3.4 --- d(GpC)d(GpC) Interstrand Crosslink --- p.44
Chapter 3.3.5 --- d(GpXpC)d(GpXpC) Interstrand Crosslink --- p.44
Chapter 3.3.6 --- Summary --- p.45
Chapter 3.4 --- Potential Binding Products Analysis --- p.47
Chapter 3.4.1 --- Site Identification Convention --- p.47
Chapter 3.4.2 --- Potential Binding Products Analysis --- p.48
Chapter 3.4.3 --- Applications --- p.53
Chapter 3.5 --- Cisplatin-DNA Crosslink Products Analysis --- p.56
Chapter 3.5.1 --- "1,2-d(GpG) and l,2-d(ApG) Intrastrand Crosslinks" --- p.61
Chapter 3.5.2 --- "l,3-d(GpXpG) Intrastrand and d(GpXpC)d(GpXpC) Interstrand Crosslinks" --- p.62
Chapter 3.5.3 --- d(GpC)d(GpC) Interstrand Crosslinks --- p.63
Chapter 3.5.4 --- Platination at Terminal Positions --- p.65
Chapter 3.6 --- Summary --- p.65
Chapter CAHPTER FOUR: --- CONCLUDING REMARKS --- p.67
APPENDIX I: BACKBONE TORSION ANGLES AND SUGAR RING CONFORMATIONS OF THE OPTIMIZED GEOMETRIES --- p.69
APPENDIX II: BACKBONE TORSION ANGLES OF THE EXPERIMENTAL SEQUENCES FROM NUCLEIC ACID DATABASE (NDB) --- p.77
REFERENCES --- p.92

APA, Harvard, Vancouver, ISO, and other styles

31

Zhao, Huiying. "Protein function prediction by integrating sequence, structure and binding affinity information." Thesis, 2014. http://hdl.handle.net/1805/3913.

Full text

Abstract:

Indiana University-Purdue University Indianapolis (IUPUI)
Proteins are nano-machines that work inside every living organism. Functional disruption of one or several proteins is the cause for many diseases. However, the functions for most proteins are yet to be annotated because inexpensive sequencing techniques dramatically speed up discovery of new protein sequences (265 million and counting) and experimental examinations of every protein in all its possible functional categories are simply impractical. Thus, it is necessary to develop computational function-prediction tools that complement and guide experimental studies. In this study, we developed a series of predictors for highly accurate prediction of proteins with DNA-binding, RNA-binding and carbohydrate-binding capability. These predictors are a template-based technique that combines sequence and structural information with predicted binding affinity. Both sequence and structure-based approaches were developed. Results indicate the importance of binding affinity prediction for improving sensitivity and precision of function prediction. Application of these methods to the human genome and structure genome targets demonstrated its usefulness in annotating proteins of unknown functions and discovering moon-lighting proteins with DNA,RNA, or carbohydrate binding function. In addition, we also investigated disruption of protein functions by naturally occurring genetic variations due to insertions and deletions (INDELS). We found that protein structures are the most critical features in recognising disease-causing non-frame shifting INDELs. The predictors for function predictions are available at http://sparks-lab.org/spot, and the predictor for classification of non-frame shifting INDELs is available at http://sparks-lab.org/ddig.

APA, Harvard, Vancouver, ISO, and other styles

32

"DINA: a hybrid multicast-unicast fully interactive video-on-demand system." 2001. http://library.cuhk.edu.hk/record=b5890841.

Full text

Abstract:

by Ng Chi Ho.
Thesis (M.Phil.)--Chinese University of Hong Kong, 2001.
Includes bibliographical references (leaves 64-65).
Abstracts in English and Chinese.
ABSTRACT --- p.I
ACKNOWLEDGEMENT --- p.II
TABLE OF CONTENTS --- p.III
LIST OF TABLES --- p.VI
LIST OF FIGURES --- p.VII
LIST OF ABBREVIATIONS --- p.X
Chapter CHAPTER 1 --- INTRODUCTION --- p.1
Chapter 1.1 --- Overview --- p.1
Chapter 1.2 --- Related works --- p.5
Chapter 1.3 --- Organization of this Thesis --- p.6
Chapter CHAPTER 2 --- BACKGROUND --- p.7
Chapter 2.1 --- Introduction to VOD Systems --- p.7
Chapter 2.1.1 --- Pure unicast VOD System --- p.8
Chapter 2.1.2 --- Pure multicast VOD System --- p.9
Chapter 2.1.3 --- Centralized VOD System --- p.9
Chapter 2.1.4 --- Distributed VOD System --- p.10
Chapter 2.1.5 --- Hybrid VOD System (DINA) --- p.11
Chapter 2.1.6 --- Comparisons --- p.12
Chapter 2.2 --- Interactive Functions --- p.14
Chapter 2.2.1 --- Speedup --- p.14
Chapter 2.2.2 --- Split and merge (I and S streams) --- p.14
Chapter 2.2.3 --- Prerecord --- p.15
Chapter 2.3 --- Error Recovery --- p.16
Chapter 2.3.1 --- Pure FEC --- p.17
Chapter 2.3.2 --- Pure ARQ --- p.17
Chapter 2.3.3 --- Hybrid ARQ --- p.18
Chapter 2.3.4 --- Rate-Compatible Punctured Convolutional Codes --- p.18
Chapter CHAPTER 3 --- HYBRID MULTICAST-UNICAST VOD SYSTEM --- p.21
Chapter 3.1 --- System Overview --- p.21
Chapter 3.1.1 --- VSC (Video Server Cluster) --- p.22
Chapter 3.1.2 --- DIS (Distributed Interactive Server) --- p.24
Chapter 3.1.3 --- NAK (Negative Acknowledgement Server) --- p.25
Chapter 3.1.4 --- CS (Client Stations) --- p.26
Chapter 3.1.5 --- MBN (Multicast Backbone Network) --- p.27
Chapter 3.1.6 --- LDN (Local Distribution Network) --- p.27
Chapter 3.2 --- Interactive Functions --- p.28
Chapter 3.2.1 --- Hybrid Multicast- Unicast --- p.28
Chapter 3.2.2 --- Pause --- p.30
Chapter 3.2.3 --- Slow Forward (SF) --- p.33
Chapter 3.2.4 --- Slow Backward (SB) --- p.35
Chapter 3.2.5 --- Fast Forward (FF) / Fast Backward (FB) --- p.38
Chapter 3.2.6 --- Jump Forward (JF) / Jump Backward (JB) --- p.41
Chapter 3.3 --- System Performance --- p.46
Chapter 3.3.1 --- System Model --- p.46
Chapter 3.3.2 --- Simulation Results --- p.47
Chapter 3.3.3 --- Trade off --- p.53
Chapter CHAPTER 4 --- DISTRIBUTED TYPE-II HARQ --- p.54
Chapter 4.1 --- Algorithm Description --- p.54
Chapter 4.1.1 --- Design details --- p.54
Chapter 4.1.2 --- Simulation Results --- p.59
Chapter CHAPTER 5 --- CONCLUSION --- p.62
BIBLIOGRAPHY --- p.64

APA, Harvard, Vancouver, ISO, and other styles

33

(8771429), Ashley S. Dale. "3D OBJECT DETECTION USING VIRTUAL ENVIRONMENT ASSISTED DEEP NETWORK TRAINING." Thesis, 2021.

Find full text

Abstract:

An RGBZ synthetic dataset consisting of five object classes in a variety of virtual environments and orientations was combined with a small sample of real-world image data and used to train the Mask R-CNN (MR-CNN) architecture in a variety of configurations. When the MR-CNN architecture was initialized with MS COCO weights and the heads were trained with a mix of synthetic data and real world data, F1 scores improved in four of the five classes: The average maximum F1-score of all classes and all epochs for the networks trained with synthetic data is F1∗ = 0.91, compared to F1 = 0.89 for the networks trained exclusively with real data, and the standard deviation of the maximum mean F1-score for synthetically trained networks is σ∗ _F1= 0.015, compared to σF 1 = 0.020 for the networks trained exclusively with real data. Various backgrounds in synthetic data were shown to have negligible impact on F1 scores, opening the door to abstract backgrounds and minimizing the need for intensive synthetic data fabrication. When the MR-CNN architecture was initialized with MS COCO weights and depth data was included in the training data, the net- work was shown to rely heavily on the initial convolutional input to feed features into the network, the image depth channel was shown to influence mask generation, and the image color channels were shown to influence object classification. A set of latent variables for a subset of the synthetic datatset was generated with a Variational Autoencoder then analyzed using Principle Component Analysis and Uniform Manifold Projection and Approximation (UMAP). The UMAP analysis showed no meaningful distinction between real-world and synthetic data, and a small bias towards clustering based on image background.

APA, Harvard, Vancouver, ISO, and other styles

34

(9034049), Miguel Villarreal-Vasquez. "Anomaly Detection and Security Deep Learning Methods Under Adversarial Situation." Thesis, 2020.

Find full text

Abstract:

Advances in Artificial Intelligence (AI), or more precisely on Neural Networks (NNs), and fast processing technologies (e.g. Graphic Processing Units or GPUs) in recent years have positioned NNs as one of the main machine learning algorithms used to solved a diversity of problems in both academia and the industry. While they have been proved to be effective in solving many tasks, the lack of security guarantees and understanding of their internal processing disrupts their wide adoption in general and cybersecurity-related applications. In this dissertation, we present the findings of a comprehensive study aimed to enable the absorption of state-of-the-art NN algorithms in the development of enterprise solutions. Specifically, this dissertation focuses on (1) the development of defensive mechanisms to protect NNs against adversarial attacks and (2) application of NN models for anomaly detection in enterprise networks.

In this state of affairs, this work makes the following contributions. First, we performed a thorough study of the different adversarial attacks against NNs. We concentrate on the attacks referred to as trojan attacks and introduce a novel model hardening method that removes any trojan (i.e. misbehavior) inserted to the NN models at training time. We carefully evaluate our method and establish the correct metrics to test the efficiency of defensive methods against these types of attacks: (1) accuracy with benign data, (2) attack success rate, and (3) accuracy with adversarial data. Prior work evaluates their solutions using the first two metrics only, which do not suffice to guarantee robustness against untargeted attacks. Our method is compared with the state-of-the-art. The obtained results show our method outperforms it. Second, we proposed a novel approach to detect anomalies using LSTM-based models. Our method analyzes at runtime the event sequences generated by the Endpoint Detection and Response (EDR) system of a renowned security company running and efficiently detects uncommon patterns. The new detecting method is compared with the EDR system. The results show that our method achieves a higher detection rate. Finally, we present a Moving Target Defense technique that smartly reacts upon the detection of anomalies so as to also mitigate the detected attacks. The technique efficiently replaces the entire stack of virtual nodes, making ongoing attacks in the system ineffective.

APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic 'DNA - Data processing'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles