Dissertations / Theses: 'Multivariate data analysis'

1

Oliveira, Irene. "Correlated data in multivariate analysis." Thesis, University of Aberdeen, 2003. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.401414.

Full text

Abstract:

After presenting (PCA) Principal Component Analysis and its relationship with time series data sets, we describe most of the existing techniques in this field. Various techniques, e.g. Singular Spectrum Analysis, Hilbert EOF, Extended EOF or Multichannel Singular Spectrum Analysis (MSSA), Principal Oscillation Pattern Analysis (POP Analysis), can be used for such data. The way we use the matrix of data or the covariance or correlation matrix, makes each method different from the others. SSA may be considered as a PCA performed on a lagged versions of a single time series where we may decompose the original time series into some main components. Following SSA we have its multivariate version (MSSA) where we try to augment the initial matrix of data to get information on lagged versions of each variable (time series) and so past (or future) behaviour can be used to reanalyse the information between variables. In POP Analysis a linear system involving the vector field is analysed, x_t+1=Ax_t+n_t, in order to “know” x_t at time t+1 given the information from time t. The matrix A is estimated by using not only the covariance matrix but also the matrix of covariances between the systems at the current time and at lag 1. In Hilbert EOF we try to get some (future) information from the internal correlation in each variable by using the Hilbert transform of each series in a augmented complex matrix with the data themselves in the real part and the Hilbert time series in the imaginary part X_t + X_t^H. In addition to all these ideas from the statistics and other literature we develop a new methodology as a modification of HEOF and POP Analysis, namely Hilbert Oscillation Patterns (HOP) Analysis or the related idea of Hilbert Canonical Correlation Analysis (HCCA), by using a system, _x^H_t = Ax_t + n_t. Theory and assumptions are presented and HOPS results will be related with the results extracted from a Canonical Correlation Analysis between the time series data matrix and its Hilbert transform. Some examples will be given to show the differences and similarities of the results of the HCCA technique with those from PCA, MSSA, HEOF and POPs. We also present PCA for time series as observations where a technique of linear algebra (PCA) becomes a problem in function analysis leading to Functional PCA (FPCA). We also adapt PCA to allow for this and discuss the theoretical and practical behaviour of using PCA on the even part (EPCA) and odd part (OPCA) of the data, and its application in functional data. Comparisons will be made between PCA and this modification, for the reconstruction of data sets for which considerations of symmetry are especially relevant.

APA, Harvard, Vancouver, ISO, and other styles

2

Prelorendjos, Alexios. "Multivariate analysis of metabonomic data." Thesis, University of Strathclyde, 2014. http://oleg.lib.strath.ac.uk:80/R/?func=dbin-jump-full&object_id=24286.

Full text

Abstract:

Metabonomics is one of the main technologies used in biomedical sciences to improve understanding of how various biological processes of living organisms work. It is considered a more advanced technology than e.g. genomics and proteomics, as it can provide important evidence of molecular biomarkers for the diagnosis of diseases and the evaluation of beneficial adverse drug effects, by studying the metabolic profiles of living organisms. This is achievable by studying samples of various types such as tissues and biofluids. The findings of a metabonomics study for a specific disease, disorder or drug effect, could be applied to other diseases, disorders or drugs, making metabonomics an important tool for biomedical research. This thesis aims to review and study various multivariate statistical techniques which can be used in the exploratory analysis of metabonomics data. To motivate this research, a metabonomics data set containing the metabolic profiles of a group of patients with epilepsy was used. More specifically, the metabolic fingerprints (proton NMR spectra) of 125 patients with epilepsy, of blood serum type, have been obtained from the Western Infirmary, Glasgow, for the purposes of this project. These data were originally collected as baseline data in a study to investigate if the treatment with Anti-Epileptic Drugs (AEDs), of patients with pharmacoresistant epilepsy affects the seizure levels of the patients. The response to the drug treatment in terms of the reduction in seizure levels of these patients enabled two main categories of response to be identified, i.e. responders and the non-responders to AEDs. We explore the use of statistical methods used in metabonomics to analyse these data. Novel aspects of the thesis are the use of Self Organising Maps (SOM) and of Fuzzy Clustering Methods to pattern recognition in metabonomics data. Part I of the thesis defines metabonomics and the other main "omics" technologies, and gives a detailed description of the metabonomics data to be analysed, as well as a description of the two main analytical chemical techniques, Mass Spectrometry (MS) and Nuclear Magnetic Resonance Spectroscopy (NMR), that can be used to generate metabonomics data. Pre-processing and pre-treatment methods that are commonly used in NMR-generated metabonomics data to enhance the quality and accuracy of the data, are also discussed. In Part II, several unsupervised statistical techniques are reviewed and applied to the epilepsy data to investigate the capability of these techniques to discriminate the patients according to their type of response. The techniques reviewed include Principal Components Analysis (PCA), Multi-dimensional scaling (both Classical scaling and Sammon's non-linear mapping) and Clustering techniques. The latter include Hierarchical clustering (with emphasis on Agglomerative Nesting algorithms), Partitioning methods (Fuzzy and Hard clustering algorithms) and Competitive Learning algorithms (Self Organizing maps). The advantages and disadvantages of the different methods are examined, for this kind of data. Results of the exploratory multivariate analyses showed that no natural clusters of patients existed with regards to th eir response to AEDs, therefore none of these techniques was capable of discriminating these patients according to their clinical characteristics. To examine the capability of an unsupervised technique such as PCA, to identify groups in such data as the data based on metabolic fingerprints of patients with epilepsy, a simulation algorithm was developed to run a series of experiments, covered in Part III of the thesis. The aim of the simulation study is to investigate the extent of the difference in the clusters of the data, and under what conditions this difference is detectable by unsupervised techniques. Furthermore, the study examines whether the existence or lack of variation in the mean-shifted variables affects the discriminating ability of the unsupervised techniques (in this case PCA) or not. In each simulation experiment, a reference and a test data set were generated based on the original epilepsy data, and the discriminating capability of PCA was assessed. A test set was generated by mean-shifting a pre-selected number of variables in a reference set. Three methods of selecting the variables to meanshift (maximum and minimum standard deviations and maximum means), five subsets of variables of sizes 1, 3, 20, 120 and 244 (total number of variables in the data sets) and three sample sizes (100, 500 and 1000) were used. Average values in 100 runs of an experiment for two statistics, i.e. the misclassification rate and the average separation (Webb, 2002) were recorded. Results showed that the number of mean-shifted variables (in general) and the methods used to select the variables (in some cases) are important factors for the discriminating ability of PCA, whereas the sample size of the two data sets does not play any role in the experiments (although experiments in large sample sizes showed greater stability in the results for the two statistics in 100 runs of any experiment). The results have implications for the use of PCA with metabonomics data generally.

APA, Harvard, Vancouver, ISO, and other styles

3

Yang, Di. "Analysis guided visual exploration of multivariate data." Worcester, Mass. : Worcester Polytechnic Institute, 2007. http://www.wpi.edu/Pubs/ETD/Available/etd-050407-005925/.

Full text

APA, Harvard, Vancouver, ISO, and other styles

4

Lans, Ivo A. van der. "Nonlinear multivariate analysis for multiattribute preference data." [Leiden] : DSWO Press, Leiden University, 1992. http://catalog.hathitrust.org/api/volumes/oclc/28733326.html.

Full text

APA, Harvard, Vancouver, ISO, and other styles

5

Zhu, Liang. "Semiparametric analysis of multivariate longitudinal data." Diss., Columbia, Mo. : University of Missouri-Columbia, 2008. http://hdl.handle.net/10355/6044.

Full text

Abstract:

Thesis (Ph. D.)--University of Missouri-Columbia, 2008.
The entire dissertation/thesis text is included in the research.pdf file; the official abstract appears in the short.pdf file (which also appears in the research.pdf); a non-technical general description, or public abstract, appears in the public.pdf file. Title from title screen of research.pdf file (viewed on August 3, 2009) Vita. Includes bibliographical references.

APA, Harvard, Vancouver, ISO, and other styles

6

Tavares, Nuno Filipe Ramalho da Cunha. "Multivariate analysis applied to clinical analysis data." Master's thesis, Faculdade de Ciências e Tecnologia, 2014. http://hdl.handle.net/10362/12288.

Full text

Abstract:

Dissertação para obtenção do Grau de Mestre em Engenharia e Gestão Industrial
Folate, vitamin B12, iron and hemoglobin are essential for metabolic functions in the body. The deficiency of these can be the cause of several known pathologies and, untreated, can be responsible for severe morbidity and even death. The objective of this study is to characterize a population, residing in the metropolitan area of Lisbon and Setubal, concerning serum levels of folate, vitamin B12, iron and hemoglobin, as well as finding evidence of correlations between these parameters and illnesses, mainly cardiovascular, gastrointestinal, neurological and anemia. Clinical analysis data was collected and submitted to multivariate analysis. First the data was screened with Spearman correlation and Kruskal-Wallis analysis of variance to study correlations and variability between groups. To characterize the population, we used cluster analysis with Ward’s linkage method. Finally a sensitivity analysis was performed to strengthen the results. A positive correlation between iron with, ferritin and transferrin, and with hemoglobin was observed with the Spearman correlation. Kruskal-Wallis analysis of variance test showed significant differences between these biomarkers in persons aged 0 to 29, 30 to 59 and over 60 years old. Cluster analysis proved to be a useful tool when characterizing a population based on its biomarkers, showing evidence of low folate levels for the population in general, and hemoglobin levels below the reference values. Iron and vitamin B12 were within the reference range for most of the population. Low levels of the parameters were registered mainly in patients with cardiovascular, gastrointestinal, and neurological diseases and anemia.

APA, Harvard, Vancouver, ISO, and other styles

7

Rehman, Naveed Ur. "Data-driven time-frequency analysis of multivariate data." Thesis, Imperial College London, 2011. http://hdl.handle.net/10044/1/9116.

Full text

Abstract:

Empirical Mode Decomposition (EMD) is a data-driven method for the decomposition and time-frequency analysis of real world nonstationary signals. Its main advantages over other time-frequency methods are its locality, data-driven nature, multiresolution-based decomposition, higher time-frequency resolution and its ability to capture oscillation of any type (nonharmonic signals). These properties have made EMD a viable tool for real world nonstationary data analysis. Recent advances in sensor and data acquisition technologies have brought to light new classes of signals containing typically several data channels. Currently, such signals are almost invariably processed channel-wise, which is suboptimal. It is, therefore, imperative to design multivariate extensions of the existing nonlinear and nonstationary analysis algorithms as they are expected to give more insight into the dynamics and the interdependence between multiple channels of such signals. To this end, this thesis presents multivariate extensions of the empirical mode de- composition algorithm and illustrates their advantages with regards to multivariate non- stationary data analysis. Some important properties of such extensions are also explored, including their ability to exhibit wavelet-like dyadic filter bank structures for white Gaussian noise (WGN), and their capacity to align similar oscillatory modes from multiple data channels. Owing to the generality of the proposed methods, an improved multi- variate EMD-based algorithm is introduced which solves some inherent problems in the original EMD algorithm. Finally, to demonstrate the potential of the proposed methods, simulations on the fusion of multiple real world signals (wind, images and inertial body motion data) support the analysis.

APA, Harvard, Vancouver, ISO, and other styles

8

Droop, Alastair Philip. "Correlation Analysis of Multivariate Biological Data." Thesis, University of York, 2009. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.507622.

Full text

APA, Harvard, Vancouver, ISO, and other styles

9

Collins, Gary Stephen. "Multivariate analysis of flow cytometry data." Thesis, University of Exeter, 2000. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.324749.

Full text

APA, Harvard, Vancouver, ISO, and other styles

10

Haydock, Richard. "Multivariate analysis of Raman spectroscopy data." Thesis, University of Nottingham, 2015. http://eprints.nottingham.ac.uk/30697/.

Full text

Abstract:

This thesis is concerned with developing techniques for analysing Raman spectroscopic images. A Raman spectroscopic image differs from a standard image as in place of red, green and blue quantities for each pixel a Raman image contains a spectrum of light intensities at each pixel. These spectra are used to identify the chemical components from which the image subject, for example a tablet, is comprised. The study of these types of images is known as chemometrics, with the majority of chemometric methods based on multivariate statistical and image analysis techniques. The work in this thesis has two main foci. The first of these is on the spectral decomposition of a Raman image, the purpose of which is to identify the component chemicals and their concentrations. The standard method for this is to fit a bilinear model to the image where both parts of the model, representing components and concentrations, must be estimated. As the standard bilinear model is nonidentifiable in its solutions we investigate the range of possible solutions in the solution space with a random walk. We also derive an improved model for spectral decomposition, combining cluster analysis techniques and the standard bilinear model. For this purpose we apply the expectation maximisation algorithm on a Gaussian mixture model with bilinear means, to represent our spectra and concentrations. This reduces noise in the estimated chemical components by separating the Raman image subject from the background. The second focus of this thesis is on the analysis of our spectral decomposition results. For testing the chemical components for uniform mixing we derive test statistics for identifying patterns in the image based on Minkowski measures, grey level co-occurence matrices and neighbouring pixel correlations. However with a non-identifiable model any hypothesis tests performed on the solutions will be specific to only that solution. Therefore to obtain conclusions for a range of solutions we combined our test statistics with our random walk. We also investigate the analysis of a time series of Raman images as the subject dissolved. Using models comprised of Gaussian cumulative distribution functions we are able to estimate the changes in concentration levels of dissolving tablets between the scan times. The results of which allowed us to describe the dissolution process in terms of the quantities of component chemicals.

APA, Harvard, Vancouver, ISO, and other styles

11

Lee, Yau-wing. "Modelling multivariate survival data using semiparametric models." Click to view the E-thesis via HKUTO, 2000. http://sunzi.lib.hku.hk/hkuto/record/B4257528X.

Full text

APA, Harvard, Vancouver, ISO, and other styles

12

Tardif, Geneviève. "Multivariate Analysis of Canadian Water Quality Data." Thesis, Université d'Ottawa / University of Ottawa, 2015. http://hdl.handle.net/10393/32245.

Full text

Abstract:

Physical-chemical water quality data from lotic water monitoring sites across Canada were integrated into one dataset. Two overlapping matrices of data were analyzed with principal component analysis (PCA) and cluster analysis to uncover structure and patterns in the data. The first matrix (Matrix A) had 107 sites located throughout Canada, and the following water quality parameters: pH, specific conductance (SC), and total phosphorus (TP). The second matrix (Matrix B) included more variables: calcium (Ca), chloride (Cl), total alkalinity (T_ALK), dissolved oxygen (DO), water temperature (WT), pH, SC and TP; for a subset of 42 sites. Landscape characteristics were calculated for each water quality monitoring site and their importance in explaining water quality data was examined through redundancy analysis. The first principal components in the analyses of Matrix A and B were most correlated with SC, suggesting this parameter is the most representative of water quality variance at the scale of Canada. Overlaying cluster analysis results on PCA information proved an excellent mean to identify the major water characteristics defining each group; mapping cluster analysis group membership provided information on their spatial distribution and was found informative with regards to the probable environmental influences on each group. Redundancy analyses produced significant predictive models of water quality demonstrating that landscape characteristics are determinant factors in water quality at the country scale. The proportion of cropland and the mean annual total precipitation in the drainage area were the landscape variables with the most variance explained. Assembling a consistent dataset of water quality data from monitoring locations throughout Canada proved difficult due to the unevenness of the monitoring programs in place. It is therefore recommended that a standard for the monitoring of a minimum core set of water quality variable be implemented throughout the country to support future nation-wide analysis of water quality data.

APA, Harvard, Vancouver, ISO, and other styles

13

Snavely, Anna Catherine. "Multivariate Data Analysis with Applications to Cancer." Thesis, Harvard University, 2012. http://dissertations.umi.com/gsas.harvard:10371.

Full text

Abstract:

Multivariate data is common in a wide range of settings. As data structures become increasingly complex, additional statistical tools are required to perform proper analyses. In this dissertation we develop and evaluate methods for the analysis of multivariate data generated from cancer trials. In the first chapter we consider the analysis of clustered survival data that can arise from multicenter clinical trials. In particular, we review and compare marginal and conditional models numerically through simulations and discuss model selection techniques. A multicenter clinical trial of children with acute lymphoblastic leukemia is used to illustrate the findings. The second and third chapters both address the setting where multiple outcomes are collected when the outcome of interest cannot be measured directly. A head and neck cancer trial in which multiple outcomes were collected to measure dysphagia was the particular motivation for this part of the dissertation. Specifically, in the second chapter we propose a semiparametric latent variable transformation model that incorporates measurable outcomes of mixed types, including censored outcomes. This method extends traditional approaches by allowing the relationship between the measurable outcomes and latent variable to be unspecified, rendering more robust inference. Using this approach we can directly estimate the treatment (or other covariate) effect on the unobserved latent variable, enhancing interpretation. In the third chapter, the basic model from the second chapter is maintained, but additional parametric assumptions are made. This model still has the advantages of allowing for censored measurable outcomes and being able to estimate a treatment effect on the latent variable, but has the added advantage of good performance in a small data set. Together the methods proposed in the second and third chapters provide a comprehensive approach for the analysis of complex multiple outcomes data.

APA, Harvard, Vancouver, ISO, and other styles

14

Bolton, Richard John. "Multivariate analysis of multiproduct market research data." Thesis, University of Exeter, 2000. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.302542.

Full text

APA, Harvard, Vancouver, ISO, and other styles

15

Durif, Ghislain. "Multivariate analysis of high-throughput sequencing data." Thesis, Lyon, 2016. http://www.theses.fr/2016LYSE1334/document.

Full text

Abstract:

L'analyse statistique de données de séquençage à haut débit (NGS) pose des questions computationnelles concernant la modélisation et l'inférence, en particulier à cause de la grande dimension des données. Le travail de recherche dans ce manuscrit porte sur des méthodes de réductions de dimension hybrides, basées sur des approches de compression (représentation dans un espace de faible dimension) et de sélection de variables. Des développements sont menés concernant la régression "Partial Least Squares" parcimonieuse (supervisée) et les méthodes de factorisation parcimonieuse de matrices (non supervisée). Dans les deux cas, notre objectif sera la reconstruction et la visualisation des données. Nous présenterons une nouvelle approche de type PLS parcimonieuse, basée sur une pénalité adaptative, pour la régression logistique. Cette approche sera utilisée pour des problèmes de prédiction (devenir de patients ou type cellulaire) à partir de l'expression des gènes. La principale problématique sera de prendre en compte la réponse pour écarter les variables non pertinentes. Nous mettrons en avant le lien entre la construction des algorithmes et la fiabilité des résultats.Dans une seconde partie, motivés par des questions relatives à l'analyse de données "single-cell", nous proposons une approche probabiliste pour la factorisation de matrices de comptage, laquelle prend en compte la sur-dispersion et l'amplification des zéros (caractéristiques des données single-cell). Nous développerons une procédure d'estimation basée sur l'inférence variationnelle. Nous introduirons également une procédure de sélection de variables probabiliste basée sur un modèle "spike-and-slab". L'intérêt de notre méthode pour la reconstruction, la visualisation et le clustering de données sera illustré par des simulations et par des résultats préliminaires concernant une analyse de données "single-cell". Toutes les méthodes proposées sont implémentées dans deux packages R: plsgenomics et CMF
The statistical analysis of Next-Generation Sequencing data raises many computational challenges regarding modeling and inference, especially because of the high dimensionality of genomic data. The research work in this manuscript concerns hybrid dimension reduction methods that rely on both compression (representation of the data into a lower dimensional space) and variable selection. Developments are made concerning: the sparse Partial Least Squares (PLS) regression framework for supervised classification, and the sparse matrix factorization framework for unsupervised exploration. In both situations, our main purpose will be to focus on the reconstruction and visualization of the data. First, we will present a new sparse PLS approach, based on an adaptive sparsity-inducing penalty, that is suitable for logistic regression to predict the label of a discrete outcome. For instance, such a method will be used for prediction (fate of patients or specific type of unidentified single cells) based on gene expression profiles. The main issue in such framework is to account for the response to discard irrelevant variables. We will highlight the direct link between the derivation of the algorithms and the reliability of the results. Then, motivated by questions regarding single-cell data analysis, we propose a flexible model-based approach for the factorization of count matrices, that accounts for over-dispersion as well as zero-inflation (both characteristic of single-cell data), for which we derive an estimation procedure based on variational inference. In this scheme, we consider probabilistic variable selection based on a spike-and-slab model suitable for count data. The interest of our procedure for data reconstruction, visualization and clustering will be illustrated by simulation experiments and by preliminary results on single-cell data analysis. All proposed methods were implemented into two R-packages "plsgenomics" and "CMF" based on high performance computing

APA, Harvard, Vancouver, ISO, and other styles

16

李友榮 and Yau-wing Lee. "Modelling multivariate survival data using semiparametric models." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 2000. http://hub.hku.hk/bib/B4257528X.

Full text

APA, Harvard, Vancouver, ISO, and other styles

17

Zhou, Feifei, and 周飞飞. "Cure models for univariate and multivariate survival data." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 2011. http://hub.hku.hk/bib/B45700977.

Full text

APA, Harvard, Vancouver, ISO, and other styles

18

Bergfors, Linus. "Explorative Multivariate Data Analysis of the Klinthagen Limestone Quarry Data." Thesis, Uppsala University, Department of Information Technology, 2010. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-122575.

Full text

Abstract:

The today quarry planning at Klinthagen is rough, which provides an opportunity to introduce new exciting methods to improve the quarry gain and efficiency. Nordkalk AB, active at Klinthagen, wishes to start a new quarry at a nearby location. To exploit future quarries in an efficient manner and ensure production quality, multivariate statistics may help gather important information.

In this thesis the possibilities of the multivariate statistical approaches of Principal Component Analysis (PCA) and Partial Least Squares (PLS) regression were evaluated on the Klinthagen bore data. PCA data were spatially interpolated by Kriging, which also was evaluated and compared to IDW interpolation.

Principal component analysis supplied an overview of the variables relations, but also visualised the problems involved when linking geophysical data to geochemical data and the inaccuracy introduced by lacking data quality.

The PLS regression further emphasised the geochemical-geophysical problems, but also showed good precision when applied to strictly geochemical data.

Spatial interpolation by Kriging did not result in significantly better approximations than the less complex control interpolation by IDW.

In order to improve the information content of the data when modelled by PCA, a more discrete sampling method would be advisable. The data quality may cause trouble, though with sample technique of today it was considered to be of less consequence.

Faced with a single geophysical component to be predicted from chemical variables further geophysical data need to complement existing data to achieve satisfying PLS models.

The stratified rock composure caused trouble when spatially interpolated. Further investigations should be performed to develop more suitable interpolation techniques.

APA, Harvard, Vancouver, ISO, and other styles

19

Ehlers, Rene. "Maximum likelihood estimation procedures for categorical data." Pretoria : [s.n.], 2002. http://upetd.up.ac.za/thesis/available/etd-07222005-124541.

Full text

APA, Harvard, Vancouver, ISO, and other styles

20

Hopkins, Julie Anne. "Sampling designs for exploratory multivariate analysis." Thesis, University of Sheffield, 2000. http://etheses.whiterose.ac.uk/14798/.

Full text

Abstract:

This thesis is concerned with problems of variable selection, influence of sample size and related issues in the applications of various techniques of exploratory multivariate analysis (in particular, correspondence analysis, biplots and canonical correspondence analysis) to archaeology and ecology. Data sets (both published and new) are used to illustrate these methods and to highlight the problems that arise - these practical examples are returned to throughout as the various issues are discussed. Much of the motivation for the development of the methodology has been driven by the needs of the archaeologists providing the data, who were consulted extensively during the study. The first (introductory) chapter includes a detailed description of the data sets examined and the archaeological background to their collection. Chapters Two, Three and Four explain in detail the mathematical theory behind the three techniques. Their uses are illustrated on the various examples of interest, raising data-driven questions which become the focus of the later chapters. The main objectives are to investigate the influence of various design quantities on the inferences made from such multivariate techniques. Quantities such as the sample size (e.g. number of artefacts collected), the number of categories of classification (e.g. of sites, wares, contexts) and the number of variables measured compete for fixed resources in archaeological and ecological applications. Methods of variable selection and the assessment of the stability of the results are further issues of interest and are investigated using bootstrapping and procrustes analysis. Jack-knife methods are used to detect influential sites, wares, contexts, species and artefacts. Some existing methods of investigating issues such as those raised above are applied and extended to correspondence analysis in Chapters Five and Six. Adaptions of them are proposed for biplots in Chapters Seven and Eight and for canonical correspondence analysis in Chapter Nine. Chapter Ten concludes the thesis.

APA, Harvard, Vancouver, ISO, and other styles

21

Lawal, Najib. "Modelling and multivariate data analysis of agricultural systems." Thesis, University of Manchester, 2015. https://www.research.manchester.ac.uk/portal/en/theses/modelling-and-multivariate-data-analysis-of-agricultural-systems(f6b86e69-5cff-4ffb-a696-418662ecd694).html.

Full text

Abstract:

The broader research area investigated during this programme was conceived from a goal to contribute towards solving the challenge of food security in the 21st century through the reduction of crop loss and minimisation of fungicide use. This is aimed to be achieved through the introduction of an empirical approach to agricultural disease monitoring. In line with this, the SYIELD project, initiated by a consortium involving University of Manchester and Syngenta, among others, proposed a novel biosensor design that can electrochemically detect viable airborne pathogens by exploiting the biology of plant-pathogen interaction. This approach offers improvement on the inefficient and largely experimental methods currently used. Within this context, this PhD focused on the adoption of multidisciplinary methods to address three key objectives that are central to the success of the SYIELD project: local spore ingress near canopies, the evaluation of a suitable model that can describe spore transport, and multivariate analysis of the potential monitoring network built from these biosensors. The local transport of spores was first investigated by carrying out a field trial experiment at Rothamsted Research UK in order to investigate spore ingress in OSR canopies, generate reliable data for testing the prototype biosensor, and evaluate a trajectory model. During the experiment, spores were air-sampled and quantified using established manual detection methods. Results showed that the manual methods, such as colourimetric detection are more sensitive than the proposed biosensor, suggesting the proxy measurement mechanism used by the biosensor may not be reliable in live deployments where spores are likely to be contaminated by impurities and other inhibitors of oxalic acid production. Spores quantified using the more reliable quantitative Polymerase Chain Reaction proved informative and provided novel of data of high experimental value. The dispersal of this data was found to fit a power decay law, a finding that is consistent with experiments in other crops. In the second area investigated, a 3D backward Lagrangian Stochastic model was parameterised and evaluated with the field trial data. The bLS model, parameterised with Monin-Obukhov Similarity Theory (MOST) variables showed good agreement with experimental data and compared favourably in terms of performance statistics with a recent application of an LS model in a maize canopy. Results obtained from the model were found to be more accurate above the canopy than below it. This was attributed to a higher error during initialisation of release velocities below the canopy. Overall, the bLS model performed well and demonstrated suitability for adoption in estimating above-canopy spore concentration profiles which can further be used for designing efficient deployment strategies. The final area of focus was the monitoring of a potential biosensor network. A novel framework based on Multivariate Statistical Process Control concepts was proposed and applied to data from a pollution-monitoring network. The main limitation of traditional MSPC in spatial data applications was identified as a lack of spatial awareness by the PCA model when considering correlation breakdowns caused by an incoming erroneous observation. This resulted in misclassification of healthy measurements as erroneous. The proposed Kriging-augmented MSPC approach was able to incorporate this capability and significantly reduce the number of false alarms.

APA, Harvard, Vancouver, ISO, and other styles

22

Cai, Jianwen. "Generalized estimating equations for censored multivariate failure time data /." Thesis, Connect to this title online; UW restricted, 1992. http://hdl.handle.net/1773/9581.

Full text

APA, Harvard, Vancouver, ISO, and other styles

23

Billah, Baki. "The analysis of multivariate incomplete failure time data." Thesis, National Library of Canada = Bibliothèque nationale du Canada, 1995. http://www.collectionscanada.ca/obj/s4/f2/dsk2/ftp04/mq25823.pdf.

Full text

APA, Harvard, Vancouver, ISO, and other styles

24

Rawizza, Mark Alan. "Time-series analysis of multivariate manufacturing data sets." Thesis, Massachusetts Institute of Technology, 1996. http://hdl.handle.net/1721.1/10895.

Full text

APA, Harvard, Vancouver, ISO, and other styles

25

Ritchie, Elspeth Kathryn. "Application of multivariate data analysis in biopharmaceutical production." Thesis, University of Newcastle upon Tyne, 2016. http://hdl.handle.net/10443/3356.

Full text

Abstract:

In 2004, the FDA launched the Process Analytical Technology (PAT) initiative to support product and process development. Even before this, the biologics manufacturing industry was working to implement PAT. While a strong focus of PAT is the implementation of new monitoring technologies, there is also a strong emphasis on the use of multivariate data analysis (MVDA). Effective implementation and integration of MVDA is of particular interest as it can be applied retroactively to historical datasets in addition to current datasets. However translation of academic research into industrial ways of working can be slowed or prevented by many obstacles, from proposed solutions being workable only by the original academic to a need to prove that time invested in developing MVDA models and methodologies will result in positive business impacts (e.g. reduction of costs or man hours). The presented research applied MVDA techniques to datasets from three scales typically encountered during investigations of biologics manufacturing processes: a single product, dataset; a single product, multi-scale dataset; a multi-product, multi-scale, single platform dataset. These datasets were interrogated in multiple approaches and multiple objectives (e.g. indictors/causes of productivity variation, comparison of pH measurement technologies). Individual project outcomes culminated in the creation of a robust statistical toolbox. The toolbox captures an array of MVDA techniques from PCA and PLS to decision trees employing k-NN. These are supported by frameworks and guidance for implementation based on interrogation aims encountered in a contract manufacturing environment. The presented frameworks ranged from extraction of indirectly captured information (Chapter 4) to meta-analytical strategies (Chapter 6). Software-based tools generated during research ranged from translation of high frequency online monitoring data as robust summary statistics with intuitive meaning (Appendix A) to tools enabling potential reduction in confounding underlying variation in dataset structures through the use of alternative progression variables (Chapter 5). Each tool was designed to fit into current and future planned ways of working at the sponsor company. The presented research demonstrates a range of investigation aims and challenges encountered in a contract manufacturing organisation with demonstrated benefits from ease of integration into normal work process flows and savings in time and human resources.

APA, Harvard, Vancouver, ISO, and other styles

26

Wang, Lianming. "Statistical analysis of multivariate interval-censored failure time data." Diss., Columbia, Mo. : University of Missouri-Columbia, 2006. http://hdl.handle.net/10355/4375.

Full text

Abstract:

Thesis (Ph.D.)--University of Missouri-Columbia, 2006.
The entire dissertation/thesis text is included in the research.pdf file; the official abstract appears in the short.pdf file (which also appears in the research.pdf); a non-technical general description, or public abstract, appears in the public.pdf file. Title from title screen of research.pdf file viewed on (May 2, 2007) Vita. Includes bibliographical references.

APA, Harvard, Vancouver, ISO, and other styles

27

Nicolini, Olivier. "LIBS Multivariate Analysis with Machine Learning." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-286595.

Full text

Abstract:

Laser-Induced Breakdown Spectroscopy (LIBS) is a spectroscopic technique used for chemical analysis of materials. By analyzing the spectrum obtained with this technique it is possible to understand the chemical composition of a sample. The possibility to analyze materials in a contactless and online fashion, without sample preparation make LIBS one of the most interesting techniques for chemical composition analysis. However, despite its intrinsic advantages, LIBS analysis suffers from poor accuracy and limited reproducibility of the results due to interference effects caused by the chemical composition of the sample or other experimental factors. How to improve the accuracy of the analysis by extracting useful information from LIBS high dimensionality data remains the main challenge of this technique. In the present work, with the purpose to propose a robust analysis method, I present a pipeline for multivariate regression on LIBS data composed of preprocessing, feature selection, and regression. First raw data is preprocessed by application of intensity filtering, normalization and baseline correction to mitigate the effect of interference factors such as laser energy fluctuations or the presence of baseline in the spectrum. Feature selection allows finding the most informative lines for an element that are then used as input in the subsequent regression phase to predict the element concentration. Partial Least Squares (PLS) and Elastic Net showed the best predictive ability among the regression methods investigated, while Interval PLS (iPLS) and Iterative Predictor Weighting PLS (IPW-PLS) proved to be the best feature selection algorithms for this type of data. By applying these feature selection algorithms on the full LIBS spectrum before regression with PLS or Elastic Net it is possible to get accurate predictions in a robust fashion.
Laser-Induced Breakdown Spectroscopy (LIBS) är en spektroskopisk teknik som används för kemisk analys av material. Genom att analysera det spektrum som erhållits med denna teknik är det möjligt att förstå den kemiska sammansättningen av ett prov. Möjligheten att analysera material på ett kontaktlöst och online sätt utan förberedelse av prov gör LIBS till en av de mest intressanta teknikerna för kemisk sammansättning analys. Trots dess inneboende fördelar lider LIBS-analysen av dålig noggrannhet och begränsad reproducerbarhet av resultaten på grund av interferenseffekter orsakade av provets kemiska sammansättning eller andra experimentella faktorer. Hur man kan förbättra analysens noggrannhet genom att extrahera användbar information från LIBS-data med hög dimensionering är fortfarande den största utmaningen med denna teknik. I det nuvarande arbetet, med syftet att föreslå en robust analysmetod, presenterar jag en pipeline för multivariat regression på LIBS-data som består av förbehandling, val av funktioner och regression. Första rådata förbehandlas genom tillämpning av intensitetsfiltrering, normalisering och baslinjekorrektion för att mildra effekten av interferensfaktorer såsom laserens energifluktuationer eller närvaron av baslinjen i spektrumet. Funktionsval gör det möjligt att hitta de mest informativa linjerna för ett element som sedan används som input i den efterföljande regressionsfasen för att förutsäga elementkoncentrationen. Partial Least Squares (PLS) och Elastic Net visade den bästa förutsägelseförmågan bland de undersökta regressionsmetoderna, medan Interval PLS (iPLS) och Iterative PredictorWeighting PLS (IPW-PLS) visade sig vara de bästa funktionsval algoritmerna för denna typ av data. Genom att tillämpa dessa funktionsval algoritmer på hela LIBS-spektrumet före regression med PLS eller Elastic Net är det möjligt att få exakta förutsägelser på ett robust sätt.

APA, Harvard, Vancouver, ISO, and other styles

28

Chen, Man-Hua. "Statistical analysis of multivariate interval-censored failure time data." Diss., Columbia, Mo. : University of Missouri-Columbia, 2007. http://hdl.handle.net/10355/4776.

Full text

Abstract:

Thesis (Ph.D.)--University of Missouri-Columbia, 2007.
The entire dissertation/thesis text is included in the research.pdf file; the official abstract appears in the short.pdf file (which also appears in the research.pdf); a non-technical general description, or public abstract, appears in the public.pdf file. Title from title screen of research.pdf file (viewed on March 6, 2009) Includes bibliographical references.

APA, Harvard, Vancouver, ISO, and other styles

29

Sheppard, Therese. "Extending covariance structure analysis for multivariate and functional data." Thesis, University of Manchester, 2010. https://www.research.manchester.ac.uk/portal/en/theses/extending-covariance-structure-analysis-for-multivariate-and-functional-data(e2ad7f12-3783-48cf-b83c-0ca26ef77633).html.

Full text

Abstract:

For multivariate data, when testing homogeneity of covariance matrices arising from two or more groups, Bartlett's (1937) modified likelihood ratio test statistic is appropriate to use under the null hypothesis of equal covariance matrices where the null distribution of the test statistic is based on the restrictive assumption of normality. Zhang and Boos (1992) provide a pooled bootstrap approach when the data cannot be assumed to be normally distributed. We give three alternative bootstrap techniques to testing homogeneity of covariance matrices when it is both inappropriate to pool the data into one single population as in the pooled bootstrap procedure and when the data are not normally distributed. We further show that our alternative bootstrap methodology can be extended to testing Flury's (1988) hierarchy of covariance structure models. Where deviations from normality exist, we show, by simulation, that the normal theory log-likelihood ratio test statistic is less viable compared with our bootstrap methodology. For functional data, Ramsay and Silverman (2005) and Lee et al (2002) together provide four computational techniques for functional principal component analysis (PCA) followed by covariance structure estimation. When the smoothing method for smoothing individual profiles is based on using least squares cubic B-splines or regression splines, we find that the ensuing covariance matrix estimate suffers from loss of dimensionality. We show that ridge regression can be used to resolve this problem, but only for the discretisation and numerical quadrature approaches to estimation, and that choice of a suitable ridge parameter is not arbitrary. We further show the unsuitability of regression splines when deciding on the optimal degree of smoothing to apply to individual profiles. To gain insight into smoothing parameter choice for functional data, we compare kernel and spline approaches to smoothing individual profiles in a nonparametric regression context. Our simulation results justify a kernel approach using a new criterion based on predicted squared error. We also show by simulation that, when taking account of correlation, a kernel approach using a generalized cross validatory type criterion performs well. These data-based methods for selecting the smoothing parameter are illustrated prior to a functional PCA on a real data set.

APA, Harvard, Vancouver, ISO, and other styles

30

Wan, Chung-him, and 溫仲謙. "Analysis of zero-inflated count data." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 2009. http://hub.hku.hk/bib/B43703719.

Full text

APA, Harvard, Vancouver, ISO, and other styles

31

Wan, Chung-him. "Analysis of zero-inflated count data." Click to view the E-thesis via HKUTO, 2009. http://sunzi.lib.hku.hk/hkuto/record/B43703719.

Full text

APA, Harvard, Vancouver, ISO, and other styles

32

陳志昌 and Chee-cheong Chan. "Compositional data analysis of voting patterns." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 1993. http://hub.hku.hk/bib/B31977236.

Full text

APA, Harvard, Vancouver, ISO, and other styles

33

Chan, Chee-cheong. "Compositional data analysis of voting patterns." [Hong Kong : University of Hong Kong], 1993. http://sunzi.lib.hku.hk/hkuto/record.jsp?B13787160.

Full text

APA, Harvard, Vancouver, ISO, and other styles

34

Nothnagel, Carien. "Multivariate data analysis using spectroscopic data of fluorocarbon alcohol mixtures / Nothnagel, C." Thesis, North-West University, 2012. http://hdl.handle.net/10394/7064.

Full text

Abstract:

Pelchem, a commercial subsidiary of Necsa (South African Nuclear Energy Corporation), produces a range of commercial fluorocarbon products while driving research and development initiatives to support the fluorine product portfolio. One such initiative is to develop improved analytical techniques to analyse product composition during development and to quality assure produce. Generally the C–F type products produced by Necsa are in a solution of anhydrous HF, and cannot be directly analyzed with traditional techniques without derivatisation. A technique such as vibrational spectroscopy, that can analyze these products directly without further preparation, will have a distinct advantage. However, spectra of mixtures of similar compounds are complex and not suitable for traditional quantitative regression analysis. Multivariate data analysis (MVA) can be used in such instances to exploit the complex nature of spectra to extract quantitative information on the composition of mixtures. A selection of fluorocarbon alcohols was made to act as representatives for fluorocarbon compounds. Experimental design theory was used to create a calibration range of mixtures of these compounds. Raman and infrared (NIR and ATR–IR) spectroscopy were used to generate spectral data of the mixtures and this data was analyzed with MVA techniques by the construction of regression and prediction models. Selected samples from the mixture range were chosen to test the predictive ability of the models. Analysis and regression models (PCR, PLS2 and PLS1) gave good model fits (R2 values larger than 0.9). Raman spectroscopy was the most efficient technique and gave a high prediction accuracy (at 10% accepted standard deviation), provided the minimum mass of a component exceeded 16% of the total sample. The infrared techniques also performed well in terms of fit and prediction. The NIR spectra were subjected to signal saturation as a result of using long path length sample cells. This was shown to be the main reason for the loss in efficiency of this technique compared to Raman and ATR–IR spectroscopy. It was shown that multivariate data analysis of spectroscopic data of the selected fluorocarbon compounds could be used to quantitatively analyse mixtures with the possibility of further optimization of the method. The study was a representative study indicating that the combination of MVA and spectroscopy can be used successfully in the quantitative analysis of other fluorocarbon compound mixtures.
Thesis (M.Sc. (Chemistry))--North-West University, Potchefstroom Campus, 2012.

APA, Harvard, Vancouver, ISO, and other styles

35

Ahmadi-Nedushan, Behrooz 1966. "Multivariate statistical analysis of monitoring data for concrete dams." Thesis, McGill University, 2002. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=82815.

Full text

Abstract:

Major dams in the world are often instrumented in order to validate numerical models, to gain insight into the behavior of the dam, to detect anomalies, and to enable a timely response either in the form of repairs, reservoir management, or evacuation. Advances in automated data monitoring system makes it possible to regularly collect data on a large number of instruments for a dam. Managing this data is a major concern since traditional means of monitoring each instrument are time consuming and personnel intensive. Among tasks that need to be performed are: identification of faulty instruments, removal of outliers, data interpretation, model fitting and management of alarms for detecting statistically significant changes in the response of a dam.
Statistical models such as multiple linear regression, and back propagation neural networks have been used to estimate the response of individual instruments. Multiple linear regression models are of two kinds, (1) Hydro-Seasonal-Time (HST) models and (2) models that consider concrete temperatures as predictors.
Univerariate, bivariate, and multivariate methods are proposed for the identification of anomalies in the instrumentation data. The source of these anomalies can be either bad readings, faulty instruments, or changes in dam behavior.
The proposed methodologies are applied to three different dams, Idukki, Daniel Johnson and Chute-a-Caron, which are respectively an arch, multiple arch and a gravity dam. Displacements, strains, flow rates, and crack openings of these three dams are analyzed.
This research also proposes various multivariate statistical analyses and artificial neural networks techniques to analyze dam monitoring data. One of these methods, Principal Component Analysis (PCA) is concerned with explaining the variance-covariance structure of a data set through a few linear combinations of the original variables. The general objectives are (1) data reduction and (2) data interpretation. Other multivariate analysis methods such as canonical correlation analysis, partial least squares and nonlinear principal component analysis are discussed. The advantages of methodologies for noise reduction, the reduction of number of variables that have to be monitored, the prediction of response parameters, and the identification of faulty readings are discussed. Results indicated that dam responses are generally correlated and that only a few principal components can summarize the behavior of a dam.

APA, Harvard, Vancouver, ISO, and other styles

36

Das, Mitali. "Motion within music : the analysis of multivariate MIDI data." Thesis, University of York, 2001. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.367466.

Full text

APA, Harvard, Vancouver, ISO, and other styles

37

Edberg, Alexandra. "Monitoring Kraft Recovery Boiler Fouling by Multivariate Data Analysis." Thesis, KTH, Skolan för kemi, bioteknologi och hälsa (CBH), 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-230906.

Full text

Abstract:

This work deals with fouling in the recovery boiler at Montes del Plata, Uruguay. Multivariate data analysis has been used to analyze the large amount of data that was available in order to investigate how different parameters affect the fouling problems. Principal Component Analysis (PCA) and Partial Least Square Projection (PLS) have in this work been used. PCA has been used to compare average values between time periods with high and low fouling problems while PLS has been used to study the correlation structures between the variables and consequently give an indication of which parameters that might be changed to improve the availability of the boiler. The results show that this recovery boiler tends to have problems with fouling that might depend on the distribution of air, the black liquor pressure or the dry solid content of the black liquor. The results also show that multivariate data analysis is a powerful tool for analyzing these types of fouling problems.
Detta arbete handlar om inkruster i sodapannan pa Montes del Plata, Uruguay. Multivariat dataanalys har anvands for att analysera den stora datamangd som fanns tillganglig for att undersoka hur olika parametrar paverkar inkrusterproblemen. Principal·· Component Analysis (PCA) och Partial Least Square Projection (PLS) har i detta jobb anvants. PCA har anvants for att jamfora medelvarden mellan tidsperioder med hoga och laga inkrusterproblem medan PLS har anvants for att studera korrelationen mellan variablema och darmed ge en indikation pa vilka parametrar som kan tankas att andras for att forbattra tillgangligheten pa sodapannan. Resultaten visar att sodapannan tenderar att ha problem med inkruster som kan hero pa fdrdelningen av luft, pa svartlutens tryck eller pa torrhalten i svartluten. Resultaten visar ocksa att multivariat dataanalys ar ett anvandbart verktyg for att analysera dessa typer av inkrusterproblem.

APA, Harvard, Vancouver, ISO, and other styles

38

Chang, Janis. "Analysis of ordered categorical data." Thesis, University of British Columbia, 1988. http://hdl.handle.net/2429/27857.

Full text

Abstract:

Methods of testing for a location shift between two populations in a longitudinal study are investigated when the data of interest are ordered, categorical and non-linear. A non-standard analysis involving modelling of data over time with transition probability matrices is discussed. Next, the relative efficiencies of statistics more frequently used for the analysis of such categorical data at a single time point are examined. The Wilcoxon rank sum, McCullagh, and 2 sample t statistic are compared for the analysis of such cross sectional data using simulation and efficacy calculations. Simulation techniques are then utilized in comparing the stratified Wilcoxon, McCullagh and chi squared-type statistic in their efficiencies at detecting a location shift when the data are examined over two time points. The distribution of a chi squared-type statistic based on the simple contingency table constructed by merely noting whether a subject improved, stayed the same or deteriorated is derived. Applications of these methods and results to a data set of Multiple Sclerosis patients, some of whom were treated with interferon and some of whom received a placebo are provided throughout the thesis and our findings are summarized in the last Chapter.
Science, Faculty of
Statistics, Department of
Graduate

APA, Harvard, Vancouver, ISO, and other styles

39

Eslava-Gomez, Guillermina. "Projection pursuit and other graphical methods for multivariate data." Thesis, University of Oxford, 1989. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.236118.

Full text

APA, Harvard, Vancouver, ISO, and other styles

40

Siluyele, Ian John. "Power studies of multivariate two-sample tests of comparison." Thesis, University of the Western Cape, 2007. http://etd.uwc.ac.za/index.php?module=etd&action=viewtitle&id=gen8Srv25Nme4_6355_1255091702.

Full text

Abstract:

The multivariate two-sample tests provide a means to test the match between two multivariate distributions. Although many tests exist in the literature, relatively little is known about the relative power of these procedures. The studies reported in the thesis contrasts the effectiveness, in terms of power, of seven such tests with a Monte Carlo study. The relative power of the tests was investigated against location, scale, and correlation alternatives.

APA, Harvard, Vancouver, ISO, and other styles

41

Minnen, David. "Unsupervised discovery of activity primitives from multivariate sensor data." Diss., Atlanta, Ga. : Georgia Institute of Technology, 2008. http://hdl.handle.net/1853/24623.

Full text

Abstract:

Thesis (Ph.D.)--Computing, Georgia Institute of Technology, 2009.
Committee Chair: Thad Starner; Committee Member: Aaron Bobick; Committee Member: Bernt Schiele; Committee Member: Charles Isbell; Committee Member: Irfan Essa

APA, Harvard, Vancouver, ISO, and other styles

42

Fitzgerald-DeHoog, Lindsay M. "Multivariate analysis of proteomic data| Functional group analysis using a global test." Thesis, California State University, Long Beach, 2015. http://pqdtopen.proquest.com/#viewpdf?dispub=1602759.

Full text

Abstract:

Proteomics is a relatively new discipline being implemented in life science fields. Proteomics allows a whole-systems approach to discerning changes in organismal physiology due to physical perturbations. The advantages of a proteomic approach may be counteracted by the ability to analyze the data in a meaningful way due to inherent problems with statistical assumptions. Furthermore, analyzing significant protein volume differences among treatment groups often requires analysis of numerous proteins even when limiting analyses to a particular protein type or physiological pathway. Improper use of traditional techniques leads to problems with multiple hypotheses testing.

This research will examine two common techniques used to analyze proteomic data and will apply these to a novel proteomic data set. In addition, a Global Test originally developed for gene array data will be employed to discover its utility for proteomic data and the ability to counteract the multiple hypotheses testing problems encountered with traditional analyses.

APA, Harvard, Vancouver, ISO, and other styles

43

Kurtovic, Sanela. "Directed Evolution of Glutathione Transferases Guided by Multivariate Data Analysis." Doctoral thesis, Uppsala University, Department of Biochemistry and Organic Chemistry, 2008. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-8718.

Full text

Abstract:

Evolution of enzymes with novel functional properties has gained much attention in recent years. Naturally evolved enzymes are adapted to work in living cells under physiological conditions, circumstances that are not always available for industrial processes calling for novel and better catalysts. Furthermore, altering enzyme function also affords insight into how enzymes work and how natural evolution operates.

Previous investigations have explored catalytic properties in the directed evolution of mutant libraries with high sequence variation. Before this study was initiated, functional analysis of mutant libraries was, to a large extent, restricted to uni- or bivariate methods. Consequently, there was a need to apply multivariate data analysis (MVA) techniques in this context. Directed evolution was approached by DNA shuffling of glutathione transferases (GSTs) in this thesis. GSTs are multifarious enzymes that have detoxication of both exo- and endogenous compounds as their primary function. They catalyze the nucleophilic attack by the tripeptide glutathione on many different electrophilic substrates.

Several multivariate analysis tools, e.g. principal component (PC), hierarchical cluster, and K-means cluster analyses, were applied to large mutant libraries assayed with a battery of GST substrates. By this approach, evolvable units (quasi-species) fit for further evolution were identified. It was clear that different substrates undergoing different kinds of chemical transformation can group together in a multi-dimensional substrate-activity space, thus being responsible for a certain quasi-species cluster. Furthermore, the importance of the chemical environment, or substrate matrix, in enzyme evolution was recognized. Diverging substrate selectivity profiles among homologous enzymes acting on substrates performing the same kind of chemistry were identified by MVA. Important structure-function activity relationships with the prodrug azathioprine were elucidated by segment analysis of a shuffled GST mutant library. Together, these results illustrate important methods applied to molecular enzyme evolution.

APA, Harvard, Vancouver, ISO, and other styles

44

Stenlund, Hans. "Improving interpretation by orthogonal variation : Multivariate analysis of spectroscopic data." Doctoral thesis, Umeå universitet, Kemiska institutionen, 2011. http://urn.kb.se/resolve?urn=urn:nbn:se:umu:diva-43476.

Full text

Abstract:

The desire to use the tools and concepts of chemometrics when studying problems in the life sciences, especially biology and medicine, has prompted chemometricians to shift their focus away from their field‘s traditional emphasis on model predictivity and towards the more contemporary objective of optimizing information exchange via model interpretation. The complex data structures that are captured by modern advanced analytical instruments open up new possibilities for extracting information from complex data sets. This in turn imposes higher demands on the quality of data and the modeling techniques used. The introduction of the concept of orthogonal variation in the late 1990‘s led to a shift of focus within chemometrics; the information gained from analysis of orthogonal structures complements that obtained from the predictive structures that were the discipline‘s previous focus. OPLS, which was introduced in the beginning of 2000‘s, refined this view by formalizing the model structure and the separation of orthogonal variations. Orthogonal variation stems from experimental/analytical issues such as time trends, process drift, storage, sample handling, and instrumental differences, or from inherent properties of the sample such as age, gender, genetics, and environmental influence. The usefulness and versatility of OPLS has been demonstrated in over 500 citations, mainly in the fields of metabolomics and transcriptomics but also in NIR, UV and FTIR spectroscopy. In all cases, the predictive precision of OPLS is identical to that of PLS, but OPLS is superior when it comes to the interpretation of both predictive and orthogonal variation. Thus, OPLS models the same data structures but provides increased scope for interpretation, making it more suitable for contemporary applications in the life sciences. This thesis discusses four different research projects, including analyses of NIR, FTIR and NMR spectroscopic data. The discussion includes comparisons of OPLS and PLS models of complex datasets in which experimental variation conceals and confounds relevant information. The PLS and OPLS methods are discussed in detail. In addition, the thesis describes new OPLS-based methods developed to accommodate hyperspectral images for supervised modeling. Proper handling of orthogonal structures revealed the weaknesses in the analytical chains examined. In all of the studies described, the orthogonal structures were used to validate the quality of the generated models as well as gaining new knowledge. These aspects are crucial in order to enhance the information exchange from both past and future studies.

APA, Harvard, Vancouver, ISO, and other styles

45

Combrexelle, Sébastien. "Multifractal analysis for multivariate data with application to remote sensing." Phd thesis, Toulouse, INPT, 2016. http://oatao.univ-toulouse.fr/16477/1/Combrexelle.pdf.

Full text

Abstract:

Texture characterization is a central element in many image processing applications. Texture analysis can be embedded in the mathematical framework of multifractal analysis, enabling the study of the fluctuations in regularity of image intensity and providing practical tools for their assessment, the coefficients or wavelet leaders. Although successfully applied in various contexts, multi fractal analysis suffers at present from two major limitations. First, the accurate estimation of multifractal parameters for image texture remains a challenge, notably for small sample sizes. Second, multifractal analysis has so far been limited to the analysis of a single image, while the data available in applications are increasingly multivariate. The main goal of this thesis is to develop practical contributions to overcome these limitations. The first limitation is tackled by introducing a generic statistical model for the logarithm of wavelet leaders, parametrized by multifractal parameters of interest. This statistical model enables us to counterbalance the variability induced by small sample sizes and to embed the estimation in a Bayesian framework. This yields robust and accurate estimation procedures, effective both for small and large images. The multifractal analysis of multivariate images is then addressed by generalizing this Bayesian framework to hierarchical models able to account for the assumption that multifractal properties evolve smoothly in the dataset. This is achieved via the design of suitable priors relating the dynamical properties of the multifractal parameters of the different components composing the dataset. Different priors are investigated and compared in this thesis by means of numerical simulations conducted on synthetic multivariate multifractal images. This work is further completed by the investigation of the potential benefit of multifractal analysis and the proposed Bayesian methodology for remote sensing via the example of hyperspectral imaging.

APA, Harvard, Vancouver, ISO, and other styles

46

Duchesne, Carl. "Improvement of processes and product quality through multivariate data analysis /." *McMaster only, 2000.

Find full text

APA, Harvard, Vancouver, ISO, and other styles

47

Fernandes, Gomes da Silva Alexandre Miguel. "Methods for the analysis of multivariate lifetime data with frailty." Thesis, University of Reading, 2004. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.408331.

Full text

APA, Harvard, Vancouver, ISO, and other styles

48

Robson, Geoffrey. "Multiple outlier detection and cluster analysis of multivariate normal data." Thesis, Stellenbosch : Stellenbosch University, 2003. http://hdl.handle.net/10019.1/53508.

Full text

Abstract:

Thesis (MscEng)--Stellenbosch University, 2003.
ENGLISH ABSTRACT: Outliers may be defined as observations that are sufficiently aberrant to arouse the suspicion of the analyst as to their origin. They could be the result of human error, in which case they should be corrected, but they may also be an interesting exception, and this would deserve further investigation. Identification of outliers typically consists of an informal inspection of a plot of the data, but this is unreliable for dimensions greater than two. A formal procedure for detecting outliers allows for consistency when classifying observations. It also enables one to automate the detection of outliers by using computers. The special case of univariate data is treated separately to introduce essential concepts, and also because it may well be of interest in its own right. We then consider techniques used for detecting multiple outliers in a multivariate normal sample, and go on to explain how these may be generalized to include cluster analysis. Multivariate outlier detection is based on the Minimum Covariance Determinant (MCD) subset, and is therefore treated in detail. Exact bivariate algorithms were refined and implemented, and the solutions were used to establish the performance of the commonly used heuristic, Fast–MCD.
AFRIKAANSE OPSOMMING: Uitskieters word gedefinieer as waarnemings wat tot s´o ’n mate afwyk van die verwagte gedrag dat die analis wantrouig is oor die oorsprong daarvan. Hierdie waarnemings mag die resultaat wees van menslike foute, in welke geval dit reggestel moet word. Dit mag egter ook ’n interressante verskynsel wees wat verdere ondersoek benodig. Die identifikasie van uitskieters word tipies informeel deur inspeksie vanaf ’n grafiese voorstelling van die data uitgevoer, maar hierdie benadering is onbetroubaar vir dimensies groter as twee. ’n Formele prosedure vir die bepaling van uitskieters sal meer konsekwente klassifisering van steekproefdata tot gevolg hˆe. Dit gee ook geleentheid vir effektiewe rekenaar implementering van die tegnieke. Aanvanklik word die spesiale geval van eenveranderlike data behandel om noodsaaklike begrippe bekend te stel, maar ook aangesien dit in eie reg ’n area van groot belang is. Verder word tegnieke vir die identifikasie van verskeie uitskieters in meerveranderlike, normaal verspreide data beskou. Daar word ook ondersoek hoe hierdie idees veralgemeen kan word om tros analise in te sluit. Die sogenaamde Minimum Covariance Determinant (MCD) subversameling is fundamenteel vir die identifikasie van meerveranderlike uitskieters, en word daarom in detail ondersoek. Deterministiese tweeveranderlike algoritmes is verfyn en ge¨ımplementeer, en gebruik om die effektiwiteit van die algemeen gebruikte heuristiese algoritme, Fast–MCD, te ondersoek.

APA, Harvard, Vancouver, ISO, and other styles

49

Morris, Nathan J. "Multivariate and Structural Equation Models for Family Data." Case Western Reserve University School of Graduate Studies / OhioLINK, 2009. http://rave.ohiolink.edu/etdc/view?acc_num=case1247004562.

Full text

APA, Harvard, Vancouver, ISO, and other styles

50

Hu, Zongliang. "New developments in multiple testing and multivariate testing for high-dimensional data." HKBU Institutional Repository, 2018. https://repository.hkbu.edu.hk/etd_oa/534.

Full text

Abstract:

This thesis aims to develop some new and novel methods in advancing multivariate testing and multiple testing for high-dimensional small sample size data. In Chapter 2, we propose a likelihood ratio test framework for testing normal mean vectors in high-dimensional data under two common scenarios: the one-sample test and the two-sample test with equal covariance matrices. We derive the test statistics under the assumption that the covariance matrices follow a diagonal matrix structure. In comparison with the diagonal Hotelling's tests, our proposed test statistics display some interesting characteristics. In particular, they are a summation of the log-transformed squared t-statistics rather than a direct summation of those components. More importantly, to derive the asymptotic normality of our test statistics under the null and local alternative hypotheses, we do not need the requirement that the covariance matrices follow a diagonal matrix structure. As a consequence, our proposed test methods are very flexible and readily applicable in practice. Monte Carlo simulations and a real data analysis are also carried out to demonstrate the advantages of the proposed methods. In Chapter 3, we propose a pairwise Hotelling's method for testing high-dimensional mean vectors. The new test statistics make a compromise on whether using all the correlations or completely abandoning them. To achieve the goal, we perform a screening procedure, pick up the paired covariates with strong correlations, and construct a classical Hotelling's statistic for each pair. While for the individual covariates without strong correlations with others, we apply squared t statistics to account for their respective contributions to the multivariate testing problem. As a consequence, our proposed test statistics involve a combination of the collected pairwise Hotelling's test statistics and squared t statistics. The asymptotic normality of our test statistics under the null and local alternative hypotheses are also derived under some regularity conditions. Numerical studies and two real data examples demonstrate the efficacy of our pairwise Hotelling's test. In Chapter 4, we propose a regularized t distribution and also explore its applications in multiple testing. The motivation of this topic dates back to microarray studies, where the expression levels of thousands of genes are measured simultaneously by the microarray technology. To identify genes that are differentially expressed between two or more groups, one needs to conduct hypothesis test for each gene. However, as microarray experiments are often with a small number of replicates, Student's t-tests using the sample means and standard deviations may suffer a low power for detecting differentially expressed genes. To overcome this problem, we first propose a regularized t distribution and derive its statistical properties including the probability density function and the moments. The noncentral regularized t distribution is also introduced for the power analysis. To demonstrate the usefulness of the proposed test, we apply the regularized t distribution to the gene expression detection problem. Simulation studies and two real data examples show that the regularized t-test outperforms the existing tests including Student's t-test and the Bayesian t-test in a wide range of settings, in particular when the sample size is small.

APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic 'Multivariate data analysis'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles