Dissertations / Theses: 'Multivariate analysis – Data processing'

1

Jonsson, Pär. "Multivariate processing and modelling of hyphenated metabolite data." Doctoral thesis, Umeå universitet, Kemi, 2005. http://urn.kb.se/resolve?urn=urn:nbn:se:umu:diva-663.

Full text

Abstract:

One trend in the ‘omics’ sciences is the generation of increasing amounts of data, describing complex biological samples. To cope with this and facilitate progress towards reliable diagnostic tools, it is crucial to develop methods for extracting representative and predictive information. In global metabolite analysis (metabolomics and metabonomics) NMR, GC/MS and LC/MS are the main platforms for data generation. Multivariate projection methods (e.g. PCA, PLS and O-PLS) have been recognized as efficient tools for data analysis within subjects such as biology and chemistry due to their ability to provide interpretable models based on many, correlated variables. In global metabolite analysis, these methods have been successfully applied in areas such as toxicology, disease diagnosis and plant functional genomics. This thesis describes the development of processing methods for the unbiased extraction of representative and predictive information from metabolic GC/MS and LC/MS data characterizing biofluids, e.g. plant extracts, urine and blood plasma. In order to allow the multivariate projections to detect and highlight differences between samples, one requirement of the processing methods is that they must extract a common set of descriptors from all samples and still retain the metabolically relevant information in the data. In Papers I and II this was done by applying a hierarchical multivariate compression approach to both GC/MS and LC/MS data. In the study described in Paper III a hierarchical multivariate curve resolution strategy (H-MCR) was developed for simultaneously resolving multiple GC/MS samples into pure profiles. In Paper IV the H-MCR method was applied to a drug toxicity study in rats, where the method’s potential for biomarker detection and identification was exemplified. Finally, the H-MCR method was extended, as described in Paper V, allowing independent samples to be processed and predicted using a model based on an existing set of representative samples. The fact that these processing methods proved to be valid for predicting the properties of new independent samples indicates that it is now possible for global metabolite analysis to be extended beyond isolated studies. In addition, the results facilitate high through-put analysis, because predicting the nature of samples is rapid compared to the actual processing. In summary this research highlights the possibilities for using global metabolite analysis in diagnosis.

APA, Harvard, Vancouver, ISO, and other styles

2

Siluyele, Ian John. "Power studies of multivariate two-sample tests of comparison." Thesis, University of the Western Cape, 2007. http://etd.uwc.ac.za/index.php?module=etd&action=viewtitle&id=gen8Srv25Nme4_6355_1255091702.

Full text

Abstract:

The multivariate two-sample tests provide a means to test the match between two multivariate distributions. Although many tests exist in the literature, relatively little is known about the relative power of these procedures. The studies reported in the thesis contrasts the effectiveness, in terms of power, of seven such tests with a Monte Carlo study. The relative power of the tests was investigated against location, scale, and correlation alternatives.

APA, Harvard, Vancouver, ISO, and other styles

3

Vitale, Raffaele. "Novel chemometric proposals for advanced multivariate data analysis, processing and interpretation." Doctoral thesis, Universitat Politècnica de València, 2017. http://hdl.handle.net/10251/90442.

Full text

Abstract:

The present Ph.D. thesis, primarily conceived to support and reinforce the relation between academic and industrial worlds, was developed in collaboration with Shell Global Solutions (Amsterdam, The Netherlands) in the endeavour of applying and possibly extending well-established latent variable-based approaches (i.e. Principal Component Analysis - PCA - Partial Least Squares regression - PLS - or Partial Least Squares Discriminant Analysis - PLSDA) for complex problem solving not only in the fields of manufacturing troubleshooting and optimisation, but also in the wider environment of multivariate data analysis. To this end, novel efficient algorithmic solutions are proposed throughout all chapters to address very disparate tasks, from calibration transfer in spectroscopy to real-time modelling of streaming flows of data. The manuscript is divided into the following six parts, focused on various topics of interest: Part I - Preface, where an overview of this research work, its main aims and justification is given together with a brief introduction on PCA, PLS and PLSDA; Part II - On kernel-based extensions of PCA, PLS and PLSDA, where the potential of kernel techniques, possibly coupled to specific variants of the recently rediscovered pseudo-sample projection, formulated by the English statistician John C. Gower, is explored and their performance compared to that of more classical methodologies in four different applications scenarios: segmentation of Red-Green-Blue (RGB) images, discrimination of on-/off-specification batch runs, monitoring of batch processes and analysis of mixture designs of experiments; Part III - On the selection of the number of factors in PCA by permutation testing, where an extensive guideline on how to accomplish the selection of PCA components by permutation testing is provided through the comprehensive illustration of an original algorithmic procedure implemented for such a purpose; Part IV - On modelling common and distinctive sources of variability in multi-set data analysis, where several practical aspects of two-block common and distinctive component analysis (carried out by methods like Simultaneous Component Analysis - SCA - DIStinctive and COmmon Simultaneous Component Analysis - DISCO-SCA - Adapted Generalised Singular Value Decomposition - Adapted GSVD - ECO-POWER, Canonical Correlation Analysis - CCA - and 2-block Orthogonal Projections to Latent Structures - O2PLS) are discussed, a new computational strategy for determining the number of common factors underlying two data matrices sharing the same row- or column-dimension is described, and two innovative approaches for calibration transfer between near-infrared spectrometers are presented; Part V - On the on-the-fly processing and modelling of continuous high-dimensional data streams, where a novel software system for rational handling of multi-channel measurements recorded in real time, the On-The-Fly Processing (OTFP) tool, is designed; Part VI - Epilogue, where final conclusions are drawn, future perspectives are delineated, and annexes are included.
La presente tesis doctoral, concebida principalmente para apoyar y reforzar la relación entre la academia y la industria, se desarrolló en colaboración con Shell Global Solutions (Amsterdam, Países Bajos) en el esfuerzo de aplicar y posiblemente extender los enfoques ya consolidados basados en variables latentes (es decir, Análisis de Componentes Principales - PCA - Regresión en Mínimos Cuadrados Parciales - PLS - o PLS discriminante - PLSDA) para la resolución de problemas complejos no sólo en los campos de mejora y optimización de procesos, sino también en el entorno más amplio del análisis de datos multivariados. Con este fin, en todos los capítulos proponemos nuevas soluciones algorítmicas eficientes para abordar tareas dispares, desde la transferencia de calibración en espectroscopia hasta el modelado en tiempo real de flujos de datos. El manuscrito se divide en las seis partes siguientes, centradas en diversos temas de interés: Parte I - Prefacio, donde presentamos un resumen de este trabajo de investigación, damos sus principales objetivos y justificaciones junto con una breve introducción sobre PCA, PLS y PLSDA; Parte II - Sobre las extensiones basadas en kernels de PCA, PLS y PLSDA, donde presentamos el potencial de las técnicas de kernel, eventualmente acopladas a variantes específicas de la recién redescubierta proyección de pseudo-muestras, formulada por el estadista inglés John C. Gower, y comparamos su rendimiento respecto a metodologías más clásicas en cuatro aplicaciones a escenarios diferentes: segmentación de imágenes Rojo-Verde-Azul (RGB), discriminación y monitorización de procesos por lotes y análisis de diseños de experimentos de mezclas; Parte III - Sobre la selección del número de factores en el PCA por pruebas de permutación, donde aportamos una guía extensa sobre cómo conseguir la selección de componentes de PCA mediante pruebas de permutación y una ilustración completa de un procedimiento algorítmico original implementado para tal fin; Parte IV - Sobre la modelización de fuentes de variabilidad común y distintiva en el análisis de datos multi-conjunto, donde discutimos varios aspectos prácticos del análisis de componentes comunes y distintivos de dos bloques de datos (realizado por métodos como el Análisis Simultáneo de Componentes - SCA - Análisis Simultáneo de Componentes Distintivos y Comunes - DISCO-SCA - Descomposición Adaptada Generalizada de Valores Singulares - Adapted GSVD - ECO-POWER, Análisis de Correlaciones Canónicas - CCA - y Proyecciones Ortogonales de 2 conjuntos a Estructuras Latentes - O2PLS). Presentamos a su vez una nueva estrategia computacional para determinar el número de factores comunes subyacentes a dos matrices de datos que comparten la misma dimensión de fila o columna y dos planteamientos novedosos para la transferencia de calibración entre espectrómetros de infrarrojo cercano; Parte V - Sobre el procesamiento y la modelización en tiempo real de flujos de datos de alta dimensión, donde diseñamos la herramienta de Procesamiento en Tiempo Real (OTFP), un nuevo sistema de manejo racional de mediciones multi-canal registradas en tiempo real; Parte VI - Epílogo, donde presentamos las conclusiones finales, delimitamos las perspectivas futuras, e incluimos los anexos.
La present tesi doctoral, concebuda principalment per a recolzar i reforçar la relació entre l'acadèmia i la indústria, es va desenvolupar en col·laboració amb Shell Global Solutions (Amsterdam, Països Baixos) amb l'esforç d'aplicar i possiblement estendre els enfocaments ja consolidats basats en variables latents (és a dir, Anàlisi de Components Principals - PCA - Regressió en Mínims Quadrats Parcials - PLS - o PLS discriminant - PLSDA) per a la resolució de problemes complexos no solament en els camps de la millora i optimització de processos, sinó també en l'entorn més ampli de l'anàlisi de dades multivariades. A aquest efecte, en tots els capítols proposem noves solucions algorítmiques eficients per a abordar tasques dispars, des de la transferència de calibratge en espectroscopia fins al modelatge en temps real de fluxos de dades. El manuscrit es divideix en les sis parts següents, centrades en diversos temes d'interès: Part I - Prefaci, on presentem un resum d'aquest treball de recerca, es donen els seus principals objectius i justificacions juntament amb una breu introducció sobre PCA, PLS i PLSDA; Part II - Sobre les extensions basades en kernels de PCA, PLS i PLSDA, on presentem el potencial de les tècniques de kernel, eventualment acoblades a variants específiques de la recentment redescoberta projecció de pseudo-mostres, formulada per l'estadista anglés John C. Gower, i comparem el seu rendiment respecte a metodologies més clàssiques en quatre aplicacions a escenaris diferents: segmentació d'imatges Roig-Verd-Blau (RGB), discriminació i monitorització de processos per lots i anàlisi de dissenys d'experiments de mescles; Part III - Sobre la selecció del nombre de factors en el PCA per proves de permutació, on aportem una guia extensa sobre com aconseguir la selecció de components de PCA a través de proves de permutació i una il·lustració completa d'un procediment algorítmic original implementat per a la finalitat esmentada; Part IV - Sobre la modelització de fonts de variabilitat comuna i distintiva en l'anàlisi de dades multi-conjunt, on discutim diversos aspectes pràctics de l'anàlisis de components comuns i distintius de dos blocs de dades (realitzat per mètodes com l'Anàlisi Simultània de Components - SCA - Anàlisi Simultània de Components Distintius i Comuns - DISCO-SCA - Descomposició Adaptada Generalitzada en Valors Singulars - Adapted GSVD - ECO-POWER, Anàlisi de Correlacions Canòniques - CCA - i Projeccions Ortogonals de 2 blocs a Estructures Latents - O2PLS). Presentem al mateix temps una nova estratègia computacional per a determinar el nombre de factors comuns subjacents a dues matrius de dades que comparteixen la mateixa dimensió de fila o columna, i dos plantejaments nous per a la transferència de calibratge entre espectròmetres d'infraroig proper; Part V - Sobre el processament i la modelització en temps real de fluxos de dades d'alta dimensió, on dissenyem l'eina de Processament en Temps Real (OTFP), un nou sistema de tractament racional de mesures multi-canal registrades en temps real; Part VI - Epíleg, on presentem les conclusions finals, delimitem les perspectives futures, i incloem annexos.
Vitale, R. (2017). Novel chemometric proposals for advanced multivariate data analysis, processing and interpretation [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/90442
TESIS

APA, Harvard, Vancouver, ISO, and other styles

4

Doshi, Punit Rameshchandra. "Adaptive prefetching for visual data exploration." Link to electronic thesis, 2003. http://www.wpi.edu/Pubs/ETD/Available/etd-0131103-203307.

Full text

Abstract:

Thesis (M.S.)--Worcester Polytechnic Institute.
Keywords: Adaptive prefetching; Large-scale multivariate data visualization; Semantic caching; Hierarchical data exploration; Exploratory data analysis. Includes bibliographical references (p.66-70).

APA, Harvard, Vancouver, ISO, and other styles

5

Cannon, Paul C. "Extending the information partition function : modeling interaction effects in highly multivariate, discrete data /." Diss., CLICK HERE for online access, 2008. http://contentdm.lib.byu.edu/ETD/image/etd2263.pdf.

Full text

APA, Harvard, Vancouver, ISO, and other styles

6

Forshed, Jenny. "Processing and analysis of NMR data : Impurity determination and metabolic profiling." Doctoral thesis, Stockholm : Dept. of analytical chemistry, Stockholm university, 2005. http://urn.kb.se/resolve?urn=urn:nbn:se:su:diva-712.

Full text

APA, Harvard, Vancouver, ISO, and other styles

7

Guamán, Novillo Ana Verónica. "Multivariate Signal Processing for Quantitative and Qualitative Analysis of Ion Mobility Spectrometry data, applied to Biomedical Applications and Food Related Applications." Doctoral thesis, Universitat de Barcelona, 2015. http://hdl.handle.net/10803/349210.

Full text

Abstract:

There are several applications where the measurement of VOC results to be useful, such as: toxic leaks, air quality measurements, explosive detection, monitoring of food and beverages quality, diagnosis of diseases, etc. Some of this applications claim for fast responses or even real time responses. In this context, there are few analytical techniques for performing gas phase analysis, among of them Ion Mobility Spectrometry (IMS). IMS is a fast analytical device based on the time of flight of ions in a drift tube. The response of IMS lasts typically few seconds, but it can be even less than a second. This fast response has drifted its use towards novel applications, such as biomedical and food applications (bio-related applications). Nonetheless, it has also brought the need to analyze complex spectra with hundreds of compounds. In fact, tackling this disadvantage is the main focus of this thesis, where new algorithms for enhancing the IMS performance are investigated when are applied to bio-related applications. Nonlinear behavior and charge competitions of IMS responses are important issues that need to be addressed. Both effects have a direct impact in the IMS spectra interpretation —especially when real dataset are studied. Additionally, the use of univariate spectra analysis, where peaks information is extracted manually, becomes unfeasible in bio-related applications. In this context, this work introduces multivariate methodologies focused on quantitative and qualitative analysis. In the case of quantitative analysis, calibration models were built using univariate methodology, Partial Leas Squares (PLS) and Multivariate Curve Resolution techniques (MCR). The quantitative analysis aims tackling the main issues of IMS such as non linearities and mixture effect. Definitely, univariate techniques provides poor or overoptimistic results that minimize the impact of the IMS use. The results show a really improvement on the performance when multivariate techniques were used. Regarding the results between MCR and PLS, the main difference is the interpretability that offers MCR. In the case of qualitative analysis, two different approaches were planned for building models for classes' discrimination. The first approach consisted on building a model through principal component analysis and linear discriminant analysis, besides of using robust cross validation methodology for obtaining reliable results. This methodology were implemented in samples of wine, where main motivation was found discrimination regarding to their origin. The results were fully satisfactory because the model was able to separate four groups with a high accuracy rate. The second approach involves the use of Multivarite Curve Resolution — Lasso algorithm for extracting pure components of samples from rats' breath and then use a feature selection technique for obtaining the most representative features subset. In this case, the objective of the application was to find a model that discriminate rats with sepsis from control rats. The results shows there were few pure components of IMS that generate a discriminatory model that means there are specific compounds in the breath linked with the disease. Summarizing, the following proposal has as main objective resolving open issues in stand-alone IMS that are applied to the analysis of bio-related applications. Two major investigation lines were proposed in this thesis: (i) qualitative analysis and (ii) quantitative analysis. The qualitative analysis covers pre-processing algorithms and the developing of new methodologies for building models in bio-related applications. The quantitative analysis are focused on highlighting the importance of the use of multivariate techniques instead of univariate techniques. In order to reach the objectives of this thesis, a set of datasets were created, which are detailed on the content of this thesis. The results and main conclusions are deeply explained in the extended proposal.
El objetivo de esta tesis es el desarrollo de nuevas metodologías en el procesado de señal multivariante en espectros IMS. En este trabajo se ha realizado una comparación entre tres espectrómetros IMS. Esta labor comparativa, mediante procesado multivariante, es prácticamente inédita en este ámbito. En este caso se realizó un estudio con 3 aminas y se determinó el límite de detección. Los resultados mostraron que los 3 espectrómetros tuvieron un rendimiento similar, a pesar de que sus condiciones de operación son distintas. Se propuso una técnica específica para eliminar ruido de baja frecuencia acoplado al espectro de IMS. Se observó que utilizar PCA o ICA (métodos multivariantes) mejora notablemente la relación señal ruido si se compara con las técnicas convencionales. Se ha estudiado el alineamiento de los espectros y se han propuesto soluciones basadas en los diferentes métodos del estado del arte. Se ha evidenciado que incluir compuestos de referencia para garantizar que el proceso de alineamiento es el adecuado es ventajoso. En el caso de que esto no fuese posible se aconseja realizar el alineamiento por etapas, primero un alineamiento en una misma muestra, y luego entre muestras. Se realizaron modelos cualitativos para diferenciar o discriminar clases a partir de medidas de IMS. Se propusieron dos modelos multivariantes con técnicas de validación cruzada. Los resultados obtenidos muestran el gran potencial de IMS en este sentido. Se evaluó el rendimiento cuantitativo de los IMS al utilizar métodos multivariantes y fueron comparados con métodos univariantes habituales en el ámbito de IMS. De los resultados obtenidos se observó que los modelos univariantes no son capaces de resolver comportamientos típicos de IMS como son el comportamiento no lineal y el efecto en mezclas. En este sentido las técnicas multivariantes mostraron mejores prestaciones. Se comparó la utilización de técnicas multivariantes que proyectan los datos en un nuevo subespacio como lo es PLS con técnicas de deconvolución como lo es MCR en sus dos versiones ALS y Lasso. Los resultados obtenidos fueron bastante similares, sin embargo MCR ofrece una ventaja importante ya que permite interpretar de mejor manera los resultados.

APA, Harvard, Vancouver, ISO, and other styles

8

Cannon, Paul C. "Extending the Information Partition Function: Modeling Interaction Effects in Highly Multivariate, Discrete Data." BYU ScholarsArchive, 2007. https://scholarsarchive.byu.edu/etd/1234.

Full text

Abstract:

Because of the huge amounts of data made available by the technology boom in the late twentieth century, new methods are required to turn data into usable information. Much of this data is categorical in nature, which makes estimation difficult in highly multivariate settings. In this thesis we review various multivariate statistical methods, discuss various statistical methods of natural language processing (NLP), and discuss a general class of models described by Erosheva (2002) called generalized mixed membership models. We then propose extensions of the information partition function (IPF) derived by Engler (2002), Oliphant (2003), and Tolley (2006) that will allow modeling of discrete, highly multivariate data in linear models. We report results of the modified IPF model on the World Health Organization's Survey on Global Aging (SAGE).

APA, Harvard, Vancouver, ISO, and other styles

9

Oller, Moreno Sergio. "Data processing for Life Sciences measurements with hyphenated Gas Chromatography-Ion Mobility Spectrometry." Doctoral thesis, Universitat de Barcelona, 2018. http://hdl.handle.net/10803/523539.

Full text

Abstract:

Recent progress in analytical chemistry instrumentation has increased the amount of data available for analysis. This progress has been encompassed by computational improvements, that have enabled new possibilities to analyze larger amounts of data. These two factors have allowed to analyze more complex samples in multiple life science fields, such as biology, medicine, pharmacology, or food science. One of the techniques that has benefited from these improvements is Gas Chromatography - Ion Mobility Spectrometry (GC-IMS). This technique is useful for the detection of Volatile Organic Compounds (VOCs) in complex samples. Ion Mobility Spectrometry is an analytical technique for characterizing chemical substances based on the velocity of gas-phase ions in an electric field. It is able to detect trace levels of volatile chemicals reaching for some analytes ppb concentrations. While the instrument has moderate selectivity it is very fast in the analysis, as an ion mobility spectrum can be acquired in tenths of milliseconds. As it operates at ambient pressure, it is found not only as laboratory instrumentation but also in-site, to perform screening applications. For instance it is often used in airports for the detection of drugs and explosives. To enhance the selectivity of the IMS, especially for the analysis of complex samples, a gas chromatograph can be used for sample pre-separation at the expense of the length of the analysis. While there is better instrumentation and more computational power, better algorithms are still needed to exploit and extract all the information present in the samples. In particular, GC-IMS has not received much attention compared to other analytical techniques. In this work we address some of the data analysis issues for GC-IMS: With respect to the pre-processing, we explore several baseline estimation methods and we suggest a variation of Asymmetric Least Squares, a popular baseline estimation technique, that is able to cope with signals that present large peaks or large dynamic range. This baseline estimation method is used in Gas Chromatography - Mass Spectrometry signals as well, as it suits both techniques. Furthermore, we also characterize spectral misalignments in a several months long study, and propose an alignment method based on monotonic cubic splines for its correction. Based on the misalignment characterization we propose an optimal time span between consecutive calibrant samples. We the explore the usage of Multivariate Curve Resolution methods for the deconvolution of overlapped peaks and their extraction into pure components. We propose the use of a sliding window in the retention time axis to extract the pure components from smaller windows. The pure components are tracked through the windows. This approach is able to extract analytes with lower response with respect to MCR, compounds that have a low variance in the overall matrix Finally we apply some of these developments to real world applications, on a dataset for the prevention of fraud and quality control in the classification of olive oils, measured with GC-IMS, and on data for biomarker discovery of prostate cancer by analyzing the headspace of urine samples with a GC-MS instrument.
Els avenços recents en instrumentació química i el progrés en les capacitats computacionals obren noves possibilitats per l’anàlisi de dades provinents de diversos camps en l’àmbit de les ciències de la vida, com la biologia, la medicina o la ciència de l’alimentació. Una de les tècniques que s’ha beneficiat d’aquests avenços és la cromatografia de gasos – espectrometria de mobilitat d’ions (GC-IMS). Aquesta tècnica és útil per detectar compostos orgànics volàtils en mostres complexes. L’IMS és una tècnica analítica per caracteritzar substàncies químiques basada en la velocitat d’ions en fase gasosa en un camp elèctric, capaç de detectar traces d’alguns volàtils en concentracions de ppb ràpidament. Per augmentar-ne la selectivitat, un cromatògraf de gasos pot emprar-se per pre-separar la mostra, a expenses de la durada de l’anàlisi. Tot i disposar de millores en la instrumentació i més poder computacional, calen millors algoritmes per extreure tota la informació de les mostres. En particular, GC-IMS no ha rebut molta atenció en comparació amb altres tècniques analítiques. En aquest treball, tractem alguns problemes de l’anàlisi de dades de GC-IMS: Pel que fa al pre-processat, explorem algoritmes d’estimació de la línia de base i en proposem una millora, adaptada a les necessitats de l’instrument. Aquest algoritme també s’utilitza en mostres de cromatografia de gasos espectrometria de masses (GC-MS), en tant que s’adapta correctament a ambdues tècniques. Caracteritzem els desalineaments espectrals que es produeixen en un estudi de diversos mesos de durada, i proposem un mètode d’alineat basat en splines cúbics monotònics per a la seva correcció i un interval de temps òptim entre dues mostres calibrants. Explorem l’ús de mètodes de resolució multivariant de corbes (MCR) per a la deconvolució de pics solapats i la seva extracció en components purs. Proposem l’ús d’una finestra mòbil en el temps de retenció. Aquesta millora permet extreure més informació d’analits. Finalment utilitzem alguns d’aquests desenvolupaments a dues aplicacions: la prevenció de frau en la classificació d’olis d’oliva, mesurada amb GC-IMS i la cerca de biomarcadors de càncer de pròstata en volàtils de la orina, feta amb GC-MS.

APA, Harvard, Vancouver, ISO, and other styles

10

Alexander, Miranda Abhilash. "Spectral factor model for time series learning." Doctoral thesis, Universite Libre de Bruxelles, 2011. http://hdl.handle.net/2013/ULB-DIPOT:oai:dipot.ulb.ac.be:2013/209812.

Full text

Abstract:

Today's computerized processes generate

massive amounts of streaming data.

In many applications, data is collected for modeling the processes. The process model is hoped to drive objectives such as decision support, data visualization, business intelligence, automation and control, pattern recognition and classification, etc. However, we face significant challenges in data-driven modeling of processes. Apart from the errors, outliers and noise in the data measurements, the main challenge is due to a large dimensionality, which is the number of variables each data sample measures. The samples often form a long temporal sequence called a multivariate time series where any one sample is influenced by the others.

We wish to build a model that will ensure robust generation, reviewing, and representation of new multivariate time series that are consistent with the underlying process.

In this thesis, we adopt a modeling framework to extract characteristics from multivariate time series that correspond to dynamic variation-covariation common to the measured variables across all the samples. Those characteristics of a multivariate time series are named its 'commonalities' and a suitable measure for them is defined. What makes the multivariate time series model versatile is the assumption regarding the existence of a latent time series of known or presumed characteristics and much lower dimensionality than the measured time series; the result is the well-known 'dynamic factor model'.

Original variants of existing methods for estimating the dynamic factor model are developed: The estimation is performed using the frequency-domain equivalent of the dynamic factor model named the 'spectral factor model'. To estimate the spectral factor model, ideas are sought from the asymptotic theory of spectral estimates. This theory is used to attain a probabilistic formulation, which provides maximum likelihood estimates for the spectral factor model parameters. Then, maximum likelihood parameters are developed with all the analysis entirely in the spectral-domain such that the dynamically transformed latent time series inherits the commonalities maximally.

The main contribution of this thesis is a learning framework using the spectral factor model. We term learning as the ability of a computational model of a process to robustly characterize the data the process generates for purposes of pattern matching, classification and prediction. Hence, the spectral factor model could be claimed to have learned a multivariate time series if the latent time series when dynamically transformed extracts the commonalities reliably and maximally. The spectral factor model will be used for mainly two multivariate time series learning applications: First, real-world streaming datasets obtained from various processes are to be classified; in this exercise, human brain magnetoencephalography signals obtained during various cognitive and physical tasks are classified. Second, the commonalities are put to test by asking for reliable prediction of a multivariate time series given its past evolution; share prices in a portfolio are forecasted as part of this challenge.

For both spectral factor modeling and learning, an analytical solution as well as an iterative solution are developed. While the analytical solution is based on low-rank approximation of the spectral density function, the iterative solution is based on the expectation-maximization algorithm. For the human brain signal classification exercise, a strategy for comparing similarities between the commonalities for various classes of multivariate time series processes is developed. For the share price prediction problem, a vector autoregressive model whose parameters are enriched with the maximum likelihood commonalities is designed. In both these learning problems, the spectral factor model gives commendable performance with respect to competing approaches.

Les processus informatisés actuels génèrent des quantités massives de flux de données. Dans nombre d'applications, ces flux de données sont collectées en vue de modéliser les processus. Les modèles de processus obtenus ont pour but la réalisation d'objectifs tels que l'aide à la décision, la visualisation de données, l'informatique décisionnelle, l'automatisation et le contrôle, la reconnaissance de formes et la classification, etc. La modélisation de processus sur la base de données implique cependant de faire face à d’importants défis. Outre les erreurs, les données aberrantes et le bruit, le principal défi provient de la large dimensionnalité, i.e. du nombre de variables dans chaque échantillon de données mesurées. Les échantillons forment souvent une longue séquence temporelle appelée série temporelle multivariée, où chaque échantillon est influencé par les autres. Notre objectif est de construire un modèle robuste qui garantisse la génération, la révision et la représentation de nouvelles séries temporelles multivariées cohérentes avec le processus sous-jacent.

Dans cette thèse, nous adoptons un cadre de modélisation capable d’extraire, à partir de séries temporelles multivariées, des caractéristiques correspondant à des variations - covariations dynamiques communes aux variables mesurées dans tous les échantillons. Ces caractéristiques sont appelées «points communs» et une mesure qui leur est appropriée est définie. Ce qui rend le modèle de séries temporelles multivariées polyvalent est l'hypothèse relative à l'existence de séries temporelles latentes de caractéristiques connues ou présumées et de dimensionnalité beaucoup plus faible que les séries temporelles mesurées; le résultat est le bien connu «modèle factoriel dynamique». Des variantes originales de méthodes existantes pour estimer le modèle factoriel dynamique sont développées :l'estimation est réalisée en utilisant l'équivalent du modèle factoriel dynamique au niveau du domaine de fréquence, désigné comme le «modèle factoriel spectral». Pour estimer le modèle factoriel spectral, nous nous basons sur des idées relatives à la théorie des estimations spectrales. Cette théorie est utilisée pour aboutir à une formulation probabiliste, qui fournit des estimations de probabilité maximale pour les paramètres du modèle factoriel spectral. Des paramètres de probabilité maximale sont alors développés, en plaçant notre analyse entièrement dans le domaine spectral, de façon à ce que les séries temporelles latentes transformées dynamiquement héritent au maximum des points communs.

La principale contribution de cette thèse consiste en un cadre d'apprentissage utilisant le modèle factoriel spectral. Nous désignons par apprentissage la capacité d'un modèle de processus à caractériser de façon robuste les données générées par le processus à des fins de filtrage par motif, classification et prédiction. Dans ce contexte, le modèle factoriel spectral est considéré comme ayant appris une série temporelle multivariée si la série temporelle latente, une fois dynamiquement transformée, permet d'extraire les points communs de façon fiable et maximale. Le modèle factoriel spectral sera utilisé principalement pour deux applications d'apprentissage de séries multivariées :en premier lieu, des ensembles de données sous forme de flux venant de différents processus du monde réel doivent être classifiés; lors de cet exercice, la classification porte sur des signaux magnétoencéphalographiques obtenus chez l'homme au cours de différentes tâches physiques et cognitives; en second lieu, les points communs obtenus sont testés en demandant une prédiction fiable d'une série temporelle multivariée étant donnée l'évolution passée; les prix d'un portefeuille d'actions sont prédits dans le cadre de ce défi.

À la fois pour la modélisation et pour l'apprentissage factoriel spectral, une solution analytique aussi bien qu'une solution itérative sont développées. Tandis que la solution analytique est basée sur une approximation de rang inférieur de la fonction de densité spectrale, la solution itérative est basée, quant à elle, sur l'algorithme de maximisation des attentes. Pour l'exercice de classification des signaux magnétoencéphalographiques humains, une stratégie de comparaison des similitudes entre les points communs des différentes classes de processus de séries temporelles multivariées est développée. Pour le problème de prédiction des prix des actions, un modèle vectoriel autorégressif dont les paramètres sont enrichis avec les points communs de probabilité maximale est conçu. Dans ces deux problèmes d’apprentissage, le modèle factoriel spectral atteint des performances louables en regard d’approches concurrentes.
Doctorat en Sciences
info:eu-repo/semantics/nonPublished

APA, Harvard, Vancouver, ISO, and other styles

11

Ablin, Pierre. "Exploration of multivariate EEG /MEG signals using non-stationary models." Thesis, Université Paris-Saclay (ComUE), 2019. http://www.theses.fr/2019SACLT051.

Full text

Abstract:

L'Analyse en Composantes Indépendantes (ACI) modèle un ensemble de signaux comme une combinaison linéaire de sources indépendantes. Cette méthode joue un rôle clé dans le traitement des signaux de magnétoencéphalographie (MEG) et électroencéphalographie (EEG). L'ACI de tels signaux permet d'isoler des sources de cerveau intéressantes, de les localiser, et de les séparer d'artefacts. L'ACI fait partie de la boite à outils de nombreux neuroscientifiques, et est utilisée dans de nombreux articles de recherche en neurosciences. Cependant, les algorithmes d'ACI les plus utilisés ont été développés dans les années 90. Ils sont souvent lents lorsqu'ils sont appliqués sur des données réelles, et sont limités au modèle d'ACI classique.L'objectif de cette thèse est de développer des algorithmes d'ACI utiles en pratique aux neuroscientifiques. Nous suivons deux axes. Le premier est celui de la vitesse : nous considérons le problème d'optimisation résolu par deux des algorithmes les plus utilisés par les praticiens: Infomax et FastICA. Nous développons une nouvelle technique se basant sur un préconditionnement par des approximations de la Hessienne de l'algorithm L-BFGS. L'algorithme qui en résulte, Picard, est conçu pour être appliqué sur données réelles, où l'hypothèse d’indépendance n'est jamais entièrement vraie. Sur des données de M/EEG, il converge plus vite que les implémentations `historiques'.Les méthodes incrémentales, qui traitent quelques échantillons à la fois au lieu du jeu de données complet, constituent une autre possibilité d’accélération de l'ACI. Ces méthodes connaissent une popularité grandissante grâce à leur faculté à bien passer à l'échelle sur de grands jeux de données. Nous proposons un algorithme incrémental pour l'ACI, qui possède une importante propriété de descente garantie. En conséquence, cet algorithme est simple d'utilisation, et n'a pas de paramètre critique et difficile à régler comme un taux d'apprentissage.En suivant un second axe, nous proposons de prendre en compte du bruit dans le modèle d'ACI. Le modèle resultant est notoirement difficile et long à estimer sous l'hypothèse standard de non-Gaussianité de l'ACI. Nous nous reposons donc sur une hypothèse de diversité spectrale, qui mène à un algorithme facile d'utilisation et utilisable en pratique, SMICA. La modélisation du bruit permet de nouvelles possibilités inenvisageables avec un modèle d'ACI classique, comme une estimation fine des source et l'utilisation de l'ACI comme une technique de réduction de dimension statistiquement bien posée. De nombreuses expériences sur données M/EEG démontrent l'utilité de cette nouvelle approche.Tous les algorithmes développés dans cette thèse sont disponibles en accès libre sur internet. L’algorithme Picard est inclus dans les librairies de traitement de données M/EEG les plus populaires en Python (MNE) et en Matlab (EEGlab)
Independent Component Analysis (ICA) models a set of signals as linear combinations of independent sources. This analysis method plays a key role in electroencephalography (EEG) and magnetoencephalography (MEG) signal processing. Applied on such signals, it allows to isolate interesting brain sources, locate them, and separate them from artifacts. ICA belongs to the toolbox of many neuroscientists, and is a part of the processing pipeline of many research articles. Yet, the most widely used algorithms date back to the 90's. They are often quite slow, and stick to the standard ICA model, without more advanced features.The goal of this thesis is to develop practical ICA algorithms to help neuroscientists. We follow two axes. The first one is that of speed. We consider the optimization problems solved by two of the most widely used ICA algorithms by practitioners: Infomax and FastICA. We develop a novel technique based on preconditioning the L-BFGS algorithm with Hessian approximation. The resulting algorithm, Picard, is tailored for real data applications, where the independence assumption is never entirely true. On M/EEG data, it converges faster than the `historical' implementations.Another possibility to accelerate ICA is to use incremental methods, which process a few samples at a time instead of the whole dataset. Such methods have gained huge interest in the last years due to their ability to scale well to very large datasets. We propose an incremental algorithm for ICA, with important descent guarantees. As a consequence, the proposed algorithm is simple to use and does not have a critical and hard to tune parameter like a learning rate.In a second axis, we propose to incorporate noise in the ICA model. Such a model is notoriously hard to fit under the standard non-Gaussian hypothesis of ICA, and would render estimation extremely long. Instead, we rely on a spectral diversity assumption, which leads to a practical algorithm, SMICA. The noise model opens the door to new possibilities, like finer estimation of the sources, and use of ICA as a statistically sound dimension reduction technique. Thorough experiments on M/EEG datasets demonstrate the usefulness of this approach.All algorithms developed in this thesis are open-sourced and available online. The Picard algorithm is included in the largest M/EEG processing Python library, MNE and Matlab library, EEGlab

APA, Harvard, Vancouver, ISO, and other styles

12

Siepka, Damian. "Development of multidimensional spectral data processing procedures for analysis of composition and mixing state of aerosol particles by Raman and FTIR spectroscopy." Thesis, Lille 1, 2017. http://www.theses.fr/2017LIL10188/document.

Full text

Abstract:

Les méthodologies de traitement de données multidimensionnelles peuvent considérablement améliorer la connaissance des échantillons. Les techniques spectroscopiques permettent l’analyse moléculaire avancée d’échantillons variés et complexes. La combinaison des techniques spectroscopiques aux méthodes de chimiométrie trouve des applications dans de nombreux domaines. Les particules atmosphériques affectent la qualité de l’air, la santé humaine, les écosystèmes et jouent un rôle important dans le processus de changement climatique. L’objectif de cette thèse a été de développer des outils de chimiométrie, simples d’utilisation, permettant de traiter un grand nombre de données spectrales provenant de l’analyse d’échantillons complexes par microspectrométrie Raman (RMS) et spectroscopie d’absorption IRTF. Dans un premier temps, nous avons développé une méthodologie combinant les méthodes de résolution de courbes et d’analyse multivariée afin de déterminer la composition chimique d’échantillons de particules analysées par RMS. Cette méthode appliquée à l’analyse de particules collectées dans les mines en Bolivie, a ouvert une nouvelle voie de description des échantillons. Dans un second temps, nous avons conçu un logiciel facilement accessible pour le traitement des données IRTF et Raman. Ce logiciel inclue plusieurs algorithmes de prétraitement ainsi que les méthodes d’analyse multivariées adaptées à la spectroscopie vibrationnelle. Il a été appliqué avec succès pour le traitement de données spectrales enregistrées pour divers échantillons (particules de mines de charbon, particules biogéniques, pigments organiques)
Sufficiently adjusted, multivariate data processing methods and procedures can significantly improve the process for obtaining knowledge of a sample composition. Spectroscopic techniques have capabilities for fast analysis of various samples and were developed for research and industrial purposes. It creates a great possibility for advanced molecular analysis of complex samples, such as atmospheric aerosols. Airborne particles affect air quality, human health, ecosystem condition and play an important role in the Earth’s climate system. The purpose of this thesis is twofold. On an analytical level, the functional algorithm for evaluation of quantitative composition of atmospheric particles from measurements of individual particles by Raman microspectrocopy (RMS) was established. On a constructive level, the readily accessible analytical system for Raman and FTIR data processing was developed. A potential of a single particle analysis by RMS has been exploited by an application of the designed analytical algorithm based on a combination between a multicurve resolution and a multivariate data treatment for an efficient description of chemical mixing of aerosol particles. The algorithm was applied to the particles collected in a copper mine in Bolivia and provides a new way of a sample description. The new user-friendly software, which includes pre-treatment algorithms and several easy-to access, common multivariate data treatments, is equipped with a graphical interface. The created software was applied to some challenging aspects of a pattern recognition in the scope of Raman and FTIR spectroscopy for coal mine particles, biogenic particles and organic pigments

APA, Harvard, Vancouver, ISO, and other styles

13

Derksen, Timothy J. (Timothy John). "Processing of outliers and missing data in multivariate manufacturing data." Thesis, Massachusetts Institute of Technology, 1996. http://hdl.handle.net/1721.1/38800.

Full text

Abstract:

Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1996.
Includes bibliographical references (leaf 64).
by Timothy J. Derksen.
M.Eng.

APA, Harvard, Vancouver, ISO, and other styles

14

Jonsson, Pär. "Multivariate processing and modelling of hyphenated metabolite data /." Umeå : Dept. of Chemistry, Umeå University, 2005. http://urn.kb.se/resolve?urn=urn:nbn:se:umu:diva-663.

Full text

APA, Harvard, Vancouver, ISO, and other styles

15

Oliveira, Irene. "Correlated data in multivariate analysis." Thesis, University of Aberdeen, 2003. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.401414.

Full text

Abstract:

After presenting (PCA) Principal Component Analysis and its relationship with time series data sets, we describe most of the existing techniques in this field. Various techniques, e.g. Singular Spectrum Analysis, Hilbert EOF, Extended EOF or Multichannel Singular Spectrum Analysis (MSSA), Principal Oscillation Pattern Analysis (POP Analysis), can be used for such data. The way we use the matrix of data or the covariance or correlation matrix, makes each method different from the others. SSA may be considered as a PCA performed on a lagged versions of a single time series where we may decompose the original time series into some main components. Following SSA we have its multivariate version (MSSA) where we try to augment the initial matrix of data to get information on lagged versions of each variable (time series) and so past (or future) behaviour can be used to reanalyse the information between variables. In POP Analysis a linear system involving the vector field is analysed, x_t+1=Ax_t+n_t, in order to “know” x_t at time t+1 given the information from time t. The matrix A is estimated by using not only the covariance matrix but also the matrix of covariances between the systems at the current time and at lag 1. In Hilbert EOF we try to get some (future) information from the internal correlation in each variable by using the Hilbert transform of each series in a augmented complex matrix with the data themselves in the real part and the Hilbert time series in the imaginary part X_t + X_t^H. In addition to all these ideas from the statistics and other literature we develop a new methodology as a modification of HEOF and POP Analysis, namely Hilbert Oscillation Patterns (HOP) Analysis or the related idea of Hilbert Canonical Correlation Analysis (HCCA), by using a system, _x^H_t = Ax_t + n_t. Theory and assumptions are presented and HOPS results will be related with the results extracted from a Canonical Correlation Analysis between the time series data matrix and its Hilbert transform. Some examples will be given to show the differences and similarities of the results of the HCCA technique with those from PCA, MSSA, HEOF and POPs. We also present PCA for time series as observations where a technique of linear algebra (PCA) becomes a problem in function analysis leading to Functional PCA (FPCA). We also adapt PCA to allow for this and discuss the theoretical and practical behaviour of using PCA on the even part (EPCA) and odd part (OPCA) of the data, and its application in functional data. Comparisons will be made between PCA and this modification, for the reconstruction of data sets for which considerations of symmetry are especially relevant.

APA, Harvard, Vancouver, ISO, and other styles

16

Prelorendjos, Alexios. "Multivariate analysis of metabonomic data." Thesis, University of Strathclyde, 2014. http://oleg.lib.strath.ac.uk:80/R/?func=dbin-jump-full&object_id=24286.

Full text

Abstract:

Metabonomics is one of the main technologies used in biomedical sciences to improve understanding of how various biological processes of living organisms work. It is considered a more advanced technology than e.g. genomics and proteomics, as it can provide important evidence of molecular biomarkers for the diagnosis of diseases and the evaluation of beneficial adverse drug effects, by studying the metabolic profiles of living organisms. This is achievable by studying samples of various types such as tissues and biofluids. The findings of a metabonomics study for a specific disease, disorder or drug effect, could be applied to other diseases, disorders or drugs, making metabonomics an important tool for biomedical research. This thesis aims to review and study various multivariate statistical techniques which can be used in the exploratory analysis of metabonomics data. To motivate this research, a metabonomics data set containing the metabolic profiles of a group of patients with epilepsy was used. More specifically, the metabolic fingerprints (proton NMR spectra) of 125 patients with epilepsy, of blood serum type, have been obtained from the Western Infirmary, Glasgow, for the purposes of this project. These data were originally collected as baseline data in a study to investigate if the treatment with Anti-Epileptic Drugs (AEDs), of patients with pharmacoresistant epilepsy affects the seizure levels of the patients. The response to the drug treatment in terms of the reduction in seizure levels of these patients enabled two main categories of response to be identified, i.e. responders and the non-responders to AEDs. We explore the use of statistical methods used in metabonomics to analyse these data. Novel aspects of the thesis are the use of Self Organising Maps (SOM) and of Fuzzy Clustering Methods to pattern recognition in metabonomics data. Part I of the thesis defines metabonomics and the other main "omics" technologies, and gives a detailed description of the metabonomics data to be analysed, as well as a description of the two main analytical chemical techniques, Mass Spectrometry (MS) and Nuclear Magnetic Resonance Spectroscopy (NMR), that can be used to generate metabonomics data. Pre-processing and pre-treatment methods that are commonly used in NMR-generated metabonomics data to enhance the quality and accuracy of the data, are also discussed. In Part II, several unsupervised statistical techniques are reviewed and applied to the epilepsy data to investigate the capability of these techniques to discriminate the patients according to their type of response. The techniques reviewed include Principal Components Analysis (PCA), Multi-dimensional scaling (both Classical scaling and Sammon's non-linear mapping) and Clustering techniques. The latter include Hierarchical clustering (with emphasis on Agglomerative Nesting algorithms), Partitioning methods (Fuzzy and Hard clustering algorithms) and Competitive Learning algorithms (Self Organizing maps). The advantages and disadvantages of the different methods are examined, for this kind of data. Results of the exploratory multivariate analyses showed that no natural clusters of patients existed with regards to th eir response to AEDs, therefore none of these techniques was capable of discriminating these patients according to their clinical characteristics. To examine the capability of an unsupervised technique such as PCA, to identify groups in such data as the data based on metabolic fingerprints of patients with epilepsy, a simulation algorithm was developed to run a series of experiments, covered in Part III of the thesis. The aim of the simulation study is to investigate the extent of the difference in the clusters of the data, and under what conditions this difference is detectable by unsupervised techniques. Furthermore, the study examines whether the existence or lack of variation in the mean-shifted variables affects the discriminating ability of the unsupervised techniques (in this case PCA) or not. In each simulation experiment, a reference and a test data set were generated based on the original epilepsy data, and the discriminating capability of PCA was assessed. A test set was generated by mean-shifting a pre-selected number of variables in a reference set. Three methods of selecting the variables to meanshift (maximum and minimum standard deviations and maximum means), five subsets of variables of sizes 1, 3, 20, 120 and 244 (total number of variables in the data sets) and three sample sizes (100, 500 and 1000) were used. Average values in 100 runs of an experiment for two statistics, i.e. the misclassification rate and the average separation (Webb, 2002) were recorded. Results showed that the number of mean-shifted variables (in general) and the methods used to select the variables (in some cases) are important factors for the discriminating ability of PCA, whereas the sample size of the two data sets does not play any role in the experiments (although experiments in large sample sizes showed greater stability in the results for the two statistics in 100 runs of any experiment). The results have implications for the use of PCA with metabonomics data generally.

APA, Harvard, Vancouver, ISO, and other styles

17

Tavares, Nuno Filipe Ramalho da Cunha. "Multivariate analysis applied to clinical analysis data." Master's thesis, Faculdade de Ciências e Tecnologia, 2014. http://hdl.handle.net/10362/12288.

Full text

Abstract:

Dissertação para obtenção do Grau de Mestre em Engenharia e Gestão Industrial
Folate, vitamin B12, iron and hemoglobin are essential for metabolic functions in the body. The deficiency of these can be the cause of several known pathologies and, untreated, can be responsible for severe morbidity and even death. The objective of this study is to characterize a population, residing in the metropolitan area of Lisbon and Setubal, concerning serum levels of folate, vitamin B12, iron and hemoglobin, as well as finding evidence of correlations between these parameters and illnesses, mainly cardiovascular, gastrointestinal, neurological and anemia. Clinical analysis data was collected and submitted to multivariate analysis. First the data was screened with Spearman correlation and Kruskal-Wallis analysis of variance to study correlations and variability between groups. To characterize the population, we used cluster analysis with Ward’s linkage method. Finally a sensitivity analysis was performed to strengthen the results. A positive correlation between iron with, ferritin and transferrin, and with hemoglobin was observed with the Spearman correlation. Kruskal-Wallis analysis of variance test showed significant differences between these biomarkers in persons aged 0 to 29, 30 to 59 and over 60 years old. Cluster analysis proved to be a useful tool when characterizing a population based on its biomarkers, showing evidence of low folate levels for the population in general, and hemoglobin levels below the reference values. Iron and vitamin B12 were within the reference range for most of the population. Low levels of the parameters were registered mainly in patients with cardiovascular, gastrointestinal, and neurological diseases and anemia.

APA, Harvard, Vancouver, ISO, and other styles

18

Rehman, Naveed Ur. "Data-driven time-frequency analysis of multivariate data." Thesis, Imperial College London, 2011. http://hdl.handle.net/10044/1/9116.

Full text

Abstract:

Empirical Mode Decomposition (EMD) is a data-driven method for the decomposition and time-frequency analysis of real world nonstationary signals. Its main advantages over other time-frequency methods are its locality, data-driven nature, multiresolution-based decomposition, higher time-frequency resolution and its ability to capture oscillation of any type (nonharmonic signals). These properties have made EMD a viable tool for real world nonstationary data analysis. Recent advances in sensor and data acquisition technologies have brought to light new classes of signals containing typically several data channels. Currently, such signals are almost invariably processed channel-wise, which is suboptimal. It is, therefore, imperative to design multivariate extensions of the existing nonlinear and nonstationary analysis algorithms as they are expected to give more insight into the dynamics and the interdependence between multiple channels of such signals. To this end, this thesis presents multivariate extensions of the empirical mode de- composition algorithm and illustrates their advantages with regards to multivariate non- stationary data analysis. Some important properties of such extensions are also explored, including their ability to exhibit wavelet-like dyadic filter bank structures for white Gaussian noise (WGN), and their capacity to align similar oscillatory modes from multiple data channels. Owing to the generality of the proposed methods, an improved multi- variate EMD-based algorithm is introduced which solves some inherent problems in the original EMD algorithm. Finally, to demonstrate the potential of the proposed methods, simulations on the fusion of multiple real world signals (wind, images and inertial body motion data) support the analysis.

APA, Harvard, Vancouver, ISO, and other styles

19

Droop, Alastair Philip. "Correlation Analysis of Multivariate Biological Data." Thesis, University of York, 2009. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.507622.

Full text

APA, Harvard, Vancouver, ISO, and other styles

20

Collins, Gary Stephen. "Multivariate analysis of flow cytometry data." Thesis, University of Exeter, 2000. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.324749.

Full text

APA, Harvard, Vancouver, ISO, and other styles

21

Zhu, Liang. "Semiparametric analysis of multivariate longitudinal data." Diss., Columbia, Mo. : University of Missouri-Columbia, 2008. http://hdl.handle.net/10355/6044.

Full text

Abstract:

Thesis (Ph. D.)--University of Missouri-Columbia, 2008.
The entire dissertation/thesis text is included in the research.pdf file; the official abstract appears in the short.pdf file (which also appears in the research.pdf); a non-technical general description, or public abstract, appears in the public.pdf file. Title from title screen of research.pdf file (viewed on August 3, 2009) Vita. Includes bibliographical references.

APA, Harvard, Vancouver, ISO, and other styles

22

Haydock, Richard. "Multivariate analysis of Raman spectroscopy data." Thesis, University of Nottingham, 2015. http://eprints.nottingham.ac.uk/30697/.

Full text

Abstract:

This thesis is concerned with developing techniques for analysing Raman spectroscopic images. A Raman spectroscopic image differs from a standard image as in place of red, green and blue quantities for each pixel a Raman image contains a spectrum of light intensities at each pixel. These spectra are used to identify the chemical components from which the image subject, for example a tablet, is comprised. The study of these types of images is known as chemometrics, with the majority of chemometric methods based on multivariate statistical and image analysis techniques. The work in this thesis has two main foci. The first of these is on the spectral decomposition of a Raman image, the purpose of which is to identify the component chemicals and their concentrations. The standard method for this is to fit a bilinear model to the image where both parts of the model, representing components and concentrations, must be estimated. As the standard bilinear model is nonidentifiable in its solutions we investigate the range of possible solutions in the solution space with a random walk. We also derive an improved model for spectral decomposition, combining cluster analysis techniques and the standard bilinear model. For this purpose we apply the expectation maximisation algorithm on a Gaussian mixture model with bilinear means, to represent our spectra and concentrations. This reduces noise in the estimated chemical components by separating the Raman image subject from the background. The second focus of this thesis is on the analysis of our spectral decomposition results. For testing the chemical components for uniform mixing we derive test statistics for identifying patterns in the image based on Minkowski measures, grey level co-occurence matrices and neighbouring pixel correlations. However with a non-identifiable model any hypothesis tests performed on the solutions will be specific to only that solution. Therefore to obtain conclusions for a range of solutions we combined our test statistics with our random walk. We also investigate the analysis of a time series of Raman images as the subject dissolved. Using models comprised of Gaussian cumulative distribution functions we are able to estimate the changes in concentration levels of dissolving tablets between the scan times. The results of which allowed us to describe the dissolution process in terms of the quantities of component chemicals.

APA, Harvard, Vancouver, ISO, and other styles

23

Lans, Ivo A. van der. "Nonlinear multivariate analysis for multiattribute preference data." [Leiden] : DSWO Press, Leiden University, 1992. http://catalog.hathitrust.org/api/volumes/oclc/28733326.html.

Full text

APA, Harvard, Vancouver, ISO, and other styles

24

Yang, Di. "Analysis guided visual exploration of multivariate data." Worcester, Mass. : Worcester Polytechnic Institute, 2007. http://www.wpi.edu/Pubs/ETD/Available/etd-050407-005925/.

Full text

APA, Harvard, Vancouver, ISO, and other styles

25

Snavely, Anna Catherine. "Multivariate Data Analysis with Applications to Cancer." Thesis, Harvard University, 2012. http://dissertations.umi.com/gsas.harvard:10371.

Full text

Abstract:

Multivariate data is common in a wide range of settings. As data structures become increasingly complex, additional statistical tools are required to perform proper analyses. In this dissertation we develop and evaluate methods for the analysis of multivariate data generated from cancer trials. In the first chapter we consider the analysis of clustered survival data that can arise from multicenter clinical trials. In particular, we review and compare marginal and conditional models numerically through simulations and discuss model selection techniques. A multicenter clinical trial of children with acute lymphoblastic leukemia is used to illustrate the findings. The second and third chapters both address the setting where multiple outcomes are collected when the outcome of interest cannot be measured directly. A head and neck cancer trial in which multiple outcomes were collected to measure dysphagia was the particular motivation for this part of the dissertation. Specifically, in the second chapter we propose a semiparametric latent variable transformation model that incorporates measurable outcomes of mixed types, including censored outcomes. This method extends traditional approaches by allowing the relationship between the measurable outcomes and latent variable to be unspecified, rendering more robust inference. Using this approach we can directly estimate the treatment (or other covariate) effect on the unobserved latent variable, enhancing interpretation. In the third chapter, the basic model from the second chapter is maintained, but additional parametric assumptions are made. This model still has the advantages of allowing for censored measurable outcomes and being able to estimate a treatment effect on the latent variable, but has the added advantage of good performance in a small data set. Together the methods proposed in the second and third chapters provide a comprehensive approach for the analysis of complex multiple outcomes data.

APA, Harvard, Vancouver, ISO, and other styles

26

Bolton, Richard John. "Multivariate analysis of multiproduct market research data." Thesis, University of Exeter, 2000. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.302542.

Full text

APA, Harvard, Vancouver, ISO, and other styles

27

Durif, Ghislain. "Multivariate analysis of high-throughput sequencing data." Thesis, Lyon, 2016. http://www.theses.fr/2016LYSE1334/document.

Full text

Abstract:

L'analyse statistique de données de séquençage à haut débit (NGS) pose des questions computationnelles concernant la modélisation et l'inférence, en particulier à cause de la grande dimension des données. Le travail de recherche dans ce manuscrit porte sur des méthodes de réductions de dimension hybrides, basées sur des approches de compression (représentation dans un espace de faible dimension) et de sélection de variables. Des développements sont menés concernant la régression "Partial Least Squares" parcimonieuse (supervisée) et les méthodes de factorisation parcimonieuse de matrices (non supervisée). Dans les deux cas, notre objectif sera la reconstruction et la visualisation des données. Nous présenterons une nouvelle approche de type PLS parcimonieuse, basée sur une pénalité adaptative, pour la régression logistique. Cette approche sera utilisée pour des problèmes de prédiction (devenir de patients ou type cellulaire) à partir de l'expression des gènes. La principale problématique sera de prendre en compte la réponse pour écarter les variables non pertinentes. Nous mettrons en avant le lien entre la construction des algorithmes et la fiabilité des résultats.Dans une seconde partie, motivés par des questions relatives à l'analyse de données "single-cell", nous proposons une approche probabiliste pour la factorisation de matrices de comptage, laquelle prend en compte la sur-dispersion et l'amplification des zéros (caractéristiques des données single-cell). Nous développerons une procédure d'estimation basée sur l'inférence variationnelle. Nous introduirons également une procédure de sélection de variables probabiliste basée sur un modèle "spike-and-slab". L'intérêt de notre méthode pour la reconstruction, la visualisation et le clustering de données sera illustré par des simulations et par des résultats préliminaires concernant une analyse de données "single-cell". Toutes les méthodes proposées sont implémentées dans deux packages R: plsgenomics et CMF
The statistical analysis of Next-Generation Sequencing data raises many computational challenges regarding modeling and inference, especially because of the high dimensionality of genomic data. The research work in this manuscript concerns hybrid dimension reduction methods that rely on both compression (representation of the data into a lower dimensional space) and variable selection. Developments are made concerning: the sparse Partial Least Squares (PLS) regression framework for supervised classification, and the sparse matrix factorization framework for unsupervised exploration. In both situations, our main purpose will be to focus on the reconstruction and visualization of the data. First, we will present a new sparse PLS approach, based on an adaptive sparsity-inducing penalty, that is suitable for logistic regression to predict the label of a discrete outcome. For instance, such a method will be used for prediction (fate of patients or specific type of unidentified single cells) based on gene expression profiles. The main issue in such framework is to account for the response to discard irrelevant variables. We will highlight the direct link between the derivation of the algorithms and the reliability of the results. Then, motivated by questions regarding single-cell data analysis, we propose a flexible model-based approach for the factorization of count matrices, that accounts for over-dispersion as well as zero-inflation (both characteristic of single-cell data), for which we derive an estimation procedure based on variational inference. In this scheme, we consider probabilistic variable selection based on a spike-and-slab model suitable for count data. The interest of our procedure for data reconstruction, visualization and clustering will be illustrated by simulation experiments and by preliminary results on single-cell data analysis. All proposed methods were implemented into two R-packages "plsgenomics" and "CMF" based on high performance computing

APA, Harvard, Vancouver, ISO, and other styles

28

Tardif, Geneviève. "Multivariate Analysis of Canadian Water Quality Data." Thesis, Université d'Ottawa / University of Ottawa, 2015. http://hdl.handle.net/10393/32245.

Full text

Abstract:

Physical-chemical water quality data from lotic water monitoring sites across Canada were integrated into one dataset. Two overlapping matrices of data were analyzed with principal component analysis (PCA) and cluster analysis to uncover structure and patterns in the data. The first matrix (Matrix A) had 107 sites located throughout Canada, and the following water quality parameters: pH, specific conductance (SC), and total phosphorus (TP). The second matrix (Matrix B) included more variables: calcium (Ca), chloride (Cl), total alkalinity (T_ALK), dissolved oxygen (DO), water temperature (WT), pH, SC and TP; for a subset of 42 sites. Landscape characteristics were calculated for each water quality monitoring site and their importance in explaining water quality data was examined through redundancy analysis. The first principal components in the analyses of Matrix A and B were most correlated with SC, suggesting this parameter is the most representative of water quality variance at the scale of Canada. Overlaying cluster analysis results on PCA information proved an excellent mean to identify the major water characteristics defining each group; mapping cluster analysis group membership provided information on their spatial distribution and was found informative with regards to the probable environmental influences on each group. Redundancy analyses produced significant predictive models of water quality demonstrating that landscape characteristics are determinant factors in water quality at the country scale. The proportion of cropland and the mean annual total precipitation in the drainage area were the landscape variables with the most variance explained. Assembling a consistent dataset of water quality data from monitoring locations throughout Canada proved difficult due to the unevenness of the monitoring programs in place. It is therefore recommended that a standard for the monitoring of a minimum core set of water quality variable be implemented throughout the country to support future nation-wide analysis of water quality data.

APA, Harvard, Vancouver, ISO, and other styles

29

Bergfors, Linus. "Explorative Multivariate Data Analysis of the Klinthagen Limestone Quarry Data." Thesis, Uppsala University, Department of Information Technology, 2010. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-122575.

Full text

Abstract:

The today quarry planning at Klinthagen is rough, which provides an opportunity to introduce new exciting methods to improve the quarry gain and efficiency. Nordkalk AB, active at Klinthagen, wishes to start a new quarry at a nearby location. To exploit future quarries in an efficient manner and ensure production quality, multivariate statistics may help gather important information.

In this thesis the possibilities of the multivariate statistical approaches of Principal Component Analysis (PCA) and Partial Least Squares (PLS) regression were evaluated on the Klinthagen bore data. PCA data were spatially interpolated by Kriging, which also was evaluated and compared to IDW interpolation.

Principal component analysis supplied an overview of the variables relations, but also visualised the problems involved when linking geophysical data to geochemical data and the inaccuracy introduced by lacking data quality.

The PLS regression further emphasised the geochemical-geophysical problems, but also showed good precision when applied to strictly geochemical data.

Spatial interpolation by Kriging did not result in significantly better approximations than the less complex control interpolation by IDW.

In order to improve the information content of the data when modelled by PCA, a more discrete sampling method would be advisable. The data quality may cause trouble, though with sample technique of today it was considered to be of less consequence.

Faced with a single geophysical component to be predicted from chemical variables further geophysical data need to complement existing data to achieve satisfying PLS models.

The stratified rock composure caused trouble when spatially interpolated. Further investigations should be performed to develop more suitable interpolation techniques.

APA, Harvard, Vancouver, ISO, and other styles

30

Lee, Yau-wing. "Modelling multivariate survival data using semiparametric models." Click to view the E-thesis via HKUTO, 2000. http://sunzi.lib.hku.hk/hkuto/record/B4257528X.

Full text

APA, Harvard, Vancouver, ISO, and other styles

31

Irick, Nancy. "Post Processing Data Analysis." International Foundation for Telemetering, 2009. http://hdl.handle.net/10150/606091.

Full text

Abstract:

ITC/USA 2009 Conference Proceedings / The Forty-Fifth Annual International Telemetering Conference and Technical Exhibition / October 26-29, 2009 / Riviera Hotel & Convention Center, Las Vegas, Nevada
Once the test is complete, the job of the Data Analyst has begun. Files from the various acquisition systems are collected. It is the job of the analyst to put together these files in a readable format so the success or failure of the test can be attained. This paper will discuss the process of breaking down these files, comparing data from different systems, and methods of presenting the data.

APA, Harvard, Vancouver, ISO, and other styles

32

李友榮 and Yau-wing Lee. "Modelling multivariate survival data using semiparametric models." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 2000. http://hub.hku.hk/bib/B4257528X.

Full text

APA, Harvard, Vancouver, ISO, and other styles

33

Billah, Baki. "The analysis of multivariate incomplete failure time data." Thesis, National Library of Canada = Bibliothèque nationale du Canada, 1995. http://www.collectionscanada.ca/obj/s4/f2/dsk2/ftp04/mq25823.pdf.

Full text

APA, Harvard, Vancouver, ISO, and other styles

34

Rawizza, Mark Alan. "Time-series analysis of multivariate manufacturing data sets." Thesis, Massachusetts Institute of Technology, 1996. http://hdl.handle.net/1721.1/10895.

Full text

APA, Harvard, Vancouver, ISO, and other styles

35

Ritchie, Elspeth Kathryn. "Application of multivariate data analysis in biopharmaceutical production." Thesis, University of Newcastle upon Tyne, 2016. http://hdl.handle.net/10443/3356.

Full text

Abstract:

In 2004, the FDA launched the Process Analytical Technology (PAT) initiative to support product and process development. Even before this, the biologics manufacturing industry was working to implement PAT. While a strong focus of PAT is the implementation of new monitoring technologies, there is also a strong emphasis on the use of multivariate data analysis (MVDA). Effective implementation and integration of MVDA is of particular interest as it can be applied retroactively to historical datasets in addition to current datasets. However translation of academic research into industrial ways of working can be slowed or prevented by many obstacles, from proposed solutions being workable only by the original academic to a need to prove that time invested in developing MVDA models and methodologies will result in positive business impacts (e.g. reduction of costs or man hours). The presented research applied MVDA techniques to datasets from three scales typically encountered during investigations of biologics manufacturing processes: a single product, dataset; a single product, multi-scale dataset; a multi-product, multi-scale, single platform dataset. These datasets were interrogated in multiple approaches and multiple objectives (e.g. indictors/causes of productivity variation, comparison of pH measurement technologies). Individual project outcomes culminated in the creation of a robust statistical toolbox. The toolbox captures an array of MVDA techniques from PCA and PLS to decision trees employing k-NN. These are supported by frameworks and guidance for implementation based on interrogation aims encountered in a contract manufacturing environment. The presented frameworks ranged from extraction of indirectly captured information (Chapter 4) to meta-analytical strategies (Chapter 6). Software-based tools generated during research ranged from translation of high frequency online monitoring data as robust summary statistics with intuitive meaning (Appendix A) to tools enabling potential reduction in confounding underlying variation in dataset structures through the use of alternative progression variables (Chapter 5). Each tool was designed to fit into current and future planned ways of working at the sponsor company. The presented research demonstrates a range of investigation aims and challenges encountered in a contract manufacturing organisation with demonstrated benefits from ease of integration into normal work process flows and savings in time and human resources.

APA, Harvard, Vancouver, ISO, and other styles

36

Lawal, Najib. "Modelling and multivariate data analysis of agricultural systems." Thesis, University of Manchester, 2015. https://www.research.manchester.ac.uk/portal/en/theses/modelling-and-multivariate-data-analysis-of-agricultural-systems(f6b86e69-5cff-4ffb-a696-418662ecd694).html.

Full text

Abstract:

The broader research area investigated during this programme was conceived from a goal to contribute towards solving the challenge of food security in the 21st century through the reduction of crop loss and minimisation of fungicide use. This is aimed to be achieved through the introduction of an empirical approach to agricultural disease monitoring. In line with this, the SYIELD project, initiated by a consortium involving University of Manchester and Syngenta, among others, proposed a novel biosensor design that can electrochemically detect viable airborne pathogens by exploiting the biology of plant-pathogen interaction. This approach offers improvement on the inefficient and largely experimental methods currently used. Within this context, this PhD focused on the adoption of multidisciplinary methods to address three key objectives that are central to the success of the SYIELD project: local spore ingress near canopies, the evaluation of a suitable model that can describe spore transport, and multivariate analysis of the potential monitoring network built from these biosensors. The local transport of spores was first investigated by carrying out a field trial experiment at Rothamsted Research UK in order to investigate spore ingress in OSR canopies, generate reliable data for testing the prototype biosensor, and evaluate a trajectory model. During the experiment, spores were air-sampled and quantified using established manual detection methods. Results showed that the manual methods, such as colourimetric detection are more sensitive than the proposed biosensor, suggesting the proxy measurement mechanism used by the biosensor may not be reliable in live deployments where spores are likely to be contaminated by impurities and other inhibitors of oxalic acid production. Spores quantified using the more reliable quantitative Polymerase Chain Reaction proved informative and provided novel of data of high experimental value. The dispersal of this data was found to fit a power decay law, a finding that is consistent with experiments in other crops. In the second area investigated, a 3D backward Lagrangian Stochastic model was parameterised and evaluated with the field trial data. The bLS model, parameterised with Monin-Obukhov Similarity Theory (MOST) variables showed good agreement with experimental data and compared favourably in terms of performance statistics with a recent application of an LS model in a maize canopy. Results obtained from the model were found to be more accurate above the canopy than below it. This was attributed to a higher error during initialisation of release velocities below the canopy. Overall, the bLS model performed well and demonstrated suitability for adoption in estimating above-canopy spore concentration profiles which can further be used for designing efficient deployment strategies. The final area of focus was the monitoring of a potential biosensor network. A novel framework based on Multivariate Statistical Process Control concepts was proposed and applied to data from a pollution-monitoring network. The main limitation of traditional MSPC in spatial data applications was identified as a lack of spatial awareness by the PCA model when considering correlation breakdowns caused by an incoming erroneous observation. This resulted in misclassification of healthy measurements as erroneous. The proposed Kriging-augmented MSPC approach was able to incorporate this capability and significantly reduce the number of false alarms.

APA, Harvard, Vancouver, ISO, and other styles

37

Hopkins, Julie Anne. "Sampling designs for exploratory multivariate analysis." Thesis, University of Sheffield, 2000. http://etheses.whiterose.ac.uk/14798/.

Full text

Abstract:

This thesis is concerned with problems of variable selection, influence of sample size and related issues in the applications of various techniques of exploratory multivariate analysis (in particular, correspondence analysis, biplots and canonical correspondence analysis) to archaeology and ecology. Data sets (both published and new) are used to illustrate these methods and to highlight the problems that arise - these practical examples are returned to throughout as the various issues are discussed. Much of the motivation for the development of the methodology has been driven by the needs of the archaeologists providing the data, who were consulted extensively during the study. The first (introductory) chapter includes a detailed description of the data sets examined and the archaeological background to their collection. Chapters Two, Three and Four explain in detail the mathematical theory behind the three techniques. Their uses are illustrated on the various examples of interest, raising data-driven questions which become the focus of the later chapters. The main objectives are to investigate the influence of various design quantities on the inferences made from such multivariate techniques. Quantities such as the sample size (e.g. number of artefacts collected), the number of categories of classification (e.g. of sites, wares, contexts) and the number of variables measured compete for fixed resources in archaeological and ecological applications. Methods of variable selection and the assessment of the stability of the results are further issues of interest and are investigated using bootstrapping and procrustes analysis. Jack-knife methods are used to detect influential sites, wares, contexts, species and artefacts. Some existing methods of investigating issues such as those raised above are applied and extended to correspondence analysis in Chapters Five and Six. Adaptions of them are proposed for biplots in Chapters Seven and Eight and for canonical correspondence analysis in Chapter Nine. Chapter Ten concludes the thesis.

APA, Harvard, Vancouver, ISO, and other styles

38

Zhou, Feifei, and 周飞飞. "Cure models for univariate and multivariate survival data." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 2011. http://hub.hku.hk/bib/B45700977.

Full text

APA, Harvard, Vancouver, ISO, and other styles

39

Petersson, Henrik. "Multivariate Exploration and Processing of Sensor Data-applications with multidimensional sensor systems." Doctoral thesis, Linköpings universitet, Tillämpad Fysik, 2008. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-14879.

Full text

Abstract:

A sensor is a device that transforms a physical, chemical, or biological stimulus into a readable signal. The integral part that sensors make in modern technology is considerable and many are those trying to take the development of sensor technology further. Sensor systems are becoming more and more complex and may contain a wide range of different sensors, where each may deliver a multitude of signals.Although the data generated by modern sensor systems contain lots of information, the information may not be clearly visible. Appropriate handling of data becomes crucial to reveal what is sought, but unfortunately, that process is not always straightforward and there are many aspects to consider. Therefore, analysis of multidimensional sensor data has become a science.The topic of this thesis is signal processing of multidimensional sensordata. Surveys are given on methods to explore data and to use the data to quantify or classify samples. It is also discussed how to avoid the rise of artifacts and how to compensate for sensor deficiencies. Special interest is put on methods being practically applicable to chemical gas sensors. The merits and limitations of chemical sensors are discussed and it is argued that multivariate data analysis plays an important role using such sensors. The contribution made to the public by this thesis is primarily on techniques dealing with difficulties related to the operation of sensors in applications. In the second paper, a method is suggested that aims at suppressing the negative effects caused by unwanted sensor-to-sensor differences. If such differences are not suppressed sufficiently, systems where sensors occasionally must be replaced may degrade and lose performance. The strong-point of the suggested method is its relative ease of use considering large-scale production of sensor components and when integrating sensors into mass-market products. The third paper presents a method that facilitates and speeds up the process of assembling an array of sensors that is optimal for a particular application. The method combines multivariate data analysis with the `Scanning Light Pulse Technique'. In the first and fourth papers, the problem of source separation is studied. In two separate applications, one using gas sensors for combustion control and one using acoustic sensors for ground surveillance, it has been identified that the current sensors outputs mixtures of both interesting- and interfering signals. By different means, the two papers applies and evaluates methods to extract the relevant information under such circumstances.
En sensor är en komponent som överför en fysikalisk, kemisk, eller biologisk storhet eller kvalitet till en utläsbar signal. Sensorer utgör idag en viktig del i flertalet högteknologiska produkter och sensorforskning är ett aktivt område. Komplexiteten på sensorbaserade system ökar och det blir möjligt att registrera allt er olika typer av mätsignaler. Mätsignalerna är inte alltid direkt tydbara, varvid signalbehandling blir ett väsentligt verktyg för att vaska fram den viktiga information som sökes. Signalbehandling av sensorsignaler är dessvärre inte en okomplicerad procedur och det finns många aspekter att beakta. Av denna anledning har signalbehandling och analys av sensorsignaler utvecklats till ett eget forskningsområde. Denna avhandling avhandlar metoder för att analysera komplexa multidimensionella sensorsignaler. En introduktion ges till metoder för att, utifrån mätningar, klassificera och kvantifiera egenskaper hos mätobjekt. En överblick ges av de effekter som kan uppstå på grund av imperfektioner hos sensorerna och en diskussion föres kring metoder för att undvika eller lindra de problem som dessa imperfektioner kan ge uppkomst till. Speciell vikt lägges vid sådana metoder som medför en direkt applicerbarhet och nytta för system av kemiska sensorer. I avhandlingen ingår fyra artiklar, som vart och en belyser hur de metoder som beskrivits kan användas i praktiska situationer.
Sensor,

APA, Harvard, Vancouver, ISO, and other styles

40

Nicolini, Olivier. "LIBS Multivariate Analysis with Machine Learning." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-286595.

Full text

Abstract:

Laser-Induced Breakdown Spectroscopy (LIBS) is a spectroscopic technique used for chemical analysis of materials. By analyzing the spectrum obtained with this technique it is possible to understand the chemical composition of a sample. The possibility to analyze materials in a contactless and online fashion, without sample preparation make LIBS one of the most interesting techniques for chemical composition analysis. However, despite its intrinsic advantages, LIBS analysis suffers from poor accuracy and limited reproducibility of the results due to interference effects caused by the chemical composition of the sample or other experimental factors. How to improve the accuracy of the analysis by extracting useful information from LIBS high dimensionality data remains the main challenge of this technique. In the present work, with the purpose to propose a robust analysis method, I present a pipeline for multivariate regression on LIBS data composed of preprocessing, feature selection, and regression. First raw data is preprocessed by application of intensity filtering, normalization and baseline correction to mitigate the effect of interference factors such as laser energy fluctuations or the presence of baseline in the spectrum. Feature selection allows finding the most informative lines for an element that are then used as input in the subsequent regression phase to predict the element concentration. Partial Least Squares (PLS) and Elastic Net showed the best predictive ability among the regression methods investigated, while Interval PLS (iPLS) and Iterative Predictor Weighting PLS (IPW-PLS) proved to be the best feature selection algorithms for this type of data. By applying these feature selection algorithms on the full LIBS spectrum before regression with PLS or Elastic Net it is possible to get accurate predictions in a robust fashion.
Laser-Induced Breakdown Spectroscopy (LIBS) är en spektroskopisk teknik som används för kemisk analys av material. Genom att analysera det spektrum som erhållits med denna teknik är det möjligt att förstå den kemiska sammansättningen av ett prov. Möjligheten att analysera material på ett kontaktlöst och online sätt utan förberedelse av prov gör LIBS till en av de mest intressanta teknikerna för kemisk sammansättning analys. Trots dess inneboende fördelar lider LIBS-analysen av dålig noggrannhet och begränsad reproducerbarhet av resultaten på grund av interferenseffekter orsakade av provets kemiska sammansättning eller andra experimentella faktorer. Hur man kan förbättra analysens noggrannhet genom att extrahera användbar information från LIBS-data med hög dimensionering är fortfarande den största utmaningen med denna teknik. I det nuvarande arbetet, med syftet att föreslå en robust analysmetod, presenterar jag en pipeline för multivariat regression på LIBS-data som består av förbehandling, val av funktioner och regression. Första rådata förbehandlas genom tillämpning av intensitetsfiltrering, normalisering och baslinjekorrektion för att mildra effekten av interferensfaktorer såsom laserens energifluktuationer eller närvaron av baslinjen i spektrumet. Funktionsval gör det möjligt att hitta de mest informativa linjerna för ett element som sedan används som input i den efterföljande regressionsfasen för att förutsäga elementkoncentrationen. Partial Least Squares (PLS) och Elastic Net visade den bästa förutsägelseförmågan bland de undersökta regressionsmetoderna, medan Interval PLS (iPLS) och Iterative PredictorWeighting PLS (IPW-PLS) visade sig vara de bästa funktionsval algoritmerna för denna typ av data. Genom att tillämpa dessa funktionsval algoritmer på hela LIBS-spektrumet före regression med PLS eller Elastic Net är det möjligt att få exakta förutsägelser på ett robust sätt.

APA, Harvard, Vancouver, ISO, and other styles

41

Ehlers, Rene. "Maximum likelihood estimation procedures for categorical data." Pretoria : [s.n.], 2002. http://upetd.up.ac.za/thesis/available/etd-07222005-124541.

Full text

APA, Harvard, Vancouver, ISO, and other styles

42

Cai, Jianwen. "Generalized estimating equations for censored multivariate failure time data /." Thesis, Connect to this title online; UW restricted, 1992. http://hdl.handle.net/1773/9581.

Full text

APA, Harvard, Vancouver, ISO, and other styles

43

Nothnagel, Carien. "Multivariate data analysis using spectroscopic data of fluorocarbon alcohol mixtures / Nothnagel, C." Thesis, North-West University, 2012. http://hdl.handle.net/10394/7064.

Full text

Abstract:

Pelchem, a commercial subsidiary of Necsa (South African Nuclear Energy Corporation), produces a range of commercial fluorocarbon products while driving research and development initiatives to support the fluorine product portfolio. One such initiative is to develop improved analytical techniques to analyse product composition during development and to quality assure produce. Generally the C–F type products produced by Necsa are in a solution of anhydrous HF, and cannot be directly analyzed with traditional techniques without derivatisation. A technique such as vibrational spectroscopy, that can analyze these products directly without further preparation, will have a distinct advantage. However, spectra of mixtures of similar compounds are complex and not suitable for traditional quantitative regression analysis. Multivariate data analysis (MVA) can be used in such instances to exploit the complex nature of spectra to extract quantitative information on the composition of mixtures. A selection of fluorocarbon alcohols was made to act as representatives for fluorocarbon compounds. Experimental design theory was used to create a calibration range of mixtures of these compounds. Raman and infrared (NIR and ATR–IR) spectroscopy were used to generate spectral data of the mixtures and this data was analyzed with MVA techniques by the construction of regression and prediction models. Selected samples from the mixture range were chosen to test the predictive ability of the models. Analysis and regression models (PCR, PLS2 and PLS1) gave good model fits (R2 values larger than 0.9). Raman spectroscopy was the most efficient technique and gave a high prediction accuracy (at 10% accepted standard deviation), provided the minimum mass of a component exceeded 16% of the total sample. The infrared techniques also performed well in terms of fit and prediction. The NIR spectra were subjected to signal saturation as a result of using long path length sample cells. This was shown to be the main reason for the loss in efficiency of this technique compared to Raman and ATR–IR spectroscopy. It was shown that multivariate data analysis of spectroscopic data of the selected fluorocarbon compounds could be used to quantitatively analyse mixtures with the possibility of further optimization of the method. The study was a representative study indicating that the combination of MVA and spectroscopy can be used successfully in the quantitative analysis of other fluorocarbon compound mixtures.
Thesis (M.Sc. (Chemistry))--North-West University, Potchefstroom Campus, 2012.

APA, Harvard, Vancouver, ISO, and other styles

44

Ahmadi-Nedushan, Behrooz 1966. "Multivariate statistical analysis of monitoring data for concrete dams." Thesis, McGill University, 2002. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=82815.

Full text

Abstract:

Major dams in the world are often instrumented in order to validate numerical models, to gain insight into the behavior of the dam, to detect anomalies, and to enable a timely response either in the form of repairs, reservoir management, or evacuation. Advances in automated data monitoring system makes it possible to regularly collect data on a large number of instruments for a dam. Managing this data is a major concern since traditional means of monitoring each instrument are time consuming and personnel intensive. Among tasks that need to be performed are: identification of faulty instruments, removal of outliers, data interpretation, model fitting and management of alarms for detecting statistically significant changes in the response of a dam.
Statistical models such as multiple linear regression, and back propagation neural networks have been used to estimate the response of individual instruments. Multiple linear regression models are of two kinds, (1) Hydro-Seasonal-Time (HST) models and (2) models that consider concrete temperatures as predictors.
Univerariate, bivariate, and multivariate methods are proposed for the identification of anomalies in the instrumentation data. The source of these anomalies can be either bad readings, faulty instruments, or changes in dam behavior.
The proposed methodologies are applied to three different dams, Idukki, Daniel Johnson and Chute-a-Caron, which are respectively an arch, multiple arch and a gravity dam. Displacements, strains, flow rates, and crack openings of these three dams are analyzed.
This research also proposes various multivariate statistical analyses and artificial neural networks techniques to analyze dam monitoring data. One of these methods, Principal Component Analysis (PCA) is concerned with explaining the variance-covariance structure of a data set through a few linear combinations of the original variables. The general objectives are (1) data reduction and (2) data interpretation. Other multivariate analysis methods such as canonical correlation analysis, partial least squares and nonlinear principal component analysis are discussed. The advantages of methodologies for noise reduction, the reduction of number of variables that have to be monitored, the prediction of response parameters, and the identification of faulty readings are discussed. Results indicated that dam responses are generally correlated and that only a few principal components can summarize the behavior of a dam.

APA, Harvard, Vancouver, ISO, and other styles

45

Wang, Lianming. "Statistical analysis of multivariate interval-censored failure time data." Diss., Columbia, Mo. : University of Missouri-Columbia, 2006. http://hdl.handle.net/10355/4375.

Full text

Abstract:

Thesis (Ph.D.)--University of Missouri-Columbia, 2006.
The entire dissertation/thesis text is included in the research.pdf file; the official abstract appears in the short.pdf file (which also appears in the research.pdf); a non-technical general description, or public abstract, appears in the public.pdf file. Title from title screen of research.pdf file viewed on (May 2, 2007) Vita. Includes bibliographical references.

APA, Harvard, Vancouver, ISO, and other styles

46

Das, Mitali. "Motion within music : the analysis of multivariate MIDI data." Thesis, University of York, 2001. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.367466.

Full text

APA, Harvard, Vancouver, ISO, and other styles

47

Chen, Man-Hua. "Statistical analysis of multivariate interval-censored failure time data." Diss., Columbia, Mo. : University of Missouri-Columbia, 2007. http://hdl.handle.net/10355/4776.

Full text

Abstract:

Thesis (Ph.D.)--University of Missouri-Columbia, 2007.
The entire dissertation/thesis text is included in the research.pdf file; the official abstract appears in the short.pdf file (which also appears in the research.pdf); a non-technical general description, or public abstract, appears in the public.pdf file. Title from title screen of research.pdf file (viewed on March 6, 2009) Includes bibliographical references.

APA, Harvard, Vancouver, ISO, and other styles

48

Edberg, Alexandra. "Monitoring Kraft Recovery Boiler Fouling by Multivariate Data Analysis." Thesis, KTH, Skolan för kemi, bioteknologi och hälsa (CBH), 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-230906.

Full text

Abstract:

This work deals with fouling in the recovery boiler at Montes del Plata, Uruguay. Multivariate data analysis has been used to analyze the large amount of data that was available in order to investigate how different parameters affect the fouling problems. Principal Component Analysis (PCA) and Partial Least Square Projection (PLS) have in this work been used. PCA has been used to compare average values between time periods with high and low fouling problems while PLS has been used to study the correlation structures between the variables and consequently give an indication of which parameters that might be changed to improve the availability of the boiler. The results show that this recovery boiler tends to have problems with fouling that might depend on the distribution of air, the black liquor pressure or the dry solid content of the black liquor. The results also show that multivariate data analysis is a powerful tool for analyzing these types of fouling problems.
Detta arbete handlar om inkruster i sodapannan pa Montes del Plata, Uruguay. Multivariat dataanalys har anvands for att analysera den stora datamangd som fanns tillganglig for att undersoka hur olika parametrar paverkar inkrusterproblemen. Principal·· Component Analysis (PCA) och Partial Least Square Projection (PLS) har i detta jobb anvants. PCA har anvants for att jamfora medelvarden mellan tidsperioder med hoga och laga inkrusterproblem medan PLS har anvants for att studera korrelationen mellan variablema och darmed ge en indikation pa vilka parametrar som kan tankas att andras for att forbattra tillgangligheten pa sodapannan. Resultaten visar att sodapannan tenderar att ha problem med inkruster som kan hero pa fdrdelningen av luft, pa svartlutens tryck eller pa torrhalten i svartluten. Resultaten visar ocksa att multivariat dataanalys ar ett anvandbart verktyg for att analysera dessa typer av inkrusterproblem.

APA, Harvard, Vancouver, ISO, and other styles

49

Sheppard, Therese. "Extending covariance structure analysis for multivariate and functional data." Thesis, University of Manchester, 2010. https://www.research.manchester.ac.uk/portal/en/theses/extending-covariance-structure-analysis-for-multivariate-and-functional-data(e2ad7f12-3783-48cf-b83c-0ca26ef77633).html.

Full text

Abstract:

For multivariate data, when testing homogeneity of covariance matrices arising from two or more groups, Bartlett's (1937) modified likelihood ratio test statistic is appropriate to use under the null hypothesis of equal covariance matrices where the null distribution of the test statistic is based on the restrictive assumption of normality. Zhang and Boos (1992) provide a pooled bootstrap approach when the data cannot be assumed to be normally distributed. We give three alternative bootstrap techniques to testing homogeneity of covariance matrices when it is both inappropriate to pool the data into one single population as in the pooled bootstrap procedure and when the data are not normally distributed. We further show that our alternative bootstrap methodology can be extended to testing Flury's (1988) hierarchy of covariance structure models. Where deviations from normality exist, we show, by simulation, that the normal theory log-likelihood ratio test statistic is less viable compared with our bootstrap methodology. For functional data, Ramsay and Silverman (2005) and Lee et al (2002) together provide four computational techniques for functional principal component analysis (PCA) followed by covariance structure estimation. When the smoothing method for smoothing individual profiles is based on using least squares cubic B-splines or regression splines, we find that the ensuing covariance matrix estimate suffers from loss of dimensionality. We show that ridge regression can be used to resolve this problem, but only for the discretisation and numerical quadrature approaches to estimation, and that choice of a suitable ridge parameter is not arbitrary. We further show the unsuitability of regression splines when deciding on the optimal degree of smoothing to apply to individual profiles. To gain insight into smoothing parameter choice for functional data, we compare kernel and spline approaches to smoothing individual profiles in a nonparametric regression context. Our simulation results justify a kernel approach using a new criterion based on predicted squared error. We also show by simulation that, when taking account of correlation, a kernel approach using a generalized cross validatory type criterion performs well. These data-based methods for selecting the smoothing parameter are illustrated prior to a functional PCA on a real data set.

APA, Harvard, Vancouver, ISO, and other styles

50

陳志昌 and Chee-cheong Chan. "Compositional data analysis of voting patterns." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 1993. http://hub.hku.hk/bib/B31977236.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic 'Multivariate analysis – Data processing'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles