To see the other types of publications on this topic, follow the link: High dimensional data.

Dissertations / Theses on the topic 'High dimensional data'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 dissertations / theses for your research on the topic 'High dimensional data.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.

1

Wauters, John, and John Wauters. "Independence Screening in High-Dimensional Data." Thesis, The University of Arizona, 2016. http://hdl.handle.net/10150/623083.

Full text
Abstract:
High-dimensional data, data in which the number of dimensions exceeds the number of observations, is increasingly common in statistics. The term "ultra-high dimensional" is defined by Fan and Lv (2008) as describing the situation where log(p) is of order O(na) for some a in the interval (0, ½). It arises in many contexts such as gene expression data, proteomic data, imaging data, tomography, and finance, as well as others. High-dimensional data present a challenge to traditional statistical techniques. In traditional statistical settings, models have a small number of features, chosen based on an assumption of what features may be relevant to the response of interest. In the high-dimensional setting, many of the techniques of traditional feature selection become computationally intractable, or does not yield unique solutions. Current research in modeling high-dimensional data is heavily focused on methods that screen the features before modeling; that is, methods that eliminate noise-features as a pre-modeling dimension reduction. Typically noise feature are identified by exploiting properties of independent random variables, thus the term "independence screening." There are methods for modeling high-dimensional data without feature screening first (e.g. LASSO or SCAD), but simulation studies show screen-first methods perform better as dimensionality increases. Many proposals for independence screening exist, but in my literature review certain themes recurred: A) The assumption of sparsity: that all the useful information in the data is actually contained in a small fraction of the features (the "active features"), the rest being essentially random noise (the "inactive" features). B) In many newer methods, initial dimension reduction by feature screening reduces the problem from the high-dimensional case to a classical case; feature selection then proceeds by a classical method. C) In the initial screening, removal of features independent of the response is highly desirable, as such features literally give no information about the response. D) For the initial screening, some statistic is applied pairwise to each feature in combination with the response; the specific statistic chosen so that in the case that the two random variables are independent, a specific known value is expected for the statistic. E) Features are ranked by the absolute difference between the calculated statistic and the expected value of that statistic in the independent case, i.e. features that are most different from the independent case are most preferred. F) Proof is typically offered that, asymptotically, the method retains the true active features with probability approaching one. G) Where possible, an iterative version of the process is explored, as iterative versions do much better at identifying features that are active in their interactions, but not active individually.
APA, Harvard, Vancouver, ISO, and other styles
2

Zeugner, Stefan. "Macroeconometrics with high-dimensional data." Doctoral thesis, Universite Libre de Bruxelles, 2012. http://hdl.handle.net/2013/ULB-DIPOT:oai:dipot.ulb.ac.be:2013/209640.

Full text
Abstract:
CHAPTER 1:

The default g-priors predominant in Bayesian Model Averaging tend to over-concentrate posterior mass on a tiny set of models - a feature we denote as 'supermodel effect'. To address it, we propose a 'hyper-g' prior specification, whose data-dependent shrinkage adapts posterior model distributions to data quality. We demonstrate the asymptotic consistency of the hyper-g prior, and its interpretation as a goodness-of-fit indicator. Moreover, we highlight the similarities between hyper-g and 'Empirical Bayes' priors, and introduce closed-form expressions essential to computationally feasibility. The robustness of the hyper-g prior is demonstrated via simulation analysis, and by comparing four vintages of economic growth data.

CHAPTER 2:

Ciccone and Jarocinski (2010) show that inference in Bayesian Model Averaging (BMA) can be highly sensitive to small data perturbations. In particular they demonstrate that the importance attributed to potential growth determinants varies tremendously over different revisions of international income data. They conclude that 'agnostic' priors appear too sensitive for this strand of growth empirics. In response, we show that the found instability owes much to a specific BMA set-up: First, comparing the same countries over data revisions improves robustness. Second, much of the remaining variation can be reduced by applying an evenly 'agnostic', but flexible prior.

CHAPTER 3:

This chapter explores the link between the leverage of the US financial sector, of households and of non-financial businesses, and real activity. We document that leverage is negatively correlated with the future growth of real activity, and positively linked to the conditional volatility of future real activity and of equity returns.

The joint information in sectoral leverage series is more relevant for predicting future real activity than the information contained in any individual leverage series. Using in-sample regressions and out-of sample forecasts, we show that the predictive power of leverage is roughly comparable to that of macro and financial predictors commonly used by forecasters.

Leverage information would not have allowed to predict the 'Great Recession' of 2008-2009 any better than conventional macro/financial predictors.

CHAPTER 4:

Model averaging has proven popular for inference with many potential predictors in small samples. However, it is frequently criticized for a lack of robustness with respect to prediction and inference. This chapter explores the reasons for such robustness problems and proposes to address them by transforming the subset of potential 'control' predictors into principal components in suitable datasets. A simulation analysis shows that this approach yields robustness advantages vs. both standard model averaging and principal component-augmented regression. Moreover, we devise a prior framework that extends model averaging to uncertainty over the set of principal components and show that it offers considerable improvements with respect to the robustness of estimates and inference about the importance of covariates. Finally, we empirically benchmark our approach with popular model averaging and PC-based techniques in evaluating financial indicators as alternatives to established macroeconomic predictors of real economic activity.
Doctorat en Sciences économiques et de gestion
info:eu-repo/semantics/nonPublished

APA, Harvard, Vancouver, ISO, and other styles
3

Boulesteix, Anne-Laure. "Dimension reduction and Classification with High-Dimensional Microarray Data." Diss., lmu, 2005. http://nbn-resolving.de/urn:nbn:de:bvb:19-28017.

Full text
APA, Harvard, Vancouver, ISO, and other styles
4

Samko, Oksana. "Low dimension hierarchical subspace modelling of high dimensional data." Thesis, Cardiff University, 2009. http://orca.cf.ac.uk/54883/.

Full text
Abstract:
Building models of high-dimensional data in a low dimensional space has become extremely popular in recent years. Motion tracking, facial animation, stock market tracking, digital libraries and many other different models have been built and tuned to specific application domains. However, when the underlying structure of the original data is unknown, the modelling of such data is still an open question. The problem is of interest as capturing and storing large amounts of high dimensional data has become trivial, yet the capability to process, interpret, and use this data is limited. In this thesis, we introduce novel algorithms for modelling high dimensional data with an unknown structure, which allows us to represent the data with good accuracy and in a compact manner. This work presents a novel fully automated dynamic hierarchical algorithm, together with a novel automatic data partitioning method to work alongside existing specific models (talking head, human motion). Our algorithm is applicable to hierarchical data visualisation and classification, meaningful pattern extraction and recognition, and new data sequence generation. Also during our work we investigated problems related to low dimensional data representation: automatic optimal input parameter estimation, and robustness against noise and outliers. We show the potential of our modelling with many data domains: talking head, motion, audio, etc. and we believe that it has good potential in adapting to other domains.
APA, Harvard, Vancouver, ISO, and other styles
5

Ruan, Lingyan. "Statistical analysis of high dimensional data." Diss., Georgia Institute of Technology, 2010. http://hdl.handle.net/1853/37135.

Full text
Abstract:
This century is surely the century of data (Donoho, 2000). Data analysis has been an emerging activity over the last few decades. High dimensional data is in particular more and more pervasive with the advance of massive data collection system, such as microarrays, satellite imagery, and financial data. However, analysis of high dimensional data is of challenge with the so called curse of dimensionality (Bellman 1961). This research dissertation presents several methodologies in the application of high dimensional data analysis. The first part discusses a joint analysis of multiple microarray gene expressions. Microarray analysis dates back to Golub et al. (1999). It draws much attention after that. One common goal of microarray analysis is to determine which genes are differentially expressed. These genes behave significantly differently between groups of individuals. However, in microarray analysis, there are thousands of genes but few arrays (samples, individuals) and thus relatively low reproducibility remains. It is natural to consider joint analyses that could combine microarrays from different experiments effectively in order to achieve improved accuracy. In particular, we present a model-based approach for better identification of differentially expressed genes by incorporating data from different studies. The model can accommodate in a seamless fashion a wide range of studies including those performed at different platforms, and/or under different but overlapping biological conditions. Model-based inferences can be done in an empirical Bayes fashion. Because of the information sharing among studies, the joint analysis dramatically improves inferences based on individual analysis. Simulation studies and real data examples are presented to demonstrate the effectiveness of the proposed approach under a variety of complications that often arise in practice. The second part is about covariance matrix estimation in high dimensional data. First, we propose a penalised likelihood estimator for high dimensional t-distribution. The student t-distribution is of increasing interest in mathematical finance, education and many other applications. However, the application in t-distribution is limited by the difficulty in the parameter estimation of the covariance matrix for high dimensional data. We show that by imposing LASSO penalty on the Cholesky factors of the covariance matrix, EM algorithm can efficiently compute the estimator and it performs much better than other popular estimators. Secondly, we propose an estimator for high dimensional Gaussian mixture models. Finite Gaussian mixture models are widely used in statistics thanks to its great flexibility. However, parameter estimation for Gaussian mixture models with high dimensionality can be rather challenging because of the huge number of parameters that need to be estimated. For such purposes, we propose a penalized likelihood estimator to specifically address such difficulties. The LASSO penalty we impose on the inverse covariance matrices encourages sparsity on its entries and therefore helps reducing the dimensionality of the problem. We show that the proposed estimator can be efficiently computed via an Expectation-Maximization algorithm. To illustrate the practical merits of the proposed method, we consider its application in model-based clustering and mixture discriminant analysis. Numerical experiments with both simulated and real data show that the new method is a valuable tool in handling high dimensional data. Finally, we present structured estimators for high dimensional Gaussian mixture models. The graphical representation of every cluster in Gaussian mixture models may have the same or similar structure, which is an important feature in many applications, such as image processing, speech recognition and gene network analysis. Failure to consider the sharing structure would deteriorate the estimation accuracy. To address such issues, we propose two structured estimators, hierarchical Lasso estimator and group Lasso estimator. An EM algorithm can be applied to conveniently solve the estimation problem. We show that when clusters share similar structures, the proposed estimator perform much better than the separate Lasso estimator.
APA, Harvard, Vancouver, ISO, and other styles
6

Shen, Xilin. "Multiscale analysis of high dimensional data." Connect to online resource, 2007. http://gateway.proquest.com/openurl?url_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:dissertation&res_dat=xri:pqdiss&rft_dat=xri:pqdiss:3284443.

Full text
APA, Harvard, Vancouver, ISO, and other styles
7

Wang, Wangie. "Clustering Problems for High Dimensional Data." Research Showcase @ CMU, 2014. http://repository.cmu.edu/dissertations/384.

Full text
Abstract:
We consider a clustering problem where we observe feature vectors Xi ∈ Rp, i = 1, 2, ..., n, from several possible classes. The class labels are unknown and the main interest is to estimate these labels. We propose a three-step clustering procedure where we first evaluate the significance of each feature by the Kolmogorov-Smirnov statistic, then we select the small fraction of features for which the Kolmogorov-Smirnov scores exceed a preselected threshold t > 0, and then use only the selected features for clustering by one version of the Principal Component Analysis (PCA). In this procedure, one of the main challenges is how to set the threshold t. We propose a new approach to set the threshold, where the core is the so-called Signal-to-Noise Ratio (SNR) in post-selection PCA. SNR is reminiscent of the recent innovation of Higher Criticism; for this reason, we call the proposed threshold the Higher Criticism Threshold (HCT), despite that it is significantly different from the HCT proposed earlier by [Donoho 2008] in the context of classification. Motivated by many examples in Big Data, we study the spectral clustering with HCT for a model where the signals are both rare and weak for two-classes clustering case. Through delicate PCA, we forge a close link between the HCT and the ideal threshold choice, and show that the HCT yields optimal results in the spectral clustering approach. The approach is successfully applied to three gene microarray data sets, where it compares favorably with existing clustering methods. Our analysis is subtle and requires new development in the Random Matrix Theory (RMT). One challenge we face is that most results in the RMT can not be applied directly to our case: existing results are usually for matrices with i.i.d. entries, but the object of interest in the current case is the post-selection data matrix, where (due to feature selection) the columns are non-independent and have hard-to-track distributions. We develop intricate new RMT to overcome this problem. We also find the theoretical approximation for the tail distribution of Kolmogorov-Smirnov Statistic under null hypothesis and alternative hypothesis. With the theoretical approximation, we can claim the effectiveness of KS statistic. Besides, we also find the fundamental limits for clustering problem, signal recovery problem, and detection problem under the Asymptotic Rare and Weak model. We find the boundary such that when the model parameters are beyond the boundary, then the inference is unavailable, otherwise there are some methods (usually exhausted search) to achieve the inference.
APA, Harvard, Vancouver, ISO, and other styles
8

Wang, Wanjie. "CLUSTERING PROBLEMS FOR HIGH DIMENSIONAL DATA." Research Showcase @ CMU, 2014. http://repository.cmu.edu/dissertations/443.

Full text
Abstract:
We consider a clustering problem where we observe feature vectors Xi ∈ Rp , i = 1, 2, . . . , n, from several possible classes. The class labels are unknown and the main interest is to estimate these labels. We propose a three-step clustering procedure where we first evaluate the significance of each feature by the Kolmogorov-Smirnov statistic, then we select the small fraction of features for which the Kolmogorov-Smirnov scores exceed a preselected threshold t > 0, and then use only the selected features for clustering by one version of the Principal Component Analysis (PCA). In this procedure, one of the main challenges is how to set the threshold t. We propose a new approach to set the threshold, where the core is the so-called Signal-toNoise Ratio (SNR) in post-selection PCA. SNR is reminiscent of the recent innovation of Higher Criticism; for this reason, we call the proposed threshold the Higher Criticism Threshold (HCT), despite that it is significantly different from the HCT proposed earlier by [Donoho 2008] in the context of classification. Motivated by many examples in Big Data, we study the spectral clustering with HCT for a model where the signals are both rare and weak for two-classes clustering case. Through delicate PCA, we forge a close link between the HCT and the ideal threshold choice, and show that the HCT yields optimal results in the spectral clustering approach. The approach is successfully applied to three gene microarray data sets, where it compares favorably with existing clustering methods. Our analysis is subtle and requires new development in the Random Matrix Theory (RMT). One challenge we face is that most results in the RMT can not be applied directly to our case: existing results are usually for matrices with i.i.d. entries, but the object of interest in the current case is the post-selection data matrix, where (due to feature selection) the columns are non-independent and have hard-to-track distributions. We develop intricate new RMT to overcome this problem. We also find the theoretical approximation for the tail distribution of KolmogorovSmirnov Statistic under null hypothesis and alternative hypothesis. With the theoretical approximation, we can claim the effectiveness of KS statistic. Besides, we also find the fundamental limits for clustering problem, signal recovery problem, and detection problem under the Asymptotic Rare and Weak model. We find the boundary such that when the model parameters are beyond the boundary, then the inference is unavailable, otherwise there are some methods (usually exhausted search) to achieve the inference.
APA, Harvard, Vancouver, ISO, and other styles
9

Csikós, Mónika. "Efficient Approximations of High-Dimensional Data." Thesis, Université Gustave Eiffel, 2022. http://www.theses.fr/2022UEFL2004.

Full text
Abstract:
Dans cette thèse, nous étudions les approximations de systèmes d'ensembles (X,S), où X est un ensemble de base et S est constitué de sous-ensembles de X appelés plages. Étant donné un système d'ensembles finis, notre objectif est de construire un petit sous-ensemble de X tel que chaque plage soit `bien-approximée'. En particulier, pour un paramètre epsilon donné dans (0,1), nous disons qu'un sous-ensemble A de X est une epsilon-approximation de (X,S) si pour toute plage R dans S, les fractions |A cap R|/|A| et |R|/|X| sont proches de epsilon.La recherche sur de telles approximations a commencé dans les années 1950, l'échantillonnage aléatoire étant l'outil clé pour montrer leur existence. Depuis lors, la notion d'approximations est devenue une structure fondamentale dans plusieurs communautés - théorie de l'apprentissage, statistiques, combinatoire et algorithmes. Une percée dans l'étude des approximations remonte à 1971, lorsque Vapnik et Chervonenkis ont étudié les systèmes d'ensembles avec une VC-dimension finie, qui s'est avérée être un paramètre clé pour caractériser leur complexité. Par exemple, si un système d'ensembles (X, S) a une VC-dimension d, alors un échantillon uniforme de O(d/epsilon^2) points est une approximation epsilon de (X, S) avec une probabilité élevée. Il est important de noter que la taille de l'approximation ne dépend que d'epsilon et de d, et qu'elle est indépendante des tailles d'entrée |X| et |S| !Dans la première partie de cette thèse, nous donnons une preuve modulaire, autonome et intuitive de la garantie d'échantillonnage uniforme ci-dessus. Dans la deuxième partie, nous donnons une amélioration d'un goulot d'étranglement algorithmique vieux de 30 ans - la construction d'appariements avec un faible nombre de croisements. Ceci peut être appliqué pour construire des approximations avec des garanties améliorées. Enfin, nous répondons à un problème ouvert vieux de 30 ans de Blumer etal. en prouvant des bornes inférieures serrées sur la dimension VC des unions de demi-espaces - un système d'ensembles géométriques qui apparaît dans plusieurs applications, par exemple les constructions de coresets
In this thesis, we study approximations of set systems (X,S), where X is a base set and S consists of subsets of X called ranges. Given a finite set system, our goal is to construct a small subset of X set such that each range is `well-approximated'. In particular, for a given parameter epsilon in (0,1), we say that a subset A of X is an epsilon-approximation of (X,S) if for any range R in S, the fractions |A cap R|/|A| and |R|/|X| are epsilon-close.Research on such approximations started in the 1950s, with random sampling being the key tool for showing their existence. Since then, the notion of approximations has become a fundamental structure across several communities---learning theory, statistics, combinatorics and algorithms. A breakthrough in the study of approximations dates back to 1971 when Vapnik and Chervonenkis studied set systems with finite VC-dimension, which turned out a key parameter to characterise their complexity. For instance, if a set system (X,S) has VC dimension d, then a uniform sample of O(d/epsilon^2) points is an epsilon-approximation of (X,S) with high probability. Importantly, the size of the approximation only depends on epsilon and d, and it is independent of the input sizes |X| and |S|!In the first part of this thesis, we give a modular, self-contained, intuitive proof of the above uniform sampling guarantee .In the second part, we give an improvement of a 30 year old algorithmic bottleneck---constructing matchings with low crossing number. This can be applied to build approximations with improved guarantees.Finally, we answer a 30 year old open problem of Blumer etal. by proving tight lower bounds on the VC dimension of unions of half-spaces - a geometric set system that appears in several applications, e.g. coreset constructions
APA, Harvard, Vancouver, ISO, and other styles
10

Qin, Yingli. "Statistical inference for high-dimensional data." [Ames, Iowa : Iowa State University], 2009. http://gateway.proquest.com/openurl?url_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:dissertation&res_dat=xri:pqdiss&rft_dat=xri:pqdiss:3389139.

Full text
APA, Harvard, Vancouver, ISO, and other styles
11

Lou, Qiang. "LEARNING FROM INCOMPLETE HIGH-DIMENSIONAL DATA." Diss., Temple University Libraries, 2013. http://cdm16002.contentdm.oclc.org/cdm/ref/collection/p245801coll10/id/214785.

Full text
Abstract:
Computer and Information Science
Ph.D.
Data sets with irrelevant and redundant features and large fraction of missing values are common in the real life application. Learning such data usually requires some preprocess such as selecting informative features and imputing missing values based on observed data. These processes can provide more accurate and more efficient prediction as well as better understanding of the data distribution. In my dissertation I will describe my work in both of these aspects and also my following up work on feature selection in incomplete dataset without imputing missing values. In the last part of my dissertation, I will present my current work on more challenging situation where high-dimensional data is time-involving. The first two parts of my dissertation consist of my methods that focus on handling such data in a straightforward way: imputing missing values first, and then applying traditional feature selection method to select informative features. We proposed two novel methods, one for imputing missing values and the other one for selecting informative features. We proposed a new method that imputes the missing attributes by exploiting temporal correlation of attributes, correlations among multiple attributes collected at the same time and space, and spatial correlations among attributes from multiple sources. The proposed feature selection method aims to find a minimum subset of the most informative variables for classification/regression by efficiently approximating the Markov Blanket which is a set of variables that can shield a certain variable from the target. I present, in the third part, how to perform feature selection in incomplete high-dimensional data without imputation, since imputation methods only work well when data is missing completely at random, when fraction of missing values is small, or when there is prior knowledge about the data distribution. We define the objective function of the uncertainty margin-based feature selection method to maximize each instance's uncertainty margin in its own relevant subspace. In optimization, we take into account the uncertainty of each instance due to the missing values. The experimental results on synthetic and 6 benchmark data sets with few missing values (less than 25%) provide evidence that our method can select the same accurate features as the alternative methods which apply an imputation method first. However, when there is a large fraction of missing values (more than 25%) in data, our feature selection method outperforms the alternatives, which impute missing values first. In the fourth part, I introduce my method handling more challenging situation where the high-dimensional data varies in time. Existing way to handle such data is to flatten temporal data into single static data matrix, and then applying traditional feature selection method. In order to keep the dynamics in the time series data, our method avoid flattening the data in advance. We propose a way to measure the distance between multivariate temporal data from two instances. Based on this distance, we define the new objective function based on the temporal margin of each data instance. A fixed-point gradient descent method is proposed to solve the formulated objective function to learn the optimal feature weights. The experimental results on real temporal microarray data provide evidence that the proposed method can identify more informative features than the alternatives that flatten the temporal data in advance.
Temple University--Theses
APA, Harvard, Vancouver, ISO, and other styles
12

Dannenberg, Matthew. "Pattern Recognition in High-Dimensional Data." Scholarship @ Claremont, 2016. https://scholarship.claremont.edu/hmc_theses/76.

Full text
Abstract:
Vast amounts of data are produced all the time. Yet this data does not easily equate to useful information: extracting information from large amounts of high dimensional data is nontrivial. People are simply drowning in data. A recent and growing source of high-dimensional data is hyperspectral imaging. Hyperspectral images allow for massive amounts of spectral information to be contained in a single image. In this thesis, a robust supervised machine learning algorithm is developed to efficiently perform binary object classification on hyperspectral image data by making use of the geometry of Grassmann manifolds. This algorithm can consistently distinguish between a large range of even very similar materials, returning very accurate classification results with very little training data. When distinguishing between dissimilar locations like crop fields and forests, this algorithm consistently classifies more than 95 percent of points correctly. On more similar materials, more than 80 percent of points are classified correctly. This algorithm will allow for very accurate information to be extracted from these large and complicated hyperspectral images.
APA, Harvard, Vancouver, ISO, and other styles
13

Pacella, Massimo. "High-dimensional statistics for complex data." Doctoral thesis, Universita degli studi di Salerno, 2018. http://hdl.handle.net/10556/3016.

Full text
Abstract:
2016 - 2017
High dimensional data analysis has become a popular research topic in the recent years, due to the emergence of various new applications in several fields of sciences underscoring the need for analysing massive data sets. One of the main challenge in analysing high dimensional data regards the interpretability of estimated models as well as the computational efficiency of procedures adopted. Such a purpose can be achieved through the identification of relevant variables that really affect the phenomenon of interest, so that effective models can be subsequently constructed and applied to solve practical problems. The first two chapters of the thesis are devoted in studying high dimensional statistics for variable selection. We firstly introduce a short but exhaustive review on the main developed techniques for the general problem of variable selection using nonparametric statistics. Lastly in chapter 3 we will present our proposal regarding a feature screening approach for non additive models developed by using of conditional information in the estimation procedure... [edited by Author]
XXX ciclo
APA, Harvard, Vancouver, ISO, and other styles
14

ZANCO, ALESSANDRO. "High-dimensional data driven parameterized macromodeling." Doctoral thesis, Politecnico di Torino, 2022. http://hdl.handle.net/11583/2971991.

Full text
APA, Harvard, Vancouver, ISO, and other styles
15

Hassan, Tahir Mohammed. "Data-independent vs. data-dependent dimension reduction for pattern recognition in high dimensional spaces." Thesis, University of Buckingham, 2017. http://bear.buckingham.ac.uk/199/.

Full text
Abstract:
There has been a rapid emergence of new pattern recognition/classification techniques in a variety of real world applications over the last few decades. In most of the pattern recognition/classification applications, the pattern of interest is modelled by a data vector/array of very high dimension. The main challenges in such applications are related to the efficiency of retrieval, analysis, and verifying/classifying the pattern/object of interest. The “Curse of Dimension” is a reference to these challenges and is commonly addressed by Dimension Reduction (DR) techniques. Several DR techniques has been developed and implemented in a variety of applications. The most common DR schemes are dependent on a dataset of “typical samples” (e.g. the Principal Component Analysis (PCA), and Linear Discriminant Analysis (LDA)). However, data-independent DR schemes (e.g. Discrete Wavelet Transform (DWT), and Random Projections (RP)) are becoming more desirable due to lack of density ratio of samples to dimension. In this thesis, we critically review both types of techniques, and highlight advantages and disadvantages in terms of efficiency and impact on recognition accuracy. We shall study the theoretical justification for the existence of DR transforms that preserve, within tolerable error, distances between would be feature vectors modelling objects of interest. We observe that data-dependent DRs do not specifically attempts to preserve distances, and the problems of overfitting and biasness are consequences of low density ratio of samples to dimension. Accordingly, the focus of our investigations is more on data-independent DR schemes and in particular on the different ways of generating RPs as an efficient DR tool. RPs suitable for pattern recognition applications are only restricted by a lower bound on the reduced dimension that depends on the tolerable error. Besides, the known RPs that are generated in accordance to some probability distributions, we investigate and test the performance of differently constructed over-complete Hadamard mxn (m<
APA, Harvard, Vancouver, ISO, and other styles
16

Yahya, Waheed Babatunde. "Sequential Dimension Reduction and Prediction Methods with High-dimensional Microarray Data." Diss., lmu, 2009. http://nbn-resolving.de/urn:nbn:de:bvb:19-102544.

Full text
APA, Harvard, Vancouver, ISO, and other styles
17

Liu, Jinze Wang Wei. "New approaches for clustering high dimensional data." Chapel Hill, N.C. : University of North Carolina at Chapel Hill, 2006. http://dc.lib.unc.edu/u?/etd,584.

Full text
Abstract:
Thesis (Ph. D.)--University of North Carolina at Chapel Hill, 2006.
Title from electronic title page (viewed Oct. 10, 2007). "... in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Department of Computer Science." Discipline: Computer Science; Department/School: Computer Science.
APA, Harvard, Vancouver, ISO, and other styles
18

Mansoor, Rashid. "Assessing Distributional Properties of High-Dimensional Data." Doctoral thesis, Internationella Handelshögskolan, Högskolan i Jönköping, IHH, Economics, Finance and Statistics, 2013. http://urn.kb.se/resolve?urn=urn:nbn:se:hj:diva-22547.

Full text
Abstract:
This doctoral thesis consists of five papers in the field of multivariate statistical analysis of high-dimensional data. Because of the wide application and methodological scope, the individual papers in the thesis necessarily target a number of different statistical issues. In the first paper, Monte Carlo simulations are used to investigate a number of tests of multivariate non-normality with respect to their increasing dimension asymptotic (IDA) properties as the dimension p grows proportionally with the number of observations n such that p/n → c where is a constant. In the second paper a new test for non-normality that utilizes principal components is proposed for cases when p/n → c. The power and size of the test are examined through Monte Carlo simulations where different combinations of p and n are used. The third paper treats the problem of the relation between the second central moment of a distribution to its first raw moment. In order to make inference of the systematic relationship between mean and standard deviation, a model that captures this relationship by a slope parameter (β) is proposed and three different estimators of this parameter are developed and their consistency proven in the context where the number of variables increases proportionally to the number of observations. In the fourth paper, a Bayesian regression approach has been taken to model the relationship between the mean and standard deviation of the excess return and to test hypotheses regarding the β parameter. An empirical example involving Stockholm exchange market data is included. Then finally in the fifth paper three new methods to test for panel cointegration
APA, Harvard, Vancouver, ISO, and other styles
19

Sun, Yizhi. "Statistical Analysis of Structured High-dimensional Data." Diss., Virginia Tech, 2018. http://hdl.handle.net/10919/97505.

Full text
Abstract:
High-dimensional data such as multi-modal neuroimaging data and large-scale networks carry excessive amount of information, and can be used to test various scientific hypotheses or discover important patterns in complicated systems. While considerable efforts have been made to analyze high-dimensional data, existing approaches often rely on simple summaries which could miss important information, and many challenges on modeling complex structures in data remain unaddressed. In this proposal, we focus on analyzing structured high-dimensional data, including functional data with important local regions and network data with community structures. The first part of this dissertation concerns the detection of ``important'' regions in functional data. We propose a novel Bayesian approach that enables region selection in the functional data regression framework. The selection of regions is achieved through encouraging sparse estimation of the regression coefficient, where nonzero regions correspond to regions that are selected. To achieve sparse estimation, we adopt compactly supported and potentially over-complete basis to capture local features of the regression coefficient function, and assume a spike-slab prior to the coefficients of the bases functions. To encourage continuous shrinkage of nearby regions, we assume an Ising hyper-prior which takes into account the neighboring structure of the bases functions. This neighboring structure is represented by an undirected graph. We perform posterior sampling through Markov chain Monte Carlo algorithms. The practical performance of the proposed approach is demonstrated through simulations as well as near-infrared and sonar data. The second part of this dissertation focuses on constructing diversified portfolios using stock return data in the Center for Research in Security Prices (CRSP) database maintained by the University of Chicago. Diversification is a risk management strategy that involves mixing a variety of financial assets in a portfolio. This strategy helps reduce the overall risk of the investment and improve performance of the portfolio. To construct portfolios that effectively diversify risks, we first construct a co-movement network using the correlations between stock returns over a training time period. Correlation characterizes the synchrony among stock returns thus helps us understand whether two or multiple stocks have common risk attributes. Based on the co-movement network, we apply multiple network community detection algorithms to detect groups of stocks with common co-movement patterns. Stocks within the same community tend to be highly correlated, while stocks across different communities tend to be less correlated. A portfolio is then constructed by selecting stocks from different communities. The average return of the constructed portfolio over a testing time period is finally compared with the SandP 500 market index. Our constructed portfolios demonstrate outstanding performance during a non-crisis period (2004-2006) and good performance during a financial crisis period (2008-2010).
PHD
APA, Harvard, Vancouver, ISO, and other styles
20

Harvey, William John. "Understanding High-Dimensional Data Using Reeb Graphs." The Ohio State University, 2012. http://rave.ohiolink.edu/etdc/view?acc_num=osu1342614959.

Full text
APA, Harvard, Vancouver, ISO, and other styles
21

Green, Brittany. "Ultra-high Dimensional Semiparametric Longitudinal Data Analysis." University of Cincinnati / OhioLINK, 2020. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1593171378846243.

Full text
APA, Harvard, Vancouver, ISO, and other styles
22

Huo, Shuning. "Bayesian Modeling of Complex High-Dimensional Data." Diss., Virginia Tech, 2020. http://hdl.handle.net/10919/101037.

Full text
Abstract:
With the rapid development of modern high-throughput technologies, scientists can now collect high-dimensional complex data in different forms, such as medical images, genomics measurements. However, acquisition of more data does not automatically lead to better knowledge discovery. One needs efficient and reliable analytical tools to extract useful information from complex datasets. The main objective of this dissertation is to develop innovative Bayesian methodologies to enable effective and efficient knowledge discovery from complex high-dimensional data. It contains two parts—the development of computationally efficient functional mixed models and the modeling of data heterogeneity via Dirichlet Diffusion Tree. The first part focuses on tackling the computational bottleneck in Bayesian functional mixed models. We propose a computational framework called variational functional mixed model (VFMM). This new method facilitates efficient data compression and high-performance computing in basis space. We also propose a new multiple testing procedure in basis space, which can be used to detect significant local regions. The effectiveness of the proposed model is demonstrated through two datasets, a mass spectrometry dataset in a cancer study and a neuroimaging dataset in an Alzheimer's disease study. The second part is about modeling data heterogeneity by using Dirichlet Diffusion Trees. We propose a Bayesian latent tree model that incorporates covariates of subjects to characterize the heterogeneity and uncover the latent tree structure underlying data. This innovative model may reveal the hierarchical evolution process through branch structures and estimate systematic differences between groups of samples. We demonstrate the effectiveness of the model through the simulation study and a brain tumor real data.
Doctor of Philosophy
With the rapid development of modern high-throughput technologies, scientists can now collect high-dimensional data in different forms, such as engineering signals, medical images, and genomics measurements. However, acquisition of such data does not automatically lead to efficient knowledge discovery. The main objective of this dissertation is to develop novel Bayesian methods to extract useful knowledge from complex high-dimensional data. It has two parts—the development of an ultra-fast functional mixed model and the modeling of data heterogeneity via Dirichlet Diffusion Trees. The first part focuses on developing approximate Bayesian methods in functional mixed models to estimate parameters and detect significant regions. Two datasets demonstrate the effectiveness of proposed method—a mass spectrometry dataset in a cancer study and a neuroimaging dataset in an Alzheimer's disease study. The second part focuses on modeling data heterogeneity via Dirichlet Diffusion Trees. The method helps uncover the underlying hierarchical tree structures and estimate systematic differences between the group of samples. We demonstrate the effectiveness of the method through the brain tumor imaging data.
APA, Harvard, Vancouver, ISO, and other styles
23

Williams, Andre. "Stereotype Logit Models for High Dimensional Data." VCU Scholars Compass, 2010. http://scholarscompass.vcu.edu/etd/147.

Full text
Abstract:
Gene expression studies are of growing importance in the field of medicine. In fact, subtypes within the same disease have been shown to have differing gene expression profiles (Golub et al., 1999). Often, researchers are interested in differentiating a disease by a categorical classification indicative of disease progression. For example, it may be of interest to identify genes that are associated with progression and to accurately predict the state of progression using gene expression data. One challenge when modeling microarray gene expression data is that there are more genes (variables) than there are observations. In addition, the genes usually demonstrate a complex variance-covariance structure. Therefore, modeling a categorical variable reflecting disease progression using gene expression data presents the need for methods capable of handling an ordinal outcome in the presence of a high dimensional covariate space. In this research we present a method that combines the stereotype regression model (Anderson, 1984) with an elastic net penalty (Friedman et al., 2010) as a method capable of modeling an ordinal outcome for high-throughput genomic datasets. Results from applying the proposed method to both simulated and gene expression data will be reported and the effectiveness of the proposed method compared to a univariable and heuristic approach will be discussed.
APA, Harvard, Vancouver, ISO, and other styles
24

Chi, Yuan. "Machine learning techniques for high dimensional data." Thesis, University of Liverpool, 2015. http://livrepository.liverpool.ac.uk/2033319/.

Full text
Abstract:
This thesis presents data processing techniques for three different but related application areas: embedding learning for classification, fusion of low bit depth images and 3D reconstruction from 2D images. For embedding learning for classification, a novel manifold embedding method is proposed for the automated processing of large, varied data sets. The method is based on binary classification, where the embeddings are constructed so as to determine one or more unique features for each class individually from a given dataset. The proposed method is applied to examples of multiclass classification that are relevant for large scale data processing for surveillance (e.g. face recognition), where the aim is to augment decision making by reducing extremely large sets of data to a manageable level before displaying the selected subset of data to a human operator. In addition, an indicator for a weighted pairwise constraint is proposed to balance the contributions from different classes to the final optimisation, in order to better control the relative positions between the important data samples from either the same class (intraclass) or different classes (interclass). The effectiveness of the proposed method is evaluated through comparison with seven existing techniques for embedding learning, using four established databases of faces, consisting of various poses, lighting conditions and facial expressions, as well as two standard text datasets. The proposed method performs better than these existing techniques, especially for cases with small sets of training data samples. For fusion of low bit depth images, using low bit depth images instead of full images offers a number of advantages for aerial imaging with UAVs, where there is a limited transmission rate/bandwidth. For example, reducing the need for data transmission, removing superfluous details, and reducing computational loading of on-board platforms (especially for small or micro-scale UAVs). The main drawback of using low bit depth imagery is discarding image details of the scene. Fortunately, this can be reconstructed by fusing a sequence of related low bit depth images, which have been properly aligned. To reduce computational complexity and obtain a less distorted result, a similarity transformation is used to approximate the geometric alignment between two images of the same scene. The transformation is estimated using a phase correlation technique. It is shown that that the phase correlation method is capable of registering low bit depth images, without any modi�cation, or any pre and/or post-processing. For 3D reconstruction from 2D images, a method is proposed to deal with the dense reconstruction after a sparse reconstruction (i.e. a sparse 3D point cloud) has been created employing the structure from motion technique. Instead of generating a dense 3D point cloud, this proposed method forms a triangle by three points in the sparse point cloud, and then maps the corresponding components in the 2D images back to the point cloud. Compared to the existing methods that use a similar approach, this method reduces the computational cost. Instated of utilising every triangle in the 3D space to do the mapping from 2D to 3D, it uses a large triangle to replace a number of small triangles for flat and almost flat areas. Compared to the reconstruction result obtained by existing techniques that aim to generate a dense point cloud, the proposed method can achieve a better result while the computational cost is comparable.
APA, Harvard, Vancouver, ISO, and other styles
25

McWilliams, Brian Victor Parulian. "Projection based models for high dimensional data." Thesis, Imperial College London, 2011. http://hdl.handle.net/10044/1/9577.

Full text
Abstract:
In recent years, many machine learning applications have arisen which deal with the problem of finding patterns in high dimensional data. Principal component analysis (PCA) has become ubiquitous in this setting. PCA performs dimensionality reduction by estimating latent factors which minimise the reconstruction error between the original data and its low-dimensional projection. We initially consider a situation where influential observations exist within the dataset which have a large, adverse affect on the estimated PCA model. We propose a measure of “predictive influence” to detect these points based on the contribution of each point to the leave-one-out reconstruction error of the model using an analytic PRedicted REsidual Sum of Squares (PRESS) statistic. We then develop a robust alternative to PCA to deal with the presence of influential observations and outliers which minimizes the predictive reconstruction error. In some applications there may be unobserved clusters in the data, for which fitting PCA models to subsets of the data would provide a better fit. This is known as the subspace clustering problem. We develop a novel algorithm for subspace clustering which iteratively fits PCA models to subsets of the data and assigns observations to clusters based on their predictive influence on the reconstruction error. We study the convergence of the algorithm and compare its performance to a number of subspace clustering methods on simulated data and in real applications from computer vision involving clustering object trajectories in video sequences and images of faces. We extend our predictive clustering framework to a setting where two high-dimensional views of data have been obtained. Often, only either clustering or predictive modelling is performed between the views. Instead, we aim to recover clusters which are maximally predictive between the views. In this setting two block partial least squares (TB-PLS) is a useful model. TB-PLS performs dimensionality reduction in both views by estimating latent factors that are highly predictive. We fit TB-PLS models to subsets of data and assign points to clusters based on their predictive influence under each model which is evaluated using a PRESS statistic. We compare our method to state of the art algorithms in real applications in webpage and document clustering and find that our approach to predictive clustering yields superior results. Finally, we propose a method for dynamically tracking multivariate data streams based on PLS. Our method learns a linear regression function from multivariate input and output streaming data in an incremental fashion while also performing dimensionality reduction and variable selection. Moreover, the recursive regression model is able to adapt to sudden changes in the data generating mechanism and also identifies the number of latent factors. We apply our method to the enhanced index tracking problem in computational finance.
APA, Harvard, Vancouver, ISO, and other styles
26

Mahammad, Beigi Majid. "Kernel methods for high-dimensional biological data." [S.l. : s.n.], 2008.

Find full text
APA, Harvard, Vancouver, ISO, and other styles
27

Köchert, Karl. "From high-dimensional data to disease mechanisms." Doctoral thesis, Humboldt-Universität zu Berlin, Mathematisch-Naturwissenschaftliche Fakultät I, 2011. http://dx.doi.org/10.18452/16297.

Full text
Abstract:
Die aberrante Aktivierung des NOTCH Signalweges trägt entscheidend zu verschiedensten malignen Erkrankungen im Menschen bei. Basierend auf der Analyse von hochdimensionalen Microarray-Datensätzen von klassischen Hodgkin Lymphoma Fällen und nicht-Hodgkin Fällen, haben wir eine Hodgkin Lymphoma-spezifische NOTCH Signatur identifiziert. Diese wird von dem essentiellen NOTCH-Koaktivator Mastermindlike 2 (MAML2) signifikant dominiert. Auf der Grundlage dieses Resultates haben wir die Rolle von MAML2 im Kontext des Hodgkin Lymphoma-spezifischen, aberrant regulierten NOTCH Signalweges weiter untersucht. Die signifikante Überexpression von MAML2 im Hodgkin Lymphom konnte in verschiedenen Hodgkin Lymphom Zelllinien und auch durch die immunhistochemische Analyse von primären Hodgkin Lymphom Fällen verifiziert werden. Mit Hilfe des Knockdowns von MAML2 bzw. der Inhibition des NOTCH Signalweges durch die Verwendung einer kompetitiv, dominant-negativ wirkenden, trunkierten Variante von MAML1 konnte daraufhin gezeigt werden, dass die Überexpression von MAML2 der limitierende Faktor für die Hodgkin Lymphomaspezifische, pathologische Deregulation des NOTCH Signalweges ist. Die MAML2- vermittelte Überaktivierung des NOTCH Signalweges ist darüber hinaus essentiell für die Proliferation von Hodgkin Lymphom Zelllinien und die aberrante Expression der NOTCH Zielgene HES7 und HEY1. Das konstitutive Vorhandensein von aktiviertem, intrazellulären NOTCH1 in Hodgkin Lymphom Zelllinien impliziert darüber hinaus,dass der Signalweg im Hodgkin Lymphom zellautonom aktiviert ist. In dieser Arbeit wird damit ein neuer, pathologisch hochwirksamer Mechanismus der NOTCH Signalweg-Deregulation aufgedeckt.
Inappropriate activation of the NOTCH signaling pathway, e.g. by activating mutations, contributes to the pathogenesis of various human malignancies. Using a bottom up approach based on the acquisition of high–dimensional microarray data of classical Hodgkin lymphoma (cHL) and non-Hodgkin B cell lymphomas as control, we identify a cHL specific NOTCH gene-expression signature dominated by the NOTCH co-activator Mastermind-like 2 (MAML2). This set the basis for demonstrating that aberrant expression of the essential NOTCH co-activator MAML2 provides an alternative mechanism to activate NOTCH signaling in human lymphoma cells. Using immunohistochemistry we detected high-level MAML2 expression in several B cell-derived lymphoma types, including cHL cells, whereas in normal B cells no staining for MAML2 was detectable. Inhibition of MAML protein activity by a dominant negative form of MAML or by shRNAs targeting MAML2 in cHL cells resulted in down-regulation of the NOTCH target genes HES7 and HEY1, which we identified as overexpressed in cHL cells, and in reduced proliferation. In order to target the NOTCH transcriptional complex directly we developed short peptide constructs that competitively inhibit NOTCH dependent transcriptional activity as demonstrated by NOTCH reporter assays and EMSA analyses. We conclude that NOTCH signaling is aberrantly activated in a cell autonomous manner in cHL. This is mediated by high-level expression of the essential NOTCH coactivator MAML2, a protein that is only weakly expressed in B cells from healthy donors. Using short peptide constructs we moreover show, that this approach is promising in regard to the development of NOTCH pathway inhibitors that will also work in NOTCH associated malignancies that are resistant to -secretase inhibition.
APA, Harvard, Vancouver, ISO, and other styles
28

Salaro, Rossana <1994&gt. "Multinomial Logistic Regression with High Dimensional Data." Master's Degree Thesis, Università Ca' Foscari Venezia, 2018. http://hdl.handle.net/10579/13814.

Full text
Abstract:
This thesis investigates multinomial logistic regression in presence of high-dimensional data. Multinomial logistic regression has been widely used to model categorical data in a variety of fields, including health, physical and social sciences. In this thesis we apply to multinomial logistic regression three different kind of dimensionality reduction techniques, namely ridge regression, lasso and principal components regression. These methods reduce the dimensions of the design matrix used to build the multinomial logistic regression model by selecting those explanatory variables that most affect the response variable. We carry out an extensive simulation study to compare and contrast the three reduction methods. Moreover, we illustrate the multinomial regression model on different case studies that allow to highlight benefits and limits of the different approaches.
APA, Harvard, Vancouver, ISO, and other styles
29

Blake, Patrick Michael. "Biclustering and Visualization of High Dimensional Data using VIsual Statistical Data Analyzer." Thesis, Virginia Tech, 2019. http://hdl.handle.net/10919/87392.

Full text
Abstract:
Many data sets have too many features for conventional pattern recognition techniques to work properly. This thesis investigates techniques that alleviate these difficulties. One such technique, biclustering, clusters data in both dimensions and is inherently resistant to the challenges posed by having too many features. However, the algorithms that implement biclustering have limitations in that the user must know at least the structure of the data and how many biclusters to expect. This is where the VIsual Statistical Data Analyzer, or VISDA, can help. It is a visualization tool that successively and progressively explores the structure of the data, identifying clusters along the way. This thesis proposes coupling VISDA with biclustering to overcome some of the challenges of data sets with too many features. Further, to increase the performance, usability, and maintainability as well as reduce costs, VISDA was translated from Matlab to a Python version called VISDApy. Both VISDApy and the overall process were demonstrated with real and synthetic data sets. The results of this work have the potential to improve analysts' understanding of the relationships within complex data sets and their ability to make informed decisions from such data.
Master of Science
Many data sets have too many features for conventional pattern recognition techniques to work properly. This thesis investigates techniques that alleviate these difficulties. One such technique, biclustering, clusters data in both dimensions and is inherently resistant to the challenges posed by having too many features. However, the algorithms that implement biclustering have limitations in that the user must know at least the structure of the data and how many biclusters to expect. This is where the VIsual Statistical Data Analyzer, or VISDA, can help. It is a visualization tool that successively and progressively explores the structure of the data, identifying clusters along the way. This thesis proposes coupling VISDA with biclustering to overcome some of the challenges of data sets with too many features. Further, to increase the performance, usability, and maintainability as well as reduce costs, VISDA was translated from Matlab to a Python version called VISDApy. Both VISDApy and the overall process were demonstrated with real and synthetic data sets. The results of this work have the potential to improve analysts’ understanding of the relationships within complex data sets and their ability to make informed decisions from such data.
APA, Harvard, Vancouver, ISO, and other styles
30

Chung, David H. S. "High-dimensional glyph-based visualization and interactive techniques." Thesis, Swansea University, 2014. https://cronfa.swan.ac.uk/Record/cronfa42276.

Full text
Abstract:
The advancement of modern technology and scientific measurements has led to datasets growing in both size and complexity, exposing the need for more efficient and effective ways of visualizing and analysing data. Despite the amount of progress in visualization methods, high-dimensional data still poses a number of significant challenges in terms of the technical ability of realising such a mapping, and how accurate they are actually interpreted. The different data sources and characteristics which arise from a wide range of scientific domains as well as specific design requirements constantly create new special challenges for visualization research. This thesis presents several contributions to the field of glyph-based visualization. Glyphs are parametrised objects which encode one or more data values to its appearance (also referred to as visual channels) such as their size, colour, shape, and position. They have been widely used to convey information visually, and are especially well suited for displaying complex, multi-faceted datasets. Its major strength is the ability to depict patterns of data in the context of a spatial relationship, where multi-dimensional trends can often be perceived more easily. Our research is set in the broad scope of multi-dimensional visualization, addressing several aspects of glyph-based techniques, including visual design, perception, placement, interaction, and applications. In particular, this thesis presents a comprehensive study on one interaction technique, namely sorting, for supporting various analytical tasks. We have outlined the concepts of glyph- based sorting, identified a set of design criteria for sorting interactions, designed and prototyped a user interface for sorting multivariate glyphs, developed a visual analytics technique to support sorting, conducted an empirical study on perceptual orderability of visual channels used in glyph design, and applied glyph-based sorting to event visualization in sports applications. The content of this thesis is organised into two parts. Part I provides an overview of the basic concepts of glyph-based visualization, before describing the state-of-the-art in this field. We then present a collection of novel glyph-based approaches to address challenges created from real-world applications. These are detailed in Part II. Our first approach involves designing glyphs to depict the composition of multiple error-sensitivity fields. This work addresses the problem of single camera positioning, using both 2D and 3D methods to support camera configuration based on various constraints in the context of a real-world environment. Our second approach present glyphs to visualize actions and events "at a glance". We discuss the relative merits of using metaphoric glyphs in comparison to other types of glyph designs to the particular problem of real-time sports analysis. As a result of this research, we delivered a visualization software, MatchPad, on a tablet computer. It successfully helped coaching staff and team analysts to examine actions and events in detail whilst maintaining a clear overview of the match, and assisted in their decision making during the matches. Abstract shortened by ProQuest.
APA, Harvard, Vancouver, ISO, and other styles
31

Battey, Heather Suzanne. "Dimension reduction and automatic smoothing in high dimensional and functional data analysis." Thesis, University of Cambridge, 2011. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.609849.

Full text
APA, Harvard, Vancouver, ISO, and other styles
32

Weng, Jiaying. "TRANSFORMS IN SUFFICIENT DIMENSION REDUCTION AND THEIR APPLICATIONS IN HIGH DIMENSIONAL DATA." UKnowledge, 2019. https://uknowledge.uky.edu/statistics_etds/40.

Full text
Abstract:
The big data era poses great challenges as well as opportunities for researchers to develop efficient statistical approaches to analyze massive data. Sufficient dimension reduction is such an important tool in modern data analysis and has received extensive attention in both academia and industry. In this dissertation, we introduce inverse regression estimators using Fourier transforms, which is superior to the existing SDR methods in two folds, (1) it avoids the slicing of the response variable, (2) it can be readily extended to solve the high dimensional data problem. For the ultra-high dimensional problem, we investigate both eigenvalue decomposition and minimum discrepancy approaches to achieve optimal solutions and also develop a novel and efficient optimization algorithm to obtain the sparse estimates. We derive asymptotic properties of the proposed estimators and demonstrate its efficiency gains compared to the traditional estimators. The oracle properties of the sparse estimates are derived. Simulation studies and real data examples are used to illustrate the effectiveness of the proposed methods. Wavelet transform is another tool that effectively detects information from time-localization of high frequency. Parallel to our proposed Fourier transform methods, we also develop a wavelet transform version approach and derive the asymptotic properties of the resulting estimators.
APA, Harvard, Vancouver, ISO, and other styles
33

Polin, Afroza. "Simultaneous Inference for High Dimensional and Correlated Data." Bowling Green State University / OhioLINK, 2019. http://rave.ohiolink.edu/etdc/view?acc_num=bgsu1563182262263262.

Full text
APA, Harvard, Vancouver, ISO, and other styles
34

Bressan, Marco José Miguel. "Statistical Independence for classification for High Dimensional Data." Doctoral thesis, Universitat Autònoma de Barcelona, 2003. http://hdl.handle.net/10803/3034.

Full text
APA, Harvard, Vancouver, ISO, and other styles
35

Landfors, Mattias. "Normalization and analysis of high-dimensional genomics data." Doctoral thesis, Umeå universitet, Institutionen för matematik och matematisk statistik, 2012. http://urn.kb.se/resolve?urn=urn:nbn:se:umu:diva-53486.

Full text
Abstract:
In the middle of the 1990’s the microarray technology was introduced. The technology allowed for genome wide analysis of gene expression in one experiment. Since its introduction similar high through-put methods have been developed in other fields of molecular biology. These high through-put methods provide measurements for hundred up to millions of variables in a single experiment and a rigorous data analysis is necessary in order to answer the underlying biological questions. Further complications arise in data analysis as technological variation is introduced in the data, due to the complexity of the experimental procedures in these experiments. This technological variation needs to be removed in order to draw relevant biological conclusions from the data. The process of removing the technical variation is referred to as normalization or pre-processing. During the last decade a large number of normalization and data analysis methods have been proposed. In this thesis, data from two types of high through-put methods are used to evaluate the effect pre-processing methods have on further analyzes. In areas where problems in current methods are identified, novel normalization methods are proposed. The evaluations of known and novel methods are performed on simulated data, real data and data from an in-house produced spike-in experiment.
APA, Harvard, Vancouver, ISO, and other styles
36

Muja, Marius. "Scalable nearest neighbour methods for high dimensional data." Thesis, University of British Columbia, 2013. http://hdl.handle.net/2429/44402.

Full text
Abstract:
For many computer vision and machine learning problems, large training sets are key for good performance. However, the most computationally expensive part of many computer vision and machine learning algorithms consists of finding nearest neighbour matches to high dimensional vectors that represent the training data. We propose new algorithms for approximate nearest neighbour matching and evaluate and compare them with previous algorithms. For matching high dimensional features, we find two algorithms to be the most efficient: the randomized k-d forest and a new algorithm proposed in this thesis, the priority search k-means tree. We also propose a new algorithm for matching binary features by searching multiple hierarchical clustering trees and show it outperforms methods typically used in the literature. We show that the optimal nearest neighbour algorithm and its parameters depend on the dataset characteristics and describe an automated configuration procedure for finding the best algorithm to search a particular dataset. In order to scale to very large datasets that would otherwise not fit in the memory of a single machine, we propose a distributed nearest neighbour matching framework that can be used with any of the algorithms described in the thesis. All this research has been released as an open source library called FLANN (Fast Library for Approximate Nearest Neighbours), which has been incorporated into OpenCV and is now one of the most popular libraries for nearest neighbour matching.
APA, Harvard, Vancouver, ISO, and other styles
37

Winiger, Joakim. "Estimating the intrinsic dimensionality of high dimensional data." Thesis, KTH, Matematisk statistik, 2015. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-163170.

Full text
Abstract:
This report presents a review of some methods for estimating what is known as intrinsic dimensionality (ID). The principle behind intrinsic dimensionality estimation is that frequently, it is possible to find some structure in data which makes it possible to re-express it using a fewer number of coordinates (dimensions). The main objective of the report is to solve a common problem: Given a (typically high-dimensional) dataset, determine whether the number of dimensions are redundant, and if so, find a lower dimensional representation of it. We introduce different approaches for ID estimation, motivate them theoretically and compare them using both synthetic and real datasets. The first three methods estimate the ID of a dataset while the fourth finds a low dimensional version of the data. This is a useful order in which to organize the task, given an estimate of the ID of a dataset, construct a simpler version of the dataset using this number of dimensions. The results show that it is possible to obtain a remarkable decrease in high-dimensional data. The different methods give similar results despite their different theoretical backgrounds and behave as expected when using them on synthetic datasets with known ID.
Denna rapport ger en genomgång av olika metoder för skattning av inre dimension (ID). Principen bakom begreppet ID är att det ofta är möjligt att hitta strukturer i data som gör det möjligt att uttrycka samma data på nytt med ett färre antal koordinater (dimensioner). Syftet med detta projekt är att lösa ett vanligt problem: given en (vanligtvis högdimensionell) datamängd, avgör om antalet dimensioner är överflödiga, och om så är fallet, hitta en representation av datamängden som har ett mindre antal dimensioner. Vi introducerar olika tillvägagångssätt för skattning av inre dimension, går igenom teorin bakom dem och jämför deras resultat för både syntetiska och verkliga datamängder. De tre första metoderna skattar den inre dimensionen av data medan den fjärde hittar en lägre-dimensionell version av en datamängd. Denna ordning är praktisk för syftet med projektet, när vi har en skattning av den inre dimensionen av en datamängd kan vi använda denna skattning för att konstruera en enklare version av datamängden som har detta antal dimensioner. Resultaten visar att för högdimensionell data går det att reducera antalet dimensioner avsevärt. De olika metoderna ger liknande resultat trots deras olika teoretiska bakgrunder, och ger väntade resultat när de används på syntetiska datamängder vars inre dimensioner redan är kända.
APA, Harvard, Vancouver, ISO, and other styles
38

Zhang, Peng. "Structured sensing for estimation of high-dimensional data." Thesis, Imperial College London, 2016. http://hdl.handle.net/10044/1/49415.

Full text
Abstract:
Efficient estimation and processing of high-dimensional data is important in many scientic and engineering domains. In this thesis, we explore structured sensing methods for high-dimensional signal in three different perspectives: structured random matrices for compressed sensing and corrupted sensing, atomic norm regularization for massive multiple-input-multiple-output (MIMO) systems and variable density sampling for random field. Designing efficient sensing systems for high-dimensional data by appealing to the prior knowledge that their intrinsic information is usually small has become popular in recent years. As a starting point, compressed sensing has proven to be feasible for estimating sparse signals when the number of measurements is far less than the dimensionality of the signals. Besides fully random sensing matrices, many structured sensing matrices have been designed to reduce the computation and storage cost. We propose a unified structured sensing framework and prove the associated restricted isometry property. We demonstrate that the proposed framework encompasses many existing designs. In addition, we construct new structured sensing models based on the proposed framework. Furthermore, we consider a generalized problem where the compressive measurements are affected by both dense noise and sparse corruption. We show that in some cases the proposed framework can still guarantee faithful recovery for both the sparse signal and the corruption. The next part of the thesis is concerned with channel estimation and faulty antennas detection in massive MIMO systems. By leveraging the intrinsic information of the channel matrix through atomic norm, we propose new algorithms and demonstrate their performances for both channel estimation and faulty antennas detection. In the last part, we propose a variable density sampling method for the estimation of high-dimensional random field. While conventional uniform sampling requires a number of samples increasing exponentially with the dimension, we show that faithful recovery can be guaranteed with a polynomial size of random samples.
APA, Harvard, Vancouver, ISO, and other styles
39

Nilsson, Mårten. "Augmenting High-Dimensional Data with Deep Generative Models." Thesis, KTH, Robotik, perception och lärande, RPL, 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-233969.

Full text
Abstract:
Data augmentation is a technique that can be performed in various ways to improve the training of discriminative models. The recent developments in deep generative models offer new ways of augmenting existing data sets. In this thesis, a framework for augmenting annotated data sets with deep generative models is proposed together with a method for quantitatively evaluating the quality of the generated data sets. Using this framework, two data sets for pupil localization was generated with different generative models, including both well-established models and a novel model proposed for this purpose. The unique model was shown both qualitatively and quantitatively to generate the best data sets. A set of smaller experiments on standard data sets also revealed cases where this generative model could improve the performance of an existing discriminative model. The results indicate that generative models can be used to augment or replace existing data sets when training discriminative models.
Dataaugmentering är en teknik som kan utföras på flera sätt för att förbättra träningen av diskriminativa modeller. De senaste framgångarna inom djupa generativa modeller har öppnat upp nya sätt att augmentera existerande dataset. I detta arbete har ett ramverk för augmentering av annoterade dataset med hjälp av djupa generativa modeller föreslagits. Utöver detta så har en metod för kvantitativ evaulering av kvaliteten hos genererade data set tagits fram. Med hjälp av detta ramverk har två dataset för pupillokalisering genererats med olika generativa modeller. Både väletablerade modeller och en ny modell utvecklad för detta syfte har testats. Den unika modellen visades både kvalitativt och kvantitativt att den genererade de bästa dataseten. Ett antal mindre experiment på standardiserade dataset visade exempel på fall där denna generativa modell kunde förbättra prestandan hos en existerande diskriminativ modell. Resultaten indikerar att generativa modeller kan användas för att augmentera eller ersätta existerande dataset vid träning av diskriminativa modeller.
APA, Harvard, Vancouver, ISO, and other styles
40

Schlosser, Pascal [Verfasser], and Martin [Akademischer Betreuer] Schumacher. "Netboost: statistical modeling strategies for high-dimensional data." Freiburg : Universität, 2019. http://d-nb.info/1237220505/34.

Full text
APA, Harvard, Vancouver, ISO, and other styles
41

Li, Hao. "Feature cluster selection for high-dimensional data analysis." Diss., Online access via UMI:, 2007.

Find full text
APA, Harvard, Vancouver, ISO, and other styles
42

Wang, Kaijun. "Graph-based Modern Nonparametrics For High-dimensional Data." Diss., Temple University Libraries, 2019. http://cdm16002.contentdm.oclc.org/cdm/ref/collection/p245801coll10/id/578840.

Full text
Abstract:
Statistics
Ph.D.
Developing nonparametric statistical methods and inference procedures for high-dimensional large data have been a challenging frontier problem of statistics. To attack this problem, in recent years, a clear rising trend has been observed with a radically different viewpoint--``Graph-based Nonparametrics," which is the main research focus of this dissertation. The basic idea consists of two steps: (i) representation step: code the given data using graphs, (ii) analysis step: apply statistical methods on the graph-transformed problem to systematically tackle various types of data structures. Under this general framework, this dissertation develops two major research directions. Chapter 2—based on Mukhopadhyay and Wang (2019a)—introduces a new nonparametric method for high-dimensional k-sample comparison problem that is distribution-free, robust, and continues to work even when the dimension of the data is larger than the sample size. The proposed theory is based on modern LP-nonparametrics tools and unexplored connections with spectral graph theory. The key is to construct a specially-designed weighted graph from the data and to reformulate the k-sample problem into a community detection problem. The procedure is shown to possess various desirable properties along with a characteristic exploratory flavor that has practical consequences. The numerical examples show surprisingly well performance of our method under a broad range of realistic situations. Chapter 3—based on Mukhopadhyay and Wang (2019b)—revisits some foundational questions about network modeling that are still unsolved. In particular, we present unified statistical theory of the fundamental spectral graph methods (e.g., Laplacian, Modularity, Diffusion map, regularized Laplacian, Google PageRank model), which are often viewed as spectral heuristic-based empirical mystery facts. Despite half a century of research, this question has been one of the most formidable open issues, if not the core problem in modern network science. Our approach integrates modern nonparametric statistics, mathematical approximation theory (of integral equations), and computational harmonic analysis in a novel way to develop a theory that unifies and generalizes the existing paradigm. From a practical standpoint, it is shown that this perspective can provide adequate guidance for designing next-generation computational tools for large-scale problems. As an example, we have described the high-dimensional change-point detection problem. Chapter 4 discusses some further extensions and application of our methodologies to regularized spectral clustering and spatial graph regression problems. The dissertation concludes with the a discussion of two important areas of future studies.
Temple University--Theses
APA, Harvard, Vancouver, ISO, and other styles
43

GALVANI, MARTA. "Predictive and Clustering Methods for High dimensional data." Doctoral thesis, Università degli studi di Pavia, 2020. http://hdl.handle.net/11571/1361035.

Full text
APA, Harvard, Vancouver, ISO, and other styles
44

Zhang, Liangwei. "Big Data Analytics for eMaintenance : Modeling of high-dimensional data streams." Licentiate thesis, Luleå tekniska universitet, Drift, underhåll och akustik, 2015. http://urn.kb.se/resolve?urn=urn:nbn:se:ltu:diva-17012.

Full text
Abstract:
Big Data analytics has attracted intense interest from both academia and industry recently for its attempt to extract information, knowledge and wisdom from Big Data. In industry, with the development of sensor technology and Information & Communication Technologies (ICT), reams of high-dimensional data streams are being collected and curated by enterprises to support their decision-making. Fault detection from these data is one of the important applications in eMaintenance solutions with the aim of supporting maintenance decision-making. Early discovery of system faults may ensure the reliability and safety of industrial systems and reduce the risk of unplanned breakdowns. Both high dimensionality and the properties of data streams impose stringent challenges on fault detection applications. From the data modeling point of view, high dimensionality may cause the notorious “curse of dimensionality” and lead to the accuracy deterioration of fault detection algorithms. On the other hand, fast-flowing data streams require fault detection algorithms to have low computing complexity and give real-time or near real-time responses upon the arrival of new samples. Most existing fault detection models work on relatively low-dimensional spaces. Theoretical studies on high-dimensional fault detection mainly focus on detecting anomalies on subspace projections of the original space. However, these models are either arbitrary in selecting subspaces or computationally intensive. In considering the requirements of fast-flowing data streams, several strategies have been proposed to adapt existing fault detection models to online mode for them to be applicable in stream data mining. Nevertheless, few studies have simultaneously tackled the challenges associated with high dimensionality and data streams. In this research, an Angle-based Subspace Anomaly Detection (ABSAD) approach to fault detection from high-dimensional data is developed. Both analytical study and numerical illustration demonstrated the efficacy of the proposed ABSAD approach. Based on the sliding window strategy, the approach is further extended to an online mode with the aim of detecting faults from high-dimensional data streams. Experiments on synthetic datasets proved that the online ABSAD algorithm can be adaptive to the time-varying behavior of the monitored system, and hence applicable to dynamic fault detection.
Godkänd; 2015; 20150512 (liazha); Nedanstående person kommer att hålla licentiatseminarium för avläggande av teknologie licentiatexamen. Namn: Liangwei Zhang Ämne: Drift och underhållsteknik/Operation and Maintenance Engineering Uppsats: Big Data Analytics for eMaintenance Examinator: Professor Uday Kumar Institutionen för samhällsbyggnad och naturresurser Avdelning Drift, underhåll och akustik Luleå tekniska universitet Diskutant: Professor Wolfgang Birk Institutionen för system- och rymdteknik Avdelning Signaler och system Luleå tekniska universitet Tid: Onsdag 10 juni 2015 kl 10.00 Plats: E243, Luleå tekniska universitet
APA, Harvard, Vancouver, ISO, and other styles
45

François, Damien. "High-dimensional data analysis : optimal metrics and feature selection." Université catholique de Louvain, 2007. http://edoc.bib.ucl.ac.be:81/ETD-db/collection/available/BelnUcetd-01152007-162739/.

Full text
Abstract:
High-dimensional data are everywhere: texts, sounds, spectra, images, etc. are described by thousands of attributes. However, many data analysis tools at disposal (coming from statistics, artificial intelligence, etc.) were designed for low-dimensional data. Many of the explicit or implicit assumptions made while developing the classical data analysis tools are not transposable to high-dimensional data. For instance, many tools rely on the Euclidean distance, to compare data elements. But the Euclidean distance concentrates in high-dimensional spaces: all distances between data elements seem identical. The Euclidean distance is furthermore incapable of identifying important attributes from irrelevant ones. This thesis therefore focuses the choice of a relevant distance function to compare high-dimensional data and the selection of the relevant attributes. In Part One of the thesis, the phenomenon of the concentration of the distances is considered, and its consequences on data analysis tools are studied. It is shown that for nearest neighbours search, the Euclidean distance and the Gaussian kernel, both heavily used, may not be appropriate; it is thus proposed to use Fractional metrics and Generalised Gaussian kernels. Part Two of this thesis focuses on the problem of feature selection in the case of a large number of initial features. Two methods are proposed to (1) reduce the computational burden of feature selection process and (2) cope with the instability induced by high correlation between features that often appear with high-dimensional data. Most of the concepts studied and presented in this thesis are illustrated on chemometric data, and more particularly on spectral data, with the objective of inferring a physical or chemical property of a material by analysis the spectrum of the light it reflects.
APA, Harvard, Vancouver, ISO, and other styles
46

Miloš, Radovanović. "High-Dimensional Data Representations and Metrics for Machine Learning and Data Mining." Phd thesis, Univerzitet u Novom Sadu, Prirodno-matematički fakultet u Novom Sadu, 2011. https://www.cris.uns.ac.rs/record.jsf?recordId=77530&source=NDLTD&language=en.

Full text
Abstract:
In the current information age, massive amounts of data are gathered, at a rate prohibiting their effective structuring, analysis, and conversion into useful knowledge. This information overload is manifested both in large numbers of data objects recorded in data sets, and large numbers of attributes, also known as high dimensionality. This dis-sertation deals with problems originating from high dimensionality of data representation, referred to as the “curse of dimensionality,” in the context of machine learning, data mining, and information retrieval. The described research follows two angles: studying the behavior of (dis)similarity metrics with increasing dimensionality, and exploring feature-selection methods, primarily with regard to document representation schemes for text classification. The main results of the dissertation, relevant to the first research angle, include theoretical insights into the concentration behavior of cosine similarity, and a detailed analysis of the phenomenon of hubness, which refers to the tendency of some points in a data set to become hubs by being in-cluded in unexpectedly many k-nearest neighbor lists of other points. The mechanisms behind the phenomenon are studied in detail, both from a theoretical and empirical perspective, linking hubness with the (intrinsic) dimensionality of data, describing its interaction with the cluster structure of data and the information provided by class la-bels, and demonstrating the interplay of the phenomenon and well known algorithms for classification, semi-supervised learning, clustering, and outlier detection, with special consideration being given to time-series classification and information retrieval. Results pertaining to the second research angle include quantification of the interaction between various transformations of high-dimensional document representations, and feature selection, in the context of text classification.
U tekućem „informatičkom dobu“, masivne količine podataka sesakupljaju brzinom koja ne dozvoljava njihovo efektivno strukturiranje,analizu, i pretvaranje u korisno znanje. Ovo zasićenje informacijamase manifestuje kako kroz veliki broj objekata uključenihu skupove podataka, tako i kroz veliki broj atributa, takođe poznatkao velika dimenzionalnost. Disertacija se bavi problemima kojiproizilaze iz velike dimenzionalnosti reprezentacije podataka, čestonazivanim „prokletstvom dimenzionalnosti“, u kontekstu mašinskogučenja, data mining-a i information retrieval-a. Opisana istraživanjaprate dva pravca: izučavanje ponašanja metrika (ne)sličnosti u odnosuna rastuću dimenzionalnost, i proučavanje metoda odabira atributa,prvenstveno u interakciji sa tehnikama reprezentacije dokumenata zaklasifikaciju teksta. Centralni rezultati disertacije, relevantni za prvipravac istraživanja, uključuju teorijske uvide u fenomen koncentracijekosinusne mere sličnosti, i detaljnu analizu fenomena habovitosti kojise odnosi na tendenciju nekih tačaka u skupu podataka da postanuhabovi tako što bivaju uvrštene u neočekivano mnogo lista k najbližihsuseda ostalih tačaka. Mehanizmi koji pokreću fenomen detaljno suproučeni, kako iz teorijske tako i iz empirijske perspektive. Habovitostje povezana sa (latentnom) dimenzionalnošću podataka, opisanaje njena interakcija sa strukturom klastera u podacima i informacijamakoje pružaju oznake klasa, i demonstriran je njen efekat napoznate algoritme za klasifikaciju, semi-supervizirano učenje, klasteringi detekciju outlier-a, sa posebnim osvrtom na klasifikaciju vremenskihserija i information retrieval. Rezultati koji se odnose nadrugi pravac istraživanja uključuju kvantifikaciju interakcije izmeđurazličitih transformacija višedimenzionalnih reprezentacija dokumenatai odabira atributa, u kontekstu klasifikacije teksta.
APA, Harvard, Vancouver, ISO, and other styles
47

Vege, Sri Harsha. "Ensemble of Feature Selection Techniques for High Dimensional Data." TopSCHOLAR®, 2012. http://digitalcommons.wku.edu/theses/1164.

Full text
Abstract:
Data mining involves the use of data analysis tools to discover previously unknown, valid patterns and relationships from large amounts of data stored in databases, data warehouses, or other information repositories. Feature selection is an important preprocessing step of data mining that helps increase the predictive performance of a model. The main aim of feature selection is to choose a subset of features with high predictive information and eliminate irrelevant features with little or no predictive information. Using a single feature selection technique may generate local optima. In this thesis we propose an ensemble approach for feature selection, where multiple feature selection techniques are combined to yield more robust and stable results. Ensemble of multiple feature ranking techniques is performed in two steps. The first step involves creating a set of different feature selectors, each providing its sorted order of features, while the second step aggregates the results of all feature ranking techniques. The ensemble method used in our study is frequency count which is accompanied by mean to resolve any frequency count collision. Experiments conducted in this work are performed on the datasets collected from Kent Ridge bio-medical data repository. Lung Cancer dataset and Lymphoma dataset are selected from the repository to perform experiments. Lung Cancer dataset consists of 57 attributes and 32 instances and Lymphoma dataset consists of 4027 attributes and 96 ix instances. Experiments are performed on the reduced datasets obtained from feature ranking. These datasets are used to build the classification models. Model performance is evaluated in terms of AUC (Area under Receiver Operating Characteristic Curve) performance metric. ANOVA tests are also performed on the AUC performance metric. Experimental results suggest that ensemble of multiple feature selection techniques is more effective than an individual feature selection technique.
APA, Harvard, Vancouver, ISO, and other styles
48

Ding, Yuanyuan. "Handling complex, high dimensional data for classification and clustering /." Full text available from ProQuest UM Digital Dissertations, 2007. http://0-proquest.umi.com.umiss.lib.olemiss.edu/pqdweb?index=0&did=1400971141&SrchMode=2&sid=1&Fmt=2&VInst=PROD&VType=PQD&RQT=309&VName=PQD&TS=1219343482&clientId=22256.

Full text
APA, Harvard, Vancouver, ISO, and other styles
49

Tillander, Annika. "Classification models for high-dimensional data with sparsity patterns." Doctoral thesis, Stockholms universitet, Statistiska institutionen, 2013. http://urn.kb.se/resolve?urn=urn:nbn:se:su:diva-95664.

Full text
Abstract:
Today's high-throughput data collection devices, e.g. spectrometers and gene chips, create information in abundance. However, this poses serious statistical challenges, as the number of features is usually much larger than the number of observed units.  Further, in this high-dimensional setting, only a small fraction of the features are likely to be informative for any specific project. In this thesis, three different approaches to the two-class supervised classification in this high-dimensional, low sample setting are considered. There are classifiers that are known to mitigate the issues of high-dimensionality, e.g. distance-based classifiers such as Naive Bayes. However, these classifiers are often computationally intensive and therefore less time-consuming for discrete data. Hence, continuous features are often transformed into discrete features. In the first paper, a discretization algorithm suitable for high-dimensional data is suggested and compared with other discretization approaches. Further, the effect of discretization on misclassification probability in high-dimensional setting is evaluated.   Linear classifiers are more stable which motivate adjusting the linear discriminant procedure to high-dimensional setting. In the second paper, a two-stage estimation procedure of the inverse covariance matrix, applying Lasso-based regularization and Cuthill-McKee ordering is suggested. The estimation gives a block-diagonal approximation of the covariance matrix which in turn leads to an additive classifier. In the third paper, an asymptotic framework that represents sparse and weak block models is derived and a technique for block-wise feature selection is proposed.      Probabilistic classifiers have the advantage of providing the probability of membership in each class for new observations rather than simply assigning to a class. In the fourth paper, a method is developed for constructing a Bayesian predictive classifier. Given the block-diagonal covariance matrix, the resulting Bayesian predictive and marginal classifier provides an efficient solution to the high-dimensional problem by splitting it into smaller tractable problems. The relevance and benefits of the proposed methods are illustrated using both simulated and real data.
Med dagens teknik, till exempel spektrometer och genchips, alstras data i stora mängder. Detta överflöd av data är inte bara till fördel utan orsakar även vissa problem, vanligtvis är antalet variabler (p) betydligt fler än antalet observation (n). Detta ger så kallat högdimensionella data vilket kräver nya statistiska metoder, då de traditionella metoderna är utvecklade för den omvända situationen (p<n).  Dessutom är det vanligtvis väldigt få av alla dessa variabler som är relevanta för något givet projekt och styrkan på informationen hos de relevanta variablerna är ofta svag. Därav brukar denna typ av data benämnas som gles och svag (sparse and weak). Vanligtvis brukar identifiering av de relevanta variablerna liknas vid att hitta en nål i en höstack. Denna avhandling tar upp tre olika sätt att klassificera i denna typ av högdimensionella data.  Där klassificera innebär, att genom ha tillgång till ett dataset med både förklaringsvariabler och en utfallsvariabel, lära en funktion eller algoritm hur den skall kunna förutspå utfallsvariabeln baserat på endast förklaringsvariablerna. Den typ av riktiga data som används i avhandlingen är microarrays, det är cellprov som visar aktivitet hos generna i cellen. Målet med klassificeringen är att med hjälp av variationen i aktivitet hos de tusentals gener (förklaringsvariablerna) avgöra huruvida cellprovet kommer från cancervävnad eller normalvävnad (utfallsvariabeln). Det finns klassificeringsmetoder som kan hantera högdimensionella data men dessa är ofta beräkningsintensiva, därav fungera de ofta bättre för diskreta data. Genom att transformera kontinuerliga variabler till diskreta (diskretisera) kan beräkningstiden reduceras och göra klassificeringen mer effektiv. I avhandlingen studeras huruvida av diskretisering påverkar klassificeringens prediceringsnoggrannhet och en mycket effektiv diskretiseringsmetod för högdimensionella data föreslås. Linjära klassificeringsmetoder har fördelen att vara stabila. Nackdelen är att de kräver en inverterbar kovariansmatris och vilket kovariansmatrisen inte är för högdimensionella data. I avhandlingen föreslås ett sätt att skatta inversen för glesa kovariansmatriser med blockdiagonalmatris. Denna matris har dessutom fördelen att det leder till additiv klassificering vilket möjliggör att välja hela block av relevanta variabler. I avhandlingen presenteras även en metod för att identifiera och välja ut blocken. Det finns också probabilistiska klassificeringsmetoder som har fördelen att ge sannolikheten att tillhöra vardera av de möjliga utfallen för en observation, inte som de flesta andra klassificeringsmetoder som bara predicerar utfallet. I avhandlingen förslås en sådan Bayesiansk metod, givet den blockdiagonala matrisen och normalfördelade utfallsklasser. De i avhandlingen förslagna metodernas relevans och fördelar är visade genom att tillämpa dem på simulerade och riktiga högdimensionella data.
APA, Harvard, Vancouver, ISO, and other styles
50

Zhao, Jiwu [Verfasser]. "Automatic subspace clustering for high-dimensional data / Jiwu Zhao." Düsseldorf : Universitäts- und Landesbibliothek der Heinrich-Heine-Universität Düsseldorf, 2014. http://d-nb.info/1047907658/34.

Full text
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!