To see the other types of publications on this topic, follow the link: High-Dimensional Regression.

Dissertations / Theses on the topic 'High-Dimensional Regression'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 dissertations / theses for your research on the topic 'High-Dimensional Regression.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.

1

Fang, Zhou. "Reweighting methods in high dimensional regression." Thesis, University of Oxford, 2012. http://ora.ox.ac.uk/objects/uuid:26f8541a-9e2d-466a-84aa-e6850c4baba9.

Full text
Abstract:
In this thesis, we focus on the application of covariate reweighting with Lasso-style methods for regression in high dimensions, particularly where p ≥ n. We apply a particular focus to the case of sparse regression under a-priori grouping structures. In such problems, even in the linear case, accurate estimation is difficult. Various authors have suggested ideas such as the Group Lasso and the Sparse Group Lasso, based on convex penalties, or alternatively methods like the Group Bridge, which rely on convergence under repetition to some local minimum of a concave penalised likelihood. We propose in this thesis a methodology that uses concave penalties to inspire a procedure whereupon we compute weights from an initial estimate, and then do a single second reweighted Lasso. This procedure -- the Co-adaptive Lasso -- obtains excellent results in empirical experiments, and we present some theoretical prediction and estimation error bounds. Further, several extensions and variants of the procedure are discussed and studied. In particular, we propose a Lasso style method of doing additive isotonic regression in high dimensions, the Liso algorithm, and enhance it using the Co-adaptive methodology. We also propose a method of producing rules based regression estimates for high dimensional non-parametric regression, that often outperforms the current leading method, the RuleFit algorithm. We also discuss extensions involving robust statistics applied to weight computation, repeating the algorithm, and online computation.
APA, Harvard, Vancouver, ISO, and other styles
2

Meier, Lukas Dieter. "High-dimensional regression problems with special structure /." Zürich : ETH, 2008. http://e-collection.ethbib.ethz.ch/show?type=diss&nr=18129.

Full text
APA, Harvard, Vancouver, ISO, and other styles
3

Hashem, Hussein Abdulahman. "Regularized and robust regression methods for high dimensional data." Thesis, Brunel University, 2014. http://bura.brunel.ac.uk/handle/2438/9197.

Full text
Abstract:
Recently, variable selection in high-dimensional data has attracted much research interest. Classical stepwise subset selection methods are widely used in practice, but when the number of predictors is large these methods are difficult to implement. In these cases, modern regularization methods have become a popular choice as they perform variable selection and parameter estimation simultaneously. However, the estimation procedure becomes more difficult and challenging when the data suffer from outliers or when the assumption of normality is violated such as in the case of heavy-tailed errors. In these cases, quantile regression is the most appropriate method to use. In this thesis we combine these two classical approaches together to produce regularized quantile regression methods. Chapter 2 shows a comparative simulation study of regularized and robust regression methods when the response variable is continuous. In chapter 3, we develop a quantile regression model with a group lasso penalty for binary response data when the predictors have a grouped structure and when the data suffer from outliers. In chapter 4, we extend this method to the case of censored response variables. Numerical examples on simulated and real data are used to evaluate the performance of the proposed methods in comparisons with other existing methods.
APA, Harvard, Vancouver, ISO, and other styles
4

Aldahmani, Saeed. "High-dimensional linear regression problems via graphical models." Thesis, University of Essex, 2017. http://repository.essex.ac.uk/19207/.

Full text
Abstract:
This thesis introduces a new method for solving the linear regression problem where the number of observations n is smaller than the number of variables (predictors) v. In contrast to existing methods such as ridge regression, Lasso and Lars, the proposed method uses the idea of graphical models and provides unbiased parameter estimates under certain conditions. In addition, the new method provides a detailed graphical conditional correlation structure for the predictors, whereby the real causal relationship between predictors can be identified. Furthermore, the proposed method is extended to form a hybridisation with the idea of ridge regression to improve efficiency in terms of computation and model selection. In the extended method, less important variables are regularised by a ridge type penalty, and a search for models in the space is made for important covariates. This significantly reduces computational cost while giving unbiased estimates for the important variables as well as increasing the efficiency of model selection. Moreover, the extended method is used in dealing with the issue of portfolio selection within the Markowitz mean-variance framework, with n < v. Various simulations and real data analyses were conducted for comparison between the two novel methods and the aforementioned existing methods. Our experiments indicate that the new methods outperform all the other methods when n
APA, Harvard, Vancouver, ISO, and other styles
5

Wang, Tao. "Variable selection and dimension reduction in high-dimensional regression." HKBU Institutional Repository, 2013. http://repository.hkbu.edu.hk/etd_ra/1544.

Full text
APA, Harvard, Vancouver, ISO, and other styles
6

Lee, Wai Hong. "Variable selection for high dimensional transformation model." HKBU Institutional Repository, 2010. http://repository.hkbu.edu.hk/etd_ra/1161.

Full text
APA, Harvard, Vancouver, ISO, and other styles
7

Chen, Xiaohui. "Lasso-type sparse regression and high-dimensional Gaussian graphical models." Thesis, University of British Columbia, 2012. http://hdl.handle.net/2429/42271.

Full text
Abstract:
High-dimensional datasets, where the number of measured variables is larger than the sample size, are not uncommon in modern real-world applications such as functional Magnetic Resonance Imaging (fMRI) data. Conventional statistical signal processing tools and mathematical models could fail at handling those datasets. Therefore, developing statistically valid models and computationally efficient algorithms for high-dimensional situations are of great importance in tackling practical and scientific problems. This thesis mainly focuses on the following two issues: (1) recovery of sparse regression coefficients in linear systems; (2) estimation of high-dimensional covariance matrix and its inverse matrix, both subject to additional random noise. In the first part, we focus on the Lasso-type sparse linear regression. We propose two improved versions of the Lasso estimator when the signal-to-noise ratio is low: (i) to leverage adaptive robust loss functions; (ii) to adopt a fully Bayesian modeling framework. In solution (i), we propose a robust Lasso with convex combined loss function and study its asymptotic behaviors. We further extend the asymptotic analysis to the Huberized Lasso, which is shown to be consistent even if the noise distribution is Cauchy. In solution (ii), we propose a fully Bayesian Lasso by unifying discrete prior on model size and continuous prior on regression coefficients in a single modeling framework. Since the proposed Bayesian Lasso has variable model sizes, we propose a reversible-jump MCMC algorithm to obtain its numeric estimates. In the second part, we focus on the estimation of large covariance and precision matrices. In high-dimensional situations, the sample covariance is an inconsistent estimator. To address this concern, regularized estimation is needed. For the covariance matrix estimation, we propose a shrinkage-to-tapering estimator and show that it has attractive theoretic properties for estimating general and large covariance matrices. For the precision matrix estimation, we propose a computationally efficient algorithm that is based on the thresholding operator and Neumann series expansion. We prove that, the proposed estimator is consistent in several senses under the spectral norm. Moreover, we show that the proposed estimator is minimax in a class of precision matrices that are approximately inversely closed.
APA, Harvard, Vancouver, ISO, and other styles
8

Chen, Chi. "Variable selection in high dimensional semi-varying coefficient models." HKBU Institutional Repository, 2013. https://repository.hkbu.edu.hk/etd_oa/11.

Full text
Abstract:
With the development of computing and sampling technologies, high dimensionality has become an important characteristic of commonly used science data, such as some data from bioinformatics, information engineering, and the social sciences. The varying coefficient model is a flexible and powerful statistical model for exploring dynamic patterns in many scientific areas. It is a natural extension of classical parametric models with good interpretability, and is becoming increasingly popular in data analysis. The main objective of thesis is to apply the varying coefficient model to analyze high dimensional data, and to investigate the properties of regularization methods for high-dimensional varying coefficient models. We first discuss how to apply local polynomial smoothing and the smoothly clipped absolute deviation (SCAD) penalized methods to estimate varying coefficient models when the dimension of the model is diverging with the sample size. Based on the nonconcave penalized method and local polynomial smoothing, we suggest a regularization method to select significant variables from the model and estimate the corresponding coefficient functions simultaneously. Importantly, our proposed method can also identify constant coefficients at same time. We investigate the asymptotic properties of our proposed method and show that it has the so called “oracle property.” We apply the nonparametric independence Screening (NIS) method to varying coefficient models with ultra-high-dimensional data. Based on the marginal varying coefficient model estimation, we establish the sure independent screening property under some regular conditions for our proposed sure screening method. Combined with our proposed regularization method, we can systematically deal with high-dimensional or ultra-high-dimensional data using varying coefficient models. The nonconcave penalized method is a very effective variable selection method. However, maximizing such a penalized likelihood function is computationally challenging, because the objective functions are nondifferentiable and nonconcave. The local linear approximation (LLA) and local quadratic approximation (LQA) are two popular algorithms for dealing with such optimal problems. In this thesis, we revisit these two algorithms. We investigate the convergence rate of LLA and show that the rate is linear. We also study the statistical properties of the one-step estimate based on LLA under a generalized statistical model with a diverging number of dimensions. We suggest a modified version of LQA to overcome its drawback under high dimensional models. Our proposed method avoids having to calculate the inverse of the Hessian matrix in the modified Newton Raphson algorithm based on LQA. Our proposed methods are investigated by numerical studies and in a real case study in Chapter 5.
APA, Harvard, Vancouver, ISO, and other styles
9

Breheny, Patrick John Huang Jian. "Regularized methods for high-dimensional and bi-level variable selection." Iowa City : University of Iowa, 2009. http://ir.uiowa.edu/etd/325.

Full text
APA, Harvard, Vancouver, ISO, and other styles
10

Villegas, Santamaría Mauricio. "Contributions to High-Dimensional Pattern Recognition." Doctoral thesis, Universitat Politècnica de València, 2011. http://hdl.handle.net/10251/10939.

Full text
Abstract:
This thesis gathers some contributions to statistical pattern recognition particularly targeted at problems in which the feature vectors are high-dimensional. Three pattern recognition scenarios are addressed, namely pattern classification, regression analysis and score fusion. For each of these, an algorithm for learning a statistical model is presented. In order to address the difficulty that is encountered when the feature vectors are high-dimensional, adequate models and objective functions are defined. The strategy of learning simultaneously a dimensionality reduction function and the pattern recognition model parameters is shown to be quite effective, making it possible to learn the model without discarding any discriminative information. Another topic that is addressed in the thesis is the use of tangent vectors as a way to take better advantage of the available training data. Using this idea, two popular discriminative dimensionality reduction techniques are shown to be effectively improved. For each of the algorithms proposed throughout the thesis, several data sets are used to illustrate the properties and the performance of the approaches. The empirical results show that the proposed techniques perform considerably well, and furthermore the models learned tend to be very computationally efficient.
Villegas Santamaría, M. (2011). Contributions to High-Dimensional Pattern Recognition [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/10939
Palancia
APA, Harvard, Vancouver, ISO, and other styles
11

Luo, Weiqi. "Spatial/temporal modelling of crop disease data using high-dimensional regression." Thesis, University of Leeds, 2008. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.493292.

Full text
Abstract:
Septoria tritici is one of the most serious foliar diseases of winter wheat across England and Wales, causing considerable reduction in yield quality and production. There are increasing pressures to control such a disease using disease forecasting systems sociated with various meteorological factors.
APA, Harvard, Vancouver, ISO, and other styles
12

Herath, Herath Mudiyanselage Wiranthe Bandara. "TENSOR REGRESSION AND TENSOR TIME SERIES ANALYSES FOR HIGH DIMENSIONAL DATA." OpenSIUC, 2019. https://opensiuc.lib.siu.edu/theses/2585.

Full text
Abstract:
Many real data are naturally represented as a multidimensional array called a tensor. In classical regression and time series models, the predictors and covariate variables are considered as a vector. However, due to high dimensionality of predictor variables, these types of models are inefficient for analyzing multidimensional data. In contrast, tensor structured models use predictors and covariate variables in a tensor format. Tensor regression and tensor time series models can reduce high dimensional data to a low dimensional framework and lead to efficient estimation and prediction. In this thesis, we discuss the modeling and estimation procedures for both tensor regression models and tensor time series models. The results of simulation studies and a numerical analysis are provided.
APA, Harvard, Vancouver, ISO, and other styles
13

Zhang, Yuankun. "(Ultra-)High Dimensional Partially Linear Single Index Models for Quantile Regression." University of Cincinnati / OhioLINK, 2018. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1535703962712806.

Full text
APA, Harvard, Vancouver, ISO, and other styles
14

Andersson, Niklas. "Regression-Based Monte Carlo For Pricing High-Dimensional American-Style Options." Thesis, Umeå universitet, Institutionen för fysik, 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:umu:diva-119013.

Full text
Abstract:
Pricing different financial derivatives is an essential part of the financial industry. For some derivatives there exists a closed form solution, however the pricing of high-dimensional American-style derivatives is still today a challenging problem. This project focuses on the derivative called option and especially pricing of American-style basket options, i.e. options with both an early exercise feature and multiple underlying assets. In high-dimensional problems, which is definitely the case for American-style options, Monte Carlo methods is advantageous. Therefore, in this thesis, regression-based Monte Carlo has been used to determine early exercise strategies for the option. The well known Least Squares Monte Carlo (LSM) algorithm of Longstaff and Schwartz (2001) has been implemented and compared to Robust Regression Monte Carlo (RRM) by C.Jonen (2011). The difference between these methods is that robust regression is used instead of least square regression to calculate continuation values of American style options. Since robust regression is more stable against outliers the result using this approach is claimed by C.Jonen to give better estimations of the option price. It was hard to compare the techniques without the duality approach of Andersen and Broadie (2004) therefore this method was added. The numerical tests then indicate that the exercise strategy determined using RRM produces a higher lower bound and a tighter upper bound compared to LSM. The difference between upper and lower bound could be up to 4 times smaller using RRM. Importance sampling and Quasi Monte Carlo have also been used to reduce the variance in the estimation of the option price and to speed up the convergence rate.
Prissättning av olika finansiella derivat är en viktig del av den finansiella sektorn. För vissa derivat existerar en sluten lösning, men prissättningen av derivat med hög dimensionalitet och av amerikansk stil är fortfarande ett utmanande problem. Detta projekt fokuserar på derivatet som kallas option och särskilt prissättningen av amerikanska korg optioner, dvs optioner som både kan avslutas i förtid och som bygger på flera underliggande tillgångar. För problem med hög dimensionalitet, vilket definitivt är fallet för optioner av amerikansk stil, är Monte Carlo metoder fördelaktiga. I detta examensarbete har därför regressions baserad Monte Carlo använts för att bestämma avslutningsstrategier för optionen. Den välkända minsta kvadrat Monte Carlo (LSM) algoritmen av Longstaff och Schwartz (2001) har implementerats och jämförts med Robust Regression Monte Carlo (RRM) av C.Jonen (2011). Skillnaden mellan metoderna är att robust regression används istället för minsta kvadratmetoden för att beräkna fortsättningsvärden för optioner av amerikansk stil. Eftersom robust regression är mer stabil mot avvikande värden påstår C.Jonen att denna metod ger bättre skattingar av optionspriset. Det var svårt att jämföra teknikerna utan tillvägagångssättet med dualitet av Andersen och Broadie (2004) därför lades denna metod till. De numeriska testerna indikerar då att avslutningsstrategin som bestämts med RRM producerar en högre undre gräns och en snävare övre gräns jämfört med LSM. Skillnaden mellan övre och undre gränsen kunde vara upp till 4 gånger mindre med RRM. Importance sampling och Quasi Monte Carlo har också använts för att reducera variansen i skattningen av optionspriset och för att påskynda konvergenshastigheten.
APA, Harvard, Vancouver, ISO, and other styles
15

Xie, Fang. "ζ1 penalized methods in high-dimensional regressions and its theoretical properties." Thesis, University of Macau, 2018. http://umaclib3.umac.mo/record=b3952485.

Full text
APA, Harvard, Vancouver, ISO, and other styles
16

Lo, Shin-Lian. "High-dimensional classification and attribute-based forecasting." Diss., Georgia Institute of Technology, 2010. http://hdl.handle.net/1853/37193.

Full text
Abstract:
This thesis consists of two parts. The first part focuses on high-dimensional classification problems in microarray experiments. The second part deals with forecasting problems with a large number of categories in predictors. Classification problems in microarray experiments refer to discriminating subjects with different biologic phenotypes or known tumor subtypes as well as to predicting the clinical outcomes or the prognostic stages of subjects. One important characteristic of microarray data is that the number of genes is much larger than the sample size. The penalized logistic regression method is known for simultaneous variable selection and classification. However, the performance of this method declines as the number of variables increases. With this concern, in the first study, we propose a new classification approach that employs the penalized logistic regression method iteratively with a controlled size of gene subsets to maintain variable selection consistency and classification accuracy. The second study is motivated by a modern microarray experiment that includes two layers of replicates. This new experimental setting causes most existing classification methods, including penalized logistic regression, not appropriate to be directly applied because the assumption of independent observations is violated. To solve this problem, we propose a new classification method by incorporating random effects into penalized logistic regression such that the heterogeneity among different experimental subjects and the correlations from repeated measurements can be taken into account. An efficient hybrid algorithm is introduced to tackle computational challenges in estimation and integration. Applications to a breast cancer study show that the proposed classification method obtains smaller models with higher prediction accuracy than the method based on the assumption of independent observations. The second part of this thesis develops a new forecasting approach for large-scale datasets associated with a large number of predictor categories and with predictor structures. The new approach, beyond conventional tree-based methods, incorporates a general linear model and hierarchical splits to make trees more comprehensive, efficient, and interpretable. Through an empirical study in the air cargo industry and a simulation study containing several different settings, the new approach produces higher forecasting accuracy and higher computational efficiency than existing tree-based methods.
APA, Harvard, Vancouver, ISO, and other styles
17

Ratnasingam, Suthakaran. "Sequential Change-point Detection in Linear Regression and Linear Quantile Regression Models Under High Dimensionality." Bowling Green State University / OhioLINK, 2020. http://rave.ohiolink.edu/etdc/view?acc_num=bgsu159050606401363.

Full text
APA, Harvard, Vancouver, ISO, and other styles
18

Pettersson, Anders. "High-Dimensional Classification Models with Applications to Email Targeting." Thesis, KTH, Matematisk statistik, 2015. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-168203.

Full text
Abstract:
Email communication is valuable for any modern company, since it offers an easy mean for spreading important information or advertising new products, features or offers and much more. To be able to identify which customers that would be interested in certain information would make it possible to significantly improve a company's email communication and as such avoiding that customers start ignoring messages and creating unnecessary badwill. This thesis focuses on trying to target customers by applying statistical learning methods to historical data provided by the music streaming company Spotify. An important aspect was the high-dimensionality of the data, creating certain demands on the applied methods. A binary classification model was created, where the target was whether a customer will open the email or not. Two approaches were used for trying to target the costumers, logistic regression, both with and without regularization, and random forest classifier, for their ability to handle the high-dimensionality of the data. Performance accuracy of the suggested models were then evaluated on both a training set and a test set using statistical validation methods, such as cross-validation, ROC curves and lift charts. The models were studied under both large-sample and high-dimensional scenarios. The high-dimensional scenario represents when the number of observations, N, is of the same order as the number of features, p and the large sample scenario represents when N ≫ p. Lasso-based variable selection was performed for both these scenarios, to study the informative value of the features. This study demonstrates that it is possible to greatly improve the opening rate of emails by targeting users, even in the high dimensional scenario. The results show that increasing the amount of training data over a thousand fold will only improve the performance marginally. Rather efficient customer targeting can be achieved by using a few highly informative variables selected by the Lasso regularization.
Företag kan använda e-mejl för att på ett enkelt sätt sprida viktig information, göra reklam för nya produkter eller erbjudanden och mycket mer, men för många e-mejl kan göra att kunder slutar intressera sig för innehållet, genererar badwill och omöjliggöra framtida kommunikation. Att kunna urskilja vilka kunder som är intresserade av det specifika innehållet skulle vara en möjlighet att signifikant förbättra ett företags användning av e-mejl som kommunikationskanal. Denna studie fokuserar på att urskilja kunder med hjälp av statistisk inlärning applicerad på historisk data tillhandahållen av musikstreaming-företaget Spotify. En binärklassificeringsmodell valdes, där responsvariabeln beskrev huruvida kunden öppnade e-mejlet eller inte. Två olika metoder användes för att försöka identifiera de kunder som troligtvis skulle öppna e-mejlen, logistisk regression, både med och utan regularisering, samt random forest klassificerare, tack vare deras förmåga att hantera högdimensionella data. Metoderna blev sedan utvärderade på både ett träningsset och ett testset, med hjälp av flera olika statistiska valideringsmetoder så som korsvalidering och ROC kurvor. Modellerna studerades under både scenarios med stora stickprov och högdimensionella data. Där scenarion med högdimensionella data representeras av att antalet observationer, N, är av liknande storlek som antalet förklarande variabler, p, och scenarion med stora stickprov representeras av att N ≫ p. Lasso-baserad variabelselektion utfördes för båda dessa scenarion för att studera informationsvärdet av förklaringsvariablerna. Denna studie visar att det är möjligt att signifikant förbättra öppningsfrekvensen av e-mejl genom att selektera kunder, även när man endast använder små mängder av data. Resultaten visar att en enorm ökning i antalet träningsobservationer endast kommer förbättra modellernas förmåga att urskilja kunder marginellt.
APA, Harvard, Vancouver, ISO, and other styles
19

Gusnanto, Arief. "Regression on high-dimensional predictor space : with application in chemometrics and microarray data /." Stockholm, 2004. http://diss.kib.ki.se/2004/91-7140-153-9/.

Full text
APA, Harvard, Vancouver, ISO, and other styles
20

Yi, Congrui. "Penalized methods and algorithms for high-dimensional regression in the presence of heterogeneity." Diss., University of Iowa, 2016. https://ir.uiowa.edu/etd/2299.

Full text
Abstract:
In fields such as statistics, economics and biology, heterogeneity is an important topic concerning validity of data inference and discovery of hidden patterns. This thesis focuses on penalized methods for regression analysis with the presence of heterogeneity in a potentially high-dimensional setting. Two possible strategies to deal with heterogeneity are: robust regression methods that provide heterogeneity-resistant coefficient estimation, and direct detection of heterogeneity while estimating coefficients accurately in the meantime. We consider the first strategy for two robust regression methods, Huber loss regression and quantile regression with Lasso or Elastic-Net penalties, which have been studied theoretically but lack efficient algorithms. We propose a new algorithm Semismooth Newton Coordinate Descent to solve them. The algorithm is a novel combination of Semismooth Newton Algorithm and Coordinate Descent that applies to penalized optimization problems with both nonsmooth loss and nonsmooth penalty. We prove its convergence properties, and show its computational efficiency through numerical studies. We also propose a nonconvex penalized regression method, Heterogeneity Discovery Regression (HDR) , as a realization of the second idea. We establish theoretical results that guarantees statistical precision for any local optimum of the objective function with high probability. We also compare the numerical performances of HDR with competitors including Huber loss regression, quantile regression and least squares through simulation studies and a real data example. In these experiments, HDR methods are able to detect heterogeneity accurately, and also largely outperform the competitors in terms of coefficient estimation and variable selection.
APA, Harvard, Vancouver, ISO, and other styles
21

Vasquez, Monica M., and Monica M. Vasquez. "Penalized Regression Methods in the Study of Serum Biomarkers for Overweight and Obesity." Diss., The University of Arizona, 2017. http://hdl.handle.net/10150/625637.

Full text
Abstract:
The study of circulating biomarkers and their association with disease outcomes has become progressively complex due to advances in the measurement of these biomarkers through multiplex technologies. Although the availability of numerous serum biomarkers is highly promising, multiplex assays present statistical challenges due to the high dimensionality of these data. In this dissertation, three studies are presented that address these challenges using L1 penalized regression methods. In the first part of the dissertation, an extensive simulation study is performed for the logistic regression model that compares the Least Absolute Shrinkage and Selection Operator (LASSO) method with five LASSO-type methods given scenarios that are present in serum biomarker research, such as high correlation between biomarkers, weak associations with the outcome, and sparse number of true signals. Results show that choice of optimal LASSO-type method is dependent on data structure and should be guided by the research objective. Methods are then applied to the Tucson Epidemiological Study of Airway Obstructive Disease (TESAOD) study for the identification of serum biomarkers of overweight and obesity. Measurement of serum biomarkers using multiplex technologies may be more variable as compared to traditional single biomarker methods. Measurement error may induce bias in parameter estimation and complicate the variable selection process. In the second part of the dissertation, an existing measurement error correction method for penalized linear regression with L1 penalty has been adapted to accommodate validation data on a randomly selected subset of the study sample. A simulation study and analysis of TESAOD data demonstrate that the proposed approach improves variable selection and reduces bias in parameter estimation for validation data as small as 10 percent of the study sample. In the third part of the dissertation, a measurement error correction method that utilizes validation data is proposed for the penalized logistic regression model with the L1 penalty. A simulation study and analysis of TESAOD data are used to evaluate the proposed method. Results show an improvement in variable selection.
APA, Harvard, Vancouver, ISO, and other styles
22

Wang, Fan. "Penalised regression for high-dimensional data : an empirical investigation and improvements via ensemble learning." Thesis, University of Cambridge, 2019. https://www.repository.cam.ac.uk/handle/1810/289419.

Full text
Abstract:
In a wide range of applications, datasets are generated for which the number of variables p exceeds the sample size n. Penalised likelihood methods are widely used to tackle regression problems in these high-dimensional settings. In this thesis, we carry out an extensive empirical comparison of the performance of popular penalised regression methods in high-dimensional settings and propose new methodology that uses ensemble learning to enhance the performance of these methods. The relative efficacy of different penalised regression methods in finite-sample settings remains incompletely understood. Through a large-scale simulation study, consisting of more than 1,800 data-generating scenarios, we systematically consider the influence of various factors (for example, sample size and sparsity) on method performance. We focus on three related goals --- prediction, variable selection and variable ranking --- and consider six widely used methods. The results are supported by a semi-synthetic data example. Our empirical results complement existing theory and provide a resource to compare performance across a range of settings and metrics. We then propose a new ensemble learning approach for improving the performance of penalised regression methods, called STructural RANDomised Selection (STRANDS). The approach, that builds and improves upon the Random Lasso method, consists of two steps. In both steps, we reduce dimensionality by repeated subsampling of variables. We apply a penalised regression method to each subsampled dataset and average the results. In the first step, subsampling is informed by variable correlation structure, and in the second step, by variable importance measures from the first step. STRANDS can be used with any sparse penalised regression approach as the ``base learner''. In simulations, we show that STRANDS typically improves upon its base learner, and demonstrate that taking account of the correlation structure in the first step can help to improve the efficiency with which the model space may be explored. We propose another ensemble learning method to improve the prediction performance of Ridge Regression in sparse settings. Specifically, we combine Bayesian Ridge Regression with a probabilistic forward selection procedure, where inclusion of a variable at each stage is probabilistically determined by a Bayes factor. We compare the prediction performance of the proposed method to penalised regression methods using simulated data.
APA, Harvard, Vancouver, ISO, and other styles
23

Ohlsson, Henrik. "Regression on Manifolds with Implications for System Identification." Licentiate thesis, Linköping University, Linköping University, Automatic Control, 2008. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-15467.

Full text
Abstract:

The trend today is to use many inexpensive sensors instead of a few expensive ones, since the same accuracy can generally be obtained by fusing several dependent measurements. It also follows that the robustness against failing sensors is improved. As a result, the need for high-dimensional regression techniques is increasing.

As measurements are dependent, the regressors will be constrained to some manifold. There is then a representation of the regressors, of the same dimension as the manifold, containing all predictive information. Since the manifold is commonly unknown, this representation has to be estimated using data. For this, manifold learning can be utilized. Having found a representation of the manifold constrained regressors, this low-dimensional representation can be used in an ordinary regression algorithm to find a prediction of the output. This has further been developed in the Weight Determination by Manifold Regularization (WDMR) approach.

In most regression problems, prior information can improve prediction results. This is also true for high-dimensional regression problems. Research to include physical prior knowledge in high-dimensional regression i.e., gray-box high-dimensional regression, has been rather limited, however. We explore the possibilities to include prior knowledge in high-dimensional manifold constrained regression by the means of regularization. The result will be called gray-box WDMR. In gray-box WDMR we have the possibility to restrict ourselves to predictions which are physically plausible. This is done by incorporating dynamical models for how the regressors evolve on the manifold.


MOVIII
APA, Harvard, Vancouver, ISO, and other styles
24

Minnier, Jessica. "Inference and Prediction for High Dimensional Data via Penalized Regression and Kernel Machine Methods." Thesis, Harvard University, 2012. http://dissertations.umi.com/gsas.harvard:10327.

Full text
Abstract:
Analysis of high dimensional data often seeks to identify a subset of important features and assess their effects on the outcome. Furthermore, the ultimate goal is often to build a prediction model with these features that accurately assesses risk for future subjects. Such statistical challenges arise in the study of genetic associations with health outcomes. However, accurate inference and prediction with genetic information remains challenging, in part due to the complexity in the genetic architecture of human health and disease. A valuable approach for improving prediction models with a large number of potential predictors is to build a parsimonious model that includes only important variables. Regularized regression methods are useful, though often pose challenges for inference due to nonstandard limiting distributions or finite sample distributions that are difficult to approximate. In Chapter 1 we propose and theoretically justify a perturbation-resampling method to derive confidence regions and covariance estimates for marker effects estimated from regularized procedures with a general class of objective functions and concave penalties. Our methods outperform their asymptotic-based counterparts, even when effects are estimated as zero. In Chapters 2 and 3 we focus on genetic risk prediction. The difficulty in accurate risk assessment with genetic studies can in part be attributed to several potential obstacles: sparsity in marker effects, a large number of weak signals, and non-linear effects. Single marker analyses often lack power to select informative markers and typically do not account for non-linearity. One approach to gain predictive power and efficiency is to group markers based on biological knowledge such genetic pathways or gene structure. In Chapter 2 we propose and theoretically justify a multi-stage method for risk assessment that imposes a naive bayes kernel machine (KM) model to estimate gene-set specific risk models, and then aggregates information across all gene-sets by adaptively estimating gene-set weights via a regularization procedure. In Chapter 3 we extend these methods to meta-analyses by introducing sampling-based weights in the KM model. This permits building risk prediction models with multiple studies that have heterogeneous sampling schemes
APA, Harvard, Vancouver, ISO, and other styles
25

Wang, Guoshen. "Analysis of Additive Risk Model with High Dimensional Covariates Using Correlation Principal Component Regression." Digital Archive @ GSU, 2008. http://digitalarchive.gsu.edu/math_theses/51.

Full text
Abstract:
One problem of interest is to relate genes to survival outcomes of patients for the purpose of building regression models to predict future patients¡¯ survival based on their gene expression data. Applying semeparametric additive risk model of survival analysis, this thesis proposes a new approach to conduct the analysis of gene expression data with the focus on model¡¯s predictive ability. The method modifies the correlation principal component regression to handle the censoring problem of survival data. Also, we employ the time dependent AUC and RMSEP to assess how well the model predicts the survival time. Furthermore, the proposed method is able to identify significant genes which are related to the disease. Finally, this proposed approach is illustrated by simulation data set, the diffuse large B-cell lymphoma (DLBCL) data set, and breast cancer data set. The results show that the model fits both of the data sets very well.
APA, Harvard, Vancouver, ISO, and other styles
26

Sarac, Ferdi. "Development of unsupervised feature selection methods for high dimensional biomedical data in regression domain." Thesis, Northumbria University, 2017. http://nrl.northumbria.ac.uk/36260/.

Full text
Abstract:
In line with technological developments, there is almost no limit to collect data of high dimension in various fields including bioinformatics. In most cases, these high dimensional datasets contain many irrelevant or noisy features which need to be filtered out to find a small but biologically meaningful set of attributes. Although there have been various attempts to select predictive feature sets from high dimensional data in classification and clustering, there have only been limited attempts to do this for regression problems. Since supervised feature selection methods tend to identify noisy features in addition to discriminative variables, unsupervised feature selection methods (USFSMs) are generally regarded as more unbiased approaches. The aim of this thesis is, therefore, to provide (i) a comprehensive overview of feature selection methods for regression problems where feature selection methods are shown along with their types, references, sources, and code repositories (ii) a taxonomy of feature selection methods for regression problems to assist researchers to select appropriate feature selection methods for their research (iii) a deep learning based unsupervised feature selection framework, DFSFR (iv) a K-means based unsupervised feature selection method, KBFS. To the best of our knowledge, DFSFR is the first deep learning based method to be designed particularly for regression tasks. In addition, a hybrid USFSM, DKBFS, is proposed which combines KBFS and DFSFR to select discriminative features from very high dimensional data. The proposed frameworks are compared with the state-of-the-art USFSMs, including Multi Cluster Feature Selection (MCFS), Embedded Unsupervised Feature Selection (EUFS), Infinite Feature Selection (InFS), Spectral Regression Feature Selection (SPFS), Laplacian Score Feature Selection (LapFS), and Term Variance Feature Selection (TV) along with the entire feature sets as well as the methods used in previous studies. To evaluate the effectiveness of proposed methods, four different case studies are considered: (i) a low dimensional RV144 vaccine dataset; (ii) three different high dimensional peptide binding affinity datasets; (iii) a very high dimensional GSE44763 dataset; (iv) a very high dimensional GSE40279 dataset. Experimental results from these data sets are used to validate the effectiveness of the proposed methods. Compared to state-of-the-art feature selection methods, the proposed methods achieve improvements in prediction accuracy of as much as 9% for the RV144 Vaccine dataset, 75% for the peptide binding affinity datasets, 3% for the GSE44763 dataset, and 55% for the GSE40279 dataset.
APA, Harvard, Vancouver, ISO, and other styles
27

Gu, Chao. "Advancing Bechhofer's Ranking Procedures to High-dimensional Variable Selection." Bowling Green State University / OhioLINK, 2021. http://rave.ohiolink.edu/etdc/view?acc_num=bgsu1626653022254095.

Full text
APA, Harvard, Vancouver, ISO, and other styles
28

Breheny, Patrick John. "Regularized methods for high-dimensional and bi-level variable selection." Diss., University of Iowa, 2009. https://ir.uiowa.edu/etd/325.

Full text
Abstract:
Many traditional approaches cease to be useful when the number of variables is large in comparison with the sample size. Penalized regression methods have proved to be an attractive approach, both theoretically and empirically, for dealing with these problems. This thesis focuses on the development of penalized regression methods for high-dimensional variable selection. The first part of this thesis deals with problems in which the covariates possess a grouping structure that can be incorporated into the analysis to select important groups as well as important members of those groups. I introduce a framework for grouped penalization that encompasses the previously proposed group lasso and group bridge methods, sheds light on the behavior of grouped penalties, and motivates the proposal of a new method, group MCP. The second part of this thesis develops fast algorithms for fitting models with complicated penalty functions such as grouped penalization methods. These algorithms combine the idea of local approximation of penalty functions with recent research into coordinate descent algorithms to produce highly efficient numerical methods for fitting models with complicated penalties. Importantly, I show these algorithms to be both stable and linear in the dimension of the feature space, allowing them to be efficiently scaled up to very large problems. In the third part of this thesis, I extend the idea of false discovery rates to penalized regression. The Karush-Kuhn-Tucker conditions describing penalized regression estimates provide testable hypotheses involving partial residuals. I use these hypotheses to connect the previously disparate elds of multiple comparisons and penalized regression, develop estimators for the false discovery rates of methods such as the lasso and elastic net, and establish theoretical results. Finally, the methods from all three sections are studied in a number of simulations and applied to real data from gene expression and genetic association studies.
APA, Harvard, Vancouver, ISO, and other styles
29

Ren, Sheng. "New Methods of Variable Selection and Inference on High Dimensional Data." University of Cincinnati / OhioLINK, 2017. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1511883302569683.

Full text
APA, Harvard, Vancouver, ISO, and other styles
30

Zuber, Verena. "A Multivariate Framework for Variable Selection and Identification of Biomarkers in High-Dimensional Omics Data." Doctoral thesis, Universitätsbibliothek Leipzig, 2012. http://nbn-resolving.de/urn:nbn:de:bsz:15-qucosa-101223.

Full text
Abstract:
In this thesis, we address the identification of biomarkers in high-dimensional omics data. The identification of valid biomarkers is especially relevant for personalized medicine that depends on accurate prediction rules. Moreover, biomarkers elucidate the provenance of disease, or molecular changes related to disease. From a statistical point of view the identification of biomarkers is best cast as variable selection. In particular, we refer to variables as the molecular attributes under investigation, e.g. genes, genetic variation, or metabolites; and we refer to observations as the specific samples whose attributes we investigate, e.g. patients and controls. Variable selection in high-dimensional omics data is a complicated challenge due to the characteristic structure of omics data. For one, omics data is high-dimensional, comprising cellular information in unprecedented details. Moreover, there is an intricate correlation structure among the variables due to e.g internal cellular regulation, or external, latent factors. Variable selection for uncorrelated data is well established. In contrast, there is no consensus on how to approach variable selection under correlation. Here, we introduce a multivariate framework for variable selection that explicitly accounts for the correlation among markers. In particular, we present two novel quantities for variable importance: the correlation-adjusted t (CAT) score for classification, and the correlation-adjusted (marginal) correlation (CAR) score for regression. The CAT score is defined as the Mahalanobis-decorrelated t-score vector, and the CAR score as the Mahalanobis-decorrelated correlation between the predictor variables and the outcome. We derive the CAT and CAR score from a predictive point of view in linear discriminant analysis and regression; both quantities assess the weight of a decorrelated and standardized variable on the prediction rule. Furthermore, we discuss properties of both scores and relations to established quantities. Above all, the CAT score decomposes Hotelling’s T 2 and the CAR score the proportion of variance explained. Notably, the decomposition of total variance into explained and unexplained variance in the linear model can be rewritten in terms of CAR scores. To render our approach applicable on high-dimensional omics data we devise an efficient algorithm for shrinkage estimates of the CAT and CAR score. Subsequently, we conduct extensive simulation studies to investigate the performance of our novel approaches in ranking and prediction under correlation. Here, CAT and CAR scores consistently improve over marginal approaches in terms of more true positives selected and a lower model error. Finally, we illustrate the application of CAT and CAR score on real omics data. In particular, we analyze genomics, transcriptomics, and metabolomics data. We ascertain that CAT and CAR score are competitive or outperform state of the art techniques in terms of true positives detected and prediction error.
APA, Harvard, Vancouver, ISO, and other styles
31

Miller, Ryan. "Marginal false discovery rate approaches to inference on penalized regression models." Diss., University of Iowa, 2018. https://ir.uiowa.edu/etd/6474.

Full text
Abstract:
Data containing large number of variables is becoming increasingly more common and sparsity inducing penalized regression methods, such the lasso, have become a popular analysis tool for these datasets due to their ability to naturally perform variable selection. However, quantifying the importance of the variables selected by these models is a difficult task. These difficulties are compounded by the tendency for the most predictive models, for example those which were chosen using procedures like cross-validation, to include substantial amounts of noise variables with no real relationship with the outcome. To address the task of performing inference on penalized regression models, this thesis proposes false discovery rate approaches for a broad class of penalized regression models. This work includes the development of an upper bound for the number of noise variables in a model, as well as local false discovery rate approaches that quantify the likelihood of each individual selection being a false discovery. These methods are applicable to a wide range of penalties, such as the lasso, elastic net, SCAD, and MCP; a wide range of models, including linear regression, generalized linear models, and Cox proportional hazards models; and are also extended to the group regression setting under the group lasso penalty. In addition to studying these methods using numerous simulation studies, the practical utility of these methods is demonstrated using real data from several high-dimensional genome wide association studies.
APA, Harvard, Vancouver, ISO, and other styles
32

Mahmood, Nozad. "Sparse Ridge Fusion For Linear Regression." Master's thesis, University of Central Florida, 2013. http://digital.library.ucf.edu/cdm/ref/collection/ETD/id/5986.

Full text
Abstract:
For a linear regression, the traditional technique deals with a case where the number of observations n more than the number of predictor variables p (n>p). In the case nM.S.
Masters
Statistics
Sciences
Statistical Computing
APA, Harvard, Vancouver, ISO, and other styles
33

Liu, Li. "Grouped variable selection in high dimensional partially linear additive Cox model." Diss., University of Iowa, 2010. https://ir.uiowa.edu/etd/847.

Full text
Abstract:
In the analysis of survival outcome supplemented with both clinical information and high-dimensional gene expression data, traditional Cox proportional hazard model fails to meet some emerging needs in biological research. First, the number of covariates is generally much larger the sample size. Secondly, predicting an outcome with individual gene expressions is inadequate because a gene's expression is regulated by multiple biological processes and functional units. There is a need to understand the impact of changes at a higher level such as molecular function, cellular component, biological process, or pathway. The change at a higher level is usually measured with a set of gene expressions related to the biological process. That is, we need to model the outcome with gene sets as variable groups and the gene sets could be partially overlapped also. In this thesis work, we investigate the impact of a penalized Cox regression procedure on regularization, parameter estimation, variable group selection, and nonparametric modeling of nonlinear eects with a time-to-event outcome. We formulate the problem as a partially linear additive Cox model with high-dimensional data. We group genes into gene sets and approximate the nonparametric components by truncated series expansions with B-spline bases. After grouping and approximation, the problem of variable selection becomes that of selecting groups of coecients in a gene set or in an approximation. We apply the group Lasso to obtain an initial solution path and reduce the dimension of the problem and then update the whole solution path with the adaptive group Lasso. We also propose a generalized group lasso method to provide more freedom in specifying the penalty and excluding covariates from being penalized. A modied Newton-Raphson method is designed for stable and rapid computation. The core programs are written in the C language. An user-friendly R interface is implemented to perform all the calculations by calling the core programs. We demonstrate the asymptotic properties of the proposed methods. Simulation studies are carried out to evaluate the finite sample performance of the proposed procedure using several tuning parameter selection methods for choosing the point on the solution path as the nal estimator. We also apply the proposed approach on two real data examples.
APA, Harvard, Vancouver, ISO, and other styles
34

Seetharaman, Indu. "Consistent bi-level variable selection via composite group bridge penalized regression." Kansas State University, 2013. http://hdl.handle.net/2097/15980.

Full text
Abstract:
Master of Science
Department of Statistics
Kun Chen
We study the composite group bridge penalized regression methods for conducting bilevel variable selection in high dimensional linear regression models with a diverging number of predictors. The proposed method combines the ideas of bridge regression (Huang et al., 2008a) and group bridge regression (Huang et al., 2009), to achieve variable selection consistency in both individual and group levels simultaneously, i.e., the important groups and the important individual variables within each group can both be correctly identi ed with probability approaching to one as the sample size increases to in nity. The method takes full advantage of the prior grouping information, and the established bi-level oracle properties ensure that the method is immune to possible group misidenti cation. A related adaptive group bridge estimator, which uses adaptive penalization for improving bi-level selection, is also investigated. Simulation studies show that the proposed methods have superior performance in comparison to many existing methods.
APA, Harvard, Vancouver, ISO, and other styles
35

Massias, Mathurin. "Sparse high dimensional regression in the presence of colored heteroscedastic noise : application to M/EEG source imaging." Electronic Thesis or Diss., Université Paris-Saclay (ComUE), 2019. http://www.theses.fr/2019SACLT053.

Full text
Abstract:
Parmi les techniques d’imagerie cerébrale, la magneto- et l’électro-encéphalographie se distinguent pour leur faible degré d’invasivité et leur excellente résolution temporelle. La reconstruction de l’activité neuronale à partir de l’enregistrement des champs électriques et magnétiques constitue un problème inverse extr êmement mal posé, auquel il est nécessaire d’ajouter des contraintes pour le résoudre. Une approche populaire, empruntée dans ce manuscrit, est de postuler que la solution est parcimonieuse spatialement, ce qui peut s’obtenir par une pénalisation L2/1. Cependant, ce type de régularisation nécessite de résoudre des problèmes d’optimisation non-lisses en grande dimension, avec des méthodes itératives dont la performance se dégrade avec la dimension. De plus, les enregistrements M/EEG sont typiquement corrompus par un fort bruit coloré, allant à l’encontre des hypothèses classiques de résolution des problèmes inverses. Dans cette thèse, nous proposons d’abord une accélération des algorithmes itératifs utilisés pour résoudre le problème bio-magnétique avec régularisation L2/1. Les améliorations classiques (règles de filtrage et ensemble actifs), tirent parti de la parcimonie de la solution: elles ignorent les sources cérébrales inactives, et réduisent ainsi la dimension du problème. Nous introduisons une nouvelle technique d’ensemble actifs, reposant sur les règles de filtrage les plus performantes actuellement. Nous proposons des techniques duales avancées, qui permettent un contrôle plus fin de l’optimalité et améliorent les techniques d’identification de prédicteurs. Notre construction duale extrapole la structure Vectorielle Autoregressive des iterés duaux, régularité que nous relions aux propriétés d’identification de support des algorithmes proximaux. En plus du problème inverse bio-magnétique, l’approche proposée est appliquée à l’ensemble des modèles linéaires g énéralisés r égularisés L1. Deuxièmement, nous introduisons de nouveaux estimateurs concomitants pour la régression multitâche, conçus pour traiter du bruit gaussien correlé. Le probleme d’optimisation sous-jacent est convexe, et présente une structure “lisse + proximable” attrayante ; nous lions la formulation de ce problème au lissage des normes de Schatten
Understanding the functioning of the brain under normal and pathological conditions is one of the challenges of the 21textsuperscript{st} century.In the last decades, neuroimaging has radically affected clinical and cognitive neurosciences.Amongst neuroimaging techniques, magneto- and electroencephalography (M/EEG) stand out for two reasons: their non-invasiveness, and their excellent time resolution.Reconstructing the neural activity from the recordings of magnetic field and electric potentials is the so-called bio-magnetic inverse problem.Because of the limited number of sensors, this inverse problem is severely ill-posed, and additional constraints must be imposed in order to solve it.A popular approach, considered in this manuscript, is to assume spatial sparsity of the solution: only a few brain regions are involved in a short and specific cognitive task.Solutions exhibiting such a neurophysiologically plausible sparsity pattern can be obtained through L21-penalized regression approaches.However, this regularization requires to solve time-consuming high-dimensional and non-smooth optimization problems, with iterative (block) proximal gradients solvers.% Issues of M/EEG: noise:Additionally, M/EEG recordings are usually corrupted by strong non-white noise, which breaks the classical statistical assumptions of inverse problems. To circumvent this, it is customary to whiten the data as a preprocessing step,and to average multiple repetitions of the same experiment to increase the signal-to-noise ratio.Averaging measurements has the drawback of removing brain responses which are not phase-locked, ie do not happen at a fixed latency after the stimuli presentation onset.%Making it faster.In this work, we first propose speed improvements of iterative solvers used for the L21-regularized bio-magnetic inverse problem.Typical improvements, screening and working sets, exploit the sparsity of the solution: by identifying inactive brain sources, they reduce the dimensionality of the optimization problem.We introduce a new working set policy, derived from the state-of-the-art Gap safe screening rules.In this framework, we also propose duality improvements, yielding a tighter control of optimality and improving feature identification techniques.This dual construction extrapolates on an asymptotic Vector AutoRegressive regularity of the dual iterates, which we connect to manifold identification of proximal algorithms.Beyond the L21-regularized bio-magnetic inverse problem, the proposed methods apply to the whole class of sparse Generalized Linear Models.%Better handling of the noiseSecond, we introduce new concomitant estimators for multitask regression.Along with the neural sources estimation, concomitant estimators jointly estimate the noise covariance matrix.We design them to handle non-white Gaussian noise, and to exploit the multiple repetitions nature of M/EEG experiments.Instead of averaging the observations, our proposed method, CLaR, uses them all for a better estimation of the noise.The underlying optimization problem is jointly convex in the regression coefficients and the noise variable, with a ``smooth + proximable'' composite structure.It is therefore solvable via standard alternate minimization, for which we apply the improvements detailed in the first part.We provide a theoretical analysis of our objective function, linking it to the smoothing of Schatten norms.We demonstrate the benefits of the proposed approach for source localization on real M/EEG datasets.Our improved solvers and refined modeling of the noise pave the way for a faster and more statistically efficient processing of M/EEG recordings, allowing for interactive data analysis and scaling approaches to larger and larger M/EEG datasets
APA, Harvard, Vancouver, ISO, and other styles
36

Margevicius, Seunghee P. "Modeling of High-Dimensional Clinical Longitudinal Oxygenation Data from Retinopathy of Prematurity." Case Western Reserve University School of Graduate Studies / OhioLINK, 2018. http://rave.ohiolink.edu/etdc/view?acc_num=case1523022165691473.

Full text
APA, Harvard, Vancouver, ISO, and other styles
37

Liley, Albert James. "Statistical co-analysis of high-dimensional association studies." Thesis, University of Cambridge, 2017. https://www.repository.cam.ac.uk/handle/1810/270628.

Full text
Abstract:
Modern medical practice and science involve complex phenotypic definitions. Understanding patterns of association across this range of phenotypes requires co-analysis of high-dimensional association studies in order to characterise shared and distinct elements. In this thesis I address several problems in this area, with a general linking aim of making more efficient use of available data. The main application of these methods is in the analysis of genome-wide association studies (GWAS) and similar studies. Firstly, I developed methodology for a Bayesian conditional false discovery rate (cFDR) for levering GWAS results using summary statistics from a related disease. I extended an existing method to enable a shared control design, increasing power and applicability, and developed an approximate bound on false-discovery rate (FDR) for the procedure. Using the new method I identified several new variant-disease associations. I then developed a second application of shared control design in the context of study replication, enabling improvement in power at the cost of changing the spectrum of sensitivity to systematic errors in study cohorts. This has application in studies on rare diseases or in between-case analyses. I then developed a method for partially characterising heterogeneity within a disease by modelling the bivariate distribution of case-control and within-case effect sizes. Using an adaptation of a likelihood-ratio test, this allows an assessment to be made of whether disease heterogeneity corresponds to differences in disease pathology. I applied this method to a range of simulated and real datasets, enabling insight into the cause of heterogeneity in autoantibody positivity in type 1 diabetes (T1D). Finally, I investigated the relation of subtypes of juvenile idiopathic arthritis (JIA) to adult diseases, using modified genetic risk scores and linear discriminants in a penalised regression framework. The contribution of this thesis is in a range of methodological developments in the analysis of high-dimensional association study comparison. Methods such as these will have wide application in the analysis of GWAS and similar areas, particularly in the development of stratified medicine.
APA, Harvard, Vancouver, ISO, and other styles
38

Swinson, Michael D. "Statistical Modeling of High-Dimensional Nonlinear Systems: A Projection Pursuit Solution." Diss., Available online, Georgia Institute of Technology, 2005, 2005. http://etd.gatech.edu/theses/available/etd-11232005-204333/.

Full text
Abstract:
Thesis (Ph. D.)--Mechanical Engineering, Georgia Institute of Technology, 2006.
Shapiro, Alexander, Committee Member ; Vidakovic, Brani, Committee Member ; Ume, Charles, Committee Member ; Sadegh, Nader, Committee Chair ; Liang, Steven, Committee Member. Vita.
APA, Harvard, Vancouver, ISO, and other styles
39

Jonen, Christian [Verfasser], Rüdiger [Akademischer Betreuer] Seydel, and Caren [Akademischer Betreuer] Tischendorf. "Efficient Pricing of High-Dimensional American-Style Derivatives : A Robust Regression Monte Carlo Method / Christian Jonen. Gutachter: Rüdiger Seydel ; Caren Tischendorf." Köln : Universitäts- und Stadtbibliothek Köln, 2011. http://d-nb.info/103811179X/34.

Full text
APA, Harvard, Vancouver, ISO, and other styles
40

Hermann, Philipp [Verfasser], and Hajo [Akademischer Betreuer] Holzmann. "High-dimensional, robust, heteroscedastic variable selection with the adaptive LASSO, and applications to random coefficient regression / Philipp Hermann ; Betreuer: Hajo Holzmann." Marburg : Philipps-Universität Marburg, 2021. http://d-nb.info/1236692187/34.

Full text
APA, Harvard, Vancouver, ISO, and other styles
41

Jiang, Wei. "Statistical inference with incomplete and high-dimensional data - modeling polytraumatized patients." Thesis, université Paris-Saclay, 2020. http://www.theses.fr/2020UPASM013.

Full text
Abstract:
Le problème des données manquantes existe depuis les débuts de l'analyse des données, car les valeurs manquantes sont liées au processus d'obtention et de préparation des données. Dans les applications des statistiques modernes et de l'apprentissage machine, où la collecte de données devient de plus en plus complexe et où de multiples sources d'information sont combinées, les grandes bases de données présentent souvent un nombre extraordinairement élevé de valeurs manquantes. Ces données présentent donc d'importants défis méthodologiques et techniques pour l'analyse : de la visualisation à la modélisation, en passant par l'estimation, la sélection des variables, les capacités de prédiction et la mise en oeuvre par des implémentations. De plus, bien que les données en grande dimension avec des valeurs manquantes soient considérées comme des difficultés courantes dans l'analyse statistique aujourd'hui, seules quelques solutions sont disponibles.L'objectif de cette thèse est de développer de nouvelles méthodologies pour effectuer des inférences statistiques avec des données manquantes et en particulier pour des données en grande dimension. La contribution la plus importante est de proposer un cadre complet pour traiter les valeurs manquantes, de l'estimation à la sélection d'un modèle, en se basant sur des approches de vraisemblance. La méthode proposée ne repose pas sur un dispositif spécifique du manque, et permet un bon équilibre entre qualité de l'inférence et implémentations efficaces.Les contributions de la thèse se composent en trois parties. Dans le chapitre 2, nous nous concentrons sur la régression logistique avec des valeurs manquantes dans un cadre de modélisation jointe, en utilisant une approximation stochastique de l'algorithme EM. Nous étudions l'estimation des paramètres, la sélection des variables et la prédiction pour de nouvelles observations incomplètes. Grâce à des simulations complètes, nous montrons que les estimateurs sont non biaisés et ont de bonnes propriétés en termes de couverture des intervalles de confiance, ce qui surpasse l'approche populaire basée sur l'imputation. La méthode est ensuite appliquée à des données pré-hospitalières pour prédire le risque de choc hémorragique, en collaboration avec des partenaires médicaux - le groupe Traumabase des hôpitaux de Paris. En effet, le modèle proposé améliore la prédiction du risque de saignement par rapport à la prédiction faite par les médecins.Dans les chapitres 3 et 4, nous nous concentrons sur des questions de sélection de modèles pour les données incomplètes en grande dimension, qui visent en particulier à contrôler les fausses découvertes. Pour les modèles linéaires, la version bayésienne adaptative de SLOPE (ABSLOPE) que nous proposons dans le chapitre 3 aborde ces problématiques en intégrant la régularisation triée l1 dans un cadre bayésien 'spike and slab'. Dans le chapitre 4, qui vise des modèles plus généraux que celui de la régression linéaire, nous considérons ces questions dans un cadre dit de “model-X”, où la distribution conditionnelle de la réponse en fonction des covariables n'est pas spécifiée. Pour ce faire, nous combinons une méthodologie “knockoff” et des imputations multiples. Grâce à une étude complète par simulations, nous démontrons des performances satisfaisantes en termes de puissance, de FDR et de biais d'estimation pour un large éventail de scénarios. Dans l'application de l'ensemble des données médicales, nous construisons un modèle pour prédire les niveaux de plaquettes des patients à partir des données pré-hospitalières et hospitalières.Enfin, nous fournissons deux logiciels libres avec des tutoriels, afin d'aider la prise de décision dans le domaine médical et les utilisateurs confrontés à des valeurs manquantes
The problem of missing data has existed since the beginning of data analysis, as missing values are related to the process of obtaining and preparing data. In applications of modern statistics and machine learning, where the collection of data is becoming increasingly complex and where multiple sources of information are combined, large databases often have an extraordinarily high number of missing values. These data therefore present important methodological and technical challenges for analysis: from visualization to modeling including estimation, variable selection, predictive capabilities, and implementation through implementations. Moreover, although high-dimensional data with missing values are considered common difficulties in statistical analysis today, only a few solutions are available.The objective of this thesis is to provide new methodologies for performing statistical inferences with missing data and in particular for high-dimensional data. The most important contribution is to provide a comprehensive framework for dealing with missing values from estimation to model selection based on likelihood approaches. The proposed method doesn't rely on a specific pattern of missingness, and allows a good balance between quality of inference and computational efficiency.The contribution of the thesis consists of three parts. In Chapter 2, we focus on performing a logistic regression with missing values in a joint modeling framework, using a stochastic approximation of the EM algorithm. We discuss parameter estimation, variable selection, and prediction for incomplete new observations. Through extensive simulations, we show that the estimators are unbiased and have good confidence interval coverage properties, which outperforms the popular imputation-based approach. The method is then applied to pre-hospital data to predict the risk of hemorrhagic shock, in collaboration with medical partners - the Traumabase group of Paris hospitals. Indeed, the proposed model improves the prediction of bleeding risk compared to the prediction made by physicians.In chapters 3 and 4, we focus on model selection issues for high-dimensional incomplete data, which are particularly aimed at controlling for false discoveries. For linear models, the adaptive Bayesian version of SLOPE (ABSLOPE) we propose in Chapter 3 addresses these issues by embedding the sorted l1 regularization within a Bayesian spike-and-slab framework. Alternatively, in Chapter 4, aiming at more general models beyond linear regression, we consider these questions in a model-X framework, where the conditional distribution of the response as a function of the covariates is not specified. To do so, we combine knockoff methodology and multiple imputations. Through extensive simulations, we demonstrate satisfactory performance in terms of power, FDR and estimation bias for a wide range of scenarios. In the application of the medical data set, we build a model to predict patient platelet levels from pre-hospital and hospital data.Finally, we provide two open-source software packages with tutorials, in order to help decision making in medical field and users facing missing values
APA, Harvard, Vancouver, ISO, and other styles
42

Klau, Simon [Verfasser], and Anne-Laure [Akademischer Betreuer] Boulesteix. "Addressing the challenges of uncertainty in regression models for high dimensional and heterogeneous data from observational studies / Simon Klau ; Betreuer: Anne-Laure Boulesteix." München : Universitätsbibliothek der Ludwig-Maximilians-Universität, 2020. http://d-nb.info/1220631884/34.

Full text
APA, Harvard, Vancouver, ISO, and other styles
43

Huynh, Bao Tuyen. "Estimation and feature selection in high-dimensional mixtures-of-experts models." Thesis, Normandie, 2019. http://www.theses.fr/2019NORMC237.

Full text
Abstract:
Cette thèse traite de la modélisation et de l’estimation de modèles de mélanges d’experts de grande dimension, en vue d’efficaces estimation de densité, prédiction et classification de telles données complexes car hétérogènes et de grande dimension. Nous proposons de nouvelles stratégies basées sur l’estimation par maximum de vraisemblance régularisé des modèles pour pallier aux limites des méthodes standards, y compris l’EMV avec les algorithmes d’espérance-maximisation (EM), et pour effectuer simultanément la sélection des variables pertinentes afin d’encourager des solutions parcimonieuses dans un contexte haute dimension. Nous introduisons d’abord une méthode d’estimation régularisée des paramètres et de sélection de variables d’un mélange d’experts, basée sur des régularisations l1 (lasso) et le cadre de l’algorithme EM, pour la régression et la classification adaptés aux contextes de la grande dimension. Ensuite, nous étendons la stratégie un mélange régularisé de modèles d’experts pour les données discrètes, y compris pour la classification. Nous développons des algorithmes efficaces pour maximiser la fonction de log-vraisemblance l1 -pénalisée des données observées. Nos stratégies proposées jouissent de la maximisation monotone efficace du critère optimisé, et contrairement aux approches précédentes, ne s’appuient pas sur des approximations des fonctions de pénalité, évitent l’inversion de matrices et exploitent l’efficacité de l’algorithme de montée de coordonnées, particulièrement dans l’approche proximale par montée de coordonnées
This thesis deals with the problem of modeling and estimation of high-dimensional MoE models, towards effective density estimation, prediction and clustering of such heterogeneous and high-dimensional data. We propose new strategies based on regularized maximum-likelihood estimation (MLE) of MoE models to overcome the limitations of standard methods, including MLE estimation with Expectation-Maximization (EM) algorithms, and to simultaneously perform feature selection so that sparse models are encouraged in such a high-dimensional setting. We first introduce a mixture-of-experts’ parameter estimation and variable selection methodology, based on l1 (lasso) regularizations and the EM framework, for regression and clustering suited to high-dimensional contexts. Then, we extend the method to regularized mixture of experts models for discrete data, including classification. We develop efficient algorithms to maximize the proposed l1 -penalized observed-data log-likelihood function. Our proposed strategies enjoy the efficient monotone maximization of the optimized criterion, and unlike previous approaches, they do not rely on approximations on the penalty functions, avoid matrix inversion, and exploit the efficiency of the coordinate ascent algorithm, particularly within the proximal Newton-based approach
APA, Harvard, Vancouver, ISO, and other styles
44

Escalante, Bañuelos Alberto Nicolás [Verfasser], Laurenz [Gutachter] Wiskott, and Rolf [Gutachter] Würtz. "Extensions of hierarchical slow feature analysis for efficient classsification and regression on high-dimensional data / Alberto Nicolás Escalante Bañuelos ; Gutachter: Laurenz Wiskott, Rolf Würtz." Bochum : Ruhr-Universität Bochum, 2017. http://d-nb.info/1140223186/34.

Full text
APA, Harvard, Vancouver, ISO, and other styles
45

Lannsjö, Fredrik. "Forecasting the Business Cycle using Partial Least Squares." Thesis, KTH, Matematisk statistik, 2014. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-151378.

Full text
Abstract:
Partial Least Squares is both a regression method and a tool for variable selection, that is especially appropriate for models based on numerous (possibly correlated) variables. While being a well established modeling tool in chemometrics, this thesis adapts PLS to financial data to predict the movements of the business cycle represented by the OECD Composite Leading Indicators. High-dimensional data is used, and a model with automated variable selection through a genetic algorithm is developed to forecast different economic regions with good results in out-of-sample tests.
Partial Least Squares är både en regressionsmetod och ett verktyg för variabelselektion som är specielltlämpligt för modeller baserade på en stor mängd (möjligtvis korrelerade) variabler.Medan det är en väletablerad modelleringsmetod inom kemimetri, anpassar den häruppsatsen PLS till finansiell data för att förutspå rörelserna av konjunkturen,representerad av OECD's Composite Leading Indicator. Högdimensionella dataanvänds och en model med automatiserad variabelselektion via en genetiskalgoritm utvecklas för att göra en prognos av olika ekonomiska regioner medgoda resultat i out-of-sample-tester
APA, Harvard, Vancouver, ISO, and other styles
46

Kim, Byung-Jun. "Semiparametric and Nonparametric Methods for Complex Data." Diss., Virginia Tech, 2020. http://hdl.handle.net/10919/99155.

Full text
Abstract:
A variety of complex data has broadened in many research fields such as epidemiology, genomics, and analytical chemistry with the development of science, technologies, and design scheme over the past few decades. For example, in epidemiology, the matched case-crossover study design is used to investigate the association between the clustered binary outcomes of disease and a measurement error in covariate within a certain period by stratifying subjects' conditions. In genomics, high-correlated and high-dimensional(HCHD) data are required to identify important genes and their interaction effect over diseases. In analytical chemistry, multiple time series data are generated to recognize the complex patterns among multiple classes. Due to the great diversity, we encounter three problems in analyzing those complex data in this dissertation. We have then provided several contributions to semiparametric and nonparametric methods for dealing with the following problems: the first is to propose a method for testing the significance of a functional association under the matched study; the second is to develop a method to simultaneously identify important variables and build a network in HDHC data; the third is to propose a multi-class dynamic model for recognizing a pattern in the time-trend analysis. For the first topic, we propose a semiparametric omnibus test for testing the significance of a functional association between the clustered binary outcomes and covariates with measurement error by taking into account the effect modification of matching covariates. We develop a flexible omnibus test for testing purposes without a specific alternative form of a hypothesis. The advantages of our omnibus test are demonstrated through simulation studies and 1-4 bidirectional matched data analyses from an epidemiology study. For the second topic, we propose a joint semiparametric kernel machine network approach to provide a connection between variable selection and network estimation. Our approach is a unified and integrated method that can simultaneously identify important variables and build a network among them. We develop our approach under a semiparametric kernel machine regression framework, which can allow for the possibility that each variable might be nonlinear and is likely to interact with each other in a complicated way. We demonstrate our approach using simulation studies and real application on genetic pathway analysis. Lastly, for the third project, we propose a Bayesian focal-area detection method for a multi-class dynamic model under a Bayesian hierarchical framework. Two-step Bayesian sequential procedures are developed to estimate patterns and detect focal intervals, which can be used for gas chromatography. We demonstrate the performance of our proposed method using a simulation study and real application on gas chromatography on Fast Odor Chromatographic Sniffer (FOX) system.
Doctor of Philosophy
A variety of complex data has broadened in many research fields such as epidemiology, genomics, and analytical chemistry with the development of science, technologies, and design scheme over the past few decades. For example, in epidemiology, the matched case-crossover study design is used to investigate the association between the clustered binary outcomes of disease and a measurement error in covariate within a certain period by stratifying subjects' conditions. In genomics, high-correlated and high-dimensional(HCHD) data are required to identify important genes and their interaction effect over diseases. In analytical chemistry, multiple time series data are generated to recognize the complex patterns among multiple classes. Due to the great diversity, we encounter three problems in analyzing the following three types of data: (1) matched case-crossover data, (2) HCHD data, and (3) Time-series data. We contribute to the development of statistical methods to deal with such complex data. First, under the matched study, we discuss an idea about hypothesis testing to effectively determine the association between observed factors and risk of interested disease. Because, in practice, we do not know the specific form of the association, it might be challenging to set a specific alternative hypothesis. By reflecting the reality, we consider the possibility that some observations are measured with errors. By considering these measurement errors, we develop a testing procedure under the matched case-crossover framework. This testing procedure has the flexibility to make inferences on various hypothesis settings. Second, we consider the data where the number of variables is very large compared to the sample size, and the variables are correlated to each other. In this case, our goal is to identify important variables for outcome among a large amount of the variables and build their network. For example, identifying few genes among whole genomics associated with diabetes can be used to develop biomarkers. By our proposed approach in the second project, we can identify differentially expressed and important genes and their network structure with consideration for the outcome. Lastly, we consider the scenario of changing patterns of interest over time with application to gas chromatography. We propose an efficient detection method to effectively distinguish the patterns of multi-level subjects in time-trend analysis. We suggest that our proposed method can give precious information on efficient search for the distinguishable patterns so as to reduce the burden of examining all observations in the data.
APA, Harvard, Vancouver, ISO, and other styles
47

Gentry, Amanda E. "Penalized mixed-effects ordinal response models for high-dimensional genomic data in twins and families." VCU Scholars Compass, 2018. https://scholarscompass.vcu.edu/etd/5575.

Full text
Abstract:
The Brisbane Longitudinal Twin Study (BLTS) was being conducted in Australia and was funded by the US National Institute on Drug Abuse (NIDA). Adolescent twins were sampled as a part of this study and surveyed about their substance use as part of the Pathways to Cannabis Use, Abuse and Dependence project. The methods developed in this dissertation were designed for the purpose of analyzing a subset of the Pathways data that includes demographics, cannabis use metrics, personality measures, and imputed genotypes (SNPs) for 493 complete twin pairs (986 subjects.) The primary goal was to determine what combination of SNPs and additional covariates may predict cannabis use, measured on an ordinal scale as: “never tried,” “used moderately,” or “used frequently”. To conduct this analysis, we extended the ordinal Generalized Monotone Incremental Forward Stagewise (GMIFS) method for mixed models. This extension includes allowance for a unpenalized set of covariates to be coerced into the model as well as flexibility for user-specified correlation patterns between twins in a family. The proposed methods are applicable to high-dimensional (genomic or otherwise) data with ordinal response and specific, known covariance structure within clusters.
APA, Harvard, Vancouver, ISO, and other styles
48

Sjödin, Hällstrand Andreas. "Bilinear Gaussian Radial Basis Function Networks for classification of repeated measurements." Thesis, Linköpings universitet, Matematisk statistik, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-170850.

Full text
Abstract:
The Growth Curve Model is a bilinear statistical model which can be used to analyse several groups of repeated measurements. Normally the Growth Curve Model is defined in such a way that the permitted sampling frequency of the repeated measurement is limited by the number of observed individuals in the data set.In this thesis, we examine the possibilities of utilizing highly frequently sampled measurements to increase classification accuracy for real world data. That is, we look at the case where the regular Growth Curve Model is not defined due to the relationship between the sampling frequency and the number of observed individuals. When working with this high frequency data, we develop a new method of basis selection for the regression analysis which yields what we call a Bilinear Gaussian Radial Basis Function Network (BGRBFN), which we then compare to more conventional polynomial and trigonometrical functional bases. Finally, we examine if Tikhonov regularization can be used to further increase the classification accuracy in the high frequency data case.Our findings suggest that the BGRBFN performs better than the conventional methods in both classification accuracy and functional approximability. The results also suggest that both high frequency data and furthermore Tikhonov regularization can be used to increase classification accuracy.
APA, Harvard, Vancouver, ISO, and other styles
49

Courtois, Émeline. "Score de propension en grande dimension et régression pénalisée pour la détection automatisée de signaux en pharmacovigilance Propensity Score-Based Approaches in High Dimension for Pharmacovigilance Signal Detection: an Empirical Comparison on the French Spontaneous Reporting Database New adaptive lasso approaches for variable selection in automated pharmacovigilance signal detection." Thesis, université Paris-Saclay, 2020. http://www.theses.fr/2020UPASR009.

Full text
Abstract:
La pharmacovigilance a pour but de détecter le plus précocement possible les effets indésirables des médicaments commercialisés. Elle repose sur l’exploitation de grandes bases de données de notifications spontanées, c’est-à-dire de cas rapportés par des professionnels de santé d’évènements indésirables soupçonnées d’être d’origine médicamenteuse. L’exploitation automatique de ces données pour l’identification de signaux statistiques repose classiquement sur des méthodes de disproportionnalité qui s’appuient sur la forme agrégée des données. Plus récemment, des méthodes basées sur des régressions multiples ont été proposées pour prendre en compte les poly-expositions médicamenteuses. Dans le chapitre 2, nous proposons une méthode basée sur le score de propension en grande dimension (HDPS). Une étude empirique, conduite sur la base de pharmacovigilance française et basée sur un ensemble de référence relatif aux atteintes hépatiques aigues (DILIrank), est réalisée pour comparer les performances de cette méthode (déclinée en 12 modalités) à des méthodes basées sur des régressions pénalisées lasso. Dans ce travail, l’influence de la méthode d’estimation des scores est minime, contrairement à la méthode d’intégration des scores. En particulier, la pondération sur l’HDPS avec des poids matching weights montre de bonnes performances, comparables à celles des méthodes basées sur le lasso. Dans le chapitre 3, nous proposons une méthode basée sur extension du lasso: le lasso adaptatif qui permet d’introduire des pénalités propres à chaque variable via des poids. Nous proposons deux nouveaux poids adaptés aux données de notifications, ainsi que l’utilisation du BIC pour le choix de la valeur de pénalité. Une vaste étude de simulations est réalisée pour comparer les performances de nos propositions à d’autres implémentations du lasso adaptatif, une méthode de disproportionnalité, des méthodes basées sur le lasso et sur l’HDPS. Les méthodes proposées montrent globalement de meilleurs résultats en termes de fausses découvertes et de sensibilité que les méthodes concurrentes. Une étude empirique analogue à celle du chapitre 2 vient compléter l’évaluation. Toutes les méthodes présentées sont implémentées dans le package R « adapt4pv » disponible sur le CRAN. En parallèle des développements méthodologiques sur les notifications spontanées, un intérêt croissant s’est porté autour de l’utilisation des bases médico-administratives pour la détection de signaux en pharmacovigilance. Les efforts de recherche méthodologique dans ce domaine en sont encore à leurs débuts. Dans le chapitre 4, nous explorons des stratégies de détection exploitant les notifications spontanées et l’Echantillon Généraliste des Bénéficiaires (EGB). Nous évaluons tout d’abord les performances d'une détection sur l'EGB à partir de DILIrank. Puis, nous considérons une détection conduite sur les notifications spontanées basée sur un lasso adaptatif intégrant, au travers de ses poids, l’information relative à l’exposition médicamenteuse d’individus contrôles mesurée dans l'EGB. Dans les deux cas, l’apport des données médico-administratives est difficile à évaluer du fait de la relative faible taille des données de l’EGB
Post-marketing pharmacovigilance aims to detect as early as possible adverse effects of marketed drugs. It relies on large databases of individual case safety reports of adverse events suspected to be drug-induced. Several automated signal detection tools have been developed to mine these large amounts of data in order to highlight suspicious adverse event-drug combinations. Classical signal detection methods are based on disproportionality analyses of counts aggregating patients’ reports. Recently, multiple regression-based methods have been proposed to account for multiple drug exposures. In chapter 2, we propose a signal detection method based on the high-dimensional propensity score (HDPS). An empirical study, conducted on the French pharmacovigilance database with a reference signal set pertaining to drug-induced liver injury (DILIrank), is carried out to compare the performance of this method (in 12 modalities) to methods based on lasso penalized regressions. In this work, the influence of the score estimation method is minimal, unlike the score integration method. In particular, HDPS weighting with matching weights shows good performances, comparable to those of lasso-based methods. In chapter 3, we propose a method based on a lasso extension: the adaptive lasso which allows to introduce specific penalties to each variable through adaptive weights. We propose two new weights adapted to spontaneous reports data, as well as the use of the BIC for the choice of the penalty term. An extensive simulation study is performed to compare the performances of our proposals with other implementations of the adaptive lasso, a disproportionality method, lasso-based methods and HDPS-based methods. The proposed methods show overall better results in terms of false discoveries and sensitivity than competing methods. An empirical study similar to the one conducted in chapter 2 completes the evaluation. All the evaluated methods are implemented in the R package "adapt4pv" available on the CRAN. Alongside to methodological developments in spontaneous reporting, there has been a growing interest in the use of medico-administrative databases for signal detection in pharmacovigilance. Methodological research efforts in this area are to be developed. In chapter 4, we explore detection strategies exploiting spontaneous reports and the national health insurance permanent sample (Echantillon Généraliste des bénéficiaires, EGB). We first evaluate the performance of a detection on the EGB using DILIrank. Then, we consider a detection conducted on spontaneous reports based on an adaptive lasso integrating, through weights, the information related to the drug exposure of a control group measured in the EGB. In both cases, the contribution of medico-administrative data is difficult to evaluate because of the relatively small size of the EGB
APA, Harvard, Vancouver, ISO, and other styles
50

Ternes, Nils. "Identification de biomarqueurs prédictifs de la survie et de l'effet du traitement dans un contexte de données de grande dimension." Thesis, Université Paris-Saclay (ComUE), 2016. http://www.theses.fr/2016SACLS278/document.

Full text
Abstract:
Avec la révolution récente de la génomique et la médecine stratifiée, le développement de signatures moléculaires devient de plus en plus important pour prédire le pronostic (biomarqueurs pronostiques) ou l’effet d’un traitement (biomarqueurs prédictifs) de chaque patient. Cependant, la grande quantité d’information disponible rend la découverte de faux positifs de plus en plus fréquente dans la recherche biomédicale. La présence de données de grande dimension (nombre de biomarqueurs ≫ taille d’échantillon) soulève de nombreux défis statistiques tels que la non-identifiabilité des modèles, l’instabilité des biomarqueurs sélectionnés ou encore la multiplicité des tests.L’objectif de cette thèse a été de proposer et d’évaluer des méthodes statistiques pour l’identification de ces biomarqueurs et l’élaboration d’une prédiction individuelle des probabilités de survie pour des nouveaux patients à partir d’un modèle de régression de Cox. Pour l’identification de biomarqueurs en présence de données de grande dimension, la régression pénalisée lasso est très largement utilisée. Dans le cas de biomarqueurs pronostiques, une extension empirique de cette pénalisation a été proposée permettant d’être plus restrictif sur le choix du paramètre λ dans le but de sélectionner moins de faux positifs. Pour les biomarqueurs prédictifs, l’intérêt s’est porté sur les interactions entre le traitement et les biomarqueurs dans le contexte d’un essai clinique randomisé. Douze approches permettant de les identifier ont été évaluées telles que le lasso (standard, adaptatif, groupé ou encore ridge+lasso), le boosting, la réduction de dimension des effets propres et un modèle implémentant les effets pronostiques par bras. Enfin, à partir d’un modèle de prédiction pénalisé, différentes stratégies ont été évaluées pour obtenir une prédiction individuelle pour un nouveau patient accompagnée d’un intervalle de confiance, tout en évitant un éventuel surapprentissage du modèle. La performance des approches ont été évaluées au travers d’études de simulation proposant des scénarios nuls et alternatifs. Ces méthodes ont également été illustrées sur différents jeux de données, contenant des données d’expression de gènes dans le cancer du sein
With the recent revolution in genomics and in stratified medicine, the development of molecular signatures is becoming more and more important for predicting the prognosis (prognostic biomarkers) and the treatment effect (predictive biomarkers) of each patient. However, the large quantity of information has rendered false positives more and more frequent in biomedical research. The high-dimensional space (i.e. number of biomarkers ≫ sample size) leads to several statistical challenges such as the identifiability of the models, the instability of the selected coefficients or the multiple testing issue.The aim of this thesis was to propose and evaluate statistical methods for the identification of these biomarkers and the individual predicted survival probability for new patients, in the context of the Cox regression model. For variable selection in a high-dimensional setting, the lasso penalty is commonly used. In the prognostic setting, an empirical extension of the lasso penalty has been proposed to be more stringent on the estimation of the tuning parameter λ in order to select less false positives. In the predictive setting, focus has been given to the biomarker-by-treatment interactions in the setting of a randomized clinical trial. Twelve approaches have been proposed for selecting these interactions such as lasso (standard, adaptive, grouped or ridge+lasso), boosting, dimension reduction of the main effects and a model incorporating arm-specific biomarker effects. Finally, several strategies were studied to obtain an individual survival prediction with a corresponding confidence interval for a future patient from a penalized regression model, while limiting the potential overfit.The performance of the approaches was evaluated through simulation studies combining null and alternative scenarios. The methods were also illustrated in several data sets containing gene expression data in breast cancer
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!