Academic literature on the topic 'Variable selection bias'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the lists of relevant articles, books, theses, conference reports, and other scholarly sources on the topic 'Variable selection bias.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Journal articles on the topic "Variable selection bias"

1

Canan, Chelsea, Catherine Lesko, and Bryan Lau. "Instrumental Variable Analyses and Selection Bias." Epidemiology 28, no. 3 (May 2017): 396–98. http://dx.doi.org/10.1097/ede.0000000000000639.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Shin, Sung-Chul, Yeon-Joo Jeong, and Moon Sup Song. "Bias Reduction in Split Variable Selection in C4.5." Communications for Statistical Applications and Methods 10, no. 3 (December 1, 2003): 627–35. http://dx.doi.org/10.5351/ckss.2003.10.3.627.

Full text
APA, Harvard, Vancouver, ISO, and other styles
3

Yeob Choi, Byeong, Jason P. Fine, and M. Alan Brookhart. "Bias testing, bias correction, and confounder selection using an instrumental variable model." Statistics in Medicine 39, no. 29 (August 27, 2020): 4386–404. http://dx.doi.org/10.1002/sim.8730.

Full text
APA, Harvard, Vancouver, ISO, and other styles
4

Shih, Yu-Shan, and Hsin-Wen Tsai. "Variable selection bias in regression trees with constant fits." Computational Statistics & Data Analysis 45, no. 3 (April 2004): 595–607. http://dx.doi.org/10.1016/s0167-9473(03)00036-7.

Full text
APA, Harvard, Vancouver, ISO, and other styles
5

García, O. "Estimating top height with variable plot sizes." Canadian Journal of Forest Research 28, no. 10 (October 1, 1998): 1509–17. http://dx.doi.org/10.1139/x98-128.

Full text
Abstract:
Conventional top height estimates are biased if the area of the sample plot differs from that on which the definition is based. Sources of bias include a sampling selection effect and spatial autocorrelation. The problem was studied in relation to the use of data sets with varying spatial detail for modelling Douglas-fir (Pseudotsuga menziesii (Mirb.) Franco) plantation growth. Improved top height estimators, developed taking into account the selection effect, eliminated the bias. Bias was reduced, but not eliminated completely, when the estimators were tested using more highly autocorrelated eucalypt data.
APA, Harvard, Vancouver, ISO, and other styles
6

Zhao, Pei Xin. "Penalized Estimation Based Variable Selection for Semiparametric Regression Models with Endogenous Covariates." Advanced Materials Research 1079-1080 (December 2014): 843–46. http://dx.doi.org/10.4028/www.scientific.net/amr.1079-1080.843.

Full text
Abstract:
In this paper, we study the variable selection problem for the parametric components of semiparametric regression models with endogenous variables. Based on the penalized empirical likelihood technology and the bias adjustment method, we propose a penalized empirical likelihood based variable selection procedure. Simulation studies show that the proposed variable selection procedure is workable, and the resulting estimator is consistent.
APA, Harvard, Vancouver, ISO, and other styles
7

Swanson, Sonja A. "A Practical Guide to Selection Bias in Instrumental Variable Analyses." Epidemiology 30, no. 3 (May 2019): 345–49. http://dx.doi.org/10.1097/ede.0000000000000973.

Full text
APA, Harvard, Vancouver, ISO, and other styles
8

Marshall, Andrew, Leilei Tang, and Alistair Milne. "Variable reduction, sample selection bias and bank retail credit scoring." Journal of Empirical Finance 17, no. 3 (June 2010): 501–12. http://dx.doi.org/10.1016/j.jempfin.2009.12.003.

Full text
APA, Harvard, Vancouver, ISO, and other styles
9

Qin, Xiao, and Junhee Han. "Variable Selection Issues in Tree-Based Regression Models." Transportation Research Record: Journal of the Transportation Research Board 2061, no. 1 (January 2008): 30–38. http://dx.doi.org/10.3141/2061-04.

Full text
Abstract:
Recently, there has been increasing interest in the use of classification and regression tree (CART) analysis. A tree-based regression model can be constructed by recursively partitioning the data with such criteria as to yield the maximum reduction in the variability of the response. Unfortunately, the exhaustive search may yield a bias in variable selection, and it tends to choose a categorical variable as a splitter that has many distinct values. In this study, an unbiased tree-based regression generalized unbiased interaction detection and estimation (GUIDE) model is introduced for its robustness against the variable selection bias. Not only are the underlying theoretical differences behind CART and GUIDE in variable selection presented, but also the outcomes of the two different tree-based regression models are compared and analyzed by utilizing intersection inventory and crash data. The results underscore GUIDE's strength in selecting variables equally. A simulation shed additional light on the resulting negative impact when an algorithm was inappropriately applied to the data. This paper concludes by addressing the strengths and weaknesses of—and, more important, the differences between—the two hierarchical tree-based regression models, CART and GUIDE, and advises on the appropriate application. It is anticipated that the GUIDE model will provide a new perspective for users of tree-based models and will offer an advantage over existing methods. Users in transportation should choose the appropriate method and utilize it to their advantage.
APA, Harvard, Vancouver, ISO, and other styles
10

Nishi, Hayato, Yasushi Asami, and Chihiro Shimizu. "Housing features and rent: estimating the microstructures of rental housing." International Journal of Housing Markets and Analysis 12, no. 2 (April 1, 2019): 210–25. http://dx.doi.org/10.1108/ijhma-09-2018-0067.

Full text
Abstract:
Purpose While consumers did not previously have information on detailed housing features via traditional media, such as magazines, nowadays, due to the progress in information technology, they can access detailed information on various housing features via housing information websites. Therefore, detailed housing features may affect current rents to some extent. This paper aims to identify the effects of detailed housing features on rent and on omitted variable bias in Tokyo, Japan. Design/methodology/approach This paper applies the hedonic approach. To identify the effects of features which are not observed previously, we use a unique data set that contains various housing features and over 200,000 housing units. This data set enables to simulate the situations when the researcher cannot get some variables, and this simulation shows which variables cause omitted variable bias. Findings The analysis shows that housing features significantly influence housing rent. If significant housing feature variables are not included in the hedonic model, the estimated coefficients show omitted variable bias. Additionally, unit-specific features such auto-locking door can cause omitted variable bias on location-specific features such accessibility to downtown. Originality/values This paper shows empirical evidence that detailed housing features can cause omitted variable bias on other features including variables which are often used in previous searches. The result from our unique data set can be a guide for variable selection to reduce omitted variable bias.
APA, Harvard, Vancouver, ISO, and other styles
More sources

Dissertations / Theses on the topic "Variable selection bias"

1

Strobl, Carolin, Anne-Laure Boulesteix, Achim Zeileis, and Torsten Hothorn. "Bias in Random Forest Variable Importance Measures: Illustrations, Sources and a Solution." Department of Statistics and Mathematics, WU Vienna University of Economics and Business, 2006. http://epub.wu.ac.at/1274/1/document.pdf.

Full text
Abstract:
Variable importance measures for random forests have been receiving increased attention as a means of variable selection in many classification tasks in bioinformatics and related scientific fields, for instance to select a subset of genetic markers relevant for the prediction of a certain disease. We show that random forest variable importance measures are a sensible means for variable selection in many applications, but are not reliable in situations where potential predictor variables vary in their scale level or their number of categories. This is particularly important in genomics and computational biology, where predictors often include variables of different types. Simulation studies are presented illustrating that, when random forest variable importance measures are used with data of varying types, the results are misleading because suboptimal predictor variables may be artificially preferred in variable selection. The two mechanisms underlying this deficiency are biased variable selection in the individual classification trees used to build the random forest on one hand, and effects induced by bootstrap sampling with replacement on the other hand. We propose to employ an alternative implementation of random forests, that provides unbiased variable selection in the individual classification trees. When this method is applied using subsampling without replacement, the resulting variable importance measures can be used reliably for variable selection even in situations where the potential predictor variables vary in their scale level or their number of categories. The usage of both random forest algorithms and their variable importance measures in the R system for statistical computing is illustrated and documented thoroughly in an application re-analysing data from a study on RNA editing. Therefore the suggested method can be applied straightforwardly by scientists in bioinformatics research. (author's abstract)
Series: Research Report Series / Department of Statistics and Mathematics
APA, Harvard, Vancouver, ISO, and other styles
2

Tseng, Shih-Hsien. "Bayesian and Semi-Bayesian regression applied to manufacturing wooden products." The Ohio State University, 2008. http://rave.ohiolink.edu/etdc/view?acc_num=osu1199240473.

Full text
APA, Harvard, Vancouver, ISO, and other styles
3

Ditrich, Josef. "Možnosti redukce výběrového zkreslení v ratingových modelech." Doctoral thesis, Vysoká škola ekonomická v Praze, 2009. http://www.nusl.cz/ntk/nusl-201116.

Full text
Abstract:
Nowadays, the use of credit scoring models in the financial sector is a common practice. Credit scoring plays an important role in profitability and transparency of lending business. Given the high credit volumes, even a small improvement of discriminatory and predictive power of a credit scoring model may provide a substantial additional profit. Scoring models are applied on the through-the-door population, however, for creating them or adjusting already existing credit rules, it is usual to use only the data corresponding to accepted applicants for which payment discipline can be observed. This discrepancy can lead to reject bias (or selection bias in general). Methods trying to eliminate or reduce this phenomenon are known by the term reject inference. In general, these methods try to assess the behavior of rejected applicants or to obtain an additional information about them. In the dissertation thesis, I dealt with the enlargement method which is based on a random acceptance of applicants that would have been rejected. This method is not only time consuming but also expensive. Therefore I looked for the ways how to reduce the cost of acquiring additional information about rejected applicants. As a result, I have proposed a modification which I called the enlargement method with sorting variable. It was validated on real bank database with two possible sorting variables and the results were compared with the original version of the method. It was shown that both tested approaches can reduce its cost while retaining the accuracy of the scoring models.
APA, Harvard, Vancouver, ISO, and other styles
4

Cai, Mingxuan. "BIVAS: a scalable Bayesian method for bi-level variable selection." HKBU Institutional Repository, 2018. https://repository.hkbu.edu.hk/etd_oa/482.

Full text
Abstract:
In this thesis, we consider a Bayesian bi-level variable selection problem in high-dimensional regressions. In many practical situations, it is natural to assign group membership to each predictor. Examples include that genetic variants can be grouped at the gene level and a covariate from different tasks naturally forms a group. Thus, it is of interest to select important groups as well as important members from those groups. The existing methods based on Markov Chain Monte Carlo (MCMC) are often computationally intensive and not scalable to large data sets. To address this problem, we consider variational inference for bi-level variable selection (BIVAS). In contrast to the commonly used mean-field approximation, we propose a hierarchical factorization to approximate the posterior distribution, by utilizing the structure of bi-level variable selection. Moreover, we develop a computationally efficient and fully parallelizable algorithm based on this variational approximation. We further extend the developed method to model data sets from multi-task learning. The comprehensive numerical results from both simulation studies and real data analysis demonstrate the advantages of BIVAS for variable selection, parameter estimation and computational efficiency over existing methods. The BIVAS software with support of parallelization is implemented in R package `bivas' available at https://github.com/mxcai/bivas.
APA, Harvard, Vancouver, ISO, and other styles
5

Xie, Diqiong. "Bias and variance of treatment effect estimators using propensity-score matching." Diss., University of Iowa, 2011. https://ir.uiowa.edu/etd/4980.

Full text
Abstract:
Observational studies are an indispensable complement to randomized clinical trials (RCT) for comparison of treatment effectiveness. Often RCTs cannot be carried out due to the costs of the trial, ethical questions and rarity of the outcome. When noncompliance and missing data are prevalent, RCTs become more like observational studies. The main problem is to adjust for the selection bias in the observational study. One increasingly used method is propensity-score matching. Compared to traditional multi-covariate matching methods, matching on the propensity score alleviates the curse of dimensionality. It allows investigators to balance multiple covariate distributions between treatment groups by matching on a single score. This thesis focuses on the large sample properties of the matching estimators of the treatment effect. The first part of this thesis deals with problems of the analytic supports of the logit propensity score and various matching methods. The second part of this thesis focuses on the matching estimators of additive and multiplicative treatment effects. We derive the asymptotic order of the biases and asymptotic distributions of the matching estimators. We also derive the large sample variance estimators for the treatment effect estimators. The methods and theoretical results are applied and checked in a series of simulation studies. The third part of this thesis is devoted to a comparison between propensity-score matching and multiple linear regression using simulation.
APA, Harvard, Vancouver, ISO, and other styles
6

Romero, Merino Enrique. "Learning with Feed-forward Neural Networks: Three Schemes to Deal with the Bias/Variance Trade-off." Doctoral thesis, Universitat Politècnica de Catalunya, 2004. http://hdl.handle.net/10803/6644.

Full text
Abstract:
In terms of the Bias/Variance decomposition, very flexible (i.e., complex) Supervised Machine Learning systems may lead to unbiased estimators but with high variance. A rigid model, in contrast, may lead to small variance but high bias. There is a trade-off between the bias and variance contributions to the error, where the optimal performance is achieved.

In this work we present three schemes related to the control of the Bias/Variance decomposition for Feed-forward Neural Networks (FNNs) with the (sometimes modified) quadratic loss function:

1. An algorithm for sequential approximation with FNNs, named Sequential Approximation with Optimal Coefficients and Interacting Frequencies (SAOCIF). Most of the sequential approximations proposed in the literature select the new frequencies (the non-linear weights) guided by the approximation of the residue of the partial approximation. We propose a sequential algorithm where the new frequency is selected taking into account its interactions with the previously selected ones. The interactions are discovered by means of their optimal coefficients (the linear weights). A number of heuristics can be used to select the new frequencies. The aim is that the same level of approximation may be achieved with less hidden units than if we only try to match the residue as best as possible. In terms of the Bias/Variance decomposition, it will be possible to obtain simpler models with the same bias. The idea behind SAOCIF can be extended to approximation in Hilbert spaces, maintaining orthogonal-like properties. In this case, the importance of the interacting frequencies lies in the expectation of increasing the rate of approximation. Experimental results show that the idea of interacting frequencies allows to construct better approximations than matching the residue.

2. A study and comparison of different criteria to perform Feature Selection (FS) with Multi-Layer Perceptrons (MLPs) and the Sequential Backward Selection (SBS) procedure within the wrapper approach. FS procedures control the Bias/Variance decomposition by means of the input dimension, establishing a clear connection with the curse of dimensionality. Several critical decision points are studied and compared. First, the stopping criterion. Second, the data set where the value of the loss function is measured. Finally, we also compare two ways of computing the saliency (i.e., the relative importance) of a feature: either first train a network and then remove temporarily every feature or train a different network with every feature temporarily removed. The experiments are performed for linear and non-linear models. Experimental results suggest that the increase in the computational cost associated with retraining a different network with every feature temporarily removed previous to computing the saliency can be rewarded with a significant performance improvement, specially if non-linear models are used. Although this idea could be thought as very intuitive, it has been hardly used in practice. Regarding the data set where the value of the loss function is measured, it seems clear that the SBS procedure for MLPs takes profit from measuring the loss function in a validation set. A somewhat non-intuitive conclusion is drawn looking at the stopping criterion, where it can be seen that forcing overtraining may be as useful as early stopping.

3. A modification of the quadratic loss function for classification problems, inspired in Support Vector Machines (SVMs) and the AdaBoost algorithm, named Weighted Quadratic Loss (WQL) function. The modification consists in weighting the contribution of every example to the total error. In the linearly separable case, the solution of the hard margin SVM also minimizes the proposed loss function. The hardness of the resulting solution can be controlled, as in SVMs, so that this scheme may also be used for the non-linearly separable case. The error weighting proposed in WQL forces the training procedure to pay more attention to the points with a smaller margin. Therefore, variance tries to be controlled by not attempting to overfit the points that are already well classified. The model shares several properties with the SVMs framework, with some additional advantages. On the one hand, the final solution is neither restricted to have an architecture with so many hidden units as points (or support vectors) in the data set nor to use kernel functions. The frequencies are not restricted to be a subset of the data set. On the other hand, it allows to deal with multiclass and multilabel problems in a natural way. Experimental results are shown confirming these claims.

A wide experimental work has been done with the proposed schemes, including artificial data sets, well-known benchmark data sets and two real-world problems from the Natural Language Processing domain. In addition to widely used activation functions, such as the hyperbolic tangent or the Gaussian function, other activation functions have been tested. In particular, sinusoidal MLPs showed a very good behavior. The experimental results can be considered as very satisfactory. The schemes presented in this work have been found to be very competitive when compared to other existing schemes described in the literature. In addition, they can be combined among them, since they deal with complementary aspects of the whole learning process.
APA, Harvard, Vancouver, ISO, and other styles
7

Bhatti, Sajjad Haider. "Estimation of the mincerian wage model addressing its specification and different econometric issues." Phd thesis, Université de Bourgogne, 2012. http://tel.archives-ouvertes.fr/tel-00780563.

Full text
Abstract:
In the present doctoral thesis, we estimated Mincer's (1974) semi logarithmic wage function for the French and Pakistani labour force data. This model is considered as a standard tool in order to estimate the relationship between earnings/wages and different contributory factors. Despite of its vide and extensive use, simple estimation of the Mincerian model is biased because of different econometric problems. The main sources of bias noted in the literature are endogeneity of schooling, measurement error, and sample selectivity. We have tackled the endogeneity and measurement error biases via instrumental variables two stage least squares approach for which we have proposed two new instrumental variables. The first instrumental variable is defined as "the average years of schooling in the family of the concerned individual" and the second instrumental variable is defined as "the average years of schooling in the country, of particular age group, of particular gender, at the particular time when an individual had joined the labour force". Schooling is found to be endogenous for the both countries. Comparing two said instruments we have selected second instrument to be more appropriate. We have applied the Heckman (1979) two-step procedure to eliminate possible sample selection bias which found to be significantly positive for the both countries which means that in the both countries, people who decided not to participate in labour force as wage worker would have earned less than participants if they had decided to work as wage earner. We have estimated a specification that tackled endogeneity and sample selectivity problems together as we found in respect to present literature relative scarcity of such studies all over the globe in general and absence of such studies for France and Pakistan, in particular. Differences in coefficients proved worth of such specification. We have also estimated model semi-parametrically, but contrary to general norm in the context of the Mincerian model, our semi-parametric estimation contained non-parametric component from first-stage schooling equation instead of non-parametric component from selection equation. For both countries, we have found parametric model to be more appropriate. We found errors to be heteroscedastic for the data from both countries and then applied adaptive estimation to control adverse effects of heteroscedasticity. Comparing simple and adaptive estimations, we prefer adaptive specification of parametric model for both countries. Finally, we have applied quantile regression on the selected model from mean regression. Quantile regression exposed that different explanatory factors influence differently in different parts of the wage distribution of the two countries. For both Pakistan and France, it would be the first study that corrected both sample selectivity and endogeneity in single specification in quantile regression framework
APA, Harvard, Vancouver, ISO, and other styles
8

Shandilya, Sharad. "ASSESSMENT AND PREDICTION OF CARDIOVASCULAR STATUS DURING CARDIAC ARREST THROUGH MACHINE LEARNING AND DYNAMICAL TIME-SERIES ANALYSIS." VCU Scholars Compass, 2013. http://scholarscompass.vcu.edu/etd/3198.

Full text
Abstract:
In this work, new methods of feature extraction, feature selection, stochastic data characterization/modeling, variance reduction and measures for parametric discrimination are proposed. These methods have implications for data mining, machine learning, and information theory. A novel decision-support system is developed in order to guide intervention during cardiac arrest. The models are built upon knowledge extracted with signal-processing, non-linear dynamic and machine-learning methods. The proposed ECG characterization, combined with information extracted from PetCO2 signals, shows viability for decision-support in clinical settings. The approach, which focuses on integration of multiple features through machine learning techniques, suits well to inclusion of multiple physiologic signals. Ventricular Fibrillation (VF) is a common presenting dysrhythmia in the setting of cardiac arrest whose main treatment is defibrillation through direct current countershock to achieve return of spontaneous circulation. However, often defibrillation is unsuccessful and may even lead to the transition of VF to more nefarious rhythms such as asystole or pulseless electrical activity. Multiple methods have been proposed for predicting defibrillation success based on examination of the VF waveform. To date, however, no analytical technique has been widely accepted. For a given desired sensitivity, the proposed model provides a significantly higher accuracy and specificity as compared to the state-of-the-art. Notably, within the range of 80-90% of sensitivity, the method provides about 40% higher specificity. This means that when trained to have the same level of sensitivity, the model will yield far fewer false positives (unnecessary shocks). Also introduced is a new model that predicts recurrence of arrest after a successful countershock is delivered. To date, no other work has sought to build such a model. I validate the method by reporting multiple performance metrics calculated on (blind) test sets.
APA, Harvard, Vancouver, ISO, and other styles
9

"Addressing the Variable Selection Bias and Local Optimum Limitations of Longitudinal Recursive Partitioning with Time-Efficient Approximations." Doctoral diss., 2019. http://hdl.handle.net/2286/R.I.54792.

Full text
Abstract:
abstract: Longitudinal recursive partitioning (LRP) is a tree-based method for longitudinal data. It takes a sample of individuals that were each measured repeatedly across time, and it splits them based on a set of covariates such that individuals with similar trajectories become grouped together into nodes. LRP does this by fitting a mixed-effects model to each node every time that it becomes partitioned and extracting the deviance, which is the measure of node purity. LRP is implemented using the classification and regression tree algorithm, which suffers from a variable selection bias and does not guarantee reaching a global optimum. Additionally, fitting mixed-effects models to each potential split only to extract the deviance and discard the rest of the information is a computationally intensive procedure. Therefore, in this dissertation, I address the high computational demand, variable selection bias, and local optimum solution. I propose three approximation methods that reduce the computational demand of LRP, and at the same time, allow for a straightforward extension to recursive partitioning algorithms that do not have a variable selection bias and can reach the global optimum solution. In the three proposed approximations, a mixed-effects model is fit to the full data, and the growth curve coefficients for each individual are extracted. Then, (1) a principal component analysis is fit to the set of coefficients and the principal component score is extracted for each individual, (2) a one-factor model is fit to the coefficients and the factor score is extracted, or (3) the coefficients are summed. The three methods result in each individual having a single score that represents the growth curve trajectory. Therefore, now that the outcome is a single score for each individual, any tree-based method may be used for partitioning the data and group the individuals together. Once the individuals are assigned to their final nodes, a mixed-effects model is fit to each terminal node with the individuals belonging to it. I conduct a simulation study, where I show that the approximation methods achieve the goals proposed while maintaining a similar level of out-of-sample prediction accuracy as LRP. I then illustrate and compare the methods using an applied data.
Dissertation/Thesis
Doctoral Dissertation Psychology 2019
APA, Harvard, Vancouver, ISO, and other styles
10

Lutu, P. E. N. (Patricia Elizabeth Nalwoga). "Dataset selection for aggregate model implementation in predictive data mining." Thesis, 2010. http://hdl.handle.net/2263/29486.

Full text
Abstract:
Data mining has become a commonly used method for the analysis of organisational data, for purposes of summarizing data in useful ways and identifying non-trivial patterns and relationships in the data. Given the large volumes of data that are collected by business, government, non-government and scientific research organizations, a major challenge for data mining researchers and practitioners is how to select relevant data for analysis in sufficient quantities, in order to meet the objectives of a data mining task. This thesis addresses the problem of dataset selection for predictive data mining. Dataset selection was studied in the context of aggregate modeling for classification. The central argument of this thesis is that, for predictive data mining, it is possible to systematically select many dataset samples and employ different approaches (different from current practice) to feature selection, training dataset selection, and model construction. When a large amount of information in a large dataset is utilised in the modeling process, the resulting models will have a high level of predictive performance and should be more reliable. Aggregate classification models, also known as ensemble classifiers, have been shown to provide a high level of predictive accuracy on small datasets. Such models are known to achieve a reduction in the bias and variance components of the prediction error of a model. The research for this thesis was aimed at the design of aggregate models and the selection of training datasets from large amounts of available data. The objectives for the model design and dataset selection were to reduce the bias and variance components of the prediction error for the aggregate models. Design science research was adopted as the paradigm for the research. Large datasets obtained from the UCI KDD Archive were used in the experiments. Two classification algorithms: See5 for classification tree modeling and K-Nearest Neighbour, were used in the experiments. The two methods of aggregate modeling that were studied are One-Vs-All (OVA) and positive-Vs-negative (pVn) modeling. While OVA is an existing method that has been used for small datasets, pVn is a new method of aggregate modeling, proposed in this thesis. Methods for feature selection from large datasets, and methods for training dataset selection from large datasets, for OVA and pVn aggregate modeling, were studied. The experiments of feature selection revealed that the use of many samples, robust measures of correlation, and validation procedures result in the reliable selection of relevant features for classification. A new algorithm for feature subset search, based on the decision rule-based approach to heuristic search, was designed and the performance of this algorithm was compared to two existing algorithms for feature subset search. The experimental results revealed that the new algorithm makes better decisions for feature subset search. The information provided by a confusion matrix was used as a basis for the design of OVA and pVn base models which aren combined into one aggregate model. A new construct called a confusion graph was used in conjunction with new algorithms for the design of pVn base models. A new algorithm for combining base model predictions and resolving conflicting predictions was designed and implemented. Experiments to study the performance of the OVA and pVn aggregate models revealed the aggregate models provide a high level of predictive accuracy compared to single models. Finally, theoretical models to depict the relationships between the factors that influence feature selection and training dataset selection for aggregate models are proposed, based on the experimental results.
Thesis (PhD)--University of Pretoria, 2010.
Computer Science
unrestricted
APA, Harvard, Vancouver, ISO, and other styles

Books on the topic "Variable selection bias"

1

Boudreau, Joseph F., and Eric S. Swanson. Monte Carlo methods. Oxford University Press, 2018. http://dx.doi.org/10.1093/oso/9780198708636.003.0007.

Full text
Abstract:
Monte Carlo methods are those designed to obtain numerical answers with the use of random numbers . This chapter discusses random engines, which provide a pseudo-random pattern of bits, and their use in for sampling a variety of nonuniform distributions, for both continuous and discrete variables. A wide selection of uniform and nonuniform variate generators from the C++ standard library are reviewed, and common techniques for generating custom nonuniform variates are discussed. The chapter presents the uses of Monte Carlo to evaluate integrals, particularly multidimensional integrals, and then introduces the important method of Markov chain Monte Carlo, suitable for solving a wide range of scientific problems that require the sampling of complicated multivariate distributions. Relevant topics in probability and statistics are also introduced in this chapter. Finally, the topics of thermalization, autocorrelation, multimodality, and Gibbs sampling are presented.
APA, Harvard, Vancouver, ISO, and other styles

Book chapters on the topic "Variable selection bias"

1

Baskin, Igor I., Gilles Marcou, Dragos Horvath, and Alexandre Varnek. "Cross-Validation and the Variable Selection Bias." In Tutorials in Chemoinformatics, 163–73. Chichester, UK: John Wiley & Sons, Ltd, 2017. http://dx.doi.org/10.1002/9781119161110.ch10.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Richards, Joseph W. "Overcoming Sample Selection Bias in Variable Star Classification." In Astrostatistics and Data Mining, 213–21. New York, NY: Springer New York, 2012. http://dx.doi.org/10.1007/978-1-4614-3323-1_22.

Full text
APA, Harvard, Vancouver, ISO, and other styles
3

Marquis, Bastien, and Maarten Jansen. "Correction for Optimisation Bias in Structured Sparse High-Dimensional Variable Selection." In Springer Proceedings in Mathematics & Statistics, 357–65. Cham: Springer International Publishing, 2020. http://dx.doi.org/10.1007/978-3-030-57306-5_32.

Full text
APA, Harvard, Vancouver, ISO, and other styles
4

Munson, M. Arthur, and Rich Caruana. "On Feature Selection, Bias-Variance, and Bagging." In Machine Learning and Knowledge Discovery in Databases, 144–59. Berlin, Heidelberg: Springer Berlin Heidelberg, 2009. http://dx.doi.org/10.1007/978-3-642-04174-7_10.

Full text
APA, Harvard, Vancouver, ISO, and other styles
5

Rosales-Pérez, Alejandro, Hugo Jair Escalante, Jesus A. Gonzalez, Carlos A. Reyes-Garcia, and Carlos A. Coello Coello. "Bias and Variance Multi-objective Optimization for Support Vector Machines Model Selection." In Pattern Recognition and Image Analysis, 108–16. Berlin, Heidelberg: Springer Berlin Heidelberg, 2013. http://dx.doi.org/10.1007/978-3-642-38628-2_12.

Full text
APA, Harvard, Vancouver, ISO, and other styles
6

Arreola, Julio, Damián Gibaja, J. Agustín Franco, and Marcelo Sánchez-Oro. "Comparison of the Bias and Weighting of Variables in Neural Networks (ANN) for the Selection of the Type of Housing in Spain and Mexico." In Studies in Computational Intelligence, 19–34. Cham: Springer International Publishing, 2021. http://dx.doi.org/10.1007/978-3-030-72065-0_2.

Full text
APA, Harvard, Vancouver, ISO, and other styles
7

Hankin, David G., Michael S. Mohr, and Ken B. Newman. "Ratio and regression estimation." In Sampling Theory, 104–39. Oxford University Press, 2019. http://dx.doi.org/10.1093/oso/9780198815792.003.0007.

Full text
Abstract:
Inexpensive and/or readily available auxiliary variable, x, values may often be available at little or no cost. If these variables are highly correlated with the target variable, y, then use of ratio or regression estimators may greatly reduce sampling variance. These estimators are not unbiased, but bias is generally small compared to the target of estimation and contributes a very small proportion of overall mean square error, the relevant measure of accuracy for biased estimators. Ratio estimation can also be incorporated in the context of stratified designs, again possibly offering a reduction in overall sampling variance. Model-based prediction offers an alternative to the design-based ratio and regression estimators and we present an overview of this approach. In model-based prediction, the y values associated with population units are viewed as realizations of random variables which are assumed to be related to auxiliary variables according to specified models. The realized values of the target variable are known for the sample, but must be predicted using an assumed model dependency on the auxiliary variable for the non-sampled units in the population. Insights from model-based thinking may assist the design-based sampling theorist in selection of an appropriate estimator. Similarly, we show that insights from design-based estimation may improve estimation of uncertainty in model-based mark-recapture estimation.
APA, Harvard, Vancouver, ISO, and other styles
8

Poast, Paul. "Analyzing Alliance Treaty Negotiation Outcomes." In Arguing about Alliances, 64–106. Cornell University Press, 2019. http://dx.doi.org/10.7591/cornell/9781501740244.003.0004.

Full text
Abstract:
This chapter explores basic patterns in the data described in the previous chapter using cross tabulations. These tabulations show that having strategic and operational compatibility is strongly associated with a higher rate of agreement in alliance treaty negotiations. They also demonstrate that agreement can be reached, though less often, even between states that lack ideal war plan compatibility. The suggestive evidence offered by these cross tabulations is useful, but the cross tabulations also raise questions. While the initial patterns are supportive of this book's theory, the chapter is concerned about potential complications in the data that could undermine the ability to draw inferences about the relationships between variables. These potential complications include selection bias and omitted variable bias. The chapter then identifies how and under what conditions the existence of an outside option influences the outcome of alliance treaty negotiations.
APA, Harvard, Vancouver, ISO, and other styles
9

"An Algorithm for Causal Inference in the Presence of Latent Variables and Selection Bias." In Computation, Causation, and Discovery. The MIT Press, 1999. http://dx.doi.org/10.7551/mitpress/2006.003.0009.

Full text
APA, Harvard, Vancouver, ISO, and other styles
10

Hankin, David G., Michael S. Mohr, and Ken B. Newman. "Basic concepts." In Sampling Theory, 11–22. Oxford University Press, 2019. http://dx.doi.org/10.1093/oso/9780198815792.003.0002.

Full text
Abstract:
This chapter provides a conceptual, visual and non-quantitative presentation of the basic principles of sampling theory which are developed in formal quantitative fashion in subsequent chapters. Included are summaries of (a) basic terminology used throughout the text (population, sample, estimator, estimate), (b) components of a sampling strategy (sampling frame, sampling design, estimator), (c) properties of estimators (bias, sampling variance, mean square error), and (d) sampling distribution of an estimator. Simple or familiar settings are used to illustrate the differences between simple frames (listings of population units from which a sample of units is selected) and complex frames (sampling units consist of groupings of population units), and to illustrate the different components of a sampling strategy. A bullseye target with associated dart throws is used to distinguish the important estimator properties of bias, sampling variance, and mean square error. The performances of randomized sampling procedures and purposive or judgment selection of “representative samples” are contrasted using two examples: (1) an historical contrast of estimated abundance of Oregon coastal coho salmon (Oncorhynchus kisutch) based on purposive representative reach surveys and on stratified random surveys, and (2) a repeatable classroom exercise pitting judgment sampling against simple random sampling for estimation of mean weight in a population of agates collected from northern California beaches.
APA, Harvard, Vancouver, ISO, and other styles

Conference papers on the topic "Variable selection bias"

1

"Variable Liquidity and Selection Bias in Transaction Indices of institutional Commercial Property." In 9th European Real Estate Society Conference: ERES Conference 2002. ERES, 2002. http://dx.doi.org/10.15396/eres2002_167.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Simon, Donald L., and Sanjay Garg. "Optimal Tuner Selection for Kalman Filter-Based Aircraft Engine Performance Estimation." In ASME Turbo Expo 2009: Power for Land, Sea, and Air. ASMEDC, 2009. http://dx.doi.org/10.1115/gt2009-59684.

Full text
Abstract:
A linear point design methodology for minimizing the error in on-line Kalman filter-based aircraft engine performance estimation applications is presented. This technique specifically addresses the underdetermined estimation problem, where there are more unknown parameters than available sensor measurements. A systematic approach is applied to produce a model tuning parameter vector of appropriate dimension to enable estimation by a Kalman filter, while minimizing the estimation error in the parameters of interest. Tuning parameter selection is performed using a multi-variable iterative search routine which seeks to minimize the theoretical mean-squared estimation error. This paper derives theoretical Kalman filter estimation error bias and variance values at steady-state operating conditions, and presents the tuner selection routine applied to minimize these values. Results from the application of the technique to an aircraft engine simulation are presented and compared to the conventional approach of tuner selection. Experimental simulation results are found to be in agreement with theoretical predictions. The new methodology is shown to yield a significant improvement in on-line engine performance estimation accuracy.
APA, Harvard, Vancouver, ISO, and other styles
3

Smart, Lucinda, Yanping Li, J. Bruce Nestleroth, and Suzanne Ward. "Interaction Rule Guidance for Corrosion Features Reported by ILI." In 2018 12th International Pipeline Conference. American Society of Mechanical Engineers, 2018. http://dx.doi.org/10.1115/ipc2018-78284.

Full text
Abstract:
Corrosion anomalies which reduce the strength of the pipeline must be mitigated appropriately. When corrosion defects have varying morphologies it is not always simple to determine the point at which the corrosion region becomes a safety concern, particularly for complex corrosion areas where multiple corrosion anomalies may interact with one another. Therefore, understanding how various anomalies may interact is important to determining the overall remaining strength of a pipeline under pressure. Many criteria for this spacing and how to apply the rules are recommended in the literature and have been studied either as the focus or periphery by several more, but no single criterion is provided as regulation. The task is left to the pipeline operator to choose the interaction rule for what is defined as ‘closely spaced corrosion.’ The method by which the failure pressure is calculated should be considered as varying levels of conservatism are inherent in these assessments. Recommendations for interaction guidelines have been determined by either empirical or analytical approaches. The empirical approaches may be limited when an insufficient number and variety of pipes can be burst tested. Many analytical approaches are based upon relationships of remaining wall and simple corrosion morphologies which may not be applicable to real world corrosion. The source of the corrosion anomaly data is an important variable when selecting and applying interaction rules. In-line inspections (ILI) are the most common methods by which to obtain corrosion anomaly data, but each technology has an inherent measurement error and bias which should be considered. This paper will go into detail on each of the items discussed, present the current state of research into this subject in the industry, and will present a general recommendation for selection of an interaction criterion for corrosion features reported by ILI.
APA, Harvard, Vancouver, ISO, and other styles
4

Lall, Amrita, Hamid Khakpour Nejadkhaki, and John Hall. "An Integrative Framework for Design and Control Optimization of a Variable-Ratio Gearbox in a Wind Turbine With Active Blades." In ASME 2016 International Design Engineering Technical Conferences and Computers and Information in Engineering Conference. American Society of Mechanical Engineers, 2016. http://dx.doi.org/10.1115/detc2016-60244.

Full text
Abstract:
A variable ratio gearbox (VRG) provides discrete variable rotor speed operation, and thus increases wind capture, for small fixed-speed wind turbines. It is a low-cost, reliable alternative to conventional variable speed operation, which requires special power-conditioning equipment. The authors’ previous work has demonstrated the benefit of using a VRG in a fixed-speed system with passive blades. This work characterizes the performance of the VRG when used with active blades. The main contribution of the study is an integrative design framework that maximizes power production while mitigating stress in the blade root. As part of the procedure, three gear ratios are selected for the VRG. It establishes the control rules by defining the gear ratio and pitch angle used in relation to wind speed and mechanical torque. A 300 kW wind turbine model is used for a case study that demonstrates how the framework is implemented. The model consists of aerodynamic, mechanical, and electrical submodels, which work collaboratively to convert kinetic air to electrical power. Blade element momentum theory is used in the aerodynamic model to compute the blade loads. The resulting torque is passed through a mechanical system and subsequently to the induction machine model to generate power. The BEM method also provides the thrust and bending loads that contribute to blade-root stress. The stress in the root of the blade is also computed in response to these loads, as well as those caused by gravity and centrifugal force. Two case studies are performed using wind data that was obtained from the National Renewable Energy Laboratory (NREL). Each of these represents an installation site with a unique set of wind conditions that are used to customize the wind turbine design. The framework uses dynamic programming to simulate the performance of an exhaustive set of combinations. Each combination is evaluated over each set of recorded wind data. The combinations are evaluated in terms of the total energy and stress that is produced over the simulation period. Weights are applied to a multi-objective cost function that identifies the optimal design configurations with respect to the design objectives. As a final design step, a VRG combination is selected, and a control algorithm is established for each set of wind data. During operation, the cost function can also be used to bias the system towards higher power production or lower stress. The results suggest a VRG can improve wind energy production in Region 2 by roughly 10% in both the low and high wind regions. In both cases, stress is also increased in Region 2, as the power increases. However, the stress in Region 3 may be reduced for some wind speeds through the optimal selection of gear combinations.
APA, Harvard, Vancouver, ISO, and other styles
5

Xia, Rui, Zhenchun Pan, and Feng Xu. "Instance Weighting with Applications to Cross-domain Text Classification via Trading off Sample Selection Bias and Variance." In Twenty-Seventh International Joint Conference on Artificial Intelligence {IJCAI-18}. California: International Joint Conferences on Artificial Intelligence Organization, 2018. http://dx.doi.org/10.24963/ijcai.2018/624.

Full text
Abstract:
Domain adaptation is an important problem in natural language processing (NLP) due to the distributional difference between the labeled source domain and the target domain. In this paper, we study the domain adaptation problem from the instance weighting perspective. By using density ratio as the instance weight, the traditional instance weighting approaches can potentially correct the sample selection bias in domain adaptation. However, researchers often failed to achieve good performance when applying instance weighting to domain adaptation in NLP and many negative results were reported in the literature. In this work, we conduct an in-depth study on the causes of the failure, and find that previous work only focused on reducing the sample selection bias, but ignored another important factor, sample selection variance, in domain adaptation. On this basis, we propose a new instance weighting framework by trading off two factors in instance weight learning. We evaluate our approach on two cross-domain text classification tasks and compare it with eight instance weighting methods. The results prove our approach's advantages in domain adaptation performance, optimization efficiency and parameter stability.
APA, Harvard, Vancouver, ISO, and other styles
6

McShane, Michael J., Gerard L. Cote, and Clifford H. Spiegelman. "Variable selection for quantitative determination of glucose concentration with near-infrared spectroscopy." In BiOS '97, Part of Photonics West, edited by Alexander V. Priezzhev, Toshimitsu Asakura, and Robert C. Leif. SPIE, 1997. http://dx.doi.org/10.1117/12.273614.

Full text
APA, Harvard, Vancouver, ISO, and other styles
7

Bronson, Robert J., Hans R. Depold, Ravi Rajamani, Somnath Deb, William H. Morrison, and Krishna R. Pattipati. "Data Normalization for Engine Health Monitoring." In ASME Turbo Expo 2005: Power for Land, Sea, and Air. ASMEDC, 2005. http://dx.doi.org/10.1115/gt2005-68039.

Full text
Abstract:
In this paper we present a systematic data-driven parameter correction and estimation process consisting of outlier detection and removal, relevant input parameter selection, advanced statistical and empirical correlation, and prediction fusion to reduce variance in relevant engine parameter estimates. We model engine parameter deviations from nominal, and show that these methods can result in significant reductions in bias and variance modeling errors. Reducing the error variance increases the signal-to-noise ratio, thereby increasing the reliability and speed of fault-detection algorithms. The overall objective function is to reduce the measurement variances without masking faults. Key parameters modeled include fuel flow, rotor speed(s), and measured temperatures.
APA, Harvard, Vancouver, ISO, and other styles
8

He, Shuangchi, and Jitendra Tugnait. "On Bias-Variance Trade-Off in Superimposed Training-Based Doubly Selective Channel Estimation." In 2006 40th Annual Conference on Information Sciences and Systems. IEEE, 2006. http://dx.doi.org/10.1109/ciss.2006.286666.

Full text
APA, Harvard, Vancouver, ISO, and other styles
9

Henriksson, M., S. Borguet, O. Le´onard, and T. Gro¨nstedt. "On Inverse Problems in Turbine Engine Parameter Estimation." In ASME Turbo Expo 2007: Power for Land, Sea, and Air. ASMEDC, 2007. http://dx.doi.org/10.1115/gt2007-27756.

Full text
Abstract:
This paper extends previous work on model order reduction based on singular value decomposition. It is shown how the decrease in estimator variance must be balanced against the bias on the estimate inevitably introduced by solving the inverse problem in a reduced order space. A proof for the decrease in estimator variance by means of multi-point analysis is provided. The proof relies on comparing the Cramer-Rao lower bound of the single point and the multi-point estimators. Model order selection is discussed in the presence of a varying degree of a priori parameter information, through the use of a regularization parameter. Simulation results on the SR-30 turbojet engine indicate that the theoretically attainable multi-point improvements are difficult to realize in practical jet engine applications.
APA, Harvard, Vancouver, ISO, and other styles
10

Gormez, Z., O. Kursun, A. Sertbas, N. Aydin, and H. Seker. "Statistical bias and variance of gene selection and cross validation methods: A case study on hypertension prediction." In 2012 IEEE-EMBS International Conference on Biomedical and Health Informatics (BHI). IEEE, 2012. http://dx.doi.org/10.1109/bhi.2012.6211658.

Full text
APA, Harvard, Vancouver, ISO, and other styles

Reports on the topic "Variable selection bias"

1

Brown, H. C., K. Ganesan, and R. K. Dhar. Enolboration 3. An Examination of the Effect of Variable Steric Requirements of R on the Stereoselective Enolboration of Ketones with R2BCl/Et3N. Bis(Bicyclo(2.2.2)Octyl)Chloroborane/Triethylamine - A New Reagent Which Achieves the Selective Generation of E Enolborinates from Representative Ketones. Fort Belvoir, VA: Defense Technical Information Center, April 1992. http://dx.doi.org/10.21236/ada250066.

Full text
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography