Dissertations / Theses on the topic 'Rang et sélection (statistique)'
Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles
Consult the top 50 dissertations / theses for your research on the topic 'Rang et sélection (statistique).'
Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.
You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.
Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.
Meunier, Hervé. "Algorithmes évolutionnaires parallèles pour l'optimisation multi-objectif de réseaux de télécommunications mobiles." Lille 1, 2002. https://pepite-depot.univ-lille.fr/RESTREINT/Th_Num/2002/50376-2002-93.pdf.
Full textChambaz, Antoine. "Segmentation spatiale et sélection de modèle : théorie et applications statistiques." Paris 11, 2003. http://www.theses.fr/2003PA112012.
Full textWe tacke in this thesis the elaboration of an original method that provides refinement of the localization of the mobIle telecommunication traffic in urban area for France Télécom R&D. This work involves both practical and theoretical developments. Our point of view is of statistical nature. The major themes are spatial segmentation and model selection. We first introduce the various datasets from which our approach stems. They cast some light on the original problem. We motivate the choice of an heteroscedastic regression model. We then present a practical nonparametric regression method based on CART regression trees and its Bagging and Boosting extensions by resampling. The latter classical methods are designed for ho- moscedastic models. We propose an adaptation to heteroscedastic ODes, including an original analysis of variable importance. We apply the method to various traffic datasets. The final results are commented. The above practical work motivates the theoretical study of the consistency of a family of estimators of the order of a segmented model and its associated segmentation. We also cope, in a general framework of model select ion in a nested family of models, with the estimation of the order of a model. We are particularly concerned with consistency properties and rates of und er- or overestimation. We tackle the problem at stake with a linear functional approach, i. E. An approach where the events of interest are described as events concerning the empirical measute. This allows to derive general results that gather and enhance earlier ODes. A large range of techniques are involved : classical arguments of M -estimation, concentration, max- imal inequalities for dependent variables, Stein's lemma, penalization, Large and Moderate Deviations Principles for the empirical measure, à la Huber trick
Bi, Duyan. "Segmentation d'images basée sur les statistiques de rangs des niveaux de gris." Tours, 1997. http://www.theses.fr/1997TOUR4005.
Full textSavalle, Pierre-André. "Interactions entre rang et parcimonie en estimation pénalisée, et détection d'objets structurés." Thesis, Châtenay-Malabry, Ecole centrale de Paris, 2014. http://www.theses.fr/2014ECAP0051/document.
Full textThis thesis is organized in two independent parts. The first part focused on convex matrix estimation problems, where both rank and sparsity are taken into account simultaneously. In the context of graphs with community structures, a common assumption is that the underlying adjacency matrices are block-diagonal in an appropriate basis. However, these types of graphs are usually far from complete, and their adjacency representations are thus also inherently sparse. This suggests that combining the sparse hypothesis and the low rank hypothesis may allow to more accurately model such objects. To this end, we propose and analyze a convex penalty to promote both low rank and high sparsity at the same time. Although the low rank hypothesis allows to reduce over-fitting by decreasing the modeling capacity of a matrix model, the opposite may be desirable when enough data is available. We study such an example in the context of localized multiple kernel learning, which extends multiple kernel learning by allowing each of the kernels to select different support vectors. In this framework, multiple kernel learning corresponds to a rank one estimator, while higher-rank estimators have been observed to increase generalization performance. We propose a novel family of large-margin methods for this problem that, unlike previous methods, are both convex and theoretically grounded. The second part of the thesis is about detection of objects or signals which exhibit combinatorial structures, and we present two such problems. First, we consider detection in the statistical hypothesis testing sense, in models where anomalous signals correspond to correlated values at different sensors. In most existing work, detection procedures are provided with a full sample of all the sensors. However, the experimenter may have the capacity to make targeted measurements in an on-line and adaptive manner, and we investigate such adaptive sensing procedures. Finally, we consider the task of identifying and localizing objects in images. This is an important problem in computer vision, where hand-crafted features are usually used. Following recent successes in learning ad-hoc representations for similar problems, we integrate the method of deformable part models with high-dimensional features from convolutional neural networks, and shows that this significantly decreases the error rates of existing part-based models
Challita, Nicole. "Contributions à la sélection des attributs de signaux non stationnaires pour la classification." Thesis, Troyes, 2018. http://www.theses.fr/2018TROY0012.
Full textTo monitor the functioning of a system, the number of measurements and attributes can now be very large. But it is desirable to reduce the size of the problem by keeping only the discriminating features to learn the monitoring rule and to reduce the processing demand. The problem is therefore to select a subset of attributes to obtain the best possible classification performance. This thesis dissertation presents different existing methods for feature selection and proposes two new ones. The first one, named "EN-ReliefF", is a combination of a sequential ReliefF method and a weighted regression approach: Elastic Net. The second one is inspired by neural networks. It is formulated as an optimization problem allowing defining at the same time a non-linear regression that adapts to the learning data and a parsimonious weighting of the features. The weights are then used to select the relevant features. Both methods are tested on synthesis data and data from rotating machines. Experimental results show the effectiveness of both methods. Remarkable characteristics are the stability of selection and ability to manage linearly correlated attributes for "EN-ReliefF" and the sensitivity and ability to manage non-linear dependencies for the second method
Boisbunon, Aurélie. "Sélection de modèle : une approche décisionnelle." Phd thesis, Université de Rouen, 2013. http://tel.archives-ouvertes.fr/tel-00793898.
Full textEstampes, Ludovic d'. "Traitement statistique des processus alpha-stables : mesures de dépendance et identification des AR stables : tests séquentiels tronqués." Toulouse, INPT, 2003. http://www.theses.fr/2003INPT031H.
Full textKalakech, Mariam. "Sélection semi-supervisée d'attributs : application à la classification de textures couleur." Thesis, Lille 1, 2011. http://www.theses.fr/2011LIL10018/document.
Full textWithin the framework of this thesis, we are interested in feature selection methods based on graph theory in different unsupervised, semi-supervised and supervised learning contexts. We are particularly interested in the feature ranking scores based on must-link et cannot-link constraints. Indeed, these constraints are easy to be obtained on real applications. They just require to formalize for two data samples if they are similar and then must be grouped together or not, without detailed information on the classes to be found. Constraint scores have shown good performances for semi-supervised feature selection. However, these scores strongly depend on the given must-link and cannot-link subsets built by the user. We propose then a new semi-supervised constraint scores that uses both pairwise constraints and local properties of the unconstrained data. Experiments on artificial and real databases show that this new score is less sensitive to the given constraints than the previous scores while providing similar performances. Semi supervised feature selection was also successfully applied to the color texture classification. Indeed, among many texture features which can be extracted from the color images, it is necessary to select the most relevant ones to improve the quality of classification
Olteanu, Madalina. "Modèles à changements de régime : applications aux données financières." Phd thesis, Université Panthéon-Sorbonne - Paris I, 2006. http://tel.archives-ouvertes.fr/tel-00133132.
Full textOn propose d'étudier ces questions à travers deux approches. Dans la première, il s'agit de montrer la consistance faible d'un estimateur de maximum de vraisemblance pénalisée sous des conditions de stationnarité et dépendance faible. Les hypothèses introduites sur l'entropie à crochets de la classe des fonctions scores généralisés sont ensuite vérifiées dans un cadre linéaire et gaussien. La deuxième approche, plutôt empirique, est issue des méthodes de classification non-supervisée et combine les cartes de Kohonen avec une classification hiérarchique pour laquelle une nouvelle dispersion basée sur la somme des carrés résiduelle est introduite.
Reynaud-Bouret, Patricia. "Estimation adaptative de l'intensité de certains processus ponctuels par sélection de modèle." Phd thesis, Paris 11, 2002. http://tel.archives-ouvertes.fr/tel-00081412.
Full textde sélection de modèle au cadre particulier de l'estimation d'intensité de
processus ponctuels. Plus précisément, nous voulons montrer que les
estimateurs par projection pénalisés de l'intensité sont adaptatifs soit dans
une famille d'estimateurs par projection, soit pour le risque minimax. Nous
nous sommes restreints à deux cas particuliers : les processus de Poisson
inhomogènes et les processus de comptage à intensité
multiplicative d'Aalen.
Dans les deux cas, nous voulons trouver une inégalité de type
oracle, qui garantit que les estimateurs par projection pénalisés ont un risque
du même ordre de grandeur que le meilleur estimateur par projection pour une
famille de modèles donnés. La clé qui permet de prouver des inégalités de
type oracle est le phénomène de concentration de la mesure ou plus précisément
la connaissance d'inégalités exponentielles, qui permettent de contrôler en
probabilité les déviations de statistiques de type khi-deux au dessus de leur
moyenne. Nous avons prouvé deux types d'inégalités de concentration. La
première n'est valable que pour les processus de Poisson. Elle est comparable
en terme d'ordre de grandeur à l'inégalité de M. Talagrand pour les suprema de
processus empiriques. La deuxième est plus grossière mais elle est valable
pour des processus de comptage beaucoup plus généraux.
Cette dernière inégalité met en oeuvre des techniques de
martingales dont nous nous sommes inspirés pour prouver des inégalités de
concentration pour des U-statistiques dégénérées d'ordre 2 ainsi que pour des
intégrales doubles par rapport à une mesure de Poisson recentrée.
Nous calculons aussi certaines bornes inférieures pour les
risques minimax et montrons que les estimateurs par projection pénalisés
atteignent ces vitesses.
Lehéricy, Luc. "Estimation adaptative pour les modèles de Markov cachés non paramétriques." Thesis, Université Paris-Saclay (ComUE), 2018. http://www.theses.fr/2018SACLS550/document.
Full textDuring my PhD, I have been interested in theoretical properties of nonparametric hidden Markov models. Nonparametric models avoid the loss of performance coming from an inappropriate choice of parametrization, hence a recent interest in applications. In a first part, I have been interested in estimating the number of hidden states. I introduce two consistent estimators: the first one is based on a penalized least squares criterion, and the second one on a spectral method. Once the order is known, it is possible to estimate the other parameters. In a second part, I consider two adaptive estimators of the emission distributions. Adaptivity means that their rate of convergence adapts to the regularity of the target distribution. Contrary to existing methods, these estimators adapt to the regularity of each distribution instead of only the worst regularity. The third part is focussed on the misspecified setting, that is when the observations may not come from a hidden Markov model. I control of the prediction error of the maximum likelihood estimator when the true distribution satisfies general forgetting and mixing assumptions. Finally, I introduce a nonhomogeneous variant of hidden Markov models : hidden Markov models with trends, and show that the maximum likelihood estimators of such models is consistent
Naveau, Marion. "Procédures de sélection de variables en grande dimension dans les modèles non-linéaires à effets mixtes. Application en amélioration des plantes." Electronic Thesis or Diss., université Paris-Saclay, 2024. http://www.theses.fr/2024UPASM031.
Full textMixed-effects models analyze observations collected repeatedly from several individuals, attributing variability to different sources (intra-individual, inter-individual, residual). Accounting for this variability is essential to characterize the underlying biological mechanisms without biais. These models use covariates and random effects to describe variability among individuals: covariates explain differences due to observed characteristics, while random effects represent the variability not attributable to measured covariates. In high-dimensional context, where the number of covariates exceeds the number of individuals, identifying influential covariates is challenging, as selection focuses on latent variables in the model. Many procedures have been developed for linear mixed-effects models, but contributions for non-linear models are rare and lack theoretical foundations. This thesis aims to develop a high-dimensional covariate selection procedure for non-linear mixed-effects models by studying their practical implementations and theoretical properties. This procedure is based on the use of a gaussian spike-and-slab prior and the SAEM algorithm (Stochastic Approximation of Expectation Maximisation Algorithm). Posterior contraction rates around true parameter values in a non-linear mixed-effects model under a discrete spike-and-slab prior have been obtained, comparable to those observed in linear models. The work in this thesis is motivated by practical questions in plant breeding, where these models describe plant development as a function of their genotypes and environmental conditions. The considered covariates are generally numerous since varieties are characterized by thousands of genetic markers, most of which have no effect on certain phenotypic traits. The statistical method developed in the thesis is applied to a real dataset related to this application
Paolillo, José. "L'institutionnalisation du discours sur l'Université de rang mondial dans le système d'enseignement supérieur Péruvien : le cas de l’Université Catholique Santo Toribio de Mogrovejo au Chiclayo." Electronic Thesis or Diss., Université de Montpellier (2022-....), 2024. http://www.theses.fr/2024UMOND009.
Full textWe explain the construction of the World-class university (WCU) concept and its particular relationship with international university rankings, the main actors that make up the global scenario are identified by describing each of them. Particular emphasis is placed on the so-called “Big Three” because of the crucial importance of their diffusion in the process of institutionalization of the concept. Likewise, we examine the implications of the concept in the national sphere (Peru), and later in the micro-organizational sphere (At the Catholic University “Santo Toribio de Mogrovejo” in the City of Chiclayo - Peru). We present a review of the literature related to the concept, starting from studies carried out at the global level, then those at the Latin American level, and ending with those established at the Peruvian national level. Next, we present the conceptual framework of neo-institutional theory (NIT) which will help us clarify institutionalization through the discursive variant. We highlight the arguments that led us to pronounce ourselves on this theory based on the founding articles and their perspectives. Subsequently, we present a brief overview of the discursive trajectory, as well as the explanation of a model of discursive institutionalization through the relationships between actions, texts, discourse and institutions. Finally, we present our research questions, which range from the international level to the scope of the Catholic University “Santo Toribio de Mogrovejo” (USAT), through discourse interpretation at national level. In the second part, we base our selection decision on qualitative research, then rely on a single-case model. We start from our epistemological position, thus placing our research within the interpretivist paradigm. Then, we explain the reasons for choosing the qualitative methodology and then we integrate Langley's (1999) longitudinal approach into our work. We show in detail the data collection, as well as the interpretation and analysis considerations: immersion, interviews, conduct of interviews, selection of interviewed actors and data collection (primary and secondary). Finally, our results suggest two important aspects, the identification of particular structures of institutionalization of the WCU concept in each international, national and organizational level of analysis, and the identification of levers outside our analysis model coming from the field, namely Quality, Language and Economic Resources
Makkhongkaew, Raywat. "Semi-supervised co-selection : instances and features : application to diagnosis of dry port by rail." Thesis, Lyon, 2016. http://www.theses.fr/2016LYSE1341.
Full textWe are drowning in massive data but starved for knowledge retrieval. It is well known through the dimensionality tradeoff that more data increase informative but pay a price in computational complexity, which has to be made up in some way. When the labeled sample size is too little to bring sufficient information about the target concept, supervised learning fail with this serious challenge. Unsupervised learning can be an alternative in this problem. However, as these algorithms ignore label information, important hints from labeled data are left out and this will generally downgrades the performance of unsupervised learning algorithms. Using both labeled and unlabeled data is expected to better procedure in semi-supervised learning, which is more adapted for large domain applications when labels are hardly and costly to obtain. In addition, when data are large, feature selection and instance selection are two important dual operations for removing irrelevant information. Both of tasks with semisupervised learning are different challenges for machine learning and data mining communities for data dimensionality reduction and knowledge retrieval. In this thesis, we focus on co-selection of instances and features in the context of semi-supervised learning. In this context, co-selection becomes a more challenging problem as the data contains labeled and unlabeled examples sampled from the same population. To do such semi-supervised coselection, we propose two unified frameworks, which efficiently integrate labeled and unlabeled parts into the co-selection process. The first framework is based on weighting constrained clustering and the second one is based on similarity preserving selection. Both approaches evaluate the usefulness of features and instances in order to select the most relevant ones, simultaneously. Finally, we present a variety of empirical studies over high-dimensional data sets, which are well-known in the literature. The results are promising and prove the efficiency and effectiveness of the proposed approaches. In addition, the developed methods are validated on a real world application, over data provided by the State Railway of Thailand (SRT). The purpose is to propose the application models from our methodological contributions to diagnose the performance of rail dry port systems. First, we present the results of some ensemble methods applied on a first data set, which is fully labeled. Second, we show how can our co-selection approaches improve the performance of learning algorithms over partially labeled data provided by SRT
Arlot, Sylvain. "Rééchantillonnage et Sélection de modèles." Phd thesis, Université Paris Sud - Paris XI, 2007. http://tel.archives-ouvertes.fr/tel-00198803.
Full textLa majeure partie de ce travail de thèse consiste dans la calibration précise de méthodes de sélection de modèles optimales en pratique, pour le problème de la prédiction. Nous étudions la validation croisée V-fold (très couramment utilisée, mais mal comprise en théorie, notamment pour ce qui est de choisir V) et plusieurs méthodes de pénalisation. Nous proposons des méthodes de calibration précise de pénalités, aussi bien pour ce qui est de leur forme générale que des constantes multiplicatives. L'utilisation du rééchantillonnage permet de résoudre des problèmes difficiles, notamment celui de la régression avec un niveau de bruit variable. Nous validons théoriquement ces méthodes du point de vue non-asymptotique, en prouvant des inégalités oracle et des propriétés d'adaptation. Ces résultats reposent entre autres sur des inégalités de concentration.
Un second problème que nous abordons est celui des régions de confiance et des tests multiples, lorsque l'on dispose d'observations de grande dimension, présentant des corrélations générales et inconnues. L'utilisation de méthodes de rééchantillonnage permet de s'affranchir du fléau de la dimension, et d'"apprendre" ces corrélations. Nous proposons principalement deux méthodes, et prouvons pour chacune un contrôle non-asymptotique de leur niveau.
Genuer, Robin. "Forêts aléatoires : aspects théoriques, sélection de variables et applications." Phd thesis, Université Paris Sud - Paris XI, 2010. http://tel.archives-ouvertes.fr/tel-00550989.
Full textSidi, Zakari Ibrahim. "Sélection de variables et régression sur les quantiles." Thesis, Lille 1, 2013. http://www.theses.fr/2013LIL10081/document.
Full textThis work is a contribution to the selection of statistical models and more specifically in the selection of variables in penalized linear quantile regression when the dimension is high. It focuses on two points in the selection process: the stability of selection and the inclusion of variables by grouping effect. As a first contribution, we propose a transition from the penalized least squares regression to quantiles regression (QR). A bootstrap approach based on frequency of selection of each variable is proposed for the construction of linear models (LM). In most cases, the QR approach provides more significant coefficients. A second contribution is to adapt some algorithms of "Random" LASSO (Least Absolute Shrinkage and Solution Operator) family in connection with the QR and to propose methods of selection stability. Examples from food security illustrate the obtained results. As part of the penalized QR in high dimension, the grouping effect property is established under weak conditions and the oracle ones. Two examples of real and simulated data illustrate the regularization paths of the proposed algorithms. The last contribution deals with variable selection for generalized linear models (GLM) using the nonconcave penalized likelihood. We propose an algorithm to maximize the penalized likelihood for a broad class of non-convex penalty functions. The convergence property of the algorithm and the oracle one of the estimator obtained after an iteration have been established. Simulations and an application to real data are also presented
Mbina, Mbina Alban. "Contributions à la sélection des variables en statistique multidimensionnelle et fonctionnelle." Thesis, Lille 1, 2017. http://www.theses.fr/2017LIL10102/document.
Full textThis thesis focuses on variables selection on linear models and additif functional linear model. More precisely we propose three variables selection methods. The first one is concerned with the selection continuous variables of multidimentional linear model. The comparative study based on prediction loss shows that our method is beter to method of An et al. (2013) Secondly, we propose a new selection method of mixed variables (mixing of discretes and continuous variables). This method is based on generalization in the mixed framwork of NKIET (2012) method, more precisely, is based on a generalization of linear canonical invariance criterion to the framework of discrimination with mixed variables. A comparative study based on the rate of good classification show that our method is equivalente to the method of MAHAT et al. (2007) in the case of two groups. In the third method, we propose an approach of variables selection on an additive functional linear model. A simulations study shows from Hausdorff distance an illustration of our approach
Lerasle, Matthieu. "Rééchantillonnage et sélection de modèles optimale pour l'estimation de la densité." Toulouse, INSA, 2009. http://eprint.insa-toulouse.fr/archive/00000290/.
Full textVerzelen, Nicolas. "Modèles graphiques gaussiens et sélection de modèles." Phd thesis, Université Paris Sud - Paris XI, 2008. http://tel.archives-ouvertes.fr/tel-00352802.
Full textEn utilisant le lien entre modèles graphiques et régression linéaire à plan d'expérience gaussien, nous développons une approche basée sur des techniques de sélection de modèles. Les procédures ainsi introduites sont analysés d'un point de vue non-asymptotique. Nous prouvons notamment des inégalités oracles et des propriétés d'adaptation au sens minimax valables en grande dimension. Les performances pratiques des méthodes statistiques sont illustrées sur des données simulées ainsi que sur des données réelles.
Nédélec, Elodie. "Quelques problèmes liés à la théorie statistique de l'apprentissage et applications." Paris 11, 2004. http://www.theses.fr/2004PA112297.
Full textThis thesis deals with three problems in learning theory. We observe a sample of (X,Y) satisfying the relation Y=s(X)+e where e is centered conditionnaly to X. Our aim is to estimate s* a function of the regression function s with few assumptions on s. We use a minimum contrast procedure. We note F a set of function such that s belongs to F. We consider a collection of models and an empirical contrast g on F. We study the minimum contrast estimator on a fixed model. We define a loss function l(u,v) for all u,v in F in order to evaluate the quality of the minimum contrast estimators. And then we are interested by the risk of the estimators defined as the expectation of the loss function between s* and the estimators. We look for estimators with a low risk. In this thesis we study for different examples in learning theory tthe risk on one model and the model selection procedure
Ben, Ishak Anis. "Sélection de variables par les machines à vecteurs supports pour la discrimination binaire et multiclasse en grande dimension." Aix-Marseille 2, 2007. http://www.theses.fr/2007AIX22067.
Full textHarel, Michel. "Convergence faible de la statistique linéaire de rang pour des variables aléatoires faiblement dépendantes et non stationnaires." Paris 11, 1989. http://www.theses.fr/1989PA112359.
Full textCe travail est composé de trois parties. La première consiste en la convergence faible du processus empirique tronqué corrigé pour des suites de variables aléatoires non stationnaires
Vandewalle, Vincent. "Estimation et sélection en classification semi-supervisée." Phd thesis, Université des Sciences et Technologie de Lille - Lille I, 2009. http://tel.archives-ouvertes.fr/tel-00447141.
Full textViallefont, Valérie. "Analyses bayesiennes du choix de modèles en épidémiologie : sélection de variables et modélisation de l'hétérogénéité pour des évènements." Paris 11, 2000. http://www.theses.fr/2000PA11T023.
Full textThis dissertation has two separated parts. In the first part, we compare different strategies for variable selection in a multivariate logistic regression model. Covariate and confounder selection in case-control studies is often carried out using either a two-step method or a stepwise variable selection method. Inference is then carried out conditionally on the selected model, but this ignores the madel uncertainty implicit in the variable selection process, and so underestimates uncertainty about relative risks. It is well known, and showed again in our study, that the ρ-values computed after variable selection can greatly overstate the strength of conclusions. We propose Bayesian Model Averaging as a formal way of taking account of madel uncertainty in a logistic regression context. The BMA methods, that allows to take into account several models, each being associated with its posterior probability, yields an easily interpreted summary, the posterior probability that a variable is a risk factor, and its estimate averaged over the set of models. We conduct two comparative simulations studies : the first one has a simple design including only one risk factor and one confounder, the second one mimics a epidemiological cohort study dataset, with a large number of potential risk factors. Our criteria are the mean bias, the rate of type I and type II errors, and the assessment of uncertainty in the results, which is bath more accurate and explicit under the BMA analysis. The methods are applied and compared in the context of a previously published case-control study of cervical cancer. The choice of the prior distributions are discussed. In the second part, we focus on the modelling of rare events via a Poisson distribution, that sometimes reveals substantial over-dispersion, indicating that sorme un explained discontinuity arises in the data. We suggest to madel this over-dispersion by a Poisson mixture. In a hierarchical Bayesian model, the posterior distributions of he unknown quantities in the mixture (number of components, weights, and Poisson parameters) can be estimated by MCMC algorithms, including reversible jump algothms which allows to vary the dimension of the mixture. We focus on the difficulty of finding a weakly informative prior for the Poisson parameters : different priors are detailed and compared. Then, the performances of different maves created for changing dimension are investigated. The model is extended by the introduction of covariates, with homogeneous or heterogeneous effect. Simulated data sets are designed for the different comparisons, and the model is finally illustrated in two different contexts : an ecological analysis of digestive cancer mortality along the coasts of France, and a dataset concerning counts of accidents in road-junctions
d'Estampes, Ludovic. "Traitement statistique des processus alpha-stables: mesures de dépendance et identification des ar stables. Test séquentiels tronqués." Phd thesis, Institut National Polytechnique de Toulouse - INPT, 2003. http://tel.archives-ouvertes.fr/tel-00005216.
Full textBrault, Vincent. "Estimation et sélection de modèle pour le modèle des blocs latents." Thesis, Paris 11, 2014. http://www.theses.fr/2014PA112238/document.
Full textClassification aims at sharing data sets in homogeneous subsets; the observations in a class are more similar than the observations of other classes. The problem is compounded when the statistician wants to obtain a cross classification on the individuals and the variables. The latent block model uses a law for each crossing object class and class variables, and observations are assumed to be independent conditionally on the choice of these classes. However, factorizing the joint distribution of the labels is impossible, obstructing the calculation of the log-likelihood and the using of the EM algorithm. Several methods and criteria exist to find these partitions, some frequentist ones, some bayesian ones, some stochastic ones... In this thesis, we first proposed sufficient conditions to obtain the identifiability of the model. In a second step, we studied two proposed algorithms to counteract the problem of the EM algorithm: the VEM algorithm (Govaert and Nadif (2008)) and the SEM-Gibbs algorithm (Keribin, Celeux and Govaert (2010)). In particular, we analyzed the combination of both and highlighted why the algorithms degenerate (term used to say that it returns empty classes). By choosing priors wise, we then proposed a Bayesian adaptation to limit this phenomenon. In particular, we used a Gibbs sampler and we proposed a stopping criterion based on the statistics of Brooks-Gelman (1998). We also proposed an adaptation of the Largest Gaps algorithm (Channarond et al. (2012)). By taking their demonstrations, we have shown that the labels and parameters estimators obtained are consistent when the number of rows and columns tend to infinity. Furthermore, we proposed a method to select the number of classes in row and column, the estimation provided is also consistent when the number of row and column is very large. To estimate the number of classes, we studied the ICL criterion (Integrated Completed Likelihood) whose we proposed an exact shape. After studying the asymptotic approximation, we proposed a BIC criterion (Bayesian Information Criterion) and we conjecture that the two criteria select the same results and these estimates are consistent; conjecture supported by theoretical and empirical results. Finally, we compared the different combinations and proposed a methodology for co-clustering
Thouvenot, Vincent. "Estimation et sélection pour les modèles additifs et application à la prévision de la consommation électrique." Thesis, Université Paris-Saclay (ComUE), 2015. http://www.theses.fr/2015SACLS184/document.
Full textFrench electricity load forecasting encounters major changes since the past decade. These changes are, among others things, due to the opening of electricity market (and economical crisis), which asks development of new automatic time adaptive prediction methods. The advent of innovating technologies also needs the development of some automatic methods, because we have to study thousands or tens of thousands time series. We adopt for time prediction a semi-parametric approach based on additive models. We present an automatic procedure for covariate selection in a additive model. We combine Group LASSO, which is selection consistent, with P-Splines, which are estimation consistent. Our estimation and model selection results are valid without assuming that the norm of each of the true non-zero components is bounded away from zero and need only that the norms of non-zero components converge to zero at a certain rate. Real applications on local and agregate load forecasting are provided.Keywords: Additive Model, Group LASSO, Load Forecasting, Multi-stage estimator, P-Splines, Variables selection
Grelaud, Aude. "Méthodes sans vraisemblance appliquées à l'étude de la sélection naturelle et à la prédiction de structure tridimensionnelle des protéines." Paris 9, 2009. https://portail.bu.dauphine.fr/fileviewer/index.php?doc=2009PA090048.
Full textGendre, Xavier. "Estimation par sélection de modèle en régression hétéroscédastique." Phd thesis, Université de Nice Sophia-Antipolis, 2009. http://tel.archives-ouvertes.fr/tel-00397608.
Full textLa première partie de cette thèse consiste dans l'étude du problème d'estimation de la moyenne et de la variance d'un vecteur gaussien à coordonnées indépendantes. Nous proposons une méthode de choix de modèle basée sur un critère de vraisemblance pénalisé. Nous validons théoriquement cette approche du point de vue non-asymptotique en prouvant des majorations de type oracle du risque de Kullback de nos estimateurs et des vitesses de convergence uniforme sur les boules de Hölder.
Un second problème que nous abordons est l'estimation de la fonction de régression dans un cadre hétéroscédastique à dépendances connues. Nous développons des procédures de sélection de modèle tant sous des hypothèses gaussiennes que sous des conditions de moment. Des inégalités oracles non-asymptotiques sont données pour nos estimateurs ainsi que des propriétés d'adaptativité. Nous appliquons en particulier ces résultats à l'estimation d'une composante dans un modèle de régression additif.
Roux, Camille. "Effets de la sélection naturelle et de l'histoire démographique sur les patrons de polymorphisme nucléaire : comparaisons interspécifiques chez Arabidopsis halleri et A. lyrata entre le fond génomique et deux régions cibles de la sélection." Thesis, Lille 1, 2010. http://www.theses.fr/2010LIL10157/document.
Full textThe dichotomous view of life has long been availed to represent the diversity observed in nature. The recent expansion of sequence data have identified large discrepancies between the phylogenies of genes and species, forming the so-called "mosaic structure" of genomes. This complex pattern is the result of different neutral and adaptive evolutionary processes shaping the diversity of life. These processes explain the shared polymorphism observed between two different species. The trans-specific polymorphism (TSP) is generated by neutral retention of ancestral polymorphism, introgression and genetic homoplasy. Functional TSP is the result of the same processes and of the effects of natural selection. Whether local adaptation of a species contributes to the reduction of TSP, natural selection may increase the TSP in the case of balancing selection.Using the pair of closely related plant species Arabidopsis halleri and A. lyrata, we compared the patterns of polymorphism observed in genomic backgrounds to those observed in the neighborhood of the target regions of balancing selection, in order to measure the relative importance of selection and demography.Demographic analysis by ABC from genomic backgrounds leads to the rejection of the hypothesis of recent migration between these two species, and support the importance of the evolution of tolerance to heavy metals in the process of speciation of A. halleri.Finally, by measuring the patterns of polymorphism around the S-locus, we showed that balancing selection affects very localy the neutral linked polymorphism
Caron, François. "Inférence bayésienne pour la détermination et la sélection de modèles stochastiques." Ecole Centrale de Lille, 2006. http://www.theses.fr/2006ECLI0012.
Full textWe are interested in the addition of uncertainty in hidden Markov models. The inference is made in a Bayesian framework based on Monte Carlo methods. We consider multiple sensors that may switch between several states of work. An original jump model is developed for different kind of situations, including synchronous/asynchronous data and the binary valid/invalid case. The model/algorithm is applied to the positioning of a land vehicle equipped with three sensors. One of them is a GPS receiver, whose data are potentially corrupted due to multipaths phenomena. We consider the estimation of the probability density function of the evolution and observation noises in hidden Markov models. First, the case of linear models is addressed and MCMC and particle filter algorithms are developed and applied on three different applications. Then the case of the estimation of probability density functions in nonlinear models is addressed. For that purpose, time-varying Dirichlet processes are defined for the online estimation of time-varying probability density functions
El, Matouat Abdelaziz. "Sélection du nombre de paramètres d'un modèle comparaison avec le critère d'Akaike." Rouen, 1987. http://www.theses.fr/1987ROUES054.
Full textLounici, Karim. "Estimation Statistique En Grande Dimension, Parcimonie et Inégalités D'Oracle." Phd thesis, Université Paris-Diderot - Paris VII, 2009. http://tel.archives-ouvertes.fr/tel-00435917.
Full textPluntz, Matthieu. "Sélection de variables en grande dimension par le Lasso et tests statistiques - application à la pharmacovigilance." Electronic Thesis or Diss., université Paris-Saclay, 2024. http://www.theses.fr/2024UPASR002.
Full textVariable selection in high-dimensional regressions is a classic problem in health data analysis. It aims to identify a limited number of factors associated with a given health event among a large number of candidate variables such as genetic factors or environmental or drug exposures.The Lasso regression (Tibshirani, 1996) provides a series of sparse models where variables appear one after another depending on the regularization parameter's value. It requires a procedure for choosing this parameter and thus the associated model. In this thesis, we propose procedures for selecting one of the models of the Lasso path, which belong to or are inspired by the statistical testing paradigm. Thus, we aim to control the risk of selecting at least one false positive (Family-Wise Error Rate, FWER) unlike most existing post-processing methods of the Lasso, which accept false positives more easily.Our first proposal is a generalization of the Akaike Information Criterion (AIC) which we call the Extended AIC (EAIC). We penalize the log-likelihood of the model under consideration by its number of parameters weighted by a function of the total number of candidate variables and the targeted level of FWER but not the number of observations. We obtain this function by observing the relationship between comparing the information criteria of nested sub-models of a high-dimensional regression, and performing multiple likelihood ratio test, about which we prove an asymptotic property.Our second proposal is a test of the significance of a variable appearing on the Lasso path. Its null hypothesis depends on a set A of already selected variables and states that it contains all the active variables. As the test statistic, we aim to use the regularization parameter value from which a first variable outside A is selected by Lasso. This choice faces the fact that the null hypothesis is not specific enough to define the distribution of this statistic and thus its p-value. We solve this by replacing the statistic with its conditional p-value, which we define conditional on the non-penalized estimated coefficients of the model restricted to A. We estimate the conditional p-value with an algorithm that we call simulation-calibration, where we simulate outcome vectors and then calibrate them on the observed outcome‘s estimated coefficients. We adapt the calibration heuristically to the case of generalized linear models (binary and Poisson) in which it turns into an iterative and stochastic procedure. We prove that using our test controls the risk of selecting a false positive in linear models, both when the null hypothesis is verified and, under a correlation condition, when the set A does not contain all active variables.We evaluate the performance of both procedures through extensive simulation studies, which cover both the potential selection of a variable under the null hypothesis (or its equivalent for EAIC) and on the overall model selection procedure. We observe that our proposals compare well to their closest existing counterparts, the BIC and its extended versions for the EAIC, and Lockhart et al.'s (2014) covariance test for the simulation-calibration test. We also illustrate both procedures in the detection of exposures associated with drug-induced liver injuries (DILI) in the French national pharmacovigilance database (BNPV) by measuring their performance using the DILIrank reference set of known associations
Rohart, Florian. "Prédiction phénotypique et sélection de variables en grande dimension dans les modèles linéaires et linéaires mixtes." Thesis, Toulouse, INSA, 2012. http://www.theses.fr/2012ISAT0027/document.
Full textRecent technologies have provided scientists with genomics and post-genomics high-dimensional data; there are always more variables that are measured than the number of individuals. These high dimensional datasets usually need additional assumptions in order to be analyzed, such as a sparsity condition which means that only a small subset of the variables are supposed to be relevant. In this high-dimensional context we worked on a real dataset which comes from the pig species and high-throughput biotechnologies. Metabolomic data has been measured with NMR spectroscopy and phenotypic data has been mainly obtained post-mortem. There are two objectives. On one hand, we aim at obtaining good prediction for the production phenotypes and on the other hand we want to pinpoint metabolomic data that explain the phenotype under study. Thanks to the Lasso method applied in a linear model, we show that metabolomic data has a real prediction power for some important phenotypes for livestock production, such as a lean meat percentage and the daily food consumption. The second objective is a problem of variable selection. Classic statistical tools such as the Lasso method or the FDR procedure are investigated and new powerful methods are developed. We propose a variable selection method based on multiple hypotheses testing. This procedure is designed to perform in linear models and non asymptotic results are given under a condition on the signal. Since supplemental data are available on the real dataset such as the batch or the family relationships between the animals, linear mixed models are considered. A new algorithm for fixed effects selection is developed, and this algorithm turned out to be faster than the usual ones. Thanks to its structure, it can be combined with any variable selection methods built for linear models. However, the convergence property of this algorithm depends on the method that is used. The multiple hypotheses testing procedure shows good empirical results. All the mentioned methods are applied to the real data and biological relationships are emphasized
Roche, Angelina. "Modélisation statistique pour données fonctionnelles : approches non-asymptotiques et méthodes adaptatives." Phd thesis, Université Montpellier II - Sciences et Techniques du Languedoc, 2014. http://tel.archives-ouvertes.fr/tel-01023919.
Full textHaury, Anne-Claire. "Sélection de variables à partir de données d'expression : signatures moléculaires pour le pronostic du cancer du sein et inférence de réseaux de régulation génique." Phd thesis, Ecole Nationale Supérieure des Mines de Paris, 2012. http://pastel.archives-ouvertes.fr/pastel-00818345.
Full textComminges, Laëtitia. "Quelques contributions à la sélection de variables et aux tests non-paramétriques." Thesis, Paris Est, 2012. http://www.theses.fr/2012PEST1068/document.
Full textReal-world data are often extremely high-dimensional, severely under constrained and interspersed with a large number of irrelevant or redundant features. Relevant variable selection is a compelling approach for addressing statistical issues in the scenario of high-dimensional and noisy data with small sample size. First, we address the issue of variable selection in the regression model when the number of variables is very large. The main focus is on the situation where the number of relevant variables is much smaller than the ambient dimension. Without assuming any parametric form of the underlying regression function, we get tight conditions making it possible to consistently estimate the set of relevant variables. Secondly, we consider the problem of testing a particular type of composite null hypothesis under a nonparametric multivariate regression model. For a given quadratic functional $Q$, the null hypothesis states that the regression function $f$ satisfies the constraint $Q[f] = 0$, while the alternative corresponds to the functions for which $Q[f]$ is bounded away from zero. We provide minimax rates of testing and the exact separation constants, along with a sharp-optimal testing procedure, for diagonal and nonnegative quadratic functionals. We can apply this to testing the relevance of a variable. Studying minimax rates for quadratic functionals which are neither positive nor negative, makes appear two different regimes: “regular” and “irregular”. We apply this to the issue of testing the equality of norms of two functions observed in noisy environments
Zulian, Marine. "Méthodes de sélection et de validation de modèles à effets mixtes pour la médecine génomique." Thesis, Institut polytechnique de Paris, 2020. http://www.theses.fr/2020IPPAX003.
Full textThe study of complex biological phenomena such as human pathophysiology, pharmacokinetics of a drug or its pharmacodynamics can be enriched by modelling and simulation approaches. Technological advances in genetics allow the establishment of data sets from larger and more heterogeneous populations. The challenge is then to develop tools that integrate genomic and phenotypic data to explain inter-individual variability. In this thesis, we develop methods that take into account the complexity of biological data and the complexity of underlying processes. Curation steps of genomic covariates allow us to limit the number of potential covariates and limit correlations between covariates. We propose an algorithm for selecting covariates in a mixed effects model whose structure is constrained by the physiological process. In particular, we illustrate the developed methods on two medical applications: actual high blood pressure data and simulated tramadol (opioid) metabolism data
Saumard, Adrien. "Estimation par Minimum de Contraste Régulier et Heuristique de Pente en Sélection de Modèles." Phd thesis, Université Rennes 1, 2010. http://tel.archives-ouvertes.fr/tel-00569372.
Full textSokolovska, Nataliya. "Contributions à l'estimation de modèles probabilistes discriminants : apprentissage semi-supervisé et sélection de caractéristiques." Phd thesis, Paris, Télécom ParisTech, 2010. https://pastel.hal.science/pastel-00006257.
Full textIn this thesis, we investigate the use of parametric probabilistic models for classification tasks in the domain of natural lang uage processing. We focus in particular on discriminative models, such as logistic regression and its generalization, conditional random fields (CRFs). Discriminative probabilistic models design directly conditional probability of a class given an observation. The logistic regression has been widely used due to its simplicity and effectiveness. Conditional random fields allow to take structural dependencies into consideration and therefore are used for structured output prediction. In this study, we address two aspects of modern machine learning, namely , semi-supervised learning and model selection, in the context of CRFs. The contribution of this thesis is twofold. First, we consider the framework of semi -supervised learning and propose a novel semi-supervised estimator and show that it is preferable to the standard logistic regression. Second, we study model selection approaches for discriminative models, in particular for CRFs and propose to penalize the CRFs with the elastic net. Since the penalty term is not differentiable in zero, we consider coordinate-wise optimization. The comparison with the performances of other methods demonstrates competitiveness of the CRFs penalized by the elastic net
Sokolovska, Nataliya. "Contributions à l'estimation de modèles probabilistes discriminants : apprentissage semi-supervisé et sélection de caractéristiques." Phd thesis, Ecole nationale supérieure des telecommunications - ENST, 2010. http://tel.archives-ouvertes.fr/tel-00557662.
Full textLeyrat, Clémence. "Biais de sélection dans les essais en clusters : intérêt d'une approche de type score de propension pour le diagnostic et l'analyse statistique." Paris 7, 2014. http://www.theses.fr/2014PA077064.
Full textThis work aimed to study propensity score (PS)-based approaches for analysis of results of cluster randomized trials (CRTs) with selection bias. First, we used Monte Carlo simulations to compare the performance of 4 PS-based methods (direct adjustment, inverse weighting, stratification and matching) and classical multivariable regression when analyzing results of a CRT with selection bias. For continuous outcomes, both multivariable regression and PS-based methods (except matching) removed the bias. Conversely, only direct adjustment on PS provided an unbiased estimate of treatment effect for a low-incidence binary outcome. Second, we developed a tool for detecting selection bias that relies on the area under the receiver operating characteristic curve of the PS model. This tool provides, for a fixed number of covariates and sample size, a threshold value beyond which one could consider the existence of selection bias. This work also highlights the complexity of implementing PS-based methods in the context of CRTs because of the hierarchical structure of the data, as well as the challenges linked to the choice of the statistical method in a causal inference framework
Celisse, Alain. "Sélection de modèle par validation-croisée en estimation de la densité, régression et détection de ruptures." Phd thesis, Université Paris Sud - Paris XI, 2008. http://tel.archives-ouvertes.fr/tel-00346320.
Full textBourguignon, Pierre Yves Vincent. "Parcimonie dans les modèles Markoviens et application à l'analyse des séquences biologiques." Thesis, Evry-Val d'Essonne, 2008. http://www.theses.fr/2008EVRY0042.
Full textMarkov chains, as a universal model accounting for finite memory, discrete valued processes, are omnipresent in applied statistics. Their applications range from text compression to the analysis of biological sequences. Their practical use with finite samples, however, systematically require to draw a compromise between the memory length of the model used, which conditions the complexity of the interactions the model may capture, and the amount of information carried by the data, whose limitation negatively impacts the quality of estimation. Context trees, as an extension of the model class of Markov chains, provide the modeller with a finer granularity in this model selection process, by allowing the memory length to vary across contexts. Several popular modelling methods are based on this class of models, in fields such as text indexation of text compression (Context Tree Maximization and Context Tree Weighting). We propose an extension of the models class of context trees, the Parcimonious context trees, which further allow the fusion of sibling nodes in the context tree. They provide the modeller with a yet finer granularity to perform the model selection task, at the cost of an increased computational cost for performing it. Thanks to a bayesian approach of this problem borrowed from compression techniques, we succeeded at desiging an algorithm that exactly optimizes the bayesian criterion, while it benefits from a dynamic programming scheme ensuring the minimisation of the computational complexity of the model selection task. This algorithm is able to perform in reasonable space and time on alphabets up to size 10, and has been applied on diverse datasets to establish the good performances achieved by this approach
Pain, Michel. "Mouvement brownien branchant et autres modèles hiérarchiques en physique statistique." Thesis, Sorbonne université, 2019. http://www.theses.fr/2019SORUS305.
Full textBranching Brownian motion (BBM) is a particle system, where particles move and reproduce randomly. Firstly, we study precisely the phase transition occuring for this particle system close to its minimum, in the setting of the so-called near-critical case. Then, we describe the universal 1-stable fluctuations appearing in the front of BBM and identify the typical behavior of particles contributing to them. A version of BBM with selection, where particles are killed when going down at a distance larger than L from the highest particle, is also sudied: we see how this selection rule affects the speed of the fastest individuals in the population, when L is large. Thereafter, motivated by temperature chaos in spin glasses, we study the 2-dimensional discrete Gaussian free field, which is a model with an approximative hierarchical structure and properties similar to BBM, and show that, from this perspective, it behaves differently than the Random Energy Model. Finally, the last part of this thesis is dedicated to the Derrida-Retaux model, which is also defined by a hierarchical structure. We introduce a continuous time version of this model and exhibit a family of exactly solvable solutions, which allows us to answer several conjectures stated on the discrete time model
Perthame, Emeline. "Stabilité de la sélection de variables pour la régression et la classification de données corrélées en grande dimension." Thesis, Rennes 1, 2015. http://www.theses.fr/2015REN1S122/document.
Full textThe analysis of high throughput data has renewed the statistical methodology for feature selection. Such data are both characterized by their high dimension and their heterogeneity, as the true signal and several confusing factors are often observed at the same time. In such a framework, the usual statistical approaches are questioned and can lead to misleading decisions as they are initially designed under independence assumption among variables. The goal of this thesis is to contribute to the improvement of variable selection methods in regression and supervised classification issues, by accounting for the dependence between selection statistics. All the methods proposed in this thesis are based on a factor model of covariates, which assumes that variables are conditionally independent given a vector of latent variables. A part of this thesis focuses on the analysis of event-related potentials data (ERP). ERPs are now widely collected in psychological research to determine the time courses of mental events. In the significant analysis of the relationships between event-related potentials and experimental covariates, the psychological signal is often both rare, since it only occurs on short intervals and weak, regarding the huge between-subject variability of ERP curves. Indeed, this data is characterized by a temporal dependence pattern both strong and complex. Moreover, studying the effect of experimental condition on brain activity for each instant is a multiple testing issue. We propose to decorrelate the test statistics by a joint modeling of the signal and time-dependence among test statistics from a prior knowledge of time points during which the signal is null. Second, an extension of decorrelation methods is proposed in order to handle a variable selection issue in the linear supervised classification models framework. The contribution of factor model assumption in the general framework of Linear Discriminant Analysis is studied. It is shown that the optimal linear classification rule conditionally to these factors is more efficient than the non-conditional rule. Next, an Expectation-Maximization algorithm for the estimation of the model parameters is proposed. This method of data decorrelation is compatible with a prediction purpose. At last, the issues of detection and identification of a signal when features are dependent are addressed more analytically. We focus on the Higher Criticism (HC) procedure, defined under the assumptions of a sparse signal of low amplitude and independence among tests. It is shown in the literature that this method reaches theoretical bounds of detection. Properties of HC under dependence are studied and the bounds of detectability and estimability are extended to arbitrarily complex situations of dependence. Finally, in the context of signal identification, an extension of Higher Criticism Thresholding based on innovations is proposed
Cherfaoui, Farah. "Echantillonnage pour l'accélération des méthodes à noyaux et sélection gloutonne pour les représentations parcimonieuses." Electronic Thesis or Diss., Aix-Marseille, 2022. http://www.theses.fr/2022AIXM0256.
Full textThe contributions of this thesis are divided into two parts. The first part is dedicated to the acceleration of kernel methods and the second to optimization under sparsity constraints. Kernel methods are widely known and used in machine learning. However, the complexity of their implementation is high and they become unusable when the number of data is large. We first propose an approximation of Ridge leverage scores. We then use these scores to define a probability distribution for the sampling process of the Nyström method in order to speed up the kernel methods. We then propose a new kernel-based framework for representing and comparing discrete probability distributions. We then exploit the link between our framework and the maximum mean discrepancy to propose an accurate and fast approximation of the latter. The second part of this thesis is devoted to optimization with sparsity constraint for signal optimization and random forest pruning. First, we prove under certain conditions on the coherence of the dictionary, the reconstruction and convergence properties of the Frank-Wolfe algorithm. Then, we use the OMP algorithm to reduce the size of random forests and thus reduce the size needed for its storage. The pruned forest consists of a subset of trees from the initial forest selected and weighted by OMP in order to minimize its empirical prediction error
Peel, Thomas. "Algorithmes de poursuite stochastiques et inégalités de concentration empiriques pour l'apprentissage statistique." Thesis, Aix-Marseille, 2013. http://www.theses.fr/2013AIXM4769/document.
Full textThe first part of this thesis introduces new algorithms for the sparse encoding of signals. Based on Matching Pursuit (MP) they focus on the following problem : how to reduce the computation time of the selection step of MP. As an answer, we sub-sample the dictionary in line and column at each iteration. We show that this theoretically grounded approach has good empirical performances. We then propose a bloc coordinate gradient descent algorithm for feature selection problems in the multiclass classification setting. Thanks to the use of error-correcting output codes, this task can be seen as a simultaneous sparse encoding of signals problem. The second part exposes new empirical Bernstein inequalities. Firstly, they concern the theory of the U-Statistics and are applied in order to design generalization bounds for ranking algorithms. These bounds take advantage of a variance estimator and we propose an efficient algorithm to compute it. Then, we present an empirical version of the Bernstein type inequality for martingales by Freedman [1975]. Again, the strength of our result lies in the variance estimator computable from the data. This allows us to propose generalization bounds for online learning algorithms which improve the state of the art and pave the way to a new family of learning algorithms taking advantage of this empirical information