Dissertations / Theses: 'Class imbalance'

1

Wang, Shuo. "Ensemble diversity for class imbalance learning." Thesis, University of Birmingham, 2011. http://etheses.bham.ac.uk//id/eprint/1793/.

Full text

Abstract:

This thesis studies the diversity issue of classification ensembles for class imbalance learning problems. Class imbalance learning refers to learning from imbalanced data sets, in which some classes of examples (minority) are highly under-represented comparing to other classes (majority). The very skewed class distribution degrades the learning ability of many traditional machine learning methods, especially in the recognition of examples from the minority classes, which are often deemed to be more important and interesting. Although quite a few ensemble learning approaches have been proposed to handle the problem, no in-depth research exists to explain why and when they can be helpful. Our objectives are to understand how ensemble diversity affects the classification performance for a class imbalance problem according to single-class and overall performance measures, and to make best use of diversity to improve the performance. As the first stage, we study the relationship between ensemble diversity and generalization performance for class imbalance problems. We investigate mathematical links between single-class performance and ensemble diversity. It is found that how the single-class measures change along with diversity falls into six different situations. These findings are then verified in class imbalance scenarios through empirical studies. The impact of diversity on overall performance is also investigated empirically. Strong correlations between diversity and the performance measures are found. Diversity shows a positive impact on the recognition of the minority class and benefits the overall performance of ensembles in class imbalance learning. Our results help to understand if and why ensemble diversity can help to deal with class imbalance problems. Encouraged by the positive role of diversity in class imbalance learning, we then focus on a specific ensemble learning technique, the negative correlation learning (NCL) algorithm, which considers diversity explicitly when creating ensembles and has achieved great empirical success. We propose a new learning algorithm based on the idea of NCL, named AdaBoost.NC, for classification problems. An ``ambiguity" term decomposed from the 0-1 error function is introduced into the training framework of AdaBoost. It demonstrates superiority in both effectiveness and efficiency. Its good generalization performance is explained by theoretical and empirical evidences. It can be viewed as the first NCL algorithm specializing in classification problems. Most existing ensemble methods for class imbalance problems suffer from the problems of overfitting and over-generalization. To improve this situation, we address the class imbalance issue by making use of ensemble diversity. We investigate the generalization ability of NCL algorithms, including AdaBoost.NC, to tackle two-class imbalance problems. We find that NCL methods integrated with random oversampling are effective in recognizing minority class examples without losing the overall performance, especially the AdaBoost.NC tree ensemble. This is achieved by providing smoother and less overfitting classification boundaries for the minority class. The results here show the usefulness of diversity and open up a novel way to deal with class imbalance problems. Since the two-class imbalance is not the only scenario in real-world applications, multi-class imbalance problems deserve equal attention. To understand what problems multi-class can cause and how it affects the classification performance, we study the multi-class difficulty by analyzing the multi-minority and multi-majority cases respectively. Both lead to a significant performance reduction. The multi-majority case appears to be more harmful. The results reveal possible issues that a class imbalance learning technique could have when dealing with multi-class tasks. Following this part of analysis and the promising results of AdaBoost.NC on two-class imbalance problems, we apply AdaBoost.NC to a set of multi-class imbalance domains with the aim of solving them effectively and directly. Our method shows good generalization in minority classes and balances the performance across different classes well without using any class decomposition schemes. Finally, we conclude this thesis with how the study has contributed to class imbalance learning and ensemble learning, and propose several possible directions for future research that may improve and extend this work.

APA, Harvard, Vancouver, ISO, and other styles

2

Nataraj, Vismitha, and Sushmitha Narayanan. "Resolving Class Imbalance using Generative Adversarial Networks." Thesis, Högskolan i Halmstad, Akademin för informationsteknologi, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:hh:diva-41405.

Full text

APA, Harvard, Vancouver, ISO, and other styles

3

Tran, Quang Duc. "One-class classification : an approach to handle class imbalance in multimodal biometric authentication." Thesis, City, University of London, 2014. http://openaccess.city.ac.uk/19662/.

Full text

Abstract:

Biometric verification is the process of authenticating a person‟s identity using his/her physiological and behavioural characteristics. It is well-known that multimodal biometric systems can further improve the authentication accuracy by combining information from multiple biometric traits at various levels, namely sensor, feature, match score and decision levels. Fusion at match score level is generally preferred due to the trade-off between information availability and fusion complexity. However, combining match scores poses a number of challenges, when treated as a two-class classification problem due to the highly imbalanced class distributions. Most conventional classifiers assume equally balanced classes. They do not work well when samples of one class vastly outnumber the samples of the other class. These challenges become even more significant, when the fusion is based on user-specific processing due to the limited availability of the genuine samples per user. This thesis aims at exploring the paradigm of one-class classification to advance the classification performance of imbalanced biometric data sets. The contributions of the research can be enumerated as follows. Firstly, a thorough investigation of the various one-class classifiers, including Gaussian Mixture Model, k-Nearest Neighbour, K-means clustering and Support Vector Data Description, has been provided. These classifiers are applied in learning the user-specific and user-independent descriptions for the biometric decision inference. It is demonstrated that the one-class classifiers are particularly useful in handling the imbalanced learning problem in multimodal biometric authentication. User-specific approach is a better alternative with respect to user-independent counterpart because it is able to overcome the so-called within-class sub-concepts problem, which arises very often in multimodal biometric systems due to the existence of user variation. Secondly, a novel adapted score fusion scheme that consists of one-class classifiers and is trained using both the genuine user and impostor samples has been proposed. This method also replaces user-independent by user-specific description to learn the characteristics of the impostor class, and thus, reducing the degree of imbalanced proportion of data for different classes. Extensive experiments are conducted on the BioSecure DS2 and XM2VTS databases to illustrate the potential of the proposed adapted score fusion scheme, which provides a relative improvement in terms of Equal Error Rate of 32% and 20% as compared to the standard sum of scores and likelihood ratio based score fusion, respectively. Thirdly, a hybrid boosting algorithm, called r-ABOC has been developed, which is capable of exploiting the natural capabilities of both the well-known Real AdaBoost and one-class classification to further improve the system performance without causing overfitting. However, unlike the conventional Real AdaBoost, the individual classifiers in the proposed schema are trained on the same data set, but with different parameter choices. This does not only generate a high diversity, which is vital to the success of r-ABOC, but also reduces the number of user-specified parameters. A comprehensive empirical study using the BioSecure DS2 and XM2VTS databases demonstrates that r-ABOC may achieve a performance gain in terms of Half Total Error Rate of up to 28% with respect to other state-of-the-art biometric score fusion techniques. Finally, a Robust Imputation based on Group Method of Data Handling (RIBG) has been proposed to handle the missing data problem in the BioSecure DS2 database. RIBG is able to provide accurate predictions of incomplete score vectors. It is observed to achieve a better performance with respect to the state-of-the-art imputation techniques, including mean, median and k-NN imputations. An important feature of RIBG is that it does not require any parameter fine-tuning, and hence, is amendable to immediate applications.

APA, Harvard, Vancouver, ISO, and other styles

4

SENG, Kruy. "Cost-sensitive deep neural network ensemble for class imbalance problem." Digital Commons @ Lingnan University, 2018. https://commons.ln.edu.hk/otd/32.

Full text

Abstract:

In data mining, classification is a task to build a model which classifies data into a given set of categories. Most classification algorithms assume the class distribution of data to be roughly balanced. In real-life applications such as direct marketing, fraud detection and churn prediction, class imbalance problem usually occurs. Class imbalance problem is referred to the issue that the number of examples belonging to a class is significantly greater than those of the others. When training a standard classifier with class imbalance data, the classifier is usually biased toward majority class. However, minority class is the class of interest and more significant than the majority class. In the literature, existing methods such as data-level, algorithmic-level and cost-sensitive learning have been proposed to address this problem. The experiments discussed in these studies were usually conducted on relatively small data sets or even on artificial data. The performance of the methods on modern real-life data sets, which are more complicated, is unclear. In this research, we study the background and some of the state-of-the-art approaches which handle class imbalance problem. We also propose two costsensitive methods to address class imbalance problem, namely Cost-Sensitive Deep Neural Network (CSDNN) and Cost-Sensitive Deep Neural Network Ensemble (CSDE). CSDNN is a deep neural network based on Stacked Denoising Autoencoders (SDAE). We propose CSDNN by incorporating cost information of majority and minority class into the cost function of SDAE to make it costsensitive. Another proposed method, CSDE, is an ensemble learning version of CSDNN which is proposed to improve the generalization performance on class imbalance problem. In the first step, a deep neural network based on SDAE is created for layer-wise feature extraction. Next, we perform Bagging’s resampling procedure with undersampling to split training data into a number of bootstrap samples. In the third step, we apply a layer-wise feature extraction method to extract new feature samples from each of the hidden layer(s) of the SDAE. Lastly, the ensemble learning is performed by using each of the new feature samples to train a CSDNN classifier with random cost vector. Experiments are conducted to compare the proposed methods with the existing methods. We examine their performance on real-life data sets in business domains. The results show that the proposed methods obtain promising results in handling class imbalance problem and also outperform all the other compared methods. There are three major contributions to this work. First, we proposed CSDNN method in which misclassification costs are considered in training process. Second, we incorporate random undersampling with layer-wise feature extraction to perform ensemble learning. Third, this is the first work that conducts experiments on class imbalance problem using large real-life data sets in different business domains ranging from direct marketing, churn prediction, credit scoring, fraud detection to fake review detection.

APA, Harvard, Vancouver, ISO, and other styles

5

Barnabé-Lortie, Vincent. "Active Learning for One-class Classification." Thesis, Université d'Ottawa / University of Ottawa, 2015. http://hdl.handle.net/10393/33001.

Full text

Abstract:

Active learning is a common solution for reducing labeling costs and maximizing the impact of human labeling efforts in binary and multi-class classification settings. However, when we are faced with extreme levels of class imbalance, a situation in which it is not safe to assume that we have a representative sample of the minority class, it has been shown effective to replace the binary classifiers with a one-class classifiers. In such a setting, traditional active learning methods, and many previously proposed in the literature for one-class classifiers, prove to be inappropriate, as they rely on assumptions about the data that no longer stand. In this thesis, we propose a novel approach to active learning designed for one-class classification. The proposed method does not rely on many of the inappropriate assumptions of its predecessors and leads to more robust classification performance. The gist of this method consists of labeling, in priority, the instances considered to fit the learned class the least by previous iterations of a one-class classification model. Throughout the thesis, we provide evidence for the merits of our method, then deepen our understanding of these merits by exploring the properties of the method that allow it to outperform the alternatives.

APA, Harvard, Vancouver, ISO, and other styles

6

Dutta, Ila. "Data Mining Techniques to Identify Financial Restatements." Thesis, Université d'Ottawa / University of Ottawa, 2018. http://hdl.handle.net/10393/37342.

Full text

Abstract:

Data mining is a multi-disciplinary field of science and technology widely used in developing predictive models and data visualization in various domains. Although there are numerous data mining algorithms and techniques across multiple fields, it appears that there is no consensus on the suitability of a particular model, or the ways to address data preprocessing issues. Moreover, the effectiveness of data mining techniques depends on the evolving nature of data. In this study, we focus on the suitability and robustness of various data mining models for analyzing real financial data to identify financial restatements. From data mining perspective, it is quite interesting to study financial restatements for the following reasons: (i) the restatement data is highly imbalanced that requires adequate attention in model building, (ii) there are many financial and non-financial attributes that may affect financial restatement predictive models. This requires careful implementation of data mining techniques to develop parsimonious models, and (iii) the class imbalance issue becomes more complex in a dataset that includes both intentional and unintentional restatement instances. Most of the previous studies focus on fraudulent (or intentional) restatements and the literature has largely ignored unintentional restatements. Intentional (i.e. fraudulent) restatements instances are rare and likely to have more distinct features compared to non-restatement cases. However, unintentional cases are comparatively more prevalent and likely to have fewer distinct features that separate them from non-restatement cases. A dataset containing unintentional restatement cases is likely to have more class overlapping issues that may impact the effectiveness of predictive models. In this study, we developed predictive models based on all restatement cases (both intentional and unintentional restatements) using a real, comprehensive and novel dataset which includes 116 attributes and approximately 1,000 restatement and 19,517 non-restatement instances over a period of 2009 to 2014. To the best of our knowledge, no other study has developed predictive models for financial restatements using post-financial crisis events. In order to avoid redundant attributes, we use three feature selection techniques: Correlation based feature subset selection (CfsSubsetEval), Information gain attribute evaluation (InfoGainEval), Stepwise forward selection (FwSelect) and generate three datasets with reduced attributes. Our restatement dataset is highly skewed and highly biased towards non-restatement (majority) class. We applied various algorithms (e.g. random undersampling (RUS), Cluster based undersampling (CUS) (Sobhani et al., 2014), random oversampling (ROS), Synthetic minority oversampling technique (SMOTE) (Chawla et al., 2002), Adaptive synthetic sampling (ADASYN) (He et al., 2008), and Tomek links with SMOTE) to address class imbalance in the financial restatement dataset. We perform classification employing six different choices of classifiers, Decision three (DT), Artificial neural network (ANN), Naïve Bayes (NB), Random forest (RF), Bayesian belief network (BBN) and Support vector machine (SVM) using 10-fold cross validation and test the efficiency of various predictive models using minority class recall value, minority class F-measure and G-mean. We also experiment different ensemble methods (bagging and boosting) with the base classifiers and employ other meta-learning algorithms (stacking and cost-sensitive learning) to improve model performance. While applying cluster-based undersampling technique, we find that various classifiers (e.g. SVM, BBN) show a high success rate in terms of minority class recall value. For example, SVM classifier shows a minority recall value of 96% which is quite encouraging. However, the ability of these classifiers to detect majority class instances is dismal. We find that some variations of synthetic oversampling such as ‘Tomek Link + SMOTE’ and ‘ADASYN’ show promising results in terms of both minority recall value and G-mean. Using InfoGainEval feature selection method, RF classifier shows minority recall values of 92.6% for ‘Tomek Link + SMOTE’ and 88.9% for ‘ADASYN’ techniques, respectively. The corresponding G-mean values are 95.2% and 94.2% for these two oversampling techniques, which show that RF classifier is quite effective in predicting both minority and majority classes. We find further improvement in results for RF classifier with cost-sensitive learning algorithm using ‘Tomek Link + SMOTE’ oversampling technique. Subsequently, we develop some decision rules to detect restatement firms based on a subset of important attributes. To the best of our knowledge, only Kim et al. (2016) perform a data mining study using only pre-financial crisis restatement data. Kim et al. (2016) employed a matching sample based undersampling technique and used logistic regression, SVM and BBN classifiers to develop financial restatement predictive models. The study’s highest reported G-mean is 70%. Our results with clustering based undersampling are similar to the performance measures reported by Kim et al. (2016). However, our synthetic oversampling based results show a better predictive ability. The RF classifier shows a very high degree of predictive capability for minority class instances (97.4%) and a very high G-mean value (95.3%) with cost-sensitive learning. Yet, we recognize that Kim et al. (2016) use a different restatement dataset (with pre-crisis restatement cases) and hence a direct comparison of results may not be fully justified. Our study makes contributions to the data mining literature by (i) presenting predictive models for financial restatements with a comprehensive dataset, (ii) focussing on various datamining techniques and presenting a comparative analysis, and (iii) addressing class imbalance issue by identifying most effective technique. To the best of our knowledge, we used the most comprehensive dataset to develop our predictive models for identifying financial restatement.

APA, Harvard, Vancouver, ISO, and other styles

7

Batuwitage, Manohara Rukshan Kannangara. "Enhanced class imbalance learning methods for support vector machines application to human miRNA gene classification." Thesis, University of Oxford, 2010. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.531966.

Full text

APA, Harvard, Vancouver, ISO, and other styles

8

Mathur, Tanmay. "Improving Classification Results Using Class Imbalance Solutions & Evaluating the Generalizability of Rationale Extraction Techniques." Miami University / OhioLINK, 2015. http://rave.ohiolink.edu/etdc/view?acc_num=miami1420335486.

Full text

APA, Harvard, Vancouver, ISO, and other styles

9

Iosifidis, Vasileios [Verfasser], and Eirini [Akademischer Betreuer] Ntoutsi. "Semi-supervised learning and fairness-aware learning under class imbalance / Vasileios Iosifidis ; Betreuer: Eirini Ntoutsi." Hannover : Gottfried Wilhelm Leibniz Universität Hannover, 2020. http://d-nb.info/1217782168/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

10

Bellinger, Colin. "Beyond the Boundaries of SMOTE: A Framework for Manifold-based Synthetic Oversampling." Thesis, Université d'Ottawa / University of Ottawa, 2016. http://hdl.handle.net/10393/34643.

Full text

Abstract:

Within machine learning, the problem of class imbalance refers to the scenario in which one or more classes is significantly outnumbered by the others. In the most extreme case, the minority class is not only significantly outnumbered by the majority class, but it also considered to be rare, or absolutely imbalanced. Class imbalance appears in a wide variety of important domains, ranging from oil spill and fraud detection, to text classification and medical diagnosis. Given this, it has been deemed as one of the ten most important research areas in data mining, and for more than a decade now the machine learning community has been coming together in an attempt to unequivocally solve the problem. The fundamental challenge in the induction of a classifier from imbalanced training data is in managing the prediction bias. The current state-of-the-art methods deal with this by readjusting misclassification costs or by applying resampling methods. In cases of absolute imbalance, these methods are insufficient; rather, it has been observed that we need more training examples. The nature of class imbalance, however, dictates that additional examples cannot be acquired, and thus, synthetic oversampling becomes the natural choice. We recognize the importance of selecting algorithms with assumptions and biases that are appropriate for the properties of the target data, and argue that this is of absolute importance when it comes to developing synthetic oversampling methods because a large generative leap must be made from a relatively small training set. In particular, our research into gamma-ray spectral classification has demonstrated the benefits of incorporating prior knowledge of conformance to the manifold assumption into the synthetic oversampling algorithms. We empirically demonstrate the negative impact of the manifold property on the state-of-the-art methods, and propose a framework for manifold-based synthetic oversampling. We algorithmically present the generic form of the framework and demonstrate formalizations of it with PCA and the denoising autoencoder. Through use of the helix and swiss roll datasets, which are standards in the manifold learning community, we visualize and qualitatively analyze the benefits of our proposed framework. Moreover, we unequivocally show the framework to be superior on three real-world gamma-ray spectral datasets and on sixteen benchmark UCI datasets in general. Specifically, our results demonstrate that the framework for manifold-based synthetic oversampling produces higher area under the ROC results than the current state-of-the-art and degrades less on data that conforms to the manifold assumption.

APA, Harvard, Vancouver, ISO, and other styles

11

Kueterman, Nathan. "Comparative Study of Classification Methods for the Mitigation of Class Imbalance Issues in Medical Imaging Applications." University of Dayton / OhioLINK, 2020. http://rave.ohiolink.edu/etdc/view?acc_num=dayton1591611376235015.

Full text

APA, Harvard, Vancouver, ISO, and other styles

12

Jagelid, Michelle, and Maria Movin. "A Comparison of Resampling Techniques to Handle the Class Imbalance Problem in Machine Learning : Conversion prediction of Spotify Users - A Case Study." Thesis, KTH, Skolan för datavetenskap och kommunikation (CSC), 2017. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-208876.

Full text

Abstract:

Spotify uses a freemium business model, meaning that it has two main products, one free limited and one premium for paying customers. In this study we investigated machine learning models’ abilities, given user activity data, to predict conversion from free to premium. Predicting which of the users convert from free to premium was a class-imbalanced problem, meaning that the ratio of converters and non-converters was skewed. Three methods were investigated: logistic regression, decision trees, and gradient boosting trees. We also studied if different resampling methods, which balance the train datasets, can improve classification performance of the models. We showed that machine learning models are able to find patterns in user data that could be used to predict conversion. Additionally, for all our investigated classification methods, we showed that resampling increased the models’ performances. The methods with best performances in our study were logistic regression and gradient boosting tree trained with oversampled data up to equal numbers of converters and non-converters.
I den här studien undersökte vi om det går att, givet användardata från Spotifyanvändare, prediktera vilka användare som konverterar från gratisversionen till premiumversionen. Eftersom det finns fler användare som inte konverterar än som konverterar, var detta ett problem med obalancerade klasser. Obalancerade klasser är ett välkänt problem inom maskininlärning. Tre maskininlärningsmetoder undersöktes: Logistic regression, Decision trees och Gradient Boosting Trees. Förbehandlingsmetoder som leder till att träningsdata får jämnare fördelning mellan klasserna undersöktes. Detta för att se om sådana förbehandlingar kunde öka modellernas förmåga att klassificera nya användare. Vi visade att det var möjligt att med maskininlärningsmetoder, givet användardata, hitta mönster i data som kunde användas för att prediktera vilka användare som konverterar. För alla tre maskininlärningsmetoder visade det sig att förbehandling av träningsdata till jämnare fördelning mellan klasserna gav bättre resultat. Av de undersökta modellerna presterade Logistic regression och Gradient Boosting Tree bäst då de tränats med förbehandlad data, så att slumpmässiga dubbletter av användare som konverterat lagts till i datasetet upp till helt jämn fördelning.

APA, Harvard, Vancouver, ISO, and other styles

13

Pezzicoli, Francesco. "Statistical Physics - Machine Learning Interplay : from Addressing Class Imbalance with Replica Theory to Predicting Dynamical Heterogeneities with SE(3)-equivariant Graph Neural Networks." Electronic Thesis or Diss., université Paris-Saclay, 2024. http://www.theses.fr/2024UPASG115.

Full text

Abstract:

Cette thèse explore la relation entre l'Apprentissage Automatique (AA) et la Physique Statistique (PS), en abordant deux défis importants à l'interface entre ces deux domaines. Tout d'abord, j'examine le problème du Déséquilibre de Classe (DC) dans le cadre de l'apprentissage supervisé en introduisant un modèle analytiquement solvable basé sur la mécanique statistique: je propose un cadre théorique pour analyser et interpréter le problème de DC. Certains phénomènes non triviaux sont observés : par exemple, un ensemble d'entraînement équilibré aboutit souvent à une performance sous-optimale. Ensuite, j'étudie le phénomène de blocage dynamique dans les verres structuraux à l'aide de modèles avancés d'AA. En exploitant des réseaux de neurones sur graphe qui sont SE(3)-équivariants, j'atteins des performance qui atteignent ou surpassent l'état de l'art pour la prédiction des propriétés dynamiques à partir de la structure statique. Cela suggère l'émergence d'un "ordre amorphe" qui est corrélé avec la dynamique. Cela souligne également l'importance des features directionnelles dans l'identification de cet ordre. Ensemble, ces contributions démontrent le potentiel de la physique statistique pour résoudre les défis de l'AA et l'utilité des modèles d'AA pour faire progresser les sciences physiques
This thesis explores the relationship between Machine Learning (ML) and Statistical Physics (SP), addressing two significant challenges at the interface between the two fields. First, I examine the problem of Class Imbalance (CI) in the supervised learning set-up by introducing an analytically tractable model grounded in statistical mechanics: I provide a theoretical framework to analyze and interpret CI. Some non-trivial phenomena are observed: for example, a balanced training set often results in sub-optimal performance. Second, I study the phenomenon of dynamical arrest in supercooled liquids through advanced ML models. Leveraging SE(3)-equivariant Graph Neural Networks, I am able to reach or surpass state-of-the art accuracy in the task of prediction of dynamical properties from static structure. This suggests the emergence of a growing "amorphous order" that correlates with particle dynamics. It also emphasizes the importance of directional features in identifying this order. Together, these contributions demonstrate the potential of SP in addressing ML challenges and the utility of ML models in advancing physical sciences

APA, Harvard, Vancouver, ISO, and other styles

14

Yella, Jaswanth. "Machine Learning-based Prediction and Characterization of Drug-drug Interactions." University of Cincinnati / OhioLINK, 2018. http://rave.ohiolink.edu/etdc/view?acc_num=ucin154399419112613.

Full text

APA, Harvard, Vancouver, ISO, and other styles

15

Ringdahl, Benjamin. "Gaussian Process Multiclass Classification : Evaluation of Binarization Techniques and Likelihood Functions." Thesis, Linnéuniversitetet, Institutionen för matematik (MA), 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:lnu:diva-87952.

Full text

Abstract:

In binary Gaussian process classification the prior class membership probabilities are obtained by transforming a Gaussian process to the unit interval, typically either with the logistic likelihood function or the cumulative Gaussian likelihood function. Multiclass classification problems can be handled by any binary classifier by means of so-called binarization techniques, which reduces the multiclass problem into a number of binary problems. Other than introducing the mathematics behind the theory and methods behind Gaussian process classification, we compare the binarization techniques one-against-all and one-against-one in the context of Gaussian process classification, and we also compare the performance of the logistic likelihood and the cumulative Gaussian likelihood. This is done by means of two experiments: one general experiment where the methods are tested on several publicly available datasets, and one more specific experiment where the methods are compared with respect to class imbalance and class overlap on several artificially generated datasets. The results indicate that there is no significant difference in the choices of binarization technique and likelihood function for typical datasets, although the one-against-one technique showed slightly more consistent performance. However the second experiment revealed some differences in how the methods react to varying degrees of class imbalance and class overlap. Most notably the logistic likelihood was a dominant factor and the one-against-one technique performed better than one-against-all.

APA, Harvard, Vancouver, ISO, and other styles

16

Brandt, Jakob, and Emil Lanzén. "A Comparative Review of SMOTE and ADASYN in Imbalanced Data Classification." Thesis, Uppsala universitet, Statistiska institutionen, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-432162.

Full text

Abstract:

In this thesis, the performance of two over-sampling techniques, SMOTE and ADASYN, is compared. The comparison is done on three imbalanced data sets using three different classification models and evaluation metrics, while varying the way the data is pre-processed. The results show that both SMOTE and ADASYN improve the performance of the classifiers in most cases. It is also found that SVM in conjunction with SMOTE performs better than with ADASYN as the degree of class imbalance increases. Furthermore, both SMOTE and ADASYN increase the relative performance of the Random forest as the degree of class imbalance grows. However, no pre-processing method consistently outperforms the other in its contribution to better performance as the degree of class imbalance varies.

APA, Harvard, Vancouver, ISO, and other styles

17

Prati, Ronaldo Cristiano. ""Novas abordagens em aprendizado de máquina para a geração de regras, classes desbalanceadas e ordenação de casos"." Universidade de São Paulo, 2006. http://www.teses.usp.br/teses/disponiveis/55/55134/tde-01092006-155445/.

Full text

Abstract:

Algoritmos de aprendizado de máquina são frequentemente os mais indicados em uma grande variedade de aplicações de mineração dados. Entretanto, a maioria das pesquisas em aprendizado de máquina refere-se ao problema bem definido de encontrar um modelo (geralmente de classificação) de um conjunto de dados pequeno, relativamente bem preparado para o aprendizado, no formato atributo-valor, no qual os atributos foram previamente selecionados para facilitar o aprendizado. Além disso, o objetivo a ser alcançado é simples e bem definido (modelos de classificação precisos, no caso de problemas de classificação). Mineração de dados propicia novas direções para pesquisas em aprendizado de máquina e impõe novas necessidades para outras. Com a mineração de dados, algoritmos de aprendizado estão quebrando as restrições descritas anteriormente. Dessa maneira, a grande contribuição da área de aprendizado de máquina para a mineração de dados é retribuída pelo efeito inovador que a mineração de dados provoca em aprendizado de máquina. Nesta tese, exploramos alguns desses problemas que surgiram (ou reaparecem) com o uso de algoritmos de aprendizado de máquina para mineração de dados. Mais especificamente, nos concentramos seguintes problemas: Novas abordagens para a geração de regras. Dentro dessa categoria, propomos dois novos métodos para o aprendizado de regras. No primeiro, propomos um novo método para gerar regras de exceção a partir de regras gerais. No segundo, propomos um algoritmo para a seleção de regras denominado Roccer. Esse algoritmo é baseado na análise ROC. Regras provêm de um grande conjunto externo de regras e o algoritmo proposto seleciona regras baseado na região convexa do gráfico ROC. Proporção de exemplos entre as classes. Investigamos vários aspectos relacionados a esse tópico. Primeiramente, realizamos uma série de experimentos em conjuntos de dados artificiais com o objetivo de testar nossa hipótese de que o grau de sobreposição entre as classes é um fator complicante em conjuntos de dados muito desbalanceados. Também executamos uma extensa análise experimental com vários métodos (alguns deles propostos neste trabalho) para balancear artificialmente conjuntos de dados desbalanceados. Finalmente, investigamos o relacionamento entre classes desbalanceadas e pequenos disjuntos, e a influência da proporção de classes no processo de rotulação de exemplos no algoritmo de aprendizado de máquina semi-supervisionado Co-training. Novo método para a combinação de rankings. Propomos um novo método, chamado BordaRank, para construir ensembles de rankings baseado no método de votação borda count. BordaRank pode ser aplicado em qualquer problema de ordenação binária no qual vários rankings estejam disponíveis. Resultados experimentais mostram uma melhora no desempenho com relação aos rankings individuais, alem de um desempenho comparável com algoritmos mais sofisticados que utilizam a predição numérica, e não rankings, para a criação de ensembles para o problema de ordenação binária.
Machine learning algorithms are often the most appropriate algorithms for a great variety of data mining applications. However, most machine learning research to date has mainly dealt with the well-circumscribed problem of finding a model (generally a classifier) given a single, small and relatively clean dataset in the attribute-value form, where the attributes have previously been chosen to facilitate learning. Furthermore, the end-goal is simple and well-defined, such as accurate classifiers in the classification problem. Data mining opens up new directions for machine learning research, and lends new urgency to others. With data mining, machine learning is now removing each one of these constraints. Therefore, machine learning's many valuable contributions to data mining are reciprocated by the latter's invigorating effect on it. In this thesis, we explore this interaction by proposing new solutions to some problems due to the application of machine learning algorithms to data mining applications. More specifically, we contribute to the following problems. New approaches to rule learning. In this category, we propose two new methods for rule learning. In the first one, we propose a new method for finding exceptions to general rules. The second one is a rule selection algorithm based on the ROC graph. Rules come from an external larger set of rules and the algorithm performs a selection step based on the current convex hull in the ROC graph. Proportion of examples among classes. We investigated several aspects related to this issue. Firstly, we carried out a series of experiments on artificial data sets in order to verify our hypothesis that overlapping among classes is a complicating factor in highly skewed data sets. We also carried out a broadly experimental analysis with several methods (some of them proposed by us) that artificially balance skewed datasets. Our experiments show that, in general, over-sampling methods perform better than under-sampling methods. Finally, we investigated the relationship between class imbalance and small disjuncts, as well as the influence of the proportion of examples among classes in the process of labelling unlabelled cases in the semi-supervised learning algorithm Co-training. New method for combining rankings. We propose a new method called BordaRanking to construct ensembles of rankings based on borda count voting, which could be applied whenever only the rankings are available. Results show an improvement upon the base-rankings constructed by taking into account the ordering given by classifiers which output continuous-valued scores, as well as a comparable performance with the fusion of such scores.

APA, Harvard, Vancouver, ISO, and other styles

18

Siddique, Nahian A. "PATTERN RECOGNITION IN CLASS IMBALANCED DATASETS." VCU Scholars Compass, 2016. http://scholarscompass.vcu.edu/etd/4480.

Full text

Abstract:

Class imbalanced datasets constitute a significant portion of the machine learning problems of interest, where recognizing the ‘rare class’ is the primary objective for most applications. Traditional linear machine learning algorithms are often not effective in recognizing the rare class. In this research work, a specifically optimized feed-forward artificial neural network (ANN) is proposed and developed to train from moderate to highly imbalanced datasets. The proposed methodology deals with the difficulty in classification task in multiple stages—by optimizing the training dataset, modifying kernel function to generate the gram matrix and optimizing the NN structure. First, the training dataset is extracted from the available sample set through an iterative process of selective under-sampling. Then, the proposed artificial NN comprises of a kernel function optimizer to specifically enhance class boundaries for imbalanced datasets by conformally transforming the kernel functions. Finally, a single hidden layer weighted neural network structure is proposed to train models from the imbalanced dataset. The proposed NN architecture is derived to effectively classify any binary dataset with even very high imbalance ratio with appropriate parameter tuning and sufficient number of processing elements. Effectiveness of the proposed method is tested on accuracy based performance metrics, achieving close to and above 90%, with several imbalanced datasets of generic nature and compared with state of the art methods. The proposed model is also used for classification of a 25GB computed tomographic colonography database to test its applicability for big data. Also the effectiveness of under-sampling, kernel optimization for training of the NN model from the modified kernel gram matrix representing the imbalanced data distribution is analyzed experimentally. Computation time analysis shows the feasibility of the system for practical purposes. This report is concluded with discussion of prospect of the developed model and suggestion for further development works in this direction.

APA, Harvard, Vancouver, ISO, and other styles

19

Abouelenien, Mohamed. "Boosting for Learning From Imbalanced, Multiclass Data Sets." Thesis, University of North Texas, 2013. https://digital.library.unt.edu/ark:/67531/metadc407775/.

Full text

Abstract:

In many real-world applications, it is common to have uneven number of examples among multiple classes. The data imbalance, however, usually complicates the learning process, especially for the minority classes, and results in deteriorated performance. Boosting methods were proposed to handle the imbalance problem. These methods need elongated training time and require diversity among the classifiers of the ensemble to achieve improved performance. Additionally, extending the boosting method to handle multi-class data sets is not straightforward. Examples of applications that suffer from imbalanced multi-class data can be found in face recognition, where tens of classes exist, and in capsule endoscopy, which suffers massive imbalance between the classes. This dissertation introduces RegBoost, a new boosting framework to address the imbalanced, multi-class problems. This method applies a weighted stratified sampling technique and incorporates a regularization term that accommodates multi-class data sets and automatically determines the error bound of each base classifier. The regularization parameter penalizes the classifier when it misclassifies instances that were correctly classified in the previous iteration. The parameter additionally reduces the bias towards majority classes. Experiments are conducted using 12 diverse data sets with moderate to high imbalance ratios. The results demonstrate superior performance of the proposed method compared to several state-of-the-art algorithms for imbalanced, multi-class classification problems. More importantly, the sensitivity improvement of the minority classes using RegBoost is accompanied with the improvement of the overall accuracy for all classes. With unpredictability regularization, a diverse group of classifiers are created and the maximum accuracy improvement reaches above 24%. Using stratified undersampling, RegBoost exhibits the best efficiency. The reduction in computational cost is significant reaching above 50%. As the volume of training data increase, the gain of efficiency with the proposed method becomes more significant.

APA, Harvard, Vancouver, ISO, and other styles

20

Andersson, Melanie. "Multi-Class Imbalanced Learning for Time Series Problem : An Industrial Case Study." Thesis, Uppsala universitet, Avdelningen för systemteknik, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-412799.

Full text

Abstract:

Classification problems with multiple classes and imbalanced sample sizes present a new challenge than the binary classification problems. Methods have been proposed to handle imbalanced learning, however most of them are specifically designed for binary classification problems. Multi-class imbalance imposes additional challenges when applied to time series classification problems, such as weather classification. In this thesis, we introduce, apply and evaluate a new algorithm for handling multi-class imbalanced problems involving time series data. Our proposed algorithm is designed to handle both multi-class imbalance and time series classification problems and is inspired by the Imbalanced Fuzzy-Rough Ordered Weighted Average Nearest Neighbor Classification algorithm. The feasibility of our proposed algorithm is studied through an empirical evaluation performed on a telecom use-case at Ericsson, Sweden where data from commercial microwave links is used for weather classification. Our proposed algorithm is compared to the currently used model at Ericsson which is a one-dimensional convolutional neural network, as well as three other deep learning models. The empirical evaluation indicates that the performance of our proposed algorithm for weather classification is comparable to that of the current solution. Our proposed algorithm and the current solution are the two best performing models of the study.

APA, Harvard, Vancouver, ISO, and other styles

21

Ghanem, Amal Saleh. "Probabilistic models for mining imbalanced relational data." Thesis, Curtin University, 2009. http://hdl.handle.net/20.500.11937/2266.

Full text

Abstract:

Most data mining and pattern recognition techniques are designed for learning from at data files with the assumption of equal populations per class. However, most real-world data are stored as rich relational databases that generally have imbalanced class distribution. For such domains, a rich relational technique is required to accurately model the different objects and relationships in the domain, which can not be easily represented as a set of simple attributes, and at the same time handle the imbalanced class problem.Motivated by the significance of mining imbalanced relational databases that represent the majority of real-world data, learning techniques for mining imbalanced relational domains are investigated. In this thesis, the employment of probabilistic models in mining relational databases is explored. In particular, the Probabilistic Relational Models (PRMs) that were proposed as an extension of the attribute-based Bayesian Networks. The effectiveness of PRMs in mining real-world databases was explored by learning PRMs from a real-world university relational database. A visual data mining tool is also proposed to aid the interpretation of the outcomes of the PRM learned models.Despite the effectiveness of PRMs in relational learning, the performance of PRMs as predictive models is significantly hindered by the imbalanced class problem. This is due to the fact that PRMs share the assumption common to other learning techniques of relatively balanced class distributions in the training data. Therefore, this thesis proposes a number of models utilizing the effectiveness of PRMs in relational learning and extending it for mining imbalanced relational domains.The first model introduced in this thesis examines the problem of mining imbalanced relational domains for a single two-class attribute. The model is proposed by enriching the PRM learning with the ensemble learning technique. The premise behind this model is that an ensemble of models would attain better performance than a single model, as misclassification committed by one of the models can be often correctly classified by others.Based on this approach, another model is introduced to address the problem of mining multiple imbalanced attributes, in which it is important to predict several attributes rather than a single one. In this model, the ensemble bagging sampling approach is exploited to attain a single model for mining several attributes. Finally, the thesis outlines the problem of imbalanced multi-class classification and introduces a generalized framework to handle this problem for both relational and non-relational domains.

APA, Harvard, Vancouver, ISO, and other styles

22

Makki, Sara. "An Efficient Classification Model for Analyzing Skewed Data to Detect Frauds in the Financial Sector." Thesis, Lyon, 2019. http://www.theses.fr/2019LYSE1339/document.

Full text

Abstract:

Différents types de risques existent dans le domaine financier, tels que le financement du terrorisme, le blanchiment d’argent, la fraude de cartes de crédit, la fraude d’assurance, les risques de crédit, etc. Tout type de fraude peut entraîner des conséquences catastrophiques pour des entités telles que les banques ou les compagnies d’assurances. Ces risques financiers sont généralement détectés à l'aide des algorithmes de classification. Dans les problèmes de classification, la distribution asymétrique des classes, également connue sous le nom de déséquilibre de classe (class imbalance), est un défi très commun pour la détection des fraudes. Des approches spéciales d'exploration de données sont utilisées avec les algorithmes de classification traditionnels pour résoudre ce problème. Le problème de classes déséquilibrées se produit lorsque l'une des classes dans les données a beaucoup plus d'observations que l’autre classe. Ce problème est plus vulnérable lorsque l'on considère dans le contexte des données massives (Big Data). Les données qui sont utilisées pour construire les modèles contiennent une très petite partie de groupe minoritaire qu’on considère positifs par rapport à la classe majoritaire connue sous le nom de négatifs. Dans la plupart des cas, il est plus délicat et crucial de classer correctement le groupe minoritaire plutôt que l'autre groupe, comme la détection de la fraude, le diagnostic d’une maladie, etc. Dans ces exemples, la fraude et la maladie sont les groupes minoritaires et il est plus délicat de détecter un cas de fraude en raison de ses conséquences dangereuses qu'une situation normale. Ces proportions de classes dans les données rendent très difficile à l'algorithme d'apprentissage automatique d'apprendre les caractéristiques et les modèles du groupe minoritaire. Ces algorithmes seront biaisés vers le groupe majoritaire en raison de leurs nombreux exemples dans l'ensemble de données et apprendront à les classer beaucoup plus rapidement que l'autre groupe. Dans ce travail, nous avons développé deux approches : Une première approche ou classifieur unique basée sur les k plus proches voisins et utilise le cosinus comme mesure de similarité (Cost Sensitive Cosine Similarity K-Nearest Neighbors : CoSKNN) et une deuxième approche ou approche hybride qui combine plusieurs classifieurs uniques et fondu sur l'algorithme k-modes (K-modes Imbalanced Classification Hybrid Approach : K-MICHA). Dans l'algorithme CoSKNN, notre objectif était de résoudre le problème du déséquilibre en utilisant la mesure de cosinus et en introduisant un score sensible au coût pour la classification basée sur l'algorithme de KNN. Nous avons mené une expérience de validation comparative au cours de laquelle nous avons prouvé l'efficacité de CoSKNN en termes de taux de classification correcte et de détection des fraudes. D’autre part, K-MICHA a pour objectif de regrouper des points de données similaires en termes des résultats de classifieurs. Ensuite, calculez les probabilités de fraude dans les groupes obtenus afin de les utiliser pour détecter les fraudes de nouvelles observations. Cette approche peut être utilisée pour détecter tout type de fraude financière, lorsque des données étiquetées sont disponibles. La méthode K-MICHA est appliquée dans 3 cas : données concernant la fraude par carte de crédit, paiement mobile et assurance automobile. Dans les trois études de cas, nous comparons K-MICHA au stacking en utilisant le vote, le vote pondéré, la régression logistique et l’algorithme CART. Nous avons également comparé avec Adaboost et la forêt aléatoire. Nous prouvons l'efficacité de K-MICHA sur la base de ces expériences. Nous avons également appliqué K-MICHA dans un cadre Big Data en utilisant H2O et R. Nous avons pu traiter et analyser des ensembles de données plus volumineux en très peu de temps
There are different types of risks in financial domain such as, terrorist financing, money laundering, credit card fraudulence and insurance fraudulence that may result in catastrophic consequences for entities such as banks or insurance companies. These financial risks are usually detected using classification algorithms. In classification problems, the skewed distribution of classes also known as class imbalance, is a very common challenge in financial fraud detection, where special data mining approaches are used along with the traditional classification algorithms to tackle this issue. Imbalance class problem occurs when one of the classes have more instances than another class. This problem is more vulnerable when we consider big data context. The datasets that are used to build and train the models contain an extremely small portion of minority group also known as positives in comparison to the majority class known as negatives. In most of the cases, it’s more delicate and crucial to correctly classify the minority group rather than the other group, like fraud detection, disease diagnosis, etc. In these examples, the fraud and the disease are the minority groups and it’s more delicate to detect a fraud record because of its dangerous consequences, than a normal one. These class data proportions make it very difficult to the machine learning classifier to learn the characteristics and patterns of the minority group. These classifiers will be biased towards the majority group because of their many examples in the dataset and will learn to classify them much faster than the other group. After conducting a thorough study to investigate the challenges faced in the class imbalance cases, we found that we still can’t reach an acceptable sensitivity (i.e. good classification of minority group) without a significant decrease of accuracy. This leads to another challenge which is the choice of performance measures used to evaluate models. In these cases, this choice is not straightforward, the accuracy or sensitivity alone are misleading. We use other measures like precision-recall curve or F1 - score to evaluate this trade-off between accuracy and sensitivity. Our objective is to build an imbalanced classification model that considers the extreme class imbalance and the false alarms, in a big data framework. We developed two approaches: A Cost-Sensitive Cosine Similarity K-Nearest Neighbor (CoSKNN) as a single classifier, and a K-modes Imbalance Classification Hybrid Approach (K-MICHA) as an ensemble learning methodology. In CoSKNN, our aim was to tackle the imbalance problem by using cosine similarity as a distance metric and by introducing a cost sensitive score for the classification using the KNN algorithm. We conducted a comparative validation experiment where we prove the effectiveness of CoSKNN in terms of accuracy and fraud detection. On the other hand, the aim of K-MICHA is to cluster similar data points in terms of the classifiers outputs. Then, calculating the fraud probabilities in the obtained clusters in order to use them for detecting frauds of new transactions. This approach can be used to the detection of any type of financial fraud, where labelled data are available. At the end, we applied K-MICHA to a credit card, mobile payment and auto insurance fraud data sets. In all three case studies, we compare K-MICHA with stacking using voting, weighted voting, logistic regression and CART. We also compared with Adaboost and random forest. We prove the efficiency of K-MICHA based on these experiments

APA, Harvard, Vancouver, ISO, and other styles

23

Gladh, Marcus, and Daniel Sahlin. "Image Synthesis Using CycleGAN to Augment Imbalanced Data for Multi-class Weather Classification." Thesis, Linköpings universitet, Medie- och Informationsteknik, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-176991.

Full text

Abstract:

In the last decade, convolutional neural networks have been used to a large extent for image classification and recognition tasks in a number of fields. For image weather classification, data can be both sparse and unevenly distributed amongst labels in the training set. As a way to improve the performance of the classifier, one often used traditional augmentation techniques to increase the size of the training set and help the classifier to converge towards a desirable solution. This can often be met with varying results, which is why this work intends to investigate another approach of augmentation using image synthesis. The idea is to make use of the fact that most datasets contain at least one label that is well represented. In weather image datasets, this is often the sunny label. CycleGAN is a framework which is capable of image-to-image translation (i.e. synthesizing images to represent a new label) using unpaired data. This makes the framework attractive as it does not put any unnecessary requirements on the data collection. To test the whether the synthesized images can be used as an augmentation approach, training samples in one label was deliberately reduced sequentially and supplemented with CycleGAN synthesized images. The results show adding synthesized images using CycleGAN can be used as an augmentation approach, since the performance of the classifier was relatively unchanged even though the number of images was low. In this case it was as few as 198 training samples in the label that represented foggy weather. Comparing CycleGAN to traditional augmentation techniques, it proved to be more stable as the number of images in the training set decreased. A modification to CycleGAN, which used weight demodulation instead of instance normalization in its generators, removed artifacts that otherwise could appear during training. This improved the visual quality of the synthesized images overall.

Examensarbetet är utfört vid Institutionen för teknik och naturvetenskap (ITN) vid Tekniska fakulteten, Linköpings universitet

APA, Harvard, Vancouver, ISO, and other styles

24

Tumati, Saini. "A Combined Approach to Handle Multi-class Imbalanced Data and to Adapt Concept Drifts using Machine Learning." University of Cincinnati / OhioLINK, 2021. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1623240328088387.

Full text

APA, Harvard, Vancouver, ISO, and other styles

25

Yang, Shaojie. "A Data Augmentation Methodology for Class-imbalanced Image Processing in Prognostic and Health Management." University of Cincinnati / OhioLINK, 2020. http://rave.ohiolink.edu/etdc/view?acc_num=ucin161375046654683.

Full text

APA, Harvard, Vancouver, ISO, and other styles

26

Olaitan, Olubukola. "SCUT-DS: Methodologies for Learning in Imbalanced Data Streams." Thesis, Université d'Ottawa / University of Ottawa, 2018. http://hdl.handle.net/10393/37243.

Full text

Abstract:

The automation of most of our activities has led to the continuous production of data that arrive in the form of fast-arriving streams. In a supervised learning setting, instances in these streams are labeled as belonging to a particular class. When the number of classes in the data stream is more than two, such a data stream is referred to as a multi-class data stream. Multi-class imbalanced data stream describes the situation where the instance distribution of the classes is skewed, such that instances of some classes occur more frequently than others. Classes with the frequently occurring instances are referred to as the majority classes, while the classes with instances that occur less frequently are denoted as the minority classes. Classification algorithms, or supervised learning techniques, use historic instances to build models, which are then used to predict the classes of unseen instances. Multi-class imbalanced data stream classification poses a great challenge to classical classification algorithms. This is due to the fact that traditional algorithms are usually biased towards the majority classes, since they have more examples of the majority classes when building the model. These traditional algorithms yield low predictive accuracy rates for the minority instances and need to be augmented, often with some form of sampling, in order to improve their overall performances. In the literature, in both static and streaming environments, most studies focus on the binary class imbalance problem. Furthermore, research in multi-class imbalance in the data stream environment is limited. A number of researchers have proceeded by transforming a multi-class imbalanced setting into multiple binary class problems. However, such a transformation does not allow the stream to be studied in the original form and may introduce bias. The research conducted in this thesis aims to address this research gap by proposing a novel online learning methodology that combines oversampling of the minority classes with cluster-based majority class under-sampling, without decomposing the data stream into multiple binary sets. Rather, sampling involves continuously selecting a balanced number of instances across all classes for model building. Our focus is on improving the rate of correctly predicting instances of the minority classes in multi-class imbalanced data streams, through the introduction of the Synthetic Minority Over-sampling Technique (SMOTE) and Cluster-based Under-sampling - Data Streams (SCUT-DS) methodologies. In this work, we dynamically balance the classes by utilizing a windowing mechanism during the incremental sampling process. Our SCUT-DS algorithms are evaluated using six different types of classification techniques, followed by comparing their results against a state-of-the-art algorithm. Our contributions are tested using both synthetic and real data sets. The experimental results show that the approaches developed in this thesis yield high prediction rates of minority instances as contained in the multiple minority classes within a non-evolving stream.

APA, Harvard, Vancouver, ISO, and other styles

27

Ayhan, Dilber. "Multi-class Classification Methods Utilizing Mahalanobis Taguchi System And A Re-sampling Approach For Imbalanced Data Sets." Master's thesis, METU, 2009. http://etd.lib.metu.edu.tr/upload/3/12610521/index.pdf.

Full text

Abstract:

Classification approaches are used in many areas in order to identify or estimate classes, which different observations belong to. The classification approach, Mahalanobis Taguchi System (MTS) is analyzed and further improved for multi-class classification problems under the scope of this thesis study. MTS tries to explore significant variables and classify a new observation based on its Mahalanobis distance (MD). In this study, first, sample size problems, which are encountered mostly in small data sets, and multicollinearity problems, which constitute some limitations of MTS, are analyzed and a re-sampling approach is explored as a solution. Our re-sampling approach, which only works for data sets with two classes, is a combination of over-sampling and under-sampling. Over-sampling is based on SMOTE, which generates the synthetic observations between the nearest neighbors of observations in the minority class. In addition, MTS models are used to test the performance of several re-sampling parameters, for which the most appropriate values are sought specific to each case. In the second part, multi-class classification methods with MTS are developed. An algorithm, namely Feature Weighted Multi-class MTS-I (FWMMTS-I), is inspired by the descent feature weighted MD. It relaxes adding up of the MDs for variables equally. This provides representations of noisy variables with weights close to zero so that they do not mask the other variables. As a second multi-class classification algorithm, the original MTS method is extended to multi-class problems, which is called Multi-class MTS (MMTS). In addition, a comparable approach to that of Su and Hsiao (2009), which also considers weights of variables, is studied with a modification in MD calculation. It is named as Feature Weighted Multi-class MTS-II (FWMMTS-II). The methods are compared on eight different multi-class data sets using a 5-fold stratified cross validation approach. Results show that FWMMTS-I is as accurate as MMTS, and they are better than FWMMTS-II. Interestingly, the Mahalanobis Distance Classifier (MDC) using all the variables directly in the classification model has performed equally well on the studied data sets.

APA, Harvard, Vancouver, ISO, and other styles

28

Orriols, Puig Albert. "New Challenges in Learning Classifier Systems: Mining Rarities and Evolving Fuzzy Models." Doctoral thesis, Universitat Ramon Llull, 2008. http://hdl.handle.net/10803/9159.

Full text

Abstract:

Durant l'última dècada, els sistemes classificadors (LCS) d'estil Michigan - sistemes d'aprenentatge automàtic que combinen tècniques de repartiment de crèdit i algorismes genètics (AG) per evolucionar una població de classificadors online- han renascut. Juntament amb la formulació dels sistemes de primera generació, s'han produït avenços importants en (1) el disseny sistemàtic de nous LCS competents, (2) la seva aplicació en dominis rellevants i (3) el desenvolupament d'anàlisis teòriques. Malgrat aquests dissenys i aplicacions importants, encara hi ha reptes complexos que cal abordar per comprendre millor el funcionament dels LCS i per solucionar problemes del món real eficientment i escalable.
Aquesta tesi tracta dos reptes importants - compartits amb la comunitat d'aprenentatge automàtic - amb LCS d'estil Michigan: (1) aprenentatge en dominis que contenen classes estranyes i (2) evolució de models comprensibles on s'utilitzin mètodes de raonament similars als humans. L'aprenentatge de models precisos de classes estranyes és crític, doncs el coneixement clau sol quedar amagat en exemples d'aquestes, i la majoria de tècniques d'aprenentatge no són capaces de modelar la raresa amb precisió. La detecció de rareses sol ser complicat en aprenentatge online ja que el sistema d'aprenentatge rep un flux d'exemples i ha de detectar les rareses al vol. D'altra banda, l'evolució de models comprensibles és crucial en certs dominis com el mèdic, on l'expert acostuma a estar més interessat en obtenir una explicació intel·ligible de la predicció que en la predicció en si mateixa.
El treball present considera dos LCS d'estil Michigan com a punt de partida: l'XCS i l 'UCS. Es pren l'XCS com a primera referència ja que és l'LCS que ha tingut més influencia fins al moment. L'UCS hereta els components principals de l'XCS i els especialitza per aprenentatge supervisat. Tenint en compte que aquesta tesi especialment se centra en problemes de classificació, l'UCS també es considera en aquest estudi. La inclusió de l'UCS marca el primer objectiu de la tesi, sota el qual es revisen un conjunt de punts que van restar oberts en el disseny del sistema. A més, per il·lustrar les diferències claus entre l'XCS i l'UCS, es comparen ambdós sistemes sobre una bateria de problemes artificials de complexitat acotada.
L'estudi de com els LCS aprenen en dominis amb classes estranyes comença amb un estudi analític que descompon el problema en cinc elements crítics i deriva models per facetes per cadascun d'ells. Aquesta anàlisi s'usa com a eina per dissenyar guies de configuració que permeten que l'XCS i l'UCS solucionin problemes que prèviament no eren resolubles. A continuació, es comparen els dos LCS amb alguns dels sistemes d'aprenentatge amb més influencia en la comunitat d'aprenentatge automàtic sobre una col·lecció de problemes del món real que contenen classes estranyes. Els resultats indiquen que els dos LCS són els mètodes més robustos de la comparativa. Així mateix, es demostra experimentalment que remostrejar els conjunts d'entrenament amb l'objectiu d'eliminar la presencia de classes estranyes beneficia, en mitjana, el rendiment de les tècniques d'aprenentatge.
El repte de crear models més comprensibles i d'usar mecanismes de raonament que siguin similars als humans s'aborda mitjançant el disseny d'un nou LCS per aprenentatge supervisat que combina les capacitats d'avaluació de regles online, la robustesa mostrada pels AG en problemes complexos i la representació comprensible i mètodes de raonament fonamentats proporcionats per la lògica difusa. El nou LCS, anomenat Fuzzy-UCS, s'estudia en detall i es compara amb una bateria de mètodes d'aprenentatge. Els resultats de la comparativa demostren la competitivitat del Fuzzy-UCS en termes de precisió i intel·ligibilitat dels models evolucionats. Addicionalment, s'usa Fuzzy-UCS per extreure models de classificació acurats de grans volums de dades, exemplificant els avantatges de l'arquitectura d'aprenentatge online del Fuzzy-UCS.
En general, les observacions i avenços assolits en aquesta tesi contribueixen a augmentar la comprensió del funcionament dels LCS i en preparar aquests tipus de sistemes per afrontar problemes del món real de gran complexitat. Finalment, els resultats experimentals ressalten la robustesa i competitivitat dels LCS respecte a altres mètodes d'aprenentatge, encoratjant el seu ús per tractar nous problemes del món real.
Durante la última década, los sistemas clasificadores (LCS) de estilo Michigan - sistemas de aprendizaje automático que combinan técnicas de repartición de crédito y algoritmos genéticos (AG) para evolucionar una población de clasificadores online - han renacido. Juntamente con la formulación de los sistemas de primera generación, se han producido avances importantes en (1) el diseño sistemático de nuevos LCS competentes, (2) su aplicación en dominios relevantes y (3) el desarrollo de análisis teóricos. Pese a eso, aún existen retos complejos que deben ser abordados para comprender mejor el funcionamiento de los LCS y para solucionar problemas del mundo real escalable y eficientemente.
Esta tesis trata dos retos importantes - compartidos por la comunidad de aprendizaje automático - con LCS de estilo Michigan: (1) aprendizaje en dominios con clases raras y (2) evolución de modelos comprensibles donde se utilicen métodos de razonamiento similares a los humanos. El aprendizaje de modelos precisos de clases raras es crítico pues el conocimiento clave suele estar escondido en ejemplos de estas clases, y la mayoría de técnicas de aprendizaje no son capaces de modelar la rareza con precisión. El modelado de las rarezas acostumbra a ser más complejo en entornos de aprendizaje online, pues el sistema de aprendizaje recibe un flujo de ejemplos y debe detectar las rarezas al vuelo. La evolución de modelos comprensibles es crucial en ciertos dominios como el médico, donde el experto está más interesado en obtener una explicación inteligible de la predicción que en la predicción en sí misma.
El trabajo presente considera dos LCS de estilo Michigan como punto de partida: el XCS y el UCS. Se toma XCS como primera referencia debido a que es el LCS que ha tenido más influencia hasta el momento. UCS es un diseño reciente de LCS que hereda los componentes principales de XCS y los especializa para aprendizaje supervisado. Dado que esta tesis está especialmente centrada en problemas de clasificación automática, también se considera UCS en el estudio. La inclusión de UCS marca el primer objetivo de la tesis, bajo el cual se revisan un conjunto de aspectos que quedaron abiertos durante el diseño del sistema. Además, para ilustrar las diferencias claves entre XCS y UCS, se comparan ambos sistemas sobre una batería de problemas artificiales de complejidad acotada.
El estudio de cómo los LCS aprenden en dominios con clases raras empieza con un estudio analítico que descompone el problema en cinco elementos críticos y deriva modelos por facetas para cada uno de ellos. Este análisis se usa como herramienta para diseñar guías de configuración que permiten que XCS y UCS solucionen problemas que previamente no eran resolubles. A continuación, se comparan los dos LCS con algunos de los sistemas de aprendizaje de mayor influencia en la comunidad de aprendizaje automático sobre una colección de problemas del mundo real que contienen clases raras.
Los resultados indican que los dos LCS son los métodos más robustos de la comparativa. Además, se demuestra experimentalmente que remuestrear los conjuntos de entrenamiento con el objetivo de eliminar la presencia de clases raras beneficia, en promedio, el rendimiento de los métodos de aprendizaje automático incluidos en la comparativa.
El reto de crear modelos más comprensibles y usar mecanismos de razonamiento que sean similares a los humanos se aborda mediante el diseño de un nuevo LCS para aprendizaje supervisado que combina las capacidades de evaluación de reglas online, la robustez mostrada por los AG en problemas complejos y la representación comprensible y métodos de razonamiento proporcionados por la lógica difusa. El sistema que resulta de la combinación de estas ideas, llamado Fuzzy-UCS, se estudia en detalle y se compara con una batería de métodos de aprendizaje altamente reconocidos en el campo de aprendizaje automático. Los resultados de la comparativa demuestran la competitividad de Fuzzy-UCS en referencia a la precisión e inteligibilidad de los modelos evolucionados. Adicionalmente, se usa Fuzzy-UCS para extraer modelos de clasificación precisos de grandes volúmenes de datos, ejemplificando las ventajas de la arquitectura de aprendizaje online de Fuzzy-UCS.
En general, los avances y observaciones proporcionados en la tesis presente contribuyen a aumentar la comprensión del funcionamiento de los LCS y a preparar estos tipos de sistemas para afrontar problemas del mundo real de gran complejidad. Además, los resultados experimentales resaltan la robustez y competitividad de los LCS respecto a otros métodos de aprendizaje, alentando su uso para tratar nuevos problemas del mundo real.
During the last decade, Michigan-style learning classifier systems (LCSs) - genetic-based machine learning (GBML) methods that combine apportionment of credit techniques and genetic algorithms (GAs) to evolve a population of classifiers online - have been enjoying a renaissance. Together with the formulation of first generation systems, there have been crucial advances in (1) systematic design of new competent LCSs, (2) applications in important domains, and (3) theoretical analyses for design. Despite these successful designs and applications, there still remain difficult challenges that need to be addressed to increase our comprehension of how LCSs behave and to scalably and efficiently solve real-world problems.
The purpose of this thesis is to address two important challenges - shared by the machine learning community - with Michigan-style LCSs: (1) learning from domains that contain rare classes and (2) evolving highly legible models in which human-like reasoning mechanisms are employed. Extracting accurate models from rare classes is critical since the key, unperceptive knowledge usually resides in the rarities, and many traditional learning techniques are not able to model rarity accurately. Besides, these difficulties are increased in online learning, where the learner receives a stream of examples and has to detect rare classes on the fly. Evolving highly legible models is crucial in some domains such as medical diagnosis, in which human experts may be more interested in the explanation of the prediction than in the prediction itself.
The contributions of this thesis take two Michigan-style LCSs as starting point: the extended classifier system (XCS) and the supervised classifier system (UCS). XCS is taken as the first reference of this work since it is the most influential LCS. UCS is a recent LCS design that has inherited the main components of XCS and has specialized them for supervised learning. As this thesis is especially concerned with classification problems, UCS is also considered in this study. Since UCS is still a young system, for which there are several open issues that need further investigation, its learning architecture is first revised and updated. Moreover, to illustrate the key differences between XCS and UCS, the behavior of both systems is compared % and show that UCS converges quickly than XCS on a collection of boundedly difficult problems.
The study of learning from rare classes with LCSs starts with an analytical approach in which the problem is decomposed in five critical elements, and facetwise models are derived for each element. The analysis is used as a tool for designing configuration guidelines that enable XCS and UCS to solve problems that previously eluded solution. Thereafter, the two LCSs are compared with several highly-influential learners on a collection of real-world problems with rare classes, appearing as the two best techniques of the comparison. Moreover, re-sampling the training data set to eliminate the presence of rare classes is demonstrated to benefit, on average, the performance of LCSs.
The challenge of building more legible models and using human-like reasoning mechanisms is addressed with the design of a new LCS for supervised learning that combines the online evaluation capabilities of LCSs, the search robustness over complex spaces of GAs, and the legible knowledge representation and principled reasoning mechanisms of fuzzy logic. The system resulting from this crossbreeding of ideas, referred to as Fuzzy-UCS, is studied in detail and compared with several highly competent learning systems, demonstrating the competitiveness of the new architecture in terms of the accuracy and the interpretability of the evolved models. In addition, the benefits provided by the online architecture are exemplified by extracting accurate classification models from large data sets.
Overall, the advances and key insights provided in this thesis help advance our understanding of how LCSs work and prepare these types of systems to face increasingly difficult problems, which abound in current industrial and scientific applications. Furthermore, experimental results highlight the robustness and competitiveness of LCSs with respect to other machine learning techniques, which encourages their use to face new challenging real-world applications.

APA, Harvard, Vancouver, ISO, and other styles

29

Anne, Chaitanya. "Advanced Text Analytics and Machine Learning Approach for Document Classification." ScholarWorks@UNO, 2017. http://scholarworks.uno.edu/td/2292.

Full text

Abstract:

Text classification is used in information extraction and retrieval from a given text, and text classification has been considered as an important step to manage a vast number of records given in digital form that is far-reaching and expanding. This thesis addresses patent document classification problem into fifteen different categories or classes, where some classes overlap with other classes for practical reasons. For the development of the classification model using machine learning techniques, useful features have been extracted from the given documents. The features are used to classify patent document as well as to generate useful tag-words. The overall objective of this work is to systematize NASA’s patent management, by developing a set of automated tools that can assist NASA to manage and market its portfolio of intellectual properties (IP), and to enable easier discovery of relevant IP by users. We have identified an array of methods that can be applied such as k-Nearest Neighbors (kNN), two variations of the Support Vector Machine (SVM) algorithms, and two tree based classification algorithms: Random Forest and J48. The major research steps in this work consist of filtering techniques for variable selection, information gain and feature correlation analysis, and training and testing potential models using effective classifiers. Further, the obstacles associated with the imbalanced data were mitigated by adding synthetic data wherever appropriate, which resulted in a superior SVM classifier based model.

APA, Harvard, Vancouver, ISO, and other styles

30

Lento, Gabriel Carneiro. "Random forest em dados desbalanceados: uma aplicação na modelagem de churn em seguro saúde." reponame:Repositório Institucional do FGV, 2017. http://hdl.handle.net/10438/18256.

Full text

Abstract:

Submitted by Gabriel Lento (gabriel.carneiro.lento@gmail.com) on 2017-05-01T23:16:04Z No. of bitstreams: 1 Dissertação Gabriel Carneiro Lento.pdf: 832965 bytes, checksum: f79e7cb4e5933fd8c3a7c67ed781ddb5 (MD5)
Approved for entry into archive by Leiliane Silva (leiliane.silva@fgv.br) on 2017-05-04T18:39:57Z (GMT) No. of bitstreams: 1 Dissertação Gabriel Carneiro Lento.pdf: 832965 bytes, checksum: f79e7cb4e5933fd8c3a7c67ed781ddb5 (MD5)
Made available in DSpace on 2017-05-17T12:43:35Z (GMT). No. of bitstreams: 1 Dissertação Gabriel Carneiro Lento.pdf: 832965 bytes, checksum: f79e7cb4e5933fd8c3a7c67ed781ddb5 (MD5) Previous issue date: 2017-03-27
In this work we study churn in health insurance, that is predicting which clients will cancel the product or service within a preset time-frame. Traditionally, the probability whether a client will cancel the service is modeled using logistic regression. Recently, modern machine learning techniques are becoming popular in churn modeling, having been applied in the areas of telecommunications, banking, and car insurance, among others. One of the big challenges in this problem is that only a fraction of all customers cancel the service, meaning that we have to deal with highly imbalanced class probabilities. Under-sampling and over-sampling techniques have been used to overcome this issue. We use random forests, that are ensembles of decision trees, where each of the trees fits a subsample of the data constructed using either under-sampling or over-sampling. We compare the distinct specifications of random forests using various metrics that are robust to imbalanced classes, both in-sample and out-of-sample. We observe that random forests using imbalanced random samples with fewer observations than the original series present a better overall performance. Random forests also present a better performance than the classical logistic regression, often used in health insurance companies to model churn.
Neste trabalho estudamos o problema de churn em seguro saúde, isto é, a previsão se o cliente irá cancelar o produto ou serviço em até um período de tempo pré-estipulado. Tradicionalmente, regressão logística é utilizada para modelar a probabilidade de cancelamento do serviço. Atualmente, técnicas modernas de machine learning vêm se tornando cada vez mais populares para esse tipo de problema, com exemplos nas áreas de telecomunicação, bancos, e seguros de carro, dentre outras. Uma das grandes dificuldades nesta modelagem é que apenas uma pequena fração dos clientes de fato cancela o serviço, o que significa que a base de dados tratada é altamente desbalanceada. Técnicas de under-sampling e over-sampling são utilizadas para contornar esse problema. Neste trabalho, aplicamos random forests, que são combinações de árvores de decisão ajustadas em subamostras dos dados, construídas utilizando under-sampling e over-sampling. Ao fim do trabalho comparamos métricas de ajustes obtidas nas diversas especificações dos modelos testados e avaliamos seus resultados dentro e fora da amostra. Observamos que técnicas de random forest utilizando sub-amostras não balanceadas com o tamanho menor do que a amostra original apresenta a melhor performance dentre as random forests utilizadas e uma melhora com relação ao praticado no mercado de seguro saúde.

APA, Harvard, Vancouver, ISO, and other styles

31

Briend, Cyril. "Le contrat d'adhésion entre professionnels." Thesis, Sorbonne Paris Cité, 2015. http://www.theses.fr/2015USPCB177/document.

Full text

Abstract:

Le professionnel que l'on croyait capable de défendre ses intérêts, par opposition au salarié ou au consommateur, s'est révélé tout autant victime de contrats déséquilibrés depuis quelques décennies. L'apparition de puissantes entreprises privées dans différents secteurs entraîne, de toute évidence, une inégalité entre les professionnels. Notre étude souligne la complexité de trouver un juste critère pour identifier de manière juste ce qu'est un professionnel partie faible. Il n'est pas possible de dire si, de manière générale, telle entreprise est plus puissante qu'une autre, car la personne morale partie au contrat peut cacher des intérêts difficiles à cerner au premier abord. Le juge ne peut pas non plus être l'arbitre autoritaire des prix sans risquer un détournement de sa fonction. Nous développerons le parti suivant : un contrat entre professionnels est dit d'adhésion lorsque celui-ci n'a pas donné lieu à une négociation idoine ; le juge doit alors s'efforcer de regarder le processus de pourparlers ainsi que les circonstances qui ont précédé la convention. De multiples critères peuvent aider le juge, tels que la taille de chaque entreprise, les parts de marché, les propos échangés par les parties, leur bonne ou mauvaise foi ou encore les efforts engagés par elles. Si le choix de l'analyse des négociations nous apparaît ultimement le plus juste, nous tiendrons cependant compte de ses limites. Il serait illusoire de penser que le juge peut toujours parvenir de manière certaine à connaître l'intégralité des circonstances antérieures au contrat. C'est pourquoi nous ajouterons à l'analyse des négociations un système de présomptions - quoique réfragables - lorsque la disproportion des prestations ou la différence de taille des entreprises ne laisse pas de place au doute. Nous mettrons enfin en lumière les stratégies employées par les parties fortes pour contourner cette analyse des négociations, comme des stipulations néfastes ou une tactique d'internationalisation. Il sera donc préféré une impérativité renforcée en droit national ainsi qu'en droit international. Une fois l'analyse des négociations effectuée, nous essayerons de proposer des sanctions à la hauteur du phénomène. Le juge, selon nous, doit être en mesure de modifier le contrat de façon souple, aussi bien de manière rétroactive que par un changement en cours d'exécution du contrat. Le caractère extrême de certains comportements contractuels nous incite à réfléchir à la possibilité d'un droit pénal plus dissuasif ou bien un droit « quasi pénal » sanctionnant ces comportements de manière plus appropriée. Néanmoins, c'est surtout au niveau de la procédure que se joue la protection contractuelle des professionnels. Un référé ajusté à cet objectif a tout lieu de répondre aux exigences de célérité qui gênent les parties faibles dans leurs démarches. Nous soulignerons aussi l'importance d'un système d'actions collectives qui surmontent efficacement l'écueil du coût du procès. À l'inverse, la sécurité juridique des entreprises nous conduira à proposer une procédure de protection par un système de droit doux. Première partie : L'identification du contrat d'adhésion entre professionnels. Deuxième partie : Le traitement judiciaire des contrats d'adhésion entre professionnels
The professional, supposed to be able to defend his interests, by opposition to the employee or the consumer, has proven to also be victim of imbalanced contracts for a few decades. The emergence of powerful private companies in various sectors clearly leads to inequalities between professionals. Our study underlines the difficulty to find the best criterion to identify what a professional weaker party is. It is impossible to say that globally such company is stronger than another because the legal person party to the agreement can hide many interests, which are hard to seize at first sight. Nor can the judge arbitrate prices in an authoritarian way without risking a misappropriation of his part. We shall side for this idea: a business-to-business agreement is to be qualified of adhesion contract as long as it does not give place to adequate bargaining; so the judge has to look the bargaining process and the circumstances preceding the contract. Many criteria can help the judge such as the size of the company, market parts, exchanged words, the good or bad faith of the parties or the efforts they have made. If we consider the bargain analysis as the ultimately rightest choice, we have to contemplate its limitations. It would not be realistic to consider that the judge could always discover every circumstance prior to the agreement. This is why we shall join a system of presumptions - albeit rebuttable - to the bargain analysis, when the difference of size of companies or the disproportion of provisions is obvious. We shall put into light the strategies used by strongest parts to bypass the bargain analysis, such as harmful clauses or internationalization tactics. Thus, we shall opt for high obligatory standards, as well as in national law than in international law. Once the bargain analysis is done, we shall try to suggest sanctions adapted to the concern. The judge, in our opinion, must be able to modify the agreement in a very flexible way, either retroactively or during the implementation of the said agreement. The gravity of various contractual behaviors must lead us to think about a form of criminal law or a "quasi criminal" law in order to combat those behaviors in a more suitable mean. Nevertheless, the protection of the professional weaker part is also to be dealt on a procedural ground. A proceeding for interim measures is likely to face the needs for celerity, which bother the weakest parts for their action. We shall also underline the advantages of a class action, which could overcome the financial issue of the lawsuit. Conversely, the legal security of business will bring us to foster a protection by a soft law system. First Part: The identification of the business-to-business adhesion contract. Second Part: The judicial treatment of business-to-business adhesion contracts

APA, Harvard, Vancouver, ISO, and other styles

32

Chang, Yu-shan, and 張毓珊. "Developing Data Mining Models for Class Imbalance Problems." Thesis, 2009. http://ndltd.ncl.edu.tw/handle/57781951199735409394.

Full text

Abstract:

碩士
朝陽科技大學
資訊管理系碩士班
98
In classification problems, the class imbalance problem would cause a bias on the training of classifiers and result in a low predictive accuracy over the minority class examples. This problem is caused by imbalanced data in which almost all examples belong to one class and far fewer instances belong to others. Compared with the majority examples, the minority examples are usually more interesting class, such as rare diseases in medical diagnosis data, failures in inspection data, frauds in credit screening data, and so on. When inducing knowledge from an imbalanced data set, traditional data mining algorithms will seek high classification accuracy for the majority class, but an unacceptable error rate for the minority class. Therefore, they are not suitable for handling the class imbalanced data. In order to tackle the class imbalance problem, this study aims to (1) find a robust classifier from different candidates including Decision Tree (DT), Logistic Regression (LR), Mahalanobis Distance (MD), and Support Vector Machines (SVM); (2) propose two novel methods called MD-SVM (a new two-phase learning scheme) and SWAI (SOM Weights As Input). Experimental results indicated our proposed MD-SVM and SWAI has better performance in identifying the minority class examples compared with traditional techniques such as under-sampling, cost adjusting, and cluster based sampling.

APA, Harvard, Vancouver, ISO, and other styles

33

Huang, Li-Jyun, and 黃俐君. "Resolving Intra-Class Imbalance for GAN-based Data Augmentation." Thesis, 2018. http://ndltd.ncl.edu.tw/handle/xgc4e2.

Full text

Abstract:

碩士
國立交通大學
資訊科學與工程研究所
106
Generally, most equipment failures and detection of defects cases have a main problem, data imbalance. Most classifiers do not predict very well. Researchers propose technologies to re- duce or augment data. All classifiers have improved accuracy, but still not well enough since some specific types of images are sparse or some new types of data may still not have a suf- ficient number of data for training. The reason why there are the above problems is that most algorithms for data augmentation mainly deal with data imbalance among categories. After clustering a single category, we find that even within a category, the forms of the same category of data may still be very diverse and imbalanced. Therefore, we modify the design of Genera- tive Adversarial Network (GAN), which is a deep-learning-based data augmentation algorithm to consider the above intra-category data imbalanced problem. In this thesis, we propose AC- GAN and GAN_BIAS, a GAN system that has a systematic control to generate divergent de- fective images. In order to make generative adversarial network automatically balance the data of different clusters, we use a actor critic algorithm to adjust the weights of various clusters in the loss function. Through the experimental results, we observe that ACGAN and GAN_BIAS are more effective than traditional GAN in dealing with the imbalance between clusters within a class.

APA, Harvard, Vancouver, ISO, and other styles

34

Pan, Yi-Ying, and 潘怡瑩. "Clustering-based Data Preprocessing Approach for the Class Imbalance Problem." Thesis, 2018. http://ndltd.ncl.edu.tw/handle/94nys8.

Full text

Abstract:

碩士
國立中央大學
資訊管理學系
106
The class imbalance problem is an important issue in data mining. It occurs when the number of samples in one class is much larger than the other classes. Traditional classifiers tend to misclassify most samples of the minority class into the majority class for maximizing the overall accuracy. This phenomenon makes it hard to establish a good classification rule for the minority class. The class imbalance problem often occurs in many real world applications, such as fault diagnosis, medical diagnosis and face recognition. To deal with the class imbalance problem, a clustering-based data preprocessing approach is proposed, where two different clustering techniques including affinity propagation clustering and K-means clustering are used individually to divide the majority class into several subclasses resulting in multiclass data. This approach can effectively reduce the class imbalance ratio of the training dataset, shorten the class training time and improve classification performance. Our experiments based on forty-four small class imbalance datasets from KEEL and eight high-dimensional datasets from NASA to build five types of classification models, which are C4.5, MLP, Naïve Bayes, SVM and k-NN (k=5). In addition, we also employ the classifier ensemble algorithm. This research tries to compare AUC results between different clustering techniques, different classification models and the number of clusters of K-means clustering in order to find out the best configuration of the proposed approach and compare with other literature methods. Finally, the experimental results of the KEEL datasets show that k-NN (k=5) algorithm is the best choice regardless of whether affinity propagation or K-means (K=5); the experimental results of NASA datasets show that the performance of the proposed approach is superior to the literature methods for the high-dimensional datasets.

APA, Harvard, Vancouver, ISO, and other styles

35

Lu, Yi-Wei, and 呂逸瑋. "Conditional Generative Adversarial Network for Defect Classification with Class Imbalance." Thesis, 2019. http://ndltd.ncl.edu.tw/handle/gku365.

Full text

Abstract:

碩士
元智大學
資訊管理學系
107
Automated Optical Inspection (AOI) is used for defect inspection during industrial manufacturing process. It uses optical instrument to snap the surface of products and identify defects through technique of machine vision processing. Deep learning and convolution neural network automatically produce the feature which are useful for identify the defect correctly. However, the class imbalance for number of defect samples and normal samples is typically in industrial process, which will lead to poor accuracy of deep learning model. This paper proposed a framework named CGANC, integrates a Conditional Generative Adversarial Network (GAN), which can generate synthetic image automatically, to generate more defect images to adjust the data distribution for class imbalance. Eventually, this paper uses Convolutional Neural Network to get better result of defect data classification with manipulated data than with original data.

APA, Harvard, Vancouver, ISO, and other styles

36

Lin, Li-wei, and 林立為. "A Study of Developing the Methods for Solving Class Imbalance Problems." Thesis, 2010. http://ndltd.ncl.edu.tw/handle/02009794153964859900.

Full text

Abstract:

碩士
朝陽科技大學
資訊管理系碩士班
98
Class imbalance problems have attracted much attention in the field of machine learning. This problem is mainly attributed to training examples, in which, the number of particular class examples will be much larger than the other classes. When learning from such imbalance data, traditional machine learning algorithm will have a relatively high accuracy over the majority examples, and lead to an unacceptable error rate for the minority class instances which are usually important. In order to solve this problem, this study attempts to propose two novel methods, called “Modified cluster based sampling, MCBS” and “BPN based voting scheme, BPS”. Seven data sets from UCI data bank and three real cases of bloggers’ sentiment classification have been provided to verify the effectiveness of the proposed methods. In addition, four fold cross validation experiments have been implemented for obtaining high quality solutions. MCBS is to improve the shortcomings of traditional clustering sampling method. The BPS method enhance traditional voting scheme by using BPN network to get the optimal vote weights. In addition, the proposed methodologies have been applied to classify textual sentiment data which usually has problems of high dimension and small sample size problems. Experimental results indicated that, compared with conventional treatment methods for imbalance data, such as under-sampling, cluster based sampling, self-organizing map network weights method, two stage learning strategy, and one class learning, the proposed methods can not only increase the ability of detecting minority examples, but also have stable classification performance.

APA, Harvard, Vancouver, ISO, and other styles

37

Komba, Lyee, and Lyee Komba. "Sampling Techniques for Class Imbalance Problem in Aviation Safety Incidents Data." Thesis, 2018. http://ndltd.ncl.edu.tw/handle/jg2y52.

Full text

Abstract:

碩士
國立臺北科技大學
電資國際專班
106
Like any other industries in the world, the aviation industry has a variety data acquired everyday through numerous data management systems. Structured and unstructured data are being collected through aircraft systems, maintenance systems, supply systems, ticketing and booking systems, and many other systems that are utilized in the daily operations of aviation business. Data mining can be used to analyze all these different types of data to generate meaningful information that can improve future performance, safety and profitability for aviation business and operations. This thesis presents details of data mining methods based on aviation incident data to predict incidents with fatal or a death consequence. Other literature have applied data mining techniques within the aviation industry include prediction of passenger travel, meteorological prediction, component failure prediction and other fatal incident prediction literature that aimed at finding the right features. This study uses the public dataset from the Federal Aviation Authority Accidents and Incidents Data System (FAA AIDS) website – data records from the year 2000 to year 2017. Our goal is to build a prediction model for fatal incidents and generate decision rules or factors contributing to incidents that have fatal results. In this way, the model to be built will be a predictive risk management system for aviation safety. The aviation industry generally operates at a safe state because of the transition from reactive safety and risk management to a proactive safety management approach; and now a predictive approach to safety management with the application of data mining techniques such as from this study and others. Over time, the number of systems has increased and the number of aviation accidents and serious incidents has decreased. Hence, a 0.6% of incidents with fatal consequences was attained from our analysis. During the data preprocessing stage, a problem of unbalanced dataset is encountered that invokes us to propose some techniques to solve the issue. Unbalanced datasets are datasets where least number of data is representing the minority classes than the majority class, especially when the analysis is aimed at the minority class. Not dealing with this issue correctly may result in poor performing models or misclassified data. With the increase of the travelling population in the aviation community, safety is paramount so coming up with a relatively precise model is important. In order to come up with a precise model/classifier, we need to preprocess and resample the data efficiently. This thesis also looks at combating the issue of unbalanced data to come up with a balanced data that can be used to train a classifier to design a precise model. We applied the following sampling technique in R Studio– oversampling, under-sampling, SMOTE and bootstrap samples to solve the imbalanced data. The resulting dataset from the unbalanced dataset resolution techniques are used to train different classifiers and the performance of the classifiers are measured and discussed in this thesis.

APA, Harvard, Vancouver, ISO, and other styles

38

Dai, Yu-Ting, and 戴郁庭. "Missing value imputation for class imbalance data: a dynamic warping approach." Thesis, 2019. http://ndltd.ncl.edu.tw/handle/6cax3v.

Full text

Abstract:

碩士
國立中央大學
資訊管理學系
107
In a world full of information, more and more companies want to use this information to improve their competitiveness. However, the problems of “Class Imbalance” and “Missing Value” have always been important issues in the real world. For example, class imbalance datasets often occur in different fields such as medical diagnosis and bankruptcy prediction. In class imbalance, the number of samples of the majority class in the dataset is larger than that of the minority class, and the data will look skewed. In order to have a higher classification accuracy rate, the prediction model established by the general classifier will also be misjudged as a large class of data due to the influence of the skewed distribution. If the precious minority class contains some missing data, the available data are even rarer. In this thesis, dynamic time warping is used as the core for the missing value imputation task. Dynamic time warping correction feature is used to solve the problem of missing data in the minority class containing small numbers of samples. And this method is not limited to the need for a complete data sample. Therefore, in the experiment, 10%, 30%, 50%, 70%, and 90% missing rates of the minority class data are simulated. In this paper, we use 17 KEEL datasets for the experiment, and two classification models (SVM, Decision Tree) are constructed, and the AUC (Area Under Curve) are examined for different methods. The experimental results show that the dynamic time warping has good performance under the missing rate of 50%~90%, which performs better than the KNN imputation method.

APA, Harvard, Vancouver, ISO, and other styles

39

Buda, Mateusz. "A systematic study of the class imbalance problem in convolutional neural networks." Thesis, 2017. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-219872.

Full text

Abstract:

In this study, we systematically investigate the impact of class imbalance on classification performance of convolutional neural networks and compare frequently used methods to address the issue. Class imbalance refers to significantly different number of examples among classes in a training set. It is a common problem that has been comprehensively studied in classical machine learning, yet very limited systematic research is available in the context of deep learning. We define and parameterize two representative types of imbalance, i.e. step and linear. Using three benchmark datasets of increasing complexity, MNIST, CIFAR-10 and ImageNet, we investigate the effects of imbalance on classification and perform an extensive comparison of several methods to address the issue: oversampling, undersampling, two-phase training, and thresholding that compensates for prior class probabilities. Our main evaluation metric is area under the receiver operating characteristic curve (ROC AUC) adjusted to multi-class tasks since overall accuracy metric is associated with notable difficulties in the context of imbalanced data. Based on results from our experiments we conclude that (i) the effect of class imbalance on classification performance is detrimental and increases with the extent of imbalance and the scale of a task; (ii) the method of addressing class imbalance that emerged as dominant in almost all analyzed scenarios was oversampling; (iii) oversampling should be applied to the level that totally eliminates the imbalance, whereas undersampling can perform better when the imbalance is only removed to some extent; (iv) thresholding should be applied to compensate for prior class probabilities when overall number of properly classified cases is of interest; (v) as opposed to some classical machine learning models, oversampling does not necessarily cause overfitting of convolutional neural networks.
I den här studien undersöker vi systematiskt effekten av klassobalans på prestandan för klassificering hos konvolutionsnätverk och jämför vanliga metoder för att åtgärda problemet. Klassobalans avser betydlig ojämvikt hos antalet exempel per klass i ett träningsset. Det är ett vanligt problem som har studerats utförligt inom maskininlärning, men tillgången av systematisk forskning inom djupinlärning är starkt begränsad. Vi definerar och parametriserar två representiva typer av obalans, steg och linjär. Med hjälpav tre dataset med ökande komplexitet, MNIST, CTFAR-10 och ImageNet, undersöker vi effekterna av obalans på klassificering och utför en omfattande jämförelse av flera metoder för att åtgärda problemen: översampling, undersampling, tvåfasträning och avgränsning för tidigare klass-sannolikheter. Vår huvudsakliga utvärderingsmetod är arean under mottagarens karaktäristiska kurva (ROC AUC) justerat för multi-klass-syften, eftersom den övergripande noggrannheten är förenad med anmärkningsvärda svårigheter i samband med obalanserade data. Baserat på experimentens resultat drar vi slutsatserna att (i) effekten av klassens obalans påklassificeringprestanda är skadlig och ökar med mängden obalans och omfattningen av uppgiften; (ii) metoden att ta itu med klassobalans som framträdde som dominant i nästan samtliga analyserade scenarier var översampling; (iii) översampling bör tillämpas till den nivå som helt eliminerar obalansen, medan undersampling kan prestera bättre när obalansen bara avlägsnas i en viss utsträckning; (iv) avgränsning bör tillämpas för att kompensera för tidigare sannolikheter när det totala antalet korrekt klassificerade fall är av intresse; (v) i motsats till hos vissa klassiska maskininlärningsmodeller orsakar översampling inte nödvändigtvis överanpassning av konvolutionsnätverk.

APA, Harvard, Vancouver, ISO, and other styles

40

Yao, Guan-Ting, and 姚冠廷. "A Two-Stage Hybrid Data Preprocessing Approach for the Class Imbalance Problem." Thesis, 2017. http://ndltd.ncl.edu.tw/handle/dm48kk.

Full text

Abstract:

碩士
國立中央大學
資訊管理學系
105
The class imbalance problem is an important issue in data mining. The class skewed distribution occurs when the number of examples that represent one class is much lower than the ones of the other classes. The traditional classifiers tend to misclassify most samples in the minority class into the majority class because of maximizing the overall accuracy. This phenomenon limits the construction of effective classifiers for the precious minority class. This problem occurs in many real-world applications, such as fault diagnosis, medical diagnosis and face recognition. To deal with the class imbalance problem, I proposed a two-stage hybrid data preprocessing framework based on clustering and instance selection techniques. This approach filters out the noisy data in the majority class and can reduce the execution time for classifier training. More importantly, it can decrease the effect of class imbalance and perform very well in the classification task. Our experiments using 44 class imbalance datasets from KEEL to build four types of classification models, which are C4.5, k-NN, Naïve Bayes and MLP. In addition, the classifier ensemble algorithm is also employed. In addition, two kinds of clustering techniques and three kinds of instance selection algorithms are used in order to find out the best combination suited for the proposed method. The experimental results show that the proposed framework performs better than many well-known state-of-the-art approaches in terms of AUC. In particular, the proposed framework combined with bagging based MLP ensemble classifiers perform the best, which provide 92% of AUC.

APA, Harvard, Vancouver, ISO, and other styles

41

Marath, Sathi. "Large-Scale Web Page Classification." Thesis, 2010. http://hdl.handle.net/10222/13130.

Full text

Abstract:

Web page classification is the process of assigning predefined categories to web pages. Empirical evaluations of classifiers such as Support Vector Machines (SVMs), k-Nearest Neighbor (k-NN), and Naïve Bayes (NB), have shown that these algorithms are effective in classifying small segments of web directories. The effectiveness of these algorithms, however, has not been thoroughly investigated on large-scale web page classification of such popular web directories as Yahoo! and LookSmart. Such web directories have hundreds of thousands of categories, deep hierarchies, spindle category and document distributions over the hierarchies, and skewed category distribution over the documents. These statistical properties indicate class imbalance and rarity within the dataset. In hierarchical datasets similar to web directories, expanding the content of each category using the web pages of the child categories helps to decrease the degree of rarity. This process, however, results in the localized overabundance of positive instances especially in the upper level categories of the hierarchy. The class imbalance, rarity and the localized overabundance of positive instances make applying classification algorithms to web directories very difficult and the problem has not been thoroughly studied. To our knowledge, the maximum number of categories ever previously classified on web taxonomies is 246,279 categories of Yahoo! directory using hierarchical SVMs leading to a Macro-F1 of 12% only. We designed a unified framework for the content based classification of imbalanced hierarchical datasets. The complete Yahoo! web directory of 639,671 categories and 4,140,629 web pages is used to setup the experiments. In a hierarchical dataset, the prior probability distribution of the subcategories indicates the presence or absence of class imbalance, rarity and the overabundance of positive instances within the dataset. Based on the prior probability distribution and associated machine learning issues, we partitioned the subcategories of Yahoo! web directory into five mutually exclusive groups. The effectiveness of different data level, algorithmic and architectural solutions to the associated machine learning issues is explored. Later, the best performing classification technologies for a particular prior probability distribution have been identified and integrated into the Yahoo! Web directory classification model. The methodology is evaluated using a DMOZ subset of 17,217 categories and 130,594 web pages and we statistically proved that the methodology of this research works equally well on large and small dataset. The average classifier performance in terms of macro-averaged F1-Measure achieved in this research for Yahoo! web directory and DMOZ subset is 81.02% and 84.85% respectively.

APA, Harvard, Vancouver, ISO, and other styles

42

LiChao-Ting and 李兆庭. "Annotation-Effective Active Learning for Extreme Class Imbalance Problem: Application to Lymphocyte Detection in H＆E Stained Liver Histopathological Image." Thesis, 2019. http://ndltd.ncl.edu.tw/handle/d6q35g.

Full text

Abstract:

碩士
國立成功大學
電腦與通信工程研究所
107
Medical images segmentation is a fundamental challenge in medical image analysis. A major concern in the application of biomedical images in deep learning is insufficient number of annotated samples. Since the annotation process requires specialty-oriented knowledge and there are often too many instances in images (e.g. cells), this can incur a great deal of annotation effort and cost. Another concern is class imbalance problem, which is a critical obstacle commonly occurred in biomedical images. Considering the application of lymphocyte detection, an important lymphocyte subpopulation is extremely fewer than other cells, which would make training more biased toward the majority class. However, traditional labeling strategies, such as active learning, are ineffective in finding enough minority samples to train. Hence, this study deploys a low-cost method for manual annotation for efficiently lymphocyte detection in domains exhibiting extreme class imbalance. To address these problems, this paper proposed an active learning framework to reduce the total labeled workload while solving the extreme class imbalance problem by both under-sampling majority class and over-sampling minority class. Experimental results show that the proposed method can achieve annotation-effective solution in extremely imbalanced class segmentation. The contribution of the proposed method has three-fold, (1) we proposed an AL framework for solving the extreme class imbalance problem by both under-sampling majority class and over-sampling minority class. (2) the proposed framework achieves good performance for lymphocyte detection in histopathological image with fewer labeled samples. Finally, (3) quantitative analysis of lymphocytes is provided for more objective diagnosis.

APA, Harvard, Vancouver, ISO, and other styles

43

Σκρεπετός, Δημήτριος. "Σχεδιασμός και υλοποίηση πολυκριτηριακής υβριδικής μεθόδου ταξινόμησης βιολογικών δεδομένων με χρήση εξελικτικών αλγορίθμων και νευρωνικών δικτύων." Thesis, 2014. http://hdl.handle.net/10889/8037.

Full text

Abstract:

Δύσκολα προβλήματα ταξινόμησης από τον χώρο της Βιοπληροφορικής όπως η πρόβλεψη των microRNA γονιδιών και η πρόβλεψη των πρωτεϊνικών αλληλεπιδράσεων (Protein- Protein Interactions) απαιτούν ισχυρούς ταξινομητές οι οποίοι θα πρέπει να έχουν καλή ακρίβεια ταξινόμησης, να χειρίζονται ελλιπείς τιμές, να είναι ερμηνεύσιμοι, και να μην πάσχουν από το πρόβλημα ανισορροπίας κλάσεων. Ένας ευρέως χρησιμοποιούμενος ταξινομητής είναι τα νευρωνικά δίκτυα, τα οποία ωστόσο χρειάζονται προσδιορισμό της αρχιτεκτονικής τους και των λοιπών παραμέτρων τους, ενώ και οι αλγόριθμοι εκμάθησής τους συνήθως συγκλίνουν σε τοπικά ελάχιστα. Για τους λόγους αυτούς, προτείνεται μία πολυκριτηριακή εξελικτική μέθοδος η οποία βασίζεται στους εξελικτικούς αλγορίθμους ώστε να βελτιστοποιήσει πολλά από τα προαναφερθέντα κριτήρια απόδοσης των νευρωνικών δικτύων, να βρει επίσης την βέλτιση αρχιτεκτονική καθώς και ένα ολικό ελάχιστο για τα συναπτικά τους βάρη. Στην συνέχεια, από τον πληθυσμό που προκύπτει χρησιμοποιούμε το σύνολό του ώστε να επιτύχουμε την ταξινόμηση.
Hard classification problems of the area of Bioinformatics, like microRNA prediction and PPI prediction, demand powerful classifiers which must have good prediction accuracy, handle missing values, be interpretable, and not suffer from the class imbalance problem. One wide used classifier is neural networks, which need definition of their architecture and their other parameters, while their training algorithms usually converge to local minima. For those reasons, we suggest a multi-objective evolutionary method, which is based to evolutionary algorithms in order to optimise many of the aforementioned criteria of the performance of a neural network, and also find the optimised architecture and a global minimum for its weights. Then, from the ensuing population, we use it as an ensemble classifier in order to perform the classification.

APA, Harvard, Vancouver, ISO, and other styles

44

Huang, Cheng-Ting, and 黃正廷. "Hybrid Sampling Strategy to Class-Imbalanced Classification Problem." Thesis, 2015. http://ndltd.ncl.edu.tw/handle/44824582642600593412.

Full text

Abstract:

碩士
元智大學
資訊工程學系
105
Due to the imbalance characteristic of manufacturing production data, which means the number of defective product is far less than the number of non-defective product, the classification capability of machine learning is impacted such that the prediction accuracy for classification of minority data is greatly reduced. There are two known approaches to solve the problem. One is to improve data imbalance. The other is to improve classification algorithm. This thesis adopts data improving approach which is also called sampling. Under-Sampling and Over-Sampling are two major sampling methods. Under-Sampling may remove some useful data while Over-Sampling may cause over fitting or creating noise data. This study combines Over-Sampling and Under-Sampling to process the imbalanced manufacturing data, then compares prediction capability of sensitivity, specificity, G-mean and other related statistical analysis for four different machine learning algorithms. The result shows that it significantly reduces the training time for optical thin-film manufacturing data, and the classification method with Random Forest, LibSVM or K- nearest neighbor (KNN) even dramatically improved the total prediction accuracy G-mean which considers accuracy for both majority and minority. Key words: manufacturing data, Over-Sampling, Under-Sampling, machine learning, sensitivity, specificity.

APA, Harvard, Vancouver, ISO, and other styles

45

Jhang, Jing-Shang, and 張景翔. "Clustering-Based Under-sampling in Class Imbalanced Data." Thesis, 2016. http://ndltd.ncl.edu.tw/handle/63638063606226309685.

Full text

Abstract:

碩士
國立中央大學
資訊管理學系
104
The class imbalance problem is an important issue in data mining. This problem occurs when the number of samples that represent one class is much less than the ones of other classes. The classification model built by class imbalance datasets is likely to misclassify most samples in the minority class into the majority class because of maximizing the accuracy rate. It’s presences in many real-world applications, such as fault diagnosis, medical diagnosis or face recognition. 　　One of the most popular types of solutions is to consider data sampling. For example, Under-sampling the majority class or over-sampling the minority class to balance the imbalance datasets. Under-sampling balance class distribution through the elimination of majority class samples, but it may discard useful data. On the contrary, over-sampling replicates minority class samples, but it can increase the likelihood of occurring overfitting. 　　Therefore, we propose several resampling methods based on the k-means clustering technique. In order to decrease the probability of uneven resampling, we select representative samples to replace majority class samples in the training dataset. 　　Our experiments are based on using 44 small class imbalance datasets and two large scale datasets to build five types of classification models, which are C4.5, SVM, MLP, k-NN (k=5) and Naïve Bayes. In addition, the classifier ensemble algorithm is also employed. The research tries to compare the AUC result between different resampling techniques, different models and the number of clusters. Besides, we also divide imbalance ratio into three intervals. We try to find the best configuration of our experiments and compete with other literature methods. The experimental results show that combining the MLP classifier with the clustering based under-sampling method by the nearest neighbors of the cluster centers performs the best in terms of AUC over small and large scale datasets.

APA, Harvard, Vancouver, ISO, and other styles

46

Neto, Appio Indiano do Brazil Americano. "Previsão automática de fraude em transações financeiras." Master's thesis, 2021. http://hdl.handle.net/10071/23741.

Full text

Abstract:

A deteção de fraude em pagamentos de transações online é um desafio cada vez maior, principalmente com o aumento observado nos anos recentes para o consumo de produtos e serviços em e-commerce. Esta dissertação descreve o processo de modelação com técnicas de Machine Learning aplicadas a um problema de deteção de fraude, tendo como referência o desempenho das equipas participantes de uma competição promovida pela plataforma Kaggle. A atenção dirigiu-se mais especificamente às técnicas de sampling de dados para tratar o problema do desbalanceamento de classes, às técnicas de preparação dos dados para deteção de anomalias e mineração de conhecimento, e por fim, aos métodos de Ensemble Learning. A principal contribuição deste trabalho, face aos outros trabalhos que utilizaram o mesmo conjunto de dados, é demonstrar a importância do processo de criação em massa de features informativas para o desempenho do modelo. Sendo a principal técnica do processo a criação de forma iterativa de novas features através da comparação de um conjunto de variáveis de cada transação com diversas medidas estatísticas do grupo à qual cada transação pertence.
The detection of fraud in online transaction payments is an increasing challenge, especially with the increase observed in recent years for the consumption of products and services in e-commerce. This dissertation describes the modeling process with Machine Learning techniques applied to a fraud detection problem, having as reference the performance of teams participating in a competition promoted by the Kaggle platform. More specifically, attention was directed to data sampling techniques to deal with the problem of class Imbalance, to data preparation techniques to detect anomalies and knowledge mining, and finally, the Ensemble Learning methods. The main contribution of this work, compared to other works that used the same dataset, is to demonstrate the importance of the mass creation process of informative features for the model's performance. The main technique of the process is the iterative creation of new features through the comparison of a set of variables of each transaction with several statistical measures of the group to which each transaction belongs.

APA, Harvard, Vancouver, ISO, and other styles

47

Wu, Ping-Yi, and 吳秉怡. "Optimal Re-sampling Strategy for Multi-Class Imbalanced Data." Thesis, 2013. http://ndltd.ncl.edu.tw/handle/72083335909084160963.

Full text

Abstract:

碩士
國立交通大學
工業工程與管理系所
101
In many fields, developing an effective classification model to predict the category of incoming data is an important problem. For example, classification model can be utilized to predict certain type goods that the customers will purchase or to determine whether the loan customer will be default or not. However, real-world categorical data are often imbalanced, that is, the sample size of a particular class is significantly greater than that of others. In this case, most of the classification methods fail to construct an accurate model to classify the imbalanced data. There were several studies focused on developing binary classification models, but these models are not appropriate for data involve three or more categories. Therefore, this study introduces an optimal re-sampling strategy using design of experiments (DOE) and dual response surface methodology (DRS) to improve the accuracy of classification model for multi-class imbalanced data. The real cases from KEEL-dataset are used to demonstrate the effectiveness of the proposed procedure.

APA, Harvard, Vancouver, ISO, and other styles

48

(8082655), Gustavo A. Valencia-Zapata. "Probabilistic Diagnostic Model for Handling Classifier Degradation in Machine Learning." Thesis, 2019.

Find full text

Abstract:

Several studies point out different causes of performance degradation in supervised machine learning. Problems such as class imbalance, overlapping, small-disjuncts, noisy labels, and sparseness limit accuracy in classification algorithms. Even though a number of approaches either in the form of a methodology or an algorithm try to minimize performance degradation, they have been isolated efforts with limited scope. This research consists of three main parts: In the first part, a novel probabilistic diagnostic model based on identifying signs and symptoms of each problem is presented. Secondly, the behavior and performance of several supervised algorithms are studied when training sets have such problems. Therefore, prediction of success for treatments can be estimated across classifiers. Finally, a probabilistic sampling technique based on training set diagnosis for avoiding classifier degradation is proposed

APA, Harvard, Vancouver, ISO, and other styles

49

Huang, Yi-Quan, and 黃義權. "Deep learning from imbalanced class data for automatic surface defect detection." Thesis, 2018. http://ndltd.ncl.edu.tw/handle/d8ur8p.

Full text

Abstract:

碩士
元智大學
工業工程與管理學系
106
The imbalance of datasets is a common problem in machine learning. In manufacturing, datasets are predominantly composed of positive (defect-free) samples with only a small quantity of negative (defective) samples. The negative samples are generally not enough in a manufacturing process. The construction of a classification model using imbalanced datasets results in poor performance. In this study, A GAN(Generative Adversarial Network)-based model is used to generate the negative samples from a very limited number of true defects. Then the true defect-free samples and the synthesized defective samples are used to train a CNN(Convolutional Neural Network) model. The GAN model solves the imbalanced data problem arising in a manufacturing inspection. The proposed method emphasizes on automatical defect detection of sawmarks and stains in solar wafer surfaces. A multicrystalline solar wafer surface presents random shapes, sizes and orientations of crystal grains in the surface, and results in a heterogeneous texture. The heterogeneous texture makes the automatic visual inspection task extremely difficult. The propoced model shows good detection results for Inhomogeneous texture such as solar wafers, randomly textured surfaces in DAGM 2007’s open dataset, nature wood surface, and machined surfaces. Regarding the synthesis of defective samples, only 21~90 real defective samples are used as input for the CycleGAN model. The CycleGAN outputs 8,000~12,000 generated defect samples, which are used to train the CNN. Using 12000 training samples, the detection rate on small-windowed test samples can reach 85~95%. It also has a very high recognition rate on the large-sized multicrystalline solar wafer surfaces, machined surfaces, and DAGM textured surfaces.

APA, Harvard, Vancouver, ISO, and other styles

50

Last, Felix. "Oversampling for imbalanced learning based on k-means and smote." Master's thesis, 2018. http://hdl.handle.net/10362/31042.

Full text

Abstract:

Dissertation presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced Analytics
Learning from class-imbalanced data continues to be a common and challenging problem in supervised learning as standard classification algorithms are designed to handle balanced class distributions. While different strategies exist to tackle this problem, methods which generate artificial data to achieve a balanced class distribution are more versatile than modifications to the classification algorithm. Such techniques, called oversamplers, modify the training data, allowing any classifier to be used with class-imbalanced datasets. Many algorithms have been proposed for this task, but most are complex and tend to generate unnecessary noise. This work presents a simple and effective oversampling method based on k-means clustering and SMOTE oversampling, which avoids the generation of noise and effectively overcomes imbalances between and within classes. Empirical results of extensive experiments with 71 datasets show that training data oversampled with the proposed method improves classification results. Moreover, k-means SMOTE consistently outperforms other popular oversampling methods. An implementation is made available in the python programming language.

APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic 'Class imbalance'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles