Dissertations / Theses on the topic 'Class imbalance'
Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles
Consult the top 50 dissertations / theses for your research on the topic 'Class imbalance.'
Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.
You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.
Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.
Wang, Shuo. "Ensemble diversity for class imbalance learning." Thesis, University of Birmingham, 2011. http://etheses.bham.ac.uk//id/eprint/1793/.
Full textNataraj, Vismitha, and Sushmitha Narayanan. "Resolving Class Imbalance using Generative Adversarial Networks." Thesis, Högskolan i Halmstad, Akademin för informationsteknologi, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:hh:diva-41405.
Full textTran, Quang Duc. "One-class classification : an approach to handle class imbalance in multimodal biometric authentication." Thesis, City, University of London, 2014. http://openaccess.city.ac.uk/19662/.
Full textSENG, Kruy. "Cost-sensitive deep neural network ensemble for class imbalance problem." Digital Commons @ Lingnan University, 2018. https://commons.ln.edu.hk/otd/32.
Full textBarnabé-Lortie, Vincent. "Active Learning for One-class Classification." Thesis, Université d'Ottawa / University of Ottawa, 2015. http://hdl.handle.net/10393/33001.
Full textDutta, Ila. "Data Mining Techniques to Identify Financial Restatements." Thesis, Université d'Ottawa / University of Ottawa, 2018. http://hdl.handle.net/10393/37342.
Full textBatuwitage, Manohara Rukshan Kannangara. "Enhanced class imbalance learning methods for support vector machines application to human miRNA gene classification." Thesis, University of Oxford, 2010. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.531966.
Full textMathur, Tanmay. "Improving Classification Results Using Class Imbalance Solutions & Evaluating the Generalizability of Rationale Extraction Techniques." Miami University / OhioLINK, 2015. http://rave.ohiolink.edu/etdc/view?acc_num=miami1420335486.
Full textIosifidis, Vasileios [Verfasser], and Eirini [Akademischer Betreuer] Ntoutsi. "Semi-supervised learning and fairness-aware learning under class imbalance / Vasileios Iosifidis ; Betreuer: Eirini Ntoutsi." Hannover : Gottfried Wilhelm Leibniz Universität Hannover, 2020. http://d-nb.info/1217782168/34.
Full textBellinger, Colin. "Beyond the Boundaries of SMOTE: A Framework for Manifold-based Synthetic Oversampling." Thesis, Université d'Ottawa / University of Ottawa, 2016. http://hdl.handle.net/10393/34643.
Full textKueterman, Nathan. "Comparative Study of Classification Methods for the Mitigation of Class Imbalance Issues in Medical Imaging Applications." University of Dayton / OhioLINK, 2020. http://rave.ohiolink.edu/etdc/view?acc_num=dayton1591611376235015.
Full textJagelid, Michelle, and Maria Movin. "A Comparison of Resampling Techniques to Handle the Class Imbalance Problem in Machine Learning : Conversion prediction of Spotify Users - A Case Study." Thesis, KTH, Skolan för datavetenskap och kommunikation (CSC), 2017. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-208876.
Full textI den här studien undersökte vi om det går att, givet användardata från Spotifyanvändare, prediktera vilka användare som konverterar från gratisversionen till premiumversionen. Eftersom det finns fler användare som inte konverterar än som konverterar, var detta ett problem med obalancerade klasser. Obalancerade klasser är ett välkänt problem inom maskininlärning. Tre maskininlärningsmetoder undersöktes: Logistic regression, Decision trees och Gradient Boosting Trees. Förbehandlingsmetoder som leder till att träningsdata får jämnare fördelning mellan klasserna undersöktes. Detta för att se om sådana förbehandlingar kunde öka modellernas förmåga att klassificera nya användare. Vi visade att det var möjligt att med maskininlärningsmetoder, givet användardata, hitta mönster i data som kunde användas för att prediktera vilka användare som konverterar. För alla tre maskininlärningsmetoder visade det sig att förbehandling av träningsdata till jämnare fördelning mellan klasserna gav bättre resultat. Av de undersökta modellerna presterade Logistic regression och Gradient Boosting Tree bäst då de tränats med förbehandlad data, så att slumpmässiga dubbletter av användare som konverterat lagts till i datasetet upp till helt jämn fördelning.
Pezzicoli, Francesco. "Statistical Physics - Machine Learning Interplay : from Addressing Class Imbalance with Replica Theory to Predicting Dynamical Heterogeneities with SE(3)-equivariant Graph Neural Networks." Electronic Thesis or Diss., université Paris-Saclay, 2024. http://www.theses.fr/2024UPASG115.
Full textThis thesis explores the relationship between Machine Learning (ML) and Statistical Physics (SP), addressing two significant challenges at the interface between the two fields. First, I examine the problem of Class Imbalance (CI) in the supervised learning set-up by introducing an analytically tractable model grounded in statistical mechanics: I provide a theoretical framework to analyze and interpret CI. Some non-trivial phenomena are observed: for example, a balanced training set often results in sub-optimal performance. Second, I study the phenomenon of dynamical arrest in supercooled liquids through advanced ML models. Leveraging SE(3)-equivariant Graph Neural Networks, I am able to reach or surpass state-of-the art accuracy in the task of prediction of dynamical properties from static structure. This suggests the emergence of a growing "amorphous order" that correlates with particle dynamics. It also emphasizes the importance of directional features in identifying this order. Together, these contributions demonstrate the potential of SP in addressing ML challenges and the utility of ML models in advancing physical sciences
Yella, Jaswanth. "Machine Learning-based Prediction and Characterization of Drug-drug Interactions." University of Cincinnati / OhioLINK, 2018. http://rave.ohiolink.edu/etdc/view?acc_num=ucin154399419112613.
Full textRingdahl, Benjamin. "Gaussian Process Multiclass Classification : Evaluation of Binarization Techniques and Likelihood Functions." Thesis, Linnéuniversitetet, Institutionen för matematik (MA), 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:lnu:diva-87952.
Full textBrandt, Jakob, and Emil Lanzén. "A Comparative Review of SMOTE and ADASYN in Imbalanced Data Classification." Thesis, Uppsala universitet, Statistiska institutionen, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-432162.
Full textPrati, Ronaldo Cristiano. ""Novas abordagens em aprendizado de máquina para a geração de regras, classes desbalanceadas e ordenação de casos"." Universidade de São Paulo, 2006. http://www.teses.usp.br/teses/disponiveis/55/55134/tde-01092006-155445/.
Full textMachine learning algorithms are often the most appropriate algorithms for a great variety of data mining applications. However, most machine learning research to date has mainly dealt with the well-circumscribed problem of finding a model (generally a classifier) given a single, small and relatively clean dataset in the attribute-value form, where the attributes have previously been chosen to facilitate learning. Furthermore, the end-goal is simple and well-defined, such as accurate classifiers in the classification problem. Data mining opens up new directions for machine learning research, and lends new urgency to others. With data mining, machine learning is now removing each one of these constraints. Therefore, machine learning's many valuable contributions to data mining are reciprocated by the latter's invigorating effect on it. In this thesis, we explore this interaction by proposing new solutions to some problems due to the application of machine learning algorithms to data mining applications. More specifically, we contribute to the following problems. New approaches to rule learning. In this category, we propose two new methods for rule learning. In the first one, we propose a new method for finding exceptions to general rules. The second one is a rule selection algorithm based on the ROC graph. Rules come from an external larger set of rules and the algorithm performs a selection step based on the current convex hull in the ROC graph. Proportion of examples among classes. We investigated several aspects related to this issue. Firstly, we carried out a series of experiments on artificial data sets in order to verify our hypothesis that overlapping among classes is a complicating factor in highly skewed data sets. We also carried out a broadly experimental analysis with several methods (some of them proposed by us) that artificially balance skewed datasets. Our experiments show that, in general, over-sampling methods perform better than under-sampling methods. Finally, we investigated the relationship between class imbalance and small disjuncts, as well as the influence of the proportion of examples among classes in the process of labelling unlabelled cases in the semi-supervised learning algorithm Co-training. New method for combining rankings. We propose a new method called BordaRanking to construct ensembles of rankings based on borda count voting, which could be applied whenever only the rankings are available. Results show an improvement upon the base-rankings constructed by taking into account the ordering given by classifiers which output continuous-valued scores, as well as a comparable performance with the fusion of such scores.
Siddique, Nahian A. "PATTERN RECOGNITION IN CLASS IMBALANCED DATASETS." VCU Scholars Compass, 2016. http://scholarscompass.vcu.edu/etd/4480.
Full textAbouelenien, Mohamed. "Boosting for Learning From Imbalanced, Multiclass Data Sets." Thesis, University of North Texas, 2013. https://digital.library.unt.edu/ark:/67531/metadc407775/.
Full textAndersson, Melanie. "Multi-Class Imbalanced Learning for Time Series Problem : An Industrial Case Study." Thesis, Uppsala universitet, Avdelningen för systemteknik, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-412799.
Full textGhanem, Amal Saleh. "Probabilistic models for mining imbalanced relational data." Thesis, Curtin University, 2009. http://hdl.handle.net/20.500.11937/2266.
Full textMakki, Sara. "An Efficient Classification Model for Analyzing Skewed Data to Detect Frauds in the Financial Sector." Thesis, Lyon, 2019. http://www.theses.fr/2019LYSE1339/document.
Full textThere are different types of risks in financial domain such as, terrorist financing, money laundering, credit card fraudulence and insurance fraudulence that may result in catastrophic consequences for entities such as banks or insurance companies. These financial risks are usually detected using classification algorithms. In classification problems, the skewed distribution of classes also known as class imbalance, is a very common challenge in financial fraud detection, where special data mining approaches are used along with the traditional classification algorithms to tackle this issue. Imbalance class problem occurs when one of the classes have more instances than another class. This problem is more vulnerable when we consider big data context. The datasets that are used to build and train the models contain an extremely small portion of minority group also known as positives in comparison to the majority class known as negatives. In most of the cases, it’s more delicate and crucial to correctly classify the minority group rather than the other group, like fraud detection, disease diagnosis, etc. In these examples, the fraud and the disease are the minority groups and it’s more delicate to detect a fraud record because of its dangerous consequences, than a normal one. These class data proportions make it very difficult to the machine learning classifier to learn the characteristics and patterns of the minority group. These classifiers will be biased towards the majority group because of their many examples in the dataset and will learn to classify them much faster than the other group. After conducting a thorough study to investigate the challenges faced in the class imbalance cases, we found that we still can’t reach an acceptable sensitivity (i.e. good classification of minority group) without a significant decrease of accuracy. This leads to another challenge which is the choice of performance measures used to evaluate models. In these cases, this choice is not straightforward, the accuracy or sensitivity alone are misleading. We use other measures like precision-recall curve or F1 - score to evaluate this trade-off between accuracy and sensitivity. Our objective is to build an imbalanced classification model that considers the extreme class imbalance and the false alarms, in a big data framework. We developed two approaches: A Cost-Sensitive Cosine Similarity K-Nearest Neighbor (CoSKNN) as a single classifier, and a K-modes Imbalance Classification Hybrid Approach (K-MICHA) as an ensemble learning methodology. In CoSKNN, our aim was to tackle the imbalance problem by using cosine similarity as a distance metric and by introducing a cost sensitive score for the classification using the KNN algorithm. We conducted a comparative validation experiment where we prove the effectiveness of CoSKNN in terms of accuracy and fraud detection. On the other hand, the aim of K-MICHA is to cluster similar data points in terms of the classifiers outputs. Then, calculating the fraud probabilities in the obtained clusters in order to use them for detecting frauds of new transactions. This approach can be used to the detection of any type of financial fraud, where labelled data are available. At the end, we applied K-MICHA to a credit card, mobile payment and auto insurance fraud data sets. In all three case studies, we compare K-MICHA with stacking using voting, weighted voting, logistic regression and CART. We also compared with Adaboost and random forest. We prove the efficiency of K-MICHA based on these experiments
Gladh, Marcus, and Daniel Sahlin. "Image Synthesis Using CycleGAN to Augment Imbalanced Data for Multi-class Weather Classification." Thesis, Linköpings universitet, Medie- och Informationsteknik, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-176991.
Full textExamensarbetet är utfört vid Institutionen för teknik och naturvetenskap (ITN) vid Tekniska fakulteten, Linköpings universitet
Tumati, Saini. "A Combined Approach to Handle Multi-class Imbalanced Data and to Adapt Concept Drifts using Machine Learning." University of Cincinnati / OhioLINK, 2021. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1623240328088387.
Full textYang, Shaojie. "A Data Augmentation Methodology for Class-imbalanced Image Processing in Prognostic and Health Management." University of Cincinnati / OhioLINK, 2020. http://rave.ohiolink.edu/etdc/view?acc_num=ucin161375046654683.
Full textOlaitan, Olubukola. "SCUT-DS: Methodologies for Learning in Imbalanced Data Streams." Thesis, Université d'Ottawa / University of Ottawa, 2018. http://hdl.handle.net/10393/37243.
Full textAyhan, Dilber. "Multi-class Classification Methods Utilizing Mahalanobis Taguchi System And A Re-sampling Approach For Imbalanced Data Sets." Master's thesis, METU, 2009. http://etd.lib.metu.edu.tr/upload/3/12610521/index.pdf.
Full textOrriols, Puig Albert. "New Challenges in Learning Classifier Systems: Mining Rarities and Evolving Fuzzy Models." Doctoral thesis, Universitat Ramon Llull, 2008. http://hdl.handle.net/10803/9159.
Full textAquesta tesi tracta dos reptes importants - compartits amb la comunitat d'aprenentatge automàtic - amb LCS d'estil Michigan: (1) aprenentatge en dominis que contenen classes estranyes i (2) evolució de models comprensibles on s'utilitzin mètodes de raonament similars als humans. L'aprenentatge de models precisos de classes estranyes és crític, doncs el coneixement clau sol quedar amagat en exemples d'aquestes, i la majoria de tècniques d'aprenentatge no són capaces de modelar la raresa amb precisió. La detecció de rareses sol ser complicat en aprenentatge online ja que el sistema d'aprenentatge rep un flux d'exemples i ha de detectar les rareses al vol. D'altra banda, l'evolució de models comprensibles és crucial en certs dominis com el mèdic, on l'expert acostuma a estar més interessat en obtenir una explicació intel·ligible de la predicció que en la predicció en si mateixa.
El treball present considera dos LCS d'estil Michigan com a punt de partida: l'XCS i l 'UCS. Es pren l'XCS com a primera referència ja que és l'LCS que ha tingut més influencia fins al moment. L'UCS hereta els components principals de l'XCS i els especialitza per aprenentatge supervisat. Tenint en compte que aquesta tesi especialment se centra en problemes de classificació, l'UCS també es considera en aquest estudi. La inclusió de l'UCS marca el primer objectiu de la tesi, sota el qual es revisen un conjunt de punts que van restar oberts en el disseny del sistema. A més, per il·lustrar les diferències claus entre l'XCS i l'UCS, es comparen ambdós sistemes sobre una bateria de problemes artificials de complexitat acotada.
L'estudi de com els LCS aprenen en dominis amb classes estranyes comença amb un estudi analític que descompon el problema en cinc elements crítics i deriva models per facetes per cadascun d'ells. Aquesta anàlisi s'usa com a eina per dissenyar guies de configuració que permeten que l'XCS i l'UCS solucionin problemes que prèviament no eren resolubles. A continuació, es comparen els dos LCS amb alguns dels sistemes d'aprenentatge amb més influencia en la comunitat d'aprenentatge automàtic sobre una col·lecció de problemes del món real que contenen classes estranyes. Els resultats indiquen que els dos LCS són els mètodes més robustos de la comparativa. Així mateix, es demostra experimentalment que remostrejar els conjunts d'entrenament amb l'objectiu d'eliminar la presencia de classes estranyes beneficia, en mitjana, el rendiment de les tècniques d'aprenentatge.
El repte de crear models més comprensibles i d'usar mecanismes de raonament que siguin similars als humans s'aborda mitjançant el disseny d'un nou LCS per aprenentatge supervisat que combina les capacitats d'avaluació de regles online, la robustesa mostrada pels AG en problemes complexos i la representació comprensible i mètodes de raonament fonamentats proporcionats per la lògica difusa. El nou LCS, anomenat Fuzzy-UCS, s'estudia en detall i es compara amb una bateria de mètodes d'aprenentatge. Els resultats de la comparativa demostren la competitivitat del Fuzzy-UCS en termes de precisió i intel·ligibilitat dels models evolucionats. Addicionalment, s'usa Fuzzy-UCS per extreure models de classificació acurats de grans volums de dades, exemplificant els avantatges de l'arquitectura d'aprenentatge online del Fuzzy-UCS.
En general, les observacions i avenços assolits en aquesta tesi contribueixen a augmentar la comprensió del funcionament dels LCS i en preparar aquests tipus de sistemes per afrontar problemes del món real de gran complexitat. Finalment, els resultats experimentals ressalten la robustesa i competitivitat dels LCS respecte a altres mètodes d'aprenentatge, encoratjant el seu ús per tractar nous problemes del món real.
Durante la última década, los sistemas clasificadores (LCS) de estilo Michigan - sistemas de aprendizaje automático que combinan técnicas de repartición de crédito y algoritmos genéticos (AG) para evolucionar una población de clasificadores online - han renacido. Juntamente con la formulación de los sistemas de primera generación, se han producido avances importantes en (1) el diseño sistemático de nuevos LCS competentes, (2) su aplicación en dominios relevantes y (3) el desarrollo de análisis teóricos. Pese a eso, aún existen retos complejos que deben ser abordados para comprender mejor el funcionamiento de los LCS y para solucionar problemas del mundo real escalable y eficientemente.
Esta tesis trata dos retos importantes - compartidos por la comunidad de aprendizaje automático - con LCS de estilo Michigan: (1) aprendizaje en dominios con clases raras y (2) evolución de modelos comprensibles donde se utilicen métodos de razonamiento similares a los humanos. El aprendizaje de modelos precisos de clases raras es crítico pues el conocimiento clave suele estar escondido en ejemplos de estas clases, y la mayoría de técnicas de aprendizaje no son capaces de modelar la rareza con precisión. El modelado de las rarezas acostumbra a ser más complejo en entornos de aprendizaje online, pues el sistema de aprendizaje recibe un flujo de ejemplos y debe detectar las rarezas al vuelo. La evolución de modelos comprensibles es crucial en ciertos dominios como el médico, donde el experto está más interesado en obtener una explicación inteligible de la predicción que en la predicción en sí misma.
El trabajo presente considera dos LCS de estilo Michigan como punto de partida: el XCS y el UCS. Se toma XCS como primera referencia debido a que es el LCS que ha tenido más influencia hasta el momento. UCS es un diseño reciente de LCS que hereda los componentes principales de XCS y los especializa para aprendizaje supervisado. Dado que esta tesis está especialmente centrada en problemas de clasificación automática, también se considera UCS en el estudio. La inclusión de UCS marca el primer objetivo de la tesis, bajo el cual se revisan un conjunto de aspectos que quedaron abiertos durante el diseño del sistema. Además, para ilustrar las diferencias claves entre XCS y UCS, se comparan ambos sistemas sobre una batería de problemas artificiales de complejidad acotada.
El estudio de cómo los LCS aprenden en dominios con clases raras empieza con un estudio analítico que descompone el problema en cinco elementos críticos y deriva modelos por facetas para cada uno de ellos. Este análisis se usa como herramienta para diseñar guías de configuración que permiten que XCS y UCS solucionen problemas que previamente no eran resolubles. A continuación, se comparan los dos LCS con algunos de los sistemas de aprendizaje de mayor influencia en la comunidad de aprendizaje automático sobre una colección de problemas del mundo real que contienen clases raras.
Los resultados indican que los dos LCS son los métodos más robustos de la comparativa. Además, se demuestra experimentalmente que remuestrear los conjuntos de entrenamiento con el objetivo de eliminar la presencia de clases raras beneficia, en promedio, el rendimiento de los métodos de aprendizaje automático incluidos en la comparativa.
El reto de crear modelos más comprensibles y usar mecanismos de razonamiento que sean similares a los humanos se aborda mediante el diseño de un nuevo LCS para aprendizaje supervisado que combina las capacidades de evaluación de reglas online, la robustez mostrada por los AG en problemas complejos y la representación comprensible y métodos de razonamiento proporcionados por la lógica difusa. El sistema que resulta de la combinación de estas ideas, llamado Fuzzy-UCS, se estudia en detalle y se compara con una batería de métodos de aprendizaje altamente reconocidos en el campo de aprendizaje automático. Los resultados de la comparativa demuestran la competitividad de Fuzzy-UCS en referencia a la precisión e inteligibilidad de los modelos evolucionados. Adicionalmente, se usa Fuzzy-UCS para extraer modelos de clasificación precisos de grandes volúmenes de datos, ejemplificando las ventajas de la arquitectura de aprendizaje online de Fuzzy-UCS.
En general, los avances y observaciones proporcionados en la tesis presente contribuyen a aumentar la comprensión del funcionamiento de los LCS y a preparar estos tipos de sistemas para afrontar problemas del mundo real de gran complejidad. Además, los resultados experimentales resaltan la robustez y competitividad de los LCS respecto a otros métodos de aprendizaje, alentando su uso para tratar nuevos problemas del mundo real.
During the last decade, Michigan-style learning classifier systems (LCSs) - genetic-based machine learning (GBML) methods that combine apportionment of credit techniques and genetic algorithms (GAs) to evolve a population of classifiers online - have been enjoying a renaissance. Together with the formulation of first generation systems, there have been crucial advances in (1) systematic design of new competent LCSs, (2) applications in important domains, and (3) theoretical analyses for design. Despite these successful designs and applications, there still remain difficult challenges that need to be addressed to increase our comprehension of how LCSs behave and to scalably and efficiently solve real-world problems.
The purpose of this thesis is to address two important challenges - shared by the machine learning community - with Michigan-style LCSs: (1) learning from domains that contain rare classes and (2) evolving highly legible models in which human-like reasoning mechanisms are employed. Extracting accurate models from rare classes is critical since the key, unperceptive knowledge usually resides in the rarities, and many traditional learning techniques are not able to model rarity accurately. Besides, these difficulties are increased in online learning, where the learner receives a stream of examples and has to detect rare classes on the fly. Evolving highly legible models is crucial in some domains such as medical diagnosis, in which human experts may be more interested in the explanation of the prediction than in the prediction itself.
The contributions of this thesis take two Michigan-style LCSs as starting point: the extended classifier system (XCS) and the supervised classifier system (UCS). XCS is taken as the first reference of this work since it is the most influential LCS. UCS is a recent LCS design that has inherited the main components of XCS and has specialized them for supervised learning. As this thesis is especially concerned with classification problems, UCS is also considered in this study. Since UCS is still a young system, for which there are several open issues that need further investigation, its learning architecture is first revised and updated. Moreover, to illustrate the key differences between XCS and UCS, the behavior of both systems is compared % and show that UCS converges quickly than XCS on a collection of boundedly difficult problems.
The study of learning from rare classes with LCSs starts with an analytical approach in which the problem is decomposed in five critical elements, and facetwise models are derived for each element. The analysis is used as a tool for designing configuration guidelines that enable XCS and UCS to solve problems that previously eluded solution. Thereafter, the two LCSs are compared with several highly-influential learners on a collection of real-world problems with rare classes, appearing as the two best techniques of the comparison. Moreover, re-sampling the training data set to eliminate the presence of rare classes is demonstrated to benefit, on average, the performance of LCSs.
The challenge of building more legible models and using human-like reasoning mechanisms is addressed with the design of a new LCS for supervised learning that combines the online evaluation capabilities of LCSs, the search robustness over complex spaces of GAs, and the legible knowledge representation and principled reasoning mechanisms of fuzzy logic. The system resulting from this crossbreeding of ideas, referred to as Fuzzy-UCS, is studied in detail and compared with several highly competent learning systems, demonstrating the competitiveness of the new architecture in terms of the accuracy and the interpretability of the evolved models. In addition, the benefits provided by the online architecture are exemplified by extracting accurate classification models from large data sets.
Overall, the advances and key insights provided in this thesis help advance our understanding of how LCSs work and prepare these types of systems to face increasingly difficult problems, which abound in current industrial and scientific applications. Furthermore, experimental results highlight the robustness and competitiveness of LCSs with respect to other machine learning techniques, which encourages their use to face new challenging real-world applications.
Anne, Chaitanya. "Advanced Text Analytics and Machine Learning Approach for Document Classification." ScholarWorks@UNO, 2017. http://scholarworks.uno.edu/td/2292.
Full textLento, Gabriel Carneiro. "Random forest em dados desbalanceados: uma aplicação na modelagem de churn em seguro saúde." reponame:Repositório Institucional do FGV, 2017. http://hdl.handle.net/10438/18256.
Full textApproved for entry into archive by Leiliane Silva (leiliane.silva@fgv.br) on 2017-05-04T18:39:57Z (GMT) No. of bitstreams: 1 Dissertação Gabriel Carneiro Lento.pdf: 832965 bytes, checksum: f79e7cb4e5933fd8c3a7c67ed781ddb5 (MD5)
Made available in DSpace on 2017-05-17T12:43:35Z (GMT). No. of bitstreams: 1 Dissertação Gabriel Carneiro Lento.pdf: 832965 bytes, checksum: f79e7cb4e5933fd8c3a7c67ed781ddb5 (MD5) Previous issue date: 2017-03-27
In this work we study churn in health insurance, that is predicting which clients will cancel the product or service within a preset time-frame. Traditionally, the probability whether a client will cancel the service is modeled using logistic regression. Recently, modern machine learning techniques are becoming popular in churn modeling, having been applied in the areas of telecommunications, banking, and car insurance, among others. One of the big challenges in this problem is that only a fraction of all customers cancel the service, meaning that we have to deal with highly imbalanced class probabilities. Under-sampling and over-sampling techniques have been used to overcome this issue. We use random forests, that are ensembles of decision trees, where each of the trees fits a subsample of the data constructed using either under-sampling or over-sampling. We compare the distinct specifications of random forests using various metrics that are robust to imbalanced classes, both in-sample and out-of-sample. We observe that random forests using imbalanced random samples with fewer observations than the original series present a better overall performance. Random forests also present a better performance than the classical logistic regression, often used in health insurance companies to model churn.
Neste trabalho estudamos o problema de churn em seguro saúde, isto é, a previsão se o cliente irá cancelar o produto ou serviço em até um período de tempo pré-estipulado. Tradicionalmente, regressão logística é utilizada para modelar a probabilidade de cancelamento do serviço. Atualmente, técnicas modernas de machine learning vêm se tornando cada vez mais populares para esse tipo de problema, com exemplos nas áreas de telecomunicação, bancos, e seguros de carro, dentre outras. Uma das grandes dificuldades nesta modelagem é que apenas uma pequena fração dos clientes de fato cancela o serviço, o que significa que a base de dados tratada é altamente desbalanceada. Técnicas de under-sampling e over-sampling são utilizadas para contornar esse problema. Neste trabalho, aplicamos random forests, que são combinações de árvores de decisão ajustadas em subamostras dos dados, construídas utilizando under-sampling e over-sampling. Ao fim do trabalho comparamos métricas de ajustes obtidas nas diversas especificações dos modelos testados e avaliamos seus resultados dentro e fora da amostra. Observamos que técnicas de random forest utilizando sub-amostras não balanceadas com o tamanho menor do que a amostra original apresenta a melhor performance dentre as random forests utilizadas e uma melhora com relação ao praticado no mercado de seguro saúde.
Briend, Cyril. "Le contrat d'adhésion entre professionnels." Thesis, Sorbonne Paris Cité, 2015. http://www.theses.fr/2015USPCB177/document.
Full textThe professional, supposed to be able to defend his interests, by opposition to the employee or the consumer, has proven to also be victim of imbalanced contracts for a few decades. The emergence of powerful private companies in various sectors clearly leads to inequalities between professionals. Our study underlines the difficulty to find the best criterion to identify what a professional weaker party is. It is impossible to say that globally such company is stronger than another because the legal person party to the agreement can hide many interests, which are hard to seize at first sight. Nor can the judge arbitrate prices in an authoritarian way without risking a misappropriation of his part. We shall side for this idea: a business-to-business agreement is to be qualified of adhesion contract as long as it does not give place to adequate bargaining; so the judge has to look the bargaining process and the circumstances preceding the contract. Many criteria can help the judge such as the size of the company, market parts, exchanged words, the good or bad faith of the parties or the efforts they have made. If we consider the bargain analysis as the ultimately rightest choice, we have to contemplate its limitations. It would not be realistic to consider that the judge could always discover every circumstance prior to the agreement. This is why we shall join a system of presumptions - albeit rebuttable - to the bargain analysis, when the difference of size of companies or the disproportion of provisions is obvious. We shall put into light the strategies used by strongest parts to bypass the bargain analysis, such as harmful clauses or internationalization tactics. Thus, we shall opt for high obligatory standards, as well as in national law than in international law. Once the bargain analysis is done, we shall try to suggest sanctions adapted to the concern. The judge, in our opinion, must be able to modify the agreement in a very flexible way, either retroactively or during the implementation of the said agreement. The gravity of various contractual behaviors must lead us to think about a form of criminal law or a "quasi criminal" law in order to combat those behaviors in a more suitable mean. Nevertheless, the protection of the professional weaker part is also to be dealt on a procedural ground. A proceeding for interim measures is likely to face the needs for celerity, which bother the weakest parts for their action. We shall also underline the advantages of a class action, which could overcome the financial issue of the lawsuit. Conversely, the legal security of business will bring us to foster a protection by a soft law system. First Part: The identification of the business-to-business adhesion contract. Second Part: The judicial treatment of business-to-business adhesion contracts
Chang, Yu-shan, and 張毓珊. "Developing Data Mining Models for Class Imbalance Problems." Thesis, 2009. http://ndltd.ncl.edu.tw/handle/57781951199735409394.
Full text朝陽科技大學
資訊管理系碩士班
98
In classification problems, the class imbalance problem would cause a bias on the training of classifiers and result in a low predictive accuracy over the minority class examples. This problem is caused by imbalanced data in which almost all examples belong to one class and far fewer instances belong to others. Compared with the majority examples, the minority examples are usually more interesting class, such as rare diseases in medical diagnosis data, failures in inspection data, frauds in credit screening data, and so on. When inducing knowledge from an imbalanced data set, traditional data mining algorithms will seek high classification accuracy for the majority class, but an unacceptable error rate for the minority class. Therefore, they are not suitable for handling the class imbalanced data. In order to tackle the class imbalance problem, this study aims to (1) find a robust classifier from different candidates including Decision Tree (DT), Logistic Regression (LR), Mahalanobis Distance (MD), and Support Vector Machines (SVM); (2) propose two novel methods called MD-SVM (a new two-phase learning scheme) and SWAI (SOM Weights As Input). Experimental results indicated our proposed MD-SVM and SWAI has better performance in identifying the minority class examples compared with traditional techniques such as under-sampling, cost adjusting, and cluster based sampling.
Huang, Li-Jyun, and 黃俐君. "Resolving Intra-Class Imbalance for GAN-based Data Augmentation." Thesis, 2018. http://ndltd.ncl.edu.tw/handle/xgc4e2.
Full text國立交通大學
資訊科學與工程研究所
106
Generally, most equipment failures and detection of defects cases have a main problem, data imbalance. Most classifiers do not predict very well. Researchers propose technologies to re- duce or augment data. All classifiers have improved accuracy, but still not well enough since some specific types of images are sparse or some new types of data may still not have a suf- ficient number of data for training. The reason why there are the above problems is that most algorithms for data augmentation mainly deal with data imbalance among categories. After clustering a single category, we find that even within a category, the forms of the same category of data may still be very diverse and imbalanced. Therefore, we modify the design of Genera- tive Adversarial Network (GAN), which is a deep-learning-based data augmentation algorithm to consider the above intra-category data imbalanced problem. In this thesis, we propose AC- GAN and GAN_BIAS, a GAN system that has a systematic control to generate divergent de- fective images. In order to make generative adversarial network automatically balance the data of different clusters, we use a actor critic algorithm to adjust the weights of various clusters in the loss function. Through the experimental results, we observe that ACGAN and GAN_BIAS are more effective than traditional GAN in dealing with the imbalance between clusters within a class.
Pan, Yi-Ying, and 潘怡瑩. "Clustering-based Data Preprocessing Approach for the Class Imbalance Problem." Thesis, 2018. http://ndltd.ncl.edu.tw/handle/94nys8.
Full text國立中央大學
資訊管理學系
106
The class imbalance problem is an important issue in data mining. It occurs when the number of samples in one class is much larger than the other classes. Traditional classifiers tend to misclassify most samples of the minority class into the majority class for maximizing the overall accuracy. This phenomenon makes it hard to establish a good classification rule for the minority class. The class imbalance problem often occurs in many real world applications, such as fault diagnosis, medical diagnosis and face recognition. To deal with the class imbalance problem, a clustering-based data preprocessing approach is proposed, where two different clustering techniques including affinity propagation clustering and K-means clustering are used individually to divide the majority class into several subclasses resulting in multiclass data. This approach can effectively reduce the class imbalance ratio of the training dataset, shorten the class training time and improve classification performance. Our experiments based on forty-four small class imbalance datasets from KEEL and eight high-dimensional datasets from NASA to build five types of classification models, which are C4.5, MLP, Naïve Bayes, SVM and k-NN (k=5). In addition, we also employ the classifier ensemble algorithm. This research tries to compare AUC results between different clustering techniques, different classification models and the number of clusters of K-means clustering in order to find out the best configuration of the proposed approach and compare with other literature methods. Finally, the experimental results of the KEEL datasets show that k-NN (k=5) algorithm is the best choice regardless of whether affinity propagation or K-means (K=5); the experimental results of NASA datasets show that the performance of the proposed approach is superior to the literature methods for the high-dimensional datasets.
Lu, Yi-Wei, and 呂逸瑋. "Conditional Generative Adversarial Network for Defect Classification with Class Imbalance." Thesis, 2019. http://ndltd.ncl.edu.tw/handle/gku365.
Full text元智大學
資訊管理學系
107
Automated Optical Inspection (AOI) is used for defect inspection during industrial manufacturing process. It uses optical instrument to snap the surface of products and identify defects through technique of machine vision processing. Deep learning and convolution neural network automatically produce the feature which are useful for identify the defect correctly. However, the class imbalance for number of defect samples and normal samples is typically in industrial process, which will lead to poor accuracy of deep learning model. This paper proposed a framework named CGANC, integrates a Conditional Generative Adversarial Network (GAN), which can generate synthetic image automatically, to generate more defect images to adjust the data distribution for class imbalance. Eventually, this paper uses Convolutional Neural Network to get better result of defect data classification with manipulated data than with original data.
Lin, Li-wei, and 林立為. "A Study of Developing the Methods for Solving Class Imbalance Problems." Thesis, 2010. http://ndltd.ncl.edu.tw/handle/02009794153964859900.
Full text朝陽科技大學
資訊管理系碩士班
98
Class imbalance problems have attracted much attention in the field of machine learning. This problem is mainly attributed to training examples, in which, the number of particular class examples will be much larger than the other classes. When learning from such imbalance data, traditional machine learning algorithm will have a relatively high accuracy over the majority examples, and lead to an unacceptable error rate for the minority class instances which are usually important. In order to solve this problem, this study attempts to propose two novel methods, called “Modified cluster based sampling, MCBS” and “BPN based voting scheme, BPS”. Seven data sets from UCI data bank and three real cases of bloggers’ sentiment classification have been provided to verify the effectiveness of the proposed methods. In addition, four fold cross validation experiments have been implemented for obtaining high quality solutions. MCBS is to improve the shortcomings of traditional clustering sampling method. The BPS method enhance traditional voting scheme by using BPN network to get the optimal vote weights. In addition, the proposed methodologies have been applied to classify textual sentiment data which usually has problems of high dimension and small sample size problems. Experimental results indicated that, compared with conventional treatment methods for imbalance data, such as under-sampling, cluster based sampling, self-organizing map network weights method, two stage learning strategy, and one class learning, the proposed methods can not only increase the ability of detecting minority examples, but also have stable classification performance.
Komba, Lyee, and Lyee Komba. "Sampling Techniques for Class Imbalance Problem in Aviation Safety Incidents Data." Thesis, 2018. http://ndltd.ncl.edu.tw/handle/jg2y52.
Full text國立臺北科技大學
電資國際專班
106
Like any other industries in the world, the aviation industry has a variety data acquired everyday through numerous data management systems. Structured and unstructured data are being collected through aircraft systems, maintenance systems, supply systems, ticketing and booking systems, and many other systems that are utilized in the daily operations of aviation business. Data mining can be used to analyze all these different types of data to generate meaningful information that can improve future performance, safety and profitability for aviation business and operations. This thesis presents details of data mining methods based on aviation incident data to predict incidents with fatal or a death consequence. Other literature have applied data mining techniques within the aviation industry include prediction of passenger travel, meteorological prediction, component failure prediction and other fatal incident prediction literature that aimed at finding the right features. This study uses the public dataset from the Federal Aviation Authority Accidents and Incidents Data System (FAA AIDS) website – data records from the year 2000 to year 2017. Our goal is to build a prediction model for fatal incidents and generate decision rules or factors contributing to incidents that have fatal results. In this way, the model to be built will be a predictive risk management system for aviation safety. The aviation industry generally operates at a safe state because of the transition from reactive safety and risk management to a proactive safety management approach; and now a predictive approach to safety management with the application of data mining techniques such as from this study and others. Over time, the number of systems has increased and the number of aviation accidents and serious incidents has decreased. Hence, a 0.6% of incidents with fatal consequences was attained from our analysis. During the data preprocessing stage, a problem of unbalanced dataset is encountered that invokes us to propose some techniques to solve the issue. Unbalanced datasets are datasets where least number of data is representing the minority classes than the majority class, especially when the analysis is aimed at the minority class. Not dealing with this issue correctly may result in poor performing models or misclassified data. With the increase of the travelling population in the aviation community, safety is paramount so coming up with a relatively precise model is important. In order to come up with a precise model/classifier, we need to preprocess and resample the data efficiently. This thesis also looks at combating the issue of unbalanced data to come up with a balanced data that can be used to train a classifier to design a precise model. We applied the following sampling technique in R Studio– oversampling, under-sampling, SMOTE and bootstrap samples to solve the imbalanced data. The resulting dataset from the unbalanced dataset resolution techniques are used to train different classifiers and the performance of the classifiers are measured and discussed in this thesis.
Dai, Yu-Ting, and 戴郁庭. "Missing value imputation for class imbalance data: a dynamic warping approach." Thesis, 2019. http://ndltd.ncl.edu.tw/handle/6cax3v.
Full text國立中央大學
資訊管理學系
107
In a world full of information, more and more companies want to use this information to improve their competitiveness. However, the problems of “Class Imbalance” and “Missing Value” have always been important issues in the real world. For example, class imbalance datasets often occur in different fields such as medical diagnosis and bankruptcy prediction. In class imbalance, the number of samples of the majority class in the dataset is larger than that of the minority class, and the data will look skewed. In order to have a higher classification accuracy rate, the prediction model established by the general classifier will also be misjudged as a large class of data due to the influence of the skewed distribution. If the precious minority class contains some missing data, the available data are even rarer. In this thesis, dynamic time warping is used as the core for the missing value imputation task. Dynamic time warping correction feature is used to solve the problem of missing data in the minority class containing small numbers of samples. And this method is not limited to the need for a complete data sample. Therefore, in the experiment, 10%, 30%, 50%, 70%, and 90% missing rates of the minority class data are simulated. In this paper, we use 17 KEEL datasets for the experiment, and two classification models (SVM, Decision Tree) are constructed, and the AUC (Area Under Curve) are examined for different methods. The experimental results show that the dynamic time warping has good performance under the missing rate of 50%~90%, which performs better than the KNN imputation method.
Buda, Mateusz. "A systematic study of the class imbalance problem in convolutional neural networks." Thesis, 2017. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-219872.
Full textI den här studien undersöker vi systematiskt effekten av klassobalans på prestandan för klassificering hos konvolutionsnätverk och jämför vanliga metoder för att åtgärda problemet. Klassobalans avser betydlig ojämvikt hos antalet exempel per klass i ett träningsset. Det är ett vanligt problem som har studerats utförligt inom maskininlärning, men tillgången av systematisk forskning inom djupinlärning är starkt begränsad. Vi definerar och parametriserar två representiva typer av obalans, steg och linjär. Med hjälpav tre dataset med ökande komplexitet, MNIST, CTFAR-10 och ImageNet, undersöker vi effekterna av obalans på klassificering och utför en omfattande jämförelse av flera metoder för att åtgärda problemen: översampling, undersampling, tvåfasträning och avgränsning för tidigare klass-sannolikheter. Vår huvudsakliga utvärderingsmetod är arean under mottagarens karaktäristiska kurva (ROC AUC) justerat för multi-klass-syften, eftersom den övergripande noggrannheten är förenad med anmärkningsvärda svårigheter i samband med obalanserade data. Baserat på experimentens resultat drar vi slutsatserna att (i) effekten av klassens obalans påklassificeringprestanda är skadlig och ökar med mängden obalans och omfattningen av uppgiften; (ii) metoden att ta itu med klassobalans som framträdde som dominant i nästan samtliga analyserade scenarier var översampling; (iii) översampling bör tillämpas till den nivå som helt eliminerar obalansen, medan undersampling kan prestera bättre när obalansen bara avlägsnas i en viss utsträckning; (iv) avgränsning bör tillämpas för att kompensera för tidigare sannolikheter när det totala antalet korrekt klassificerade fall är av intresse; (v) i motsats till hos vissa klassiska maskininlärningsmodeller orsakar översampling inte nödvändigtvis överanpassning av konvolutionsnätverk.
Yao, Guan-Ting, and 姚冠廷. "A Two-Stage Hybrid Data Preprocessing Approach for the Class Imbalance Problem." Thesis, 2017. http://ndltd.ncl.edu.tw/handle/dm48kk.
Full text國立中央大學
資訊管理學系
105
The class imbalance problem is an important issue in data mining. The class skewed distribution occurs when the number of examples that represent one class is much lower than the ones of the other classes. The traditional classifiers tend to misclassify most samples in the minority class into the majority class because of maximizing the overall accuracy. This phenomenon limits the construction of effective classifiers for the precious minority class. This problem occurs in many real-world applications, such as fault diagnosis, medical diagnosis and face recognition. To deal with the class imbalance problem, I proposed a two-stage hybrid data preprocessing framework based on clustering and instance selection techniques. This approach filters out the noisy data in the majority class and can reduce the execution time for classifier training. More importantly, it can decrease the effect of class imbalance and perform very well in the classification task. Our experiments using 44 class imbalance datasets from KEEL to build four types of classification models, which are C4.5, k-NN, Naïve Bayes and MLP. In addition, the classifier ensemble algorithm is also employed. In addition, two kinds of clustering techniques and three kinds of instance selection algorithms are used in order to find out the best combination suited for the proposed method. The experimental results show that the proposed framework performs better than many well-known state-of-the-art approaches in terms of AUC. In particular, the proposed framework combined with bagging based MLP ensemble classifiers perform the best, which provide 92% of AUC.
Marath, Sathi. "Large-Scale Web Page Classification." Thesis, 2010. http://hdl.handle.net/10222/13130.
Full textLiChao-Ting and 李兆庭. "Annotation-Effective Active Learning for Extreme Class Imbalance Problem: Application to Lymphocyte Detection in H&E Stained Liver Histopathological Image." Thesis, 2019. http://ndltd.ncl.edu.tw/handle/d6q35g.
Full text國立成功大學
電腦與通信工程研究所
107
Medical images segmentation is a fundamental challenge in medical image analysis. A major concern in the application of biomedical images in deep learning is insufficient number of annotated samples. Since the annotation process requires specialty-oriented knowledge and there are often too many instances in images (e.g. cells), this can incur a great deal of annotation effort and cost. Another concern is class imbalance problem, which is a critical obstacle commonly occurred in biomedical images. Considering the application of lymphocyte detection, an important lymphocyte subpopulation is extremely fewer than other cells, which would make training more biased toward the majority class. However, traditional labeling strategies, such as active learning, are ineffective in finding enough minority samples to train. Hence, this study deploys a low-cost method for manual annotation for efficiently lymphocyte detection in domains exhibiting extreme class imbalance. To address these problems, this paper proposed an active learning framework to reduce the total labeled workload while solving the extreme class imbalance problem by both under-sampling majority class and over-sampling minority class. Experimental results show that the proposed method can achieve annotation-effective solution in extremely imbalanced class segmentation. The contribution of the proposed method has three-fold, (1) we proposed an AL framework for solving the extreme class imbalance problem by both under-sampling majority class and over-sampling minority class. (2) the proposed framework achieves good performance for lymphocyte detection in histopathological image with fewer labeled samples. Finally, (3) quantitative analysis of lymphocytes is provided for more objective diagnosis.
Σκρεπετός, Δημήτριος. "Σχεδιασμός και υλοποίηση πολυκριτηριακής υβριδικής μεθόδου ταξινόμησης βιολογικών δεδομένων με χρήση εξελικτικών αλγορίθμων και νευρωνικών δικτύων." Thesis, 2014. http://hdl.handle.net/10889/8037.
Full textHard classification problems of the area of Bioinformatics, like microRNA prediction and PPI prediction, demand powerful classifiers which must have good prediction accuracy, handle missing values, be interpretable, and not suffer from the class imbalance problem. One wide used classifier is neural networks, which need definition of their architecture and their other parameters, while their training algorithms usually converge to local minima. For those reasons, we suggest a multi-objective evolutionary method, which is based to evolutionary algorithms in order to optimise many of the aforementioned criteria of the performance of a neural network, and also find the optimised architecture and a global minimum for its weights. Then, from the ensuing population, we use it as an ensemble classifier in order to perform the classification.
Huang, Cheng-Ting, and 黃正廷. "Hybrid Sampling Strategy to Class-Imbalanced Classification Problem." Thesis, 2015. http://ndltd.ncl.edu.tw/handle/44824582642600593412.
Full text元智大學
資訊工程學系
105
Due to the imbalance characteristic of manufacturing production data, which means the number of defective product is far less than the number of non-defective product, the classification capability of machine learning is impacted such that the prediction accuracy for classification of minority data is greatly reduced. There are two known approaches to solve the problem. One is to improve data imbalance. The other is to improve classification algorithm. This thesis adopts data improving approach which is also called sampling. Under-Sampling and Over-Sampling are two major sampling methods. Under-Sampling may remove some useful data while Over-Sampling may cause over fitting or creating noise data. This study combines Over-Sampling and Under-Sampling to process the imbalanced manufacturing data, then compares prediction capability of sensitivity, specificity, G-mean and other related statistical analysis for four different machine learning algorithms. The result shows that it significantly reduces the training time for optical thin-film manufacturing data, and the classification method with Random Forest, LibSVM or K- nearest neighbor (KNN) even dramatically improved the total prediction accuracy G-mean which considers accuracy for both majority and minority. Key words: manufacturing data, Over-Sampling, Under-Sampling, machine learning, sensitivity, specificity.
Jhang, Jing-Shang, and 張景翔. "Clustering-Based Under-sampling in Class Imbalanced Data." Thesis, 2016. http://ndltd.ncl.edu.tw/handle/63638063606226309685.
Full text國立中央大學
資訊管理學系
104
The class imbalance problem is an important issue in data mining. This problem occurs when the number of samples that represent one class is much less than the ones of other classes. The classification model built by class imbalance datasets is likely to misclassify most samples in the minority class into the majority class because of maximizing the accuracy rate. It’s presences in many real-world applications, such as fault diagnosis, medical diagnosis or face recognition. One of the most popular types of solutions is to consider data sampling. For example, Under-sampling the majority class or over-sampling the minority class to balance the imbalance datasets. Under-sampling balance class distribution through the elimination of majority class samples, but it may discard useful data. On the contrary, over-sampling replicates minority class samples, but it can increase the likelihood of occurring overfitting. Therefore, we propose several resampling methods based on the k-means clustering technique. In order to decrease the probability of uneven resampling, we select representative samples to replace majority class samples in the training dataset. Our experiments are based on using 44 small class imbalance datasets and two large scale datasets to build five types of classification models, which are C4.5, SVM, MLP, k-NN (k=5) and Naïve Bayes. In addition, the classifier ensemble algorithm is also employed. The research tries to compare the AUC result between different resampling techniques, different models and the number of clusters. Besides, we also divide imbalance ratio into three intervals. We try to find the best configuration of our experiments and compete with other literature methods. The experimental results show that combining the MLP classifier with the clustering based under-sampling method by the nearest neighbors of the cluster centers performs the best in terms of AUC over small and large scale datasets.
Neto, Appio Indiano do Brazil Americano. "Previsão automática de fraude em transações financeiras." Master's thesis, 2021. http://hdl.handle.net/10071/23741.
Full textThe detection of fraud in online transaction payments is an increasing challenge, especially with the increase observed in recent years for the consumption of products and services in e-commerce. This dissertation describes the modeling process with Machine Learning techniques applied to a fraud detection problem, having as reference the performance of teams participating in a competition promoted by the Kaggle platform. More specifically, attention was directed to data sampling techniques to deal with the problem of class Imbalance, to data preparation techniques to detect anomalies and knowledge mining, and finally, the Ensemble Learning methods. The main contribution of this work, compared to other works that used the same dataset, is to demonstrate the importance of the mass creation process of informative features for the model's performance. The main technique of the process is the iterative creation of new features through the comparison of a set of variables of each transaction with several statistical measures of the group to which each transaction belongs.
Wu, Ping-Yi, and 吳秉怡. "Optimal Re-sampling Strategy for Multi-Class Imbalanced Data." Thesis, 2013. http://ndltd.ncl.edu.tw/handle/72083335909084160963.
Full text國立交通大學
工業工程與管理系所
101
In many fields, developing an effective classification model to predict the category of incoming data is an important problem. For example, classification model can be utilized to predict certain type goods that the customers will purchase or to determine whether the loan customer will be default or not. However, real-world categorical data are often imbalanced, that is, the sample size of a particular class is significantly greater than that of others. In this case, most of the classification methods fail to construct an accurate model to classify the imbalanced data. There were several studies focused on developing binary classification models, but these models are not appropriate for data involve three or more categories. Therefore, this study introduces an optimal re-sampling strategy using design of experiments (DOE) and dual response surface methodology (DRS) to improve the accuracy of classification model for multi-class imbalanced data. The real cases from KEEL-dataset are used to demonstrate the effectiveness of the proposed procedure.
(8082655), Gustavo A. Valencia-Zapata. "Probabilistic Diagnostic Model for Handling Classifier Degradation in Machine Learning." Thesis, 2019.
Find full textHuang, Yi-Quan, and 黃義權. "Deep learning from imbalanced class data for automatic surface defect detection." Thesis, 2018. http://ndltd.ncl.edu.tw/handle/d8ur8p.
Full text元智大學
工業工程與管理學系
106
The imbalance of datasets is a common problem in machine learning. In manufacturing, datasets are predominantly composed of positive (defect-free) samples with only a small quantity of negative (defective) samples. The negative samples are generally not enough in a manufacturing process. The construction of a classification model using imbalanced datasets results in poor performance. In this study, A GAN(Generative Adversarial Network)-based model is used to generate the negative samples from a very limited number of true defects. Then the true defect-free samples and the synthesized defective samples are used to train a CNN(Convolutional Neural Network) model. The GAN model solves the imbalanced data problem arising in a manufacturing inspection. The proposed method emphasizes on automatical defect detection of sawmarks and stains in solar wafer surfaces. A multicrystalline solar wafer surface presents random shapes, sizes and orientations of crystal grains in the surface, and results in a heterogeneous texture. The heterogeneous texture makes the automatic visual inspection task extremely difficult. The propoced model shows good detection results for Inhomogeneous texture such as solar wafers, randomly textured surfaces in DAGM 2007’s open dataset, nature wood surface, and machined surfaces. Regarding the synthesis of defective samples, only 21~90 real defective samples are used as input for the CycleGAN model. The CycleGAN outputs 8,000~12,000 generated defect samples, which are used to train the CNN. Using 12000 training samples, the detection rate on small-windowed test samples can reach 85~95%. It also has a very high recognition rate on the large-sized multicrystalline solar wafer surfaces, machined surfaces, and DAGM textured surfaces.
Last, Felix. "Oversampling for imbalanced learning based on k-means and smote." Master's thesis, 2018. http://hdl.handle.net/10362/31042.
Full textLearning from class-imbalanced data continues to be a common and challenging problem in supervised learning as standard classification algorithms are designed to handle balanced class distributions. While different strategies exist to tackle this problem, methods which generate artificial data to achieve a balanced class distribution are more versatile than modifications to the classification algorithm. Such techniques, called oversamplers, modify the training data, allowing any classifier to be used with class-imbalanced datasets. Many algorithms have been proposed for this task, but most are complex and tend to generate unnecessary noise. This work presents a simple and effective oversampling method based on k-means clustering and SMOTE oversampling, which avoids the generation of noise and effectively overcomes imbalances between and within classes. Empirical results of extensive experiments with 71 datasets show that training data oversampled with the proposed method improves classification results. Moreover, k-means SMOTE consistently outperforms other popular oversampling methods. An implementation is made available in the python programming language.