Log in

Relevant bibliographies by topics / Class imbalance / Journal articles

To see the other types of publications on this topic, follow the link: Class imbalance.

Journal articles on the topic 'Class imbalance'

Author: Grafiati

Published: 4 June 2021

Last updated: 25 January 2025

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 journal articles for your research on the topic 'Class imbalance.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

Hosen, Md Saikat, and Sai Srujan Gutlapalli. "A Study of Innovative Class Imbalance Dataset Software Defect Prediction Methods." Asian Journal of Applied Science and Engineering 10, no. 1 (December 10, 2021): 52–55. http://dx.doi.org/10.18034/ajase.v10i1.52.

Full text

Abstract:

Data mining for software defect prediction is the best approach for detecting problematic modules. On-hand classification methods can speed up knowledge discovery on class balance datasets. Actual facts are not balanced since one class dominates the other. These are class imbalances or skewed data sources. As class imbalance increases, the fault prediction rate decreases. For class imbalance data streams, the suggested algorithms use unique oversampling and under-sampling strategies to remove noisy and weak examples from both the majority and minority. We test three techniques on class imbalance software defect datasets using four assessment measures. Results indicate that class-imbalanced software defect datasets can be solved.

APA, Harvard, Vancouver, ISO, and other styles

2

Dube, Lindani, and Tanja Verster. "Enhancing classification performance in imbalanced datasets: A comparative analysis of machine learning models." Data Science in Finance and Economics 3, no. 4 (2023): 354–79. http://dx.doi.org/10.3934/dsfe.2023021.

Full text

Abstract:

<abstract><p>In the realm of machine learning, where data-driven insights guide decision-making, addressing the challenges posed by class imbalance in datasets has emerged as a crucial concern. The effectiveness of classification algorithms hinges not only on their intrinsic capabilities but also on their adaptability to uneven class distributions, a common issue encountered across diverse domains. This study delves into the intricate interplay between varying class imbalance levels and the performance of ten distinct classification models, unravelling the critical impact of this imbalance on the landscape of predictive analytics. Results showed that random forest (RF) and decision tree (DT) models outperformed others, exhibiting robustness to class imbalance. Logistic regression (LR), stochastic gradient descent classifier (SGDC) and naïve Bayes (NB) models struggled with imbalanced datasets. Adaptive boosting (ADA), gradient boosting (GB), extreme gradient boosting (XGB), light gradient boosting machine (LGBM), and k-nearest neighbour (kNN) models improved with balanced data. Adaptive synthetic sampling (ADASYN) yielded more reliable predictions than the under-sampling (UNDER) technique. This study provides insights for practitioners and researchers dealing with imbalanced datasets, guiding model selection and data balancing techniques. RF and DT models demonstrate superior performance, while LR, SGDC and NB models have limitations. By leveraging the strengths of RF and DT models and addressing class imbalance, classification performance in imbalanced datasets can be enhanced. This study enriches credit risk modelling literature by revealing how class imbalance impacts default probability estimation. The research deepens our understanding of class imbalance's critical role in predictive analytics. Serving as a roadmap for practitioners and researchers dealing with imbalanced data, the findings guide model selection and data balancing strategies, enhancing classification performance despite class imbalance.</p></abstract>

APA, Harvard, Vancouver, ISO, and other styles

3

Zhang, Linbin, Caiguang Zhang, Sinong Quan, Huaxin Xiao, Gangyao Kuang, and Li Liu. "A Class Imbalance Loss for Imbalanced Object Recognition." IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 13 (2020): 2778–92. http://dx.doi.org/10.1109/jstars.2020.2995703.

Full text

APA, Harvard, Vancouver, ISO, and other styles

4

Xue, Jie, and Jinwei Ma. "Extreme Sample Imbalance Classification Model Based on Sample Skewness Self-Adaptation." Symmetry 15, no. 5 (May 14, 2023): 1082. http://dx.doi.org/10.3390/sym15051082.

Full text

Abstract:

This paper aims to solve the asymmetric problem of sample classification recognition in extreme class imbalance. Inspired by Krawczyk (2016)’s improvement direction of extreme sample imbalance classification, this paper adopts the AdaBoost model framework to optimize the sample weight update function in each iteration. This weight update not only takes into account the sampling weights of misclassified samples, but also pays more attention to the classification effect of misclassified minority sample classes. Thus, it makes the model more adaptable to imbalanced sample class distribution and the situation of extreme imbalance and make the weight adjustment in hard classification samples more adaptive as well as to generate a symmetry between the minority and majority samples in the imbalanced datasets by adjusting class distribution of the datasets. Based on this, the imbalance boosting model, the Imbalance AdaBoost (ImAdaBoost) model is constructed. In the experimental design stage, ImAdaBoost model is compared with the original model and the mainstream imbalance classification model based on imbalanced datasets with different ratio, including extreme imbalanced dataset. The results show that the ImAdaBoost model has good minority class recognition recall ability in the weakly extreme and general class imbalance sets. In addition, the average recall rate of minority class of the mainstream imbalance classification models is 7% lower than that of ImAdaBoost model in the weakly extreme imbalance set. The ImAdaBoost model ensures that the recall rate of the minority class is at the middle level of the comparison model, and the F1-score comprehensive index performs well, demonstrating the strong stability of the minority class classification in extreme imbalanced dataset.

APA, Harvard, Vancouver, ISO, and other styles

5

Munguía Mondragón, Julio Cesar, Eréndira Rendón Lara, Roberto Alejo Eleuterio, Everardo Efrén Granda Gutirrez, and Federico Del Razo López. "Density-Based Clustering to Deal with Highly Imbalanced Data in Multi-Class Problems." Mathematics 11, no. 18 (September 21, 2023): 4008. http://dx.doi.org/10.3390/math11184008.

Full text

Abstract:

In machine learning and data mining applications, an imbalanced distribution of classes in the training dataset can drastically affect the performance of learning models. The class imbalance problem is frequently observed during classification tasks in real-world scenarios when the available instances of one class are much fewer than the amount of data available in other classes. Machine learning algorithms that do not consider the class imbalance could introduce a strong bias towards the majority class, while the minority class is usually despised. Thus, sampling techniques have been extensively used in various studies to overcome class imbalances, mainly based on random undersampling and oversampling methods. However, there is still no final solution, especially in the domain of multi-class problems. A strategy that combines density-based clustering algorithms with random undersampling and oversampling techniques is studied in this work. To analyze the performance of the studied method, an experimental validation was achieved on a collection of hyperspectral remote sensing images, and a deep learning neural network was utilized as the classifier. This data bank contains six datasets with different imbalance ratios, from slight to severe. The experimental results outperform the classification measured by the geometric mean of the precision compared with other state-of-the-art methods, mainly for highly imbalanced datasets.

APA, Harvard, Vancouver, ISO, and other styles

6

Lango, Mateusz. "Tackling the Problem of Class Imbalance in Multi-class Sentiment Classification: An Experimental Study." Foundations of Computing and Decision Sciences 44, no. 2 (June 1, 2019): 151–78. http://dx.doi.org/10.2478/fcds-2019-0009.

Full text

Abstract:

Abstract Sentiment classification is an important task which gained extensive attention both in academia and in industry. Many issues related to this task such as handling of negation or of sarcastic utterances were analyzed and accordingly addressed in previous works. However, the issue of class imbalance which often compromises the prediction capabilities of learning algorithms was scarcely studied. In this work, we aim to bridge the gap between imbalanced learning and sentiment analysis. An experimental study including twelve imbalanced learning preprocessing methods, four feature representations, and a dozen of datasets, is carried out in order to analyze the usefulness of imbalanced learning methods for sentiment classification. Moreover, the data difficulty factors — commonly studied in imbalanced learning — are investigated on sentiment corpora to evaluate the impact of class imbalance.

APA, Harvard, Vancouver, ISO, and other styles

7

Juba, Brendan, and Hai S. Le. "Precision-Recall versus Accuracy and the Role of Large Data Sets." Proceedings of the AAAI Conference on Artificial Intelligence 33 (July 17, 2019): 4039–48. http://dx.doi.org/10.1609/aaai.v33i01.33014039.

Full text

Abstract:

Practitioners of data mining and machine learning have long observed that the imbalance of classes in a data set negatively impacts the quality of classifiers trained on that data. Numerous techniques for coping with such imbalances have been proposed, but nearly all lack any theoretical grounding. By contrast, the standard theoretical analysis of machine learning admits no dependence on the imbalance of classes at all. The basic theorems of statistical learning establish the number of examples needed to estimate the accuracy of a classifier as a function of its complexity (VC-dimension) and the confidence desired; the class imbalance does not enter these formulas anywhere. In this work, we consider the measures of classifier performance in terms of precision and recall, a measure that is widely suggested as more appropriate to the classification of imbalanced data. We observe that whenever the precision is moderately large, the worse of the precision and recall is within a small constant factor of the accuracy weighted by the class imbalance. A corollary of this observation is that a larger number of examples is necessary and sufficient to address class imbalance, a finding we also illustrate empirically.

APA, Harvard, Vancouver, ISO, and other styles

8

Hartono, Hartono, Erianto Ongko, and Yeni Risyani. "Combining feature selection and hybrid approach redefinition in handling class imbalance and overlapping for multi-class imbalanced." Indonesian Journal of Electrical Engineering and Computer Science 21, no. 3 (March 10, 2021): 1513. http://dx.doi.org/10.11591/ijeecs.v21.i3.pp1513-1522.

Full text

Abstract:

<span>In the classification process that contains class imbalance problems. In addition to the uneven distribution of instances which causes poor performance, overlapping problems also cause performance degradation. This paper proposes a method that combining feature selection and hybrid approach redefinition (HAR) method in handling class imbalance and overlapping for multi-class imbalanced. HAR was a hybrid ensembles method in handling class imbalance problem. The main contribution of this work is to produce a new method that can overcome the problem of class imbalance and overlapping in the multi-class imbalance problem. This method must be able to give better results in terms of classifier performance and overlap degrees in multi-class problems. This is achieved by improving an ensemble learning algorithm and a preprocessing technique in HAR <span>using minimizing overlapping selection under SMOTE (MOSS). MOSS was known as a very popular feature selection method in handling overlapping. To validate the accuracy of the proposed method, this research use augmented R-Value, Mean AUC, Mean F-Measure, Mean G-Mean, and Mean Precision. The performance of the model is evaluated against the hybrid method (MBP+CGE) as a popular method in handling class imbalance and overlapping for multi-class imbalanced. It is found that the proposed method is superior when subjected to classifier performance as indicate with better Mean AUC, F-Measure, G-Mean, and precision.</span></span>

APA, Harvard, Vancouver, ISO, and other styles

9

Dube, Lindani, and Tanja Verster. "Interpretability of the random forest model under class imbalance." Data Science in Finance and Economics 4, no. 3 (2024): 446–68. http://dx.doi.org/10.3934/dsfe.2024019.

Full text

Abstract:

<p>In predictive modeling, addressing class imbalance is a critical concern, particularly in applications where certain classes are disproportionately represented. This study delved into the implications of class imbalance on the interpretability of the random forest models. Class imbalance is a common challenge in machine learning, particularly in domains where certain classes are under-represented. This study investigated the impact of class imbalance on random forest model performance in churn and fraud detection scenarios. We trained and evaluated random forest models on churn datasets with class imbalances ranging from 20% to 50% and fraud datasets with imbalances from 1% to 15%. The results revealed consistent improvements in the precision, recall, F1-score, and accuracy as class imbalance decreases, indicating that models become more precise and accurate in identifying rare events with balanced datasets. Additionally, we employed interpretability techniques such as Shapley values, partial dependence plots (PDPs), and breakdown plots to elucidate the effect of class imbalance on model interpretability. Shapley values showed varying feature importance across different class distributions, with a general decrease as datasets became more balanced. PDPs illustrated a consistent upward trend in estimated values as datasets approached balance, indicating consistent relationships between input variables and predicted outcomes. Breakdown plots highlighted significant changes in individual predictions as class imbalance varied, underscoring the importance of considering class distribution in interpreting model outputs. These findings contribute to our understanding of the complex interplay between class balance, model performance, and interpretability, offering insights for developing more robust and reliable predictive models in real-world applications.</p>

APA, Harvard, Vancouver, ISO, and other styles

10

Lin, Hsien-I., and Mihn Cong Nguyen. "Boosting Minority Class Prediction on Imbalanced Point Cloud Data." Applied Sciences 10, no. 3 (February 2, 2020): 973. http://dx.doi.org/10.3390/app10030973.

Full text

Abstract:

Data imbalance during the training of deep networks can cause the network to skip directly to learning minority classes. This paper presents a novel framework by which to train segmentation networks using imbalanced point cloud data. PointNet, an early deep network used for the segmentation of point cloud data, proved effective in the point-wise classification of balanced data; however, performance degraded when imbalanced data was used. The proposed approach involves removing between-class data point imbalances and guiding the network to pay more attention to majority classes. Data imbalance is alleviated using a hybrid-sampling method involving oversampling, as well as undersampling, respectively, to decrease the amount of data in majority classes and increase the amount of data in minority classes. A balanced focus loss function is also used to emphasize the minority classes through the automated assignment of costs to the various classes based on their density in the point cloud. Experiments demonstrate the effectiveness of the proposed training framework when provided a point cloud dataset pertaining to six objects. The mean intersection over union (mIoU) test accuracy results obtained using PointNet training were as follows: XYZRGB data (91%) and XYZ data (86%). The mIoU test accuracy results obtained using the proposed scheme were as follows: XYZRGB data (98%) and XYZ data (93%).

APA, Harvard, Vancouver, ISO, and other styles

11

Najeeb, Miftah Asharaf, and Alhaam Alariyibi. "Imbalanced Dataset Effect on CNN-Based Classifier Performance for Face Recognition." International Journal of Artificial Intelligence & Applications 15, no. 1 (January 29, 2024): 25–41. http://dx.doi.org/10.5121/ijaia.2024.15102.

Full text

Abstract:

Facial Recognition is integral to numerous modern applications, such as security systems, social media platforms, and augmented reality apps. The success of these systems heavily depends on the performance of the Face Recognition models they use, specifically Convolutional Neural Networks (CNNs). However, many real-world classification tasks encounter imbalanced datasets, with some classes significantly underrepresented. Face Recognition models that do not address this class imbalance tend to exhibit poor performance, especially in tasks involving a wide range of faces to identify (multi-class problems). This research examines how class imbalance in datasets impacts the creation of neural network classifiers for Facial Recognition. Initially, we crafted a Convolutional Neural Network model for facial recognition, integrating hybrid resampling methods (oversampling and under-sampling) to address dataset imbalances. In addition, augmentation techniques were implemented to enhance generalization capabilities and overall performance. Through comprehensive experimentation, we assess the influence of imbalanced datasets on the performance of the CNN-based classifier. Using Pins face data, we conducted an empirical study, evaluating conclusions based on accuracy, precision, recall, and F1-score measurements. A comparative analysis demonstrates that the performance of the proposed Convolutional Neural Network classifier diminishes in the presence of dataset class imbalances. Conversely, the proposed system, utilizing data resampling techniques, notably enhances classification performance for imbalanced datasets. This study underscores the efficacy of data resampling approaches in augmenting the performance of Face Recognition models, presenting prospects for more dependable and efficient future systems.

APA, Harvard, Vancouver, ISO, and other styles

12

Sowah, Robert A., Moses A. Agebure, Godfrey A. Mills, Koudjo M. Koumadi, and Seth Y. Fiawoo. "New Cluster Undersampling Technique for Class Imbalance Learning." International Journal of Machine Learning and Computing 6, no. 3 (June 2016): 205–14. http://dx.doi.org/10.18178/ijmlc.2016.6.3.599.

Full text

APA, Harvard, Vancouver, ISO, and other styles

13

Patel, Harshita, Dharmendra Singh Rajput, G. Thippa Reddy, Celestine Iwendi, Ali Kashif Bashir, and Ohyun Jo. "A review on classification of imbalanced data for wireless sensor networks." International Journal of Distributed Sensor Networks 16, no. 4 (April 2020): 155014772091640. http://dx.doi.org/10.1177/1550147720916404.

Full text

Abstract:

Classification of imbalanced data is a vastly explored issue of the last and present decade and still keeps the same importance because data are an essential term today and it becomes crucial when data are distributed into several classes. The term imbalance refers to uneven distribution of data into classes that severely affects the performance of traditional classifiers, that is, classifiers become biased toward the class having larger amount of data. The data generated from wireless sensor networks will have several imbalances. This review article is a decent analysis of imbalance issue for wireless sensor networks and other application domains, which will help the community to understand WHAT, WHY, and WHEN of imbalance in data and its remedies.

APA, Harvard, Vancouver, ISO, and other styles

14

Gautam, Subrat, and Ratul Dey. "METHODS FOR CLASSIFICATION OF IMBALANCED DATA: A REVIEW." International Research Journal of Computer Science 9, no. 4 (April 30, 2022): 89–95. http://dx.doi.org/10.26562/irjcs.2021.v0904.004.

Full text

Abstract:

Imbalance in dataset enforces numerous challenges to implementing data analytics in all existing real-world applications using machine learning. Data imbalance occurs when the sample size from a class is very small or large than another class. The performance of predicted models is greatly affected when the dataset is highly imbalanced and the sample size increases. Overall, Imbalanced training data have a major negative impact on performance. Leading machine learning techniques combat the imbalanced dataset by focusing on avoiding the minority class and reducing the inaccuracy for the majority class. This article presents a review of different approaches to classifying imbalanced dataset and their application areas.

APA, Harvard, Vancouver, ISO, and other styles

15

Megahed, Fadel M., Ying-Ju Chen, Aly Megahed, Yuya Ong, Naomi Altman, and Martin Krzywinski. "The class imbalance problem." Nature Methods 18, no. 11 (October 15, 2021): 1270–72. http://dx.doi.org/10.1038/s41592-021-01302-4.

Full text

APA, Harvard, Vancouver, ISO, and other styles

16

Narwane, Swati V., and Sudhir D. Sawarkar. "Effects of Class Imbalance Using Machine Learning Algorithms." International Journal of Applied Evolutionary Computation 12, no. 1 (January 2021): 1–17. http://dx.doi.org/10.4018/ijaec.2021010101.

Full text

Abstract:

Class imbalance is the major hurdle for machine learning-based systems. Data set is the backbone of machine learning and must be studied to handle the class imbalance. The purpose of this paper is to investigate the effect of class imbalance on the data sets. The proposed methodology determines the model accuracy for class distribution. To find possible solutions, the behaviour of an imbalanced data set was investigated. The study considers two case studies with data set divided balanced to unbalanced class distribution. Testing of the data set with trained and test data was carried out for standard machine learning algorithms. Model accuracy for class distribution was measured with the training data set. Further, the built model was tested with individual binary class. Results show that, for the improvement of the system performance, it is essential to work on class imbalance problems. The study concludes that the system produces biased results due to the majority class. In the future, the multiclass imbalance problem can be studied using advanced algorithms.

APA, Harvard, Vancouver, ISO, and other styles

17

Brzezinski, Dariusz, Leandro L. Minku, Tomasz Pewinski, Jerzy Stefanowski, and Artur Szumaczuk. "The impact of data difficulty factors on classification of imbalanced and concept drifting data streams." Knowledge and Information Systems 63, no. 6 (April 1, 2021): 1429–69. http://dx.doi.org/10.1007/s10115-021-01560-w.

Full text

Abstract:

AbstractClass imbalance introduces additional challenges when learning classifiers from concept drifting data streams. Most existing work focuses on designing new algorithms for dealing with the global imbalance ratio and does not consider other data complexities. Independent research on static imbalanced data has highlighted the influential role of local data difficulty factors such as minority class decomposition and presence of unsafe types of examples. Despite often being present in real-world data, the interactions between concept drifts and local data difficulty factors have not been investigated in concept drifting data streams yet. We thoroughly study the impact of such interactions on drifting imbalanced streams. For this purpose, we put forward a new categorization of concept drifts for class imbalanced problems. Through comprehensive experiments with synthetic and real data streams, we study the influence of concept drifts, global class imbalance, local data difficulty factors, and their combinations, on predictions of representative online classifiers. Experimental results reveal the high influence of new considered factors and their local drifts, as well as differences in existing classifiers’ reactions to such factors. Combinations of multiple factors are the most challenging for classifiers. Although existing classifiers are partially capable of coping with global class imbalance, new approaches are needed to address challenges posed by imbalanced data streams.

APA, Harvard, Vancouver, ISO, and other styles

18

Sun, Jie, Xin Liu, Wenguo Ai, and Qianyuan Tian. "Dynamic financial distress prediction based on class-imbalanced data batches." International Journal of Financial Engineering 08, no. 03 (May 14, 2021): 2150026. http://dx.doi.org/10.1142/s2424786321500262.

Full text

Abstract:

This study proposes two approaches for dynamic financial distress prediction (FDP) based on class-imbalanced data batches by considering both concept drift and class imbalance. One is based on sliding time window and synthetic minority over-sampling technique (SMOTE) and the other is based on sliding time window and majority class partition. Support vector machine, multiple discriminant analysis (MDA) and logistic regression are used as base classifiers in the experiments on a real-world dataset. The results indicate that the two approaches perform better than the pure dynamic FDP (DFDP) models without class imbalance processing and the static FDP models either with or without class imbalance processing.

APA, Harvard, Vancouver, ISO, and other styles

19

Lee, Heewon, and Sangtae Ahn. "Improving the Performance of Object Detection by Preserving Balanced Class Distribution." Mathematics 11, no. 21 (October 27, 2023): 4460. http://dx.doi.org/10.3390/math11214460.

Full text

Abstract:

Object detection is a task that performs position identification and label classification of objects in images or videos. The information obtained through this process plays an essential role in various tasks in the field of computer vision. In object detection, the data utilized for training and validation typically originate from public datasets that are well-balanced in terms of the number of objects ascribed to each class in an image. However, in real-world scenarios, handling datasets with much greater class imbalance, i.e., very different numbers of objects for each class, is much more common, and this imbalance may reduce the performance of object detection when predicting unseen test images. In our study, thus, we propose a method that evenly distributes the classes in an image for training and validation, solving the class imbalance problem in object detection. Our proposed method aims to maintain a uniform class distribution through multi-label stratification. We tested our proposed method not only on public datasets that typically exhibit balanced class distribution but also on private datasets that may have imbalanced class distribution. We found that our proposed method was more effective on datasets containing severe imbalance and less data. Our findings indicate that the proposed method can be effectively used on datasets with substantially imbalanced class distribution.

APA, Harvard, Vancouver, ISO, and other styles

20

Hakim, Arif Rahman, Kalamullah Ramli, Muhammad Salman, and Esti Rahmawati Agustina. "Improving Model Performance for Predicting Exfiltration Attacks Through Resampling Strategies." IIUM Engineering Journal 26, no. 1 (January 10, 2025): 420–36. https://doi.org/10.31436/iiumej.v26i1.3547.

Full text

Abstract:

Addressing class imbalance is critical in cybersecurity applications, particularly in scenarios like exfiltration detection, where skewed datasets lead to biased predictions and poor generalization for minority classes. This study investigates five Synthetic Minority Oversampling Technique (SMOTE) variants, including BorderlineSMOTE, KMeansSMOTE, SMOTEENC, SMOTEENN, and SMOTETomek, to mitigate severe imbalance in our customized tactic-labeled dataset with dominant majority class influence and weak class separability class imbalance. We use seven imbalance metrics to assess each SMOTE variant’s impact on class distribution stability and separability. Furthermore, we evaluate model performance across five classifiers: Logistic Regression, Naïve Bayes, Support Vector Machine, Random Forest, and XGBoost. Findings reveal that SMOTEENN consistently enhances performance metrics (accuracy, precision, recall, F1-score, and geometric mean) on an average of 99% across most classifiers, establishing itself as the most adaptable variant for handling imbalance. This study provides a comprehensive framework for selecting resampling strategies to enhance classification efficacy in cybersecurity tasks with imbalanced data.

APA, Harvard, Vancouver, ISO, and other styles

21

Thamrin, Sri Astuti, Dian Sidik, Hedi Kuswanto, Armin Lawi, and Ansariadi Ansariadi. "Exploration of Obesity Status of Indonesia Basic Health Research 2013 With Synthetic Minority Over-Sampling Techniques." Indonesian Journal of Statistics and Its Applications 5, no. 1 (March 31, 2021): 75–91. http://dx.doi.org/10.29244/ijsa.v5i1p75-91.

Full text

Abstract:

The accuracy of the data class is very important in classification with a machine learning approach. The more accurate the existing data sets and classes, the better the output generated by machine learning. In fact, classification can experience imbalance class data in which each class does not have the same portion of the data set it has. The existence of data imbalance will affect the classification accuracy. One of the easiest ways to correct imbalanced data classes is to balance it. This study aims to explore the problem of data class imbalance in the medium case dataset and to address the imbalance of data classes as well. The Synthetic Minority Over-Sampling Technique (SMOTE) method is used to overcome the problem of class imbalance in obesity status in Indonesia 2013 Basic Health Research (RISKESDAS). The results show that the number of obese class (13.9%) and non-obese class (84.6%). This means that there is an imbalance in the data class with moderate criteria. Moreover, SMOTE with over-sampling 600% can improve the level of minor classes (obesity). As consequence, the classes of obesity status balanced. Therefore, SMOTE technique was better compared to without SMOTE in exploring the obesity status of Indonesia RISKESDAS 2013.

APA, Harvard, Vancouver, ISO, and other styles

22

Palli, Abdul Sattar, Jafreezal Jaafar, Abdul Rehman Gilal, Aeshah Alsughayyir, Heitor Murilo Gomes, Abdullah Alshanqiti, and Mazni Omar. "Online Machine Learning from Non-stationary Data Streams in the Presence of Concept Drift and Class Imbalance: A Systematic Review." Journal of Information and Communication Technology 23, no. 1 (January 30, 2024): 105–39. http://dx.doi.org/10.32890/jict2024.23.1.5.

Full text

Abstract:

In IoT environment applications generate continuous non-stationary data streams with in-built problems of concept drift and class imbalance which cause classifier performance degradation. The imbalanced data affects the classifier during concept detection and concept adaptation. In general, for concept detection, a separate mechanism is added in parallel with the classifier to detect the concept drift called a drift detector. For concept adaptation, the classifier updates itself or trains a new classifier to replace the older one. In case, the data stream faces a class imbalance issue, the classifier may not properly adapt to the latest concept. In this survey, we study how the existing work addresses the issues of class imbalance and concept drift while learning from nonstationarydata streams. We further highlight the limitation of existing work and challenges caused by other factors of class imbalance alongwith concept drift in data stream classification. Results of our survey found that, out of 1110 studies, by using our inclusion and exclusion criteria, we were able to narrow the pool of articles down to 35 that directly addressed our study objectives. The study found that issues such as multiple concept drift types, dynamic class imbalance ratio, and multi-class imbalance in presence of concept drift are still open for further research. We also observed that, while major research efforts have been dedicated to resolving concept drift and class imbalance, not much attention has been given to with-in-class imbalance, rear examples, and borderline instances when they exist with concept drift in multi-class data. This paper concludes with some suggested future directions.

APA, Harvard, Vancouver, ISO, and other styles

23

Dr. P, Ratna Babu, and Lokaiah P. "An effective noise reduction technique for class imbalance classification." International Journal of Psychosocial Rehabilitation 24, no. 04 (February 28, 2020): 985–90. http://dx.doi.org/10.37200/ijpr/v24i4/pr201070.

Full text

APA, Harvard, Vancouver, ISO, and other styles

24

Fu, Cui, Shuisheng Zhou, Dan Zhang, and Li Chen. "Relative Density-Based Intuitionistic Fuzzy SVM for Class Imbalance Learning." Entropy 25, no. 1 (December 24, 2022): 34. http://dx.doi.org/10.3390/e25010034.

Full text

Abstract:

The support vector machine (SVM) has been combined with the intuitionistic fuzzy set to suppress the negative impact of noises and outliers in classification. However, it has some inherent defects, resulting in the inaccurate prior distribution estimation for datasets, especially the imbalanced datasets with non-normally distributed data, further reducing the performance of the classification model for imbalance learning. To solve these problems, we propose a novel relative density-based intuitionistic fuzzy support vector machine (RIFSVM) algorithm for imbalanced learning in the presence of noise and outliers. In our proposed algorithm, the relative density, which is estimated by adopting the k-nearest-neighbor distances, is used to calculate the intuitionistic fuzzy numbers. The fuzzy values of the majority class instances are designed by multiplying the score function of the intuitionistic fuzzy number by the imbalance ratio, and the fuzzy values of minority class instances are assigned the intuitionistic fuzzy membership degree. With the help of the strong capture ability of the relative density to prior information and the strong recognition ability of the intuitionistic fuzzy score function to noises and outliers, the proposed RIFSVM not only reduces the influence of class imbalance but also suppresses the impact of noises and outliers, and further improves the classification performance. Experiments on the synthetic and public imbalanced datasets show that our approach has better performance in terms of G-Means, F-Measures, and AUC than the other class imbalance classification algorithms.

APA, Harvard, Vancouver, ISO, and other styles

25

Malhotra, Ruchika, and Kusum Lata. "Using Ensembles for Class-Imbalance Problem to Predict Maintainability of Open Source Software." International Journal of Reliability, Quality and Safety Engineering 27, no. 05 (March 6, 2020): 2040011. http://dx.doi.org/10.1142/s0218539320400112.

Full text

Abstract:

To facilitate software maintenance and save the maintenance cost, numerous machine learning (ML) techniques have been studied to predict the maintainability of software modules or classes. An abundant amount of effort has been put by the research community to develop software maintainability prediction (SMP) models by relating software metrics to the maintainability of modules or classes. When software classes demanding the high maintainability effort (HME) are less as compared to the low maintainability effort (LME) classes, the situation leads to imbalanced datasets for training the SMP models. The imbalanced class distribution in SMP datasets could be a dilemma for various ML techniques because, in the case of an imbalanced dataset, minority class instances are either misclassified by the ML techniques or get discarded as noise. The recent development in predictive modeling has ascertained that ensemble techniques can boost the performance of ML techniques by collating their predictions. Ensembles themselves do not solve the class-imbalance problem much. However, aggregation of ensemble techniques with the certain techniques to handle class-imbalance problem (e.g., data resampling) has led to several proposals in research. This paper evaluates the performance of ensembles for the class-imbalance in the domain of SMP. The ensembles for class-imbalance problem (ECIP) are the modification of ensembles which pre-process the imbalanced data using data resampling before the learning process. This study experimentally compares the performance of several ECIP using performance metrics Balance and g-Mean over eight Apache software datasets. The results of the study advocate that for imbalanced datasets, ECIP improves the performance of SMP models as compared to classic ensembles.

APA, Harvard, Vancouver, ISO, and other styles

26

Han, Meng, Chunpeng Li, Fanxing Meng, Feifei He, and Ruihua Zhang. "An Adaptive Active Learning Method for Multiclass Imbalanced Data Streams with Concept Drift." Applied Sciences 14, no. 16 (August 15, 2024): 7176. http://dx.doi.org/10.3390/app14167176.

Full text

Abstract:

Learning from multiclass imbalanced data streams with concept drift and variable class imbalance ratios under a limited label budget presents new challenges in the field of data mining. To address these challenges, this paper proposes an adaptive active learning method for multiclass imbalanced data streams with concept drift (AdaAL-MID). Firstly, a dynamic label budget strategy under concept drift scenarios is introduced, which allocates label budgets reasonably at different stages of the data stream to effectively handle concept drift. Secondly, an uncertainty-based label request strategy using a dual-margin dynamic threshold matrix is designed to enhance learning opportunities for minority class instances and those that are challenging to classify, and combined with a random strategy, it can estimate the current class imbalance distribution by accessing only a limited number of instance labels. Finally, an instance-adaptive sampling strategy is proposed, which comprehensively considers the imbalance ratio and classification difficulty of instances, and combined with a weighted ensemble strategy, improves the classification performance of the ensemble classifier in imbalanced data streams. Extensive experiments and analyses demonstrate that AdaAL-MID can handle various complex concept drifts and adapt to changes in class imbalance ratios, and it outperforms several state-of-the-art active learning algorithms.

APA, Harvard, Vancouver, ISO, and other styles

27

WANG, SHUO, LEANDRO L. MINKU, and XIN YAO. "ONLINE CLASS IMBALANCE LEARNING AND ITS APPLICATIONS IN FAULT DETECTION." International Journal of Computational Intelligence and Applications 12, no. 04 (December 2013): 1340001. http://dx.doi.org/10.1142/s1469026813400014.

Full text

Abstract:

Although class imbalance learning and online learning have been extensively studied in the literature separately, online class imbalance learning that considers the challenges of both fields has not drawn much attention. It deals with data streams having very skewed class distributions, such as fault diagnosis of real-time control monitoring systems and intrusion detection in computer networks. To fill in this research gap and contribute to a wide range of real-world applications, this paper first formulates online class imbalance learning problems. Based on the problem formulation, a new online learning algorithm, sampling-based online bagging (SOB), is proposed to tackle class imbalance adaptively. Then, we study how SOB and other state-of-the-art methods can benefit a class of fault detection data under various scenarios and analyze their performance in depth. Through extensive experiments, we find that SOB can balance the performance between classes very well across different data domains and produce stable G-mean when learning constantly imbalanced data streams, but it is sensitive to sudden changes in class imbalance, in which case SOB's predecessor undersampling-based online bagging (UOB) is more robust.

APA, Harvard, Vancouver, ISO, and other styles

28

BATUWITA, RUKSHAN, and VASILE PALADE. "ADJUSTED GEOMETRIC-MEAN: A NOVEL PERFORMANCE MEASURE FOR IMBALANCED BIOINFORMATICS DATASETS LEARNING." Journal of Bioinformatics and Computational Biology 10, no. 04 (July 23, 2012): 1250003. http://dx.doi.org/10.1142/s0219720012500035.

Full text

Abstract:

One common and challenging problem faced by many bioinformatics applications, such as promoter recognition, splice site prediction, RNA gene prediction, drug discovery and protein classification, is the imbalance of the available datasets. In most of these applications, the positive data examples are largely outnumbered by the negative data examples, which often leads to the development of sub-optimal prediction models having high negative recognition rate (Specificity = SP) and low positive recognition rate (Sensitivity = SE). When class imbalance learning methods are applied, usually, the SE is increased at the expense of reducing some amount of the SP. In this paper, we point out that in these data-imbalanced bioinformatics applications, the goal of applying class imbalance learning methods would be to increase the SE as high as possible by keeping the reduction of SP as low as possible. We explain that the existing performance measures used in class imbalance learning can still produce sub-optimal models with respect to this classification goal. In order to overcome these problems, we introduce a new performance measure called Adjusted Geometric-mean (AGm). The experimental results obtained on ten real-world imbalanced bioinformatics datasets demonstrates that the AGm metric can achieve a lower rate of reduction of SP than the existing performance metrics, when increasing the SE through class imbalance learning methods. This characteristic of AGm metric makes it more suitable for achieving the proposed classification goal in imbalanced bioinformatics datasets learning.

APA, Harvard, Vancouver, ISO, and other styles

29

Law, Theng-Jia, Choo-Yee Ting, Hu Ng, Hui-Ngo Goh, and Albert Quek. "Ensemble-SMOTE: Mitigating Class Imbalance in Graduate on Time Detection." Journal of Informatics and Web Engineering 3, no. 2 (June 13, 2024): 229–50. http://dx.doi.org/10.33093/jiwe.2024.3.2.17.

Full text

Abstract:

In education, detecting students graduating on time is difficult due to high data complexity. Researchers have employed various approaches in identifying on-time graduation with Machine Learning, but it remains a challenging task due to the class imbalance in the dataset. This study has aimed to (i) compare various class imbalance treatment methods with different sampling ratios, (ii) propose an ensemble class imbalance treatment method in mitigating the problem of class imbalance, and (iii) develop and evaluate predictive models in identifying the likelihood of students graduating on time during their studies in university. The dataset is collected from 4007 graduates of a university from year 2021 and 2022 with 41 variables. After feature selection, various class imbalance treatment methods were compared with different sampling ratios ranging from 50% to 90%. Moreover, Ensemble-SMOTE is proposed to aggregate the dataset generated by Synthetic Minority Oversampling Technique variants in mitigating the problem of class imbalance effectively. The dataset generated by class imbalance treatment methods were used as the input of the predictive models in detecting on-time graduation. The predictive models were evaluated based on accuracy, precision, recall, F0.5-score, F1-score, F2-score, Area under the Curve, and Area Under the Precision-Recall Curve. Based on the findings, Logistic Regression with Ensemble-SMOTE outperformed other predictive models, and class imbalance treatment methods by achieving the highest average accuracy (87.24), recall (92.50%), F1-score (91.30%), and F2-score (92.02%) from 6th until 10th trimester. To assess the effectiveness of class imbalance treatment methods, Friedman test is performed to determine on significant difference between the models after applying Shapiro-Wilk test in normality test. Consequently, Ensemble-SMOTE is ranked as the top-performers by achieving the lowest value in the average rank based on the performance metrics. Additional research could incorporate and examine more complicated approaches in mitigating class imbalance when the dataset is highly imbalanced.

APA, Harvard, Vancouver, ISO, and other styles

30

Cruz, Rafael M. O., Mariana A. Souza, Robert Sabourin, and George D. C. Cavalcanti. "Dynamic Ensemble Selection and Data Preprocessing for Multi-Class Imbalance Learning." International Journal of Pattern Recognition and Artificial Intelligence 33, no. 11 (October 2019): 1940009. http://dx.doi.org/10.1142/s0218001419400093.

Full text

Abstract:

Class imbalance refers to classification problems in which many more instances are available for certain classes than for others. Such imbalanced datasets require special attention because traditional classifiers generally favor the majority class which has a large number of instances. Ensemble of classifiers has been reported to yield promising results. However, the majority of ensemble methods applied to imbalance learning are static ones. Moreover, they only deal with binary imbalanced problems. Hence, this paper presents an empirical analysis of Dynamic Selection techniques and data preprocessing methods for dealing with multi-class imbalanced problems. We considered five variations of preprocessing methods and 14 Dynamic Selection schemes. Our experiments conducted on 26 multi-class imbalanced problems show that the dynamic ensemble improves the AUC and the [Formula: see text]-mean as compared to the static ensemble. Moreover, data preprocessing plays an important role in such cases.

APA, Harvard, Vancouver, ISO, and other styles

31

Cleofas Sánchez, Laura, Magali Guzmán Escobedo, Rosa María Valdovinos Rosas, Cornelio Yáñez Márquez, and Oscar Camacho Nieto. "Using hybrid associative classifier with translation (HACT) for studying imbalanced data sets." Ingeniería e Investigación 32, no. 1 (January 1, 2012): 53–57. http://dx.doi.org/10.15446/ing.investig.v32n1.28522.

Full text

Abstract:

Class imbalance may reduce the classifier performance in several recognition pattern problems. Such negative effect is more notable with least represented class (minority class) Patterns. A strategy for handling this problem consisted of treating the classes included in this problem separately (majority and minority classes) to balance the data sets (DS). This paper has studied high sensitivity to class imbalance shown by an associative model of classification: hybrid associative classifier with translation (HACT); imbalanced DS impact on associative model performance was studied. The convenience of using sub-sampling methods for decreasing imbalanced negative effects on associative memories was analysed. This proposal's feasibility was based on experimental results obtained from eleven real-world datasets.

APA, Harvard, Vancouver, ISO, and other styles

32

., Hartono, Opim Salim Sitompul, Erna Budhiarti Nababan, Tulus ., Dahlan Abdullah, and Ansari Saleh Ahmar. "A New Diversity Technique for Imbalance Learning Ensembles." International Journal of Engineering & Technology 7, no. 2.14 (April 8, 2018): 478. http://dx.doi.org/10.14419/ijet.v7i2.11251.

Full text

Abstract:

Data mining and machine learning techniques designed to solve classification problems require balanced class distribution. However, in reality sometimes the classification of datasets indicates the existence of a class represented by a large number of instances whereas there are classes with far fewer instances. This problem is known as the class imbalance problem. Classifier Ensembles is a method often used in overcoming class imbalance problems. Data Diversity is one of the cornerstones of ensembles. An ideal ensemble system should have accurrate individual classifiers and if there is an error it is expected to occur on different objects or instances. This research will present the results of overview and experimental study using Hybrid Approach Redefinition (HAR) Method in handling class imbalance and at the same time expected to get better data diversity. This research will be conducted using 6 datasets with different imbalanced ratios and will be compared with SMOTEBoost which is one of the Re-Weighting method which is often used in handling class imbalance. This study shows that the data diversity is related to performance in the imbalance learning ensembles and the proposed methods can obtain better data diversity.

APA, Harvard, Vancouver, ISO, and other styles

33

Alfhaid, Mashaal A., and Manal Abdullah. "Classification of Imbalanced Data Stream: Techniques and Challenges." Transactions on Machine Learning and Artificial Intelligence 9, no. 2 (April 23, 2021): 36–52. http://dx.doi.org/10.14738/tmlai.92.9964.

Full text

Abstract:

As the number of generated data increases every day, this has brought the importance of data mining and knowledge extraction. In traditional data mining, offline status can be used for knowledge extraction. Nevertheless, dealing with stream data mining is different due to continuously arriving data that can be processed at a single scan besides the appearance of concept drift. As the pre-processing stage is critical in knowledge extraction, imbalanced stream data gain significant popularity in the last few years among researchers. Many real-world applications suffer from class imbalance including medical, business, fraud detection and etc. Learning from the supervised model includes classes whether it is binary- or multi-classes. These classes are often imbalance where it is divided into the majority (negative) class and minority (positive) class, which can cause a bias toward the majority class that leads to skew in predictive performance models. Handles imbalance streaming data is mandatory for more accurate and reliable learning models. In this paper, we will present an overview of data stream mining and its tools. Besides, summarize the problem of class imbalance and its different approaches. In addition, researchers will present the popular evaluation metrics and challenges prone from imbalanced streaming data.

APA, Harvard, Vancouver, ISO, and other styles

34

Liu, Zhenyan, Yifei Zeng, Pengfei Zhang, Jingfeng Xue, Ji Zhang, and Jiangtao Liu. "An Imbalanced Malicious Domains Detection Method Based on Passive DNS Traffic Analysis." Security and Communication Networks 2018 (June 20, 2018): 1–7. http://dx.doi.org/10.1155/2018/6510381.

Full text

Abstract:

Although existing malicious domains detection techniques have shown great success in many real-world applications, the problem of learning from imbalanced data is rarely concerned with this day. But the actual DNS traffic is inherently imbalanced; thus how to build malicious domains detection model oriented to imbalanced data is a very important issue worthy of study. This paper proposes a novel imbalanced malicious domains detection method based on passive DNS traffic analysis, which can effectively deal with not only the between-class imbalance problem but also the within-class imbalance problem. The experiments show that this proposed method has favorable performance compared to the existing algorithms.

APA, Harvard, Vancouver, ISO, and other styles

35

Cheng, Ruihan, Longfei Zhang, Shiqi Wu, Sen Xu, Shang Gao, and Hualong Yu. "Probability Density Machine: A New Solution of Class Imbalance Learning." Scientific Programming 2021 (September 9, 2021): 1–14. http://dx.doi.org/10.1155/2021/7555587.

Full text

Abstract:

Class imbalance learning (CIL) is an important branch of machine learning as, in general, it is difficult for classification models to learn from imbalanced data; meanwhile, skewed data distribution frequently exists in various real-world applications. In this paper, we introduce a novel solution of CIL called Probability Density Machine (PDM). First, in the context of Gaussian Naive Bayes (GNB) predictive model, we analyze the reason why imbalanced data distribution makes the performance of predictive model decline in theory and draw a conclusion regarding the impact of class imbalance that is only associated with the prior probability, but does not relate to the conditional probability of training data. Then, in such context, we show the rationality of several traditional CIL techniques. Furthermore, we indicate the drawback of combining GNB with these traditional CIL techniques. Next, profiting from the idea of K-nearest neighbors probability density estimation (KNN-PDE), we propose the PDM which is an improved GNB-based CIL algorithm. Finally, we conduct experiments on lots of class imbalance data sets, and the proposed PDM algorithm shows the promising results.

APA, Harvard, Vancouver, ISO, and other styles

36

Chen, Jiqiang, Jie Wan, and Litao Ma. "Regularized Discrete Optimal Transport for Class-Imbalanced Classifications." Mathematics 12, no. 4 (February 7, 2024): 524. http://dx.doi.org/10.3390/math12040524.

Full text

Abstract:

Imbalanced class data are commonly observed in pattern analysis, machine learning, and various real-world applications. Conventional approaches often resort to resampling techniques in order to address the imbalance, which inevitably alter the original data distribution. This paper proposes a novel classification method that leverages optimal transport for handling imbalanced data. Specifically, we establish a transport plan between training and testing data without modifying the original data distribution, drawing upon the principles of optimal transport theory. Additionally, we introduce a non-convex interclass regularization term to establish connections between testing samples and training samples with the same class labels. This regularization term forms the basis of a regularized discrete optimal transport model, which is employed to address imbalanced classification scenarios. Subsequently, in line with the concept of maximum minimization, a maximum minimization algorithm is introduced for regularized discrete optimal transport. Subsequent experiments on 17 Keel datasets with varying levels of imbalance demonstrate the superior performance of the proposed approach compared to 11 other widely used techniques for class-imbalanced classification. Additionally, the application of the proposed approach to water quality evaluation confirms its effectiveness.

APA, Harvard, Vancouver, ISO, and other styles

37

Fu, Guang-Hui, Jia-Bao Wang, Min-Jie Zong, and Lun-Zhao Yi. "Feature Ranking and Screening for Class-Imbalanced Metabolomics Data Based on Rank Aggregation Coupled with Re-Balance." Metabolites 11, no. 6 (June 14, 2021): 389. http://dx.doi.org/10.3390/metabo11060389.

Full text

Abstract:

Feature screening is an important and challenging topic in current class-imbalance learning. Most of the existing feature screening algorithms in class-imbalance learning are based on filtering techniques. However, the variable rankings obtained by various filtering techniques are generally different, and this inconsistency among different variable ranking methods is usually ignored in practice. To address this problem, we propose a simple strategy called rank aggregation with re-balance (RAR) for finding key variables from class-imbalanced data. RAR fuses each rank to generate a synthetic rank that takes every ranking into account. The class-imbalanced data are modified via different re-sampling procedures, and RAR is performed in this balanced situation. Five class-imbalanced real datasets and their re-balanced ones are employed to test the RAR’s performance, and RAR is compared with several popular feature screening methods. The result shows that RAR is highly competitive and almost better than single filtering screening in terms of several assessing metrics. Performing re-balanced pretreatment is hugely effective in rank aggregation when the data are class-imbalanced.

APA, Harvard, Vancouver, ISO, and other styles

38

Kaope, Cherfly, and Yoga Pristyanto. "The Effect of Class Imbalance Handling on Datasets Toward Classification Algorithm Performance." MATRIK : Jurnal Manajemen, Teknik Informatika dan Rekayasa Komputer 22, no. 2 (March 1, 2023): 227–38. http://dx.doi.org/10.30812/matrik.v22i2.2515.

Full text

Abstract:

Class imbalance is a condition where the amount of data in the minority class is smaller than that of the majority class. The impact of the class imbalance in the dataset is the occurrence of minority class misclassification, so it can affect classification performance. Various approaches have been taken to deal with the problem of class imbalances such as the data level approach, algorithmic level approach, and cost-sensitive learning. At the data level, one of the methods used is to apply the sampling method. In this study, the ADASYN, SMOTE, and SMOTE-ENN sampling methods were used to deal with the problem of class imbalance combined with the AdaBoost, K-Nearest Neighbor, and Random Forest classification algorithms. The purpose of this study was to determine the effect of handling class imbalances on the dataset on classification performance. The tests were carried out on five datasets and based on the results of the classification the integration of the ADASYN and Random Forest methods gave better results compared to other model schemes. The criteria used to evaluate include accuracy, precision, true positive rate, true negative rate, and g-mean score. The results of the classification of the integration of the ADASYN and Random Forest methods gave 5% to 10% better than other models.

APA, Harvard, Vancouver, ISO, and other styles

39

Choudhary, Roshani, and Sanyam Shukla. "Reduced-Kernel Weighted Extreme Learning Machine Using Universum Data in Feature Space (RKWELM-UFS) to Handle Binary Class Imbalanced Dataset Classification." Symmetry 14, no. 2 (February 14, 2022): 379. http://dx.doi.org/10.3390/sym14020379.

Full text

Abstract:

Class imbalance is a phenomenon of asymmetry that degrades the performance of traditional classification algorithms such as the Support Vector Machine (SVM) and Extreme Learning Machine (ELM). Various modifications of SVM and ELM have been proposed to handle the class imbalance problem, which focus on different aspects to resolve the class imbalance. The Universum Support Vector Machine (USVM) incorporates the prior information in the classification model by adding Universum data to the training data to handle the class imbalance problem. Various other modifications of SVM have been proposed which use Universum data in the classification model generation. Moreover, the existing ELM-based classification models intended to handle class imbalance do not consider the prior information about the data distribution for training. An ELM-based classification model creates two symmetry planes, one for each class. The Universum-based ELM classification model tries to create a third plane between the two symmetric planes using Universum data. This paper proposes a novel hybrid framework called Reduced-Kernel Weighted Extreme Learning Machine Using Universum Data in Feature Space (RKWELM-UFS) to handle the classification of binary class-imbalanced problems. The proposed RKWELM-UFS combines the Universum learning method with a Reduced-Kernelized Weighted Extreme Learning Machine (RKWELM) for the first time to inherit the advantages of both techniques. To generate efficient Universum samples in the feature space, this work uses the kernel trick. The performance of the proposed method is evaluated using 44 benchmark binary class-imbalanced datasets. The proposed method is compared with 10 state-of-the-art classifiers using AUC and G-mean. The statistical t-test and Wilcoxon signed-rank test are used to quantify the performance enhancement of the proposed RKWELM-UFS compared to other evaluated classifiers.

APA, Harvard, Vancouver, ISO, and other styles

40

Ali, Baraa Saeed, Nabil Sarhan, and Mohammed Alawad. "On the Robustness of Compressed Models with Class Imbalance." Computers 13, no. 11 (November 16, 2024): 297. http://dx.doi.org/10.3390/computers13110297.

Full text

Abstract:

Deep learning (DL) models have been deployed in various platforms, including resource-constrained environments such as edge computing, smartphones, and personal devices. Such deployment requires models to have smaller sizes and memory footprints. To this end, many model compression techniques proposed in the literature successfully reduce model sizes and maintain comparable accuracy. However, the robustness of compressed DL models against class imbalance, a natural phenomenon in real-life datasets, is still under-explored. We present a comprehensive experimental study of the performance and robustness of compressed DL models when trained on class-imbalanced datasets. We investigate the robustness of compressed DL models using three popular compression techniques (pruning, quantization, and knowledge distillation) with class-imbalanced variants of the CIFAR-10 dataset and show that compressed DL models are not robust against class imbalance in training datasets. We also show that different compression techniques have varying degrees of impact on the robustness of compressed DL models.

APA, Harvard, Vancouver, ISO, and other styles

41

Li, Zhuang, Jingyan Qin, Xiaotong Zhang, and Yadong Wan. "Addressing Class Overlap under Imbalanced Distribution: An Improved Method and Two Metrics." Symmetry 13, no. 9 (September 7, 2021): 1649. http://dx.doi.org/10.3390/sym13091649.

Full text

Abstract:

Class imbalance, as a phenomenon of asymmetry, has an adverse effect on the performance of most machine learning and overlap is another important factor that affects the classification performance of machine learning algorithms. This paper deals with the two factors simultaneously, addressing the class overlap under imbalanced distribution. In this paper, a theoretical analysis is firstly conducted on the existing class overlap metrics. Then, an improved method and the corresponding metrics to evaluate the class overlap under imbalance distributions are proposed based on the theoretical analysis. A well-known collection of the imbalanced datasets is used to compare the performance of different metrics and the performance is evaluated based on the Pearson correlation coefficient and the ξ correlation coefficient. The experimental results demonstrate that the proposed class overlap metrics outperform other compared metrics for the imbalanced datasets and the Pearson correlation coefficient with the AUC metric of eight algorithms can be improved by 34.7488% in average.

APA, Harvard, Vancouver, ISO, and other styles

42

Nadeem, Khurram, and Mehdi-Abderrahman Jabri. "Stable variable ranking and selection in regularized logistic regression for severely imbalanced big binary data." PLOS ONE 18, no. 1 (January 17, 2023): e0280258. http://dx.doi.org/10.1371/journal.pone.0280258.

Full text

Abstract:

We develop a novel covariate ranking and selection algorithm for regularized ordinary logistic regression (OLR) models in the presence of severe class-imbalance in high dimensional datasets with correlated signal and noise covariates. Class-imbalance is resolved using response-based subsampling which we also employ to achieve stability in variable selection by creating an ensemble of regularized OLR models fitted to subsampled (and balanced) datasets. The regularization methods considered in our study include Lasso, adaptive Lasso (adaLasso) and ridge regression. Our methodology is versatile in the sense that it works effectively for regularization techniques involving both hard- (e.g. Lasso) and soft-shrinkage (e.g. ridge) of the regression coefficients. We assess selection performance by conducting a detailed simulation experiment involving varying moderate-to-severe class-imbalance ratios and highly correlated continuous and discrete signal and noise covariates. Simulation results show that our algorithm is robust against severe class-imbalance under the presence of highly correlated covariates, and consistently achieves stable and accurate variable selection with very low false discovery rate. We illustrate our methodology using a case study involving a severely imbalanced high-dimensional wildland fire occurrence dataset comprising 13 million instances. The case study and simulation results demonstrate that our framework provides a robust approach to variable selection in severely imbalanced big binary data.

APA, Harvard, Vancouver, ISO, and other styles

43

Tiwari, Himani. "Improvising Balancing Methods for Classifying Imbalanced Data." International Journal for Research in Applied Science and Engineering Technology 9, no. 9 (September 30, 2021): 1535–43. http://dx.doi.org/10.22214/ijraset.2021.38225.

Full text

Abstract:

Abstract: Class Imbalance problem is one of the most challenging problems faced by the machine learning community. As we refer the imbalance to various instances in class of being relatively low as compare to other data. A number of over - sampling and under-sampling approaches have been applied in an attempt to balance the classes. This study provides an overview of the issue of class imbalance and attempts to examine various balancing methods for dealing with this problem. In order to illustrate the differences, an experiment is conducted using multiple simulated data sets for comparing the performance of these oversampling methods on different classifiers based on various evaluation criteria. In addition, the effect of different parameters, such as number of features and imbalance ratio, on the classifier performance is also evaluated. Keywords: Imbalanced learning, Over-sampling methods, Under-sampling methods, Classifier performances, Evaluationmetrices

APA, Harvard, Vancouver, ISO, and other styles

44

Emamipour, Sajad, Rasoul Sali, and Zahra Yousefi. "A Multi-Objective Ensemble Method for Class Imbalance Learning." International Journal of Big Data and Analytics in Healthcare 2, no. 1 (January 2017): 16–34. http://dx.doi.org/10.4018/ijbdah.2017010102.

Full text

Abstract:

This article describes how class imbalance learning has attracted great attention in recent years as many real world domain applications suffer from this problem. Imbalanced class distribution occurs when the number of training examples for one class far surpasses the training examples of the other class often the one that is of more interest. This problem may produce an important deterioration of the classifier performance, in particular with patterns belonging to the less represented classes. Toward this end, the authors developed a hybrid model to address the class imbalance learning with focus on binary class problems. This model combines benefits of the ensemble classifiers with a multi objective feature selection technique to achieve higher classification performance. The authors' model also proposes non-dominated sets of features. Then they evaluate the performance of the proposed model by comparing its results with notable algorithms for solving imbalanced data problem. Finally, the authors utilize the proposed model in medical domain of predicting life expectancy in post-operative of thoracic surgery patients.

APA, Harvard, Vancouver, ISO, and other styles

45

Lin, Ismael, Octavio Loyola-González, Raúl Monroy, and Miguel Angel Medina-Pérez. "A Review of Fuzzy and Pattern-Based Approaches for Class Imbalance Problems." Applied Sciences 11, no. 14 (July 8, 2021): 6310. http://dx.doi.org/10.3390/app11146310.

Full text

Abstract:

The usage of imbalanced databases is a recurrent problem in real-world data such as medical diagnostic, fraud detection, and pattern recognition. Nevertheless, in class imbalance problems, the classifiers are commonly biased by the class with more objects (majority class) and ignore the class with fewer objects (minority class). There are different ways to solve the class imbalance problem, and there has been a trend towards the usage of patterns and fuzzy approaches due to the favorable results. In this paper, we provide an in-depth review of popular methods for imbalanced databases related to patterns and fuzzy approaches. The reviewed papers include classifiers, data preprocessing, and evaluation metrics. We identify different application domains and describe how the methods are used. Finally, we suggest further research directions according to the analysis of the reviewed papers and the trend of the state of the art.

APA, Harvard, Vancouver, ISO, and other styles

46

Rifqi Fitriadi and Deni Mahdiana. "SYSTEMATIC LITERATURE REVIEW OF THE CLASS IMBALANCE CHALLENGES IN MACHINE LEARNING." Jurnal Teknik Informatika (Jutif) 4, no. 5 (October 5, 2023): 1099–107. http://dx.doi.org/10.52436/1.jutif.2023.4.5.970.

Full text

Abstract:

The significant growth of data poses its own challenges, both in terms of storing, managing, and analyzing the available data. Untreated and unanalyzed data can only provide limited benefits to its owner. In many cases, the data we analyze is imbalanced. An example of natural data imbalance is in detecting financial fraud, where the number of non-fraudulent transactions is usually much higher than fraudulent ones. This imbalance issue can affect the accuracy and performance of machine learning classification models. Many machine learning classification models tend to learn more general patterns in the majority class. As a result, the model may overlook patterns that exist in the minority class. Various research has been conducted to address the problem of imbalanced data. The objective of this systematic literature review is to provide the latest developments regarding the cases, methods used, and evaluation techniques in handling imbalanced data. This research successfully identifies new methods and is expected to provide more choices for researchers so that imbalanced data can be properly handled, and classification models can produce unbiased, accurate, and consistent results.

APA, Harvard, Vancouver, ISO, and other styles

47

Wongvorachan, Tarid, Surina He, and Okan Bulut. "A Comparison of Undersampling, Oversampling, and SMOTE Methods for Dealing with Imbalanced Classification in Educational Data Mining." Information 14, no. 1 (January 16, 2023): 54. http://dx.doi.org/10.3390/info14010054.

Full text

Abstract:

Educational data mining is capable of producing useful data-driven applications (e.g., early warning systems in schools or the prediction of students’ academic achievement) based on predictive models. However, the class imbalance problem in educational datasets could hamper the accuracy of predictive models as many of these models are designed on the assumption that the predicted class is balanced. Although previous studies proposed several methods to deal with the imbalanced class problem, most of them focused on the technical details of how to improve each technique, while only a few focused on the application aspect, especially for the application of data with different imbalance ratios. In this study, we compared several sampling techniques to handle the different ratios of the class imbalance problem (i.e., moderately or extremely imbalanced classifications) using the High School Longitudinal Study of 2009 dataset. For our comparison, we used random oversampling (ROS), random undersampling (RUS), and the combination of the synthetic minority oversampling technique for nominal and continuous (SMOTE-NC) and RUS as a hybrid resampling technique. We used the Random Forest as our classification algorithm to evaluate the results of each sampling technique. Our results show that random oversampling for moderately imbalanced data and hybrid resampling for extremely imbalanced data seem to work best. The implications for educational data mining applications and suggestions for future research are discussed.

APA, Harvard, Vancouver, ISO, and other styles

48

Liu, Xu Ying. "An Empirical Study of Boosting Methods on Severely Imbalanced Data." Applied Mechanics and Materials 513-517 (February 2014): 2510–13. http://dx.doi.org/10.4028/www.scientific.net/amm.513-517.2510.

Full text

Abstract:

Nowadays there are large volumes of data in real-world applications, which poses great challenge to class-imbalance learning: the large amount of the majority class examples and severe class-imbalance. Previous studies on class-imbalance learning mainly focused on relatively small or moderate class-imbalance. In this paper we conduct an empirical study to explore the difference between learning with small or moderate class-imbalance and learning with severe class-imbalance. The experimental results show that: (1) Traditional methods cannot handle severe class-imbalance effectively. (2) AUC, G-mean and F-measure can be very inconsistent for severe class-imbalance, which seldom appears when class-imbalance is moderate. And G-mean is not appropriate for severe class-imbalance learning because it is not sensitive to the change of imbalance ratio. (3) When AUC and G-mean are evaluation metrics, EasyEnsemble is the best method, followed by BalanceCascade and under-sampling. (4) A little under-full balance is better for under-sampling to handle severe class-imbalance. And it is important to handle false positives when design methods for severe class-imbalance.

APA, Harvard, Vancouver, ISO, and other styles

49

Pes, Barbara. "Learning from High-Dimensional and Class-Imbalanced Datasets Using Random Forests." Information 12, no. 8 (July 21, 2021): 286. http://dx.doi.org/10.3390/info12080286.

Full text

Abstract:

Class imbalance and high dimensionality are two major issues in several real-life applications, e.g., in the fields of bioinformatics, text mining and image classification. However, while both issues have been extensively studied in the machine learning community, they have mostly been treated separately, and little research has been thus far conducted on which approaches might be best suited to deal with datasets that are class-imbalanced and high-dimensional at the same time (i.e., with a large number of features). This work attempts to give a contribution to this challenging research area by studying the effectiveness of hybrid learning strategies that involve the integration of feature selection techniques, to reduce the data dimensionality, with proper methods that cope with the adverse effects of class imbalance (in particular, data balancing and cost-sensitive methods are considered). Extensive experiments have been carried out across datasets from different domains, leveraging a well-known classifier, the Random Forest, which has proven to be effective in high-dimensional spaces and has also been successfully applied to imbalanced tasks. Our results give evidence of the benefits of such a hybrid approach, when compared to using only feature selection or imbalance learning methods alone.

APA, Harvard, Vancouver, ISO, and other styles

50

Zhao, Zixue, Tianxiang Cui, Shusheng Ding, Jiawei Li, and Anthony Graham Bellotti. "Resampling Techniques Study on Class Imbalance Problem in Credit Risk Prediction." Mathematics 12, no. 5 (February 28, 2024): 701. http://dx.doi.org/10.3390/math12050701.

Full text

Abstract:

Credit risk prediction heavily relies on historical data provided by financial institutions. The goal is to identify commonalities among defaulting users based on existing information. However, data on defaulters is often limited, leading to a concentration of credit data where positive samples (defaults) are significantly fewer than negative samples (nondefaults). It poses a serious challenge known as the class imbalance problem, which can substantially impact data quality and predictive model effectiveness. To address the problem, various resampling techniques have been proposed and studied extensively. However, despite ongoing research, there is no consensus on the most effective technique. The choice of resampling technique is closely related to the dataset size and imbalance ratio, and its effectiveness varies across different classifiers. Moreover, there is a notable gap in research concerning suitable techniques for extremely imbalanced datasets. Therefore, this study aims to compare popular resampling techniques across different datasets and classifiers while also proposing a novel hybrid sampling method tailored for extremely imbalanced datasets. Our experimental results demonstrate that this new technique significantly enhances classifier predictive performance, shedding light on effective strategies for managing the class imbalance problem in credit risk prediction.

APA, Harvard, Vancouver, ISO, and other styles

We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!