Log in

Relevant bibliographies by topics / Synthetic minority over sampling technique / Journal articles

To see the other types of publications on this topic, follow the link: Synthetic minority over sampling technique.

Journal articles on the topic 'Synthetic minority over sampling technique'

Author: Grafiati

Published: 7 June 2025

Last updated: 26 June 2025

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 journal articles for your research on the topic 'Synthetic minority over sampling technique.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

Chawla, N. V., K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer. "SMOTE: Synthetic Minority Over-sampling Technique." Journal of Artificial Intelligence Research 16 (June 1, 2002): 321–57. http://dx.doi.org/10.1613/jair.953.

Full text

Abstract:

An approach to the construction of classifiers from imbalanced datasets is described. A dataset is imbalanced if the classification categories are not approximately equally represented. Often real-world data sets are predominately composed of ``normal'' examples with only a small percentage of ``abnormal'' or ``interesting'' examples. It is also the case that the cost of misclassifying an abnormal (interesting) example as a normal example is often much higher than the cost of the reverse error. Under-sampling of the majority (normal) class has been proposed as a good means of increasing the sensitivity of a classifier to the minority class. This paper shows that a combination of our method of over-sampling the minority (abnormal) class and under-sampling the majority (normal) class can achieve better classifier performance (in ROC space) than only under-sampling the majority class. This paper also shows that a combination of our method of over-sampling the minority class and under-sampling the majority class can achieve better classifier performance (in ROC space) than varying the loss ratios in Ripper or class priors in Naive Bayes. Our method of over-sampling the minority class involves creating synthetic minority class examples. Experiments are performed using C4.5, Ripper and a Naive Bayes classifier. The method is evaluated using the area under the Receiver Operating Characteristic curve (AUC) and the ROC convex hull strategy.

APA, Harvard, Vancouver, ISO, and other styles

2

Bunkhumpornpat, Chumphol, Krung Sinapiromsaran, and Chidchanok Lursinsap. "DBSMOTE: Density-Based Synthetic Minority Over-sampling TEchnique." Applied Intelligence 36, no. 3 (2011): 664–84. http://dx.doi.org/10.1007/s10489-011-0287-y.

Full text

APA, Harvard, Vancouver, ISO, and other styles

3

Shoohi, Liqaa M., and Jamila H. Saud. "Adaptation Proposed Methods for Handling Imbalanced Datasets based on Over-Sampling Technique." Al-Mustansiriyah Journal of Science 31, no. 2 (2020): 25. http://dx.doi.org/10.23851/mjs.v31i2.740.

Full text

Abstract:

Classification of imbalanced data is an important issue. Many algorithms have been developed for classification, such as Back Propagation (BP) neural networks, decision tree, Bayesian networks etc., and have been used repeatedly in many fields. These algorithms speak of the problem of imbalanced data, where there are situations that belong to more classes than others. Imbalanced data result in poor performance and bias to a class without other classes. In this paper, we proposed three techniques based on the Over-Sampling (O.S.) technique for processing imbalanced dataset and redistributing it and converting it into balanced dataset. These techniques are (Improved Synthetic Minority Over-Sampling Technique (Improved SMOTE), Borderline-SMOTE + Imbalanced Ratio(IR), Adaptive Synthetic Sampling (ADASYN) +IR) Algorithm, where the work these techniques are generate the synthetic samples for the minority class to achieve balance between minority and majority classes and then calculate the IR between classes of minority and majority. Experimental results show ImprovedSMOTE algorithm outperform the Borderline-SMOTE + IR and ADASYN + IR algorithms because it achieves a high balance between minority and majority classes.

APA, Harvard, Vancouver, ISO, and other styles

4

Anusha, Yamijala, R. Visalakshi, and Konda Srinivas. "Imbalanced data classification using improved synthetic minority over-sampling technique." Multiagent and Grid Systems 19, no. 2 (2023): 117–31. http://dx.doi.org/10.3233/mgs-230007.

Full text

Abstract:

In data mining, deep learning and machine learning models face class imbalance problems, which result in a lower detection rate for minority class samples. An improved Synthetic Minority Over-sampling Technique (SMOTE) is introduced for effective imbalanced data classification. After collecting the raw data from PIMA, Yeast, E.coli, and Breast cancer Wisconsin databases, the pre-processing is performed using min-max normalization, cleaning, integration, and data transformation techniques to achieve data with better uniqueness, consistency, completeness and validity. An improved SMOTE algorithm is applied to the pre-processed data for proper data distribution, and then the properly distributed data is fed to the machine learning classifiers: Support Vector Machine (SVM), Random Forest, and Decision Tree for data classification. Experimental examination confirmed that the improved SMOTE algorithm with random forest attained significant classification results with Area under Curve (AUC) of 94.30%, 91%, 96.40%, and 99.40% on the PIMA, Yeast, E.coli, and Breast cancer Wisconsin databases.

APA, Harvard, Vancouver, ISO, and other styles

5

Bunkhumpornpat, Chumphol, and Krung Sinapiromsaran. "CORE: core-based synthetic minority over-sampling and borderline majority under-sampling technique." International Journal of Data Mining and Bioinformatics 12, no. 1 (2015): 44. http://dx.doi.org/10.1504/ijdmb.2015.068952.

Full text

APA, Harvard, Vancouver, ISO, and other styles

6

Tarawneh, Ahmad S., Ahmad B. A. Hassanat, Khalid Almohammadi, Dmitry Chetverikov, and Colin Bellinger. "SMOTEFUNA: Synthetic Minority Over-Sampling Technique Based on Furthest Neighbour Algorithm." IEEE Access 8 (2020): 59069–82. http://dx.doi.org/10.1109/access.2020.2983003.

Full text

APA, Harvard, Vancouver, ISO, and other styles

7

Duan, Yijun, Xin Liu, Adam Jatowt, et al. "SORAG: Synthetic Data Over-Sampling Strategy on Multi-Label Graphs." Remote Sensing 14, no. 18 (2022): 4479. http://dx.doi.org/10.3390/rs14184479.

Full text

Abstract:

In many real-world networks of interest in the field of remote sensing (e.g., public transport networks), nodes are associated with multiple labels, and node classes are imbalanced; that is, some classes have significantly fewer samples than others. However, the research problem of imbalanced multi-label graph node classification remains unexplored. This non-trivial task challenges the existing graph neural networks (GNNs) because the majority class can dominate the loss functions of GNNs and result in the overfitting of the majority class features and label correlations. On non-graph data, minority over-sampling methods (such as the synthetic minority over-sampling technique and its variants) have been demonstrated to be effective for the imbalanced data classification problem. This study proposes and validates a new hypothesis with unlabeled data over-sampling, which is meaningless for imbalanced non-graph data; however, feature propagation and topological interplay mechanisms between graph nodes can facilitate the representation learning of imbalanced graphs. Furthermore, we determine empirically that ensemble data synthesis through the creation of virtual minority samples in the central region of a minority and generation of virtual unlabeled samples in the boundary region between a minority and majority is the best practice for the imbalanced multi-label graph node classification task. Our proposed novel data over-sampling framework is evaluated using multiple real-world network datasets, and it outperforms diverse, strong benchmark models by a large margin.

APA, Harvard, Vancouver, ISO, and other styles

8

Raveendhran, Nareshkumar, and Nimala Krishnan. "A novel hybrid SMOTE oversampling approach for balancing class distribution on social media text." Bulletin of Electrical Engineering and Informatics 14, no. 1 (2025): 638–46. http://dx.doi.org/10.11591/eei.v14i1.8380.

Full text

Abstract:

Depression is a frequent and dangerous medical disorder that has an unhealthy effect on how a person feels, thinks, and acts. Depression is also quite prevalent. Early detection and treatment of depression may avoid painful and perhaps life-threatening symptoms. An imbalance in the data creates several challenges. Consequently, the majority learners will have biases against the class that constitutes the majority and, in extreme situations, may completely dismiss the class that constitutes the minority. For decades, class disparity research has employed traditional machine learning methods. In addressing the challenge of imbalanced data in depression detection, the study aims to balance class distribution using a hybrid approach bidirectional long short-term memory (BI-LSTM) along with synthetic minority over-sampling and Tomek links and synthetic minority over-sampling and edited nearest neighbors’ techniques. This investigation presents a new approach that combines synthetic minority oversampling technique with the Kalman filter to provide an innovative extension. The Kalman-synthetic minority oversampling technique (KSMOTE) approach filters out noisy samples in the final dataset, which consists of both the original data and the artificially created samples by SMOTE. The result was greater accuracy with the BI-LSTM classification scheme compared to the other standard methods for finding depression in both unbalanced and balanced data.

APA, Harvard, Vancouver, ISO, and other styles

9

Chakrabarty, Navoneel, and Sanket Biswas. "Navo Minority Over-sampling Technique (NMOTe): A Consistent Performance Booster on Imbalanced Datasets." June 2020 2, no. 2 (2020): 96–136. http://dx.doi.org/10.36548/jei.2020.2.004.

Full text

Abstract:

Imbalanced data refers to a problem in machine learning where there exists unequal distribution of instances for each classes. Performing a classification task on such data can often turn bias in favour of the majority class. The bias gets multiplied in cases of high dimensional data. To settle this problem, there exists many real-world data mining techniques like over-sampling and under-sampling, which can reduce the Data Imbalance. Synthetic Minority Oversampling Technique (SMOTe) provided one such state-of-the-art and popular solution to tackle class imbalancing, even on high-dimensional data platform. In this work, a novel and consistent oversampling algorithm has been proposed that can further enhance the performance of classification, especially on binary imbalanced datasets. It has been named as NMOTe (Navo Minority Oversampling Technique), an upgraded and superior alternative to the existing techniques. A critical analysis and comprehensive overview on the literature has been done to get a deeper insight into the problem statements and nurturing the need to obtain the most optimal solution. The performance of NMOTe on some standard datasets has been established in this work to get a statistical understanding on why it has edged the existing state-of-the-art to become the most robust technique for solving the two-class data imbalance problem.

APA, Harvard, Vancouver, ISO, and other styles

10

Singgalen, Yerik Afrianto. "Performance evaluation of SVM with synthetic minority over-sampling technique in sentiment classification." Jurnal Mantik 8, no. 1 (2024): 326–36. http://dx.doi.org/10.35335/mantik.v8i1.5077.

Full text

Abstract:

This study investigates the performance of the Support Vector Machine (SVM) algorithm in sentiment analysis tasks within the context of tourism destination branding, utilizing the Cross-Industry Standard Process for Data Mining (CRISP-DM) framework. Specifically, the research compares SVM performance with and without the Synthetic Minority Over-sampling Technique (SMOTE) across various metrics including accuracy, precision, recall, F-measure, and Area Under the Curve (AUC). The analysis is conducted on a dataset comprising textual data extracted from "Wonderful Indonesia" promotional videos featuring Labuan Bajo. Results indicate that SVM without SMOTE achieves a slightly higher accuracy of 97.79% compared to 96.61% with SMOTE. However, a closer examination reveals that SVM without SMOTE accurately classifies all positive instances, while with SMOTE, one positive instance is misclassified as negative. Precision, recall, and F-measure scores for positive instances are also higher without SMOTE, indicating better performance in classifying positive sentiment

APA, Harvard, Vancouver, ISO, and other styles

11

Purnawan, I. Ketut Adi, Adhi Dharma Wibawa, Arik Kurniawati, and Mauridhi Hery Purnomo. "Optimizing Diabetic Neuropathy Severity Classification Using Electromyography Signals Through Synthetic Oversampling Techniques." Jurnal Nasional Pendidikan Teknik Informatika (JANAPATI) 13, no. 3 (2024): 681–90. https://doi.org/10.23887/janapati.v13i3.85675.

Full text

Abstract:

Electromyography signals are electrical signals generated by muscle activity and are very useful for analyzing the health conditions of muscles and nerves. Data imbalance is a prevalent issue in EMG signal data, especially when addressing patients with varied health conditions and restricted data availability. A major difficulty for machine learning models is class imbalance in datasets, which frequently leads to biased predictions favoring the dominant class and neglecting the minority classes. The data augmentation method employs the Synthetic Minority Over Sampling Technique (SMOTE) and Random Over Sampling (ROS) to address data imbalances and enhance the performance of classification models for underrepresented classes. This study employs an oversampling technique to enhance the efficacy of the XG Boost model. SMOTE exhibits better efficacy relative to competing methods; the application of appropriate oversampling techniques allows models to integrate patterns from both majority and often neglected minority data.

APA, Harvard, Vancouver, ISO, and other styles

12

Belluano, Poetri Lestari Lokapitasari, Reyna Aprilia Rahma, Herdianti Darwis, and Abdul Rachman Manga. "Analysis of ensemble machine learning classification comparison on the skin cancer MNIST dataset." Computer Science and Information Technologies 5, no. 3 (2024): 235–42. http://dx.doi.org/10.11591/csit.v5i3.p235-242.

Full text

Abstract:

This study aims to analyze the performance of various ensemble machine learning methods, such as Adaboost, Bagging, and Stacking, in the context of skin cancer classification using the skin cancer MNIST dataset. We also evaluate the impact of handling dataset imbalance on the classification model’s performance by applying imbalanced data methods such as random under sampling (RUS), random over sampling (ROS), synthetic minority over-sampling technique (SMOTE), and synthetic minority over-sampling technique with edited nearest neighbor (SMOTEENN). The research findings indicate that Adaboost is effective in addressing data imbalance, while imbalanced data methods can significantly improve accuracy. However, the selection of imbalanced data methods should be carefully tailored to the dataset characteristics and clinical objectives. In conclusion, addressing data imbalance can enhance skin cancer classification accuracy, with Adaboost being an exception that shows a decrease in accuracy after applying imbalanced data methods.

APA, Harvard, Vancouver, ISO, and other styles

13

Yulia, Ery Kurniawati, and Denny Prabowo Yulius. "Model optimisation of class imbalanced learning using ensemble classifier on over-sampling data." International Journal of Artificial Intelligence (IJ-AI) 11, no. 1 (2022): 276–83. https://doi.org/10.11591/ijai.v11.i1.pp276-283.

Full text

Abstract:

Data imbalance is one of the problems in the application of machine learning and data mining. Often this data imbalance occurs in the most essential and needed case entities. Two approaches to overcome this problem are the data level approach and the algorithm approach. This study aims to get the best model using the pap smear dataset that combined data levels with an algorithmic approach to solve data imbalanced. The laboratory data mostly have few data and imbalance. Almost in every case, the minor entities are the most important and needed. Over-sampling as a data level approach used in this study is the synthetic minority oversampling technique-nominal (SMOTE-N) and adaptive synthetic-nominal (ADASYN-N) algorithms. The algorithm approach used in this study is the ensemble classifier using AdaBoost and bagging with the classification and regression tree (CART) as learner-based. The best model obtained from the experimental results in accuracy, precision, recall, and f-measure using ADASYN-N and AdaBoostCART.

APA, Harvard, Vancouver, ISO, and other styles

14

Wang, Sheng, Liling Ma, and Junzheng Wang. "Fault Diagnosis Method Based on CND-SMOTE and BA-SVM Algorithm." Journal of Physics: Conference Series 2493, no. 1 (2023): 012008. http://dx.doi.org/10.1088/1742-6596/2493/1/012008.

Full text

Abstract:

Abstract The problem of unbalanced data classification has gotten extensive attention in the past few years. Unbalanced sample data makes the fault diagnosis and classification accuracy rate low, and the capability to classify minority-class fault samples is restricted. To address the problem that the classification algorithm in machine learning has the insufficient capability to identify minority class samples for unbalanced sample data classification problems. Therefore, this paper proposes an improved support vector machine (SVM) classification method based on the synthetic minority over-sampling technique (SMOTE). For the sampler, an improved synthetic minority over-sampling technique based on the characteristics of neighborhood distribution (CND-SMOTE) algorithm is used to equilibrate the minority class samples and the majority class samples. For the classifier, the parameter optimization method of support vector machines based on the bat algorithm (BA-SVM) is used to solve the multi-classification problem of faulty samples. Finally, experimental results prove that the CND-SMOTE+BA-SVM algorithm can synthesize high-quality minority fault samples, increase the classification accuracy rate of fault samples, and decrease the time spent on the classification.

APA, Harvard, Vancouver, ISO, and other styles

15

Poetri, Lestari Lokapitasari Belluano, Aprilia Rahma Reyna, Darwis Herdianti, and Rachman Manga Abdul. "Analysis of ensemble machine learning classification comparison on the skin cancer MNIST dataset." Computer Science and Information Technologies 5, no. 3 (2024): 235–42. https://doi.org/10.11591/csit.v5i3.pp235-242.

Full text

Abstract:

This study aims to analyze the performance of various ensemble machine learning methods, such as Adaboost, Bagging, and Stacking, in the context of skin cancer classification using the skin cancer MNIST dataset. We also evaluate the impact of handling dataset imbalance on the classification model’s performance by applying imbalanced data methods such as random under sampling (RUS), random over sampling (ROS), synthetic minority over-sampling technique (SMOTE), and synthetic minority over-sampling technique with edited nearest neighbor (SMOTEENN). The research findings indicate that Adaboost is effective in addressing data imbalance, while imbalanced data methods can significantly improve accuracy. However, the selection of imbalanced data methods should be carefully tailored to the dataset characteristics and clinical objectives. In conclusion, addressing data imbalance can enhance skin cancer classification accuracy, with Adaboost being an exception that shows a decrease in accuracy after applying imbalanced data methods.

APA, Harvard, Vancouver, ISO, and other styles

16

Wibisono, David Leandro, and Zaenal Abidin. "Prediction of Student Graduation Predicts using Hybrid 2D Convolutional Neural Network and Synthetic Minority Over-Sampling Technique." Recursive Journal of Informatics 1, no. 1 (2023): 27–34. http://dx.doi.org/10.15294/rji.v1i1.65646.

Full text

Abstract:

Abstract. With the rapid growth of technology, educational institutions are constantly looking for ways to improve their services and enhance student performance. One of the significant challenges in higher education is predicting the graduation outcome of students. Predicting student graduation can help educators and academic advisors to provide early intervention and support to students who may be at risk of not graduating on time. In this paper, we propose a hybrid 2D convolutional neural network (CNN) and synthetic minority over-sampling technique (SMOTE) to predict the graduation outcome of students. Purpose: Knowing the results and how the Hybrid 2D Convolutional Neural Network (CNN) and Synthetic Minority Over-sampling Technique (SMOTE) algorithms work in predicting student graduation predicates. This algorithm uses a dataset based on family background variables and academic data. Methods/Study design/approach: This study uses the Hybrid 2D CNN algorithm for the classification process and SMOTE for the minority class over-sampling. Result/Findings: The prediction accuracy of the model using SMOTE is 96.31%. Meanwhile, the model that does not use SMOTE obtains an accuracy of 95.32%. Novelty/Originality/Value: This research shows that the use of a Hybrid 2D CNN algorithm with SMOTE gives better accuracy than without using SMOTE. The dataset used also proves that family background and student academic data can be used as a reference for predicting student graduation predicates.

APA, Harvard, Vancouver, ISO, and other styles

17

Antonio, Roy, and Hironimus Leong. "PERFORMANCE OF SYNTHETIC MINORITY OVER-SAMPLING TECHNIQUE ON SUPPORT VECTOR MACHINE AND K-NEAREST NEIGHBOR FOR SENTIMENT ANALYSIS OF METAVERSE IN INDONESIA." Proxies : Jurnal Informatika 6, no. 2 (2024): 160–70. http://dx.doi.org/10.24167/proxies.v6i2.12459.

Full text

Abstract:

The metaverse is one of the most discussed things on social media, Twitter in Indonesia. This view can be both positive and negative in Indonesian society, hence the need for sentiment analysis. However, creating a sentiment classification model with unbalanced data will reduce performance. For this reason, Synthetic Minority Oversampling is needed in Support Vector Machine and K-Nearest Neighbor algorithms. The results of Synthetic Minority Oversampling can improve the accuracy of the Support Vector Machine and K-Nearest Neighbor algorithms.

APA, Harvard, Vancouver, ISO, and other styles

18

Sulistiyono, Mulia, Yoga Pristyanto, Sumarni Adi, and Gagah Gumelar. "Implementasi Algoritma Synthetic Minority Over-Sampling Technique untuk Menangani Ketidakseimbangan Kelas pada Dataset Klasifikasi." SISTEMASI 10, no. 2 (2021): 445. http://dx.doi.org/10.32520/stmsi.v10i2.1303.

Full text

APA, Harvard, Vancouver, ISO, and other styles

19

Soltanzadeh, Paria, and Mahdi Hashemzadeh. "RCSMOTE: Range-Controlled synthetic minority over-sampling technique for handling the class imbalance problem." Information Sciences 542 (January 2021): 92–111. http://dx.doi.org/10.1016/j.ins.2020.07.014.

Full text

APA, Harvard, Vancouver, ISO, and other styles

20

Amirullah, Afif, Umi Laili Yuhana, and Muhammad Alfian. "Improve Software Defect Prediction using Particle Swarm Optimization and Synthetic Minority Over-sampling Technique." Scientific Journal of Informatics 11, no. 4 (2025): 1127–36. https://doi.org/10.15294/sji.v11i4.16808.

Full text

Abstract:

Purpose: Early detection of software defects is essential to prevent problems with software maintenance. Although much machine learning research has been used to predict software defects, most have not paid attention to the problems of data imbalance and feature correlation. This research focuses on overcoming the problems of imbalance dataset. It provides new insights into the impact of these two feature extraction techniques in improving the accuracy of software defect prediction. Methods: This research compares three algorithms: Random Forest, Logistic Regression, and XGBoost, with the application of PSO for feature selection and SMOTE to overcome the problem of imbalanced data. Comparison of algorithm performance is measured using F1-Score, Precision, Recall, and Accuracy metrics to evaluate the effectiveness of each approach. Result: This research demonstrates the potential of SMOTE and PSO techniques in enhancing the performance of software defect detection models, particularly in ensemble algorithms like Random Forest (RF) and XGBoost (XGB). The application of SMOTE and PSO resulted in a significant increase in RF accuracy to 87.63%, XGB to 85.40%, but a decrease in Logistic Regression (LR) accuracy to 72.98%. The F1-Score, Precision, and Recall metrics showed substantial improvements in RF and XGB, but not in LR due to the decrease in accuracy, highlighting the impact of the research findings. Novelty: Based on the comparison results, it is proven that the SMOTE and PSO algorithms can improve the Random Forest and XGB models for predicting software defect.

APA, Harvard, Vancouver, ISO, and other styles

21

Intayoad, Wacharawan, Chayapol Kamyod, and Punnarumol Temdee. "Synthetic Minority Over-Sampling for Improving Imbalanced Data in Educational Web Usage Mining." ECTI Transactions on Computer and Information Technology (ECTI-CIT) 12, no. 2 (2019): 118–29. http://dx.doi.org/10.37936/ecti-cit.2018122.133280.

Full text

Abstract:

Educational data mining is the method for extracting and discovering new knowledge from education data. As education data is often complex and imbalanced, it requires a data preprocessing step or learning algorithms in order to obtain accurate analysis and interpretation. Many studies emphasize on classification and clustering methods in order to get insight and comprehensive knowledge from education data. However, a small number of previous works exclusively focused on the preprocessing of education data, particularly on the topic of the imbalanced dataset. Therefore, this research objective is to enhance the accuracy of data classification in educational web usage data. Our study involves the application of synthetic minority over-sampling techniques (SMOTE) to preprocess the raw dataset from web usage data. The minority class is a group of the students who failed the examination and the majority class is the students who passed the examination. In our experiments, four synthetic minority over-sampling methods are applied, SMOTE, and its variants: Borderline-SMOTE1, Borderline-SMOTE2, and SVM-SMOTE, in order to balance the number of samples in the minority class. The experiments are evaluated by comparing the results from well-known classification methods that are Naive Bayesian, decision tree, and k-nearest neighbors. The study experiments with real-world datasets from education data. The results present that synthetic minority over-sampling methods are capable of improving the detection of the minority class and achieve improving classification performance on precision, recall, and F1-value. Ed

APA, Harvard, Vancouver, ISO, and other styles

22

Malhotra, Ruchika, and Kishwar Khan. "OpTunedSMOTE: A novel model for automated hyperparameter tuning of SMOTE in software defect prediction." Intelligent Data Analysis: An International Journal 29, no. 3 (2024): 787–807. https://doi.org/10.1177/1088467x241301390.

Full text

Abstract:

Software Defect Prediction plays a crucial role in quality assurance by identifying potential defects early in the software development lifecycle. It is an essential aspect of modern software engineering that significantly contributes to improving software quality and reliability. It utilizes a variety of techniques, including machine learning algorithms like decision trees, support vector machines, and neural networks, to predict defects. A lot of research tried to improve the prediction accuracy but had problems with imbalanced data and hyperparameter tuning of the algorithms. To deal with this, we proposed a novel approach by tuning the hyperparameters of the Synthetic Minority Over-sampling Technique using the Tree-structured Parzen Estimator algorithm within the Optuna framework. Through an analysis of seventeen imbalanced datasets from a different public database, we compare our technique with existing SDP models using K-Nearest Neighbors, Multi-Layer Perceptron, Random Forest, Support Vector Machine, and Extreme Gradient Boosting classifiers. Our findings reveal that optimizing the Synthetic Minority Over-sampling Technique significantly improves the performance of SDP models, resulting in enhanced performance metrics. We have statistically validated our results using Friedman's test.

APA, Harvard, Vancouver, ISO, and other styles

23

Al-Khazaleh, Maisa J., Marwah Alian, and Manar A. Jaradat. "Sentiment analysis of imbalanced Arabic data using sampling techniques and classification algorithms." Bulletin of Electrical Engineering and Informatics 13, no. 1 (2024): 607–18. http://dx.doi.org/10.11591/eei.v13i1.5886.

Full text

Abstract:

Sentiment analysis is a popular natural language processing task that recognizes the opinions or feelings of a piece of text. Microblogging platforms such as Twitter are a valuable resource for finding such people’s opinions. The majority of Arabic sentiment analysis studies indicated that the data utilized to train machine learning algorithms is balanced. In this paper, we investigated the impact of sampling techniques and classification algorithms on an imbalanced Arabic dataset about people’s perceptions of COVID-19, with the majority of opinions reflecting people’s fear and stress about the pandemic, and the minority reflecting the belief that the pandemic was a hoax. The experiments concentrated on analyzing the imbalanced learning of Arabic sentiments using over-sampling and under-sampling techniques on seven single machine learning algorithms and two common ensemble algorithms from the bagging and boosting families, respectively. Results show that resampling-based approaches can overcome the difficulty of an imbalanced dataset, and the use of over-sampled data leads to better performance than that of under-sampled data. The results also reveal that using oversampled data from synthetic minority over-sampling technique (SMOTE), borderline-SMOTE, or adaptive synthetic sampling with random forest classifier is the most effective in addressing this classification problem, with F1-score value of 0.99.

APA, Harvard, Vancouver, ISO, and other styles

24

Jung, Ilok, Jaewon Ji, and Changseob Cho. "EmSM: Ensemble Mixed Sampling Method for Classifying Imbalanced Intrusion Detection Data." Electronics 11, no. 9 (2022): 1346. http://dx.doi.org/10.3390/electronics11091346.

Full text

Abstract:

Research on the application of machine learning to the field of intrusion detection is attracting great interest. However, depending on the application, it is difficult to collect the data needed for training and testing, as the least frequent data type reflects the most serious threats, resulting in imbalanced data, which leads to overfitting and hinders precise classification. To solve this problem, in this study, we propose a mixed resampling method using a hybrid synthetic minority oversampling technique with an edited neural network that increases the minority class and removes noisy data to generate a balanced dataset. A bagging ensemble algorithm is then used to optimize the model with the new data. We performed verification using two public intrusion detection datasets: PKDD2007 (balanced) and CSIC2012 (imbalanced). The proposed technique yields improved performance over state-of-the-art techniques. Furthermore, the proposed technique enables improved true positive identification and classification of serious threats that rarely occur, representing a major functional innovation.

APA, Harvard, Vancouver, ISO, and other styles

25

Zazzaro, Gaetano. "COSM: Controlled Over-Sampling Method." Transactions on Machine Learning and Artificial Intelligence 8, no. 2 (2020): 42–51. http://dx.doi.org/10.14738/tmlai.82.7925.

Full text

Abstract:

The class imbalance problem is widespread in Data Mining and it can reduce the general performance of a classification model. Many techniques have been proposed in order to overcome it, thanks to which a model able to handling rare events can be trained. The methodology presented in this paper, called Controlled Over-Sampling Method (COSM), includes a controller model able to reject new synthetic elements for which there is no certainty of belonging to the minority class. It combines the common Machine Learning method for holdout with an oversampling algorithm, for example the classic SMOTE algorithm. The proposal explained and designed here represents a guideline for the application of oversampling algorithms, but also a brief overview on techniques for overcoming the problem of the unbalanced class in Data Mining.

APA, Harvard, Vancouver, ISO, and other styles

26

Sandeep, Yadav. "A Comparative Analysis of Sampling Techniques for Imbalanced Datasets in Machine Learning." INTERNATIONAL JOURNAL OF INNOVATIVE RESEARCH AND CREATIVE TECHNOLOGY 7, no. 5 (2021): 1–7. https://doi.org/10.5281/zenodo.14203644.

Full text

Abstract:

In machine learning, the challenge of class imbalance—where one class is significantly underrepresented compared to others—often leads to models with poor predictive performance, especially for minority classes. This study provides a detailed comparative analysis of sampling techniques designed to address this imbalance, focusing on their effectiveness across different types of imbalanced datasets. The techniques examined include basic undersampling and oversampling, along with more sophisticated synthetic methods like SMOTE (Synthetic Minority Over-sampling Technique), ADASYN (Adaptive Synthetic Sampling), and borderline variants of SMOTE. Using several real-world and synthetic datasets, this research evaluates the performance of these techniques based on key metrics tailored for imbalanced data, such as F1-score, G-mean, precision, recall, and area under the precision-recall curve.Our findings reveal that while undersampling can improve computational efficiency, it may lead to significant data loss and reduced model robustness. Conversely, oversampling, though effective in balancing the dataset, can introduce redundancy and increase model complexity. Among synthetic methods, SMOTE and its variants demonstrate improved performance by generating more diverse samples in the feature space, although they may also introduce noise when not carefully applied. ADASYN was particularly effective in scenarios with higher levels of imbalance, adapting sample generation based on instance difficulty. Ultimately, this study underscores the importance of selecting a sampling method based on the specific dataset characteristics and model requirements, providing practical guidance for practitioners in choosing optimal sampling techniques for achieving balanced and fair machine learning models in imbalanced contexts.

APA, Harvard, Vancouver, ISO, and other styles

27

Kasemtaweechok, Chatchai, and Worasait Suwannik. "Under-sampling technique for imbalanced data using minimum sum of euclidean distance in principal component subset." IAES International Journal of Artificial Intelligence (IJ-AI) 13, no. 1 (2024): 305. http://dx.doi.org/10.11591/ijai.v13.i1.pp305-318.

Full text

Abstract:

<span lang="EN-US">Imbalanced datasets are characterized by a substantially smaller number of data points in the minority class compared to the majority class. This imbalance often leads to poor predictive performance of classification models when applied in real-world scenarios. There are three main approaches to handle imbalanced data: over-sampling, under-sampling, and hybrid approach. The over-sampling methods duplicate or synthesize data in the minority class. On the other hand, the under-sampling methods remove majority class data. Hybrid methods combine the noise-removing benefits of under-sampling the majority class with the synthetic minority class creation process of over-sampling. In this research, we applied principal component (PC) analysis, which is normally used for dimensionality reduction, to reduce the amount of majority class data. The proposed method was compared with eight state-of-the-art under-sampling methods across three different classification models: support vector machine, random forest, and AdaBoost. In the experiment, conducted on 35 datasets, the proposed method had higher average values for sensitivity, G-mean, the Matthews correlation coefficient (MCC), and receiver operating characteristic curve (ROC curve) compared to the other under-sampling methods.</span>

APA, Harvard, Vancouver, ISO, and other styles

28

Kasemtaweechok, Chatchai, and Worasait Suwannik. "Under-sampling technique for imbalanced data using minimum sum of euclidean distance in principal component subset." IAES International Journal of Artificial Intelligence (IJ-AI) 13, no. 1 (2024): 305–18. https://doi.org/10.11591/ijai.v13.i1.pp305-318.

Full text

Abstract:

Imbalanced datasets are characterized by a substantially smaller number of data points in the minority class compared to the majority class. This imbalance often leads to poor predictive performance of classification models when applied in real-world scenarios. There are three main approaches to handle imbalanced data: over-sampling, under-sampling, and hybrid approach. The over-sampling methods duplicate or synthesize data in the minority class. On the other hand, the under-sampling methods remove majority class data. Hybrid methods combine the noise-removing benefits of under-sampling the majority class with the synthetic minority class creation process of over-sampling. In this research, we applied principal component (PC) analysis, which is normally used for dimensionality reduction, to reduce the amount of majority class data. The proposed method was compared with eight state-of-the-art under-sampling methods across three different classification models: support vector machine, random forest, and AdaBoost. In the experiment, conducted on 35 datasets, the proposed method had higher average values for sensitivity, G-mean, the Matthews correlation coefficient (MCC), and receiver operating characteristic curve (ROC curve) compared to the other under-sampling methods.

APA, Harvard, Vancouver, ISO, and other styles

29

Neelam, Rout, Mishra Debahuti, and Kumar Mallick Manas. "An advance extended binomial GLMBoost ensemble method with synthetic minority over-sampling technique for handling imbalanced datasets." International Journal of Electrical and Computer Engineering (IJECE) 13, no. 4 (2023): 4357–68. https://doi.org/10.11591/ijece.v13i4.pp4357-4368.

Full text

Abstract:

Classification is an important activity in a variety of domains. Class imbalance problem have reduced the performance of the traditional classification approaches. An imbalance problem arises when mismatched class distributions are discovered among the instances of class of classification datasets. An advance extended binomial GLMBoost (EBGLMBoost) coupled with synthetic minority over-sampling technique (SMOTE) technique is the proposed model in the study to manage imbalance issues. The SMOTE is used to solve the proposed model, ensuring that the target variable's distribution is balanced, whereas the GLMBoost ensemble techniques are built to deal with imbalanced datasets. For the entire experiment, twenty different datasets are used, and support vector machine (SVM), Nu-SVM, bagging, and AdaBoost classification algorithms are used to compare with the suggested method. The model's sensitivity, specificity, geometric mean (G-mean), precision, recall, and F-measure resulted in percentages for training and testing datasets are 99.37, 66.95, 80.81, 99.21, 99.37, 99.29 and 98.61, 54.78, 69.88, 98.77, 96.61, 98.68, respectively. With the help of the Wilcoxon test, it is determined that the proposed technique performed well on unbalanced data. Finally, the proposed solutions are capable of efficiently dealing with the problem of class imbalance.

APA, Harvard, Vancouver, ISO, and other styles

30

Bhuiyan, Rabiul Alam, Mst. Shimu Khatun, Md Taslim, and Md.Alam Hossain. "Handling Class Imbalance in Credit Card Fraud Using Various Sampling Techniques." American Journal of Multidisciplinary Research and Innovation 1, no. 4 (2022): 160–68. https://doi.org/10.54536/ajmri.v1i4.633.

Full text

Abstract:

Over the last few decades, credit card fraud (CCF) has been a severe problem for both cardholders and card providers. Credit card transactions are fast expanding as internet technology advances, significantly relying on the internet. With advanced technology and increased credit card usage, fraud rates are becoming a problem for the economy. However, the credit card dataset is highly imbalanced and skewed. Many classification techniques are used to classify fraud and non-fraud but in a certain condition, they may not generate the best results. Different types of sampling techniques such as under-over sampling, Synthetic Minority Oversampling, and Adaptive synthetic techniques have been used to overcome the class imbalance problem in the credit card dataset. Then, the sampled datasets are classified using different machine learning techniques like Decision Tree, Random Forest, K-Nearest Neighbors, Logistic Regression, and Naive Bayes. Recall, F1- score, accuracy, precision, and error rate used to evaluate the model performance. The Logistic Regression model achieved the highest result with 99.94% after under sampling techniques and Random Forest model achieved the highest result with 99.964% after over sampling techniques.

APA, Harvard, Vancouver, ISO, and other styles

31

Krishnan, Ulagapriya, and Pushpa Sangar. "A Rebalancing Framework for Classification of Imbalanced Medical Appointment No-show Data." Journal of Data and Information Science 6, no. 1 (2021): 178–92. http://dx.doi.org/10.2478/jdis-2021-0011.

Full text

Abstract:

Abstract Purpose This paper aims to improve the classification performance when the data is imbalanced by applying different sampling techniques available in Machine Learning. Design/methodology/approach The medical appointment no-show dataset is imbalanced, and when classification algorithms are applied directly to the dataset, it is biased towards the majority class, ignoring the minority class. To avoid this issue, multiple sampling techniques such as Random Over Sampling (ROS), Random Under Sampling (RUS), Synthetic Minority Oversampling TEchnique (SMOTE), ADAptive SYNthetic Sampling (ADASYN), Edited Nearest Neighbor (ENN), and Condensed Nearest Neighbor (CNN) are applied in order to make the dataset balanced. The performance is assessed by the Decision Tree classifier with the listed sampling techniques and the best performance is identified. Findings This study focuses on the comparison of the performance metrics of various sampling methods widely used. It is revealed that, compared to other techniques, the Recall is high when ENN is applied CNN and ADASYN have performed equally well on the Imbalanced data. Research limitations The testing was carried out with limited dataset and needs to be tested with a larger dataset. Practical implications This framework will be useful whenever the data is imbalanced in real world scenarios, which ultimately improves the performance. Originality/value This paper uses the rebalancing framework on medical appointment no-show dataset to predict the no-shows and removes the bias towards minority class.

APA, Harvard, Vancouver, ISO, and other styles

32

Chohan, Saifurrachman, Arifin Nugroho, Achmad Maezar Bayu Aji, and Windu Gata. "Analisis Sentimen Pengguna Aplikasi Duolingo Menggunakan Metode Naïve Bayes dan Synthetic Minority Over Sampling Technique." Paradigma - Jurnal Komputer dan Informatika 22, no. 2 (2020): 139–44. http://dx.doi.org/10.31294/p.v22i2.8251.

Full text

APA, Harvard, Vancouver, ISO, and other styles

33

Karthik, M., and M. Krishnan. "Detecting Internet of Things Attacks Using Post Pruning Decision Tree-Synthetic Minority Over Sampling Technique." International Journal of Intelligent Engineering and Systems 14, no. 4 (2021): 105–14. http://dx.doi.org/10.22266/ijies2021.0831.10.

Full text

APA, Harvard, Vancouver, ISO, and other styles

34

Prasojo, Rahman Azis, Muhammad Akmal A. Putra, Ekojono, et al. "Precise transformer fault diagnosis via random forest model enhanced by synthetic minority over-sampling technique." Electric Power Systems Research 220 (July 2023): 109361. http://dx.doi.org/10.1016/j.epsr.2023.109361.

Full text

APA, Harvard, Vancouver, ISO, and other styles

35

Julian, Fajar Azhari, and Fahmi Arif. "Enhancing Cascade Quality Prediction Method in Handling Imbalanced Dataset Using Synthetic Minority Over-Sampling Technique." Industrial Engineering & Management Systems 22, no. 4 (2023): 389–98. http://dx.doi.org/10.7232/iems.2023.22.4.389.

Full text

APA, Harvard, Vancouver, ISO, and other styles

36

A, Krishnapriya, and al. et. "Machine Learning For Medicare Fraud Detection: Tackling Class Imbalance With SMOTE-ENN." International Journal of Computational Learning & Intelligence 4, no. 4 (2025): 716–24. https://doi.org/10.5281/zenodo.15251088.

Full text

Abstract:

The realm of healthcare fraud detection is continually changing and encounters substantial obstacles, especially when dealing with data imbalance problems. Earlier research primarily concentrated on standard machine learning (ML) methods, which often have difficulty with imbalanced data. This issue manifests in several ways. It involves the danger of overfitting with Random Oversampling (ROS), the creation of noise by the Synthetic Minority Oversampling Technique (SMOTE), and the possible loss of vital information with Random Undersampling (RUS). Furthermore, enhancing model performance, examining hybrid resampling techniques, and refining evaluation metrics are essential for achieving greater accuracy with imbalanced datasets. In this study, we introduce a new technique to address the problem of imbalanced datasets in healthcare fraud detection, specifically focusing on the Medicare Part B dataset. Initially, we carefully remove the categorical feature ‘‘Provider Type’’ from the dataset. This enables us to create new, synthetic instances by randomly copying existing types, thus increasing the diversity within the minority class. Subsequently, we implement a hybrid resampling method called SMOTE ENN, which combines the Synthetic Minority Over-sampling Technique (SMOTE) with Edited Nearest Neighbours (ENN).

APA, Harvard, Vancouver, ISO, and other styles

37

Rout, Neelam, Debahuti Mishra, and Manas Kumar Mallick. "An advance extended binomial GLMBoost ensemble method with synthetic minority over-sampling technique for handling imbalanced datasets." International Journal of Electrical and Computer Engineering (IJECE) 13, no. 4 (2023): 4357. http://dx.doi.org/10.11591/ijece.v13i4.pp4357-4368.

Full text

Abstract:

<span lang="EN-US">Classification is an important activity in a variety of domains. Class imbalance problem have reduced the performance of the traditional classification approaches. An imbalance problem arises when mismatched class distributions are discovered among the instances of class of classification datasets. An advance extended binomial GLMBoost (EBGLMBoost) coupled with synthetic minority over-sampling technique (SMOTE) technique is the proposed model in the study to manage imbalance issues. The SMOTE is used to solve the proposed model, ensuring that the target variable's distribution is balanced, whereas the GLMBoost ensemble techniques are built to deal with imbalanced datasets. For the entire experiment, twenty different datasets are used, and support vector machine (SVM), Nu-SVM, bagging, and AdaBoost classification algorithms are used to compare with the suggested method. The model's sensitivity, specificity, geometric mean (G-mean), precision, recall, and F-measure resulted in percentages for training and testing datasets are 99.37, 66.95, 80.81, 99.21, 99.37, 99.29 and 98.61, 54.78, 69.88, 98.77, 96.61, 98.68, respectively. With the help of the Wilcoxon test, it is determined that the proposed technique performed well on unbalanced data. Finally, the proposed solutions are capable of efficiently dealing with the problem of class imbalance.</span>

APA, Harvard, Vancouver, ISO, and other styles

38

S, Karthikeyan, and Kathirvalavakumar T. "Genetic Algorithm Based Over-Sampling with DNN in Classifying the Imbalanced Data Distribution Problem." Indian Journal of Science and Technology 16, no. 8 (2023): 547–56. https://doi.org/10.17485/IJST/v16i8.863.

Full text

Abstract:

Abstract <strong>Objective:</strong> Data imbalance exists in many real-life applications. In the imbalanced datasets, the minority class data creates a wrong inference during the classification that leads to more misclassification. More research has been done in the past to solve this issue, but as of now there is no global working solution found to do efficient classification. After analyzing various existing literatures, it is proposed to minimize the misclassification through genetic based oversampling and deep neural network (DNN) classifier. <strong>Method:</strong> In the proposed oversampling method synthetic samples are generated based on genetic algorithm. Initial populations for the genetic algorithm are generated using Gaussian weight initialization technique and the fittest individual from the population are selected by Euclidean distance for further processing to generate synthetic data in double the minority class size and the dataset is classified with the DNN. <strong>Findings:</strong> The performance of the oversampled training data with DNN Classifier is compared with C4.5 and Support Vector Machine (SVM) classifiers and found that the DNN classifier outperforms the other two classifiers. The data generated using SMOTE and ADASYN are considered for comparison. It is found that the proposed approach outperforms the other approaches. It is also proved from the experiment that misclassification is reduced and the proposed method is statistically significant and is comparatively better. <strong>Novelty:</strong> Initial population generation by Gaussian weight initialization, the fittest sample selection by Euclidean distance measure, synthetic samples with double the minority class size and DNN for classification to reduce the misclassification is novelty in this work. <strong>Keywords:</strong> Genetic algorithm; Gauss weight initialization; SMOTE; ADASYN; Imbalanced data; Classification  

APA, Harvard, Vancouver, ISO, and other styles

39

MANGATAYARU, PURETI, and NARESH. "SPAM MESSAGE DETECTION OVER SOCIAL MEDIA: A SUPERVISED SAMPLING APPROACH FOR THE SOCIAL WEB OF THINGS." Journal of Engineering Sciences 16, no. 04 (2025): 140–46. https://doi.org/10.36893/jes.2025.v16i04.023.

Full text

Abstract:

The increasing use of social media has led to a surge in spam messages, including fake advertisements, phishing links, and misinformation. Traditional spam detection methods struggle with evolving spam patterns and imbalanced datasets, where spam messages constitute only a small fraction of total messages. This paper proposes a supervised sampling approach for spam detection in the Social Web of Things (SWoT), leveraging machine learning and natural language processing (NLP) techniques. The system uses Synthetic Minority Oversampling Technique (SMOTE) and cost-sensitive learning to handle class imbalance, improving the classification accuracy of spam detection models. Experimental results on real-world social media datasets demonstrate that the proposed approach enhances spam detection performance, reducing false positives and improving precision.

APA, Harvard, Vancouver, ISO, and other styles

40

Liu, Zhen-Tao, Bao-Han Wu, Dan-Yun Li, Peng Xiao, and Jun-Wei Mao. "Speech Emotion Recognition Based on Selective Interpolation Synthetic Minority Over-Sampling Technique in Small Sample Environment." Sensors 20, no. 8 (2020): 2297. http://dx.doi.org/10.3390/s20082297.

Full text

Abstract:

Speech emotion recognition often encounters the problems of data imbalance and redundant features in different application scenarios. Researchers usually design different recognition models for different sample conditions. In this study, a speech emotion recognition model for a small sample environment is proposed. A data imbalance processing method based on selective interpolation synthetic minority over-sampling technique (SISMOTE) is proposed to reduce the impact of sample imbalance on emotion recognition results. In addition, feature selection method based on variance analysis and gradient boosting decision tree (GBDT) is introduced, which can exclude the redundant features that possess poor emotional representation. Results of experiments of speech emotion recognition on three databases (i.e., CASIA, Emo-DB, SAVEE) show that our method obtains average recognition accuracy of 90.28% (CASIA), 75.00% (SAVEE) and 85.82% (Emo-DB) for speaker-dependent speech emotion recognition which is superior to some state-of-the-arts works.

APA, Harvard, Vancouver, ISO, and other styles

41

Zhang, Yining. "Machine learning with oversampling for space debris classification based on radar cross section." Applied and Computational Engineering 49, no. 1 (2024): 102–8. http://dx.doi.org/10.54254/2755-2721/49/20241070.

Full text

Abstract:

Over the past few years, the likelihood of collision of space objects increases as the quantity of space debris rises. Space debris classification and identification becomes more crucial to space assets security and space situation awareness. Radar cross section (RCS), one of the essential arguments for tracking space debris, was measured by European Incoherent Scatter Scientific Association (EISCAT) and other radar systems. This study investigates the effectiveness of seven machine learning methods employed to address the classification of space objects based on RCS data from European Space Agency (ESA). To tackle the Class-imbalance issue in this study (the ratio of space debris to non-debris is approximately 5:1 in the dataset), three oversampling techniques are employed, including: Synthetic Minority Oversampling Technique (SMOTE), synthetic minority oversampling technique-support vector machine (SMOTE-SVM) and Adaptive Synthetic Sampling (ADASYN). The experiments show that, in the test set, the combination of SVM with SMOTE-SVM oversampling techniques can reach the accuracy of 99.7%, the precision of 98.7% and the recall of 99.4% which is better than the rest of models.

APA, Harvard, Vancouver, ISO, and other styles

42

Dharmendra, I. Komang, I. Made Agus Wirahadi Putra, and Yohanes Priyo Atmojo. "Evaluation of the Effectiveness of SMOTE and Random Under Sampling in Emotion Classification of Tweets." INFORMATICS FOR EDUCATORS AND PROFESSIONAL : Journal of Informatics 9, no. 2 (2024): 182. https://doi.org/10.51211/itbi.v9i2.3183.

Full text

Abstract:

This study evaluates the effectiveness of two sampling techniques, SMOTE (Synthetic Minority Over-sampling Technique) and Random Under Sampling (RUS), in improving the performance of several classification models, namely Maximum Entropy, SVM, Random Forest, Neural Network, and Naive Bayes Classification, for handling data imbalance in emotion classification of tweets. The analysis results show that SMOTE consistently provides a more significant improvement in accuracy, precision, recall, and F1-score compared to RUS, especially in Random Forest and Neural Network models. Maximum Entropy and SVM prove to be the best-performing models in both scenarios, while Naive Bayes Classification, although efficient in terms of time, shows lower performance in evaluation metrics. Overall, SMOTE is a more effective sampling technique compared to RUS in handling class imbalance.

APA, Harvard, Vancouver, ISO, and other styles

43

Hadjadj, Hassina, and Halim Sayoud. "Arabic Authorship Attribution Using Synthetic Minority Over-Sampling Technique and Principal Components Analysis for Imbalanced Documents." International Journal of Cognitive Informatics and Natural Intelligence 15, no. 4 (2021): 1–17. http://dx.doi.org/10.4018/ijcini.20211001.oa33.

Full text

Abstract:

Nowadays, dealing with imbalanced data represents a great challenge in data mining as well as in machine learning task. In this investigation, we are interested in the problem of class imbalance in Authorship Attribution (AA) task, with specific application on Arabic text data. This article proposes a new hybrid approach based on Principal Components Analysis (PCA) and Synthetic Minority Over-sampling Technique (SMOTE), which considerably improve the performances of authorship attribution on imbalanced data. The used dataset contains 7 Arabic books written by 7 different scholars, which are segmented into text segments of the same size, with an average length of 2900 words per text. The obtained results of our experiments show that the proposed approach using the SMO-SVM classifier, presents high performance in terms of authorship attribution accuracy (100%), especially with starting character-bigrams. In addition, the proposed method appears quite interesting by improving the AA performances in imbalanced datasets, mainly with function words.

APA, Harvard, Vancouver, ISO, and other styles

44

Almonzer, Salah Nooraldaim, Elobaid Ahmed Abdalla Amal, and Mirghani Seed Amna. "An Advanced Machine Learning Approach for Enhanced Diabetes Prediction." International Journal of Current Science Research and Review 07, no. 12 (2024): 8779–89. https://doi.org/10.5281/zenodo.14292064.

Full text

Abstract:

Abstract : <strong> </strong>Diabetes is a chronic health condition affecting millions globally, causing severe complications and burdening healthcare systems. Current machine learning methods for diabetes prediction face challenges such as data imbalance, limited generalizability, and computational inefficiency. This study proposes a novel method that combines K-Nearest Neighbors (KNN), clustering techniques, Synthetic Minority Over- sampling Technique (SMOTE), and Random Forest for outcome classification to address these issues. The PIMA Indian Diabetes Dataset was used to evaluate the approach, achieving accuracy of 87.50%. However, the study has limitations, such as dependency on specific datasets and computational complexity. Future work will focus on validating the method across diverse datasets, optimizing computational efficiency, and developing real-time prediction capabilities.

APA, Harvard, Vancouver, ISO, and other styles

45

Bui, My Thi Thien. "Incremental Ensemble Learning Model for Imbalanced Data: a Case Study of Credit Scoring." Journal of Advanced Engineering and Computation 7, no. 2 (2023): 105. http://dx.doi.org/10.55579/jaec.202372.407.

Full text

Abstract:

Imbalanced data is a challenge for classification models. It reduces the overall performance of traditional learning algorithms. Besides, the minority class of imbalanced datasets is misclassified with a high ratio even though this is a crucial object of the classification process. In this paper, a new model called the Lasso-Logistic ensemble is proposed to deal with imbalanced data by utilizing two popular techniques, random over-sampling and random under-sampling. The model was applied to two real imbalanced credit data sets. The results show that the Lasso-Logistic ensemble model oﬀers better performance than the single traditional methods, such as random over-sampling, random under-sampling, Synthetic Minority Oversampling Technique (SMOTE), and cost-sensitive learning.This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium provided the original work is properly cited.

APA, Harvard, Vancouver, ISO, and other styles

46

Nedjar, Imane, Mohamed Amine Chikh, and Saïd Mahmoudi. "A topological approach for mammographic density classification using a modified synthetic minority over-sampling technique algorithm." International Journal of Biomedical Engineering and Technology 38, no. 2 (2022): 193. http://dx.doi.org/10.1504/ijbet.2022.10045038.

Full text

APA, Harvard, Vancouver, ISO, and other styles

47

Nedjar, Imane, Saïd Mahmoudi, and Mohamed Amine Chikh. "A topological approach for mammographic density classification using a modified synthetic minority over-sampling technique algorithm." International Journal of Biomedical Engineering and Technology 38, no. 2 (2022): 193. http://dx.doi.org/10.1504/ijbet.2022.120870.

Full text

APA, Harvard, Vancouver, ISO, and other styles

48

Feng, Wei, Gabriel Dauphin, Wenjiang Huang, et al. "Dynamic Synthetic Minority Over-Sampling Technique-Based Rotation Forest for the Classification of Imbalanced Hyperspectral Data." IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 12, no. 7 (2019): 2159–69. http://dx.doi.org/10.1109/jstars.2019.2922297.

Full text

APA, Harvard, Vancouver, ISO, and other styles

49

Zhang, Xiaolong, Xiaoli Lin, Jiafu Zhao, Qianqian Huang, and Xin Xu. "Efficiently Predicting Hot Spots in PPIs by Combining Random Forest and Synthetic Minority Over-Sampling Technique." IEEE/ACM Transactions on Computational Biology and Bioinformatics 16, no. 3 (2019): 774–81. http://dx.doi.org/10.1109/tcbb.2018.2871674.

Full text

APA, Harvard, Vancouver, ISO, and other styles

50

Hao, Ming, Yanli Wang, and Stephen H. Bryant. "An efficient algorithm coupled with synthetic minority over-sampling technique to classify imbalanced PubChem BioAssay data." Analytica Chimica Acta 806 (January 2014): 117–27. http://dx.doi.org/10.1016/j.aca.2013.10.050.

Full text

APA, Harvard, Vancouver, ISO, and other styles

We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!