A Hybrid GDHS and GBDT Approach for Handling Multi-Class Imbalanced Data Classification
Abstract
Multiclass imbalanced classification remains a significant challenge in machine learning, particularly when datasets exhibit high Imbalance Ratios (IR) and overlapping feature distributions. Traditional classifiers often fail to accurately represent minority classes, leading to biased models and suboptimal performance. This study proposes a hybrid approach combining Generalization potential and learning Difficulty-based Hybrid Sampling (GDHS) as a preprocessing technique with Gradient Boosting Decision Tree (GBDT) as the classifier. GDHS enhances minority class representation through intelligent oversampling while cleaning majority classes to reduce noise and class overlap. GBDT is then applied to the resampled dataset, leveraging its adaptive learning capabilities. The performance of the proposed GDHS+GBDT model was evaluated across six benchmark datasets with varying IR levels, using metrics such as Matthews Correlation Coefficient (MCC), Precision, Recall, and F-Value. Results show that GDHS+GBDT consistently outperforms other methods, including SMOTE+XGBoost, CatBoost, and Select-SMOTE+LightGBM, particularly on high-IR datasets like Red Wine Quality (IR = 68.10) and Page-Blocks (IR = 188.72). The method improves classification performance, especially in detecting minority classes, while maintaining high accuracy.
Keywords
References
P. Gupta, A. Varshney, M. R. Khan, R. Ahmed, M. Shuaib, and S. Alam, “Unbalanced Credit Card Fraud Detection Data: A Machine Learning-Oriented Comparative Study of Balancing Techniques,” Procedia Computer Science, vol. 218, pp. 2575–2584, Jan. 2023, doi: 10.1016/j.procs.2023.01.231.
Y.-C. Wang and C.-H. Cheng, “A multiple combined method for rebalancing medical data with class imbalances,” Computers in Biology and Medicine, vol. 134, p. 104527, Jul. 2021, doi: 10.1016/j.compbiomed.2021.104527.
B. Alabduallah et al., “Class imbalanced data handling with cyberattack classification using Hybrid Salp Swarm Algorithm with deep learning approach,” Alexandria Engineering Journal, vol. 106, pp. 654–663, Nov. 2024, doi: 10.1016/j.aej.2024.08.061.
M. B?aszczyk and J. J?drzejowicz, “Framework for imbalanced data classification,” Procedia Computer Science, vol. 192, pp. 3477–3486, Jan. 2021, doi: 10.1016/j.procs.2021.09.121.
M. Lango and J. Stefanowski, “What makes multi-class imbalanced problems difficult? An experimental study,” Expert Systems with Applications, vol. 199, p. 116962, Aug. 2022, doi: 10.1016/j.eswa.2022.116962.
M. S. Santos, P. H. Abreu, N. Japkowicz, A. Fernández, and J. Santos, “A unifying view of class overlap and imbalance: Key concepts, multi-view panorama, and open avenues for research,” Information Fusion, vol. 89, pp. 228–253, Jan. 2023, doi: 10.1016/j.inffus.2022.08.017.
W. Chen, K. Yang, Z. Yu, Y. Shi, and C. L. P. Chen, “A survey on imbalanced learning: latest research, applications and future directions,” Artif Intell Rev, vol. 57, no. 6, p. 137, May 2024, doi: 10.1007/s10462-024-10759-6.
Q. Li, Y. Song, J. Zhang, and V. S. Sheng, “Multiclass imbalanced learning with one-versus-one decomposition and spectral clustering,” Expert Systems with Applications, vol. 147, p. 113152, Jun. 2020, doi: 10.1016/j.eswa.2019.113152.
F. Grina, Z. Elouedi, and E. Lefevre, “Re-sampling of multi-class imbalanced data using belief function theory and ensemble learning,” International Journal of Approximate Reasoning, vol. 156, pp. 1–15, May 2023, doi: 10.1016/j.ijar.2023.02.006.
H. Hartono, Y. Risyani, E. Ongko, and D. Abdullah, “HAR-MI method for multi-class imbalanced datasets,” TELKOMNIKA (Telecommunication Computing Electronics and Control), vol. 18, no. 2, Art. no. 2, Apr. 2020, doi: 10.12928/telkomnika.v18i2.14818.
H. Hartono and E. Ongko, “Combining Hybrid Approach Redefinition-Multiclass Imbalance (HAR-MI) and Hybrid Sampling in Handling Multi-Class Imbalance and Overlapping,” JOIV : International Journal on Informatics Visualization, vol. 5, no. 1, pp. 22–26, Mar. 2021, doi: 10.30630/joiv.5.1.420.
S. Nouas, L. Oukid, and F. Boumahdi, “Enhancing imbalanced text classification: an overlap-based refinement approach,” Data Science and Management, Mar. 2025, doi: 10.1016/j.dsm.2025.03.001.
A. Jiménez-Macías, P. J. Muñoz-Merino, P. M. Moreno-Marcos, and C. D. Kloos, “Evaluation of traditional machine learning algorithms for featuring educational exercises,” Appl Intell, vol. 55, no. 7, p. 501, Mar. 2025, doi: 10.1007/s10489-025-06386-5.
J. Yun and J.-S. Lee, “Learning from class-imbalanced data using misclassification-focusing generative adversarial networks,” Expert Systems with Applications, vol. 240, p. 122288, Apr. 2024, doi: 10.1016/j.eswa.2023.122288.
J. Chen, C. Chen, W. Huang, J. Zhang, K. Debattista, and J. Han, “Dynamic contrastive learning guided by class confidence and confusion degree for medical image segmentation,” Pattern Recognition, vol. 145, p. 109881, Jan. 2024, doi: 10.1016/j.patcog.2023.109881.
P. Thölke et al., “Class imbalance should not throw you off balance: Choosing the right classifiers and performance metrics for brain decoding with imbalanced data,” NeuroImage, vol. 277, p. 120253, Aug. 2023, doi: 10.1016/j.neuroimage.2023.120253.
W. Chen, K. Yang, Z. Yu, Y. Shi, and C. L. P. Chen, “A survey on imbalanced learning: latest research, applications and future directions,” Artif Intell Rev, vol. 57, no. 6, p. 137, May 2024, doi: 10.1007/s10462-024-10759-6.
Y. Yan, Y. Lv, S. Han, C. Yu, and P. Zhou, “GDHS: An efficient hybrid sampling method for multi-class imbalanced data classification,” Neurocomputing, vol. 637, p. 130088, Jul. 2025, doi: 10.1016/j.neucom.2025.130088.
Y. Yan, Y. Jiang, Z. Zheng, C. Yu, Y. Zhang, and Y. Zhang, “LDAS: Local density-based adaptive sampling for imbalanced data classification,” Expert Systems with Applications, vol. 191, p. 116213, Apr. 2022, doi: 10.1016/j.eswa.2021.116213.
B. Krawczyk, M. Koziarski, and M. Wo?niak, “Radial-Based Oversampling for Multiclass Imbalanced Data Classification,” IEEE Transactions on Neural Networks and Learning Systems, vol. 31, no. 8, pp. 2818–2831, Aug. 2020, doi: 10.1109/TNNLS.2019.2913673.
M. Koziarski, M. Wo?niak, and B. Krawczyk, “Combined Cleaning and Resampling algorithm for multi-class imbalanced data with label noise,” Knowledge-Based Systems, vol. 204, p. 106223, Sep. 2020, doi: 10.1016/j.knosys.2020.106223.
J. Luo, Y. Yuan, and S. Xu, “Improving GBDT performance on imbalanced datasets: An empirical study of class-balanced loss functions,” Neurocomputing, vol. 634, p. 129896, Jun. 2025, doi: 10.1016/j.neucom.2025.129896.
L. Han et al., “An explainable XGBoost model improved by SMOTE-ENN technique for maize lodging detection based on multi-source unmanned aerial vehicle images,” Computers and Electronics in Agriculture, vol. 194, p. 106804, Mar. 2022, doi: 10.1016/j.compag.2022.106804.
S. B. Jabeur, C. Gharib, S. Mefteh-Wali, and W. B. Arfi, “CatBoost model and artificial intelligence techniques for corporate failure prediction,” Technological Forecasting and Social Change, vol. 166, p. 120658, May 2021, doi: 10.1016/j.techfore.2021.120658.
C. Zhao, Z. Yan, X. Sun, and M. Wu, “Enhancing aspect category detection in imbalanced online reviews: An integrated approach using Select-SMOTE and LightGBM,” International Journal of Intelligent Networks, vol. 5, pp. 364–372, Jan. 2024, doi: 10.1016/j.ijin.2024.10.002.
A. Luque, A. Carrasco, A. Martín, and A. de las Heras, “The impact of class imbalance in classification performance metrics based on the binary confusion matrix,” Pattern Recognition, vol. 91, pp. 216–231, Jul. 2019, doi: 10.1016/j.patcog.2019.02.023.
Z. Xu, D. Shen, T. Nie, and Y. Kou, “A hybrid sampling algorithm combining M-SMOTE and ENN based on Random forest for medical imbalanced data,” Journal of Biomedical Informatics, vol. 107, p. 103465, Jul. 2020, doi: 10.1016/j.jbi.2020.103465.
DOI: https://doi.org/10.52088/ijesty.v5i3.894
Refbacks
- There are currently no refbacks.
Copyright (c) 2025 Hartono Hartono, Muhammad Khahfi Zuhanda, Rahmad Syah, Sayuti Rahman, Erianto Ongko