Ladislav Végh Profile Ladislav Végh

Improving machine learning classification models for anaemia type prediction by oversampling imbalanced complete blood count data with smote-based algorithms

  • Authors Details :  
  • Ladislav Vegh,  
  • Norbert Annus,  
  • Krisztina Czakoova,  
  • Ondrej Takac

Journal title : AD ALTA: Journal of Interdisciplinary Research

Publisher : MAGNANIMITAS

Print ISSN : 1804-7890

Page Number : 469-475

Journal volume : 14

Journal issue : 2

112 Views Original Article

Computer-assisted disease diagnosis is cost-effective and time-saving, increasing accuracy and reducing the need for an additional workforce in medical decision-making. In our prior research, we trained, tested, and compared the accuracies of nine optimizable classification models to diagnose and predict eight anaemia types from Complete Blood Count (CBC) data. This study aimed to improve these classification models by oversampling the original imbalanced dataset with four algorithms related to the Synthetic Minority Over-sampling Technique (SMOTE). The results showed that the validation accuracy increased from 99.22% (Ensemble model) to 99.57% (Tree model), and most importantly, the False Discovery Rate (FDR) for the anaemia type with the highest FDR decreased from 23.1% to 1.5%.

Article DOI & Crossmark Data

DOI : https://doi.org/10.33543/j.1402.469475

Article Subject Details


Article Keywords Details



Article File

Full Text PDF

Article References

  • (1). Udvaros, J., Forman, N.: Artificial Intelligence and Education 4.0. Valencia, Spain; 2023. pp. 6309–6317. https://doi.org/10.21125/inted.2023.1670
  • (2). Szénási, S., Légrádi, G., Vígh, B.: Machine Learning-Assisted Approach for Optimizing Step Size of Hill Climbing Algorithm. 2024 IEEE 18th International Symposium on Applied Computational Intelligence and Informatics (SACI). Timisoara, Romania: IEEE; 2024. pp. 000425–000430. https://doi.org/10.1109/SACI60582.2024.10619891
  • (3). Annuš, N.: Usability of Artificial Intelligence to Create Predictive Models in Education. Palma, Spain; 2023. pp. 5061–5065. https://doi.org/10.21125/edulearn.2023.1328
  • (4). Végh, L., Czakóová, K., Taká?, O.: Comparing Machine Learning Classification Models on a Loan Approval Prediction Dataset. International Journal of Advanced Natural Sciences and Engineering Researches. 2023, 7(9), pp. 98–103. https://doi.org/10.59287/ijanser.1516
  • (5). Bahadure, N. B., Khomane, R., Nittala, A.: Anemia Detection and Classification from Blood Samples Using Data Analysis and Deep Learning. Automatika. 2024, 65(3), pp. 1163–1176. https://doi.org/10.1080/00051144.2024.2352317
  • (6). Subramani, S., Varshney, N., Anand, M. V., Soudagar, M. E. M., Al-keridis, L. A., Upadhyay, T. K., Alshammari, N., Saeed, M., Subramanian, K., Anbarasu, K., Rohini, K.: Cardiovascular diseases prediction by machine learning incorporation with deep learning. https://doi.org/10.3389/fmed.2023.1150933
  • (7). Végh, L., Taká?, O., Czakóová, K., Dancsa, D., Nagy, M.: Comparative Analysis of Machine Learning Classification Models in Predicting Cardiovascular Disease. International Journal of Advanced Natural Sciences and Engineering Researches. 2024, 8(6), pp. 23–31.
  • (8). Mujumdar, A., Vaidehi, V.: Diabetes Prediction using Machine Learning Algorithms. Procedia Computer Science. 2019, 165, pp. 292–299. https://doi.org/10.1016/j.procs.2020.01.047
  • (9). Tasin, I., Nabil, T. U., Islam, S., Khan, R.: Diabetes prediction using machine learning and explainable AI techniques. Healthc Technol Lett. 2022, 10(1–2), pp. 1–10. https://doi.org/10.1049/htl2.12039
  • (10). Tran, K. A., Kondrashova, O., Bradley, A., Williams, E. D., Pearson, J. V., Waddell, N.: Deep learning in cancer diagnosis, prognosis and treatment selection. Genome Medicine. 2021, 13(1), pp. 152. https://doi.org/10.1186/s13073-021-00968-x
  • (11). Végh, L., Taká?, O., Czakóová, K., Dancsa, D., Nagy, M.: Evaluating Optimizable Machine Learning Models for Anemia Type Prediction from Complete Blood Count Data. International Journal of Advanced Natural Sciences and Engineering Researches. 2024, 8(7), pp. 108–119.
  • (12). Chawla, N. V., Bowyer, K. W., Hall, L. O., Kegelmeyer, W. P.: SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research. 2002, 16, pp. 321–357. https://doi.org/10.1613/jair.953
  • (13). He, H., Bai, Y., Garcia, E. A., Li, S.: ADASYN: Adaptive synthetic sampling approach for imbalanced learning. 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence). 2008. pp. 1322–1328. https://doi.org/10.1109/IJCNN.2008.4633969
  • (14). Han, H., Wang, W.-Y., Mao, B.-H.: Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning. https://doi.org/10.1007/11538059_91
  • (15). Bunkhumpornpat, C., Sinapiromsaran, K., Lursinsap, C.: Safe-Level-SMOTE: Safe-Level-Synthetic Minority Over-Sampling TEchnique for Handling the Class Imbalanced Problem. https://doi.org/10.1007/978-3-642-01307-2_43
  • (16). Anaemia. https://www.who.int/news-room/fact-sheets/detail/ANAEMIA
  • (17). Airlangga, G.: Leveraging Machine Learning for Accurate Anemia Diagnosis Using Complete Blood Count Data. Indonesian Journal of Artificial Intelligence and Data Mining. 2024, 7(2), pp. 318–326. https://doi.org/10.24014/ijaidm.v7i 2.29869
  • (18). Dalvi, P. T., Vernekar, N.: Anemia Detection Using Ensemble Learning Techniques and Statistical Models. 2016 IEEE International Conference on Recent Trends in Electronics, Information & Communication Technology (RTEICT). Bangalore, India: IEEE; 2016. pp. 1747–1751. https://doi.org/10.1109/RTEICT.2016.7808133
  • (19). Aditya, M. R., Sutanto, T., Budiman, H., Ridha, M. R. N., Syapotro, U., Azijah, N.: Machine Learning Models for Classification of Anemia from CBC Results: Random Forest, SVM, and Logistic Regression. Journal of Data Science. 2024. https://iuojs.intimal.edu.my/index.php/jods/article/view/589
  • (20). Faraj, S. M.: Performance Evaluation of Machine Learning Algorithms for Predictive Classification of Anemia Data. 2024.
  • (21). Pullakhandam, S., McRoy, S.: Classification and Explanation of Iron Deficiency Anemia from Complete Blood Count Data Using Machine Learning. BioMedInformatics. 2024, 4(1), pp. 661–672. https://doi.org/10.3390/biomedinforma tics4010036
  • (22). Rahman, Md. M., Mojumdar, M. U., Shifa, H. A., Chakraborty, N. R., Stenin, N. P., Hasan, Md. A.: Anemia Disease Prediction using Machine Learning Techniques and Performance Analysis. 2024 11th International Conference on Computing for Sustainable Global Development (INDIACom). 2024. pp. 1276–1282. https://doi.org/10.23919/INDIACom61295.2024.10498962
  • (23). Vohra, R., Hussain, A., Dudyala, A. K., Pahareeya, J., Khan, W.: Multi-Class Classification Algorithms for the Diagnosis of Anemia in an Outpatient Clinical Setting. PLoS One. 2022, 17(7), pp. e0269685. https://doi.org/10.1371/journal.pone.026 9685
  • (24). Karagül Y?ld?z, T., Yurtay, N., Öneç, B.: Classifying Anemia Types Using Artificial Learning Methods. Engineering Science and Technology, an International Journal. 2021, 24(1), pp. 50–70. https://doi.org/10.1016/j.jestch.2020.12.003
  • (25). Kovacevic, A., Lakota, A., Kuka, L., Becic, E., Smajovic, A., Pokvic, L. G.: Application of Artificial Intelligence in Diagnosis and Classification of Anemia. 2022 11th Mediterranean Conference on Embedded Computing (MECO). Budva, Montenegro: IEEE; 2022. pp. 1–4. https://doi.org/10.1109/MECO55406.2022.9797180
  • (26). Anemia Types Classification. https://www.kaggle.com/datasets/ehababoelnaga/anemia-types-classification
  • (27). MATLAB. https://www.mathworks.com/products/matlab. html
  • (28). Abdi, H., Williams, L. J.: Principal component analysis. WIREs Computational Statistics. 2010, 2(4), pp. 433–459. https://doi.org/10.1002/wics.101
  • (29). Jolliffe, I. T., Cadima, J.: Principal component analysis: a review and recent developments. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences. 2016, 374(2065), pp. 20150202. https://doi.org/10.1098/rsta.2015.0202
  • (30). Mohammed, R., Rawashdeh, J., Abdullah, M.: Machine Learning with Oversampling and Undersampling Techniques: Overview Study and Experimental Results. 2020 11th International Conference on Information and Communication Systems (ICICS). 2020. pp. 243–248. https://doi.org/10.1109/ICICS49469.2020.239556
  • (31). Viloria, A., Pineda Lezama, O. B., Mercado-Caruzo, N.: Unbalanced data processing using oversampling: Machine Learning. Procedia Computer Science. 2020, 175, pp. 108–113. https://doi.org/10.1016/j.procs.2020.07.018
  • (32). Michio, I.: Oversampling Imbalanced Data: SMOTE related algorithms. GitHub; 2024. https://github.com/minoue-xx/Oversampling-Imbalanced-Data/releases/tag/1.0.2
  • (33). Train models to classify data using supervised machine learning - MATLAB. https://www.mathworks.com/help/stats/classificationlearner-app.html
  • (34). Molnar, C.: Interpretable Machine Learning. https://christop hm.github.io/interpretable-ml-book/
  • (35). Lundberg, S. M., Lee, S.-I.: A Unified Approach to Interpreting Model Predictions. Advances in Neural Information Processing Systems. Curran Associates, Inc.; 2017. https://proceedings.neurips.cc/paper_files/paper/2017/hash/8a20a8621978632d76c43dfd28b67767-Abstract.html
  • (36). Végh, L.: MATLAB App for Anemia Types Prediction from CBC Data. GitHub; 2024. https://github.com/veghl/anemia/
  • (37). Beyan, C., Kaptan, K., Beyan, E., Turan, M.: The Platelet Count/Mean Corpuscular Hemoglobin Ratio Distinguishes Combined Iron and Vitamin B12 Deficiency from Uncomplicated Iron Deficiency. International Journal of Hematology. 2005, 81(4), pp. 301–303. https://doi.org/10.1532/IJH97.E0311
  • (38). Lin, H., Zhan, B., Shi, X., Feng, D., Tao, S., Wo, M., Fei, X., Wang, W., Yu, Y.: The mean reticulocyte volume is a valuable index in early diagnosis of cancer-related anemia. https://peerj.c om/articles/17063



More Article by Ladislav Végh

Models of data structures in educational visualizations for supporting teaching and learning algorithms and computer programming

Teaching and learning computer programming is challenging for many undergraduate first-year computer science students. during introductory programming courses, novice programmers n...

Using interactive web-based animations to help students to find the optimal algorithms of river crossing puzzles

To acquire algorithmic thinking is a long process that has a few steps. the most basic level of algorithmic thinking is when students recognize the algorithms and various problems ...

Simulations of solving a single-player memory card game with several implementations of a human-like thinking computer algorithm

The memory card game is a game that probably everyone played in childhood. the game consists of n pairs of playing cards, whereas each card of a pair is identical. at the beginning...

Comparing machine learning classification models on a loan approval prediction dataset

In the last decade, we have observed the usage of artificial intelligence algorithms and machine learning models in industry, education, healthcare, entertainment, and several othe...