Ladislav Végh
Improving machine learning classification models for anaemia type prediction by oversampling imbalanced complete blood count data with smote-based algorithms
- Authors Details :
- Ladislav Vegh,
- Norbert Annus,
- Krisztina Czakoova,
- Ondrej Takac
Journal title : AD ALTA: Journal of Interdisciplinary Research
Publisher : MAGNANIMITAS
Print ISSN : 1804-7890
Page Number : 469-475
Journal volume : 14
Journal issue : 2
112 Views
Original Article
Computer-assisted disease diagnosis is cost-effective and time-saving, increasing accuracy and reducing the need for an additional workforce in medical decision-making. In our prior research, we trained, tested, and compared the accuracies of nine optimizable classification models to diagnose and predict eight anaemia types from Complete Blood Count (CBC) data. This study aimed to improve these classification models by oversampling the original imbalanced dataset with four algorithms related to the Synthetic Minority Over-sampling Technique (SMOTE). The results showed that the validation accuracy increased from 99.22% (Ensemble model) to 99.57% (Tree model), and most importantly, the False Discovery Rate (FDR) for the anaemia type with the highest FDR decreased from 23.1% to 1.5%.
Article DOI & Crossmark Data
DOI : https://doi.org/10.33543/j.1402.469475
Article Subject Details
Article Keywords Details
Article File
Full Text PDF
Article References
- (1). Udvaros, J., Forman, N.: Artificial Intelligence and Education 4.0. Valencia, Spain; 2023. pp. 6309–6317. https://doi.org/10.21125/inted.2023.1670
- (2). Szénási, S., Légrádi, G., Vígh, B.: Machine Learning-Assisted Approach for Optimizing Step Size of Hill Climbing Algorithm. 2024 IEEE 18th International Symposium on Applied Computational Intelligence and Informatics (SACI). Timisoara, Romania: IEEE; 2024. pp. 000425–000430. https://doi.org/10.1109/SACI60582.2024.10619891
- (3). Annuš, N.: Usability of Artificial Intelligence to Create Predictive Models in Education. Palma, Spain; 2023. pp. 5061–5065. https://doi.org/10.21125/edulearn.2023.1328
- (4). Végh, L., Czakóová, K., Taká?, O.: Comparing Machine Learning Classification Models on a Loan Approval Prediction Dataset. International Journal of Advanced Natural Sciences and Engineering Researches. 2023, 7(9), pp. 98–103. https://doi.org/10.59287/ijanser.1516
- (5). Bahadure, N. B., Khomane, R., Nittala, A.: Anemia Detection and Classification from Blood Samples Using Data Analysis and Deep Learning. Automatika. 2024, 65(3), pp. 1163–1176. https://doi.org/10.1080/00051144.2024.2352317
- (6). Subramani, S., Varshney, N., Anand, M. V., Soudagar, M. E. M., Al-keridis, L. A., Upadhyay, T. K., Alshammari, N., Saeed, M., Subramanian, K., Anbarasu, K., Rohini, K.: Cardiovascular diseases prediction by machine learning incorporation with deep learning. https://doi.org/10.3389/fmed.2023.1150933
- (7). Végh, L., Taká?, O., Czakóová, K., Dancsa, D., Nagy, M.: Comparative Analysis of Machine Learning Classification Models in Predicting Cardiovascular Disease. International Journal of Advanced Natural Sciences and Engineering Researches. 2024, 8(6), pp. 23–31.
- (8). Mujumdar, A., Vaidehi, V.: Diabetes Prediction using Machine Learning Algorithms. Procedia Computer Science. 2019, 165, pp. 292–299. https://doi.org/10.1016/j.procs.2020.01.047
- (9). Tasin, I., Nabil, T. U., Islam, S., Khan, R.: Diabetes prediction using machine learning and explainable AI techniques. Healthc Technol Lett. 2022, 10(1–2), pp. 1–10. https://doi.org/10.1049/htl2.12039
- (10). Tran, K. A., Kondrashova, O., Bradley, A., Williams, E. D., Pearson, J. V., Waddell, N.: Deep learning in cancer diagnosis, prognosis and treatment selection. Genome Medicine. 2021, 13(1), pp. 152. https://doi.org/10.1186/s13073-021-00968-x
- (11). Végh, L., Taká?, O., Czakóová, K., Dancsa, D., Nagy, M.: Evaluating Optimizable Machine Learning Models for Anemia Type Prediction from Complete Blood Count Data. International Journal of Advanced Natural Sciences and Engineering Researches. 2024, 8(7), pp. 108–119.
- (12). Chawla, N. V., Bowyer, K. W., Hall, L. O., Kegelmeyer, W. P.: SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research. 2002, 16, pp. 321–357. https://doi.org/10.1613/jair.953
- (13). He, H., Bai, Y., Garcia, E. A., Li, S.: ADASYN: Adaptive synthetic sampling approach for imbalanced learning. 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence). 2008. pp. 1322–1328. https://doi.org/10.1109/IJCNN.2008.4633969
- (14). Han, H., Wang, W.-Y., Mao, B.-H.: Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning. https://doi.org/10.1007/11538059_91
- (15). Bunkhumpornpat, C., Sinapiromsaran, K., Lursinsap, C.: Safe-Level-SMOTE: Safe-Level-Synthetic Minority Over-Sampling TEchnique for Handling the Class Imbalanced Problem. https://doi.org/10.1007/978-3-642-01307-2_43
- (16). Anaemia. https://www.who.int/news-room/fact-sheets/detail/ANAEMIA
- (17). Airlangga, G.: Leveraging Machine Learning for Accurate Anemia Diagnosis Using Complete Blood Count Data. Indonesian Journal of Artificial Intelligence and Data Mining. 2024, 7(2), pp. 318–326. https://doi.org/10.24014/ijaidm.v7i 2.29869
- (18). Dalvi, P. T., Vernekar, N.: Anemia Detection Using Ensemble Learning Techniques and Statistical Models. 2016 IEEE International Conference on Recent Trends in Electronics, Information & Communication Technology (RTEICT). Bangalore, India: IEEE; 2016. pp. 1747–1751. https://doi.org/10.1109/RTEICT.2016.7808133
- (19). Aditya, M. R., Sutanto, T., Budiman, H., Ridha, M. R. N., Syapotro, U., Azijah, N.: Machine Learning Models for Classification of Anemia from CBC Results: Random Forest, SVM, and Logistic Regression. Journal of Data Science. 2024. https://iuojs.intimal.edu.my/index.php/jods/article/view/589
- (20). Faraj, S. M.: Performance Evaluation of Machine Learning Algorithms for Predictive Classification of Anemia Data. 2024.
- (21). Pullakhandam, S., McRoy, S.: Classification and Explanation of Iron Deficiency Anemia from Complete Blood Count Data Using Machine Learning. BioMedInformatics. 2024, 4(1), pp. 661–672. https://doi.org/10.3390/biomedinforma tics4010036
- (22). Rahman, Md. M., Mojumdar, M. U., Shifa, H. A., Chakraborty, N. R., Stenin, N. P., Hasan, Md. A.: Anemia Disease Prediction using Machine Learning Techniques and Performance Analysis. 2024 11th International Conference on Computing for Sustainable Global Development (INDIACom). 2024. pp. 1276–1282. https://doi.org/10.23919/INDIACom61295.2024.10498962
- (23). Vohra, R., Hussain, A., Dudyala, A. K., Pahareeya, J., Khan, W.: Multi-Class Classification Algorithms for the Diagnosis of Anemia in an Outpatient Clinical Setting. PLoS One. 2022, 17(7), pp. e0269685. https://doi.org/10.1371/journal.pone.026 9685
- (24). Karagül Y?ld?z, T., Yurtay, N., Öneç, B.: Classifying Anemia Types Using Artificial Learning Methods. Engineering Science and Technology, an International Journal. 2021, 24(1), pp. 50–70. https://doi.org/10.1016/j.jestch.2020.12.003
- (25). Kovacevic, A., Lakota, A., Kuka, L., Becic, E., Smajovic, A., Pokvic, L. G.: Application of Artificial Intelligence in Diagnosis and Classification of Anemia. 2022 11th Mediterranean Conference on Embedded Computing (MECO). Budva, Montenegro: IEEE; 2022. pp. 1–4. https://doi.org/10.1109/MECO55406.2022.9797180
- (26). Anemia Types Classification. https://www.kaggle.com/datasets/ehababoelnaga/anemia-types-classification
- (27). MATLAB. https://www.mathworks.com/products/matlab. html
- (28). Abdi, H., Williams, L. J.: Principal component analysis. WIREs Computational Statistics. 2010, 2(4), pp. 433–459. https://doi.org/10.1002/wics.101
- (29). Jolliffe, I. T., Cadima, J.: Principal component analysis: a review and recent developments. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences. 2016, 374(2065), pp. 20150202. https://doi.org/10.1098/rsta.2015.0202
- (30). Mohammed, R., Rawashdeh, J., Abdullah, M.: Machine Learning with Oversampling and Undersampling Techniques: Overview Study and Experimental Results. 2020 11th International Conference on Information and Communication Systems (ICICS). 2020. pp. 243–248. https://doi.org/10.1109/ICICS49469.2020.239556
- (31). Viloria, A., Pineda Lezama, O. B., Mercado-Caruzo, N.: Unbalanced data processing using oversampling: Machine Learning. Procedia Computer Science. 2020, 175, pp. 108–113. https://doi.org/10.1016/j.procs.2020.07.018
- (32). Michio, I.: Oversampling Imbalanced Data: SMOTE related algorithms. GitHub; 2024. https://github.com/minoue-xx/Oversampling-Imbalanced-Data/releases/tag/1.0.2
- (33). Train models to classify data using supervised machine learning - MATLAB. https://www.mathworks.com/help/stats/classificationlearner-app.html
- (34). Molnar, C.: Interpretable Machine Learning. https://christop hm.github.io/interpretable-ml-book/
- (35). Lundberg, S. M., Lee, S.-I.: A Unified Approach to Interpreting Model Predictions. Advances in Neural Information Processing Systems. Curran Associates, Inc.; 2017. https://proceedings.neurips.cc/paper_files/paper/2017/hash/8a20a8621978632d76c43dfd28b67767-Abstract.html
- (36). Végh, L.: MATLAB App for Anemia Types Prediction from CBC Data. GitHub; 2024. https://github.com/veghl/anemia/
- (37). Beyan, C., Kaptan, K., Beyan, E., Turan, M.: The Platelet Count/Mean Corpuscular Hemoglobin Ratio Distinguishes Combined Iron and Vitamin B12 Deficiency from Uncomplicated Iron Deficiency. International Journal of Hematology. 2005, 81(4), pp. 301–303. https://doi.org/10.1532/IJH97.E0311
- (38). Lin, H., Zhan, B., Shi, X., Feng, D., Tao, S., Wo, M., Fei, X., Wang, W., Yu, Y.: The mean reticulocyte volume is a valuable index in early diagnosis of cancer-related anemia. https://peerj.c om/articles/17063
More Article by Ladislav Végh