Analisis Performa Model Random Forest dan CatBoost dengan Teknik SMOTE dalam Prediksi Risiko Diabetes

Rony Irfannandhy; Lekso Budi Handoko; Noval  Ariyanto

doi:10.29408/edumatic.v8i2.27990

Authors

Rony Irfannandhy Program Studi Teknik Informatika, Universitas Dian Nuswantoro https://orcid.org/0009-0008-1341-9384
Lekso Budi Handoko Program Studi Teknik Informatika, Universitas Dian Nuswantoro https://orcid.org/0000-0001-7500-029X
Noval Ariyanto Program Studi Teknik Informatika, Universitas Dian Nuswantoro https://orcid.org/0009-0005-8277-7650

DOI:

https://doi.org/10.29408/edumatic.v8i2.27990

Keywords:

CatBoost, diabetes, machine learning, prediction model, random forest

Abstract

Diabetes mellitus (DM) is increasing in prevalence globally and is becoming a serious health problem. Early detection reduces long-term complications. The purpose of our research is to evaluate and compare the effectiveness of Random Forest (RF) and CatBoost models with SMOTE technique in predicting DM risk based on test data processed to produce comparative analysis performance of both models in the form of precission, recall, F1-Score and accuracy. Our research type is quantitative using methods that include EDA, transformation, dividing test and training data, implementation of RF and CatBoost methods with SMOTE and evaluation of model performance. The dataset from the platform (Kaggle) includes 768 individual health data consisting of eight independent variables of pregnancy, glucose, blood pressure, skin thickness, insulin, Body Mass Index (BMI), DM history, age as well as one target (outcome) variable of DM status. The SMOTE analysis technique was applied to balance the class distribution and improve the representation of the minority class, making the prediction model more accurate and stable. The findings of the SMOTE-RF model were 82% accuracy and SMOTE CatBoost 81% accuracy. Based on the feature importances analysis, the main variables affecting DM risk prediction of both models are glucose, BMI and age. Glucose variable is the main DM risk indicator used for prediction to be more efficient. The practical implication of improved machine learning early detection has the potential to support doctors' decision making more accurately to prevent more serious complications in diabetes mellitus.

References

Andi, A., Thamrin, T., Susanto, A., Wijaya, E., & Djohan, D. (2023). Analysis of the random forest and grid search algorithms in early detection of diabetes mellitus disease. Jurnal Mantik, 7(2), 1117-1124.

Ardiansyah, M., Sunyoto, A., & Luthfi, E. T. (2021). Analisis Perbandingan Akurasi Algoritma Naïve Bayes Dan C4.5 untuk Klasifikasi Diabetes. Edumatic: Jurnal Pendidikan Informatika, 5(2), 147–156. https://doi.org/10.29408/edumatic.v5i2.3424

Dennison, R. A., Chen, E. S., Green, M. E., Legard, C., Kotecha, D., Farmer, G., Sharp, S. J., Ward, R. J., Usher-Smith, J. A., & Griffin, S. J. (2021). The absolute and relative risk of type 2 diabetes after gestational diabetes: A systematic review and meta-analysis of 129 studies. Diabetes Research and Clinical Practice, 171, 108625. https://doi.org/10.1016/j.diabres.2020.108625

Hidayat, T., Anelia, S. S., Pratiwi, R. I., Salsabila, N., & Prasvita, D. S. (2020). Perbandingan Akurasi Klasifikasi Penyakit Diabetes Menggunakan Algoritma Adaboost- Random Forest Dan Adaboost- Decision Tree Dengan Imputasi Median Dan Knn. Seminar Nasional Mahasiswa Ilmu Komputer Dan Aplikasinya (SENAMIKA), 2(1), 616–623.

Irwansyah, I., Kasim, I. S., & Bohari, B. (2021). The relationship between lifestyle with the risk of diabetes mellitus in staff and lecturers of universitas megarezky. Open Access Macedonian Journal of Medical Sciences, 9, 198–202. https://doi.org/10.3889/oamjms.2021.5681

Li, J., Cao, Y., Liu, W., Wang, Q., Qian, Y., & Lu, P. (2019). Correlations among Diabetic Microvascular Complications: A Systematic Review and Meta-analysis. Scientific Reports, 9(1), 1–9. https://doi.org/10.1038/s41598-019-40049-z

Maulana, M. R., Sucipto, A., & Mulyo, H. (2020). Optimisasi Parameter Support Vector Machine dengan Particle SWARM Optimization untuk Peningkatan Klasifikasi Diabetes. Journal Informatika Teknologi dan Sains (JINTEKS), 7(2), 802–812. https://doi.org/10.51401/jinteks.v6i4.4784

Nainggolan, S. P., & Sinaga, A. (2023). Comparative Analysis of Accuracy of Random Forest and Gradient Boosting Classifier Algorithm for Diabetes Classification. Sebatik, 27(1), 97–102. https://doi.org/10.46984/sebatik.v27i1.2157

Oktaviani, V., Rosmawarni, N., & Muslim, M. P. (2024). Perbandingan Kinerja Random Forest Dan Smote Random Forest Dalam Mendeteksi Dan Mengukur Tingkat Stres Pada Mahasiswa Tingkat Akhir. Informatik : Jurnal Ilmu Komputer, 20(1), 43–49. https://doi.org/10.52958/iftk.v20i1.9158

Palupi, L., Ihsanto, E., & Nugroho, F. (2023). Analisis Validasi dan Evaluasi Model Deteksi Objek Varian Jahe Menggunakan Algoritma Yolov5. Journal of Information System Research (JOSH), 5(1), 234–241. https://doi.org/10.47065/josh.v5i1.4380

Resti, Y., Kresnawati, E. S., Dewi, N. R., Zayanti, D. A., & Eliyati, N. (2021). Diagnosis of diabetes mellitus in women of reproductive age using the prediction methods of naive bayes, discriminant analysis, and logistic regression. Science and Technology Indonesia, 6(2), 96–104. https://doi.org/10.26554/sti.2021.6.2.96-104

Sabili, N. L., Umbara, F. R., & Melina. (2024). Klasifikasi Penyakit Diabetes Menggunakan Algoritma Categorical Boosting dengan Faktor Risiko Diabetes. Jurnal Mahasiswa Teknik Informatika (JATI), 8(6), 11391–11398. https://doi.org/10.36040/jati.v8i6.11447

Salsabil, M., Azizah, N. L,, & Eviyanti, A. (2024). Implementasi Data Mining Dalam Melakukan Prediksi Penyakit Diabetes Menggunakan Metode Random Forest Dan Xgboost. Jurnal Ilmiah Komputasi, 23(1), 51–58. https://doi.org/10.32409/jikstik.23.1.3507

Sari, V. R., Firdausi, F., & Azhar, Y. (2020). Perbandingan Prediksi Kualitas Kopi Arabika dengan Menggunakan Algoritma SGD, Random Forest dan Naive Bayes. Edumatic: Jurnal Pendidikan Informatika, 4(2), 1–9. https://doi.org/10.29408/edumatic.v4i2.2202

Siboro, O., Banjarnahor, Y. P., Gultom, A., Siagian, N. A., & Silitonga, P. D. P. (2024). Penanganan Data Ketidakseimbangan dalam Pendekatan SMOTE Guna Meningkatkan akurasi. Seminar Nasional Inovasi Sains Teknologi Informasi Komputer (SNISTIK), 1(2), 473–478.

Syandika, N. D., & Yustanti, W. (2023). Deteksi Anomali Terhadap Pembatalan Transaksi Pada Platform Tiktok Shop dengan Algoritma Categorical Boosting (Catboost). Journal of Informatics and Computer Science (JINACS), 5(02), 149–156. https://doi.org/10.26740/jinacs.v5n02.p149-156

Syukron, M., Santoso, R., & Widiharih, T. (2020). Perbandingan Metode Smote Random Forest Dan Smote Xgboost Untuk Klasifikasi Tingkat Penyakit Hepatitis C Pada Imbalance Class Data. Jurnal Gaussian, 9(3), 227–236. https://doi.org/10.14710/j.gauss.v9i3.28915

Tarigan, L. R. A., & Dahlan. (2024). Optimalisasi Fitur dengan Forward Selection pada Estimasi Tingkat Penyakit Paru-Paru Menggunakan Algoritma Klasifikasi Random Forest. Jurnal Mahasiswa Teknik Informatika (JATI), 8(5), 10341–10348. https://doi.org/10.36040/jati.v8i5.11064

Taufik, I., & Kurniawan, A. A. (2023). Peran Artificial Intelligencedalam Inovasi Digital Marketing. Seminar Nasional Ilmu, Manajemen, Ekonomi, Keuangan Dan Bisnis (SNIMEKB), 2(1), 29–40. https://doi.org/10.55927/snimekb.v2i1.4602

Wang, L., Wang, X., Chen, A., Jin, X., & Che, H. (2020). Prediction of type 2 diabetes risk and its effect evaluation based on the xgboost model. Healthcare (Switzerland), 8(3), 1–11. https://doi.org/10.3390/healthcare8030247

Zhong, W., Zhang, D., Sun, Y., & Wang, Q. (2023). A CatBoost-Based Model for the Intensity Detection of Tropical Cyclones over the Western North Pacific Based on Satellite Cloud Images. Remote Sensing, 15(14). https://doi.org/10.3390/rs15143510

Zulfiansyah, A. D. K., Kusuma, H., & Attamimi, M. (2023). Rancang Bangun Sistem Pendeteksi Keaslian Uang Kertas Rupiah Menggunakan Sinar UV dengan Metode Machine Learning. Jurnal Teknik ITS, 12(2). https://doi.org/10.12962/j23373539.v12i2.118320