Deteksi Polycystic Ovary Syndrome (PCOS) Berbasis Machine Learning: Kombinasi SMOTE, Random Forest, Gradient Boosting, dan Bayesian Optimization
Abstract
Polycystic ovary syndrome (PCOS) merupakan gangguan endokrin yang umum terjadi pada wanita usia reproduktif. Kondisi ini dapat menyebabkan gangguan ovulasi, ketidakseimbangan hormon, resistensi insulin, serta meningkatkan risiko penyakit kardiovaskular, obesitas, dan gangguan psikologis. Meskipun prevalensinya cukup tinggi, sekitar 75% kasus PCOS masih belum terdiagnosis dalam praktik klinis akibat kompleksitas gejala dan keterbatasan metode diagnosis yang digunakan saat ini. Untuk mengatasi permasalahan tersebut, penelitian ini mengusulkan pendekatan berbasis machine learning guna meningkatkan akurasi dan efisiensi deteksi PCOS. Penelitian ini membandingkan performa dua algoritma pembelajaran terawasi, yaitu random forest dan gradient boosting, dalam melakukan prediksi PCOS. Dataset yang digunakan diperoleh dari repositori publik dan memuat berbagai fitur klinis yang berkaitan dengan PCOS. Untuk menangani permasalahan ketidakseimbangan kelas, metode synthetic minority over-sampling technique (SMOTE) diterapkan pada data pelatihan. Selain itu, bayesian optimization digunakan untuk melakukan penyetelan hiperparameter pada masing-masing model agar diperoleh performa yang optimal. Evaluasi performa model dilakukan menggunakan beberapa metrik, dengan area under the curve–receiver operating characteristic (AUC-ROC) sebagai metrik utama. Hasil penelitian menunjukkan bahwa model Gradient Boosting memberikan performa terbaik dengan nilai AUC sebesar 0,8983 dan nilai recall sebesar 0,95, yang mengindikasikan sensitivitas tinggi dalam mengidentifikasi kasus PCOS. Temuan ini menunjukkan bahwa kombinasi SMOTE dan bayesian optimization efektif dalam meningkatkan akurasi prediksi, khususnya pada dataset medis yang tidak seimbang. Pendekatan yang diusulkan memiliki potensi untuk diintegrasikan ke dalam sistem pendukung keputusan klinis guna mendukung proses skrining PCOS yang lebih dini dan andal.
Polycystic ovary syndrome (PCOS) is a common endocrine disorder among reproductive-aged women. This condition can lead to ovulatory dysfunction, hormonal imbalance, insulin resistance, and an increased risk of cardiovascular disease, obesity, and psychological disorders. Despite its high prevalence, approximately 75% of PCOS cases remain undiagnosed in clinical settings due to the complexity of symptoms and limitations of current diagnostic methods. To address this issue, a machine learning-based approach is proposed to improve the accuracy and efficiency of PCOS detection. This study compares the performance of two supervised learning algorithms random forest and gradient boosting for PCOS prediction. The dataset used was obtained from a public repository and contains various clinical features associated with PCOS. To address the class imbalance problem, the synthetic minority over-sampling technique (SMOTE) was applied to the training data. Additionally, bayesian optimization was employed to fine-tune the hyperparameters of each model for optimal performance. Model performance was evaluated using several metrics, with the area under the curve–receiver operating characteristic (AUC-ROC) as the primary measure. The Gradient Boosting model achieved the best results, with an AUC of 0.8983 and a recall of 0.95, indicating high sensitivity in identifying positive PCOS cases. These findings demonstrate that the combination of SMOTE and Bayesian Optimization is effective in enhancing predictive accuracy, especially in imbalanced medical datasets. The proposed approach shows promise for integration into clinical decision-support systems to facilitate earlier and more reliable PCOS screening.
Kata Kunci: Bayesian optimization; gradient boosting; PCOS; random forest; SMOTE.
Keywords : Bayesian optimization; gradient boosting; PCOS; random forest; SMOTE.
References
J. P. Christ and M. I. Cedars, “Current guidelines for diagnosing PCOS,” diagnostics, vol. 13, no. 6, pp. 1-11, 2023, doi: https://doi.org/ 10.3390/diagnostics13061113.
S. B. Nadkarni, G. S. Vijay, and R. C. Kamath, “Comparative study of random forest and gradient boosting algorithms to predict airfoil self-noise,” Engineering Proceedings, vol. 59, no. 24, 2023, doi: https://doi.org/10.3390/engproc2023059024.
Y. F. Zamzam, T. H. Saragih, R. Herteno, Muliadi, D. T. Nugrahadi, and P.-H. Huynh, “Comparison of CatBoost and random forest methods for lung cancer classification using hyperparameter tuning bayesian optimization-based,” j. electron. electromedical. eng. med. inform, vol. 6, no. 2, pp. 125–136, 2024, doi: https://doi.org/10.35882/jeeemi.v6i2.382.
M. Syukron, R. Santoso, and T. Widiharih, “Perbandingan metode SMOTE random forest dan SMOTE XGBoost untuk klasifikasi tingkat penyakit hepatitis C pada imbalance class data,” Jurnal Gaussian, vol. 9, no. 3, pp. 227–236, 2020, doi: https://doi.org/10.14710/j.gauss.9.3.227-236.
T. Fulazzaky, A. Saefuddin, and A. M. Soleh, “Evaluating ensemble learning techniques for class imbalance in machine learning: A comparative analysis of balanced random forest, SMOTE-RF, SMOTEBoost, and RUSBoost,” Scientific Journal of Informatics, vol. 11, no. 4, pp. 969–980, 2024, doi: https://doi.org/10.15294/sji.v11i4.15937.
M. Dapas and A. Dunaif, “Deconstructing a syndrome: genomic insights into pcos causal mechanisms and classification,” Endocrine Reviews., vol. 43, no. 6, pp. 927–965, 2022, https://doi.org/10.1210/endrev/bnac001.
J. Ali, R. Khan, N. Ahmad, and I. Maqsood, “Random forests and decision trees,” IJCSI International Journal of Computer Science Issues, vol. 9, no. 5, pp. 272–278, 2012, https://www.ijcsi.org/papers/IJCSI-9-5-3-272-278.pdf.
J. K. Jaiswal and R. Samikannu, “Application of random forest algorithm on feature subset selection and classification and regression,” in 2017 World Congress on Computing and Communication Technologies (WCCCT), Hyderabad, India, Jul. 2016, pp. 65–70, doi: https://doi.org/10.1109/WCCCT.2016.25.
Z. M. Alhakeem, Y. M. Jebur, S. N. Henedy, H. Imran, L. F. A. Bernardo, and H. M. Hussein, “Prediction of ecofriendly concrete compressive strength using Gradient Boosting Regression Tree combined with GridSearchCV hyperparameter-optimization techniques,” Materials, vol. 15, no. 21, p. 7432, 2022, doi: https://doi.org/10.3390/ma15217432.
M. Jun, "A comparison of a gradient boosting decision tree, random forests, and artificial neural networks to model urban land use changes: the case of the Seoul metropolitan area," International Journal of Geographical Information Science., vol. 35, no. 11, pp. 1533–1552, 2021, doi: https://doi.org/10.1080/13658816.2021.1887490.
M. P. Pulungan, A. Purnomo, and A. Kurniasih, “Penerapan SMOTE untuk mengatasi imbalance class dalam klasifikasi kepribadian MBTI menggunakan naive bayes classifier,” Jurnal Teknologi Informasi dan Ilmu Komputer (JTIIK), vol. 11, no. 5, pp. 1033–1042, 2024, doi: https://doi.org/10.25126/jtiik.2024117989.
L. Shen, Z. Lin, and Q. Huang, “Relay backpropagation for effective learning of deep convolutional neural networks,” in European Conference on Computer Vision (ECCV 2016), Cham, Switzerland, pp. 467–482, 2016, https://doi.org/10.1007/978-3-319-46478-7_29.
N. P. Y. T. Wijayanti, E. N Kencana, dan I. W. Sumarjaya, “SMOTE: potensi dan kekurangannya pada survei,” e-Jurnal Matematika, vol. 10, no. 4, pp. 235–240, Nov. 2021, doi: https://doi.org/10.24843/MTK.2021.v10.i04.p348.
G. Husain et al., “SMOTE vs. SMOTEENN: a study on the performance of resampling algorithms for addressing class imbalance in regression models,” Algorithms, vol. 18, no. 1, pp. 37, 2025, doi: https://doi.org/10.3390/a18010037.
M. Diessner, K. J. Wilson, dan R. D. Whalley, “On the development of a practical Bayesian optimization algorithm for expensive experiments and simulations with changing environmental conditions,” Data-Centric Eng., vol. 5, pp. e45, 2024, doi: https://doi.org/10.1017/dce.2024.40.
Refbacks
- There are currently no refbacks.

.jpg)






