Enhancing the Readability of Academic Data for Machine Learning through Preprocessing Techniques

Anna Mayyah Soraya, Faisal Rahutomo, Miftahul Anwar

Abstract

Academic data plays a central role in supporting decision-making in educational institutions. However, the successful implementation of machine learning to analyze and make predictions based on academic data highly depends on the quality and readability of the data. To fully harness the potential of machine learning, careful preprocessing of academic data is essential. This research aims to design and implement preprocessing techniques, that is imputation, winsorizing, and dropping data on academic data. To handle missing values, the Multivariate Imputation by Chained Equation method is used with three different algorithms, linear regression, random forest, and KNN, and then the accuracy of these three algorithms in predicting missing values is compared. Additionally, winsorizing method is applied to outliers and data duplication is addressed by dropping duplicate data. Based on the testing results through evaluation metrics, these preprocessing techniques can improve model accuracy by 0.037 for MAE, 0.11 for RMSE, and 0.006 for MSE. The processed data allows the model to function more optimally and produce more reliable results.

Full Text:

PDF

References

[1] M. B. Musthafa, N. Ngatmari, C. Rahmad, R. A. Asmara, and F. Rahutomo, “Evaluation of university accreditation prediction system,” IOP Conf Ser Mater Sci Eng, vol. 732, no. 1, p. 12041, Jan. 2020, doi: 10.1088/1757-899x/732/1/012041.

[2] R. Sharda, D. Delen, and E. Turban, Business Intelligence: A Managerial Perspective on Analytics (3rd Edition), 3rd ed. Upper Saddle River, NJ, USA: Prentice Hall Press, 2013.

[3] V. L. Sauter, Decision Support Systems for Business Intelligence, 2nd ed. John Wiley & Sons, 2014.

[4] E. Turban, J. E. Aronson, T.-P. Liang, and R. Sharda, Decision Support and Business Intelligence Systems (8th Edition). Upper Saddle River, NJ, USA: Prentice-Hall, Inc., 2006.

[5] M. Yağcı, “Educational data mining: prediction of students’ academic performance using machine learning algorithms,” Smart Learning Environments, vol. 9, no. 1, p. 11, 2022, doi: 10.1186/s40561-022-00192-z.

[6] J. Xie, L. Sun, and Y. F. Zhao, “On the Data Quality and Imbalance in Machine Learning-based Design and Manufacturing—A Systematic Review,” Engineering, vol. 45, pp. 105–131, 2025, doi: https://doi.org/10.1016/j.eng.2024.04.024.

[7] J. Li, H. Fu, K. Hu, and W. Chen, “Data Preprocessing and Machine Learning Modeling for Rockburst Assessment,” Sustainability, vol. 15, no. 18, 2023, doi: 10.3390/su151813282.

[8] E. Alpaydın, Introduction to Machine Learning , Fourth edition. The MIT Press, 2020.

[9] J. Han, Data Mining: Concepts and Techniques. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2005.

[10] M. J. Azur, E. A. Stuart, C. Frangakis, and P. J. Leaf, “Multiple imputation by chained equations: what is it and how does it work?,” Int J Methods Psychiatr Res, vol. 20, no. 1, pp. 40–49, 2011, doi: https://doi.org/10.1002/mpr.329.

[11] A. Z. Alruhaymi and C. J. Kim, “Why Can Multiple Imputations and How (MICE) Algorithm Work?,” Open J Stat, vol. 11, no. 05, pp. 759–777, 2021, doi: 10.4236/ojs.2021.115045.

[12] R. J. Hyndman and G. Athanasopoulos, Forecasting: Principles and practice. OTexts, 2018.

[13] F. Petropoulos et al., “Forecasting: theory and practice,” Int J Forecast, vol. 38, no. 3, pp. 705–871, 2022, doi: https://doi.org/10.1016/j.ijforecast.2021.11.001.

[14] F. Rahutomo, M. M. Huda, R. A. Asmara, A. Setiawan, and A. A. Septarina, “The experiment of text – number combination forecasting,” J Phys Conf Ser, vol. 1402, no. 6, p. 66037, Dec. 2019, doi: 10.1088/1742-6596/1402/6/066037.

Refbacks

  • There are currently no refbacks.