Tree-Based Ensemble Methods and Their Applications for Predicting Students’ Academic Performance

Indri Dayanah Ayulani (1), Agatha Melinda Yunawan (2), Tamara Prihutaminingsih (3), Devvi Sarwinda (4), Gianinna Ardaneswari (5), Bevina Desjwiandra Handari (6)
(1) Department of Mathematics, Universitas Indonesia, Kampus UI Depok, Depok, 16424, Indonesia
(2) Department of Mathematics, Universitas Indonesia, Kampus UI Depok, Depok, 16424, Indonesia
(3) Department of Mathematics, Universitas Indonesia, Kampus UI Depok, Depok, 16424, Indonesia
(4) Department of Mathematics, Universitas Indonesia, Kampus UI Depok, Depok, 16424, Indonesia
(5) Department of Mathematics, Universitas Indonesia, Kampus UI Depok, Depok, 16424, Indonesia
(6) Department of Mathematics, Universitas Indonesia, Kampus UI Depok, Depok, 16424, Indonesia
Fulltext View | Download
How to cite (IJASEIT) :
Ayulani, Indri Dayanah, et al. “Tree-Based Ensemble Methods and Their Applications for Predicting Students’ Academic Performance”. International Journal on Advanced Science, Engineering and Information Technology, vol. 13, no. 3, May 2023, pp. 919-27, doi:10.18517/ijaseit.13.3.16880.
Students’ academic performance is a key aspect of online learning success. Online learning applications known as Learning Management Systems (LMS) store various online learning activities. In this research, students’ academic performances in online course X are predicted such that teachers could identify students who are at risk much sooner. The prediction uses tree-based ensemble methods such as Random Forest, XGBoost (Extreme Gradient Boosting), and LightGBM (Light Gradient Boosting Machine). Random Forest is a bagging method, whereas XGBoost and LightGBM are boosting methods. The data recorded in LMS UI, or EMAS (e-Learning Management Systems) is collected. The data consists of activity data for 232 students (219 passed, 13 failed) in course X. This data is divided into three proportions (80:20, 70:30, and 60:40) and three periods (the first, first two, and first three months of the study period). Data is pre-processed using the SMOTE method to handle imbalanced data and implemented in all categories, with and without feature selection. The prediction results are compared to determine the best time for predicting students’ academic performance and how well each model can predict the number of unsuccessful students. The implementation results show that students’ academic performance can be predicted at the end of the second month, with best prediction rates of 86.8%, 80%, and 75% for the LightGBM, Random Forest, and XGBoost models, respectively, with feature selection. Therefore, with this prediction, students who could fail still have time to improve their academic performance.

F. Chen and Y. Cui, “Utilizing student time series behaviour in learning management systems for early prediction of course performance,” J. Learn. Anal., vol. 7, no. 2, pp. 1-17, 2020, doi: 10.18608/JLA.2020.72.1.

G. Akí§apınar, A. Altun, and P. AÅŸkar, “Using learning analytics to develop early-warning system for at-risk students,” Int. J. Educ. Technol. High. Educ., vol. 16, no. 1, 2019, doi: 10.1186/s41239-019-0172-z.

E. Latif and S. Miles, “The Impact of Assignments and Quizzes on Exam Grades: A Difference-in-Difference Approach,” J. Stat. Educ., vol. 28, no. 3, 2020, doi: 10.1080/10691898.2020.1807429.

E. Alyahyan and D. Dí¼ÅŸtegí¶r, “Predicting academic success in higher education: literature review and best practices,” International Journal of Educational Technology in Higher Education, vol. 17, no. 1. 2020, doi: 10.1186/s41239-020-0177-7.

E. Popescu and F. Leon, “Predicting Academic Performance Based on Learner Traces in a Social Learning Environment,” IEEE Access, vol. 6, 2018, doi: 10.1109/ACCESS.2018.2882297.

S. Jayaprakash, S. Krishnan, and J. Jaiganesh, “Predicting Students Academic Performance using an Improved Random Forest Classifier,” 2020 Int. Conf. Emerg. Smart Comput. Informatics, ESCI 2020, pp. 238-243, 2020, doi: 10.1109/ESCI48226.2020.9167547.

M. M. De Oliveira, R. Barwaldt, M. R. Pias, and D. B. Espindola, “Understanding the Student Dropout in Distance Learning,” in Proceedings - Frontiers in Education Conference, FIE, 2019, vol. 2019-October, doi: 10.1109/FIE43999.2019.9028433.

S. Helal et al., “Predicting academic performance by considering student heterogeneity,” Knowledge-Based Syst., vol. 161, 2018, doi: 10.1016/j.knosys.2018.07.042.

Y. Zhao et al., “Ensemble learning predicts multiple sclerosis disease course in the SUMMIT study,” npj Digit. Med., vol. 3, no. 1, 2020, doi: 10.1038/s41746-020-00338-8.

T. H. Lee, A. Ullah, and R. Wang, “Bootstrap Aggregating and Random Forest,” Adv. Stud. Theor. Appl. Econom., vol. 52, pp. 389-429, 2020, doi: 10.1007/978-3-030-31150-6_13.

S. Rahman, M. Irfan, M. Raza, K. M. Ghori, S. Yaqoob, and M. Awais, “Performance analysis of boosting classifiers in recognizing activities of daily living,” Int. J. Environ. Res. Public Health, vol. 17, no. 3, 2020, doi: 10.3390/ijerph17031082.

J. Han, M. Kamber, and J. Pei, Data Mining: Concepts and Techniques. 2012.

L. Breiman, “Random forests,” Mach. Learn., vol. 45, no. 1, 2001, doi: 10.1023/A:1010933404324.

D. Denisko and M. M. Hoffman, “Classification and interaction in random forests,” Proceedings of the National Academy of Sciences of the United States of America, vol. 115, no. 8. 2018, doi: 10.1073/pnas.1800256115.

K. Lin, Y. Hu, and G. Kong, “Predicting in-hospital mortality of patients with acute kidney injury in the ICU using random forest model,” Int. J. Med. Inform., vol. 125, 2019, doi: 10.1016/j.ijmedinf.2019.02.002.

L. E. Raileanu and K. Stoffel, “Theoretical comparison between the Gini Index and Information Gain criteria,” Ann. Math. Artif. Intell., vol. 41, no. 1, 2004, doi: 10.1023/B:AMAI.0000018580.96245.c6.

K. C. Dewi, H. Murfi, and S. Abdullah, “Analysis Accuracy of Random Forest Model for Big Data - A Case Study of Claim Severity Prediction in Car Insurance,” 2019, doi: 10.1109/ICSITech46713.2019.8987520.

T. Chen and C. Guestrin, “XGBoost: A scalable tree boosting system,” in Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016, vol. 13-17-August-2016, doi: 10.1145/2939672.2939785.

D. Zhang and Y. Gong, “The Comparison of LightGBM and XGBoost Coupling Factor Analysis and Prediagnosis of Acute Liver Failure,” IEEE Access, 2020, doi: 10.1109/ACCESS.2020.3042848.

S. Wang et al., “A new method of diesel fuel brands identification: SMOTE oversampling combined with XGBoost ensemble learning,” Fuel, vol. 282, 2020, doi: 10.1016/j.fuel.2020.118848.

W. Liang, S. Luo, G. Zhao, and H. Wu, “Predicting hard rock pillar stability using GBDT, XGBoost, and LightGBM algorithms,” Mathematics, vol. 8, no. 5, 2020, doi: 10.3390/MATH8050765.

C. Chen, Q. Zhang, Q. Ma, and B. Yu, “LightGBM-PPI: Predicting protein-protein interactions through LightGBM with multi-information fusion,” Chemom. Intell. Lab. Syst., vol. 191, 2019, doi: 10.1016/j.chemolab.2019.06.003.

G. Ke et al., “LightGBM: A highly efficient gradient boosting decision tree,” in Advances in Neural Information Processing Systems, 2017, vol. 2017-December.

A. A. Taha and S. J. Malebary, “An Intelligent Approach to Credit Card Fraud Detection Using an Optimized Light Gradient Boosting Machine,” IEEE Access, vol. 8, 2020, doi: 10.1109/ACCESS.2020.2971354.

A. S. Hussein, T. Li, C. W. Yohannese, and K. Bashir, “A-SMOTE: A new pre-processing approach for highly imbalanced datasets by improving SMOTE,” Int. J. Comput. Intell. Syst., vol. 12, no. 2, 2019, doi: 10.2991/ijcis.d.191114.002.

N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “ Two-Striped Telamonia Spider,” J. Artif. Intell. Res., vol. 16, no. Sept. 28, pp. 321-357, 2002, [Online]. Available:

I. Dí¼ntsch and G. Gediga, “Confusion Matrices and Rough Set Data Analysis,” in Journal of Physics: Conference Series, 2019, vol. 1229, no. 1, doi: 10.1088/1742-6596/1229/1/012055.

J. D. Novaković, A. Veljović, S. S. Ilić, Ž. Papić, and T. Milica, “Evaluation of Classification Models in Machine Learning,” Theory Appl. Math. Comput. Sci., vol. 7, no. 1, 2017.

Q. Wu, F. Nasoz, J. Jung, B. Bhattarai, and M. V. Han, “Machine Learning Approaches for Fracture Risk Assessment: A Comparative Analysis of Genomic and Phenotypic Data in 5130 Older Men,” Calcif. Tissue Int., vol. 107, no. 4, 2020, doi: 10.1007/s00223-020-00734-y.

M. Q. R. Pembury Smith and G. D. Ruxton, “Effective use of the McNemar test,” Behav. Ecol. Sociobiol., vol. 74, no. 11, 2020, doi: 10.1007/s00265-020-02916-y.

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 International License.

Authors who publish with this journal agree to the following terms:

    1. Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
    2. Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
    3. Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).