Leveraging Data Lake Architecture for Predicting Academic Student Performance

Shameen Aina Abdul Rahim (1), Fatimah Sidi (2), Lilly Suriani Affendey (3), Iskandar Ishak (4), Appak Yessirkep Nurlankyzy (5)
(1) Department of Computer Science, Faculty of Computer Science and Information Technology, University Putra Malaysia, Serdang, Malaysia
(2) Department of Computer Science, Faculty of Computer Science and Information Technology, University Putra Malaysia, Serdang, Malaysia
(3) Department of Computer Science, Faculty of Computer Science and Information Technology, University Putra Malaysia, Serdang, Malaysia
(4) Department of Computer Science, Faculty of Computer Science and Information Technology, University Putra Malaysia, Serdang, Malaysia
(5) Department of Computer Science, Faculty of Information Technologies, L. N. Gumilyov Eurasian National University, Kazakhstan
Fulltext View | Download
How to cite (IJASEIT) :
Abdul Rahim , Shameen Aina, et al. “Leveraging Data Lake Architecture for Predicting Academic Student Performance”. International Journal on Advanced Science, Engineering and Information Technology, vol. 14, no. 6, Dec. 2024, pp. 2121-9, doi:10.18517/ijaseit.14.6.12408.
In today's rapidly evolving landscape of higher education, the effective management and analysis of academic data have become increasingly challenging, particularly in the context of the 3Vs of Big Data: volume, variety, and velocity. The amount of data produced by educational institutions has increased dramatically, including student records. This flood of data originates from various sources and takes several forms, such as learning management systems and student information systems. Hence, in education, data analytics and predictive modeling have become increasingly significant in acquiring insights into student performance, such as identifying at-risk students who are most likely to fail their courses. This study proposes a novel approach for predicting student academic performance, particularly identifying at-risk students, by leveraging a data lake architecture. The proposed methodology comprises the ingestion, transformation, and quality assessment of a combined data source from Universiti Putra Malaysia's Student Information System and learning management system within the data lake environment. With its parallel processing capabilities, this centralized data repository facilitates the training and evaluation of various machine learning models for prediction. In addition to forecasting the student performance, appropriate machine learning algorithms such as Support Vector Classifier, Naive Bayes, and Decision Trees are used to build prediction models by using the data lake's scalability and parallel processing capabilities. This study has laid a solid groundwork for using data architecture to improve students' performance.

H, Jahankhani et al., AI, Blockchain and Self-Sovereign Identity in Higher Education. Cham, Switzerland: Springer, 2023.

R. Raju, R. Mital, and D. Finkelsztein. “Data Lake Architecture for Air Traffic Management,” 2018 IEEE/AIAA 37th Digital Avionics Systems Conference (DASC), 23-27 September 2018, doi: 10.1109/DASC.2018.8569361.

P. Wieder, and H. Nolte, “Toward data lakes as central building blocks for data management and analysis,” Front Big Data, vol. 5, 2022, [Online]. doi: doi.org/10.3389/fdata.2022.945720.

A. Cuzzocrea, "Big Data Lakes: Models, Frameworks, and Techniques," 2021 IEEE International Conference on Big Data and Smart Computing (BigComp), 17-20 January 2021, pp. 1-4, doi: 10.1109/BigComp51126.2021.00010.

D. Martinez-Mosquera, V. Beltrán, D. Riofrío-Luzcando and J. Carrión-Jumbo, "Data Lake Management for Educational Analysis," 2022 IEEE Sixth Ecuador Technical Chapters Meeting (ETCM), Quito, Ecuador, 2022, pp. 1-5, doi: 10.1109/ETCM56276.2022.9935751.

S. M. M. Muin, F. Sidi, I. Ishak, H. Ibrahim, and L. S. Affendey, "Predicting Academic Student Performance based on e-Learning Platform Engagement using Learning Management System Data," International Journal on Recent and Innovation Trends in Computing and Communication, vol. 11, no. 9, 2023, doi: 10.17762/ijritcc.v11i9.9178.

H. S. Brdesee, W. Alsaggaf, N. Aljohani, and S. Hassan, “Predictive Model Using a Machine Learning Approach for Enhancing the Retention Rate of Students At-Risk,” International Journal on Semantic Web and Information Systems (IJSWIS), vol. 18, no. 1, pp. 1-21, Mar. 2022, doi:10.4018/IJSWIS.299859.

C. Manco et al., "HEALER: A Data Lake Architecture for Healthcare," In 2nd International Workshop on Data Platform Design, Management, and Optimization (DATAPLAT), 2023, [Online] Available: https://ceur-ws.org/Vol-3379/DataPlat_2023_602.pdf.

G. Weintraub, E. Gudes, S. Dolev and J. D. Ullman, "Optimizing Cloud Data Lake Queries with a Balanced Coverage Plan," IEEE Transactions on Cloud Computing, vol. 12, no. 1, pp. 84-99, Jan-Mar 2024, doi: 10.1109/TCC.2023.3339208.

D. Mazumdar, J. Hughes, JB. Onofre, “The Data Lakehouse: Data Warehousing and More,” 2023. [Online] doi: 10.48550/arXiv.2310.08697.

F. Qiu, G. Zhang, X. Sheng, L. Jiang, L. Zhu, Q. Xiang, B. Jiang, and P. Chen, “Predicting students’ performance in e-learning using learning process and behaviour data,” Sci Rep 12, 453, 2022, [Online] doi: 10.1038/s41598-021-03867-8.

J. Fan, “A big data and neural networks driven approach to design students management system,” Soft Computing, vol. 28, pp. 1255–1276, 2024, doi: 10.1007/s00500-023-09524-8.

R. P. d. C. C. Macedo, “Implementation of a Data Lake in a Microservices Architecture,” Master dissertation, Department Information, University of Lisbon, Portugal, 2024, [Online] Available: http://hdl.handle.net/10451/63925.

R. Asokan, D. P. Ruiz, and S. Piramuthu, “Smart data intelligence. Algorithms for Intelligent Systems,” In Proceedings of International Conference on Smart Data Intelligence (ICSMDI) 2024, 2024, [Online] doi: 10.1007/978-981-97-3191-6.

“What Is A Non-Relational Database?” (n.d.). Accessed on: Sept. 24, 2024, [Online] Available: https://www.mongodb.com/resources/basics/databases/non-relational.

“What Is a Cloud Data Lake?” (n.d.). Accessed on: Sept. 24, 2024, [Online] Available: https://www.dremio.com/resources/guides/cloud-data-lakes/

C. Cuello, “Data Ingestion vs. Data Integration: Know the differences for Efficient Data Management,” Dec 5, 2023, [Online] Available: https://rivery.io/data-learning-center/data-ingestion-vs-data-integration/#:~:text=Data%20ingestion%20is%20the%20initial,for%20analysis%20and%20decision%2Dmaking.

"Data Ingestion vs. Data Integration: What Sets Them Apart?" Feb. 27, 2024, Accessed on: Sept. 24, 2024, [Online] Available: https://airbyte.com/data-engineering-resources/data-ingestion-vs-data-integration.

M. Garcia, “The Evolution of Data Pipelines: ETL, ELT, and the Rise of Reverse ETL,” CORE, Oct. 2, 2023, [Online]. Available: https://dzone.com/articles/the-evolution-of-data-pipelines.

"What is Data Transformation?," (n.d.). Accessed on: Sept. 24, 2024, [Online]. Available: https://www.tibco.com/glossary/what-is-data-transformation#:~:text=Data%20transformation%20is%20the%20process,that%20of%20the%20destination%20system.

D. Ushasree, A.V. Praveen Krishna, and Ch. Mallikarjuna Rao, “Enhanced stroke prediction using stacking methodology (ESPESM) in intelligent sensors for aiding preemptive clinical diagnosis of brain stroke”, Measurement: Sensors, vol. 33, no. 101108, 2024, doi:10.1016/j.measen.2024.101108.

J. Xiong et al., “Deep Learning-Based Open Source Toolkit for Eosinophil Detection in Pediatric Eosinophilic Esophagitis”, Aug 2023, [Online]. Available: https://arxiv.org/abs/2308.06333.

G. Siemens, “Learning Analytics: The Emergence of a Discipline,” American Behavioral Scientist, vol. 57, no. 10, pp. 1380-1400. Aug 2013, doi: 10.1177/0002764213498851

“Streamlining Predictive Analytics with Scikit-Learn,” (n.d.). Accessed on: Sept. 24, 2024, [Online] Available: https://www.osedea.com/insight/streamlining-predictive-analytics-with-scikit-learn

N. Sghir, A. Adadi, and M. Lahmer, “Recent advances in Predictive Learning Analytics: A decade systematic review (2012–2022),” Education and Information Technology, vol. 28, pp. 8299–8333, 2023, doi: 10.1007/s10639-022-11536-0.

“Model Performance,” (n.d.). . Accessed on: Sept. 24, 2024, [Online] Available: https://fastercapital.com/keyword/model-performance.html

Z. Liu et al., “A data mining-then-predict method for proactive maritime traffic management by machine learning,” Engineering Applications of Artificial Intelligence, vol. 135, no. 108696, 2024, doi: 10.1016/j.engappai.2024.108696.

D. T. Larose, Data mining methods and models, John Wiley & Sons, Inc, 2005 doi: .https://doi.org/10.1002/0471756482.

Pooja and R. Bhalla, “A Review Paper on the Role of Sentiment Analysis in Quality Education,” SN Computer Science, vol. 3, no. 469, 2022, doi: 10.1007/s42979-022-01366-9.

P. Rangnekar, “What is Mobility in App Development? Key Insights!,” Jan 2024, Accessed on: Sept. 24, 2024, [Online] Available: https://www.biz4solutions.com/blog/category/uncategorized/page/13/

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 International License.

Authors who publish with this journal agree to the following terms:

    1. Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
    2. Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
    3. Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).