Development of a Python Library to Generate Synthetic Datasets for Artificial Intelligence Education

Seul-ki Kim (1), Yong-ju Jeon (2)
(1) Department of Computer Education, Korea National Unversity of Education, Cheogju, Chungbuk, 28173, Republic of Korea
(2) Department of Computer Education, Andong National University, Gyeongdongro 1375, Andong, GyeongBuk, 36723, Republic of Korea
Fulltext View | Download
How to cite (IJASEIT) :
Kim, Seul-ki, and Yong-ju Jeon. “Development of a Python Library to Generate Synthetic Datasets for Artificial Intelligence Education”. International Journal on Advanced Science, Engineering and Information Technology, vol. 14, no. 3, June 2024, pp. 936-45, doi:10.18517/ijaseit.14.3.18158.
This study aims to improve the quality of AI education for the AI era by developing an educational dataset library and exploring its applicability. Reflecting the needs of teachers engaged in AI educational activities, the dataset library emphasizes the diversity of topics, forms, and sizes of datasets provided. Additionally, it is designed with a feature to generate outliers and missing values suitable for students' accessibility and educational purposes. The library developed in this research is based on Python and employs the random forest modeling method to generate high-quality synthetic datasets. The functionality and suitability of this library for AI education have been evaluated by an expert panel, which has endorsed its applicability in the field. In detailed assessments of the synthetic datasets generated, the library demonstrated its capability to accurately mirror the statistical characteristics of original datasets, achieving high levels of accuracy and cosine similarity in the modeling results. These outcomes confirm the library's efficacy in reconstructing educational datasets specifically for AI education purposes and crafting high-quality synthetic datasets. This approach offers a practical solution to the existing shortage of educational datasets and substantially enhances the overall quality of education. This research proves immensely beneficial for educators and learners, laying a foundation for ongoing and future research focused on creating and utilizing educational datasets in AI. This paves the way for expanding the possibilities and scope of their application in the educational field, potentially transforming AI education practices.

J. McCarthy, “What is artificial intelligence,” 2007, Accessed: Feb. 13, 2024. [Online]. Available: http://cse.unl.edu/~choueiry/S09-476-876/Documents/whatisai.pdf

L. Li, “A comparative study on artificial intelligence curricula,” PhD Thesis, Western Ontario Univ., Canada, 2020. Accessed: Feb. 13, 2024. [Online]. Available: https://search.proquest.com/openview/b13c4d3058c533a76cba308137d8faa6/1?pq-origsite=gscholar&cbl=18750&diss=y

S. Druga, S. T. Vu, E. Likhith, and T. Qiu, “Inclusive AI literacy for kids around the world,” in Proceedings of FabLearn 2019, in FL2019. New York, NY, USA: Association for Computing Machinery, Mar. 2019, pp. 104–111. doi: 10.1145/3311890.3311904.

S. G. Han, “Digital Content to Improve Artificial Intelligence Literacy Ability,” Journal of the Korea Society of Computer and Information, vol. 25, no. 12, pp. 93–100, Dec. 2020, doi: 10.9708/JKSCI.2020.25.12.093.

D. Long and B. Magerko, “What is AI Literacy? Competencies and Design Considerations,” in Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, Honolulu HI USA: ACM, Apr. 2020, pp. 1–16. doi: 10.1145/3313831.3376727.

D. T. K. Ng, J. K. L. Leung, S. K. W. Chu, and M. S. Qiao, “Conceptualizing AI literacy: An exploratory review,” Computers and Education: Artificial Intelligence, vol. 2, p. 100041, 2021, doi: 10.1016/j.caeai.2021.100041.

W. Yang, “Artificial Intelligence education for young children: Why, what, and how in curriculum design and implementation,” Computers and Education: Artificial Intelligence, vol. 3, p. 100061, 2022, doi: 10.1016/j.caeai.2022.100061.

I. T. Sanusi, S. S. Oyelere, H. Vartiainen, J. Suhonen, and M. Tukiainen, “A systematic review of teaching and learning machine learning in K-12 education,” Educ Inf Technol, vol. 28, no. 5, pp. 5967–5997, May 2023, doi: 10.1007/s10639-022-11416-7.

P. Langley, “An Integrative Framework for Artificial Intelligence Education,” AAAI, vol. 33, no. 01, pp. 9670–9677, Jul. 2019, doi: 10.1609/aaai.v33i01.33019670.

R. M. Martins and C. Gresse Von Wangenheim, “Findings on Teaching Machine Learning in High School: A Ten - Year Systematic Literature Review,” Informatics in Education, Sep. 2022, doi: 10.15388/infedu.2023.18.

W. Chow, “A Pedagogy that Uses a Kaggle Competition for Teaching Machine Learning: an Experience Sharing,” in 2019 IEEE International Conference on Engineering, Technology and Education (TALE), Yogyakarta, Indonesia: IEEE, Dec. 2019, pp. 1–5. doi: 10.1109/TALE48000.2019.9226005.

M. Tedre, T. Toivonen, J. Kahila, H. Vartiainen, T. Valtonen, I. Jormanainen, and A. Pears, “Teaching Machine Learning in K–12 Classroom: Pedagogical and Technological Trajectories for Artificial Intelligence Education,” IEEE Access, vol. 9, pp. 110558–110572, 2021, doi: 10.1109/ACCESS.2021.3097962.

B. Hutchinson, A. Smart, A. Hanna, E. Denton, C. Greer, O. Kjartansson, P. Barnes and M. Mitchell, “Towards Accountability for Machine Learning Datasets: Practices from Software Engineering and Infrastructure,” in Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, Virtual Event Canada: ACM, Mar. 2021, pp. 560–575. doi: 10.1145/3442188.3445918.

I. Evangelista, G. Blesio, and E. Benatti, “Why Are We Not Teaching Machine Learning at High School? A Proposal,” in 2018 World Engineering Education Forum - Global Engineering Deans Council (WEEF-GEDC), Albuquerque, NM, USA: IEEE, Nov. 2018, pp. 1–6. doi: 10.1109/WEEF-GEDC.2018.8629750.

R. Biehler and Y. Fleischer, “Introducing students to machine learning with decision trees using CODAP and Jupyter Notebooks,” Teaching Statistics, vol. 43, no. S1, Jul. 2021, doi: 10.1111/test.12279.

H. Vartiainen, T. Toivonen, I. Jormanainen, J. Kahila, M. Tedre, and T. Valtonen, “Machine learning for middle schoolers: Learning through data-driven design,” International Journal of Child-Computer Interaction, vol. 29, p. 100281, Sep. 2021, doi: 10.1016/j.ijcci.2021.100281.

S. Hooper and L. P. Rieber, “Teaching with technology,” Teaching: Theory into practice, vol. 2013, pp. 154–170, 1995.

T. K. F. Chiu and C. Chai, “Sustainable Curriculum Planning for Artificial Intelligence Education: A Self-Determination Theory Perspective,” Sustainability, vol. 12, no. 14, p. 5568, Jul. 2020, doi: 10.3390/su12145568.

I. Bosnić, I. Čavrak, and A. Zuiderwijk, “Introducing Open Data Concepts to STEM Students Using Real-World Open Datasets,” in 2021 44th International Convention on Information, Communication and Electronic Technology (MIPRO), Sep. 2021, pp. 1530–1535. doi: 10.23919/MIPRO52101.2021.9596998.

S. Kim, K. Kim, and T. Kim, “Development of PISA Mathematical Context-oriented Dataset for K-12 Artificial Intelligence Education,” Journal of The Korean Association of Information Education, vol. 27, no. 3, pp. 255–267, Jun. 2023, doi: 10.14352/jkaie.2023.27.3.255.

T. Coughlan, “The use of open data as a material for learning,” Education Tech Research Dev, vol. 68, no. 1, pp. 383–411, Feb. 2020, doi: 10.1007/s11423-019-09706-y.

K. El Emam, L. Mosquera, and R. Hoptroff, Practical synthetic data generation: balancing privacy and the broad availability of data. O’Reilly Media, 2020. Accessed: Mar. 13, 2024. [Online]. Available: https://books.google.co.kr/books?hl=ko&lr=&id=XWnnDwAAQBAJ&oi=fnd&pg=PP1&dq=emam+mosquera&ots=FouI9cNjuo&sig=Bq-XWVp-mzxsXh24EpJ7tY2MSfo

S. Kim, T. Kim, and Y. Jeon, “Research on the Development and Utility Analysis of K-12 Artificial Intelligence Educational Datasets Using Synthetic Datasets Generation Method,” The Journal of Korean Association of Computer Education, vol. 25, no. 3, pp. 9–21, May 2022, doi: 10.32431/KACE.2022.25.3.002.

A. Rossett, Training needs assessment. Educational Technology, 1987. Accessed: Mar. 13, 2024. [Online]. Available: https://books.google.co.kr/books?hl=ko&lr=&id=IWBppwNMC-QC&oi=fnd&pg=PR7&dq=training+needs+assessment&ots=PazVFE8lP1&sig=3TfbexATFVfucdSu-1POqvV57Hs

T. E. Raghunathan, J. M. Lepkowski, J. Van Hoewyk, and P. Solenberger, “A multivariate technique for multiply imputing missing values using a sequence of regression models,” Survey methodology, vol. 27, no. 1, pp. 85–96, 2001.

J. Kim and M. Park, “Multiple imputation and synthetic data,” The Korean Journal of Applied Statistics, vol. 32, no. 1, pp. 83–97, 2019.

J. P. Reiter, “Releasing multiply imputed, synthetic public use microdata: an illustration and empirical study,” Journal of the Royal Statistical Society Series A: Statistics in Society, vol. 168, no. 1, pp. 185–205, 2005.

J. Lee, “Review on Statistical Methods for Synthetic Data,” M. S. thesis, Dept Statics, UOS, Seoul Univ, Seoul, Korea, 2021.

B. Nowok, G. M. Raab, and C. Dibben, “synthpop: Bespoke Creation of Synthetic Data in R,” J. Stat. Soft., vol. 74, no. 11, 2016, doi: 10.18637/jss.v074.i11.

S. Yoo and N. Park, “Synthetic Data Generation for Individual Credit Data Using CART,” Journal of the Korean Official Statistics, vol. 25, no. 1, pp. 1–30, 2020.

Hazy Limeted, “hazy/synthpop.”, Dec. 16, 2019. Accessed: Dec. 08, 2022. [Online]. Available: https://github.com/hazy/synthpop

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, E. Duchesnay, F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Michel, “Scikit-learn: Machine Learning in Python,” Journal of Machine Learning Research, vol. 12, no. 85, pp. 2825–2830, 2011.

M. Carlisle, “racist data destruction?,” Medium. Accessed: Jan. 14, 2024. [Online]. Available: https://medium.com/@docintangible/racist-data-destruction-113e3eff54a8

C. H. Lawshe, “A Quantitative Approach to Content Validity,” Personnel Psychology, vol. 28, no. 4, pp. 563–575, Dec. 1975, doi: 10.1111/j.1744-6570.1975.tb01393.x.

M. Bergdahl, M. Ehling, E. Elvers, E. Földesi, T. Körner, A. Kron, P. Lohauß, K. Mag, V. Morais, A. Nimmergut, H. Viggo, K, Szép, U. Timm, and M. J. Zilhão “Handbook on Data Quality Assessment Methods and Tools.” Ehling, Manfred Körner, Thomas, 2007.

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 International License.

Authors who publish with this journal agree to the following terms:

    1. Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
    2. Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
    3. Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).