Generative AI-Driven Multimodal Interaction System Integrating Voice and Motion Recognition
How to cite (IJASEIT) :
J.-S. Han, C.-I. Lee, Y.-H. Youn and S.-J. Kim, "A study on realtime hand gesture recognition technology by machine learning-based mediapipe", Journal of System and Management Sciences, vol. 12, no. 2, pp. 462-476, 2022.
C. Graham and N. Roll, “Evaluating OpenAI’s Whisper ASR: Performance analysis across diverse accents and speaker traits,” JASA Express Letters, vol. 4, no. 2, Feb. 2024, doi: 10.1121/10.0024876.
G. Park and J. Kim, “Multivariate Variable-Based LSTM-AE Model for Solar Power Prediction,” International Journal on Advanced Science, Engineering and Information Technology, vol. 15, no. 1, pp. 293–299, Feb. 2025, doi: 10.18517/ijaseit.15.1.20944.
A. Badiola-Bengoa and A. Mendez-Zorrilla, “A Systematic Review of the Application of Camera-Based Human Pose Estimation in the Field of Sport and Physical Exercise,” Sensors, vol. 21, no. 18, p. 5996, Sep. 2021, doi: 10.3390/s21185996.
Y. Kim and H.-J. So, "Unpacking multimodal representation strategies in explainer videos from TPACK perspectives," J. Educ. Inf. Media (Korean Assoc. Educ. Inf. Media), vol. 30, no. 3, pp. 933-954, Jun. 2024, doi: 10.15833/kafeiam.30.3.933.
Bae, H. J., Jang, G. J., Kim, Y. H., and Kim, J. P. “LSTM (Long Short-Term Memory)-Based Abnormal Behavior Recognition Using AlphaPose,” KIPS Transactions on Software and Data Engineering, vol. 10, no. 5, pp. 187-194, May. 2021.
A. Lee, H. Ryu, H. Choi, and Y. Koo, "The DX museum strategy based on multimodal blueprint: Generation Z," Arch. Des. Res., vol. 37, no. 4, pp. 149-178, 2024.
Y. Lee and H. Kwon, "Research on improving unknown track reading by applying multimodal model," J. Internet Comput. Serv., vol. 25, no. 6, pp. 155-162, Dec. 2024.
J. Lee, "An efficient XR content authoring method based on collaborative service using multi-modal objects," KIISE Trans. Comput. Pract., vol. 29, no. 4, pp. 190-195, Apr. 2023, doi:10.5626/ktcp.2023.29.4.190.
D.-S. Jang and J.-C. Kim, "Two-way interactive algorithms based on speech and motion recognition with generative AI technology," J. Korea Inst. Electron. Commun. Sci., vol. 19, no. 2, pp. 397-402, Apr. 2024.
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16×16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2021. [Online]. Available: https://arxiv.org/abs/2010.11929.
J. Chang and H. Nam, "Exploring the feasibility of fine-tuning large-scale speech recognition models for domain-specific applications: A case study on Whisper model and KsponSpeech dataset," Phonetics Speech Sci., vol. 15, no. 3, pp. 83-88, Sep. 2023, doi:10.13064/ksss.2023.15.3.083.
C. Oh, C. Kim, and K. Park, "Building robust Korean speech recognition model by fine-tuning large pretrained model," Phonetics Speech Sci., vol. 15, no. 3, pp. 75-82, Sep. 2023, doi:10.13064/ksss.2023.15.3.075.
H.-W. Cha and J.-M. Ma, "Proposal and evaluation of military automatic speech recognition model based on transformer," J. Korea Acad.-Ind. Cooper. Soc., vol. 25, no. 3, pp. 102-107, Mar. 2024, doi:10.5762/kais.2024.25.3.102.
K. J. Son and S. H. Kim, "A study on the evaluation methods for assessing the understanding of Korean culture by generative AI models," Trans. Korea Inf. Process. Soc., vol. 13, no. 9, pp. 421-428, Sep. 2024.
H.-C. Jung, K.-S. Shin, H.-D. Kim, and S.-B. Park, "Clinical trials utilizing LLM-based generative AI," J. Korea Soc. Comput. Inf., vol. 29, no. 12, pp. 169-180, Dec. 2024, doi: 10.9708/jksci.2024.29.12.169.
P. T. J. Miguel and K. Nah, "Integrating multimodal and generative AI in design research: Enhancing ethnographic methods with data-driven analysis," J. Des. Res. (Korea Inst. Des. Res. Soc.), vol. 8, no. 4, pp. 27-38, Dec. 2023, doi: 10.46248/kidrs.2023.4.27.
D. Saisanthiya and P. Supraja, "Neuro-facial fusion for emotion AI: Improved federated learning GAN for collaborative multimodal emotion recognition," IEIE Trans. Smart Process. Comput., vol. 13, no. 1, pp. 61-68, Feb. 2024, doi: 10.5573/ieiespc.2024.13.1.61.
D. Min, S. Nam, and D. Choi, "A study on improving the accuracy of Korean speech recognition texts using KcBERT," J. KIISE, vol. 51, no. 12, pp. 1115-1124, Dec. 2024, doi: 10.5626/jok.2024.51.12.1115.
G. Bae, C. Kim, S. Hwang, Y. Lee, and J. Kong, "Development of digital exhibition contents using generative AI and prompt engineering," J. Korea Multimed. Soc., vol. 27, no. 8, pp. 959-968, Aug. 2024, doi: 10.9717/kmms.2024.27.8.959.
S. Park and K. Kim, "AI image generation study utilizing ChatGPT and Midjourney," J. Digit. Art Eng. Multimed., vol. 10, no. 4, pp. 501-510, Dec. 2023, doi: 10.29056/jdaem.2023.12.06.
J. M. Lee, Y. K. Choi, and H. S. Kang, "A study on the motion and voice recognition smart mirror using Grove gesture sensor," J. Korea Inst. Electron. Commun. Sci., vol. 18, no. 6, pp. 1313-1320, 2023.
H.-J. Bae, G.-J. Jang, Y.-H. Kim, and J.-P. Kim, "LSTM (long short-term memory)-based abnormal behavior recognition using AlphaPose," KIPS Trans. Softw. Data Eng., vol. 10, no. 5, pp. 187-194, 2021.
J.-M. Lee, H.-J. Bae, G.-J. Jang, and J.-P. Kim, "A study on the estimation of multi-object social distancing using stereo vision and AlphaPose," KIPS Trans. Softw. Data Eng., vol. 10, no. 7, pp. 279-286, Jul. 2021.
N. Kwak, "A study on hand-face hybrid gesture interface using MediaPipe models," J. Internet Things Converg., vol. 10, no. 5, pp. 1-11, Oct. 2024.
H.-S. Kim, J.-Y. Jeong, B.-J. Choi, and M.-K. Moon, "Visualization system for dance movement feedback using MediaPipe," J. Korea Inst. Electron. Commun. Sci., vol. 19, no. 1, pp. 217-224, Feb. 2024.
R. Song, Y. Hong, and N. Kwak, "User interface using hand gesture recognition based on MediaPipe hands model," J. Korea Multimed. Soc., vol. 26, no. 2, pp. 103-115, Feb. 2023, doi:10.9717/kmms.2023.26.2.103.
M. Udurume et al., "Real-time multimodal emotion recognition based on multithreaded weighted average fusion," J. Ergon. Soc. Korea, vol. 42, no. 5, pp. 417-433, Oct. 2023, doi: 10.5143/jesk.2023.42.5.417.
M. Udurume, A. Caliwag, W. Lim, and G. Kim, "Emotion recognition implementation with multimodalities of face, voice and EEG," J. Inf. Commun. Converg. Eng., vol. 20, no. 3, pp. 174-180, Sep. 2022, doi:10.56977/jicce.2022.20.3.174.
J.-H. Lee, "The expanded user interfaces and immersion by the multisensory stimulation in peripheral environment," J. Digit. Contents Soc., vol. 21, no. 5, pp. 987-996, May 2020, doi:10.9728/dcs.2020.21.5.987.

This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).