Generative AI-Driven Multimodal Interaction System Integrating Voice and Motion Recognition

DaeSung Jang (1), JongChan Kim (2)
(1) Department of Computer Engineering, Sunchon National University, Jungang-ro, Suncheon-si, Republic of Korea
(2) Department of Computer Engineering, Sunchon National University, Jungang-ro, Suncheon-si, Republic of Korea
Fulltext View | Download
How to cite (IJASEIT) :
[1]
D. Jang and J. Kim, “Generative AI-Driven Multimodal Interaction System Integrating Voice and Motion Recognition”, Int. J. Adv. Sci. Eng. Inf. Technol., vol. 15, no. 2, pp. 617–624, Apr. 2025.
This research proposes a two-way interactive algorithm based on voice and motion recognition using generative AI technology to overcome the limitations of existing systems limited to simple command recognition. Current voice and motion recognition technologies are essential in enabling interaction between smart devices and users to enhance user experience. Still, they are mainly limited to recognizing and executing prescribed commands, which do not meet the diverse and complex needs of users. To solve these problems, this research aims to develop a technology that fuses and integrates voice and motion data based on advanced learning and prediction capabilities of generative AI, provides customized data optimized for each user's personality and situation in real-time, and enables more natural and efficient interactions. The main research content includes developing data analysis and processing algorithms that can integrally process multiple input channels, designing generative AI-based models for providing customized data to users, and implementing a two-way interactive system that maintains a natural conversation flow. In particular, the research is intended to combine generative AI language models with computer vision technology to comprehensively analyze user voice and motion data, enabling smart devices to understand and respond to user intent accurately. These technologies can potentially revolutionize the user experience in various areas, including smart homes, healthcare, education, and more. This study's results are expected to significantly contribute to the development of next-generation smart device interaction systems that could improve both efficiency and engagement of interactions.

J.-S. Han, C.-I. Lee, Y.-H. Youn and S.-J. Kim, "A study on realtime hand gesture recognition technology by machine learning-based mediapipe", Journal of System and Management Sciences, vol. 12, no. 2, pp. 462-476, 2022.

C. Graham and N. Roll, “Evaluating OpenAI’s Whisper ASR: Performance analysis across diverse accents and speaker traits,” JASA Express Letters, vol. 4, no. 2, Feb. 2024, doi: 10.1121/10.0024876.

G. Park and J. Kim, “Multivariate Variable-Based LSTM-AE Model for Solar Power Prediction,” International Journal on Advanced Science, Engineering and Information Technology, vol. 15, no. 1, pp. 293–299, Feb. 2025, doi: 10.18517/ijaseit.15.1.20944.

A. Badiola-Bengoa and A. Mendez-Zorrilla, “A Systematic Review of the Application of Camera-Based Human Pose Estimation in the Field of Sport and Physical Exercise,” Sensors, vol. 21, no. 18, p. 5996, Sep. 2021, doi: 10.3390/s21185996.

Y. Kim and H.-J. So, "Unpacking multimodal representation strategies in explainer videos from TPACK perspectives," J. Educ. Inf. Media (Korean Assoc. Educ. Inf. Media), vol. 30, no. 3, pp. 933-954, Jun. 2024, doi: 10.15833/kafeiam.30.3.933.

Bae, H. J., Jang, G. J., Kim, Y. H., and Kim, J. P. “LSTM (Long Short-Term Memory)-Based Abnormal Behavior Recognition Using AlphaPose,” KIPS Transactions on Software and Data Engineering, vol. 10, no. 5, pp. 187-194, May. 2021.

A. Lee, H. Ryu, H. Choi, and Y. Koo, "The DX museum strategy based on multimodal blueprint: Generation Z," Arch. Des. Res., vol. 37, no. 4, pp. 149-178, 2024.

Y. Lee and H. Kwon, "Research on improving unknown track reading by applying multimodal model," J. Internet Comput. Serv., vol. 25, no. 6, pp. 155-162, Dec. 2024.

J. Lee, "An efficient XR content authoring method based on collaborative service using multi-modal objects," KIISE Trans. Comput. Pract., vol. 29, no. 4, pp. 190-195, Apr. 2023, doi:10.5626/ktcp.2023.29.4.190.

D.-S. Jang and J.-C. Kim, "Two-way interactive algorithms based on speech and motion recognition with generative AI technology," J. Korea Inst. Electron. Commun. Sci., vol. 19, no. 2, pp. 397-402, Apr. 2024.

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16×16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2021. [Online]. Available: https://arxiv.org/abs/2010.11929.

J. Chang and H. Nam, "Exploring the feasibility of fine-tuning large-scale speech recognition models for domain-specific applications: A case study on Whisper model and KsponSpeech dataset," Phonetics Speech Sci., vol. 15, no. 3, pp. 83-88, Sep. 2023, doi:10.13064/ksss.2023.15.3.083.

C. Oh, C. Kim, and K. Park, "Building robust Korean speech recognition model by fine-tuning large pretrained model," Phonetics Speech Sci., vol. 15, no. 3, pp. 75-82, Sep. 2023, doi:10.13064/ksss.2023.15.3.075.

H.-W. Cha and J.-M. Ma, "Proposal and evaluation of military automatic speech recognition model based on transformer," J. Korea Acad.-Ind. Cooper. Soc., vol. 25, no. 3, pp. 102-107, Mar. 2024, doi:10.5762/kais.2024.25.3.102.

K. J. Son and S. H. Kim, "A study on the evaluation methods for assessing the understanding of Korean culture by generative AI models," Trans. Korea Inf. Process. Soc., vol. 13, no. 9, pp. 421-428, Sep. 2024.

H.-C. Jung, K.-S. Shin, H.-D. Kim, and S.-B. Park, "Clinical trials utilizing LLM-based generative AI," J. Korea Soc. Comput. Inf., vol. 29, no. 12, pp. 169-180, Dec. 2024, doi: 10.9708/jksci.2024.29.12.169.

P. T. J. Miguel and K. Nah, "Integrating multimodal and generative AI in design research: Enhancing ethnographic methods with data-driven analysis," J. Des. Res. (Korea Inst. Des. Res. Soc.), vol. 8, no. 4, pp. 27-38, Dec. 2023, doi: 10.46248/kidrs.2023.4.27.

D. Saisanthiya and P. Supraja, "Neuro-facial fusion for emotion AI: Improved federated learning GAN for collaborative multimodal emotion recognition," IEIE Trans. Smart Process. Comput., vol. 13, no. 1, pp. 61-68, Feb. 2024, doi: 10.5573/ieiespc.2024.13.1.61.

D. Min, S. Nam, and D. Choi, "A study on improving the accuracy of Korean speech recognition texts using KcBERT," J. KIISE, vol. 51, no. 12, pp. 1115-1124, Dec. 2024, doi: 10.5626/jok.2024.51.12.1115.

G. Bae, C. Kim, S. Hwang, Y. Lee, and J. Kong, "Development of digital exhibition contents using generative AI and prompt engineering," J. Korea Multimed. Soc., vol. 27, no. 8, pp. 959-968, Aug. 2024, doi: 10.9717/kmms.2024.27.8.959.

S. Park and K. Kim, "AI image generation study utilizing ChatGPT and Midjourney," J. Digit. Art Eng. Multimed., vol. 10, no. 4, pp. 501-510, Dec. 2023, doi: 10.29056/jdaem.2023.12.06.

J. M. Lee, Y. K. Choi, and H. S. Kang, "A study on the motion and voice recognition smart mirror using Grove gesture sensor," J. Korea Inst. Electron. Commun. Sci., vol. 18, no. 6, pp. 1313-1320, 2023.

H.-J. Bae, G.-J. Jang, Y.-H. Kim, and J.-P. Kim, "LSTM (long short-term memory)-based abnormal behavior recognition using AlphaPose," KIPS Trans. Softw. Data Eng., vol. 10, no. 5, pp. 187-194, 2021.

J.-M. Lee, H.-J. Bae, G.-J. Jang, and J.-P. Kim, "A study on the estimation of multi-object social distancing using stereo vision and AlphaPose," KIPS Trans. Softw. Data Eng., vol. 10, no. 7, pp. 279-286, Jul. 2021.

N. Kwak, "A study on hand-face hybrid gesture interface using MediaPipe models," J. Internet Things Converg., vol. 10, no. 5, pp. 1-11, Oct. 2024.

H.-S. Kim, J.-Y. Jeong, B.-J. Choi, and M.-K. Moon, "Visualization system for dance movement feedback using MediaPipe," J. Korea Inst. Electron. Commun. Sci., vol. 19, no. 1, pp. 217-224, Feb. 2024.

R. Song, Y. Hong, and N. Kwak, "User interface using hand gesture recognition based on MediaPipe hands model," J. Korea Multimed. Soc., vol. 26, no. 2, pp. 103-115, Feb. 2023, doi:10.9717/kmms.2023.26.2.103.

M. Udurume et al., "Real-time multimodal emotion recognition based on multithreaded weighted average fusion," J. Ergon. Soc. Korea, vol. 42, no. 5, pp. 417-433, Oct. 2023, doi: 10.5143/jesk.2023.42.5.417.

M. Udurume, A. Caliwag, W. Lim, and G. Kim, "Emotion recognition implementation with multimodalities of face, voice and EEG," J. Inf. Commun. Converg. Eng., vol. 20, no. 3, pp. 174-180, Sep. 2022, doi:10.56977/jicce.2022.20.3.174.

J.-H. Lee, "The expanded user interfaces and immersion by the multisensory stimulation in peripheral environment," J. Digit. Contents Soc., vol. 21, no. 5, pp. 987-996, May 2020, doi:10.9728/dcs.2020.21.5.987.

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 International License.

Authors who publish with this journal agree to the following terms:

    1. Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
    2. Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
    3. Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).