Text and Sound-Based Feature Extraction and Speech Emotion Classification for Korean

Jaechoon Jo (1), Soo Kyun Kim (2), Yeo-chan Yoon (3)
(1) Department of Computer Education, Jeju National University, 63243 Jeju, Republic of Korea
(2) Department of Computer Education, Jeju National University, 63243 Jeju, Republic of Korea
(3) Department of Artificial Intelligence, Jeju National University, 63243 Jeju, Republic of Korea
Fulltext View | Download
How to cite (IJASEIT) :
Jo , Jaechoon, et al. “Text and Sound-Based Feature Extraction and Speech Emotion Classification for Korean”. International Journal on Advanced Science, Engineering and Information Technology, vol. 14, no. 3, June 2024, pp. 873-9, doi:10.18517/ijaseit.14.3.18544.
Embracing the complexities of human emotions conveyed through speech, this study ventures into Speech Emotion Recognition (SER) within the human-computer interaction domain, leveraging cutting-edge artificial intelligence technologies. Focusing on the auditory attributes of speech, such as tone, pitch, and rhythm, the research introduces an innovative approach that amalgamates deep learning techniques with the A Learnable Frontend for Audio Classification (LEAF) algorithm and wav2vec 2.0 pre-trained on a large corpus, specifically targeting Korean voice samples. This methodology underlines the capacity of these technologies to process and decipher complex vocal expressions, aiming to elevate emotion classification precision notably. The exploration extends the horizons of SER by accentuating auditory emotion cues and aspires to enrich machine interactions to be more intuitive and empathetic across various applications like healthcare and customer service. The outcomes underscore the efficacy of transformer-based models, particularly wav2vec 2.0 and LEAF, in capturing the subtle emotional states expressed in speech, thereby affirming the importance of auditory cues over conventional visual and textual indicators. The study's implications for further research herald a promising trajectory for evolving AI systems adept at nuanced emotion detection, thereby forging pathways toward more natural and human-centric interactions between individuals and machines. This advancement is crucial for developing empathetic AI that can seamlessly integrate into our daily lives, understanding and reacting to human emotions in a way that mirrors human understanding and compassion.

A.-H. Jo and K.-C. Kwak, “Speech Emotion Recognition Based on Two-Stream Deep Learning Model Using Korean Audio Information,” Applied Sciences, vol. 13, no. 4, p. 2167, Feb. 2023, doi:10.3390/app13042167.

J. Wagner et al., “Dawn of the Transformer Era in Speech Emotion Recognition: Closing the Valence Gap,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 9, pp. 10745–10759, Sep. 2023, doi: 10.1109/tpami.2023.3263585.

C. Hema and F. P. Garcia Marquez, “Emotional speech Recognition using CNN and Deep learning techniques,” Applied Acoustics, vol. 211, p. 109492, Aug. 2023, doi: 10.1016/j.apacoust.2023.109492.

C. Min et al., “Finding hate speech with auxiliary emotion detection from self-training multi-label learning perspective,” Information Fusion, vol. 96, pp. 214–223, Aug. 2023, doi:10.1016/j.inffus.2023.03.015.

V. Singh and S. Prasad, “Speech emotion recognition system using gender dependent convolution neural network,” Procedia Computer Science, vol. 218, pp. 2533–2540, 2023, doi:10.1016/j.procs.2023.01.227.

Kumar et al., “Multilayer Neural Network Based Speech Emotion Recognition for Smart Assistance,” Computers, Materials & Continua, vol. 74, no. 1, pp. 1523–1540, 2023, doi:10.32604/cmc.2023.028631.

Jain, Manas, et al. "Speech emotion recognition using support vector machine." arXiv preprint arXiv:2002.07590, 2020.

S. R. Kadiri and P. Alku, “Excitation Features of Speech for Speaker-Specific Emotion Detection,” IEEE Access, vol. 8, pp. 60382–60391, 2020, doi: 10.1109/access.2020.2982954.

H. Aouani and Y. B. Ayed, “Speech Emotion Recognition with deep learning,” Procedia Computer Science, vol. 176, pp. 251–260, 2020, doi: 10.1016/j.procs.2020.08.027.

D. Issa, M. Fatih Demirci, and A. Yazici, “Speech emotion recognition with deep convolutional neural networks,” Biomedical Signal Processing and Control, vol. 59, p. 101894, May 2020, doi:10.1016/j.bspc.2020.101894.

M. B. Akçay and K. Oğuz, “Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers,” Speech Communication, vol. 116, pp. 56–76, Jan. 2020, doi: 10.1016/j.specom.2019.12.001.

S. M. S. A. Abdullah, S. Y. A. Ameen, M. A. M. Sadeeq, and S. Zeebaree, “Multimodal Emotion Recognition using Deep Learning,” Journal of Applied Science and Technology Trends, vol. 2, no. 01, pp. 73–79, May 2021, doi: 10.38094/jastt20291.

W. Liu, J.-L. Qiu, W.-L. Zheng, and B.-L. Lu, “Comparing Recognition Performance and Robustness of Multimodal Deep Learning Models for Multimodal Emotion Recognition,” IEEE Transactions on Cognitive and Developmental Systems, vol. 14, no. 2, pp. 715–729, Jun. 2022, doi: 10.1109/tcds.2021.3071170.

Y. Cimtay, E. Ekmekcioglu, and S. Caglar-Ozhan, “Cross-Subject Multimodal Emotion Recognition Based on Hybrid Fusion,” IEEE Access, vol. 8, pp. 168865–168878, 2020, doi:10.1109/access.2020.3023871.

T. Mittal, U. Bhattacharya, R. Chandra, A. Bera, and D. Manocha, “M3ER: Multiplicative Multimodal Emotion Recognition using Facial, Textual, and Speech Cues,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 02, pp. 1359–1367, Apr. 2020, doi:10.1609/aaai.v34i02.5492.

T. Mittal, P. Guhan, U. Bhattacharya, R. Chandra, A. Bera, and D. Manocha, “EmotiCon: Context-Aware Multimodal Emotion Recognition Using Frege’s Principle,” 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2020, doi:10.1109/cvpr42600.2020.01424.

C. Luna-Jiménez, D. Griol, Z. Callejas, R. Kleinlein, J. M. Montero, and F. Fernández-Martínez, “Multimodal Emotion Recognition on RAVDESS Dataset Using Transfer Learning,” Sensors, vol. 21, no. 22, p. 7665, Nov. 2021, doi: 10.3390/s21227665.

B. Chen, Q. Cao, M. Hou, Z. Zhang, G. Lu, and D. Zhang, “Multimodal Emotion Recognition With Temporal and Semantic Consistency,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3592–3603, 2021, doi: 10.1109/taslp.2021.3129331.

S. Lee, D. K. Han, and H. Ko, “Multimodal Emotion Recognition Fusion Analysis Adapting BERT With Heterogeneous Feature Unification,” IEEE Access, vol. 9, pp. 94557–94572, 2021, doi:10.1109/access.2021.3092735.

S. S. R, J. S. B, and R. R, “Comprehensive Speech Emotion Recognition System Employing Multi-Layer Perceptron (MLP) Classifier and libRosa Feature Extraction,” 2023 International Conference on Sustainable Communication Networks and Application (ICSCNA), Nov. 2023, doi: 10.1109/icscna58489.2023.10370394.

Y. C. Yoon, “Can We Exploit All Datasets? Multimodal Emotion Recognition Using Cross-Modal Translation,” IEEE Access, vol. 10, pp. 64516–64524, 2022, doi: 10.1109/access.2022.3183587.

Zeghidour, Neil, et al. "LEAF: A learnable frontend for audio classification." arXiv preprint arXiv:2101.08596, 2021

Baevski, Alexei, et al. "wav2vec 2.0: A framework for self-supervised learning of speech representations." In Proceedings of the 34th International Conference on Neural Information Processing Systems (NIPS '20). Curran Associates Inc., Red Hook, NY, USA, Article 1044, 12449–12460.

Zoph, Barret, et al. "Rethinking pre-training and self-training." In Proceedings of the 34th International Conference on Neural Information Processing Systems (NIPS '20). Curran Associates Inc., Red Hook, NY, USA, Article 323, 3833–3845.

F.-L. Chen et al., “VLP: A Survey on Vision-language Pre-training,” Machine Intelligence Research, vol. 20, no. 1, pp. 38–56, Jan. 2023, doi: 10.1007/s11633-022-1369-5.

Bao, Hangbo, et al. "Beit: Bert pre-training of image transformers." arXiv preprint arXiv:2106.08254, 2021.

El-Nouby, Alaaeldin, et al. "Are large-scale datasets necessary for self-supervised pre-training?." arXiv preprint arXiv:2112.10740, 2021.

Jiang, Ziyu, et al. "Robust pre-training by adversarial contrastive learning." In Proceedings of the 34th International Conference on Neural Information Processing Systems (NIPS '20). Curran Associates Inc., Red Hook, NY, USA, Article 1359, 16199–16210.

L. H. Li et al., “Grounded Language-Image Pre-training,” 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2022, doi: 10.1109/cvpr52688.2022.01069.

L. Ruan and Q. Jin, “Survey: Transformer based video-language pre-training,” AI Open, vol. 3, pp. 1–13, 2022, doi:10.1016/j.aiopen.2022.01.001.

W. Ahmad, S. Chakraborty, B. Ray, and K.-W. Chang, “Unified Pre-training for Program Understanding and Generation,” Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021, doi: 10.18653/v1/2021.naacl-main.211.

Yi, Cheng, et al. "Applying wav2vec2. 0 to speech recognition in various low-resource languages." arXiv preprint arXiv:2012.12121, 2020.

Mohamed, Omar, and Salah A. Aly. “ASER: Arabic Speech Emotion Recognition Employing Wav2vec2.0 and HuBERT Based on BAVED Dataset,” Transactions on Machine Learning and Artificial Intelligence, vol. 9, no. 6, pp. 1–8, Nov. 2021, doi: 10.14738/tmlai.96.11039.

M. Sharma, “Multi-Lingual Multi-Task Speech Emotion Recognition Using wav2vec 2.0,” ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2022, doi: 10.1109/icassp43922.2022.9747417.

Y. C. Yoon, S. Y. Park, S. M. Park, and H. Lim, “Image classification and captioning model considering a CAM‐based disagreement loss,” ETRI Journal, vol. 42, no. 1, pp. 67–77, Jul. 2019, doi:10.4218/etrij.2018-0621.

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 International License.

Authors who publish with this journal agree to the following terms:

    1. Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
    2. Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
    3. Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).