Improvement Model for Speaker Recognition using MFCC-CNN and Online Triplet Mining

Ayu Wirdiani; Steven Ndung'u Machetho; I Ketut Gede Darma Putra; Made Sudarma; Rukmi Sari Hartati; Henrico Aldy Ferdian

doi:10.18517/ijaseit.14.2.19396

DOI : https://doi.org/10.18517/ijaseit.14.2.19396

Improvement Model for Speaker Recognition using MFCC-CNN and Online Triplet Mining

Ayu Wirdiani ⁽¹⁾, Steven Ndung'u Machetho ⁽²⁾, I Ketut Gede Darma Putra ⁽³⁾, Made Sudarma ⁽⁴⁾, Rukmi Sari Hartati ⁽⁵⁾, Henrico Aldy Ferdian ⁽⁶⁾

(1) Information Technology, Udayana University, Kampus Bukit Jimbaran, Badung, 80361, Indonesia

(2) Computer Science Department, University of Groningen, Nijenborgh 9, 9747 AG Groningen, Netherlands

(3) Information Technology, Udayana University, Kampus Bukit Jimbaran, Badung, 80361, Indonesia

(4) Electrical Engineering, Udayana University, Kampus Bukit Jimbaran, Badung, 80361, Indonesia

(5) Electrical Engineering, Udayana University, Kampus Bukit Jimbaran, Badung, 80361, Indonesia

(6) Information Technology, Udayana University, Kampus Bukit Jimbaran, Badung, 80361, Indonesia

Fulltext View | Download

How to cite (IJASEIT) :

Wirdiani, Ayu, et al. “Improvement Model for Speaker Recognition Using MFCC-CNN and Online Triplet Mining”. International Journal on Advanced Science, Engineering and Information Technology, vol. 14, no. 2, Apr. 2024, pp. 420-7, doi:10.18517/ijaseit.14.2.19396.

Citation Format :

Various biometric security systems, such as face recognition, fingerprint, voice, hand geometry, and iris, have been developed. Apart from being a communication medium, the human voice is also a form of biometrics that can be used for identification. Voice has unique characteristics that can be used as a differentiator between one person and another. A sound speaker recognition system must be able to pick up the features that characterize a person's voice. This study aims to develop a human speaker recognition system using the Convolutional Neural Network (CNN) method. This research proposes improvements in the fine-tuning layer in CNN architecture to improve the Accuracy. The recognition system combines the CNN method with Mel Frequency Cepstral Coefficients (MFCC) to perform feature extraction on raw audio and K Nearest Neighbor (KNN) to classify the embedding output. In general, this system extracts voice data features using MFCC. The process is continued with feature extraction using CNN with triplet loss to obtain the 128-dimensional embedding output. The classification of the CNN embedding output uses the KNN method. This research was conducted on 50 speakers from the TIMIT dataset, which contained eight utterances for each speaker and 60 speakers from live recording using a smartphone. The accuracy of this speaker recognition system achieves high-performance accuracy. Further research can be developed by combining different biometrics objects, commonly known as multimodal, to improve recognition accuracy further.

S. H. Moi et al., “An Improved Approach to Iris Biometric Authentication Performance and Security with Cryptography and Error,” Int. J. Informatics Vis., vol. 6, no. August, pp. 531–539, 2022.

V. M. Arun Ross, Sudipta Banerjee, Cunjian Chen, Anurag Chowdhury, “Some Research Problems in Biometrics: The Future Beckons,” in IAPR International Conference on Biometrics (ICB), 2019.

C. Medjahed, A. Rahmoun, C. Charrier, and F. Mezzoudj, “A deep learning-based multimodal biometric system using score fusion,” IAES Int. J. Artif. Intell., vol. 11, no. 1, pp. 65–80, 2022, doi:10.11591/ijai.v11.i1.pp65-80.

I. K. G. D. Putra, D. Witarsyah, M. Saputra, and P. Jhonarendra, “Palmprint Recognition Based on Edge Detection Features and Convolutional Neural Network,” Int. J. Adv. Sci. Eng. Inf. Technol., vol. 11, no. 1, pp. 380–387, 2021, doi: 10.18517/ijaseit.11.1.11664.

R. Blanco-Gonzalo et al., “Biometric Systems Interaction Assessment: The State of the Art,” IEEE Trans. Human-Machine Syst., vol. 49, no. 5, pp. 397–410, 2019, doi: 10.1109/THMS.2019.2913672.

A. Pradhan, J. He, and N. Jiang, “Score, Rank, and Decision-Level Fusion Strategies of Multicode Electromyogram-Based Verification and Identification Biometrics,” IEEE J. Biomed. Heal. Informatics, vol. 26, no. 3, 2022, doi: 10.1109/JBHI.2021.3109595.

S. Shakil, D. Arora, and T. Zaidi, “An optimal method for identification of finger vein using supervised learning,” Meas. Sensors, vol. 25, Feb. 2023, doi: 10.1016/j.measen.2022.100583.

A. Sithara, A. Thomas, and D. Mathew, “Study of MFCC and IHC feature extraction methods with probabilistic acoustic models for speaker biometric applications,” Procedia Comput. Sci., vol. 143, pp. 267–276, 2018, doi: 10.1016/j.procs.2018.10.395.

D. Cai, Z. Cai, and M. Li, “Deep Speaker Embeddings with Convolutional Neural Network on Supervector for Text-Independent Speaker Recognition,” in 2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), 2018, pp. 1478–1482. doi: 10.23919/APSIPA.2018.8659595.

S. Hidayat, M. Tajuddin, S. A. Alodiayusuf, J. Qudsi, and N. N. Jaya, “Wavelet Detail Coefficient as A Novel Wavelet-MFCC Features in Text-Dependent Speaker Recognition System,” IIUM Eng. J., vol. 23, no. 1, 2022, doi: 10.31436/IIUMEJ.V23I1.1760.

A. Ashar, M. S. Bhatti, and U. Mushtaq, “Speaker Identification Using a Hybrid CNN-MFCC Approach,” in 2020 International Conference on Emerging Trends in Smart Technologies (ICETST), IEEE, Mar. 2020, pp. 1–4. doi: 10.1109/ICETST49965.2020.9080730.

R. Jahangir et al., “Text-Independent Speaker Identification through Feature Fusion and Deep Neural Network,” IEEE Access, vol. 8, 2020, doi: 10.1109/ACCESS.2020.2973541.

H. F. Pardede, A. R. Yuliani, and R. Sustika, “Convolutional Neural Network and Feature Transformation for Distant Speech Recognition,” Int. J. Electr. Comput. Eng., vol. 8, no. 6, pp. 5381–5388, 2018, doi: 10.11591/ijece.v8i6.pp5381-5388.

R. Jagiasi, S. Ghosalkar, P. Kulal, and A. Bharambe, “CNN based Speaker Recognition in Language and Text-independent Small Scale System,” in 2019 Third International conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud), 2019, pp. 176–179. doi:10.1109/I-SMAC47947.2019.9032667.

A. Maurya, D. Kumar, and R. K. Agarwal, “Speaker Recognition for Hindi Speech Signal using MFCC-GMM Approach,” in Procedia Computer Science, Elsevier B.V., 2018, pp. 880–887. doi:10.1016/j.procs.2017.12.112.

F. Reggiswarashari and S. W. Sihwi, “Speech emotion recognition using 2D-convolutional neural network,” Int. J. Electr. Comput. Eng., vol. 12, no. 6, pp. 6594–6601, 2022, doi:10.11591/ijece.v12i6.pp6594-6601.

K. J. Devi, A. A. Devi, and K. Thongam, “Automatic Speaker Recognition using MFCC and Artificial Neural Network,” Int. J. Innov. Technol. Explor. Eng., vol. 9, no. 1S, pp. 39–42, 2019, doi:10.35940/ijitee.a1010.1191s19.

V. Z. John S. Garofolo, Lori F. Lamel, William M. Fisher, Jonathan G. Fiscus, David S. Pallett, Nancy L. Dahlgren, “TIMIT Acoustic-Phonetic Continuous Speech Corpus.” 1993.

J. Villalba et al., “State-of-the-art speaker recognition with neural network embeddings in NIST SRE18 and Speakers in the Wild evaluations,” Comput. Speech Lang., vol. 60, p. 101026, Mar. 2020, doi: 10.1016/j.csl.2019.101026.

H. R. Yulianto and Afiahayati, “Fighting COVID-19 : Convolutional Neural Network for Elevator User ’ s Speech Classification Neural in Bahasa Indonesia,” in Procedia Computer Science, Elsevier B.V., 2021, pp. 84–91. doi: 10.1016/j.procs.2021.05.079.

Z. K. Abdul and A. K. Al-Talabani, “Mel Frequency Cepstral Coefficient and its Applications: A Review,” IEEE Access, vol. 10, no. November, pp. 122136–122158, 2022, doi:10.1109/ACCESS.2022.3223444.

Heriyanto, S. Hartati, and A. P. Eko, “Ekstraksi Ciri Mel Frequency Cepstral Coefficient (MFCC) dan Rerata Coefficient untuk Pengecekan Bacaan Al-Quran,” Telematika, vol. 15, no. 02, pp. 99–108, 2018.

M. Altayeb and A. Al-ghraibah, “Classification of three pathological voices based on specific features groups using support vector machine,” Int. J. Electr. Comput. Eng., vol. 12, no. 1, pp. 946–956, 2022, doi: 10.11591/ijece.v12i1.pp946-956.

V. Panayotov, G. Chen, D. Povey, S. Khudanpur, and T. Johns, “Librispeech: An ASR corpus based on public domain audio books,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 5206–5210.

S. Ozturk and T. Cukur, “Deep Clustering via Center-Oriented Margin Free-Triplet Loss for Skin Lesion Detection in Highly Imbalanced Datasets,” IEEE J. Biomed. Heal. Informatics, vol. 26, no. 9, 2022, doi: 10.1109/JBHI.2022.3187215.

C. Zhang, S. Ranjan, and J. H. L. Hansen, “An Analysis of Transfer Learning for Domain Mismatched Text-independent Speaker Verification An analysis of transfer learning for domain mismatched text-independent speaker verification,” in Odyssey 2018 The Speaker and Language Recognition Workshop, 2018, pp. 181–186. doi:10.21437/Odyssey.2018-26.

F. Schroff, D. Kalenichenko, and J. Philbin, “FaceNet : A Unified Embedding for Face Recognition and Clustering”.

Z. Ren, Z. Chen, and S. Xu, “Triplet Based Embedding Distance and Similarity Learning for Text-independent Speaker Verification”.

L. Zhang, Z. Cheng, Y. Shen, and D. Wang, “Palmprint and palmvein recognition based on DCNN and a new large-scale contactless palmvein dataset,” Symmetry (Basel)., vol. 10, no. 4, pp. 1–15, 2018, doi: 10.3390/sym10040078.

H. Rahmat, S. Wahjuni, and H. Rahmawan, “Performance Analysis of Deep Learning-based Object Detectors on Raspberry Pi for Detecting Melon Leaf Abnormality,” Int. J. Adv. Sci. Eng. Inf. Technol., vol. 12, no. 2, pp. 572–579, 2022, doi: 10.18517/ijaseit.12.2.13801.

I. P. A. Dharmaadi, D. Witarsyah, I. P. A. Bayupati, and G. M. A. Sasmita, “Face Recognition Application Based on Convolutional Neural Network for Searching Someone’s Photo on External Storage,” Int. J. Adv. Sci. Eng. Inf. Technol., vol. 12, no. 3, pp. 1222–1228, 2022, doi: 10.18517/IJASEIT.12.3.11666.

M. K. Nandwana et al., “Robust Speaker Recognition from Distant Speech under Real Reverberant Environments Using Speaker Embeddings,” no. September, pp. 1106–1110, 2018.

This work is licensed under a Creative Commons Attribution 4.0 International License.

Authors who publish with this journal agree to the following terms:

Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution LicenseÂ that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (SeeÂ The Effect of Open Access).