BEYOND WORDS: HARNESSING SPEECH SOUND FOR SPEAKER AGE AND GENDER DETECTION USING 1D CNN ARCHITECTURE WITH SELF-ATTENTION MECHANISM

(Received: 22-Dec.-2023, Revised: 9-Mar.-2024 , Accepted: 20-Mar.-2024)

Authors Umniah Hameed Jaid, Alia Karim Abdulhasan,

Keywords #Speaker age #Speaker gender #Speaker profiling #Wav2vec embedding #Attention mechanism

Abstract Beyond the immediate content of speech, the voice can provide rich information about a speaker's demographics, including age and gender. Estimating a speaker's age and gender offers a wide range of applications, spanning from voice forensic analysis to personalized advertising, healthcare monitoring and human-computer interaction. However, pinpointing precise age remains intricate due to age ambiguity. Specifically, utterances from individuals at adjacent ages are frequently indistinguishable. Addressing this, we propose a novel, end-to-end approach that deploys Mozilla's Common Voice dataset to transform raw audio into high-quality feature representations using Wav2Vec2.0 embeddings. These are then channeled into our self-attention-based convolutional neural network (CNN) model. To address age ambiguity, we evaluate the effects of different loss functions such as focal loss and Kullback-Leibler (KL) divergence loss. Additionally, we evaluate the estimation accuracy at different speech durations. Experimental results from the Common Voice dataset underscore the efficacy of our approach, showcasing an accuracy of 87% for male speakers, 91% for female speakers and 89% overall accuracy, as well as an accuracy of 99.1% for gender prediction.

References

[1] G. Assunção, P. Menezes and F. Perdigão, "Speaker Awareness for Speech Emotion Recognition," Int. J. of Online and Biomedical Engineering, vol. 16, no. 4, pp. 15-22, 2020.

[2] A. H. Poorjam and M. H. Bahari, "Multitask Speaker Profiling for Estimating Age, Height, Weight and Smoking Habits from Spontaneous Telephone Speech Signals," Proc. of the 2014 4th IEEE Int. Conf. on Computer and Knowledge Engineering (ICCKE), Mashhad, Iran, pp. 7-12, 2014.

[3] C. Müller, "Automatic Recognition of Speakers2 Age and Gender on the Basis of Empirical Studies," Proc. of the 9th Int. Conf. on Spoken Language Processing (Interspeech 2006), pp. 2118–2121, paper 1031-Wed3CaP.11, DOI: 10.21437/Interspeech.2006-195, 2006.

[4] C. Müller and F. Burkhardt, "Combining Short-term Cepstral and Long-term Pitch Features for Automatic Recognition of Speaker Age," Proc. of the 8th Annual Conf. of the Int. Speech Communication Association, (Interspeech 2007), pp. 2277–2280, Antwerp, Belgium, 2007.

[5] S. B. Kalluri, A. Vijayakumar, D. Vijayasenan and R. Singh, "Estimating Multiple Physical Parameters from Speech Data," Proc. of the 2016 IEEE 26th Int. Workshop on Machine Learning for Signal Processing (MLSP), pp. 1-5, Vietri sul Mare, Italy, 2016.

[6] S. Galgali, S. S. Priyanka, B. Shashank and A. P. Patil, "Speaker Profiling by Extracting Paralinguistic Parameters Using Mel Frequency Cepstral Coefficients," Proc. of 2015 IEEE Int. Conf. on Applied and Theoretical Computing and Communic. Technology (iCATccT), pp. 486-489, Davangere, India, 2015.

[7] A. A. Badr and A. K. Abdul-Hassan, "Estimating Age in Short Utterances Based on Multi-class Classification Approach," Computers, Materials & Continua, vol. 68, no. 2, pp. 1713-1729, 2021.

[8] I. Mporas and T. Ganchev, "Estimation of Unknown Speaker’s Height from Speech," International Journal of Speech Technology, vol. 12, no. 4, pp. 149-160, DOI: 10.1007/s10772-010-9064-2, 2010.

[9] K. A. Williams and J. H. Hansen, "Speaker Height Estimation Combining GMM and Linear Regression Subsystems," Proc. of 2013 IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP 2013), pp. 7552-7556, 2013.

[10] H. Arsikere, G. K. F. Leung, S. M. Lulich and A. Alwan, "Automatic Estimation of the First Three Subglottal Resonances from Adults Speech Signals with Application to Speaker Height Estimation," Speech Communication, vol. 55, no. 1, pp. 51-70, DOI: 10.1016/j.specom.2012.06.004, 2013.

[11] A. A. Mallouh, Z. Qawaqneh and B. D. Barkana, "New Transformed Features Generated by Deep Bottleneck Extractor and a GMM-UBM Classifier for Speaker Age and Gender Classification," Neural Computing & Applications, vol. 30, no. 8, pp. 2581-2593, DOI: 10.1007/s00521-017-2848-4, 2018.

[12] O. Buyuk and M. L. Arslan, "Combination of Long-term and Short-term Features for Age Identification from Voice," Advances in Electrical and Computer Engineering, vol. 18, no. 2, pp. 101-108, 2018.

[13] R. Zazo et al., "Age Estimation in Short Speech Utterances Based on LSTM Recurrent Neural Networks," IEEE Access, vol. 6, pp. 22524-22530, DOI: 10.1109/access.2018.2816163, 2018.

[14] S. B. Kalluri, D. Vijayasenan and S. Ganapathy, "A Deep Neural Network Based End to End Model for Joint Height and Age Estimation from Short Duration Speech," Proc. of ICASSP 2019 - 2019 IEEE Int. Conf. on Acoustics, Speech and Signal Processing, pp. 6580-6584, Brighton, UK, 2019.

[15] M. Kaushik, V. T. Pham and E. S. Chng, "End-to-End Speaker Height and Age Estimation Using Attention Mechanism with LSTM-RNN," arXiv preprint arXiv: 2101.05056, 2021.

[16] S. Kwon, "1D-CNN: Speech Emotion Recognition System Using a Stacked Network with Dilated CNN Features," Computers, Materials & Continua, vol. 67, no. 3, pp. 4039-4059, 2021.

[17] U. H. Jaid and A. K. AbdulHassan, "End-to-End Speaker Profiling Using 1D CNN Architectures and Filter Bank Initialization," Int. J. of Online & Biomedical Engineering, vol. 19, no. 10, 2023.

[18] Mustaqeem and S. Kwon, "Optimal Feature Selection Based Speech Emotion Recognition Using Two‐stream Deep Convolutional Neural Network," Int. J. of Intellig. Syst., vol. 36, no. 9, pp. 5116-5135, 2021.

[19] M. Z. Tarashandeh, A. Torkanloo and M. H. Moattar, "AgeNet-AT: An End-to-End Model for RobustJoint Speaker Age Estimation and Gender Recognition Based on Attention Mechanism and Titanet," Proc. of the 2023 13th IEEE Int. Conf. on Computer and Knowledge Engineering (ICCKE), pp. 414-419, Mashhad, Iran, 2023.

[20] T. Gupta, D.-T. Truong, T. T. Anh and C. E. Siong, "Estimation of Speaker Age and Height from Speech Signal Using Bi-encoder Transformer Mixture Model," arXiv preprint, arXiv: 2203.11774, 2022.

[21] S. Si, J. Wang, J. Peng and J. Xiao, "Towards Speaker Age Estimation with Label Distribution Learning," arXiv preprint, arXiv: 2202.11424, 2022.

[22] S. Kwon, "Att-Net: Enhanced Emotion Recognition System Using Lightweight Self-attention Module," Applied Soft Computing, vol. 102, p. 107101, 2021.

[23] A. Galassi, M. Lippi and P. Torroni, "Attention in Natural Language Processing," IEEE Transactions on Neural Networks and Learning Systems, vol. 32, no. 10, pp. 4291-4308, 2020.

[24] N.-Q. Pham, T.-S. Nguyen, J. Niehues, M. Müller, S. Stüker and A. Waibel, "Very Deep Self-attention Networks for End-to-End Speech Recognition," arXiv preprint, arXiv:1904.13377, 2019.

[25] R. Ardila et al., "Common Voice: A Massively-multilingual Speech Corpus," arXiv: 1912.06670, 2019.

[26] A. Baevski et al., "Wav2vec 2.0: A Framework for Self-supervised Learning of Speech Representations," Advances in Neural Information Processing Systems, vol. 33, pp. 12449-12460, 2020.

[27] T.-Y. Lin, P. Goyal, R. Girshick, K. He and P. Dollár, "Focal Loss for Dense Object Detection," Proc. of the IEEE Int. Conf. on Computer Vision, pp. 2980-2988, 2017.

[28] H. A. Abdulmohsin, J. J. Stephan, B. Al-Khateeb and S. S. Hasan, "Speech Age Estimation Using a Ranking Convolutional Neural Network," Proc. of Int. Conf. on Computing and Communication Networks (ICCCN 2021), pp. 123-130, Springer, 2022.

[29] H. A. Sánchez-Hevia, R. Gil-Pita, M. Utrilla-Manso and M. Rosa-Zurera, "Age Group Classification and Gender Recognition from Speech with Temporal Convolutional Neural Networks," Multimedia Tools and Applications, vol. 81, no. 3, pp. 3535-3552, 2022.

[30] D. Kwasny and D. Hemmerling, "Gender and Age Estimation Methods Based on Speech Using Deep Neural Networks," Sensors, vol. 21, no. 14, p. 4785, 2021.

[31] A. Tursunov, Mustaqeem, J. Y. Choeh and S. Kwon, "Age and Gender Recognition Using a Convolutional Neural Network with a Specially Designed Multi-attention Module through Speech Spectrograms," Sensors, vol. 21, no. 17, p. 5892, 2021.

[32] H. A. Sánchez-Hevia, R. Gil-Pita, M. Utrilla-Manso and M. Rosa-Zurera, "Convolutional-recurrent Neural Network for Age and Gender Prediction from Speech," Proc. of the 2019 IEEE Signal Processing Symposium (SPSympo), pp. 242-245, Krakow, Poland, 2019.