PDF (5.7 MB)
Collect
Submit Manuscript
Show Outline
Outline
Abstract
Keywords
References
Show full outline
Hide outline

Speech Emotion Recognition with Complementary Feature Learning Framework and Attentional Feature Fusion Module

Peiyao HUANG1Huihui CHENG2Xiaoyu TANG1,2()
School of Electronics and Information Engineering, Faculty of Engineering, South China Normal University, Foshan Guangdong 528225, China
School of Physics, South China Normal University, Guangzhou Guangdong 510006, China
Show Author Information

Abstract

Addressing the limitations of deep learning feature extraction methods, which fail to comprehensively extract and effectively integrate emotional features from speech, this paper proposes a novel speech emotion recognition model. It integrates a complementary feature learning framework and an attention feature fusion module. The complementary feature learning framework consists of two independent representational extraction branches and an interactive complementary representational extraction branch, thoroughly covering both independent and complementary representations of emotional features. To further optimize model performance, an attention feature fusion module is introduced. This module allocates appropriate weights based on the contribution level of different representations to emotion classification, enabling the model to focus maximally on features most beneficial for emotion recognition. Simulation experiments conducted on two public emotion databases (Emo-DB and IEMOCAP) validate the robustness and effectiveness of the proposed model.

Article ID: 2096-7675(2024)01-0052-07

References

[1]
DE LOPE J, GRAÑA M. An ongoing review of speech emotion recognition[J]. Neurocomputing, 2023, 528: 1-11.
[2]
STOCK-HOMBURG R. Survey of emotions in human-robot interactions: Perspectives from robotic psychology on 20 years of research[J]. International Journal of Social Robotics, 2022, 14(2): 389-411.
[3]
LIU Z Y, HU B, LI X Y, et al. Detecting depression in speech under different speaking styles and emotional valences[C]//International Conference on Brain Informatics. Cham: Springer, 2017: 261-271.
[4]
CHEN M Y, HE X J, YANG J, et al. 3-D convolutional recurrent neural networks with attention model for speech emotion recognition[J]. IEEE Signal Processing Letters, 2018, 25(10): 1440-1444.
[5]
DAHAKE P P, SHAW K, MALATHI P. Speaker dependent speech emotion recognition using MFCC and support vector machine[C]//2016 International Conference on Automatic Control and Dynamic Optimization Techniques (ICACDOT). Pune, India. IEEE, 2016: 1080-1084.
[6]
EYBEN F, WÖLLMER M, SCHULLER B. OpenEAR-introducing the Munich open-source emotion and affect recognition toolkit[C]//2009 3rd International Conference on Affective Computing and Intelligent Interaction and Workshops. Amsterdam, Netherlands. IEEE, 2009: 1-6.
[7]
ZHENG L, LI Q, BAN H, et al. Speech emotion recognition based on convolution neural network combined with random forest[C]//2018 Chinese Control and Decision Conference (CCDC). Shenyang, China. IEEE, 2018: 4143-4147.
[8]
ZHOU P, LI X P, LI J, et al. Speech emotion recognition based on mixed MFCC[J]. Applied Mechanics and Materials, 2012, 249/250: 1252-1258.
[9]
LALITHA S, MUDUPU A, NANDYALA B V, et al. Speech emotion recognition using DWT[C]//2015 IEEE International Conference on Computational Intelligence and Computing Research (ICCIC). Madurai, India. IEEE, 2015: 1-4.
[10]
RAO K S, KOOLAGUDI S G, VEMPADA R R. Emotion recognition from speech using global and local prosodic features[J]. International Journal of Speech Technology, 2013, 16: 143-160.
[11]
JIANG P X, FU H L, TAO H W, et al. Parallelized convolutional recurrent neural network with spectral features for speech emotion recognition[J]. IEEE Access, 2019, 7: 90368-90377.
[12]
CHEN Q P, HUANG G M. A novel dual attention-based BLSTM with hybrid features in speech emotion recognition[J]. Engineering Applications of Artificial Intelligence, 2021, 102: 104277.
[13]
GUO L L, WANG L B, DANG J W, et al. Exploration of complementary features for speech emotion recognition based on kernel extreme learning machine[J]. IEEE Access, 2019, 7: 75798-75809.
[14]
HE J R, REN L Y. Speech emotion recognition using XGBoost and CNN BLSTM with attention[C]//2021 IEEE Smart-World, Ubiquitous Intelligence & Computing, Advanced & Trusted Computing, Scalable Computing & Communications, Internet of People and Smart City Innovation (SmartWorld/SCALCOM/UIC/ATC/IOP/SCI). Atlanta, GA, USA. IEEE, 2021: 154-159.
[15]
ZHONG S M, YU B X, ZHANG H. Exploration of an independent training framework for speech emotion recognition[J]. IEEE Access, 2020, 8: 222533-222543.
[16]
LIU J X, LIU Z L, WANG L B, et al. Speech emotion recognition with local-global aware deep representation learning[C]//ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Barcelona, Spain. IEEE, 2020: 7174-7178.
[17]
JUNG H, LEE S, YIM J, et al. Joint fine-tuning in deep neural networks for facial expression recognition[C]//2015 IEEE International Conference on Computer Vision (ICCV). Santiago, Chile. IEEE, 2015: 2983-2991.
[18]
WOO S, PARK J, LEE J Y, et al. CBAM: Convolutional block attention module[C]//European Conference on Computer Vision. Cham: Springer, 2018: 3-19.
[19]
KE X X, ZHU Y J, WEN L, et al. Speech emotion recognition based on SVM and ANN[J]. International Journal of Machine Learning and Computing, 2018, 8(3): 198-202.
[20]
SAHOO S, ROUTRAY A. MFCC feature with optimized frequency range: An essential step for emotion recognition[C]//2016 International Conference on Systems in Medicine and Biology (ICSMB). Kharagpur, India. IEEE, 2016: 162-165.
[21]
WU S, LI G Q, DENG L, et al. L1-norm batch normalization for efficient training of deep neural networks[J]. IEEE Transactions on Neural Networks and Learning Systems, 2019, 30(7): 2043-2051.
[22]
LI R N, WU Z Y, JIA J, et al. Dilated residual network with multi-head self-attention for speech emotion recognition[C]//ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Brighton, UK. IEEE, 2019: 6675-6679.
[23]
HAN K, XIAO A, WU E H, et al. Transformer in transformer[EB/OL]. 2021: arXiv: 2103.00112. https://arxiv.org/abs/2103.00112.pdf.
[24]
BUSSO C, BULUT M, LEE C C, et al. IEMOCAP: Interactive emotional dyadic motion capture database[J]. Language Resources and Evaluation, 2008, 42: 335-359.
[25]
RAMDINMAWII E, MOHANTA A, MITTAL V K. Emotion recognition from speech signal[C]//TENCON 2017 - 2017 IEEE Region 10 Conference. Penang, Malaysia. IEEE, 2017: 1562-1567.
Journal of Xinjiang University(Natural Science Edition in Chinese and English)
Pages 52-58
Cite this article:
HUANG P, CHENG H, TANG X. Speech Emotion Recognition with Complementary Feature Learning Framework and Attentional Feature Fusion Module. Journal of Xinjiang University(Natural Science Edition in Chinese and English), 2024, 41(1): 52-58. https://doi.org/10.13568/j.cnki.651094.651316.2023.07.05.0002
Metrics & Citations  
Article History
Copyright
Return