AI Chat Paper
Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.
{{lang === 'zh_CN' ? '文章概述' : 'Summary'}}
{{lang === 'en_US' ? '中' : 'Eng'}}
Chat more with AI
PDF (965.3 KB)
Collect
Submit Manuscript AI Chat Paper
Show Outline
Outline
Show full outline
Hide outline
Outline
Show full outline
Hide outline
Article | Open Access

TACFN: Transformer-Based Adaptive Cross-Modal Fusion Network for Multimodal Emotion Recognition

Feng Liu1( )Ziwang Fu2Yunlong Wang3Qijian Zheng1
School of Computer Science and Technology, East China Normal University, Shanghai 200062, China
MTlab, Meitu (China) Limited, Beijing 100876, China
Institute of Acoustics, University of Chinese Academy of Sciences, Beijing 100084, China
Show Author Information

Abstract

The fusion technique is the key to the multimodal emotion recognition task. Recently, cross-modal attention-based fusion methods have demonstrated high performance and strong robustness. However, cross-modal attention suffers from redundant features and does not capture complementary features well. We find that it is not necessary to use the entire information of one modality to reinforce the other during cross-modal interaction, and the features that can reinforce a modality may contain only a part of it. To this end, we design an innovative Transformer-based Adaptive Cross-modal Fusion Network (TACFN). Specifically, for the redundant features, we make one modality perform intra-modal feature selection through a self-attention mechanism, so that the selected features can adaptively and efficiently interact with another modality. To better capture the complementary information between the modalities, we obtain the fused weight vector by splicing and use the weight vector to achieve feature reinforcement of the modalities. We apply TCAFN to the RAVDESS and IEMOCAP datasets. For fair comparison, we use the same unimodal representations to validate the effectiveness of the proposed fusion method. The experimental results show that TACFN brings a significant performance improvement compared to other methods and reaches the state-of-the-art performance. All code and models could be accessed from https://github.com/shuzihuaiyu/TACFN.

References

[1]

S. Zhao, G. Jia, J. Yang, G. Ding, and K. Keutzer, Emotion recognition from multiple modalities: Fundamentals and methodologies, IEEE Signal Process. Mag., vol. 38, no. 6, pp. 59–73, 2021.

[2]

S. Poria, D. Hazarika, N. Majumder, and R. Mihalcea, Beneath the tip of the iceberg: Current challenges and new directions in sentiment analysis research, IEEE Trans. Affect. Comput., vol. 14, no. 1, pp. 108–132, 2023.

[3]
Q. Gan, S. Wang, L. Hao, and Q. Ji, A multimodal deep regression Bayesian network for affective video content analyses, in Proc. 2017 IEEE Int. Conf. Computer Vision (ICCV), Venice, Italy, 2017, pp. 5123–5132.
[4]

D. Nguyen, K. Nguyen, S. Sridharan, D. Dean, and C. Fookes, Deep spatio-temporal feature fusion with compact bilinear pooling for multimodal emotion recognition, Comput. Vis. Image Underst., vol. 174, pp. 33–42, 2018.

[5]

L. Smith and M. Gasser, The development of embodied cognition: Six lessons from babies, Artif. Life, vol. 11, nos. 1–2, pp. 29, 2005.

[6]
W. Wang, D. Tran, and M. Feiszli, What makes training multi-modal classification networks hard? in Proc. 2020 IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 2020, pp. 12692–12702.
[7]
L. P. Morency, R. Mihalcea, and P. Doshi, Towards multimodal sentiment analysis: Harvesting opinions from the web, in Proc. 13th Int. Conf. multimodal interfaces, Alicante, Spain, 2011, pp. 169–176.
[8]
V. Pérez-Rosas, R. Mihalcea, and L. P. Morency, Utterance-level multimodal sentiment analysis, in Proc. 51st Annual Meeting Association for Computational Linguistics, Sofia, Bulgaria, 2013, pp. 973–982.
[9]
A. Zadeh, R. Zellers, E. Pincus, and L. P. Morency, MOSI: Multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos, arXiv preprint arXiv: 1606.06259, 2016.
[10]
H. Wang, A. Meghawat, L. -P. Morency, and E. P. Xing, Select-additive learning: Improving generalization in multimodal sentiment analysis, in Proc. 2017 IEEE Int. Conf. Multimedia and Expo (ICME), Hong Kong, China, 2017, pp. 949–954.
[11]
S. Sahay, E. Okur, S. H. Kumar, and L. Nachman, Low rank fusion based Transformers for multimodal sequences, arXiv preprint arXiv: 2007.02038, 2020.
[12]
W. Rahman, M. K. Hasan, S. Lee, A. Bagher Zadeh, C. Mao, L. P. Morency, and E. Hoque, Integrating multimodal information in large pretrained Transformers, in Proc. 58th Annual Meeting Association for Computational Linguistics, virtual, 2020, pp. 2359–2369.
[13]

W. Yu, H. Xu, Z. Yuan, and J. Wu, Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis, Proc. AAAI Conf. Artif. Intell., vol. 35, no. 12, pp. 10790–10797, 2021.

[14]
D. Hazarika, R. Zimmermann, and S. Poria, MISA: Modality-invariant and-specific representations for multimodal sentiment analysis, in Proc. 28th ACM Int. Conf. Multimedia, Seattle, WA, USA, 2020, pp. 1122–1131.
[15]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, Attention is all you need, arXiv preprint arXiv: 1706.03762, 2017.
[16]
J. Cheng, I. Fostiropoulos, B. Boehm, and M. Soleymani, Multimodal phased Transformer for sentiment analysis, in Proc. 2021 Conf. Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, 2021, pp. 2447–2458.
[17]
F. Lv, X. Chen, Y. Huang, L. Duan, and G. Lin, Progressive modality reinforcement for human multimodal emotion recognition from unaligned multimodal sequences, in Proc. 2021 IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 2021, pp. 2554–2562.
[18]
A. Nagrani, S. Yang, A. Arnab, A. Jansen, C. Schmid, and C. Sun, Attention bottlenecksfor multimodal fusion, in Proc. 2021 Annual Conf. Nerual Information Processing Systems, virtual, 2021, pp. 14200–14213.
[19]
S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, Aggregated residual transformations for deep neural networks, in Proc. 2017 IEEE Conf. Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 2017, pp. 5987–5995.
[20]

N. Neverova, C. Wolf, G. Taylor, and F. Nebout, ModDrop: Adaptive multi-modal gesture recognition, IEEE Trans. Pattern Anal. Mach. Intell., vol. 38, no. 8, pp. 1692–1706, 2016.

[21]

S. R. Livingstone and F. A. Russo, The Ryerson audio-visual database of emotional speech and song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English, . PLoS One , vol. 13, no. 5, p. e0196391, 2018.

[22]

C. Busso, M. Bulut, C. C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. Chang, S. Lee, and S. S. Narayanan, IEMOCAP: Interactive emotional dyadic motion capture database, Lang. Resour. Eval., vol. 42, no. 4, pp. 335–359, 2008.

[23]
D. Nguyen, K. Nguyen, S. Sridharan, A. Ghasemi, D. Dean, and C. Fookes, Deep spatio-temporal features for multimodal emotion recognition, in Proc. 2017 IEEE Winter Conf. Applications of Computer Vision (WACV), Santa Rosa, CA, USA, 2017, pp. 1215–1223.
[24]
Y. H. H. Tsai, S. Bai, P. P. Liang, J. Z. Kolter, L. P. Morency, and R. Salakhutdinov, Multimodal Transformer for unaligned multimodal language sequences, in Proc. 57th Annual Meeting Association for Computational Linguistics, Florence, Italy, 2019, pp. 6558–6569.
[25]
H. R. Vaezi Joze, A. Shaban, M. L. Iuzzolino, and K. Koishida, MMTM: Multimodal transfer module for CNN fusion, in Proc. 2020 IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 2020, pp. 13286–13296.
[26]
L. Su, C. Hu, G. Li, and D. Cao, MSAF: Multimodal split attention fusion, arXiv preprint arXiv: 2012.07175, 2020.
[27]
J. Wang, M. Xue, R. Culhane, E. Diao, J. Ding, and V. Tarokh, Speech emotion recognition with dual-sequence LSTM architecture, in Proc. 2020 IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2020, pp. 6474–6478.
[28]
W. Dai, Z. Liu, T. Yu, and P. Fung, Modality-transferable emotion embeddings for low-resource multimodal emotion recognition, arXiv preprint arXiv: 2009.09629, 2020.
[29]
Q. Jin, C. Li, S. Chen, and H. Wu, Speech emotion recognition with acoustic and lexical features, in Proc. 2015 IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, Australia, 2015, pp. 4749–4753.
[30]
J. Pennington, R. Socher, and C. Manning, Glove: Global vectors for word representation, in Proc. 2014 Conf. Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 2014, pp. 1532–1543.
[31]
T. Baltrušaitis, P. Robinson, and L. P. Morency, OpenFace: An open source facial behavior analysis toolkit, in Proc. 2016 IEEE Winter Conf. Applications of Computer Vision (WACV), Lake Placid, NY, USA, 2016, pp. 1–10.
[32]
G. Degottex, J. Kane, T. Drugman, T. Raitio, and S. Scherer, COVAREP—A collaborative voice analysis repository for speech technologies, in Proc. 2014 IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy, 2014, pp. 960–964.
[33]
D. P. Kingma and J. Ba, Adam: A method for stochastic optimization, arXiv preprint arXiv: 1412.6980, 2014.
[34]
J. D. S. Ortega, M. Senoussaoui, E. Granger, M. Pedersoli, P. Cardinal, and A. L. Koerich, Multimodal fusion with deep neural networks for audio-video emotion recognition, arXiv preprint arXiv: 1907.03196, 2019.
[35]
A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell, and M. Rohrbach, Multimodal compact bilinear pooling for visual question answering and visual grounding, arXiv preprint arXiv: 1606.01847, 2016.
[36]
K. Liu, Y. Li, N. Xu, and P. Natarajan, Learn to combine modalities in multimodal deep learning, arXiv preprint arXiv: 1805.11730, 2018.
[37]

S. Hochreiter and J. Schmidhuber, Long short-term memory, Neural Comput., vol. 9, no. 8, pp. 1735–1780, 1997.

[38]

Y. Wang, Y. Shen, Z. Liu, P. P. Liang, A. Zadeh, and L. -P. Morency, Words can shift: Dynamically adjusting word representations using nonverbal behaviors, Proc. AAAI Conf. Artif. Intell., vol. 33, no. 1, pp. 7216–7223, 2019.

[39]

H. Pham, P. P. Liang, T. Manzini, L. -P. Morency, and B. Póczos, Found in translation: Learning robust joint representations by cyclic translations between modalities, Proc. AAAI Conf. Artif. Intell., vol. 33, no. 1, pp. 6892–6899, 2019.

CAAI Artificial Intelligence Research
Article number: 9150019
Cite this article:
Liu F, Fu Z, Wang Y, et al. TACFN: Transformer-Based Adaptive Cross-Modal Fusion Network for Multimodal Emotion Recognition. CAAI Artificial Intelligence Research, 2023, 2: 9150019. https://doi.org/10.26599/AIR.2023.9150019
Part of a topical collection:

1200

Views

165

Downloads

1

Crossref

Altmetrics

Received: 07 July 2023
Accepted: 31 August 2023
Published: 27 October 2023
© The author(s) 2023.

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).

Return