Cross-Modal Complementary Network with Hierarchical Fusion for Multimodal Sentiment Classification

Cheng Peng; Chunxia Zhang; Xiaojun Xue; Jiameng Gao; Hongjian Liang; Zhengdong Niu

doi:10.26599/TST.2021.9010055

| Sign up

PDF (2.4 MB)

Cite

EndNote(RIS) BibTeX

Collect

Submit Manuscript

Open Access

Cross-Modal Complementary Network with Hierarchical Fusion for Multimodal Sentiment Classification

Cheng Peng, Chunxia Zhang(), Xiaojun Xue, Jiameng Gao, Hongjian Liang, Zhengdong Niu

School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China

School of Information, Production and Systems, Waseda University, Fukuoka 808-0135, Japan

Show Author Information

Abstract

Multimodal Sentiment Classification (MSC) uses multimodal data, such as images and texts, to identify the users’ sentiment polarities from the information posted by users on the Internet. MSC has attracted considerable attention because of its wide applications in social computing and opinion mining. However, improper correlation strategies can cause erroneous fusion as the texts and the images that are unrelated to each other may integrate. Moreover, simply concatenating them modal by modal, even with true correlation, cannot fully capture the features within and between modals. To solve these problems, this paper proposes a Cross-Modal Complementary Network (CMCN) with hierarchical fusion for MSC. The CMCN is designed as a hierarchical structure with three key modules, namely, the feature extraction module to extract features from texts and images, the feature attention module to learn both text and image attention features generated by an image-text correlation generator, and the cross-modal hierarchical fusion module to fuse features within and between modals. Such a CMCN provides a hierarchical fusion framework that can fully integrate different modal features and helps reduce the risk of integrating unrelated modal features. Extensive experimental results on three public datasets show that the proposed approach significantly outperforms the state-of-the-art methods.

Keywords

multimodal sentiment analysis multimodal fusion Cross-Modal Complementary Network (CMCN)hierarchical fusion joint optimization

References

[1]

, W. J.

Mao

, and G. D.

Chen

, Multi-interactive memory network for aspect based multimodal sentiment analysis, Proc. AAAI Conf. Artif. Intell., vol. 33, no. 1, pp. 371-378, 2019.

Crossref Google Scholar

[2]

, J.

, J. P.

Fan

, and D. C.

Tao

, Multi-modal factorized bilinear pooling with co-attention learning for visual question answering, in Proc. 2017 IEEE Int. Conf. Computer Vision, Venice, Italy, 2017, pp. 1839-1848.

Crossref

[3]

, J.

, C. C.

Xiang

, J. P.

Fan

, and D. C.

Tao

, Beyond bilinear: Generalized multimodal factorized high-order pooling for visual question answering, IEEE Trans. Neural Netw. Learn. Syst., vol. 29, no. 12, pp. 5947-5959, 2018.

Crossref Google Scholar

[4]

Hazarika

, S.

Poria

, R.

Mihalcea

, E.

Cambria

, and R.

Zimmermann

, ICON: Interactive conversational memory network for multimodal emotion detection, in Proc. 2018 Conf. Empirical Methods in Natural Language Processing, Brussels, Belgium, 2018, pp. 2594-2604.

Crossref

[5]

and S.

Flaxman

, Multimodal sentiment analysis to explore the structure of emotions, in Proc. 24th ACM SIGKDD Int. Conf. Knowledge Discovery & Data Mining, London, UK, 2018, pp. 350-358.

Crossref

[6]

Anderson

, X. D.

, C.

Buehler

, D.

Teney

, M.

Johnson

, S.

Gould

, and L.

Zhang

, Bottom-up and top-down attention for image captioning and visual question answering, in Proc. 2018 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 6077-6086.

Crossref

[7]

Zhang

, S.

Wang

, and B.

Liu

, Deep learning for sentiment analysis: A survey, WIREs Data Min. Knowl. Discov., vol. 8, no. 4, p. e1253, 2018.

Crossref Google Scholar

[8]

S. C.

Zhao

, S. F.

Wang

, M.

Soleymani

, D.

Joshi

, and Q.

, Affective computing for large-scale heterogeneous multimedia data: A survey, ACM Trans. Multimed. Comput. Commun. Appl., vol. 15, no. 3s, p. 93, 2020.

Crossref Google Scholar

[9]

Niu

, S. A.

Zhu

, L.

Pang

, and A.

El Saddik

, Sentiment analysis on multi-view social data, in Proc. 22nd Int. Conf. MultiMedia Modeling, Miami, FL, USA, 2016, pp. 15-27.

Crossref

[10]

Simonyan

and A.

Zisserman

, Very deep convolutional networks for large-scale image recognition, in Proc. 3rd Int. Conf. Learning Representations, arXiv preprint arXiv:1409.1556v6.

[11]

G. R.

Wang

, K. Z.

Wang

, and L.

Lin

, Adaptively connected neural networks, in Proc. of the 2019 IEEE Conf. Computer Vision and Pattern Recognition, Long Beach, CA, USA, 2019, pp. 1781-1790.

Crossref

[12]

Cadne

, C.

Dancette

, H.

Ben-younes

, M.

Cord

, and D.

Parikh

, RUBi: Reducing unimodal biases for visual question answering, in Proc. 33rd Int. Conf. Neural Information Processing Systems, Vancouver, Canada, 2019, pp. 841-852.

[13]

Devlin

, M. W.

Chang

, K.

Lee

, and K.

Toutanova

, BERT: Pre-training of deep bidirectional transformers for language understanding, in Proc. 2019 Conf. North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2019, pp. 4171-4186.

[14]

and W. J.

Mao

, MultiSentiNet: A deep semantic network for multimodal sentiment analysis, in Proc. 2017 ACM Conf. Information and Knowledge Management, Singapore, 2017, pp. 2399-2402.

Crossref

[15]

, W. J.

Mao

, and G. D.

Chen

, A co-memory network for multimodal sentiment analysis, in Proc. 41st Int. ACM SIGIR Conf. Research & Development in Information Retrieval, Ann Arbor, MI, USA, 2018, pp. 929-932.

Crossref

[16]

J. C.

, D. L.

Chen

, X. P.

Qiu

, and X. J.

Huang

, Cached long short-term memory neural networks for document-level sentiment classification, in Proc. 2016 Conf. Empirical Methods in Natural Language Processing, Austin, TX, USA, 2016, pp. 1660-1669.

Crossref

[17]

Mishra

, K.

Dey

, and P.

Bhattacharyya

, Learning cognitive features from gaze data for sentiment and sarcasm classification using convolutional neural network, in Proc. 55th Annu. Meeting of the Association for Computational Linguistics, Vancouver, Canada, 2017, pp. 377-387.

Crossref

[18]

D. H.

, S. J.

, X. D.

Zhang

, and H. F.

Wang

, Interactive attention networks for aspect-level sentiment classification, in Proc. 26th Int. Joint Conf. Artificial Intelligence, Melbourne, Australia, 2017, pp. 4068-4074.

Crossref

[19]

Gaspar

and L. A.

Alexandre

, A multimodal approach to image sentiment analysis, in Proc. 20th Int. Conf. Intelligent Data Engineering and Automated Learning, Manchester, UK, 2019, pp. 302-309.

Crossref

[20]

Q. T.

Truong

and H. W.

Lauw

, VistaNet: Visual aspect attention network for multimodal sentiment analysis, Proc. AAAI Conf. Artif. Intell., vol. 33, no. 1, pp. 305-312, 2019.

Crossref Google Scholar

[21]

Liu

, S. J.

Tang

, X. J.

Sun

, Q. Y.

Chen

, J. X.

Cao

, J. Z.

Luo

, and S. S.

Zhao

, Context-aware social media user sentiment analysis, Tsinghua Science and Technology, vol. 25, no. 4, pp. 528-541, 2020.

Crossref Google Scholar

[22]

E. J.

Barezi

and P.

Fung

, Modality-based factorization for multimodal fusion, in Proc. 4th Workshop on Representation Learning for NLP, Florence, Italy, 2019, pp. 260-269.

Crossref

[23]

Poria

, N.

Majumder

, D.

Hazarika

, E.

Cambria

, A.

Gelbukh

, and A.

Hussain

, Multimodal sentiment analysis: Addressing key issues and setting up the baselines, IEEE Intell. Syst., vol. 33, no. 6, pp. 17-25, 2018.

Crossref Google Scholar

[24]

M. H.

Chen

, S.

Wang

, P. P.

Liang

, T.

Baltrušaitis

, A.

Zadeh

, and L. P.

Morency

, Multimodal sentiment analysis with word-level fusion and reinforcement learning, in Proc. 19th ACM Int. Conf. Multimodal Interaction, Glasgow, UK, 2017, pp. 163-171.

Crossref

[25]

Majumder

, D.

Hazarika

, A.

Gelbukh

, E.

Cambria

, and S.

Poria

, Multimodal sentiment analysis using hierarchical fusion with context modeling, Knowl.-Based Syst., vol. 161, pp. 124-133, 2018.

Crossref Google Scholar

[26]

Cambria

, D.

Hazarika

, S.

Poria

, A.

Hussain

, and R. B. V.

Subramanyam

, Benchmarking multimodal sentiment analysis, in Proc. 18th Int. Conf. Computational Linguistics and Intelligent Text Processing, Budapest, Hungary, 2017, pp. 166-179.

Crossref

[27]

Zhang

, S. S.

, Q. M.

Zhu

, and G. D.

Zhou

, Multi-modal sentiment classification with independent and interactive knowledge via semi-supervised learning, IEEE Access, vol. 8, pp. 22945-22954, 2020.

Crossref Google Scholar

[28]

Z. L.

Wang

, Z. H.

Wan

, and X. J.

Wan

, TransModality: An End2End fusion method with transformer for multimodal sentiment analysis, in Proc. Web Conf., Taipei, China, 2020, pp. 2514-2520.

Crossref

[29]

Yang

, X. C.

Wang

, and B.

Jiang

, Sentiment enhanced multi-modal Hashtag recommendation for micro-videos, IEEE Access, vol. 8, pp. 78252-78264, 2020.

Crossref Google Scholar

[30]

F. R.

Huang

, K. M.

Wei

, J.

Weng

, and Z. J.

, Attention-based modality-gated networks for image-text sentiment analysis, ACM Trans. Multimed. Comput. Commun. Appl., vol. 16, no. 3, p. 79, 2020.

Crossref Google Scholar

[31]

, J. L.

, R.

Kiros

, K.

Cho

, A.

Courville

, R.

Salakhutdinov

, R. S.

Zemel

, and Y.

Bengio

, Show, attend and tell: Neural image caption generation with visual attention, in Proc. 32nd Int. Conf. Machine Learning, Lille, France, 2015, pp. 2048-2057.

[32]

Borth

, R. R.

, T.

Chen

, T.

Breuel

, and S. F.

Chang

, Large-scale visual sentiment ontology and detectors using adjective noun pairs, in Proc. 21st ACM Int. Conf. Multimedia, Barcelona, Spain, 2013, pp. 223-232.

Crossref

[33]

Baecchi

, T.

Uricchio

, M.

Bertini

, and A.

Del Bimbo

, A multimodal feature learning approach for sentiment analysis of social network multimedia, Multimed. Tools Appl., vol. 75, no. 5, pp. 2507-2525, 2016.

Crossref Google Scholar

[34]

G. Y.

Cai

and B. B.

Xia

, Convolutional neural networks for multimedia sentiment analysis, in Proc. 4th CCF Conf. Natural Language Processing and Chinese Computing, Nanchang, China, 2015, pp. 159-167.

Crossref

[35]

Y. H.

, H. F.

Lin

, J. N.

Meng

, and Z. H.

Zhao

, Visual and textual sentiment analysis of a microblog using deep convolutional neural networks, Algorithms, vol. 9, no. 2, p. 41, 2016.

Crossref Google Scholar

[36]

, Analyzing multimodal public sentiment based on hierarchical semantic attentional network, in Proc. 2017 IEEE Int. Conf. Intelligence and Security Informatics, Beijing, China, 2017, pp. 152-154.

Crossref

[37]

Zhang

, Y. S.

Geng

, J.

Zhao

, J. X.

Liu

, and W. X.

, Sentiment analysis of social media via multimodal feature fusion, Symmetry, vol. 12, no. 12, p. 2010, 2020.

Crossref Google Scholar

[38]

X. C.

Yang

, S.

Feng

, D. L.

Wang

, and Y. F.

Zhang

, Image-text multimodal emotion classification via multi-view attentional network, IEEE Trans. Multimed., .

Crossref Google Scholar

[39]

, J.

, S.

Chen

, K.

Murphy

, and J.

Hays

, Composing text and image for image retrieval-an empirical odyssey, in Proc. 2019 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Long Beach, CA, USA, 2019, pp. 6432-6441.

Crossref

[40]

Arevalo

, T.

Solorio

, M.

Montes-y-Gómez

, and F. A.

González

, Gated multimodal units for information fusion, in Proc. 5th Int. Conf. Learning Representations, https://arxiv.org/abs/1702.01992v1.

[41]

Y. Q.

Wang

, M. L.

Huang

, X. Y.

Zhu

, and L.

Zhao

, Attention-based LSTM for aspect-level sentiment classification, in Proc. 2016 Conf. Empirical Methods in Natural Language Processing, Austin, TX, USA, 2016, pp. 606-615.

Crossref

[42]

D. Y.

Tang

, B.

Qin

, and T.

Liu

, Aspect level sentiment classification with deep memory network, in Proc. 2016 Conf. Empirical Methods in Natural Language Processing, Austin, TX, USA, 2016, pp. 214-224.

Crossref

[43]

Chen

, Z. Q.

Sun

, L. D.

Bing

, and W.

Yang

, Recurrent attention network on memory for aspect sentiment analysis, in Proc. 2017 Conf. Empirical Methods in Natural Language Processing, Copenhagen, Denmark, 2017, pp. 452-461.

Crossref

[44]

K. M.

, X. Y.

Zhang

, S. Q.

Ren

, and J.

Sun

, Deep residual learning for image recognition, in Proc. 2016 IEEE Conf. Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 2016, pp. 770-778.

Crossref

Tsinghua Science and Technology

Volume 27 Issue 4,
August 2022

Pages 664-679

DOI: 10.26599/TST.2021.9010055

Cite this article:

Peng C, Zhang C, Xue X, et al. Cross-Modal Complementary Network with Hierarchical Fusion for Multimodal Sentiment Classification. Tsinghua Science and Technology, 2022, 27(4): 664-679. https://doi.org/10.26599/TST.2021.9010055

Part of a topical collection:

Special Issue on Social Computing