[3]
Q. Gan, S. Wang, L. Hao, and Q. Ji, A multimodal deep regression Bayesian network for affective video content analyses, in Proc. 2017 IEEE Int. Conf. Computer Vision (ICCV), Venice, Italy, 2017, pp. 5123–5132.
[6]
W. Wang, D. Tran, and M. Feiszli, What makes training multi-modal classification networks hard? in Proc. 2020 IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 2020, pp. 12692–12702.
[7]
L. P. Morency, R. Mihalcea, and P. Doshi, Towards multimodal sentiment analysis: Harvesting opinions from the web, in Proc. 13th Int. Conf. multimodal interfaces, Alicante, Spain, 2011, pp. 169–176.
[8]
V. Pérez-Rosas, R. Mihalcea, and L. P. Morency, Utterance-level multimodal sentiment analysis, in Proc. 51st Annual Meeting Association for Computational Linguistics, Sofia, Bulgaria, 2013, pp. 973–982.
[9]
A. Zadeh, R. Zellers, E. Pincus, and L. P. Morency, MOSI: Multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos, arXiv preprint arXiv: 1606.06259, 2016.
[10]
H. Wang, A. Meghawat, L. -P. Morency, and E. P. Xing, Select-additive learning: Improving generalization in multimodal sentiment analysis, in Proc. 2017 IEEE Int. Conf. Multimedia and Expo (ICME), Hong Kong, China, 2017, pp. 949–954.
[11]
S. Sahay, E. Okur, S. H. Kumar, and L. Nachman, Low rank fusion based Transformers for multimodal sequences, arXiv preprint arXiv: 2007.02038, 2020.
[12]
W. Rahman, M. K. Hasan, S. Lee, A. Bagher Zadeh, C. Mao, L. P. Morency, and E. Hoque, Integrating multimodal information in large pretrained Transformers, in Proc. 58th Annual Meeting Association for Computational Linguistics, virtual, 2020, pp. 2359–2369.
[14]
D. Hazarika, R. Zimmermann, and S. Poria, MISA: Modality-invariant and-specific representations for multimodal sentiment analysis, in Proc. 28th ACM Int. Conf. Multimedia, Seattle, WA, USA, 2020, pp. 1122–1131.
[15]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, Attention is all you need, arXiv preprint arXiv: 1706.03762, 2017.
[16]
J. Cheng, I. Fostiropoulos, B. Boehm, and M. Soleymani, Multimodal phased Transformer for sentiment analysis, in Proc. 2021 Conf. Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, 2021, pp. 2447–2458.
[17]
F. Lv, X. Chen, Y. Huang, L. Duan, and G. Lin, Progressive modality reinforcement for human multimodal emotion recognition from unaligned multimodal sequences, in Proc. 2021 IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 2021, pp. 2554–2562.
[18]
A. Nagrani, S. Yang, A. Arnab, A. Jansen, C. Schmid, and C. Sun, Attention bottlenecksfor multimodal fusion, in Proc. 2021 Annual Conf. Nerual Information Processing Systems, virtual, 2021, pp. 14200–14213.
[19]
S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, Aggregated residual transformations for deep neural networks, in Proc. 2017 IEEE Conf. Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 2017, pp. 5987–5995.
[23]
D. Nguyen, K. Nguyen, S. Sridharan, A. Ghasemi, D. Dean, and C. Fookes, Deep spatio-temporal features for multimodal emotion recognition, in Proc. 2017 IEEE Winter Conf. Applications of Computer Vision (WACV), Santa Rosa, CA, USA, 2017, pp. 1215–1223.
[24]
Y. H. H. Tsai, S. Bai, P. P. Liang, J. Z. Kolter, L. P. Morency, and R. Salakhutdinov, Multimodal Transformer for unaligned multimodal language sequences, in Proc. 57th Annual Meeting Association for Computational Linguistics, Florence, Italy, 2019, pp. 6558–6569.
[25]
H. R. Vaezi Joze, A. Shaban, M. L. Iuzzolino, and K. Koishida, MMTM: Multimodal transfer module for CNN fusion, in Proc. 2020 IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 2020, pp. 13286–13296.
[26]
L. Su, C. Hu, G. Li, and D. Cao, MSAF: Multimodal split attention fusion, arXiv preprint arXiv: 2012.07175, 2020.
[27]
J. Wang, M. Xue, R. Culhane, E. Diao, J. Ding, and V. Tarokh, Speech emotion recognition with dual-sequence LSTM architecture, in Proc. 2020 IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2020, pp. 6474–6478.
[28]
W. Dai, Z. Liu, T. Yu, and P. Fung, Modality-transferable emotion embeddings for low-resource multimodal emotion recognition, arXiv preprint arXiv: 2009.09629, 2020.
[29]
Q. Jin, C. Li, S. Chen, and H. Wu, Speech emotion recognition with acoustic and lexical features, in Proc. 2015 IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, Australia, 2015, pp. 4749–4753.
[30]
J. Pennington, R. Socher, and C. Manning, Glove: Global vectors for word representation, in Proc. 2014 Conf. Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 2014, pp. 1532–1543.
[31]
T. Baltrušaitis, P. Robinson, and L. P. Morency, OpenFace: An open source facial behavior analysis toolkit, in Proc. 2016 IEEE Winter Conf. Applications of Computer Vision (WACV), Lake Placid, NY, USA, 2016, pp. 1–10.
[32]
G. Degottex, J. Kane, T. Drugman, T. Raitio, and S. Scherer, COVAREP—A collaborative voice analysis repository for speech technologies, in Proc. 2014 IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy, 2014, pp. 960–964.
[33]
D. P. Kingma and J. Ba, Adam: A method for stochastic optimization, arXiv preprint arXiv: 1412.6980, 2014.
[34]
J. D. S. Ortega, M. Senoussaoui, E. Granger, M. Pedersoli, P. Cardinal, and A. L. Koerich, Multimodal fusion with deep neural networks for audio-video emotion recognition, arXiv preprint arXiv: 1907.03196, 2019.
[35]
A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell, and M. Rohrbach, Multimodal compact bilinear pooling for visual question answering and visual grounding, arXiv preprint arXiv: 1606.01847, 2016.
[36]
K. Liu, Y. Li, N. Xu, and P. Natarajan, Learn to combine modalities in multimodal deep learning, arXiv preprint arXiv: 1805.11730, 2018.