[26]
A. Zadeh, M. Chen, S. Poria, E. Cambria, and L.-P. Morency, Tensor fusion network for multimodal sentiment analysis, in Proc. 2017 Conf. Empirical Methods in Natural Language Processing, arXiv preprint arXiv: 1707.07250.
[27]
Z. Liu, Y. Shen, V. B. Lakshminarasimhan, P. P. Liang, A. B. Zadeh, and L.-P. Morency, Efficient low-rank multimodal fusion with modality-specific factors, in Proc. of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Austratia, 2018, pp. 2247–2256.
[30]
G. Andrew, R. Arora, J. Bilmes, and K. Livescu, Deep canonical correlation analysis, in Proc. International Conference on Machine Learning, Atlanta, GA, USA, 2013, pp. 1247–1255.
[31]
Y. Dai, F. Gieseke, S. Oehmcke, Y. Wu, and K. Barnard, Attentional feature fusion, in Proc. IEEE Winter Conf. Applications of Computer Vision (WACV), Waikoloa, HI, USA, 2021, pp. 3559–3568.
[32]
J.-H. Kim, J. Jun, and B.-T. Zhang, Bilinear attention networks, arXiv preprint arXiv: 1805.07932.
[33]
J. Lu, J. Yang, D. Batra, and D. Parikh, Hierarchical question-image co-attention for visual question answering, in Proc. 30th Int. Conf. Neural Information Processing Systems, Barcelona, Spain, 2016, pp. 289–297.
[34]
Y.-C. Chen, L. Li, L. Yu, A. El Kholy, F. Ahmed, Z. Gan, Y. Cheng, and J. Liu, UNITER: UNiversal image-TExt representation learning, in Proc. 16th European Conference on Computer Vision, Glasgow, UK, 2020, pp. 104−120.
[36]
F. Locatello, S. Bauer, M. Lucic, S. Gelly, and O. Bachem, Challenging common assumptions in the unsupervised learning of disentangled representations, in Proc. of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 2019, pp. 4114−4124.
[37]
D. Hazarika, R. Zimmermann, and S. Poria, MISA: Modality-invariant and-specific representations for multimodal sentiment analysis, in Proc. 28th ACM Int. Conf. Multimedia, Seattle, WA, USA, 2020, pp. 1122–1131.
[38]
D. Yang, S. Huang, H. Kuang, Y. Du, and L. Zhang, Disentangled representation learning for multimodal emotion recognition, in Proc. 30th ACM Int. Conf. Multimedia, Lisboa, Portugal, 2022, pp. 1642–1651.
[39]
K. He, X. Zhang, S. Ren, and J. Sun, Deep residual learning for image recognition, in Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 2016, pp. 770–778.
[41]
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., An image is worth 16x16words: Transformers for image recognition at scale, in Proc. International Conference on Learning Representations, arXiv preprint arXiv: 2010.11929.
[42]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, Attention is all you need, in Proc. Advances in Neural Information Processing Systems, Long Beach, CA, USA, 2017, pp. 5998–6008.
[46]
X. Wang, M. Zhu, D. Bo, P. Cui, C. Shi, and J. Pei, AM-GCN: Adaptive multi-channel graph convolutional networks, in Proc. 26th ACM SIGKDD Int. Conf. Knowledge Discovery & Data Mining, Virtual Event, 2020, pp. 1243–1253.
[47]
C. Saillard, O. Dehaene, T. Marchand, O. Moindrot, A. Kamoun, B. Schmauch, and S. Jegou, Self-supervised learning improves dMMR/MSI detection from histology slides across multiple cancers, in Proc. MICCAI Workshop on Computational Pathology, Virtual Event, 2021, pp. 191–205.