Discover the SciOpen Platform and Achieve Your Research Goals with Ease.
Search articles, authors, keywords, DOl and etc.
As a Turing test in multimedia, visual question answering (VQA) aims to answer the textual question with a given image. Recently, the “dynamic” property of neural networks has been explored as one of the most promising ways of improving the adaptability, interpretability, and capacity of the neural network models. Unfortunately, despite the prevalence of dynamic convolutional neural networks, it is relatively less touched and very nontrivial to exploit dynamics in the transformers of the VQA tasks through all the stages in an end-to-end manner. Typically, due to the large computation cost of transformers, researchers are inclined to only apply transformers on the extracted high-level visual features for downstream vision and language tasks. To this end, we introduce a question-guided dynamic layer to the transformer as it can effectively increase the model capacity and require fewer transformer layers for the VQA task. In particular, we name the dynamics in the Transformer as Conditional Multi-Head Self-Attention block (cMHSA). Furthermore, our question-guided cMHSA is compatible with conditional ResNeXt block (cResNeXt). Thus a novel model mixture of conditional gating blocks (McG) is proposed for VQA, which keeps the best of the Transformer, convolutional neural network (CNN), and dynamic networks. The pure conditional gating CNN model and the conditional gating Transformer model can be viewed as special examples of McG. We quantitatively and qualitatively evaluate McG on the CLEVR and VQA-Abstract datasets. Extensive experiments show that McG has achieved the state-of-the-art performance on these benchmark datasets.
Raffel C, Shazeer N, Roberts A, Lee K, Narang S, Matena M, Zhou Y Q, Li W, Liu P J. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 2020, 21(1): Article No. 140.
Liu F, Liu J, Fang Z W, Hong R C, Lu H Q. Visual question answering with dense inter- and intra-modality interactions. IEEE Trans. Multimedia, 2021, 23:3518–3529. DOI: 10.1109/tmm.2020.3026892.
Yu J, Zhang W F, Lu Y H, Qin Z C, Hu Y, Tan J L, Wu Q. Reasoning on the relation: Enhancing visual representation for visual question answering and cross-modal retrieval. IEEE Trans. Multimedia, 2020, 22(12): 3196–3209. DOI: 10.1109/tmm.2020.2972830.
Pan Y H, Li Z C, Zhang L Y, Tang J H. Causal inference with knowledge distilling and curriculum learning for unbiased VQA. ACM Trans. Multimedia Computing, Communications, and Applications, 2022, 18(3): 67. DOI: 10.1145/3487042.
Yu J, Li J, Yu Z, Huang Q M. Multimodal transformer with multi-view visual representation for image captioning. IEEE Trans. Circuits and Systems for Video Technology, 2019, 30(12): 4467–4480. DOI: 10.1109/TCSVT.2019.2947482.
Wang J H, Jin L, Li Z C, Tang J H. Crossmodal knowledge distillation hashing. Scientia Sinica Technologica, 2022, 52(5): 713–726. DOI: 10.1360/sst-2021-0214.
Fedus W, Zoph B, Shazeer N. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. The Journal of Machine Learning Research, 2022, 23(1): 120.
Jiang A W, Liu B, Wang M W. Deep multimodal reinforcement network with contextually guided recurrent attention for image question answering. Journal of Computer Science and Technology, 2017, 32(4): 738–748. DOI: 10.1007/s11390-017-1755-6.