| Sign up

Article Link

Cite

EndNote(RIS) BibTeX

Collect

Collect

Submit Manuscript

Show Outline

Outline

Abstract

Keywords

Electronic Supplementary Material

References

Show full outline

Hide outline

Regular Paper

Learning a Mixture of Conditional Gating Blocks for Visual Question Answering

Qiang Sun^{¹^,²}, Yan-Wei Fu^³, Xiang-Yang Xue^⁴

1School of Statistics and Information, Shanghai University of International Business and Economics, Shanghai 201620, China

2Academy for Engineering and Technology, Fudan University, Shanghai 200433, China

3School of Data Science, Fudan University, Shanghai 200433, China

4School of Computer Science, Fudan University, Shanghai 200433, China

Show Author Information

Abstract

As a Turing test in multimedia, visual question answering (VQA) aims to answer the textual question with a given image. Recently, the “dynamic” property of neural networks has been explored as one of the most promising ways of improving the adaptability, interpretability, and capacity of the neural network models. Unfortunately, despite the prevalence of dynamic convolutional neural networks, it is relatively less touched and very nontrivial to exploit dynamics in the transformers of the VQA tasks through all the stages in an end-to-end manner. Typically, due to the large computation cost of transformers, researchers are inclined to only apply transformers on the extracted high-level visual features for downstream vision and language tasks. To this end, we introduce a question-guided dynamic layer to the transformer as it can effectively increase the model capacity and require fewer transformer layers for the VQA task. In particular, we name the dynamics in the Transformer as Conditional Multi-Head Self-Attention block (cMHSA). Furthermore, our question-guided cMHSA is compatible with conditional ResNeXt block (cResNeXt). Thus a novel model mixture of conditional gating blocks (McG) is proposed for VQA, which keeps the best of the Transformer, convolutional neural network (CNN), and dynamic networks. The pure conditional gating CNN model and the conditional gating Transformer model can be viewed as special examples of McG. We quantitatively and qualitatively evaluate McG on the CLEVR and VQA-Abstract datasets. Extensive experiments show that McG has achieved the state-of-the-art performance on these benchmark datasets.

Keywords

visual question answering Transformer dynamic network

Electronic Supplementary Material

Download File(s)

JCST-2112-12113-Highlights.pdf (502.9 KB)

References

[1]

Antol S, Agrawal A, Lu J, Mitchell M, Batra D, Zitnick C L, Parikh D. VQA: Visual question answering. In Proc. the IEEE International Conference on Computer Vision, Dec. 2015, pp.2425–2433. DOI: 10.1109/iccv.2015.279.

[2]

Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser Ł, Polosukhin I. Attention is all you need. In Proc. the 31st In. Conf. Neural Information Processing Systems, Dec. 2017, pp.6000–6010.

[3]

Raffel C, Shazeer N, Roberts A, Lee K, Narang S, Matena M, Zhou Y Q, Li W, Liu P J. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 2020, 21(1): Article No. 140.

[4]

Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X H, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J, Houlsby N. An image is worth 16x16 words: Transformers for image recognition at scale. In Proc. the 9th International Conference on Learning Representations, May 2021.

[5]

Nicolas C, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S. End-to-end object detection with transformers. In Proc. the 16th European Conference on Computer Vision, Aug. 2020, pp.213–229. DOI: 10.1007/978-3-030-58452-8_13.

[6]

Zheng S X, Lu J C, Zhao H S, Zhu X T, Luo Z K, Wang Y B, Fu Y W, Feng J F, Xiang T, Torr P H S, Zhang L. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In Proc. the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2021, pp.6877–6886. DOI: 10.1109/CVPR46437.2021.00681.

[7]

Lu J S, Batra D, Parikh D, Lee S. ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Proc. the 33rd International Conference on Neural Information Processing Systems, Dec. 2019, Article No. 2.

[8]

Tan H, Bansal M. LXMERT: Learning cross-modality encoder representations from transformers. In Proc. the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, Nov. 2019, pp.5100–5111. DOI: 10.18653/v1/d19-1514.

[9]

Huang Z C, Zeng Z Y, Liu B, Fu D M, Fu J L. Pixel-BERT: Aligning image pixels with text by deep multi-modal transformers. arXiv: 2004.00849, 2020. https://arxiv.org/abs/2004.00849, Jun. 2024.

[10]

Kim W, Son B, Kim I. ViLT: Vision-and-language transformer without convolution or region supervision. In Proc. the 38th International Conference on Machine Learning, Jul. 2021, pp.5583–5594.

[11]

Johnson J, Hariharan B, Van Der Maaten L, Hoffman J, Fei-Fei L, Zitnick C L, Girshick R. Inferring and executing programs for visual reasoning. In Proc. the 2017 IEEE International Conference on Computer Vision, Oct. 2017, pp.3008–3017. DOI: 10.1109/iccv.2017.325.

[12]

Perez E, Strub F, De Vries H, Dumoulin V, Courville A. FiLM: Visual reasoning with a general conditioning layer. In Proc. the 32nd AAAI Conference on Artificial Intelligence, Apr. 2018. DOI: 10.1609/aaai.v32i1.11671.

[13]

Wu Y Z, Sun Q, Ma J Q, Li B, Fu Y W, Peng Y, Xue X Y. Question guided modular routing networks for visual question answering. arXiv: 1904.08324, 2019. https://arxiv.org/abs/1904.08324, Jun. 2024.

[14]

Zhong H S, Chen J Y, Shen C, Zhang H W, Huang J Q, Hua X S. Self-adaptive neural module transformer for visual question answering. IEEE Trans. Multimedia, 2021, 23: 1264–1273. DOI: 10.1109/tmm.2020.2995278.

[15]

Hu R H, Andreas J, Rohrbach M, Darrell T, Saenko K. Learning to reason: End-to-end module networks for visual question answering. In Proc. the 2017 IEEE International Conference on Computer Vision, Oct. 2017, pp.804–813. DOI: 10.1109/iccv.2017.93.

[16]

Mascharka D, Tran P, Soklaski R, Majumdar A. Transparency by design: Closing the gap between performance and interpretability in visual reasoning. In Proc. the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2018, pp.4942–4950. DOI: 10.1109/cvpr.2018.00519.

[17]

Noh H, Seo P H, Han B. Image question answering using convolutional neural network with dynamic parameter prediction. In Proc. the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Jun. 2016, pp.30–38. DOI: 10.1109/cvpr.2016.11.

[18]

Gao P, Li H S, Li S, Lu P, Li Y K, Hoi S C H, Wang X G. Question-guided hybrid convolution for visual question answering. In Proc. the 15th European Conference on Computer Vision, Sept. 2018, pp.485–501. DOI: 10.1007/978-3-030-01246-5_29.

[19]

Xie S N, Girshick R, Dollár P, Tu Z W, He K M. Aggregated residual transformations for deep neural networks. In Proc. the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Jul. 2017, pp.5987–5995. DOI: 10.1109/cvpr.2017.634.

[20]

Johnson J, Hariharan B, Van Der maaten L, Fei-Fei L, Zitnick C L, Girshick R. CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning. In Proc. the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Jul. 2017, pp.1988–1997. DOI: 10.1109/cvpr.2017.215.

[21]

Chen Y P, Dai X Y, Liu M C, Chen D D, Yuan L, Liu Z C. Dynamic ReLU. In Proc. the 16th European Conference on Computer Vision, Aug. 2020, pp.351–367. DOI: 10.1007/978-3-030-58529-7_21.

[22]

Goyal Y, Khot T, Summers-Stay D, Batra D, Parikh D. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In Proc. the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Jul. 2017, pp.6325–6334. DOI: 10.1109/cvpr.2017.670.

[23]

Hudson D A, Manning C D. GQA: A new dataset for real-world visual reasoning and compositional question answering. In Proc. the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2019, pp.6693–6702. DOI: 10.1109/cvpr.2019.00686.

[24]

Anderson P, He X D, Buehler C, Teney D, Johnson M, Gould S, Zhang L. Bottom-up and top-down attention for image captioning and visual question answering. In Proc. the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2018, pp.6077–6086. DOI: 10.1109/cvpr.2018.00636.

[25]

Liu F, Liu J, Fang Z W, Hong R C, Lu H Q. Visual question answering with dense inter- and intra-modality interactions. IEEE Trans. Multimedia, 2021, 23:3518–3529. DOI: 10.1109/tmm.2020.3026892.

Crossref Google Scholar

[26]

Yu J, Zhang W F, Lu Y H, Qin Z C, Hu Y, Tan J L, Wu Q. Reasoning on the relation: Enhancing visual representation for visual question answering and cross-modal retrieval. IEEE Trans. Multimedia, 2020, 22(12): 3196–3209. DOI: 10.1109/tmm.2020.2972830.

Crossref Google Scholar

[27]

He K M, Zhang X Y, Ren S Q, Sun J. Deep residual learning for image recognition. In Proc. the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Jun. 2016, pp.770–778. DOI: 10.1109/cvpr.2016.90.

[28]

Fukui A, Park D H, Yang D, Rohrbach A, Darrell T, Rohrbach M. Multimodal compact bilinear pooling for visual question answering and visual grounding. In Proc. the 2016 Conference on Empirical Methods in Natural Language Processing, Nov. 2016, pp.457–468. DOI: 10.18653/v1/d16-1044.

[29]

Ben-Younes H, Cadene R, Cord M, Thome N. MUTAN: Multimodal tucker fusion for visual question answering. In Proc. the 2017 IEEE International Conference on Computer Vision, Oct. 2017, pp.2631–2639. DOI: 10.1109/iccv.2017.285.

[30]

Kazemi V, Elqursh A. Show, ask, attend, and answer: A strong baseline for visual question answering. arXiv: 1704.03162, 2017. https://arxiv.org/abs/1704.03162, Jun. 2024.

[31]

Niu Y L, Tang K H, Zhang H W, Lu Z W, Hua X S, Wen J R. Counterfactual VQA: A cause-effect look at language bias. In Proc. the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2021, pp.12695–12705. DOI: 10.1109/cvpr46437.2021.01251.

[32]

Pan Y H, Li Z C, Zhang L Y, Tang J H. Causal inference with knowledge distilling and curriculum learning for unbiased VQA. ACM Trans. Multimedia Computing, Communications, and Applications, 2022, 18(3): 67. DOI: 10.1145/3487042.

Crossref Google Scholar

[33]

Chen L, Yan X, Xiao J, Zhang H W, Pu S L, Zhuang Y T. Counterfactual samples synthesizing for robust visual question answering. In Proc. the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2020, pp.10797–10806. DOI: 10.1109/cvpr42600.2020.01081.

[34]

Yu Z, Yu J, Cui Y H, Tao D C, Tian Q. Deep modular co-attention networks for visual question answering. In Proc. the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2019, pp.6274–6283. DOI: 10.1109/cvpr.2019.00644.

[35]

Yu J, Li J, Yu Z, Huang Q M. Multimodal transformer with multi-view visual representation for image captioning. IEEE Trans. Circuits and Systems for Video Technology, 2019, 30(12): 4467–4480. DOI: 10.1109/TCSVT.2019.2947482.

Crossref Google Scholar

[36]

Wang J H, Jin L, Li Z C, Tang J H. Crossmodal knowledge distillation hashing. Scientia Sinica Technologica, 2022, 52(5): 713–726. DOI: 10.1360/sst-2021-0214.

Crossref Google Scholar

[37]

Andreas J, Rohrbach M, Darrell T, Klein D. Learning to compose neural networks for question answering. In Proc. the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Jun. 2016, pp.1545–1554. DOI: 10.18653/v1/n16-1181.

[38]

Devlin J, Chang M W, Lee K, Toutanova K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proc. the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Jun. 2019, pp.4171–4186. DOI: 10.18653/v1/N19-1423.

[39]

Radford A, Narasimhan K, Salimans T, Sutskever I, Improving language understanding by generative pre-training, https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf, May 2024.

[40]

Srinivas A, Lin T Y, Parmar N, Shlens J, Abbeel P, Vaswani A. Bottleneck transformers for visual recognition. In Proc. the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2021, pp.16514–16524. DOI: 10.1109/cvpr46437.2021.01625.

[41]

Huang G, Chen D L, Li T H, Wu F, Van Der Maaten L, Weinberger K Q. Multi-scale dense networks for resource efficient image classification. In Proc. the 6th International Conference on Learning Representations, May 2018.

[42]

Teerapittayanon S, McDanel B, Kung T H. BranchyNet: Fast inference via early exiting from deep neural networks. In Proc. the 23rd International Conference on Pattern Recognition, Dec. 2016, pp.2464–2469. DOI: 10.1109/icpr.2016.7900006.

[43]

Wang X, Yu F, Dou Z Y, Darrell T, Gonzalez J E. SkipNet: Learning dynamic routing in convolutional networks. In Proc. the 15th European Conference on Computer Vision, Sept. 2018, pp.420–436. DOI: 10.1007/978-3-030-01261-8_25.

[44]

Wu Z X, Nagarajan T, Kumar A, Rennie S, Davis L S, Grauman K, Feris R. BlockDrop: Dynamic inference paths in residual networks. In Proc. the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2018, pp.8817–8826. DOI: 10.1109/cvpr.2018.00919.

[45]

Shazeer N, Mirhoseini A, Maziarz K, Davis A, Le Q V, Hinton G E, Dean J. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In Proc. the 5th International Conference on Learning Representations, Apr. 2017.

[46]

Cai S F, Shu Y, Wang W. Dynamic routing networks. In Proc. the 2021 IEEE Winter Conference on Applications of Computer Vision, Jan. 2021, pp.3587–3596. DOI: 10.1109/wacv48630.2021.00363.

[47]

Fedus W, Zoph B, Shazeer N. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. The Journal of Machine Learning Research, 2022, 23(1): 120.

[48]

Ha D, Dai A M, Le Q V. HyperNetworks. In Proc. the 5th International Conference on Learning Representations, Apr. 2017.

[49]

Jiang A W, Liu B, Wang M W. Deep multimodal reinforcement network with contextually guided recurrent attention for image question answering. Journal of Computer Science and Technology, 2017, 32(4): 738–748. DOI: 10.1007/s11390-017-1755-6.

Crossref Google Scholar

[50]

Cho K, Van Merrienboer B, Bahdanau D, Bengio Y. On the properties of neural machine translation: Encoder-decoder approaches. In Proc. 8th Workshop on Syntax, Semantics and Structure in Statistical Translation, Oct. 2014, pp.103–111. DOI: 10.3115/v1/w14-4012.

[51]

Shaw P, Uszkoreit J, Vaswani A. Self-attention with relative position representations. In Proc. the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Jun. 2018, pp.464–468. DOI: 10.18653/v1/n18-2074.

[52]

Santoro A, Raposo D, Barrett D G T, Malinowski M, Pascanu R, Battaglia P, Lillicrap T. A simple neural network module for relational reasoning. In Proc. the 31st International Conference on Neural Information Processing Systems, Dec. 2017, pp.4974–4983.

[53]

Hudson D A, Manning C D. Compositional attention networks for machine reasoning. In Proc. the 6th Int. Conf. Learning Representations, Apr. 2018.

[54]

Mao J Y, Gan C, Kohli P, Tenenbaum J B, Wu J J. The neuro-symbolic concept learner: Interpreting scenes, words, and sentences from natural supervision. In Proc. the 7th International Conference on Learning Representations, May 2019.

[55]

Kim J H, Lee S W, Kwak D, Heo M O, Kim J, Ha J W, Zhang B T. Multimodal residual learning for visual QA. In Proc. the 30th International Conference on Neural Information Processing Systems, Dec. 2016, pp.361–369.

[56]

Saito K, Shin A, Ushiku Y, Harada T. DualNet: Domain-invariant network for visual question answering. In Proc. the 2017 IEEE International Conference on Multimedia and Expo, Jul. 2017, pp.829–834. DOI: 10.1109/icme.2017.8019436.

[57]

Teney D, Liu L Q, Van Den Hengel A. Graph-structured representations for visual question answering. In Proc. the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Jul. 2017, pp.3233–3241. DOI: 10.1109/cvpr.2017.344.

[58]

Selvaraju R R, Cogswell M, Das A, Vedantam R, Parikh D, Batra D. Grad-CAM: Visual explanations from deep networks via gradient-based localization. In Proc. the 2017 IEEE International Conference on Computer Vision, Oct. 2017, pp.618–626. DOI: 10.1109/iccv.2017.74.

Journal of Computer Science and Technology

Volume 39 Issue 4,
July 2024

Pages 912-928

DOI: 10.1007/s11390-024-2113-0

Cite this article:

Sun Q, Fu Y-W, Xue X-Y. Learning a Mixture of Conditional Gating Blocks for Visual Question Answering. Journal of Computer Science and Technology, 2024, 39(4): 912-928. https://doi.org/10.1007/s11390-024-2113-0

About Us

Learn about Open Access

Tsinghua University Press

Publish with Us

Peer Review Policy

Copyright and Licensing

Article Processing Charge

Contact Us

Journal Collaboration: Yao Meng (Ms.)✉️ +86-10-83470574

Technical Support: Kuo Zhao (Mr.)✉️ +86-10-83470507

Media Contact: Hao Jin (Mr.)✉️ +86-10-83470559

Address: Floor 6, Tower B, Xueyan Building, Shuangqing Road, Haidian District, Beijing 100084, China.

SciOpen——中国科技期刊卓越行动计划支持项目

Copyright © 2025 Tsinghua University Press Ltd.

京ICP备 10035462号-42 京公网安备11010802044758号