[1]
Q. Wu, P. Wang, X. Wang, X. He, and W. Zhu, Knowledge-based VQA, in Visual Question Answering, Q. Wu, P. Wang, X. Wang, X. He, and W. Zhu, eds. Singapore: Springer, 2022, pp. 73–90.
[3]
P. Wang, Q. Wu, C. Shen, A. Dick, and A. van den Hengel, FVQA: Fact-based visual question answering, IEEE Trans. Pattern Anal. Mach. Intell., vol. 40, no. 10, pp. 2413–2427, 2018.
[4]
K. Marino, M. Rastegari, A. Farhadi, and R. Mottaghi, OK-VQA: A visual question answering benchmark requiring external knowledge, in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 2019, pp. 3190–3199.
[5]
D. Schwenk, A. Khandelwal, C. Clark, K. Marino, and R. Mottaghi, A-OKVQA: A benchmark for visual question answering using world knowledge, in European Conference on Computer Vision, S. Avidan, G. Brostow, M. Cissé, G. M. Farinella, and T. Hassner, eds. Cham, Switzerland: Springer, 2022, pp. 146–162.
[6]
K. Marino, X. Chen, D. Parikh, A. Gupta, and M. Rohrbach, KRISP: Integrating implicit and symbolic knowledge for open-domain knowledge-based VQA, in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 2021, pp. 14106–14116.
[8]
T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al., Language models are few-shot learners, arXiv preprint arXiv: 2005.14165, 2020.
[9]
P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W. Yih, T. Rocktäschel, et al., Retrieval-augmented generation for knowledge-intensive NLP tasks, in Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, eds. New York, NY, USA: Curran Associates, Inc., 2020, pp. 9459–9474.
[10]
D. Hong, B. Zhang, X. Li, Y. Li, C. Li, J. Yao, N. Yokoya, H. Li, P. Ghamisi, X. Jia, et al., SpectralGPT: Spectral remote sensing foundation model, IEEE Trans. Pattern Anal. Mach. Intell.
[13]
S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh, VQA: Visual question answering, in Proc. IEEE Int. Conf. Computer Vision (ICCV), Santiago, Chile, 2015, pp. 2425–2433.
[14]
Y. Srivastava, V. Murali, S. R. Dubey, and S. Mukherjee, Visual question answering using deep learning: A survey and performance analysis, in Computer Vision and Image Processing, S. K. Singh, P. Roy, B. Raman, and P. Nagabhushan, eds. Singapore: Springer, 2021, pp. 75–86.
[17]
H. Jiang, I. Misra, M. Rohrbach, E. Learned-Miller, and X. Chen, In defense of grid features for visual question answering, in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 2020, pp. 10264–10273.
[18]
P. Zhang, X. Li, X. Hu, J. Yang, L. Zhang, L. Wang, Y. Choi, and J. Gao, VinVL: Revisiting visual representations in vision-language models, in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 2021, pp. 5575–5584.
[19]
L. Li, Z. Gan, Y. Cheng, and J. Liu, Relation-aware graph attention network for visual question answering, in Proc. IEEE/CVF Int. Conf. Computer Vision (ICCV), Seoul, Republic of Korea, 2019, pp. 10312–10321.
[20]
Z. Yu, Y. Cui, J. Yu, M. Wang, D. Tao, and Q. Tian, Deep multimodal neural architecture search, in Proc. 28th ACM Int. Conf. Multimedia, Seattle, WA, USA, 2020, pp. 3743–3752.
[21]
Z. Yu, J. Yu, Y. Cui, D. Tao, and Q. Tian, Deep modular co-attention networks for visual question answering, in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 2019, pp. 6274–6283.
[22]
Y. Cui, Z. Yu, C. Wang, Z. Zhao, J. Zhang, M. Wang, and J. Yu, ROSITA: Enhancing vision-and-language semantic alignments via cross- and intra-modal knowledge integration, in Proc. 29th ACM Int. Conf. Multimedia, Virtual Event, 2021, pp. 797–806.
[23]
J. Li, D. Li, C. Xiong, and S. Hoi, BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation, arXiv preprint arXiv: 2201.12086, 2022.
[24]
M. Zhou, L. Yu, A. Singh, M. Wang, Z. Yu, and N. Zhang, Unsupervised vision-and-language pre-training via retrieval-based multi-granular alignment, in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 2022, pp. 16464–16473.
[25]
M. Malinowski, M. Rohrbach, and M. Fritz, Ask your neurons: A neural-based approach to answering questions about images, in Proc. IEEE Int. Conf. Computer Vision (ICCV), Santiago, Chile, 2015, pp. 1–9.
[26]
Z. Yu, J. Yu, J. Fan, and D. Tao, Multi-modal factorized bilinear pooling with co-attention learning for visual question answering, in Proc. IEEE Int. Conf. Computer Vision (ICCV), Venice, Italy, 2017, pp. 1839–1848.
[27]
J. Andreas, M. Rohrbach, T. Darrell, and D. Klein, Neural module networks, in Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 2016, pp. 39–48.
[29]
Z. Shao, Z. Yu, M. Wang, and J. Yu, Prompting large language models with answer heuristics for knowledge-based visual question answering, in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Vancouver, Canada, 2023, pp. 14974–14983.
[30]
S. Pan, L. Luo, Y. Wang, C. Chen, J. Wang, and X. Wu, Unifying large language models and knowledge graphs: A roadmap, IEEE Trans. Knowl. Data Eng.
[32]
Z. Zhang, X. Han, Z. Liu, X. Jiang, M. Sun, and Q. Liu, ERNIE: Enhanced language representation with informative entities, in Proc. 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 2019, pp. 1441–1451.
[33]
Y. Sun, S. Wang, S. Feng, S. Ding, C. Pang, J. Shang, J. Liu, X. Chen, Y. Zhao, Y. Lu, et al., ERNIE 3.0: Large-scale knowledge enhanced pre-training for language understanding and generation, arXiv preprint arXiv: 2107.02137, 2021.
[34]
J. Liu, D. Shen, Y. Zhang, B. Dolan, L. Carin, and W. Chen, What makes good in-context examples for GPT-3? in Proc. Deep Learning Inside Out (DeeLIO 2022): The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures, Dublin, Ireland, 2022, pp. 100–114.
[35]
H. Ben-Younes, R. Cadene, M. Cord, and N. Thome, MUTAN: Multimodal tucker fusion for visual question answering, in Proc. IEEE Int. Conf. Computer Vision (ICCV), Venice, Italy, 2017, pp. 2631–2639.
[36]
Z. Zhu, J. Yu, Y. Wang, Y. Sun, Y. Hu, and Q. Wu, Mucko: Multi-layer cross-modal knowledge reasoning for fact-based visual question answering, in Proc. 29th Int. Joint Conf. Artificial Intelligence, Yokohama, Japan, 2020, pp. 1097–1103.
[37]
F. Gardères, M. Ziaeefard, B. Abeloos, and F. Lecue, ConceptBert: Concept-aware representation for visual question answering, in Proc. Findings of the Association for Computational Linguistics : EMNLP 2020, Virtual Event, 2020, pp. 489–498.
[38]
M. Luo, Y. Zeng, P. Banerjee, and C. Baral, Weakly-supervised visual-retriever-reader for knowledge-based question answering, in Proc. 2021 Conf. Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, 2021, pp. 6417–6431.
[39]
F. Gao, Q. Ping, G. Thattai, A. Reganti, Y. N. Wu, and P. Natarajan, Transform-retrieve-generate: Natural language-centric outside-knowledge visual question answering, in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 2022, pp. 5057–5067.
[40]
Y. Guo, L. Nie, Y. Wong, Y. Liu, Z. Cheng, and M. Kankanhalli, A unified end-to-end retriever-reader framework for knowledge-based VQA, in Proc. 30th ACM Int. Conf. Multimedia, Lisboa, Portugal, 2022, pp. 2061–2069.
[41]
Y. Jiang, V. Natarajan, X. Chen, M. Rohrbach, D. Batra, and D. Parikh, Pythia v0.1: The winning entry to the VQA challenge 2018, arXiv preprint arXiv: 1807.09956, 2018.
[42]
J. Lu, D. Batra, D. Parikh, and S. Lee, Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks, in Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. Fox, and R. Garnett, eds. New York, NY, USA: Curran Associates, Inc., 2019, pp. 13–23.
[43]
R. Mokady, A. Hertz, and A. H. Bermano, Clipcap: Clip prefix for image captioning, arXiv preprint arXiv: 2111.09734, 2021.
[44]
H. Tan and M. Bansal, LXMERT: Learning cross-modality encoder representations from transformers, in Proc. 2019 Conf. Empirical Methods in Natural Language Processing and the 9th Int. Joint Conf. Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 2019, pp. 5100–5111.
[45]
A. Kamath, C. Clark, T. Gupta, E. Kolve, D. Hoiem, and A. Kembhavi, Webly supervised concept expansion for general purpose vision models, in Computer Vision—ECCV 2022, S. Avidan, G. Brostow, M. Cissé, G. M. Farinella, and T. Hassner, eds. Cham, Switzerland: Springer, 2022, pp. 662–681.
[46]
S. Ravi, A. Chinchure, L. Sigal, R. Liao, and V. Shwartz, VLC-BERT: Visual question answering with contextualized commonsense knowledge, in Proc. IEEE/CVF Winter Conf. Applications of Computer Vision (WACV), Waikoloa, HI, USA, 2023, pp. 1155–1165.
[47]
Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh, Making the V in VQA matter: Elevating the role of image understanding in visual question answering, in Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 2017, pp. 6325–6334.
[49]
H. Touvron, T. Lavril, G. Izacard, X. Martinet, M. A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al., Llama: Open and efficient foundation language models, arXiv preprint arXiv: 2302.13971, 2023.
[50]
H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, et al., Llama 2: Open foundation and fine-tuned chat models, arXiv preprint arXiv: 2307.09288, 2023.
[51]
F. Ilievski, P. Szekely, and B. Zhang, CSKG: The commonsense knowledge graph, in The Semantic Web, R. Verborgh, K. Hose, H. Paulheim, P. Champin, M. Maleshkova, O. Corcho, P. Ristoski, and M. Alam, eds. Cham, Switzerland, Springer, 2021, pp. 680–696.