[1]
R. Xu, C. Xiong, W. Chen, and J. Corso, Jointly modeling deep video and compositional text to bridge vision andlanguage in a unified framework, in Proc. 29th AAAI Conf. Artificial Intelligence, Austin, Texas, USA, 2015.
[2]
V. Gabeur, C. Sun, K. Alahari, and C. Schmid, Multi-modaltransformer for video retrieval, in Proc. 16th European Conf. Computer Vision., Glasgow, UK, 2020, pp. 214–229.
[3]
B. Shi, L. Ji, P. Lu, Z. Niu, and N. Duan, Knowledge aware semantic concept expansion for image-text matching, in Proc. 28th Int. Joint Conf. Artificial Intelligence, Macao, China, 2019, 5182–5189.
[4]
J. A. Portillo-Quintero, J. C. Ortiz-Bayliss, and H. Terashima-Marín, A straightforward framework for video retrieval using CLIP, in Proc. 13th Mexican Conf. Pattern Recognition, Mexico City, Mexico, 2021, pp. 3–12.
[5]
H. Luo, L. Ji, M. Zhong, Y. Chen, W. Lei, N. Duan, and T. Li, CLIP4Clip: an empirical study of CLIP for end to end video clip retrieval, arXiv preprint arXiv: 2104.08860, 2021.
[6]
S. K. Gorti, N. Vouitsis, J. Ma, K. Golestan, M. Volkovs, A. Garg, and G. Yu, X-pool: Cross-modal language-video attention for text-video retrieval, in Proc. 2022 IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 2022, pp. 4996–5005.
[7]
S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, and Z. Ives, DBpedia: A nucleus for a web of open data, in Proc. 6th Int. Semantic Web Conf., 2nd Asian Semantic Web Conf., Busan, Republic of Korea, 2007, pp. 722–735.
[8]
R. Speer, J. Chin, and C. Havasi, ConceptNet 5.5: An open multilingual graph of general knowledge, in Proc. 31st AAAI Conf. Artificial Intelligence, San Francisco, CA, USA, 2017.
[9]
Z. Zhang, Y. Shi, C. Yuan, B. Li, P. Wang, W. Hu, and Z. J. Zha, Object relational graph with teacher-recommended learning for video captioning, in Proc. 2020 IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 2020, pp. 13275–13285.
[10]
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al. , Learning transferable visual models from natural language supervision, in Proc. 38th Int. Conf. Machine Learning, virtual, 2021, pp. 8748–8763.
[11]
W. Kim, B. Son, and I. Kim, ViLT: Vision-and-language transformer without convolution or region supervision, in Proc. 38th Int. Conf. Machine Learning, virtual, 2021, pp. 5583–5594.
[12]
T. J. Fu, L. Li, Z. Gan, K. Lin, W. Y. Wang, L. Wang, and Z. Liu, VIOLET: end-to-end video-language transformers with masked visual-token modeling, arXiv preprint arXiv: 2111.12681, 2021.
[13]
Z. Huang, Z. Zeng, Y. Huang, B. Liu, D. Fu, and J. Fu, Seeing out of tHe bOx: End-to-end pre-training for vision-language representation learning, in Proc. 2021 IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 2021, pp. 12971–12980.
[14]
J. Lei, L. Li, L. Zhou, Z. Gan, T. L. Berg, M. Bansal, and J. Liu, Less is more: CLIPBERT for video-and-language learning via sparse sampling, in Proc. 2021 IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 2021, pp. 7327–7337.
[15]
C. Sun, A. Myers, C. Vondrick, K. Murphy, and C. Schmid, VideoBERT: A joint model for video and language representation learning, in Proc. 2019 IEEE/CVF Int. Conf. Computer Vision (ICCV), Seoul, Republic of Korea, 2020, pp. 7463–7472.
[16]
G. Li, N. Duan, Y. Fang, M. Gong, and D. Jiang, Unicoder-VL: A universal encoder for vision and language by cross-modal pre-training, in Proc. 34th AAAI Conf. Artificial Intelligence, New York, NY, USA, 2020, pp. 11336–11344.
[17]
H. Xu, G. Ghosh, P. Y. Huang, D. Okhonko, A. Aghajanyan, F. Metze, L. Zettlemoyer, and C. Feichtenhofer, VideoCLIP: contrastive pre-training for zero-shot video-text understanding, arXiv preprint arXiv: 2109.14084, 2021.
[18]
L. Li, Y. C. Chen, Y. Cheng, Z. Gan, L. Yu, and J. Liu, HERO: hierarchical encoder for Video+Language omni-representation pre-training, arXiv preprint arXiv: 2005.00200, 2020.
[19]
M. Bain, A. Nagrani, G. Varol, and A. Zisserman, Frozen in time: A joint video and image encoder for end-to-end retrieval, in Proc. 2021 IEEE/CVF Int. Conf. Computer Vision (ICCV), Montreal, Canada, 2022, pp. 1708–1718.
[20]
M. Wang, J. Xing, and Y. Liu, ActionCLIP: A new paradigm for video action recognition, arXiv preprint arXiv: 2109.08472, 2021.
[21]
R. Mokady, A. Hertz, and A. H. Bermano, ClipCap: CLIP prefix for image captioning, arXiv preprint arXiv: 2111.09734, 2021.
[22]
J. Li, D. Li, C. Xiong, and S. Hoi, BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation, arXiv preprint arXiv: 2201.12086, 2022.
[23]
T. Srinivasan, X. Ren, and J. Thomason, Curriculum learning for data-efficient vision-language alignment, arXiv preprint arXiv: 2207.14525, 2022.
[24]
B. Mustafa, C. Riquelme, J. Puigcerver, R. Jenatton, and N. Houlsby, Multimodal contrastive learning with LIMoE: The language-image mixture of experts, arXiv preprint arXiv: 2206.02770, 2022.
[25]
O. Khattab and M. Zaharia, ColBERT: efficient and effective passage search via contextualized late interaction over BERT, in Proc. 43rd Int. ACM SIGIR Conf. Research and Development in Information Retrieval, virtual, 2020, pp. 39–48.
[26]
F. He, Q. Wang, Z. Feng, W. Jiang, Y. Lü, Y. Zhu, and X. Tan, Improving video retrieval by adaptive margin, in Proc. in Proc. 44th Int. ACM SIGIR Conf. Research and Development in Information Retrieval, virtual, 2021, pp. 1359–1368.
[27]
J. Yang, Y. Bisk, and J. Gao, TACo: token-aware cascade contrastive learning for video-text alignment, in Proc. 2021 IEEE/CVF Int. Conf. Computer Vision (ICCV), Montreal, Canada, 2022, pp. 11542–11552.
[28]
M. Dzabraev, M. Kalashnikov, S. Komkov, and A. Petiushko, MDMMT: multidomain multimodal transformer for video retrieval, in Proc. 2021 IEEE/CVF Conf. Computer Vision and Pattern Recognition Workshops (CVPRW), Nashville, TN, USA, 2021, pp. 3349–3358.
[29]
Z. Zhu, J. Yu, Y. Wang, Y. Sun, Y. Hu, and Q. Wu, Mucko: multi-layer cross-modal knowledge reasoning for fact-based visual question answering, arXiv preprint arXiv: 2006.09073, 2020.
[30]
K. Marino, X. Chen, D. Parikh, A. Gupta, and M. Rohrbach, KRISP: integrating implicit and symbolic knowledge for open-domain knowledge-based VQA, in Proc. 2021 IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 2021, pp. 14106–14116.
[31]
Y. Ding, J. Yu, B. Liu, Y. Hu, M. Cui, and Q. Wu, MuKEA: multimodal knowledge extraction and accumulation for knowledge-based visual question answering, in Proc. 2022 IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 2022, pp. 5079–5088.
[32]
T. N. Kipf and M. Welling, Semi-supervised classification with graph convolutional networks, arXiv preprint arXiv: 1609.02907, 2016.
[33]
H. Ben-younes, R. Cadene, M. Cord, and N. Thome, MUTAN: multimodal tucker fusion for visual question answering, in Proc. 2017 IEEE Int. Conf. Computer Vision (ICCV), Venice, Italy, 2017, pp. 2631–2639.
[34]
F. Gardères, M. Ziaeefard, B. Abeloos, and F. Lecue, ConceptBert: Concept-aware representation for visual question answering, in Proc. 2020 Conf. Empirical Methods in Natural Language Processing, virtual, 2020, pp. 489–498.
[35]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, Attention is all you need, in Proc. 31st Int. Conf. Neural Information Processing Systems, Long Beach, CA, USA, 2017, pp. 6000–6010.
[36]
C. Malaviya, C. Bhagavatula, A. Bosselut, and Y. Choi, Commonsense knowledge base completion with structural and semantic context, in Proc. 34th AAAI Conf. Artificial Intelligence, New York, NY, USA, 2020, pp. 2925–2933.
[37]
J. Li, R. Selvaraju, A. Gotmare, S. Joty, C. Xiong, andS. C. H. Hoi, Align before fuse: Vision and languagerepresentation learning with momentum distillation, in Proc. 35th Conf. Neural Information Processing Systems, virtual, 2021, pp. 9694–9705.
[38]
H. Xue, T. Hang, Y. Zeng, Y. Sun, B. Liu, H. Yang, J. Fu, and B. Guo, Advancing high-resolution video-language representation with large-scale video transcriptions, in Proc. 2022 IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 2022, pp. 5026–5035.
[39]
T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, A simple framework for contrastive learning of visual representations, in Proc. 37th Int. Conf. Machine Learning, Vienna, Austria, 2020, pp. 1597–1607.
[40]
J. Xu, T. Mei, T. Yao, and Y. Rui, MSR-VTT: A large video description dataset for bridging video and language, in Proc. 2016 IEEE Conf. Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 2016, pp. 5288–5296.
[41]
Y. Yu, J. Kim, and G. Kim, A joint sequence fusion model for video question answering and retrieval, in Proc. 15th European Conference on Computer Vision (ECCV), Munich, German, 2018, pp. 471–487.
[42]
L. A. Hendricks, O. Wang, E. Shechtman, J. Sivic, T. Darrell, and B. Russell, Localizing moments in video with natural language, in Proc. 2017 IEEE Int. Conf. Computer Vision (ICCV), Venice, Italy, 2017, pp. 5804–5813.
[43]
D. Chen and W. Dolan, Collecting highly parallel data for paraphrase evaluation, in Proc. 49th Annual Meeting Association for Computational Linguistics: Human Language Technologies, Portland, OR, USA, 2011, pp. 190–200.
[44]
X. Glorot, A. Bordes, and Y. Bengio, Deep sparse rectifier neural networks, in Proc. 14th Int. Conf. Artificial Intelligence and Statistics, Fort Lauderdale, FL, USA, 2011, pp. 315–323.
[45]
Y. Liu, S. Albanie, A. Nagrani, and A. Zisserman, Use what you have: Video retrieval using representations from collaborative experts, arXiv preprint arXiv: 1907.13487, 2019.
[46]
Y. Ge, Y. Ge, X. Liu, D. Li, Y. Shan, X. Qie, and P. Luo, Bridging video-text retrieval with multiple choice questions, in Proc. 2022 IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 2022, pp. 16146–16155.
[47]
I. Croitoru, S. V. Bogolin, M. Leordeanu, H. Jin, A. Zisserman, S. Albanie, and Y. Liu, TeachText: CrossModal generalized distillation for text-video retrieval, in Proc. 2021 IEEE/CVF Int. Conf. Computer Vision (ICCV), Montreal, Canada, 2022, pp. 11563–11573.
[48]
K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, Momentum contrast for unsupervised visual representation learning, in Proc. 2020 IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 2020, pp. 9726–9735.