AI Chat Paper

Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.

Chat more with AI

| Sign up

Browse by Subject

Search for peer-reviewed journals with full access.

Journals A - Z

About Us

Discover the SciOpen Platform and Achieve Your Research Goals with Ease.

About Us

Publish with Us

Support

Journals A - Z

About Us

Publish with Us

Support

PDF (3.8 MB)

Cite

EndNote(RIS) BibTeX

Collect

Submit Manuscript

AI Chat Paper

Show Outline

Outline

Show full outline

Hide outline

Outline

Show full outline

Hide outline

Open Access

KnowER: Knowledge enhancement for efficient text-video retrieval

Hongwei Kou, Yingyun Yang(

), Yan Hua

School of Information and Communication Engineering, Communication University of China, Beijing 100024, China

Show Author Information

Abstract

The widespread adoption of mobile Internet and the Internet of things (IoT) has led to a significant increase in the amount of video data. While video data are increasingly important, language and text remain the primary methods of interaction in everyday communication, text-based cross-modal retrieval has become a crucial demand in many applications. Most previous text-video retrieval works utilize implicit knowledge of pre-trained models such as contrastive language-image pre-training (CLIP) to boost retrieval performance. However, implicit knowledge only records the co-occurrence relationship existing in the data, and it cannot assist the model to understand specific words or scenes. Another type of out-of-domain knowledge—explicit knowledge—which is usually in the form of a knowledge graph, can play an auxiliary role in understanding the content of different modalities. Therefore, we study the application of external knowledge base in text-video retrieval model for the first time, and propose KnowER, a model based on knowledge enhancement for efficient text-video retrieval. The knowledge-enhanced model achieves state-of-the-art performance on three widely used text-video retrieval datasets, i.e., MSRVTT, DiDeMo, and MSVD.

Keywords

knowledge graph text-video retrieval contrastive language-image pre-training (CLIP)

References

[1]

R. Xu, C. Xiong, W. Chen, and J. Corso, Jointly modeling deep video and compositional text to bridge vision andlanguage in a unified framework, in Proc. 29th AAAI Conf. Artificial Intelligence, Austin, Texas, USA, 2015.

Crossref

[2]

V. Gabeur, C. Sun, K. Alahari, and C. Schmid, Multi-modaltransformer for video retrieval, in Proc. 16th European Conf. Computer Vision., Glasgow, UK, 2020, pp. 214–229.

Crossref

[3]

B. Shi, L. Ji, P. Lu, Z. Niu, and N. Duan, Knowledge aware semantic concept expansion for image-text matching, in Proc. 28th Int. Joint Conf. Artificial Intelligence, Macao, China, 2019, 5182–5189.

Crossref

[4]

J. A. Portillo-Quintero, J. C. Ortiz-Bayliss, and H. Terashima-Marín, A straightforward framework for video retrieval using CLIP, in Proc. 13th Mexican Conf. Pattern Recognition, Mexico City, Mexico, 2021, pp. 3–12.

Crossref

[5]

H. Luo, L. Ji, M. Zhong, Y. Chen, W. Lei, N. Duan, and T. Li, CLIP4Clip: an empirical study of CLIP for end to end video clip retrieval, arXiv preprint arXiv: 2104.08860, 2021.

Crossref

[6]

S. K. Gorti, N. Vouitsis, J. Ma, K. Golestan, M. Volkovs, A. Garg, and G. Yu, X-pool: Cross-modal language-video attention for text-video retrieval, in Proc. 2022 IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 2022, pp. 4996–5005.

Crossref

[7]

S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, and Z. Ives, DBpedia: A nucleus for a web of open data, in Proc. 6th Int. Semantic Web Conf., 2nd Asian Semantic Web Conf., Busan, Republic of Korea, 2007, pp. 722–735.

Crossref

[8]

R. Speer, J. Chin, and C. Havasi, ConceptNet 5.5: An open multilingual graph of general knowledge, in Proc. 31st AAAI Conf. Artificial Intelligence, San Francisco, CA, USA, 2017.

Crossref

[9]

Z. Zhang, Y. Shi, C. Yuan, B. Li, P. Wang, W. Hu, and Z. J. Zha, Object relational graph with teacher-recommended learning for video captioning, in Proc. 2020 IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 2020, pp. 13275–13285.

Crossref

[10]

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al. , Learning transferable visual models from natural language supervision, in Proc. 38th Int. Conf. Machine Learning, virtual, 2021, pp. 8748–8763.

[11]

W. Kim, B. Son, and I. Kim, ViLT: Vision-and-language transformer without convolution or region supervision, in Proc. 38th Int. Conf. Machine Learning, virtual, 2021, pp. 5583–5594.

[12]

T. J. Fu, L. Li, Z. Gan, K. Lin, W. Y. Wang, L. Wang, and Z. Liu, VIOLET: end-to-end video-language transformers with masked visual-token modeling, arXiv preprint arXiv: 2111.12681, 2021.

[13]

Z. Huang, Z. Zeng, Y. Huang, B. Liu, D. Fu, and J. Fu, Seeing out of tHe bOx: End-to-end pre-training for vision-language representation learning, in Proc. 2021 IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 2021, pp. 12971–12980.

Crossref

[14]

J. Lei, L. Li, L. Zhou, Z. Gan, T. L. Berg, M. Bansal, and J. Liu, Less is more: CLIPBERT for video-and-language learning via sparse sampling, in Proc. 2021 IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 2021, pp. 7327–7337.

Crossref

[15]

C. Sun, A. Myers, C. Vondrick, K. Murphy, and C. Schmid, VideoBERT: A joint model for video and language representation learning, in Proc. 2019 IEEE/CVF Int. Conf. Computer Vision (ICCV), Seoul, Republic of Korea, 2020, pp. 7463–7472.

Crossref

[16]

G. Li, N. Duan, Y. Fang, M. Gong, and D. Jiang, Unicoder-VL: A universal encoder for vision and language by cross-modal pre-training, in Proc. 34th AAAI Conf. Artificial Intelligence, New York, NY, USA, 2020, pp. 11336–11344.

Crossref

[17]

H. Xu, G. Ghosh, P. Y. Huang, D. Okhonko, A. Aghajanyan, F. Metze, L. Zettlemoyer, and C. Feichtenhofer, VideoCLIP: contrastive pre-training for zero-shot video-text understanding, arXiv preprint arXiv: 2109.14084, 2021.

Crossref

[18]

L. Li, Y. C. Chen, Y. Cheng, Z. Gan, L. Yu, and J. Liu, HERO: hierarchical encoder for Video+Language omni-representation pre-training, arXiv preprint arXiv: 2005.00200, 2020.

Crossref

[19]

M. Bain, A. Nagrani, G. Varol, and A. Zisserman, Frozen in time: A joint video and image encoder for end-to-end retrieval, in Proc. 2021 IEEE/CVF Int. Conf. Computer Vision (ICCV), Montreal, Canada, 2022, pp. 1708–1718.

Crossref

[20]

M. Wang, J. Xing, and Y. Liu, ActionCLIP: A new paradigm for video action recognition, arXiv preprint arXiv: 2109.08472, 2021.

[21]

R. Mokady, A. Hertz, and A. H. Bermano, ClipCap: CLIP prefix for image captioning, arXiv preprint arXiv: 2111.09734, 2021.

[22]

J. Li, D. Li, C. Xiong, and S. Hoi, BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation, arXiv preprint arXiv: 2201.12086, 2022.

[23]

T. Srinivasan, X. Ren, and J. Thomason, Curriculum learning for data-efficient vision-language alignment, arXiv preprint arXiv: 2207.14525, 2022.

[24]

B. Mustafa, C. Riquelme, J. Puigcerver, R. Jenatton, and N. Houlsby, Multimodal contrastive learning with LIMoE: The language-image mixture of experts, arXiv preprint arXiv: 2206.02770, 2022.

[25]

O. Khattab and M. Zaharia, ColBERT: efficient and effective passage search via contextualized late interaction over BERT, in Proc. 43rd Int. ACM SIGIR Conf. Research and Development in Information Retrieval, virtual, 2020, pp. 39–48.

Crossref

[26]

F. He, Q. Wang, Z. Feng, W. Jiang, Y. Lü, Y. Zhu, and X. Tan, Improving video retrieval by adaptive margin, in Proc. in Proc. 44th Int. ACM SIGIR Conf. Research and Development in Information Retrieval, virtual, 2021, pp. 1359–1368.

Crossref

[27]

J. Yang, Y. Bisk, and J. Gao, TACo: token-aware cascade contrastive learning for video-text alignment, in Proc. 2021 IEEE/CVF Int. Conf. Computer Vision (ICCV), Montreal, Canada, 2022, pp. 11542–11552.

Crossref

[28]

M. Dzabraev, M. Kalashnikov, S. Komkov, and A. Petiushko, MDMMT: multidomain multimodal transformer for video retrieval, in Proc. 2021 IEEE/CVF Conf. Computer Vision and Pattern Recognition Workshops (CVPRW), Nashville, TN, USA, 2021, pp. 3349–3358.

Crossref

[29]

Z. Zhu, J. Yu, Y. Wang, Y. Sun, Y. Hu, and Q. Wu, Mucko: multi-layer cross-modal knowledge reasoning for fact-based visual question answering, arXiv preprint arXiv: 2006.09073, 2020.

Crossref

[30]

K. Marino, X. Chen, D. Parikh, A. Gupta, and M. Rohrbach, KRISP: integrating implicit and symbolic knowledge for open-domain knowledge-based VQA, in Proc. 2021 IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 2021, pp. 14106–14116.

Crossref

[31]

Y. Ding, J. Yu, B. Liu, Y. Hu, M. Cui, and Q. Wu, MuKEA: multimodal knowledge extraction and accumulation for knowledge-based visual question answering, in Proc. 2022 IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 2022, pp. 5079–5088.

Crossref

[32]

T. N. Kipf and M. Welling, Semi-supervised classification with graph convolutional networks, arXiv preprint arXiv: 1609.02907, 2016.

[33]

H. Ben-younes, R. Cadene, M. Cord, and N. Thome, MUTAN: multimodal tucker fusion for visual question answering, in Proc. 2017 IEEE Int. Conf. Computer Vision (ICCV), Venice, Italy, 2017, pp. 2631–2639.

Crossref

[34]

F. Gardères, M. Ziaeefard, B. Abeloos, and F. Lecue, ConceptBert: Concept-aware representation for visual question answering, in Proc. 2020 Conf. Empirical Methods in Natural Language Processing, virtual, 2020, pp. 489–498.

Crossref

[35]

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, Attention is all you need, in Proc. 31st Int. Conf. Neural Information Processing Systems, Long Beach, CA, USA, 2017, pp. 6000–6010.

[36]

C. Malaviya, C. Bhagavatula, A. Bosselut, and Y. Choi, Commonsense knowledge base completion with structural and semantic context, in Proc. 34th AAAI Conf. Artificial Intelligence, New York, NY, USA, 2020, pp. 2925–2933.

Crossref

[37]

J. Li, R. Selvaraju, A. Gotmare, S. Joty, C. Xiong, andS. C. H. Hoi, Align before fuse: Vision and languagerepresentation learning with momentum distillation, in Proc. 35th Conf. Neural Information Processing Systems, virtual, 2021, pp. 9694–9705.

[38]

H. Xue, T. Hang, Y. Zeng, Y. Sun, B. Liu, H. Yang, J. Fu, and B. Guo, Advancing high-resolution video-language representation with large-scale video transcriptions, in Proc. 2022 IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 2022, pp. 5026–5035.

Crossref

[39]

T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, A simple framework for contrastive learning of visual representations, in Proc. 37th Int. Conf. Machine Learning, Vienna, Austria, 2020, pp. 1597–1607.

[40]

J. Xu, T. Mei, T. Yao, and Y. Rui, MSR-VTT: A large video description dataset for bridging video and language, in Proc. 2016 IEEE Conf. Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 2016, pp. 5288–5296.

Crossref

[41]

Y. Yu, J. Kim, and G. Kim, A joint sequence fusion model for video question answering and retrieval, in Proc. 15th European Conference on Computer Vision (ECCV), Munich, German, 2018, pp. 471–487.

Crossref

[42]

L. A. Hendricks, O. Wang, E. Shechtman, J. Sivic, T. Darrell, and B. Russell, Localizing moments in video with natural language, in Proc. 2017 IEEE Int. Conf. Computer Vision (ICCV), Venice, Italy, 2017, pp. 5804–5813.

Crossref

[43]

D. Chen and W. Dolan, Collecting highly parallel data for paraphrase evaluation, in Proc. 49th Annual Meeting Association for Computational Linguistics: Human Language Technologies, Portland, OR, USA, 2011, pp. 190–200.

[44]

X. Glorot, A. Bordes, and Y. Bengio, Deep sparse rectifier neural networks, in Proc. 14th Int. Conf. Artificial Intelligence and Statistics, Fort Lauderdale, FL, USA, 2011, pp. 315–323.

[45]

Y. Liu, S. Albanie, A. Nagrani, and A. Zisserman, Use what you have: Video retrieval using representations from collaborative experts, arXiv preprint arXiv: 1907.13487, 2019.

[46]

Y. Ge, Y. Ge, X. Liu, D. Li, Y. Shan, X. Qie, and P. Luo, Bridging video-text retrieval with multiple choice questions, in Proc. 2022 IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 2022, pp. 16146–16155.

Crossref

[47]

I. Croitoru, S. V. Bogolin, M. Leordeanu, H. Jin, A. Zisserman, S. Albanie, and Y. Liu, TeachText: CrossModal generalized distillation for text-video retrieval, in Proc. 2021 IEEE/CVF Int. Conf. Computer Vision (ICCV), Montreal, Canada, 2022, pp. 11563–11573.

Crossref

[48]

K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, Momentum contrast for unsupervised visual representation learning, in Proc. 2020 IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 2020, pp. 9726–9735.

Crossref

Intelligent and Converged Networks

Volume 4 Issue 2,
June 2023

Pages 93-105

DOI: 10.23919/ICN.2023.0009

Cite this article:

Kou H, Yang Y, Hua Y. KnowER: Knowledge enhancement for efficient text-video retrieval. Intelligent and Converged Networks, 2023, 4(2): 93-105. https://doi.org/10.23919/ICN.2023.0009

590

Views

Downloads

Crossref

Scopus

Google Scholar
Citation

Altmetrics

Received: 03 April 2023

Revised: 04 May 2023

Accepted: 16 May 2023

Published: 30 June 2023

This work is available under the CC BY-NC-ND 3.0 IGO license:https://creativecommons.org/licenses/by-nc-nd/3.0/igo/