| Sign up

PDF (9.9 MB)

Cite

EndNote(RIS) BibTeX

Collect

Collect

Submit Manuscript

Open Access

SVMFN-FSAR: Semantic-Guided Video Multimodal Fusion Network for Few-Shot Action Recognition

Ran Wei^¹, Rui Yan^², Hongyu Qu^³, Xing Li^¹, Qiaolin Ye^¹, Liyong Fu^⁴()

1College of Information Science and Technology & Artificial Intelligence, with State Key Laboratory of Tree Genetics and Breeding, and also with Co-Innovation Center for Sustainable Forestry in Southern China, Nanjing Forestry University, Nanjing 210037, China

2Department of Computer Science and Technology, Nanjing University, Nanjing 210023, China

3School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China

4College of Forestry, Hebei Agricultural University, Baoding 071000, China, and also with Institute of Forest Resource Information Techniques, Chinese Academy of Forestry, Beijing 100091, China

Show Author Information

Abstract

Few-Shot Action Recognition (FSAR) has been a heat topic in various areas, such as computer vision and forest ecosystem security. FSAR aims to recognize previously unseen classes using limited labeled video examples. A principal challenge in the FSAR task is to obtain more action semantics related to the category from a few samples for classification. Recent studies attempt to compensate for visual information through action labels. However, concise action category names lead to less distinct semantic space and potential performance limitations. In this work, we propose a novel Semantic-guided Video Multimodal Fusion Network for FSAR (SVMFN-FSAR). We utilize the Large Language Model (LLM) to expand detailed textual knowledge of various action categories, enhancing the distinction of semantic space and alleviating the problem of insufficient samples in FSAR tasks to some extent. We perform the matching metric between the extracted distinctive semantic information and the visual information of unknown class samples to understand the overall semantics of the video for preliminary classification. In addition, we design a novel semantic-guided temporal interaction module based on Transformers, which can make the LLM-expanded knowledge and visual information complement each other, and improve the quality of feature representation in samples. Experimental results on three few-shot benchmarks, Kinetics, UCF101, and HMDB51, consistently demonstrate the effectiveness and interpretability of the proposed method.

Keywords

few-shot learning action recognition Large Language Model (LLM)Transformer video understanding

References

[1]

J. Weng, C. Weng, J. Yuan, and Z. Liu, Discriminative spatio-temporal pattern discovery for 3D action recognition, IEEE Trans. Circuits Syst. Video Technol., vol. 29, no. 4, pp. 1077–1089, 2018.

[2]

R. Yan, L. Xie, J. Tang, X. Shu, and Q. Tian, HiGCIN: Hierarchical graph-based cross inference network for group activity recognition, IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 6, pp. 6955–6968, 2023.

Crossref Google Scholar

[3]

L. Ma, Y. Zheng, Z. Zhang, Y. Yao, X. Fan, and Q. Ye, Motion stimulation for compositional action recognition, IEEE Trans. Circuits Syst. Video Technol, vol. 33, no. 5, pp. 2061–2074, 2023.

Crossref Google Scholar

[4]

R. Yan, L. Xie, X. Shu, L. Zhang, and J. Tang, Progressive instance-aware feature learning for compositional action recognition, IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 8, pp. 10317–10330, 2023.

Crossref Google Scholar

[5]

P. Huang, R. Yan, X. Shu, Z. Tu, G. Dai, and J. Tang, Semantic-disentangled transformer with noun-verb embedding for compositional action recognition, IEEE Trans. Image Process., vol. 33, pp. 297–309, 2024.

Crossref Google Scholar

[6]

P. Huang, X. Shu, R. Yan, Z. Tu, and J. Tang, Appearance-agnostic representation learning for compositional action recognition, IEEE Trans. Circuits Syst. Video Technol., doi: 10.1109/TCSVT.2024.3384392.

[7]

Z. Li, H. Tang, Z. Peng, G. J. Qi, and J. Tang, Knowledge-guided semantic transfer network for few-shot image recognition, IEEE Trans. Neural Networks Learning Syst., doi: 10.1109/TNNLS.2023.3240195.

[8]

J. Snell, K. Swersky, and R. Zemel, Prototypical networks for few-shot learning, in Proc. 31^st Int. Conf. Neural Information Processing Systems, Long Beach, CA, USA, 2017, pp. 4080–4090.

[9]

F. Sung, Y. Yang, L. Zhang, T. Xiang, P. H. S. Torr, and T. M. Hospedales, Learning to compare: Relation network for few-shot learning, in Proc. 2018 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 1199–1208.

[10]

L. Zhu and Y. Yang, Compound memory networks for few-shot video classification, in Proc. 15^th European Conf. Computer Vision, Munich, Germany, 2018, pp. 782–797.

[11]

H. Zhang, L. Zhang, X. Qi, H. Li, P. H. S. Torr, and P. Koniusz, Few-Shot action recognition with permutation-invariant attention, in Proc. 16^th European Conf. Computer Vision – ECCV 2020, Glasgow, UK, 2020, pp. 525–542.

[12]

K. Cao, J. Ji, Z. Cao, C. Y. Chang, and J. C. Niebles, Few-shot video classification via temporal alignment, in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Seattle, WA, USA, 2020, pp. 10615–10624.

[13]

S. Zhang, J. Zhou, and X. He, Learning implicit temporal alignment for few-shot video classification, in Proc. 30^th Int. Joint Conf. Artificial Intelligence, Montreal, Canada, 2021, pp. 1309–1315.

[14]

T. Perrett, A. Masullo, T. Burghardt, M. Mirmehdi, and D. Damen, Temporal-relational CrossTransformers for few-shot action recognition, in Proc. 2021 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Nashville, TN, USA, 2021, pp. 475–484.

[15]

A. Thatipelli, S. Narayan, S. Khan, R. M. Anwer, F. S. Khan, and B. Ghanem, Spatio-temporal relation modeling for few-shot action recognition, in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, New Orleans, LA, USA, 2022, pp. 19926–19935.

[16]

X. Wang, S. Zhang, Z. Qing, M. Tang, Z. Zuo, C. Gao, R. Jin, and N. Sang, Hybrid relation guided set matching for few-shot action recognition, in Proc. 2022 IEEE/CVF Conf. Computer Vision and Pattern Recognition, New Orleans, LA, USA, 2022, pp. 19916–19925.

[17]

X. Wang, S. Zhang, Z. Qing, C. Gao, Y. Zhang, D. Zhao, and N. Sang, MoLo: Motion-augmented long-short contrastive learning for few-shot action recognition, in Proc. 2023 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Vancouver, Canada, 2023, pp. 18011–18021.

[18]

X. Wang, W. Ye, Z. Qi, G. Wang, J. Wu, Y. Shan, X. Qie, and H. Wang, Task-aware dual-representation network for few-shot action recognition, IEEE Trans. Circuits Syst. Video Technol., vol. 33, no. 10, pp. 5932–5946, 2023.

Crossref Google Scholar

[19]

X. Wang, Y. Lu, W. Yu, Y. Pang, and H. Wang, Few-shot action recognition via multi-view representation learning, IEEE Trans. Circuits Syst. Video Technol., vol. 34, no. 9, pp. 8522–8535, 2024.

Crossref Google Scholar

[20]

X. Wang, S. Zhang, J. Cen, C. Gao, Y. Zhang, D. Zhao, and N. Sang, CLIP-guided prototype modulating for few-shot action recognition, Int. J. Comput. Vision, vol. 132, no. 6, pp. 1899–1912, 2024.

[21]

J. Carreira and A. Zisserman, Quo Vadis, action recognition? A new model and the kinetics dataset, in Proc. 2017 IEEE Conf. Computer Vision and Pattern Recognition, Honolulu, HI, USA, 2017, pp. 4724–4733.

[22]

H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre, HMDB: A large video database for human motion recognition, in Proc. 2011 Int. Conf. Computer Vision, Barcelona, Spain, 2011, pp. 2556–2563.

[23]

K. Soomro, A. R. Zamir, and M. Shah, UCF101: A dataset of 101 human actions classes from videos in the wild, arXiv preprint arXiv:1212.0402, 2012.

[24]

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., Learning transferable visual models from natural language supervision, in Proc. 38^th Int. Conf. Machine Learning, Virtual Event, 2021, pp. 8748–8763.

[25]

Z. Yang, W. He, X. Fan, and T. Tjahjadi, PlantNet: Transfer learning-based fine-grained network for high-throughput plants recognition, Soft Comput., vol. 26, no. 20, pp. 10581–10590, 2022.

Crossref Google Scholar

[26]

L. Chen, T. Huang, D. Ma, and Y. C. Chen, Altered default mode network functional connectivity in Parkinson’s disease: A resting-state functional magnetic resonance imaging study, Front. Neurosci., vol. 16, p. 905121, 2022.

Crossref Google Scholar

[27]

H. Tao, J. Cao, L. Chen, H. Sun, Y. Shi, and X. Zhu, Black-box attacks on dynamic graphs via adversarial topology perturbations, Neural Networks, vol. 171, pp. 308–319, 2024.

Crossref Google Scholar

[28]

S. Zhao, L. Zhang, and X. Liu, DAE-TPGM: A deep autoencoder network based on a two-part-gamma model for analyzing single-cell RNA-seq data, Comput. Biol. Med., vol. 146, p. 105578, 2022.

[29]

G. Zhu, J. Cao, L. Chen, Y. Wang, Z. Bu, S. Yang, J. Wu, and Z. Wang, A multi-task graph neural network with variational graph auto-encoders for session-based travel packages recommendation, ACM Trans. Web, vol. 17, no. 3, pp. 18, 2023.

Crossref Google Scholar

[30]

X. Li, Q. Song, J. Wu, R. Zhu, Z. Ma, and J. H. Xue, Locally-enriched cross-reconstruction for few-shot fine-grained image classification, IEEE Trans. Circuits Syst. Video Technol., vol. 33, no. 12, pp. 7530–7540, 2023.

Crossref Google Scholar

[31]

H. Qu, R. Yan, X. Shu, H. Gao, P. Huang, and G. S. Xie, MVP-Shot: Multi-velocity progressive-alignment framework for few-shot action recognition, arXiv preprint arXiv: 2405.02077, 2024.

[32]

P. Kaul, W. Xie, and A. Zisserman, Label, verify, correct: A simple few shot object detection method, in Proc. 2022 IEEE/CVF Conf. Computer Vision and Pattern Recognition, New Orleans, LA, USA, 2022, pp. 14217–14227.

[33]

O. Vinyals, C. Blundell, T. Lillicrap, K. Kavukcuoglu, and D. Wierstra, Matching networks for one shot learning, in Proc. 30^th Int. Conf. Neural Information Processing Systems, Barcelona, Spain, 2016, pp. 3637–3645.

[34]

C. Finn, P. Abbeel, and S. Levine, Model-agnostic meta-learning for fast adaptation of deep networks, in Proc. 34^th Int. Conf. Machine Learning, Sydney, Australia, 2017, pp. 1126–1135.

[35]

S. Ravi and H. Larochelle, Optimization as a model for few-shot learning, in Proc. 5^th Int. Conf. Learning Representations, Toulon, France, 2017, pp. 1–11.

[36]

A. A. Rusu, D. Rao, J. Sygnowski, O. Vinyals, R. Pascanu, S. Osindero, and R. Hadsell, Meta-learning with latent embedding optimization, arXiv preprint arXiv: 1807.05960, 2018.

[37]

Q. Cai, Y. Pan, T. Yao, C. Yan, and T. Mei, Memory matching networks for one-shot image recognition, in Proc. 2018 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 4080–4088.

[38]

T. Munkhdalai, X. Yuan, S. Mehri, and A. Trischler, Rapid adaptation with conditionally shifted neurons, in Proc. 35^th Int. Conf. Machine Learning, Stockholm, Sweden, 2018, pp. 3664–3673.

[39]

X. Gu, T. Y. Lin, W. Kuo, and Y. Cui, Open-vocabulary object detection via vision and language knowledge distillation, arXiv preprint arXiv: 2104.13921, 2021.

[40]

Z. Wang, Y. Lu, Q. Li, X. Tao, Y. Guo, M. Gong, and T. Liu, CRIS: CLIP-driven referring image segmentation, in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, New Orleans, LA, USA, 2022, pp. 11676–11685.

[41]

R. Mokady, A. Hertz, and A. H. Bermano, ClipCap: CLIP prefix for image captioning, arXiv preprint arXiv: 2111.09734, 2021.

[42]

M. Narasimhan, A. Rohrbach, and T. Darrell, CLIP-It! Language-guided video summarization, in Proc. 35^th Int. Conf. Neural Information Processing Systems, Virtual Event, 2021, p. 1072.

[43]

H. Xu, G. Ghosh, P. Y. Huang, D. Okhonko, A. Aghajanyan, F. Metze, L. Zettlemoyer, and C. Feichtenhofer, VideoCLIP: Contrastive pre-training for zero-shot video-text understanding, in Proc. 2021 Conf. Empirical Methods in Natural Language Processing, Virtual Event, 2021, pp. 6787–6800.

[44]

M. Wang, J. Xing, and Y. Liu, ActionCLIP: A new paradigm for video action recognition, arXiv preprint arXiv: 2109.08472, 2021.

[45]

B. Ni, H. Peng, M. Chen, S. Zhang, G. Meng, J. Fu, S. Xiang, and H. Ling, Expanding language-image pretrained models for general video recognition, in Proc. 17^th European Conf. Computer Vision, Tel Aviv, Israel, 2022, pp. 1–18.

[46]

T. Wu, S. He, J. Liu, S. Sun, K. Liu, Q. L. Han, and Y. Tang, A brief overview of ChatGPT: The history, status quo and potential future development, IEEE/CAA J. Autom. Sinica, vol. 10, no. 5, pp. 1122–1136, 2023.

Crossref Google Scholar

[47]

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al., GPT-4 technical report, arXiv preprint arXiv: 2303.08774, 2023.

[48]

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M. A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al., LLaMA: Open and efficient foundation language models, arXiv preprint arXiv: 2302.13971, 2023.

[49]

A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, et al, PaLM: Scaling language modeling with pathways, J. Mach. Learn. Res., vol. 24, no. 1, p. 240, 2024.

[50]

K. Lehnert, AI insights into theoretical physics and the swampland program: A journey through the cosmos with ChatGPT, arXiv preprint arXiv: 2301.08155, 2023.

[51]

R. Tu, C. Ma, and C. Zhang, Causal-discovery performance of ChatGPT in the context of neuropathic pain diagnosis, arXiv preprint arXiv: 2301.13819, 2023.

[52]

W. Wu, H. Yao, M. Zhang, Y. Song, W. Ouyang, and J. Wang, GPT4Vis: What can GPT-4 do for zero-shot visual recognition? arXiv preprint arXiv: 2311.15732, 2023.

[53]

S. Guo, Y. Wang, S. Li, and N. Saeed, Semantic communications with ordered importance using ChatGPT, arXiv preprint arXiv: 2302.07142, 2023.

[54]

K. He, X. Zhang, S. Ren, and J. Sun, Deep residual learning for image recognition, in Proc. 2016 IEEE Conf. Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 2016, pp. 770–778.

[55]

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, Attention is all you need, in Proc. 31^st Int. Conf. Neural Information Processing Systems, Long Beach, CA, USA, 2017, pp. 6000–6010.

[56]

M. Bishay, G. Zoumpourlis, and I. Patras, TARN: Temporal attentive relation network for few-shot and zero-shot action recognition, arXiv preprint arXiv: 1907.09021, 2019.

[57]

L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool, Temporal segment networks: Towards good practices for deep action recognition, in Proc. 14^th European Conf. Computer Vision, Amsterdam, The Netherlands, 2016, pp. 20–36.

[58]

D. P. Kingma and J. Ba, Adam: A method for stochastic optimization, arXiv preprint arXiv: 1412.6980, 2014.

[59]

L. van der Maaten and G. Hinton, Visualizing data using t-SNE, J. Mach. Learn. Res., vol. 9, no. 86, pp. 2579–2605, 2008.

[60]

S. Li, H. Liu, R. Qian, Y. Li, J. See, M. Fei, X. Yu, and W. Lin, TA2N: Two-stage action alignment network for few-shot action recognition, in Proc. 36 ^th AAAI Conf. Artificial Intelligence, Vancouver, Canada, 2022, pp. 1404–1411

[61]

J. Wu, T. Zhang, Z. Zhang, F. Wu, and Y. Zhang, Motion-modulated temporal fragment alignment network for few-shot action recognition, in Proc. 2022 IEEE/CVF Conf. Computer Vision and Pattern Recognition, New Orleans, LA, USA, 2022, pp. 9141–9150.

[62]

Y. Huang, L. Yang, and Y. Sato, Compound prototype matching for few-shot action recognition, in Proc. 17^th European Conf. Computer Vision, Tel Aviv, Israel, 2022, pp. 351–368.

[63]

S. Zheng, S. Chen, and Q. Jin, Few-shot action recognition with hierarchical matching and contrastive learning, in Proc. 17^th European Conf. Computer Vision, Tel Aviv, Israel, 2022, pp. 297–313.

[64]

Z. Zhu, L. Wang, S. Guo, and G. Wu, A closer look at few-shot video classification: A new baseline and benchmark, arXiv preprint arXiv: 2110.12358, 2021.

Big Data Mining and Analytics

Volume 8 Issue 3,
May 2025

Pages 534-550

DOI: 10.26599/BDMA.2024.9020076

Cite this article:

Wei R, Yan R, Qu H, et al. SVMFN-FSAR: Semantic-Guided Video Multimodal Fusion Network for Few-Shot Action Recognition. Big Data Mining and Analytics, 2025, 8(3): 534-550. https://doi.org/10.26599/BDMA.2024.9020076

About Us

Learn about Open Access

Tsinghua University Press

Publish with Us

Peer Review Policy

Copyright and Licensing

Article Processing Charge

Contact Us

Journal Collaboration: Yao Meng (Ms.)✉️ +86-10-83470574

Technical Support: Kuo Zhao (Mr.)✉️ +86-10-83470507

Media Contact: Hao Jin (Mr.)✉️ +86-10-83470559

Address: Floor 6, Tower B, Xueyan Building, Shuangqing Road, Haidian District, Beijing 100084, China.

SciOpen——中国科技期刊卓越行动计划支持项目

Copyright © 2025 Tsinghua University Press Ltd.

京ICP备 10035462号-42 京公网安备11010802044758号