PDF (9.9 MB)
Collect
Submit Manuscript
Show Outline
Figures (7)

Tables (6)
Table 1
Table 2
Table 3
Table 4
Table 5
Show 1 more tables Hide 1 tables
Open Access

SVMFN-FSAR: Semantic-Guided Video Multimodal Fusion Network for Few-Shot Action Recognition

College of Information Science and Technology & Artificial Intelligence, with State Key Laboratory of Tree Genetics and Breeding, and also with Co-Innovation Center for Sustainable Forestry in Southern China, Nanjing Forestry University, Nanjing 210037, China
Department of Computer Science and Technology, Nanjing University, Nanjing 210023, China
School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China
College of Forestry, Hebei Agricultural University, Baoding 071000, China, and also with Institute of Forest Resource Information Techniques, Chinese Academy of Forestry, Beijing 100091, China
Show Author Information

Abstract

Few-Shot Action Recognition (FSAR) has been a heat topic in various areas, such as computer vision and forest ecosystem security. FSAR aims to recognize previously unseen classes using limited labeled video examples. A principal challenge in the FSAR task is to obtain more action semantics related to the category from a few samples for classification. Recent studies attempt to compensate for visual information through action labels. However, concise action category names lead to less distinct semantic space and potential performance limitations. In this work, we propose a novel Semantic-guided Video Multimodal Fusion Network for FSAR (SVMFN-FSAR). We utilize the Large Language Model (LLM) to expand detailed textual knowledge of various action categories, enhancing the distinction of semantic space and alleviating the problem of insufficient samples in FSAR tasks to some extent. We perform the matching metric between the extracted distinctive semantic information and the visual information of unknown class samples to understand the overall semantics of the video for preliminary classification. In addition, we design a novel semantic-guided temporal interaction module based on Transformers, which can make the LLM-expanded knowledge and visual information complement each other, and improve the quality of feature representation in samples. Experimental results on three few-shot benchmarks, Kinetics, UCF101, and HMDB51, consistently demonstrate the effectiveness and interpretability of the proposed method.

References

[1]
J. Weng, C. Weng, J. Yuan, and Z. Liu, Discriminative spatio-temporal pattern discovery for 3D action recognition, IEEE Trans. Circuits Syst. Video Technol., vol. 29, no. 4, pp. 1077–1089, 2018.
[2]

R. Yan, L. Xie, J. Tang, X. Shu, and Q. Tian, HiGCIN: Hierarchical graph-based cross inference network for group activity recognition, IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 6, pp. 6955–6968, 2023.

[3]

L. Ma, Y. Zheng, Z. Zhang, Y. Yao, X. Fan, and Q. Ye, Motion stimulation for compositional action recognition, IEEE Trans. Circuits Syst. Video Technol, vol. 33, no. 5, pp. 2061–2074, 2023.

[4]

R. Yan, L. Xie, X. Shu, L. Zhang, and J. Tang, Progressive instance-aware feature learning for compositional action recognition, IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 8, pp. 10317–10330, 2023.

[5]

P. Huang, R. Yan, X. Shu, Z. Tu, G. Dai, and J. Tang, Semantic-disentangled transformer with noun-verb embedding for compositional action recognition, IEEE Trans. Image Process., vol. 33, pp. 297–309, 2024.

[6]
P. Huang, X. Shu, R. Yan, Z. Tu, and J. Tang, Appearance-agnostic representation learning for compositional action recognition, IEEE Trans. Circuits Syst. Video Technol., doi: 10.1109/TCSVT.2024.3384392.
[7]
Z. Li, H. Tang, Z. Peng, G. J. Qi, and J. Tang, Knowledge-guided semantic transfer network for few-shot image recognition, IEEE Trans. Neural Networks Learning Syst., doi: 10.1109/TNNLS.2023.3240195.
[8]
J. Snell, K. Swersky, and R. Zemel, Prototypical networks for few-shot learning, in Proc. 31 st Int. Conf. Neural Information Processing Systems, Long Beach, CA, USA, 2017, pp. 4080–4090.
[9]
F. Sung, Y. Yang, L. Zhang, T. Xiang, P. H. S. Torr, and T. M. Hospedales, Learning to compare: Relation network for few-shot learning, in Proc. 2018 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 1199–1208.
[10]
L. Zhu and Y. Yang, Compound memory networks for few-shot video classification, in Proc. 15 th European Conf. Computer Vision, Munich, Germany, 2018, pp. 782–797.
[11]
H. Zhang, L. Zhang, X. Qi, H. Li, P. H. S. Torr, and P. Koniusz, Few-Shot action recognition with permutation-invariant attention, in Proc. 16 th European Conf. Computer Vision – ECCV 2020, Glasgow, UK, 2020, pp. 525–542.
[12]
K. Cao, J. Ji, Z. Cao, C. Y. Chang, and J. C. Niebles, Few-shot video classification via temporal alignment, in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Seattle, WA, USA, 2020, pp. 10615–10624.
[13]
S. Zhang, J. Zhou, and X. He, Learning implicit temporal alignment for few-shot video classification, in Proc. 30 th Int. Joint Conf. Artificial Intelligence, Montreal, Canada, 2021, pp. 1309–1315.
[14]
T. Perrett, A. Masullo, T. Burghardt, M. Mirmehdi, and D. Damen, Temporal-relational CrossTransformers for few-shot action recognition, in Proc. 2021 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Nashville, TN, USA, 2021, pp. 475–484.
[15]
A. Thatipelli, S. Narayan, S. Khan, R. M. Anwer, F. S. Khan, and B. Ghanem, Spatio-temporal relation modeling for few-shot action recognition, in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, New Orleans, LA, USA, 2022, pp. 19926–19935.
[16]
X. Wang, S. Zhang, Z. Qing, M. Tang, Z. Zuo, C. Gao, R. Jin, and N. Sang, Hybrid relation guided set matching for few-shot action recognition, in Proc. 2022 IEEE/CVF Conf. Computer Vision and Pattern Recognition, New Orleans, LA, USA, 2022, pp. 19916–19925.
[17]
X. Wang, S. Zhang, Z. Qing, C. Gao, Y. Zhang, D. Zhao, and N. Sang, MoLo: Motion-augmented long-short contrastive learning for few-shot action recognition, in Proc. 2023 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Vancouver, Canada, 2023, pp. 18011–18021.
[18]

X. Wang, W. Ye, Z. Qi, G. Wang, J. Wu, Y. Shan, X. Qie, and H. Wang, Task-aware dual-representation network for few-shot action recognition, IEEE Trans. Circuits Syst. Video Technol., vol. 33, no. 10, pp. 5932–5946, 2023.

[19]

X. Wang, Y. Lu, W. Yu, Y. Pang, and H. Wang, Few-shot action recognition via multi-view representation learning, IEEE Trans. Circuits Syst. Video Technol., vol. 34, no. 9, pp. 8522–8535, 2024.

[20]
X. Wang, S. Zhang, J. Cen, C. Gao, Y. Zhang, D. Zhao, and N. Sang, CLIP-guided prototype modulating for few-shot action recognition, Int. J. Comput. Vision, vol. 132, no. 6, pp. 1899–1912, 2024.
[21]
J. Carreira and A. Zisserman, Quo Vadis, action recognition? A new model and the kinetics dataset, in Proc. 2017 IEEE Conf. Computer Vision and Pattern Recognition, Honolulu, HI, USA, 2017, pp. 4724–4733.
[22]
H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre, HMDB: A large video database for human motion recognition, in Proc. 2011 Int. Conf. Computer Vision, Barcelona, Spain, 2011, pp. 2556–2563.
[23]
K. Soomro, A. R. Zamir, and M. Shah, UCF101: A dataset of 101 human actions classes from videos in the wild, arXiv preprint arXiv:1212.0402, 2012.
[24]
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., Learning transferable visual models from natural language supervision, in Proc. 38 th Int. Conf. Machine Learning, Virtual Event, 2021, pp. 8748–8763.
[25]

Z. Yang, W. He, X. Fan, and T. Tjahjadi, PlantNet: Transfer learning-based fine-grained network for high-throughput plants recognition, Soft Comput., vol. 26, no. 20, pp. 10581–10590, 2022.

[26]

L. Chen, T. Huang, D. Ma, and Y. C. Chen, Altered default mode network functional connectivity in Parkinson’s disease: A resting-state functional magnetic resonance imaging study, Front. Neurosci., vol. 16, p. 905121, 2022.

[27]

H. Tao, J. Cao, L. Chen, H. Sun, Y. Shi, and X. Zhu, Black-box attacks on dynamic graphs via adversarial topology perturbations, Neural Networks, vol. 171, pp. 308–319, 2024.

[28]
S. Zhao, L. Zhang, and X. Liu, DAE-TPGM: A deep autoencoder network based on a two-part-gamma model for analyzing single-cell RNA-seq data, Comput. Biol. Med., vol. 146, p. 105578, 2022.
[29]

G. Zhu, J. Cao, L. Chen, Y. Wang, Z. Bu, S. Yang, J. Wu, and Z. Wang, A multi-task graph neural network with variational graph auto-encoders for session-based travel packages recommendation, ACM Trans. Web, vol. 17, no. 3, pp. 18, 2023.

[30]

X. Li, Q. Song, J. Wu, R. Zhu, Z. Ma, and J. H. Xue, Locally-enriched cross-reconstruction for few-shot fine-grained image classification, IEEE Trans. Circuits Syst. Video Technol., vol. 33, no. 12, pp. 7530–7540, 2023.

[31]
H. Qu, R. Yan, X. Shu, H. Gao, P. Huang, and G. S. Xie, MVP-Shot: Multi-velocity progressive-alignment framework for few-shot action recognition, arXiv preprint arXiv: 2405.02077, 2024.
[32]
P. Kaul, W. Xie, and A. Zisserman, Label, verify, correct: A simple few shot object detection method, in Proc. 2022 IEEE/CVF Conf. Computer Vision and Pattern Recognition, New Orleans, LA, USA, 2022, pp. 14217–14227.
[33]
O. Vinyals, C. Blundell, T. Lillicrap, K. Kavukcuoglu, and D. Wierstra, Matching networks for one shot learning, in Proc. 30 th Int. Conf. Neural Information Processing Systems, Barcelona, Spain, 2016, pp. 3637–3645.
[34]
C. Finn, P. Abbeel, and S. Levine, Model-agnostic meta-learning for fast adaptation of deep networks, in Proc. 34 th Int. Conf. Machine Learning, Sydney, Australia, 2017, pp. 1126–1135.
[35]
S. Ravi and H. Larochelle, Optimization as a model for few-shot learning, in Proc. 5 th Int. Conf. Learning Representations, Toulon, France, 2017, pp. 1–11.
[36]
A. A. Rusu, D. Rao, J. Sygnowski, O. Vinyals, R. Pascanu, S. Osindero, and R. Hadsell, Meta-learning with latent embedding optimization, arXiv preprint arXiv: 1807.05960, 2018.
[37]
Q. Cai, Y. Pan, T. Yao, C. Yan, and T. Mei, Memory matching networks for one-shot image recognition, in Proc. 2018 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 4080–4088.
[38]
T. Munkhdalai, X. Yuan, S. Mehri, and A. Trischler, Rapid adaptation with conditionally shifted neurons, in Proc. 35 th Int. Conf. Machine Learning, Stockholm, Sweden, 2018, pp. 3664–3673.
[39]
X. Gu, T. Y. Lin, W. Kuo, and Y. Cui, Open-vocabulary object detection via vision and language knowledge distillation, arXiv preprint arXiv: 2104.13921, 2021.
[40]
Z. Wang, Y. Lu, Q. Li, X. Tao, Y. Guo, M. Gong, and T. Liu, CRIS: CLIP-driven referring image segmentation, in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, New Orleans, LA, USA, 2022, pp. 11676–11685.
[41]
R. Mokady, A. Hertz, and A. H. Bermano, ClipCap: CLIP prefix for image captioning, arXiv preprint arXiv: 2111.09734, 2021.
[42]
M. Narasimhan, A. Rohrbach, and T. Darrell, CLIP-It! Language-guided video summarization, in Proc. 35 th Int. Conf. Neural Information Processing Systems, Virtual Event, 2021, p. 1072.
[43]
H. Xu, G. Ghosh, P. Y. Huang, D. Okhonko, A. Aghajanyan, F. Metze, L. Zettlemoyer, and C. Feichtenhofer, VideoCLIP: Contrastive pre-training for zero-shot video-text understanding, in Proc. 2021 Conf. Empirical Methods in Natural Language Processing, Virtual Event, 2021, pp. 6787–6800.
[44]
M. Wang, J. Xing, and Y. Liu, ActionCLIP: A new paradigm for video action recognition, arXiv preprint arXiv: 2109.08472, 2021.
[45]
B. Ni, H. Peng, M. Chen, S. Zhang, G. Meng, J. Fu, S. Xiang, and H. Ling, Expanding language-image pretrained models for general video recognition, in Proc. 17 th European Conf. Computer Vision, Tel Aviv, Israel, 2022, pp. 1–18.
[46]

T. Wu, S. He, J. Liu, S. Sun, K. Liu, Q. L. Han, and Y. Tang, A brief overview of ChatGPT: The history, status quo and potential future development, IEEE/CAA J. Autom. Sinica, vol. 10, no. 5, pp. 1122–1136, 2023.

[47]
J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al., GPT-4 technical report, arXiv preprint arXiv: 2303.08774, 2023.
[48]
H. Touvron, T. Lavril, G. Izacard, X. Martinet, M. A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al., LLaMA: Open and efficient foundation language models, arXiv preprint arXiv: 2302.13971, 2023.
[49]

A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, et al, PaLM: Scaling language modeling with pathways, J. Mach. Learn. Res., vol. 24, no. 1, p. 240, 2024.

[50]
K. Lehnert, AI insights into theoretical physics and the swampland program: A journey through the cosmos with ChatGPT, arXiv preprint arXiv: 2301.08155, 2023.
[51]
R. Tu, C. Ma, and C. Zhang, Causal-discovery performance of ChatGPT in the context of neuropathic pain diagnosis, arXiv preprint arXiv: 2301.13819, 2023.
[52]
W. Wu, H. Yao, M. Zhang, Y. Song, W. Ouyang, and J. Wang, GPT4Vis: What can GPT-4 do for zero-shot visual recognition? arXiv preprint arXiv: 2311.15732, 2023.
[53]
S. Guo, Y. Wang, S. Li, and N. Saeed, Semantic communications with ordered importance using ChatGPT, arXiv preprint arXiv: 2302.07142, 2023.
[54]
K. He, X. Zhang, S. Ren, and J. Sun, Deep residual learning for image recognition, in Proc. 2016 IEEE Conf. Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 2016, pp. 770–778.
[55]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, Attention is all you need, in Proc. 31 st Int. Conf. Neural Information Processing Systems, Long Beach, CA, USA, 2017, pp. 6000–6010.
[56]
M. Bishay, G. Zoumpourlis, and I. Patras, TARN: Temporal attentive relation network for few-shot and zero-shot action recognition, arXiv preprint arXiv: 1907.09021, 2019.
[57]
L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool, Temporal segment networks: Towards good practices for deep action recognition, in Proc. 14 th European Conf. Computer Vision, Amsterdam, The Netherlands, 2016, pp. 20–36.
[58]
D. P. Kingma and J. Ba, Adam: A method for stochastic optimization, arXiv preprint arXiv: 1412.6980, 2014.
[59]

L. van der Maaten and G. Hinton, Visualizing data using t-SNE, J. Mach. Learn. Res., vol. 9, no. 86, pp. 2579–2605, 2008.

[60]
S. Li, H. Liu, R. Qian, Y. Li, J. See, M. Fei, X. Yu, and W. Lin, TA2N: Two-stage action alignment network for few-shot action recognition, in Proc. 36 th AAAI Conf. Artificial Intelligence, Vancouver, Canada, 2022, pp. 1404–1411
[61]
J. Wu, T. Zhang, Z. Zhang, F. Wu, and Y. Zhang, Motion-modulated temporal fragment alignment network for few-shot action recognition, in Proc. 2022 IEEE/CVF Conf. Computer Vision and Pattern Recognition, New Orleans, LA, USA, 2022, pp. 9141–9150.
[62]
Y. Huang, L. Yang, and Y. Sato, Compound prototype matching for few-shot action recognition, in Proc. 17 th European Conf. Computer Vision, Tel Aviv, Israel, 2022, pp. 351–368.
[63]
S. Zheng, S. Chen, and Q. Jin, Few-shot action recognition with hierarchical matching and contrastive learning, in Proc. 17 th European Conf. Computer Vision, Tel Aviv, Israel, 2022, pp. 297–313.
[64]
Z. Zhu, L. Wang, S. Guo, and G. Wu, A closer look at few-shot video classification: A new baseline and benchmark, arXiv preprint arXiv: 2110.12358, 2021.
Big Data Mining and Analytics
Pages 534-550
Cite this article:
Wei R, Yan R, Qu H, et al. SVMFN-FSAR: Semantic-Guided Video Multimodal Fusion Network for Few-Shot Action Recognition. Big Data Mining and Analytics, 2025, 8(3): 534-550. https://doi.org/10.26599/BDMA.2024.9020076
Metrics & Citations  
Article History
Copyright
Rights and Permissions
Return