PDF (10.6 MB)
Collect
Submit Manuscript
Show Outline
Outline
Abstract
Keywords
Show full outline
Hide outline
Open Access | Just Accepted

SVMFN-FSAR: Semantic-guided Video Multimodal Fusion Network for Few-shot Action Recognition

Ran Wei1Rui Yan2Hongyu Qu3Xing Li1Qiaolin Ye1Liyong Fu4,5()

1 College of Information Science and Technology & Artificial Intelligence, Nanjing Forestry University, Nanjing 210037, China

2 Department of Computer Science and Technology, Nanjing University, Nanjing 210023, China

3 School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China

4 College of Forestry, Hebei Agricultural University, Baoding 071000, China

5 Institute of Forest Resource Information Techniques, Chinese Academy of Forestry, Beijing 100091, China

Show Author Information

Abstract

Few-shot Action Recognition (FSAR) has been a heat topic in various areas, such as computer vision and forest ecosystem security. FSAR aims to recognize previously unseen classes using limited labeled video examples. A principal challenge in the FSAR task is to obtain more action semantics related to the category from a few samples for classification. Recent studies attempt to compensate for visual information through action labels. However, concise action category names lead to less distinct semantic space and potential performance limitations. In this work, we propose a novel Semantic-guided Video Multimodal Fusion Network for FSAR (SVMFN-FSAR). We utilize the Large Language Model (LLM) to expand detailed textual knowledge of various action categories, enhancing the distinction of semantic space and alleviating the problem of insufficient samples in FSAR tasks to some extent. We perform the matching metric between the extracted distinctive semantic information and the visual information of unknown class samples to understand the overall semantics of the video for preliminary classification. In addition, we design a novel semantic-guided temporal interaction module based on Transformers, which can make the LLM-expanded knowledge and visual information complement each other from both the temporal dimension and the channel, and improve the quality of feature representation in samples. Experimental results on three few-shot benchmarks, Kinetics, UCF101, and HMDB51, consistently demonstrate the effectiveness and interpretability of the proposed method.

Big Data Mining and Analytics
Cite this article:
Wei R, Yan R, Qu H, et al. SVMFN-FSAR: Semantic-guided Video Multimodal Fusion Network for Few-shot Action Recognition. Big Data Mining and Analytics, 2024, https://doi.org/10.26599/BDMA.2024.9020076
Metrics & Citations  
Article History
Copyright
Rights and Permissions
Return