Abstract
Few-shot Action Recognition (FSAR) has been a heat topic in various areas, such as computer vision and forest ecosystem security. FSAR aims to recognize previously unseen classes using limited labeled video examples. A principal challenge in the FSAR task is to obtain more action semantics related to the category from a few samples for classification. Recent studies attempt to compensate for visual information through action labels. However, concise action category names lead to less distinct semantic space and potential performance limitations. In this work, we propose a novel Semantic-guided Video Multimodal Fusion Network for FSAR (SVMFN-FSAR). We utilize the Large Language Model (LLM) to expand detailed textual knowledge of various action categories, enhancing the distinction of semantic space and alleviating the problem of insufficient samples in FSAR tasks to some extent. We perform the matching metric between the extracted distinctive semantic information and the visual information of unknown class samples to understand the overall semantics of the video for preliminary classification. In addition, we design a novel semantic-guided temporal interaction module based on Transformers, which can make the LLM-expanded knowledge and visual information complement each other from both the temporal dimension and the channel, and improve the quality of feature representation in samples. Experimental results on three few-shot benchmarks, Kinetics, UCF101, and HMDB51, consistently demonstrate the effectiveness and interpretability of the proposed method.