AI Chat Paper
Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.
{{lang === 'zh_CN' ? '文章概述' : 'Summary'}}
{{lang === 'en_US' ? '中' : 'Eng'}}
Chat more with AI
PDF (10.6 MB)
Collect
Submit Manuscript AI Chat Paper
Show Outline
Outline
Show full outline
Hide outline
Outline
Show full outline
Hide outline
Open Access | Just Accepted

SVMFN-FSAR: Semantic-guided Video Multimodal Fusion Network for Few-shot Action Recognition

Ran Wei1Rui Yan2Hongyu Qu3Xing Li1Qiaolin Ye1Liyong Fu4,5( )

1 College of Information Science and Technology & Artificial Intelligence, Nanjing Forestry University, Nanjing 210037, China

2 Department of Computer Science and Technology, Nanjing University, Nanjing 210023, China

3 School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China

4 College of Forestry, Hebei Agricultural University, Baoding 071000, China

5 Institute of Forest Resource Information Techniques, Chinese Academy of Forestry, Beijing 100091, China

Show Author Information

Abstract

Few-shot Action Recognition (FSAR) has been a heat topic in various areas, such as computer vision and forest ecosystem security. FSAR aims to recognize previously unseen classes using limited labeled video examples. A principal challenge in the FSAR task is to obtain more action semantics related to the category from a few samples for classification. Recent studies attempt to compensate for visual information through action labels. However, concise action category names lead to less distinct semantic space and potential performance limitations. In this work, we propose a novel Semantic-guided Video Multimodal Fusion Network for FSAR (SVMFN-FSAR). We utilize the Large Language Model (LLM) to expand detailed textual knowledge of various action categories, enhancing the distinction of semantic space and alleviating the problem of insufficient samples in FSAR tasks to some extent. We perform the matching metric between the extracted distinctive semantic information and the visual information of unknown class samples to understand the overall semantics of the video for preliminary classification. In addition, we design a novel semantic-guided temporal interaction module based on Transformers, which can make the LLM-expanded knowledge and visual information complement each other from both the temporal dimension and the channel, and improve the quality of feature representation in samples. Experimental results on three few-shot benchmarks, Kinetics, UCF101, and HMDB51, consistently demonstrate the effectiveness and interpretability of the proposed method.

Big Data Mining and Analytics
Cite this article:
Wei R, Yan R, Qu H, et al. SVMFN-FSAR: Semantic-guided Video Multimodal Fusion Network for Few-shot Action Recognition. Big Data Mining and Analytics, 2024, https://doi.org/10.26599/BDMA.2024.9020076

369

Views

96

Downloads

0

Crossref

0

Web of Science

0

Scopus

0

CSCD

Altmetrics

Received: 14 August 2024
Revised: 07 October 2024
Accepted: 14 October 2024
Available online: 31 October 2024

© The author(s) 2024.

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).

Return