[1]
J. Weng, C. Weng, J. Yuan, and Z. Liu, Discriminative spatio-temporal pattern discovery for 3D action recognition, IEEE Trans. Circuits Syst. Video Technol., vol. 29, no. 4, pp. 1077–1089, 2018.
[8]
J. Snell, K. Swersky, and R. Zemel, Prototypical networks for few-shot learning, in Proc. 31 st Int. Conf. Neural Information Processing Systems, Long Beach, CA, USA, 2017, pp. 4080–4090.
[9]
F. Sung, Y. Yang, L. Zhang, T. Xiang, P. H. S. Torr, and T. M. Hospedales, Learning to compare: Relation network for few-shot learning, in Proc. 2018 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 1199–1208.
[10]
L. Zhu and Y. Yang, Compound memory networks for few-shot video classification, in Proc. 15 th European Conf. Computer Vision, Munich, Germany, 2018, pp. 782–797.
[11]
H. Zhang, L. Zhang, X. Qi, H. Li, P. H. S. Torr, and P. Koniusz, Few-Shot action recognition with permutation-invariant attention, in Proc. 16 th European Conf. Computer Vision – ECCV 2020, Glasgow, UK, 2020, pp. 525–542.
[12]
K. Cao, J. Ji, Z. Cao, C. Y. Chang, and J. C. Niebles, Few-shot video classification via temporal alignment, in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Seattle, WA, USA, 2020, pp. 10615–10624.
[13]
S. Zhang, J. Zhou, and X. He, Learning implicit temporal alignment for few-shot video classification, in Proc. 30 th Int. Joint Conf. Artificial Intelligence, Montreal, Canada, 2021, pp. 1309–1315.
[14]
T. Perrett, A. Masullo, T. Burghardt, M. Mirmehdi, and D. Damen, Temporal-relational CrossTransformers for few-shot action recognition, in Proc. 2021 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Nashville, TN, USA, 2021, pp. 475–484.
[15]
A. Thatipelli, S. Narayan, S. Khan, R. M. Anwer, F. S. Khan, and B. Ghanem, Spatio-temporal relation modeling for few-shot action recognition, in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, New Orleans, LA, USA, 2022, pp. 19926–19935.
[16]
X. Wang, S. Zhang, Z. Qing, M. Tang, Z. Zuo, C. Gao, R. Jin, and N. Sang, Hybrid relation guided set matching for few-shot action recognition, in Proc. 2022 IEEE/CVF Conf. Computer Vision and Pattern Recognition, New Orleans, LA, USA, 2022, pp. 19916–19925.
[17]
X. Wang, S. Zhang, Z. Qing, C. Gao, Y. Zhang, D. Zhao, and N. Sang, MoLo: Motion-augmented long-short contrastive learning for few-shot action recognition, in Proc. 2023 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Vancouver, Canada, 2023, pp. 18011–18021.
[20]
X. Wang, S. Zhang, J. Cen, C. Gao, Y. Zhang, D. Zhao, and N. Sang, CLIP-guided prototype modulating for few-shot action recognition, Int. J. Comput. Vision, vol. 132, no. 6, pp. 1899–1912, 2024.
[21]
J. Carreira and A. Zisserman, Quo Vadis, action recognition? A new model and the kinetics dataset, in Proc. 2017 IEEE Conf. Computer Vision and Pattern Recognition, Honolulu, HI, USA, 2017, pp. 4724–4733.
[22]
H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre, HMDB: A large video database for human motion recognition, in Proc. 2011 Int. Conf. Computer Vision, Barcelona, Spain, 2011, pp. 2556–2563.
[23]
K. Soomro, A. R. Zamir, and M. Shah, UCF101: A dataset of 101 human actions classes from videos in the wild, arXiv preprint arXiv:1212.0402, 2012.
[24]
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., Learning transferable visual models from natural language supervision, in Proc. 38 th Int. Conf. Machine Learning, Virtual Event, 2021, pp. 8748–8763.
[28]
S. Zhao, L. Zhang, and X. Liu, DAE-TPGM: A deep autoencoder network based on a two-part-gamma model for analyzing single-cell RNA-seq data, Comput. Biol. Med., vol. 146, p. 105578, 2022.
[31]
H. Qu, R. Yan, X. Shu, H. Gao, P. Huang, and G. S. Xie, MVP-Shot: Multi-velocity progressive-alignment framework for few-shot action recognition, arXiv preprint arXiv: 2405.02077, 2024.
[32]
P. Kaul, W. Xie, and A. Zisserman, Label, verify, correct: A simple few shot object detection method, in Proc. 2022 IEEE/CVF Conf. Computer Vision and Pattern Recognition, New Orleans, LA, USA, 2022, pp. 14217–14227.
[33]
O. Vinyals, C. Blundell, T. Lillicrap, K. Kavukcuoglu, and D. Wierstra, Matching networks for one shot learning, in Proc. 30 th Int. Conf. Neural Information Processing Systems, Barcelona, Spain, 2016, pp. 3637–3645.
[34]
C. Finn, P. Abbeel, and S. Levine, Model-agnostic meta-learning for fast adaptation of deep networks, in Proc. 34 th Int. Conf. Machine Learning, Sydney, Australia, 2017, pp. 1126–1135.
[35]
S. Ravi and H. Larochelle, Optimization as a model for few-shot learning, in Proc. 5 th Int. Conf. Learning Representations, Toulon, France, 2017, pp. 1–11.
[36]
A. A. Rusu, D. Rao, J. Sygnowski, O. Vinyals, R. Pascanu, S. Osindero, and R. Hadsell, Meta-learning with latent embedding optimization, arXiv preprint arXiv: 1807.05960, 2018.
[37]
Q. Cai, Y. Pan, T. Yao, C. Yan, and T. Mei, Memory matching networks for one-shot image recognition, in Proc. 2018 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 4080–4088.
[38]
T. Munkhdalai, X. Yuan, S. Mehri, and A. Trischler, Rapid adaptation with conditionally shifted neurons, in Proc. 35 th Int. Conf. Machine Learning, Stockholm, Sweden, 2018, pp. 3664–3673.
[39]
X. Gu, T. Y. Lin, W. Kuo, and Y. Cui, Open-vocabulary object detection via vision and language knowledge distillation, arXiv preprint arXiv: 2104.13921, 2021.
[40]
Z. Wang, Y. Lu, Q. Li, X. Tao, Y. Guo, M. Gong, and T. Liu, CRIS: CLIP-driven referring image segmentation, in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, New Orleans, LA, USA, 2022, pp. 11676–11685.
[41]
R. Mokady, A. Hertz, and A. H. Bermano, ClipCap: CLIP prefix for image captioning, arXiv preprint arXiv: 2111.09734, 2021.
[42]
M. Narasimhan, A. Rohrbach, and T. Darrell, CLIP-It! Language-guided video summarization, in Proc. 35 th Int. Conf. Neural Information Processing Systems, Virtual Event, 2021, p. 1072.
[43]
H. Xu, G. Ghosh, P. Y. Huang, D. Okhonko, A. Aghajanyan, F. Metze, L. Zettlemoyer, and C. Feichtenhofer, VideoCLIP: Contrastive pre-training for zero-shot video-text understanding, in Proc. 2021 Conf. Empirical Methods in Natural Language Processing, Virtual Event, 2021, pp. 6787–6800.
[44]
M. Wang, J. Xing, and Y. Liu, ActionCLIP: A new paradigm for video action recognition, arXiv preprint arXiv: 2109.08472, 2021.
[45]
B. Ni, H. Peng, M. Chen, S. Zhang, G. Meng, J. Fu, S. Xiang, and H. Ling, Expanding language-image pretrained models for general video recognition, in Proc. 17 th European Conf. Computer Vision, Tel Aviv, Israel, 2022, pp. 1–18.
[47]
J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al., GPT-4 technical report, arXiv preprint arXiv: 2303.08774, 2023.
[48]
H. Touvron, T. Lavril, G. Izacard, X. Martinet, M. A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al., LLaMA: Open and efficient foundation language models, arXiv preprint arXiv: 2302.13971, 2023.
[50]
K. Lehnert, AI insights into theoretical physics and the swampland program: A journey through the cosmos with ChatGPT, arXiv preprint arXiv: 2301.08155, 2023.
[51]
R. Tu, C. Ma, and C. Zhang, Causal-discovery performance of ChatGPT in the context of neuropathic pain diagnosis, arXiv preprint arXiv: 2301.13819, 2023.
[52]
W. Wu, H. Yao, M. Zhang, Y. Song, W. Ouyang, and J. Wang, GPT4Vis: What can GPT-4 do for zero-shot visual recognition? arXiv preprint arXiv: 2311.15732, 2023.
[53]
S. Guo, Y. Wang, S. Li, and N. Saeed, Semantic communications with ordered importance using ChatGPT, arXiv preprint arXiv: 2302.07142, 2023.
[54]
K. He, X. Zhang, S. Ren, and J. Sun, Deep residual learning for image recognition, in Proc. 2016 IEEE Conf. Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 2016, pp. 770–778.
[55]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, Attention is all you need, in Proc. 31 st Int. Conf. Neural Information Processing Systems, Long Beach, CA, USA, 2017, pp. 6000–6010.
[56]
M. Bishay, G. Zoumpourlis, and I. Patras, TARN: Temporal attentive relation network for few-shot and zero-shot action recognition, arXiv preprint arXiv: 1907.09021, 2019.
[57]
L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool, Temporal segment networks: Towards good practices for deep action recognition, in Proc. 14 th European Conf. Computer Vision, Amsterdam, The Netherlands, 2016, pp. 20–36.
[58]
D. P. Kingma and J. Ba, Adam: A method for stochastic optimization, arXiv preprint arXiv: 1412.6980, 2014.
[60]
S. Li, H. Liu, R. Qian, Y. Li, J. See, M. Fei, X. Yu, and W. Lin, TA2N: Two-stage action alignment network for few-shot action recognition, in Proc. 36 th AAAI Conf. Artificial Intelligence, Vancouver, Canada, 2022, pp. 1404–1411
[61]
J. Wu, T. Zhang, Z. Zhang, F. Wu, and Y. Zhang, Motion-modulated temporal fragment alignment network for few-shot action recognition, in Proc. 2022 IEEE/CVF Conf. Computer Vision and Pattern Recognition, New Orleans, LA, USA, 2022, pp. 9141–9150.
[62]
Y. Huang, L. Yang, and Y. Sato, Compound prototype matching for few-shot action recognition, in Proc. 17 th European Conf. Computer Vision, Tel Aviv, Israel, 2022, pp. 351–368.
[63]
S. Zheng, S. Chen, and Q. Jin, Few-shot action recognition with hierarchical matching and contrastive learning, in Proc. 17 th European Conf. Computer Vision, Tel Aviv, Israel, 2022, pp. 297–313.
[64]
Z. Zhu, L. Wang, S. Guo, and G. Wu, A closer look at few-shot video classification: A new baseline and benchmark, arXiv preprint arXiv: 2110.12358, 2021.