Three-dimensional skeleton-based action recognition (3D SAR) has gained important attention within the computer vision community, owing to the inherent advantages offered by skeleton data. As a result, a plethora of impressive works, including those based on conventional handcrafted features and learned feature extraction methods, have been conducted over the years. However, prior surveys on action recognition have primarily focused on video or red-green-blue (RGB) data-dominated approaches, with limited coverage of reviews related to skeleton data. Furthermore, despite the extensive application of deep learning methods in this field, there has been a notable absence of research that provides an introductory or comprehensive review from the perspective of deep learning architectures. To address these limitations, this survey first underscores the importance of action recognition and emphasizes the significance of 3-dimensional (3D) skeleton data as a valuable modality. Subsequently, we provide a comprehensive introduction to mainstream action recognition techniques based on 4 fundamental deep architectures, i.e., recurrent neural networks, convolutional neural networks, graph convolutional network, and Transformers. All methods with the corresponding architectures are then presented in a data-driven manner with detailed discussion. Finally, we offer insights into the current largest 3D skeleton dataset, NTU-RGB+D, and its new edition, NTU-RGB+D 120, along with an overview of several top-performing algorithms on these datasets. To the best of our knowledge, this research represents the first comprehensive discussion of deep learning-based action recognition using 3D skeleton data.
Wang Y, Kang H, Wu D, Yang W, Zhang L. Global and local spatio-temporal encoder for 3D human pose estimation. IEEE Trans Multimedia. 2023;1–11.
Tu Z, Liu Y, Zhang Y, Mu Q, Yuan J. Joint optimization of dark enhancement and action recognition in videos. IEEE Trans Image Process. 2023;32:3507–3520.
Zhang Y, Xu X, Zhao Y, Wen Y, Tang Z, Liu M. Facial prior guided micro-expression generation. IEEE Trans Image Process. 2024;33:525–540.
Wang X, Zhang W, Wang C, Gao Y, Liu M. Dynamic dense graph convolutional network for skeleton-based human motion prediction. IEEE Trans Image Process. 2024;33:1–15.
Zhang FL, Cheng MM, Jia J, Hu SM. Imageadmixture: Putting together dissimilar objects from groups. IEEE Trans Vis Comput Graph. 2012;18(11):1849–1857.
Zhang FL, Wu X, Li RL, Wang J, Zheng ZH, Hu SM. Detecting and removing visual distractors for video aesthetic enhancement. IEEE Trans Multimedia. 2018;20(8):1987–1999.
Ren Z, Meng J, Yuan J, Zhang Z. Robust hand gesture recognition with kinect sensor. IEEE Trans Image Process. 2013;15(5):1110–1120.
Ren B, Tang H, Meng F, Ding R, Torr PH, Sebe N. Cloth interactive transformer for virtual try-on. ACM Trans Multimed Comput Commun Appl. 2023;20(4):1–20.
Liu M, Liu H, Chen C. Enhanced skeleton visualization for view invariant human action recognition. Pattern Recognit. 2017;68:346–362.
Liu H, Tu J, Liu M. Two-stream 3d convolutional neural network for skeleton-based action recognition. arXiv. 2017. https://doi.org/10.48550/arXiv.1705.08106
Tang H, Ding L, Wu S, Ren B, Sebe N, Rota P. Deep unsupervised key frame extraction for efficient video classification. ACM Trans Multimed Comput Commun Appl. 2023;19(3):1–17.
Zhao M, Liu M, Ren B, Dai S, Sebe N. Modiff: Action-conditioned 3d motion generation with denoising diffusion probabilistic models. arXiv. 2023. https://doi.org/10.48550/arXiv.2301.03949
Wang Y, Tian Y, Zhu J, She H, Jiang Y, Jiang Z, Yokoi H. A hand gesture recognition strategy based on virtual dimension increase of EMG. Cyborg Bionic Syst. 2023;5:Article 0066.
Lin J, Gan C, Han S. Temporal shift module for efficient video understanding. arXiv. 2019. https://doi.org/10.48550/arXiv.1811.08383
Xu C, Govindarajan LN, Zhang Y, Cheng L. Lie-x: Depth image based articulated object pose estimation, tracking, and action recognition on lie groups. Int J Comput Vis. 2017;123:454–478.
Baek S, Shi Z, Kawade M, Kim TK. Kinematic-layout-aware random forests for depth-based action recognition. arXiv. 2016. https://doi.org/10.48550/arXiv.1607.06972
Hu J-F, Zheng WS, Lai J, Zhang J. Jointly learning heterogeneous features for RGB-D activity recognition. IEEE Trans Pattern Anal Mach Intell. 2015;(11):5344–5352.
Johansson G. Visual perception of biological motion and a model for its analysis. Percept psychophys. 1973;14:201–211.
Zhang Z. Microsoft kinect sensor and its effect. IEEE Multimedia. 2012;19(2):4–10.
Cao Z, Hidalgo G, Simon T, Wei SE, Sheikh Y. OpenPose: realtime multi-person 2D pose estimation using Part Affinity Fields. arXiv 2019. https://doi.org/10.48550/arXiv.1812.08008
Wang L, Huynh DQ, Koniusz P. A comparative review of recent kinect-based action recognition algorithms. IEEE Trans Image Process. 2019;29:15–28.
Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. Commun ACM. 2012;60(6):84–90.
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv. 2020. https://doi.org/10.48550/arXiv.2010.11929
Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S. End-to-end object detection with transformers. arXiv. 2020. https://doi.org/10.48550/arXiv.2005.12872
Zhu X, Su W, Lu L, Li B, Wang X, Dai J. Deformable DETR: Deformable Transformers for end-to-end object detection. arXiv. 2021. https://doi.org/10.48550/arXiv.2010.04159
Zhou Y, Cheng ZQ, Li C, Fan Y, Geng Y, Xie X, Keuper M. Hypergraph transformer for skeleton-based action recognition. arXiv. 2023. https://doi.org/10.48550/arXiv.2211.09590
Plizzari C, Cannici M, Matteucci M. Skeleton-based action recognition via spatial and temporal transformer networks. Comput Vis Image Underst. 2021;208–209:Article 103219.
Bai D, Liu T, Han X, Yi H. Application research on optimization algorithm of sEMG gesture recognition based on light CNN+ LSTM model. Cyborg Bionic Syst. 2021;2021:Article 9794610.
Poppe R. A survey on vision-based human action recognition. Image Vis Comput. 2010;28(6):976–990.
Weinland D, Ronfard R, Boyer E. A survey of vision-based methods for action representation segmentation and recognition. Comput Vis Image Underst. 2011;115(2):224–241.
Herath S, Harandi M, Porikli F. Going deeper into action recognition: A survey. Image Vis Comput. 2017;60:4–21.
Lo Presti L, La Cascia M. 3D skeleton-based human action classification: A survey. Pattern Recognit. 53:130–147.
Ellis C, Masood SZ, Tappen MF, Laviola JJ Jr, Sukthankar R. Exploring the trade-off between accuracy and observational latency in;action recognition. Int J Comput Vis. 2013;101:420–436.
Liu M, Meng F, Liang Y. Generalized pose decoupled network for unsupervised 3d skeleton sequence-based action representation learning. Cyborg Bionic Syst. 2022;2022:0002.
Sun Z, Ke Q, Rahmani H, Bennamoun M, Wang G, Liu J. Human action recognition from various data modalities: A review. IEEE Trans Pattern Anal Mach Intell. 2022;45(3):3200–3225.
Li C, Xie C, Zhang B, Han J, Zhen X, Chen J. Memory attention networks for skeleton-based action recognition. IEEE Trans Neural Netw Lear Syst. 2021;33:4800–4814.
Li L, Zheng W, Zhang Z, Huang Y, Wang L. Skeleton-based relational modeling for action recognition. arXiv. 2018. https://doi.org/10.48550/arXiv.1805.02556
Xu Y, Cheng J, Wang L, Xia H, Liu F, Tao D. Ensemble one-dimensional convolution neural networks for skeleton-based action recognition. IEEE Signal Process Lett. 2018;25:1044–1048.
Hao X, Li J, Guo Y, Jiang T, Yu M. Hypergraph neural network for skeleton-based action recognition. IEEE Trans Image Process. 2021;30:2263–2275.
Yang H, Yan D, Zhang L, Sun Y, Li D, Maybank SJ. Feedback graph convolutional network for skeleton-based action recognition. IEEE Trans Image Process. 2021;31:164–175.
Bian C, Feng W, Wan L, Wang S. Structural knowledge distillation for efficient skeleton-based action recognition. IEEE Trans Image Process. 2021;30:2963–2976.
Fang Z, Zhang X, Cao T, Zheng Y, Sun M. Spatial-temporal slowfast graph convolutional network r skeleton-based action recognition. IET Comput Vis. 2022;16:205–217.
Khan S, Naseer M, Hayat M, Zamir SW, Khan FS, Shah M. Transformers in vision: A survey. ACM Comput Surveys. 2022;54(10):1–41.
Huang X, Mei G, Zhang J. Cross-source point cloud registration: Challenges, progress and prospects. Neurocomputing. 2023;548:126383.
Wang W, Mei G, Ren B, Huang X, Poiesi F, Gool Van L, Sebe N, Lepri B. Zero-shot point cloud registration. arXiv. 2023. https://doi.org/10.48550/arXiv.2312.03032
Cho S, Maqbool M, Liu F, Foroosh H. Self-attention network for skeleton-based human action recognition. arXiv. 2019. https://doi.org/10.48550/arXiv.1912.08435
Zhang J, Jia Y, Xie W, Tu Z. Zoom transformer for skeleton-based group activity recognition. IEEE Trans Circuits Syst Video Technol. 2022;32(12):8646–8659.
Liu J, Shahroudy A, Perez M, Wang G, Duan LY, Kot AC. Ntu rgb+ d 120: A large-scale benchmark for 3d human activity understanding. IEEE Trans Pattern Anal Mach Intell. 2019;42(10):2684–2701.
Duan H, Wang J, Chen K, Lin D. DG-STGCN: dynamic spatial-temporal modeling for skeleton-based action recognition. arXiv. 2022. https://doi.org/10.48550/arXiv.2210.05895
Liu J, Wang X, Wang C, Gao Y, Liu M. Temporal Decoupling Graph Convolutional Network for Skeleton-based Gesture Recognition. IEEE Trans Multimedia. 2023;26:811–823.
Zhang P, Lan C, Xing J, Zeng W, Xue J, Zheng N. View adaptive neural networks for high performance skeleton-based human action recognition. IEEE Trans Pattern Anal Mach Intell. 2019;41(8):1963–1978.
Song YF, Zhang Z, Wang L. Richly activated graph convolutional network for action recognition with incomplete skeletons. arXiv. 2019. https://doi.org/10.48550/arXiv.1905.06774
Zhang P, Lan C, Zeng W, Xue J, Zheng N. Semantics-guided neural networks for efficient skeleton-based human action recognition. arXiv. 2020. https://doi.org/10.48550/arXiv.1904.01189
Xu H, Gao Y, Hui Z, Li J, Gao X. Language knowledge-assisted representation learning for skeleton-based action recognition. arXiv. 2023. https://doi.org/10.48550/arXiv.2305.12398
Ke Q, Bennamoun M, An S, Sohel F, Boussaid F. Learning clip representations for skeleton-based 3D action recognition. IEEE Trans Image Process. 2018;27(6):2842–2855.
Liu J, Wang G, Duan LY, Abdiyeva K, Kot AC. Skeleton-based human action recognition with global context-aware attention LSTM networks. IEEE Trans Image Process. 2017;27(4):1586–1599.
Liu J, Shahroudy A, Wang G, Duan LY, Chichung AK. Skeleton-based online action prediction using scale selection network. IEEE Trans Pattern Anal Mach Intell. 2019;42(6):1453–1467.