Two-Stream Temporal Convolutional Networks for Skeleton-Based Human Action Recognition

Jin-Gong Jia; Yuan-Feng Zhou; Xing-Wei Hao; Feng Li; Christian Desrosiers; Cai-Ming Zhang

doi:10.1007/s11390-020-0405-6

AI Chat Paper

Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.

Chat more with AI

| Sign up

Browse by Subject

Search for peer-reviewed journals with full access.

Journals A - Z

About Us

Discover the SciOpen Platform and Achieve Your Research Goals with Ease.

About Us

Publish with Us

Support

Search articles, authors, keywords, DOl and etc.

Published Date

Reset Search

{{expandStatus?'Exit ':''}}Advanced Search

Journals A - Z

About Us

Publish with Us

Support

Article Link

Cite

EndNote(RIS) BibTeX

Collect

Submit Manuscript

Show Outline

Outline

Show full outline

Hide outline

Outline

Show full outline

Hide outline

Regular Paper

Two-Stream Temporal Convolutional Networks for Skeleton-Based Human Action Recognition

Jin-Gong Jia^¹, Yuan-Feng Zhou^¹(

), Xing-Wei Hao^¹, Feng Li^¹, Christian Desrosiers^², Cai-Ming Zhang^¹

School of Software, Shandong University, Jinan 250101, China

Department of Software and IT Engineering, University of Quebec, Montreal H3C 3P8, Canada

Show Author Information

Abstract

With the growing popularity of somatosensory interaction devices, human action recognition is becoming attractive in many application scenarios. Skeleton-based action recognition is effective because the skeleton can represent the position and the structure of key points of the human body. In this paper, we leverage spatiotemporal vectors between skeleton sequences as input feature representation of the network, which is more sensitive to changes of the human skeleton compared with representations based on distance and angle features. In addition, we redesign residual blocks that have different strides in the depth of the network to improve the processing ability of the temporal convolutional networks (TCNs) for long time dependent actions. In this work, we propose the two-stream temporal convolutional networks (TS-TCNs) that take full advantage of the inter-frame vector feature and the intra-frame vector feature of skeleton sequences in the spatiotemporal representations. The framework can integrate different feature representations of skeleton sequences so that the two feature representations can make up for each other’s shortcomings. The fusion loss function is used to supervise the training parameters of the two branch networks. Experiments on public datasets show that our network achieves superior performance and attains an improvement of 1.2% over the recent GCN-based (BGC-LSTM) method on the NTU RGB+D dataset.

Keywords

skeleton action recognition temporal convolutional network (TCN)vector feature representation neural network

Electronic Supplementary Material

Download File(s)

jcst-35-3-538-Highlights.pdf (222.6 KB)

References

[1]

Aggarwal J K, Xia L. Human activity recognition from 3D data: A review. Pattern Recognition Letters, 2014, 48: 70-80.

Crossref Google Scholar

[2]

Weinland D, Ronfard R, Boyer E. A survey of vision-based methods for action representation, segmentation and recognition. Computer Vision and Image Understanding, 2011, 115(2): 224-241.

Crossref Google Scholar

[3]

Han F, Reily B, Hoff W, Zhang H. Space-time representation of people based on 3D skeletal data: A review. Computer Vision and Image Understanding, 2017, 158: 85-105.

Crossref Google Scholar

[4]

Liu H, Liu B, Zhang H, Li L, Qin X, Zhang G. Crowd evacuation simulation approach based on navigation knowledge and two-layer control mechanism. Information Sciences, 2018, 436/437: 247-267.

Crossref Google Scholar

[5]

Turaga P, Chellappa R, Subrahmanian V S. Machine recognition of human activities: A survey. IEEE Transactions on Circuits and Systems for Video Technology, 2008, 18(11): 1473-1488.

Crossref Google Scholar

[6]

Herath S, Harandi M, Porikli F. Going deeper into action recognition: A survey. Image and Vision Computing, 2017, 60: 4-21.

Crossref Google Scholar

[7]

Hou J H, Chau L P, Thalmann N M, He Y. Compressing 3-D human motions via keyframe-based geometry videos. IEEE Transactions on Circuits and Systems for Video Technology, 2014, 25(1): 51-62.

Crossref Google Scholar

[8]

Sermanet P, Lynch C, Hsu J, Levine S. Time-contrastive networks: Self-supervised learning from multi-view observation. In Proc. the 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops, July 2017, pp.486-487.

Crossref

[9]

Shotton J, Sharp T, Kipman A, Fitzgibbon A, Finocchio M, Blake A, Cook M, Moore R. Real-time human pose recognition in parts from single depth images. Communications of the ACM, 2011, 56(1): 116-124.

Crossref Google Scholar

[10]

Li S, Fang Z, Song W, Hao A, Qin H. Bidirectional optimization coupled lightweight networks for efficient and robust multi-person 2D pose estimation. Journal of Computer Science and Technology, 2019, 34(3): 522-536.

Crossref Google Scholar

[11]

Shahroudy A, Liu J, Ng T T, Gang W. NTU RGB+D: A large scale dataset for 3D human activity analysis. In Proc. the 2016 IEEE Conference on Computer Vision and Pattern Recognition, June 2016, pp.1010-1019.

Crossref

[12]

Zhu F, Shao L, Xie J, Fang Y. From handcrafted to learned representations for human action recognition: A survey. Image and Vision Computing, 2016, 55: 42-52.

Crossref Google Scholar

[13]

Huang Z W, Wan C, Probst T, Van G L. Deep learning on lie groups for skeleton-based action recognition. In Proc. the 2017 IEEE Conference on Computer Vision and Pattern Recognition, July 2017, pp.1243-1252.

Crossref

[14]

Ke Q, An S, Bennamoun M, Sohel F, Boussaid F. Skeleton-Net: Mining deep part features for 3-D action recognition. IEEE Signal Processing Letters, 2017, 24(6): 731-735.

Crossref Google Scholar

[15]

Weng J, Weng C, Yuan J, Liu Z. Discriminative spatiotemporal pattern discovery for 3D action recognition. IEEE Transactions on Circuits and Systems for Video Technology, 2019, 29(4): 1077-1089.

Crossref Google Scholar

[16]

Liu J, Shahroudy A, Xu D, Kot A C, Wang G. Skeleton-based action recognition using spatiotemporal LSTM network with trust gates. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 40(12): 3007-3021.

Crossref Google Scholar

[17]

Lee I, Kim D, Kang S, Lee S. Ensemble deep learning for skeleton-based action recognition using temporal sliding LSTM networks. In Proc. the 2017 IEEE International Conference on Computer Vision, October 2017, pp.1012-1020.

Crossref

[18]

Zhang P, Xue J, Lan C, Zeng W, Gao Z, Zheng N. Adding attentiveness to the neurons in recurrent neural networks. In Proc. the 15th European Conference on Computer Vision, September 2018, pp.136-152.

Crossref

[19]

Meng F, Liu H, Liang Y, Tu J, Liu M. Sample fusion network: An end-to-end data augmentation network for skeleton-based human action recognition. IEEE Transactions on Image Processing, 2019, 28(11): 5281-5295.

Crossref Google Scholar

[20]

Yan S, Xiong Y, Lin D. Spatial temporal graph convolutional networks for skeleton-based action recognition. In Proc. the 32nd AAAI Conference on Artificial Intelligence, February 2018, pp.7444-7452.

Crossref

[21]

Shi L, Zhang Y, Cheng J, Lu H. Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In Proc. the 2019 IEEE Conference on Computer Vision and Pattern Recognition, June 2019, pp.12026-12035.

Crossref

[22]

Li M, Chen S, Chen X, Zhang Y, Wang Y, Tian Q. Actional-structural graph convolutional networks for skeleton-based action recognition. In Proc. the 2019 IEEE Conference on Computer Vision and Pattern Recognition, June 2019, pp.3595-3603.

Crossref

[23]

Si C, Chen W, Wang W, Wang L, Tan T. An attention enhanced graph convolutional LSTM network for skeleton-based action recognition. In Proc. the 2019 IEEE Conference on Computer Vision and Pattern Recognition, June 2019, pp.1227-1236.

Crossref

[24]

Lea C, Flynn M D, Vidal R, Reiter A, Hager G D. Temporal convolutional networks for action segmentation and detection. In Proc. the 2017 IEEE Conference on Computer Vision and Pattern Recognition, July 2017, pp.1003-1012.

Crossref

[25]

Kim T S, Reiter A. Interpretable 3D human action analysis with temporal convolutional networks. In Proc. the 2017 IEEE Conference on Computer Vision and Pattern Recognition, July 2017, pp.1623-1631.

Crossref

[26]

Liu J, Shahroudy A, Perez M, Wang G, Duan L Y, Kot A C. NTU RGB+D 120: A large-scale benchmark for 3D human activity understanding. arXiv: 1905.04757, 2019. https://arxiv.org/pdf/1905.04757.pdf, Jan. 2020.

Crossref

[27]

Jiang W, Nie X, Xia Y, Wu Y, Zhu S C. Cross-view action modeling, learning and recognition. In Proc. the 2014 IEEE Conference on Computer Vision and Pattern Recognition, June 2014, pp.2649-2656.

Crossref

[28]

Xia L, Chen C C, Aggarwal J K. View invariant human action recognition using histograms of 3D joints. In Proc. the 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, June 2012, pp.20-27.

Crossref

[29]

Liu Z, Zhang C, Tian Y. 3D-based deep convolutional neural network for action recognition with depth sequences. Image and Vision Computing, 2016, 55: 93-100.

Crossref Google Scholar

[30]

Wang P, Li W, Wan J, Ogunbona P, Liu X. Cooperative training of deep aggregation networks for RGB-D action recognition. In Proc. the 32nd AAAI Conference on Artificial Intelligence, February 2018, pp.7404-7411.

Crossref

[31]

Jiang W, Liu Z, Wu Y, Yuan J. Learning actionlet ensemble for 3D human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2014, 36(5): 914-927.

Crossref Google Scholar

[32]

Zhang S, Liu X, Xiao J. On geometric features for skeleton-based action recognition using multilayer LSTM networks. In Proc. the 2017 IEEE Winter Conference on Applications of Computer Vision, March 2017, pp.148-157.

Crossref

[33]

Zhang P, Lan C, Xing J, Zeng W, Xue J, Zheng N. View adaptive recurrent neural networks for high performance human action recognition from skeleton data. In Proc. the 2017 IEEE International Conference on Computer Vision, October 2017, pp.2136-2145.

Crossref

[34]

Ke Q, Bennamoun M, An S, Sohel F, Boussaïd F. A new representation of skeleton sequences for 3D action recognition. In Proc. the 2017 IEEE Conference on Computer Vision and Pattern Recognition, July 2017, pp.4570-4579.

Crossref

[35]

Ghorbel E, Boonaert J, Boutteau R, Lecoeuche S, Savatier X. An extension of kernel learning methods using a modified Log-Euclidean distance for fast and accurate skeleton-based human action recognition. Computer Vision and Image Understanding, 2018, 175: 32-43.

Crossref Google Scholar

[36]

Yuan J, Liu Z, Wu Y. Discriminative subvolume search for efficient action detection. In Proc. the 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, June 2009, pp.2442-2449.

[37]

Liu M, Shi Y, Zheng L, Xu K, Huang H, Manocha D. Recurrent 3D attentional networks for end-to-end active object recognition. Computational Visual Media, 2019, 5(1): 91-104.

Crossref Google Scholar

[38]

Ioffe S, Szegedy C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proc. the 32nd International Conference on Machine Learning, July 2015, pp.448-456.

[39]

He K, Zhang X, Ren S, Sun J. Delving deep into rectifiers: Surpassing human-level performance on imageNet classification. In Proc. the 2015 IEEE International Conference on Computer Vision, December 2015, pp.1026-1034.

Crossref

[40]

Girija S S. TensorFlow: Large-scale machine learning on heterogeneous distributed systems. arXiv: 1603.04467, 2016. https://arxiv.org/abs/1603.04467, Jan. 2020.

[41]

Ofli F, Chaudhry R, Kurillo G, Vidal R, Bajcsy R. Sequence of the most informative joints (SMIJ): A new representation for human skeletal action recognition. In Proc. the 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, June 2012, pp.8-13.

Crossref

[42]

Zhao R, Wang K, Su H, Ji Q. Bayesian graph convolution LSTM for skeleton based action recognition. In Proc. the 2019 IEEE Conference on International Conference on Computer Vision, October 2019, pp.6881-6891.

Crossref

[43]

Yu Z, Chen W, Guo G. Fusing spatiotemporal features and joints for 3D action recognition. In Proc. the 2013 IEEE Conference on Computer Vision and Pattern Recognition, June 2013, pp.486-491.

Journal of Computer Science and Technology

Volume 35 Issue 3,
May 2020

Pages 538-550

DOI: 10.1007/s11390-020-0405-6

Cite this article:

Jia J-G, Zhou Y-F, Hao X-W, et al. Two-Stream Temporal Convolutional Networks for Skeleton-Based Human Action Recognition. Journal of Computer Science and Technology, 2020, 35(3): 538-550. https://doi.org/10.1007/s11390-020-0405-6

469

Views

Crossref

N/A

Web of Science

Scopus

CSCD

Google Scholar
Citation

Altmetrics

Received: 29 February 2020

Revised: 05 April 2020

Published: 29 May 2020