| Sign up

PDF (1.3 MB)

Cite

EndNote(RIS) BibTeX

Collect

Collect

Submit Manuscript

Open Access

Global Spatial-Temporal Information Encoder-Decoder Based Action Segmentation in Untrimmed Video

Yichao Liu^¹, Yiyang Sun^¹, Zhide Chen^¹(), Chen Feng^¹, Kexin Zhu^²

1Department of Computer and Cyberspace Security, Fujian Normal University, Fuzhou 350007, China

2Department of Information Engineering, Sun Yat-sen University, Kaohsiung 80424, China

Show Author Information

Abstract

Action segmentation has made significant progress, but segmenting and recognizing actions from untrimmed long videos remains a challenging problem. Most state-of-the-art methods focus on designing models based on temporal convolution. However, the limitations of modeling long-term temporal dependencies and the inflexibility of temporal convolutions restrict the potential of these models. To address the issue of over-segmentation in existing action segmentation methods, which leads to classification errors and reduced segmentation quality, this paper proposes a global spatial-temporal information encoder-decoder based action segmentation method. The method proposed in this paper uses the global temporal information captured by refinement layer to assist the Encoder-Decoder (ED) structure in judging the action segmentation point more accurately and, at the same time, suppress the excessive segmentation phenomenon caused by the ED structure. The method proposed in this paper achieves 93% frame accuracy on the constructed real Tai Chi action dataset. The experimental results prove that this method can accurately and efficiently complete the long video action segmentation task.

Keywords

Encoder-Decoder (ED)Bidirectional Long Short-Term Memory (BiLSTM)Tai Chi action segmentation untrimmed video

References

[1]

C. Feichtenhofer, A. Pinz, and A. Zisserman, Convolutional two-stream network fusion for video action recognition, in Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 2016, pp. 1933–1941.

[2]

A. Ullah, J. Ahmad, K. Muhammad, M. Sajjad, and S. W. Baik, Action recognition in video sequences using deep bi-directional LSTM with CNN features, IEEE Access, vol. 6, pp. 1155–1166, 1994.

Crossref Google Scholar

[3]

J. Donahue, L. A. Hendricks, M. Rohrbach, S. Venugopalan, S. Guadarrama, K. Saenko, and T. Darrell, Long-term recurrent convolutional networks for visual recognition and description, IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 4, pp. 677–691, 2017.

Crossref Google Scholar

[4]

J. Carreira and A. Zisserman, Quo vadis, action recognition? A new model and the kinetics dataset, in Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 2017, pp. 4724–4733.

[5]

Y. Zhou, Research on long sequence action recognition and prediction, PhD dissertation, Xiangtan University, 2021.

[6]

Wikipedia contributors, “Tai chi”, https://en.wikipedia.org/w/index.php?title=Tai_chi&oldid=1226038427, 2024.

[7]

Y. A. Farha and J. Gall, MS-TCN: Multi-stage temporal convolutional network for action segmentation, in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 2019, pp. 3570–3579.

[8]

M.-H. Chen, B. Li, Y. Bao, and G. Alregib, Action segmentation with mixed temporal domain adaptation, in Proc. IEEE/CVF Winter Conf. Appl. Comput. Vis..

[9]

S.-H. Gao, Q. Han, Z.-Y. Li, P. Peng, L. Wang, and M.-M. Cheng, Global2Local: Efficient structure search for video action segmentation, in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 2021, pp. 16800–16809.

[10]

F. Yi, H. Wen, and T. Jiang, ASFormer: Transformer for action segmentation, arXiv preprint arXiv: 2110.08568, 2021.

[11]

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, Attention is all you need, in Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 2017, pp. 6000–6010.

[12]

H. Khan, S. Haresh, A. Ahmed, S. Siddiqui, A. Konin, M. Z. Zia, and Q.-H. Tran, Timestamp-supervised action segmentation with graph convolutional networks, in Proc. IEEE/RSJ Int. Conf. Intelligent Robots and Systems (IROS), Kyoto, Japan, 2022, pp. 10619–10626.

[13]

R. Li, X.-J. Wu, and T. Xu, Video is graph: Structured graph module for video action recognition, arXiv preprint arXiv: 2110.05904, 2021.

[14]

D. Wang, D. Hu, X. Li, and D. Dou, Temporal relational modeling with self-supervision for action segmentation, Proc. AAAI Conf. Artif. Intell., vol. 35, no. 4, pp. 2729–2737, 2021.

Crossref Google Scholar

[15]

X. Shu, J. Yang, R. Yan, and Y. Song, Expansion-squeeze-excitation fusion network for elderly activity recognition, IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 8, pp. 5281–5292, 2022.

Crossref Google Scholar

[16]

W. Ng, M. Zhang, and T. Wang, Multi-localized sensitive autoencoder-attention-LSTM for skeleton-based action recognition, IEEE Trans. Multimed., vol. 24, pp. 1678–1690, 2022.

Crossref Google Scholar

[17]

H. Wu, X. Ma, and Y. Li, Spatiotemporal multimodal learning with 3D CNNs for video action recognition, IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 3, pp. 1250–1261, 2022.

Crossref Google Scholar

[18]

Z. Zhang, L. Zhou, J. Ao, S. Liu, L. Dai, J. Li, and F. Wei, SpeechUT: Bridging speech and text with hidden-unit for encoder-decoder based speech-text pre-training, arXiv preprint arXiv: 2210.03730, 2022.

[19]

Q. Zhang, H. Lu, H. Sak, A. Tripathi, E. McDermott, S. Koo, and S. Kumar, Transformer transducer: A streamable speech recognition model with transformer encoders and RNN-T loss, in Proc. 2020 IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2020, pp. 7829–7833.

[20]

S. Cao, J. Li, K. P. Nelson, and M. A. Kon, Coupled VAE: Improved accuracy and robustness of a variational autoencoder, Entropy, vol. 24, no. 3, p. 423, 2022.

Crossref Google Scholar

[21]

C. Lea, M. D. Flynn, R. Vidal, A. Reiter, and G. D. Hager, Temporal convolutional networks for action segmentation and detection, in Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 2017, pp. 1003–1012.

[22]

P. Lei and S. Todorovic, Temporal deformable residual networks for action segmentation in videos, in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 6742–6751.

[23]

G. Ding and A. Yao, Temporal action segmentation with high-level complex activity labels, IEEE Trans. Multimed., vol. 25, pp. 1928–1939, 1928.

Crossref Google Scholar

[24]

Z. Du, X. Wang, G. Zhou, and Q. Wang, Fast and unsupervised action boundary detection for action segmentation, in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 2022, pp. 3313–3322.

[25]

Z. Wang, Z. Gao, L. Wang, Z. Li, and G. Wu, Boundary-aware cascade networks for temporal action segmentation, in Proc. ECCV 2020, Glasgow, UK, 2020, pp. 34–51.

[26]

S. Lee, S. Purushwalkam, M. Cogswell, D. Crandall, and D. Batra, Why m heads are better than one: Training a diverse ensemble of deep networks, arXiv preprintarXiv: 1511.06314, 2015.

[27]

G. Bertasius, H. Wang, and L. Torresani, Is space-time attention all you need for video understanding? in Proc. of the 38th International Conference on Machine Learning, http://arxiv.org/pdf/2102.05095.pdf, 2024.

[28]

C. Xu and L. Ding, Weakly-supervised action segmentation with iterative soft boundary assignment, in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 6508–6516.

[29]

F. Sener, D. Singhania, and A. Yao, Temporal aggregate representations for long-range video understanding, in Proc. ECCV 2020, Glasgow, UK, 2020, pp. 154−171.

[30]

A. Fathi, X. Ren, and J. M. Rehg, Learning to recognize objects in egocentric activities, in Proc. CVPR, Colorado Springs, CO, USA, 2011, pp. 3281–3288.

[31]

S. Stein and S. J. McKenna, Combining embedded accelerometers with computer vision for recognizing food preparation activities, in Proc. 2013 ACM Int. joint Conf. Pervasive and ubiquitous computing, Zurich, Switzerland, 2013, pp. 729-738.

[32]

H. Kuehne, A. Arslan, and T. Serre, The language of actions: Recovering the syntax and semantics of goal-directed human activities, in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Columbus, OH, USA, 2014, pp. 780–787.

[33]

G. Ding, F. Sener, and A. Yao, Temporal action segmentation: An analysis of modern techniques, IEEE Trans. on Pattern Analysis and Machine Intelligence.

[34]

Z. Wang, Z. Gao, L. Wang, Z. Li, and G. Wu, Boundary-aware cascade networks for temporal action segmentation in Proc. ECCV 2020, Glasgow, UK, pp. 34–51.

[35]

D. Singhania, R. Rahaman, and A. Yao, C2F-TCN: A framework for semi- and fully-supervised temporal action segmentation, IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 10, pp. 11484–11501, 2023.

Crossref Google Scholar

Tsinghua Science and Technology

Volume 30 Issue 1,
February 2025

Pages 290-302

DOI: 10.26599/TST.2024.9010041

Cite this article:

Liu Y, Sun Y, Chen Z, et al. Global Spatial-Temporal Information Encoder-Decoder Based Action Segmentation in Untrimmed Video. Tsinghua Science and Technology, 2025, 30(1): 290-302. https://doi.org/10.26599/TST.2024.9010041

About Us

Learn about Open Access

Tsinghua University Press

Publish with Us

Peer Review Policy

Copyright and Licensing

Article Processing Charge

Contact Us

Journal Collaboration: Yao Meng (Ms.)✉️ +86-10-83470574

Technical Support: Kuo Zhao (Mr.)✉️ +86-10-83470507

Media Contact: Hao Jin (Mr.)✉️ +86-10-83470559

Address: Floor 6, Tower B, Xueyan Building, Shuangqing Road, Haidian District, Beijing 100084, China.

SciOpen——中国科技期刊卓越行动计划支持项目

Copyright © 2025 Tsinghua University Press Ltd.

京ICP备 10035462号-42 京公网安备11010802044758号