AI Chat Paper
Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.
{{lang === 'zh_CN' ? '文章概述' : 'Summary'}}
{{lang === 'en_US' ? '中' : 'Eng'}}
Chat more with AI
PDF (1.3 MB)
Collect
Submit Manuscript AI Chat Paper
Show Outline
Outline
Show full outline
Hide outline
Outline
Show full outline
Hide outline
Open Access

Global Spatial-Temporal Information Encoder-Decoder Based Action Segmentation in Untrimmed Video

Department of Computer and Cyberspace Security, Fujian Normal University, Fuzhou 350007, China
Department of Information Engineering, Sun Yat-sen University, Kaohsiung 80424, China
Show Author Information

Abstract

Action segmentation has made significant progress, but segmenting and recognizing actions from untrimmed long videos remains a challenging problem. Most state-of-the-art methods focus on designing models based on temporal convolution. However, the limitations of modeling long-term temporal dependencies and the inflexibility of temporal convolutions restrict the potential of these models. To address the issue of over-segmentation in existing action segmentation methods, which leads to classification errors and reduced segmentation quality, this paper proposes a global spatial-temporal information encoder-decoder based action segmentation method. The method proposed in this paper uses the global temporal information captured by refinement layer to assist the Encoder-Decoder (ED) structure in judging the action segmentation point more accurately and, at the same time, suppress the excessive segmentation phenomenon caused by the ED structure. The method proposed in this paper achieves 93% frame accuracy on the constructed real Tai Chi action dataset. The experimental results prove that this method can accurately and efficiently complete the long video action segmentation task.

References

[1]
C. Feichtenhofer, A. Pinz, and A. Zisserman, Convolutional two-stream network fusion for video action recognition, in Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 2016, pp. 1933–1941.
[2]

A. Ullah, J. Ahmad, K. Muhammad, M. Sajjad, and S. W. Baik, Action recognition in video sequences using deep bi-directional LSTM with CNN features, IEEE Access, vol. 6, pp. 1155–1166, 1994.

[3]

J. Donahue, L. A. Hendricks, M. Rohrbach, S. Venugopalan, S. Guadarrama, K. Saenko, and T. Darrell, Long-term recurrent convolutional networks for visual recognition and description, IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 4, pp. 677–691, 2017.

[4]
J. Carreira and A. Zisserman, Quo vadis, action recognition? A new model and the kinetics dataset, in Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 2017, pp. 4724–4733.
[5]
Y. Zhou, Research on long sequence action recognition and prediction, PhD dissertation, Xiangtan University, 2021.
[6]
Wikipedia contributors, “Tai chi”, https://en.wikipedia.org/w/index.php?title=Tai_chi&oldid=1226038427, 2024.
[7]
Y. A. Farha and J. Gall, MS-TCN: Multi-stage temporal convolutional network for action segmentation, in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 2019, pp. 3570–3579.
[8]
M.-H. Chen, B. Li, Y. Bao, and G. Alregib, Action segmentation with mixed temporal domain adaptation, in Proc. IEEE/CVF Winter Conf. Appl. Comput. Vis..
[9]
S.-H. Gao, Q. Han, Z.-Y. Li, P. Peng, L. Wang, and M.-M. Cheng, Global2Local: Efficient structure search for video action segmentation, in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 2021, pp. 16800–16809.
[10]
F. Yi, H. Wen, and T. Jiang, ASFormer: Transformer for action segmentation, arXiv preprint arXiv: 2110.08568, 2021.
[11]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, Attention is all you need, in Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 2017, pp. 6000–6010.
[12]
H. Khan, S. Haresh, A. Ahmed, S. Siddiqui, A. Konin, M. Z. Zia, and Q.-H. Tran, Timestamp-supervised action segmentation with graph convolutional networks, in Proc. IEEE/RSJ Int. Conf. Intelligent Robots and Systems (IROS), Kyoto, Japan, 2022, pp. 10619–10626.
[13]
R. Li, X.-J. Wu, and T. Xu, Video is graph: Structured graph module for video action recognition, arXiv preprint arXiv: 2110.05904, 2021.
[14]

D. Wang, D. Hu, X. Li, and D. Dou, Temporal relational modeling with self-supervision for action segmentation, Proc. AAAI Conf. Artif. Intell., vol. 35, no. 4, pp. 2729–2737, 2021.

[15]

X. Shu, J. Yang, R. Yan, and Y. Song, Expansion-squeeze-excitation fusion network for elderly activity recognition, IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 8, pp. 5281–5292, 2022.

[16]

W. Ng, M. Zhang, and T. Wang, Multi-localized sensitive autoencoder-attention-LSTM for skeleton-based action recognition, IEEE Trans. Multimed., vol. 24, pp. 1678–1690, 2022.

[17]

H. Wu, X. Ma, and Y. Li, Spatiotemporal multimodal learning with 3D CNNs for video action recognition, IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 3, pp. 1250–1261, 2022.

[18]
Z. Zhang, L. Zhou, J. Ao, S. Liu, L. Dai, J. Li, and F. Wei, SpeechUT: Bridging speech and text with hidden-unit for encoder-decoder based speech-text pre-training, arXiv preprint arXiv: 2210.03730, 2022.
[19]
Q. Zhang, H. Lu, H. Sak, A. Tripathi, E. McDermott, S. Koo, and S. Kumar, Transformer transducer: A streamable speech recognition model with transformer encoders and RNN-T loss, in Proc. 2020 IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2020, pp. 7829–7833.
[20]

S. Cao, J. Li, K. P. Nelson, and M. A. Kon, Coupled VAE: Improved accuracy and robustness of a variational autoencoder, Entropy, vol. 24, no. 3, p. 423, 2022.

[21]
C. Lea, M. D. Flynn, R. Vidal, A. Reiter, and G. D. Hager, Temporal convolutional networks for action segmentation and detection, in Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 2017, pp. 1003–1012.
[22]
P. Lei and S. Todorovic, Temporal deformable residual networks for action segmentation in videos, in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 6742–6751.
[23]

G. Ding and A. Yao, Temporal action segmentation with high-level complex activity labels, IEEE Trans. Multimed., vol. 25, pp. 1928–1939, 1928.

[24]
Z. Du, X. Wang, G. Zhou, and Q. Wang, Fast and unsupervised action boundary detection for action segmentation, in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 2022, pp. 3313–3322.
[25]
Z. Wang, Z. Gao, L. Wang, Z. Li, and G. Wu, Boundary-aware cascade networks for temporal action segmentation, in Proc. ECCV 2020, Glasgow, UK, 2020, pp. 34–51.
[26]
S. Lee, S. Purushwalkam, M. Cogswell, D. Crandall, and D. Batra, Why m heads are better than one: Training a diverse ensemble of deep networks, arXiv preprintarXiv: 1511.06314, 2015.
[27]
G. Bertasius, H. Wang, and L. Torresani, Is space-time attention all you need for video understanding? in Proc. of the 38th International Conference on Machine Learning, http://arxiv.org/pdf/2102.05095.pdf, 2024.
[28]
C. Xu and L. Ding, Weakly-supervised action segmentation with iterative soft boundary assignment, in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 6508–6516.
[29]
F. Sener, D. Singhania, and A. Yao, Temporal aggregate representations for long-range video understanding, in Proc. ECCV 2020, Glasgow, UK, 2020, pp. 154−171.
[30]
A. Fathi, X. Ren, and J. M. Rehg, Learning to recognize objects in egocentric activities, in Proc. CVPR, Colorado Springs, CO, USA, 2011, pp. 3281–3288.
[31]
S. Stein and S. J. McKenna, Combining embedded accelerometers with computer vision for recognizing food preparation activities, in Proc. 2013 ACM Int. joint Conf. Pervasive and ubiquitous computing, Zurich, Switzerland, 2013, pp. 729-738.
[32]
H. Kuehne, A. Arslan, and T. Serre, The language of actions: Recovering the syntax and semantics of goal-directed human activities, in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Columbus, OH, USA, 2014, pp. 780–787.
[33]
G. Ding, F. Sener, and A. Yao, Temporal action segmentation: An analysis of modern techniques, IEEE Trans. on Pattern Analysis and Machine Intelligence.
[34]
Z. Wang, Z. Gao, L. Wang, Z. Li, and G. Wu, Boundary-aware cascade networks for temporal action segmentation in Proc. ECCV 2020, Glasgow, UK, pp. 34–51.
[35]

D. Singhania, R. Rahaman, and A. Yao, C2F-TCN: A framework for semi- and fully-supervised temporal action segmentation, IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 10, pp. 11484–11501, 2023.

Tsinghua Science and Technology
Pages 290-302
Cite this article:
Liu Y, Sun Y, Chen Z, et al. Global Spatial-Temporal Information Encoder-Decoder Based Action Segmentation in Untrimmed Video. Tsinghua Science and Technology, 2025, 30(1): 290-302. https://doi.org/10.26599/TST.2024.9010041

167

Views

18

Downloads

0

Crossref

0

Web of Science

0

Scopus

0

CSCD

Altmetrics

Received: 31 October 2023
Revised: 11 January 2024
Accepted: 18 February 2024
Published: 11 September 2024
© The Author(s) 2025.

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).

Return