AI Chat Paper
Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.
{{lang === 'zh_CN' ? '文章概述' : 'Summary'}}
{{lang === 'en_US' ? '中' : 'Eng'}}
Chat more with AI
PDF (14.6 MB)
Collect
Submit Manuscript AI Chat Paper
Show Outline
Outline
Show full outline
Hide outline
Outline
Show full outline
Hide outline
Research Article | Open Access

EfficientPose: Efficient human pose estimation with neural architecture search

School of EIC, Huazhong University of Science and Technology, Wuhan 430074, China
Institute of Artificial Intelligence, Huazhong University of Science and Technology, Wuhan 430074, China

* Wenqian Zhang and Jiemin Fang contributed equally to this work.

Show Author Information

Abstract

Human pose estimation from image and video is a key task in many multimedia applications. Previous methods achieve great performance but rarely take efficiency into consideration, which makes it difficult to implement the networks on lightweight devices. Nowadays, real-time multimedia applications call for more efficient models for better interaction. Moreover, most deep neural networks for pose estimation directly reuse networks designed for image classification as the backbone, which are not optimized for the pose estimation task. In this paper, we propose an efficient framework for human pose estimation with two parts, an efficient backbone and an efficient head. By implementing a differentiable neural architecture search method, we customize the backbone network design for pose estimation, and reduce computational cost with negligible accuracy degradation. For the efficient head, we slim the transposed convolutions and propose a spatial information correction module to promote the performance of the final prediction. In experiments, we evaluate our networks on the MPII and COCO datasets. Our smallest model requires only 0.65 GFLOPs with 88.1% PCKh@0.5 on MPII and our large model needs only 2 GFLOPs while its accuracy is competitive with the state-of-the-art large model, HRNet, which takes 9.5 GFLOPs.

References

[1]
Yang, Y.; Ramanan, D. Articulated pose estimation with flexible mixtures-of-parts. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1385–1392, 2011.
[2]
Pishchulin, L.; Andriluka, M.; Gehler, P.; Schiele, B. Poselet conditioned pictorial structures. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 588–595, 2013.
[3]
Toshev, A.; Szegedy, C. DeepPose: Human pose estimation via deep neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1653–1660, 2014.
[4]
Newell, A.; Yang, K. Y.; Deng, J. Stacked hourglass networks for human pose estimation. In: Computer Vision – ECCV 2016. Lecture Notes in Computer Science, Vol. 9912. Leibe, B.; Matas, J.; Sebe, N.; Welling, M. Eds. Springer Cham, 483–499, 2016.
[5]
Xiao, B.; Wu, H. P.; Wei, Y. C. Simple baselines for human pose estimation and tracking. In: Computer Vision – ECCV 2018. Lecture Notes in Computer Science, Vol. 11210. Ferrari, V.; Hebert, M.; Sminchisescu, C.; Weiss, Y. Eds. Springer Cham, 472–487, 2018.
[6]
Sun, K.; Xiao, B.; Liu, D.; Wang, J. D. Deep high-resolution representation learning for human pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5686–5696, 2019.
[7]
Andriluka, M.; Pishchulin, L.; Gehler, P.; Schiele, B. 2D human pose estimation: New benchmark and state of the art analysis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3686–3693, 2014.
[8]
Lin, T. Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C. L. Microsoft COCO: Common objects in context. In: Computer Vision – ECCV 2014. Lecture Notes in Computer Science, Vol. 8693. Fleet, D.; Pajdla, T.; Schiele, B.; Tuytelaars, T. Eds. Springer Cham, 740–755, 2014.
[9]
Chen, Y. L.; Wang, Z. C.; Peng, Y. X.; Zhang, Z. Q.; Yu, G.; Sun, J. Cascaded pyramid network for multi-person pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7103–7112, 2018.
[10]
Li, W. B.; Wang, Z. C.; Yin, B. Y.; Peng, Q. X.; Su, J. Rethinking on multi-stage networks for human pose estimation. arXiv preprint arXiv:1901.00148, 2019.
[11]
He, K. M.; Zhang, X. Y.; Ren, S. Q.; Sun, J. Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 770–778, 2016.
[12]
Howard, A. G.; Zhu, M. L.; Chen, B.; Kalenichenko, D.; Adam, H. Mobilenets: Efficient convolutional neural networks formobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
[13]
Zoph, B.; Vasudevan, V.; Shlens, J.; Le, Q. V. Learning transferable architectures for scalable image recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8697–8710, 2018.
[14]
Real, E.; Aggarwal, A.; Huang, Y. P.; Le, Q. V. Regularized evolution for image classifier architecture search. In: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, 4780–4789, 2019.
[15]
Bender, G.; Kindermans, P.; Zoph, B.; Vasudevan, V.; Le, Q. Understanding and simplifying one-shot architecture search. In: Proceedings of the 35th International Conference on Machine Learning, 549–558, 2018.
[16]
Liu, H. X.; Simonyan, K.; Yang, Y. M. DARTS: Differentiable architecture search. In: Proceedings of the 7th International Conference on Learning Representations, 2019.
[17]
Cai, H.; Zhu, L.; Han, S. ProxylessNAS: Direct neural architecture search on target task and hardware. In: Proceedings of the International Conference on Learning Representations, 2019.
[18]
Wu, B.; Dai, X.; Zhang, P.; Wang, Y.; Sun, F.; Wu, Y.; Tian, Y.; Vajda, P.; Jia, Y.; Keutzer, K. Fbnet: Hardware-aware efficient convNet design via differentiable neural architecture search. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10726–10734, 2019.
[19]
Liu, C. X.; Chen, L. C.; Schroff, F.; Adam, H.; Hua, W.; Yuille, A. L.; Fei-Fei, L. Auto-DeepLab: Hierarchical neural architecture search for semantic image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 82–92, 2019.
[20]
Zhang, Y.; Qiu, Z.; Liu, J.; Yao, T.; Liu, D.; Mei, T. Customizable architecture search forsemantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11633–11642, 2019.
[21]
Ghiasi, G.; Lin, T. Y.; Le, Q. V. NAS-FPN: Learning scalable feature pyramid architecture for object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7029–7038, 2019.
[22]
Fang, J. M.; Sun, Y. Z.; Zhang, Q.; Peng, K. J.; Wang, X. G. FNA++: Fast network adaptation via parameter remapping and architecture search. In: Proceedings of the International Conference on Learning Representations, 2020.
[23]
Yang, W.; Li, S.; Ouyang, W. L.; Li, H. S.; Wang, X. G. Learning feature pyramids for human pose estimation. In: Proceedings of the IEEE International Conference on Computer Vision, 1290–1299, 2017.
[24]
Bulat, A.; Tzimiropoulos, G. Binarized convolutional landmark localizers for human pose estimation and face alignment with limited resources. In: Proceedings of the IEEE International Conference on Computer Vision, 3726–3734, 2017.
[25]
Tang, Z. Q.; Peng, X.; Geng, S. J.; Wu, L. F.; Zhang, S. T.; Metaxas, D. Quantized densely connected U-nets for efficient landmark localization. In: Computer Vision – ECCV 2018. Lecture Notes in Computer Science, Vol. 11207. Ferrari, V.; Hebert, M.; Sminchisescu, C.; Weiss, Y. Eds. Springer Cham, 348–364, 2018.
[26]
Zhang, F.; Zhu, X. T.; Ye, M. Fast human pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3512–3521, 2019.
[27]
Wei, S. H.; Ramakrishna, V.; Kanade, T.; Sheikh, Y. Convolutional pose machines. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4724–4732, 2016.
[28]
Odena, A.; Dumoulin, V.; Olah, C. Deconvolution and checkerboard artifacts. Distill, 2016. Available at http://doi.org/10.23915/distill.
[29]
Gao, H.; Yuan, H.; Wang, Z.; Ji, S. Pixel deconvolutional networks.arXiv preprint arXiv:1705.06820, 2017.
[30]
Wojna, Z.; Uijlings, J.; Guadarrama, S.; Silberman, N.; Chen, L. C.; Fathi, A.; Uijlings, J. The devil is in the decoder. In: Proceedings of the British Machine Vision Conference, 10.1–10.13, 2017.
[31]
Sugawara, Y.; Shiota, S.; Kiya, H. Checkerboard artifacts free convolutional neural networks. APSIPA Transactions on Signal and Information Processing Vol. 8, e9, 2019.
[32]
Tan, M. X.; Le, Q. V. EfficientNet: Rethinking model scaling for convolutional neural networks. arXiv preprint arXiv:1905.11946, 2019.
[33]
Brock, A.; Lim, T.; Ritchie, J. M.; Weston, N. SMASH: One-shot model architecture search through HyperNetworks. In: Proceedings of the International Conference on Learning Representations, 2018.
[34]
Dong, X. Y.; Yang, Y. Searching for a robust neural architecture in four GPU hours. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1761–1770, 2019.
[35]
Xu, Y. H.; Xie, L. X.; Zhang, X. P.; Chen, X.; Xiong, H. K. PC-DARTS: Partial channel connections for memory-efficient differentiable architecture search. In: Proceedings of the International Conference on Learning Representations, 2019.
[36]
Tan, M. X.; Chen, B.; Pang, R. M.; Vasudevan, V.; Sandler, M.; Howard, A.; Le, Q. V. MnasNet: Platform-aware neural architecture search for mobile. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2815–2823, 2019.
[37]
Gong, X. Y.; Chen, W. Y.; Jiang, Y. F.; Yuan, Y.; Wang, Z. Y. AutoPose: Searching multi-scale branch aggregation for pose estimation. arXiv preprint arXiv:2008.07018, 2020.
[38]
Sandler, M.; Howard, A.; Zhu, M. L.; Zhmoginov, A.; Chen, L. C. MobileNetV2: Inverted residuals and linear bottlenecks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4510–4520, 2018.
[39]
Fang, J. M.; Sun, Y. Z.; Zhang, Q.; Li, Y.; Wang, X. G. Densely connected search space for more flexible neural architecture search, In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10625–10634, 2020.
[40]
Tang, W.; Yu, P.; Wu, Y. Deeply learned compositional models for human pose estimation. In: Computer Vision – ECCV 2018. Lecture Notes in Computer Science, Vol. 11207. Ferrari, V.; Hebert, M.; Sminchisescu, C.; Weiss, Y. Eds. Springer Cham, 197–214, 2018.
[41]
Yang, S.; Yang, W. K.; Cui, Z. Pose neural fabrics search. arXiv preprint arXiv:1909.07068, 2019.
[42]
Zhang, Z.; Tang, J.; Wu, G. Simple and lightweight human pose estimation. arXiv preprint arXiv:1911.10346, 2019.
[43]
He, K. M.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, 2980–2988, 2017.
[44]
Papandreou, G.; Zhu, T.; Kanazawa, N.; Toshev, A.; Tompson, J.; Bregler, C.; Murphy, K. Towards accurate multi-person pose estimation in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3711–3719, 2017.
[45]
Huang, S. L.; Gong, M. M.; Tao, D. C. A coarse-fine network for keypoint localization. In: Proceedings of the IEEE International Conference on Computer Vision, 3047–3056, 2017.
[46]
Ottelander, T. D.; Dushatskiy, A.; Virgolin, M.; Bosman, P. A. N. Local search is a remarkably strong baseline for neural architecture search.arXiv preprint arXiv:2004.08996, 2020.
Computational Visual Media
Pages 335-347
Cite this article:
Zhang W, Fang J, Wang X, et al. EfficientPose: Efficient human pose estimation with neural architecture search. Computational Visual Media, 2021, 7(3): 335-347. https://doi.org/10.1007/s41095-021-0214-z

812

Views

47

Downloads

36

Crossref

23

Web of Science

35

Scopus

0

CSCD

Altmetrics

Received: 11 December 2020
Accepted: 16 February 2021
Published: 07 April 2021
© The Author(s) 2021

This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduc-tion in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.

The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Other papers from this open access journal are available free of charge from http://www.springer.com/journal/41095. To submit a manuscript, please go to https://www. editorialmanager.com/cvmj.

Return