AI Chat Paper
Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.
{{lang === 'zh_CN' ? '文章概述' : 'Summary'}}
{{lang === 'en_US' ? '中' : 'Eng'}}
Chat more with AI
Article Link
Collect
Submit Manuscript
Show Outline
Outline
Show full outline
Hide outline
Outline
Show full outline
Hide outline
Regular Paper

Bidirectional Optimization Coupled Lightweight Networks for Efficient and Robust Multi-Person 2D Pose Estimation

State Key Laboratory of Virtual Reality Technology and Systems, Beihang University, Beijing 100191, China
Beihang University Qingdao Research Institute, Qingdao 266000, China
Department of Computer Science, Stony Brook University, Stony Brook 11790, U.S.A.
Show Author Information

Abstract

For multi-person 2D pose estimation, current deep learning based methods have exhibited impressive performance, but the trade-offs among efficiency, robustness, and accuracy in the existing approaches remain unavoidable. In principle, bottom-up methods are superior to top-down methods in efficiency, but they perform worse in accuracy. To make full use of their respective advantages, in this paper we design a novel bidirectional optimization coupled lightweight network (BOCLN) architecture for efficient, robust, and general-purpose multi-person 2D (2-dimensional) pose estimation from natural images. With the BOCLN framework, the bottom-up network focuses on global features, while the top-down network places emphasis on detailed features. The entire framework shares global features along the bottom-up data stream, while the top-down data stream aims to accelerate the accurate pose estimation. In particular, to exploit the priors of human joints’ relationship, we propose a probability limb heat map to represent the spatial context of the joints and guide the overall pose skeleton prediction, so that each person’s pose estimation in cluttered scenes (involving crowd) could be as accurate and robust as possible. Therefore, benefiting from the novel BOCLN architecture, the time-consuming refinement procedure could be much simplified to an efficient lightweight network. Extensive experiments and evaluations on public benchmarks have confirmed that our new method is more efficient and robust, yet still attain competitive accuracy performance compared with the state-of-the-art methods. Our BOCLN shows even greater promise in online applications.

Electronic Supplementary Material

Download File(s)
jcst-34-3-522-Highlights.pdf (692.3 KB)

References

[1]
Wen Y, Gao L, Fu H, Zhang F, Xia S. Graph CNNs with motif and variable temporal block for skeleton-based action recognition. In Proc. the 33rd AAAI Conference on Artificial Intelligence, January 2019.
[2]

Kikuchi T, Endo Y, Kanamori Y, Hashimoto T, Mitani J. Transferring pose and augmenting background for deep human-image parsing and its applications. Computational Visual Media, 2018, 4(1): 43-54.

[3]
Fan X, Zheng K, Lin Y, Wang S. Combining local appearance and holistic view: Dual-source deep neural networks for human pose estimation. In Proc. the 2015 IEEE Conference on Computer Vision and Pattern Recognition, June 2015, pp.1347-1355.
[4]
Newell A, Yang K, Deng J. Stacked hourglass networks for human pose estimation. In Proc. the 14th European Conference, October 2016, pp.483-499.
[5]
Wei S E, Ramakrishna V, Kanade T, Sheikh Y. Convolutional pose machines. In Proc. the 2016 IEEE Conference on Computer Vision and Pattern Recognition, June 2016, pp.4724-4732.
[6]
Chen Y, Shen C, Wei X S, Liu L, Yang J. Adversarial PoseNet: A structure-aware convolutional network for human pose estimation. In Proc. the 2017 IEEE International Conference on Computer Vision, Oct. 2017, pp.1212-1221.
[7]
Pishchulin L, Insafutdinov E, Tang S, Andres B, Andriluka M, Gehler P V, Schiele B. DeepCut: Joint subset partition and labeling for multi person pose estimation. In Proc. the 2016 IEEE Conference on Computer Vision and Pattern Recognition, June 2016, pp.4929-4937.
[8]
Cao Z, Simon T, Wei S E, Sheikh Y. Realtime multi-person 2D pose estimation using part affinity fields. In Proc. the 2017 IEEE Conference on Computer Vision and Pattern Recognition, July 2017, pp.1302-1310.
[9]
Newell A, Huang Z, Deng J. Associative embedding: Endto-end learning for joint detection and grouping. In Proc. the 2017 Annual Conference on Neural Information Processing Systems, December 2017, pp.2274-2284.
[10]
He K, Gkioxari G, Dollár P, Girshick R. Mask R-CNN. In Proc. the 2017 IEEE International Conference on Computer Vision, October 2017, pp.2980-2988.
[11]
Papandreou G, Zhu T, Kanazawa N, Toshev A, Tompson J, Bregler C, Murphy K. Towards accurate multi-person pose estimation in the wild. In Proc. the 2017 IEEE Conference on Computer Vision and Pattern Recognition, July 2017, pp.3711-3719.
[12]
Chen Y, Wang Z, Peng Y, Zhang Z, Yu G, Sun J. Cascaded pyramid network for multi-person pose estimation. In Proc. the 2018 IEEE Conference on Computer Vision and Pattern Recognition, June 2018, pp.7103-7112.
[13]
Papandreou G, Zhu T, Chen L C, Gidaris S, Tompson J, Murphy K. PersonLab: Person pose estimation and instance segmentation with a bottom-up, partbased, geometric embedding model. arXiv: 1803.08225, 2018. https://arxiv.org/abs/1803.08225, January 2019.
[14]
Kocabas M, Karagoz S, Akbas E. MultiPoseNet: Fast multi-person pose estimation using pose residual network. arXiv: 1807.04067, 2018. https://arxiv.org/abs/1807.04067, January 2019.
[15]
Dalal N, Triggs B. Histograms of oriented gradients for human detection. In Proc. the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, June 2005, pp.886-893.
[16]
Chen X, Yuille A L. Articulated pose estimation by a graphical model with image dependent pairwise relations. In Proc. the 2014 Annual Conference on Neural Information Processing Systems, December 2014, pp.1736-1744.
[17]
Andriluka M, Roth S, Schiele B. Pictorial structures revisited: People detection and articulated pose estimation. In Proc. the 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, June 2009, pp.1014-1021.
[18]
Johnson S, Everingham M. Learning effective human pose estimation from inaccurate annotation. In Proc. the 24th IEEE Conference on Computer Vision and Pattern Recognition, June 2011, pp.1465-1472.
[19]
Yang Y, Ramanan D. Articulated pose estimation with flexible mixtures-of-parts. In Proc. the 24th IEEE Conference on Computer Vision and Pattern Recognition, June 2011, pp.1385-1392.
[20]
Dantone M, Gall J, Leistner C, Gool L V. Human pose estimation using body parts dependent joint regressors. In Proc. the 2013 IEEE Conference on Computer Vision and Pattern Recognition, June 2013, pp.3041-3048.
[21]
Gkioxari G, Arbelaez P, Bourdev L, Malik J. Articulated pose estimation using discriminative armlet classifiers. In Proc. the 2013 IEEE Conference on Computer Vision and Pattern Recognition, June 2013, pp.3342-3349.
[22]
Pishchulin L, Andriluka M, Gehler P, Schiele B. Poselet conditioned pictorial structures. In Proc. the 2013 IEEE Conference on Computer Vision and Pattern Recognition, June 2013, pp.588-595.
[23]
Sapp B, Taskar B. MODEC: Multimodal decomposable models for human pose estimation. In Proc. the 2013 IEEE Conference on Computer Vision and Pattern Recognition, June 2013, pp.3674-3681.
[24]
Toshev A, Szegedy C. DeepPose: Human pose estimation via deep neural networks. In Proc. the 2014 IEEE Conference on Computer Vision and Pattern Recognition, June 2014, pp.1653-1660.
[25]
Zhang Z, Luo P, Loy C C, Tang X. Facial landmark detection by deep multi-task learning. In Proc. the 13th European Conference on Computer Vision, September 2014, pp.94-108.
[26]

Wang J, Zhang J, Luo C, Chen F. Joint head pose and facial landmark regression from depth images. Computational Visual Media, 2017, 3(3): 229-241.

[27]
Tompson J J, Jain A, LeCun Y, Bregler C. Joint training of a convolutional network and a graphical model for human pose estimation. In Proc. the 2014 Annual Conference on Neural Information Processing Systems, December 2014, pp.1799-1807.
[28]
Chu X, Yang W, Ouyang W, Ma C, Yuille A L, Wang X. Multi-context attention for human pose estimation. In Proc. the 2017 IEEE Conference on Computer Vision and Pattern Recognition, July 2017, pp.5669-5678.
[29]
Rogez G, Weinzaepfel P, Schmid C. LCR-Net: Localization-classification-regression for human pose. In Proc. the 2017 IEEE Conference on Computer Vision and Pattern Recognition, July 2017, pp.1216-1224.
[30]
Fang H, Xie S, Tai Y W, Lu C. RMPE: Regional multiperson pose estimation. In Proc. the 2017 IEEE International Conference on Computer Vision, October 2017, pp.2353-2362.
[31]
Girshick R. Fast R-CNN. In Proc. the 2015 IEEE International Conference on Computer Vision, December 2015, pp.1440-1448.
[32]
Ren S, He K, Girshick R, Sun J. Faster R-CNN: Towards real-time object detection with region proposal networks. In Proc. the 2015 Annual Conference on Neural Information Processing Systems, December 2015, pp.91-99.
[33]
Lin T Y, Dollar P, Girshick R, He K, Hariharan B, Belongie S. Feature pyramid networks for object detection. In Proc. the 2017 IEEE Conference on Computer Vision and Pattern Recognition, July 2017, pp.936-944.
[34]
Lin T Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick C L. Microsoft COCO: Common objects in context. In Proc. the 13th European Conference on Computer Vision, September 2014, pp.740-755.
[35]
Andriluka M, Pishchulin L, Gehler P, Schiele B. 2D human pose estimation: New benchmark and state of the art analysis. In Proc. the 2014 IEEE Conference on Computer Vision and Pattern Recognition, June 2014, pp.3686-3693.
[36]
Paszke A, Gross S, Chintala S, Chanan G, Yang E, De-Vito Z, Lin Z, Desmaison A, Antiga L, Lerer A. Automatic differentiation in pytorch. In Proc. the 2017 Annual Conference on Neural Information Processing Systems Autodiff Workshop, December 2017.
Journal of Computer Science and Technology
Pages 522-536
Cite this article:
Li S, Fang Z, Song W-F, et al. Bidirectional Optimization Coupled Lightweight Networks for Efficient and Robust Multi-Person 2D Pose Estimation. Journal of Computer Science and Technology, 2019, 34(3): 522-536. https://doi.org/10.1007/s11390-019-1924-x

404

Views

4

Crossref

N/A

Web of Science

5

Scopus

2

CSCD

Altmetrics

Received: 30 December 2018
Revised: 20 March 2019
Published: 10 May 2019
©2019 Springer Science + Business Media, LLC & Science Press, China
Return