3D hand pose and shape estimation from monocular RGB via efficient 2D cues

Fenghao Zhang; Lin Zhao; Shengling Li; Wanjuan Su; Liman Liu; Wenbing Tao

doi:10.1007/s41095-023-0346-4

AI Chat Paper

Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.

Chat more with AI

| Sign up

Browse by Subject

Search for peer-reviewed journals with full access.

Journals A - Z

About Us

Discover the SciOpen Platform and Achieve Your Research Goals with Ease.

About Us

Publish with Us

Support

Journals A - Z

About Us

Publish with Us

Support

PDF (7.7 MB)

Cite

EndNote(RIS) BibTeX

Collect

Submit Manuscript

AI Chat Paper

Show Outline

Outline

Show full outline

Hide outline

Outline

Show full outline

Hide outline

Research Article | Open Access

3D hand pose and shape estimation from monocular RGB via efficient 2D cues

Fenghao Zhang^¹, Lin Zhao^², Shengling Li^¹, Wanjuan Su^², Liman Liu^¹(

), Wenbing Tao^²

1 Hubei Key Laboratory of Medical Information Analysis and Tumor Diagnosis & Treatment, School of Biomedical Engineering, South Central Minzu University, Wuhan 430074, China

2 National Key Laboratory of Science and Technology of Multi-spectral Information Processing, School of Artificial Intelligence and Automation, Huazhong University of Science and Technology, Wuhan 430074, China

Show Author Information

Graphical Abstract

Abstract

Estimating 3D hand shape from a single-view RGB image is important for many applications. However, the diversity of hand shapes and postures, depth ambiguity, and occlusion may result in pose errors and noisy hand meshes. Making full use of 2D cues such as 2D pose can effectively improve the quality of 3D human hand shape estimation. In this paper, we use 2D joint heatmaps to obtain spatial details for robust pose estimation. We also introduce a depth-independent 2D mesh to avoid depth ambiguity in mesh regression for efficient hand-image alignment. Our method has four cascaded stages: 2D cue extraction, pose feature encoding, initial reconstruction, and reconstruction refinement. Specifically, we first encode the image to determine semantic features during 2D cue extraction; this is also used to predict hand joints and for segmentation. Then, during the pose feature encoding stage, we use a hand joints encoder to learn spatial information from the joint heatmaps. Next, a coarse 3D hand mesh and 2D mesh are obtained in the initial reconstruction step; a mesh squeeze-and-excitation block is used to fuse different hand features to enhance perception of 3D hand structures. Finally, a global mesh refinement stage learns non-local relations between vertices of the hand mesh from the predicted 2D mesh, to predict an offset hand mesh to fine-tune the reconstruction results. Quantitative and qualitative results on the FreiHAND benchmark dataset demonstrate that our approach achieves state-of-the-art performance.

Keywords

deep learning 3D reconstruction hand image features 3D mesh

Electronic Supplementary Material

Download File(s)

41095_0346_ESM.pdf (425.4 KB)

References

[1]

Jang, Y.; Noh, S. T.; Chang, H. J.; Kim, T. K.; Woo, W. 3D finger CAPE: Clicking action and position estimation under self-occlusions in egocentric viewpoint. IEEE Transactions on Visualization and Computer Graphics Vol. 21, No. 4, 501–510, 2015.

Crossref Google Scholar

[2]

Lee, T.; Hollerer, T. Multithreaded hybrid feature tracking for markerless augmented reality. IEEE Transactions on Visualization and Computer Graphics Vol. 15, No. 3, 355–368, 2009.

Crossref Google Scholar

[3]

Piumsomboon, T.; Clark, A.; Billinghurst, M.; Cockburn, A. User-defined gestures for augmented reality. In: Human-Computer Interaction – INTERACT 2013. Lecture Notes in Computer Science, Vol. 8118. Kotzé, P.; Marsden, G.; Lindgaard, G.; Wesson, J.; Winckler, M. Eds. Springer Berlin Heidelberg, 282–299, 2013.

Crossref

[4]

Kikuchi, T.; Endo, Y.; Kanamori, Y.; Hashimoto, T.; Mitani, J. Transferring pose and augmenting background for deep human-image parsing and its applications. Computational Visual Media Vol. 4, No. 1, 43–54, 2018.

Crossref Google Scholar

[5]

Wang, M.; Lyu, X. Q.; Li, Y. J.; Zhang, F. L. VR content creation and exploration with deep learning: A survey. Computational Visual Media Vol. 6, No. 1, 3–28, 2020.

Crossref Google Scholar

[6]

Ren, P. F.; Sun, H. F.; Huang, W. T.; Hao, J. C.; Cheng, D. X.; Qi, Q.; Wang, J. Y.; Liao, J. X. Spatial-aware stacked regression network for real-time 3D hand pose estimation. Neurocomputing Vol. 437, 42–57, 2021.

Crossref Google Scholar

[7]

Moon, G.; Lee, K. M. I2L-MeshNet: Image-to-lixel prediction network for accurate 3D human pose and mesh estimation from a single RGB image. In: Computer Vision – ECCV 2020. Lecture Notes in Computer Science, Vol. 12352. Vedaldi, A.; Bischof, H.; Brox, T.; Frahm, J. M. Eds. Springer Cham, 752–768, 2020.

Crossref

[8]

Zhang, X.; Huang, H. S.; Tan, J. C.; Xu, H. M.; Yang, C.; Peng, G. Z.; Wang, L.; Liu, J. Hand image understanding via deep multi-task learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 11261–11272, 2021.

Crossref

[9]

Chen, X. Y.; Liu, Y. F.; Ma, C. Y.; Chang, J. L.; Wang, H. Y.; Chen, T.; Guo, X. Y.; Wan, P. F.; Zheng, W. Camera-space hand mesh recovery via semantic aggregation and adaptive 2D-1D registration. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 13269–13278, 2021.

Crossref

[10]

Lin, K.; Wang, L. J.; Liu, Z. C. End-to-end human pose and mesh reconstruction with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1954–1963, 2021.

Crossref

[11]

Tang, X.; Wang, T. Y.; Fu, C. W. Towards accurate alignment in real-time 3D hand-mesh reconstruction. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 11678–11687, 2021.

Crossref

[12]

Gao, C. Y.; Yang, Y. J.; Li, W. S. 3D interacting hand pose and shape estimation from a single RGB image. Neurocomputing Vol. 474, 25–36, 2022.

Crossref Google Scholar

[13]

Kourbane, I.; Genc, Y. A graph-based approach for absolute 3D hand pose estimation using a single RGB image. Applied Intelligence Vol. 52, No. 14, 16667–16682, 2022.

Crossref Google Scholar

[14]

Loper, M.; Mahmood, N.; Romero, J.; Pons-Moll, G.; Black, M. J. SMPL: A skinned multi-person linear model. ACM Transactions on Graphics Vol. 34, No. 6, Article No. 248, 2015.

Crossref Google Scholar

[15]

Romero, J.; Tzionas, D.; Black, M. J. Embodied hands: Modeling and capturing hands and bodies together. ACM Transactions on Graphics Vol. 36, No. 6, Article No. 245, 2017.

Crossref Google Scholar

[16]

Kanazawa, A.; Black, M. J.; Jacobs, D. W.; Malik, J. End-to-end recovery of human shape and pose. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7122–7131, 2018.

Crossref

[17]

Hasson, Y.; Varol, G.; Tzionas, D.; Kalevatykh, I.; Black, M. J.; Laptev, I.; Schmid, C. Learning joint reconstruction of hands and manipulated objects. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11799–11808, 2019.

Crossref

[18]

Zhou, Y. X.; Habermann, M.; Xu, W. P.; Habibie, I.; Theobalt, C.; Xu, F. Monocular real-time hand shape and motion capture using multi-modal data. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5345–5354, 2020.

Crossref

[19]

Yang, L. X.; Li, J. S.; Xu, W. Q.; Diao, Y. Q.; Lu, C. W. BiHand: Recovering hand mesh with multi-stage bisected hourglass networks. arXiv preprint arXiv:2008.05079, 2020.

Google Scholar

[20]

Kulon, D.; Wang, H. Y.; Güler, R. A.; Bronstein, M.; Zafeiriou, S. Single image 3D hand reconstruction with mesh convolutions. arXiv preprint arXiv:1905.01326, 2019.

Google Scholar

[21]

Lin, K.; Wang, L. J.; Liu, Z. C. Mesh graphormer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 12919–12928, 2021.

Crossref

[22]

Ge, L. H.; Ren, Z.; Li, Y. C.; Xue, Z. H.; Wang, Y. Y.; Cai, J. F.; Yuan, J. S. 3D hand shape and pose estimation from a single RGB image. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10825–10834, 2019.

Crossref

[23]

Zhang, B. W.; Wang, Y. G.; Deng, X. M.; Zhang, Y. D.; Tan, P.; Ma, C. X.; Wang, H. A. Interacting two-hand 3D pose and shape reconstruction from single color image. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 11334–11343, 2021.

Crossref

[24]

Zimmermann, C.; Brox, T. Learning to estimate 3D hand pose from single RGB images. In: Proceedings of the IEEE International Conference on Computer Vision, 4913–4921, 2017.

Crossref

[25]

Spurr, A.; Song, J.; Park, S.; Hilliges, O. Cross-modal deep variational hand pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 89–98, 2018.

Crossref

[26]

Iqbal, U.; Molchanov, P.; Breuel, T.; Gall, J.; Kautz, J. Hand pose estimation via latent 2.5D heatmap regression. In: Computer Vision – ECCV 2018. Lecture Notes in Computer Science, Vol. 11215. Ferrari, V.; Hebert, M.; Sminchisescu, C.; Weiss, Y. Eds. Springer Cham, 125–143, 2018.

Crossref

[27]

Mueller, F.; Bernard, F.; Sotnychenko, O.; Mehta, D.; Sridhar, S.; Casas, D.; Theobalt, C. GANerated hands for real-time 3D hand tracking from monocular RGB. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 49–59, 2018.

Crossref

[28]

Cai, Y. J.; Ge, L. H.; Cai, J. F.; Yuan, J. S. Weakly-supervised 3D hand pose estimation from monocular RGB images. In: Computer Vision – ECCV 2018. Lecture Notes in Computer Science, Vol. 11210. Ferrari, V.; Hebert, M.; Sminchisescu, C.; Weiss, Y. Eds. Springer Cham, 678–694, 2018.

Crossref

[29]

Yang, L. L.; Yao, A. Disentangling latent hands for image synthesis and pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9869–9878, 2019.

Crossref

[30]

Doosti, B.; Naha, S.; Mirbagheri, M.; Crandall, D. J. HOPE-net: A graph-based model for hand-object pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6607–6616, 2020.

Crossref

[31]

Spurr, A.; Iqbal, U.; Molchanov, P.; Hilliges, O.; Kautz, J. Weakly supervised 3D hand pose estimation via biomechanical constraints. In: Computer Vision – ECCV 2020. Lecture Notes in Computer Science, Vol. 12362. Vedaldi, A.; Bischof, H.; Brox, T.; Frahm, J. M. Eds. Springer Cham, 211–228, 2020.

Crossref

[32]

Boukhayma, A.; de Bem, R.; Torr, P. H. S. 3D hand shape and pose from images in the wild. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10835–10844, 2019.

Crossref

[33]

Zhang, X.; Li, Q.; Mo, H.; Zhang, W. B.; Zheng, W. End-to-end hand mesh recovery from a monocular RGB image. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2354–2364, 2019.

Crossref

[34]

Baek, S.; Kim, K. I.; Kim, T. K. Pushing the envelope for RGB-based dense 3D hand pose estimation via neural rendering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1067–1076, 2019.

Crossref

[35]

Kolotouros, N.; Pavlakos, G.; Daniilidis, K. Convo-lutional mesh regression for single-image human shape reconstruction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4496–4505, 2019.

Crossref

[36]

Kulon, D.; Güler, R. A.; Kokkinos, I.; Bronstein, M. M.; Zafeiriou, S. Weakly-supervised mesh-convolutional hand reconstruction in the wild. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4989–4999, 2020.

Crossref

[37]

Chen, P.; Chen, Y. J.; Yang, D.; Wu, F. Y.; Li, Q.; Xia, Q. P.; Tan, Y. I2UV-HandNet: Image-to-UV prediction network for accurate and high-fidelity 3D hand mesh modeling. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 12909–12918, 2021.

Crossref

[38]

Li, M. P.; Zhou, Z. M.; Liu, X. G. 3D hypothesis clustering for cross-view matching in multi-person motion capture. Computational Visual Media Vol. 6, No. 2, 147–156, 2020.

Crossref Google Scholar

[39]

Pavlakos, G.; Zhu, L. Y.; Zhou, X. W.; Daniilidis, K. Learning to estimate 3D human pose and shape from a single color image. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 459–468, 2018.

Crossref

[40]

Varol, G.; Ceylan, D.; Russell, B.; Yang, J. M.; Yumer, E.; Laptev, I.; Schmid, C. BodyNet: Volumetric inference of 3D human body shapes. In: Computer Vision – ECCV 2018. Lecture Notes in Computer Science, Vol. 11211. Ferrari, V.; Hebert, M.; Sminchisescu, C.; Weiss, Y. Eds. Springer Cham, 20–38, 2018.

Crossref

[41]

He, K. M.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, 2980–2988, 2017.

Crossref

[42]

Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7132–7141, 2018.

Crossref

[43]

Nair, V.; Hinton, G. E. Rectified linear units improve restricted Boltzmann machines. In: Proceedings of the 27th International Conference on International Conference on Machine Learning, 807–814, 2010.

[44]

Chang, J. Y.; Moon, G.; Lee, K. M. V2V-PoseNet: Voxel-to-voxel prediction network for accurate 3D hand and human pose estimation from a single depth map. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5079–5088, 2018.

Crossref

[45]

Malik, J.; Abdelaziz, I.; Elhayek, A.; Shimada, S.; Ali, S. A.; Golyanik, V.; Theobalt, C.; Stricker, D. HandVoxNet: Deep voxel-based network for 3D hand shape and pose estimation from a single depth map. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7111–7120, 2020.

Crossref

[46]

Zhang, H.; Goodfellow, I.; Metaxas, D.; Odena, A. Self-attention generative adversarial networks. In: Proceedings of the 36th International Conference on Machine Learning, 7354–7363, 2019.

[47]

Wang, X. L.; Girshick, R.; Gupta, A.; He, K. M. Non-local neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7794–7803, 2018.

Crossref

[48]

Gong, S. W.; Chen, L.; Bronstein, M.; Zafeiriou, S. SpiralNet: A fast and highly efficient mesh convolution operator. In: Proceedings of the IEEE/CVF International Conference on Computer Vision Workshop, 4141–4148, 2019.

Crossref

[49]

Newell, A.; Yang, K. Y.; Deng, J. Stacked hourglass networks for human pose estimation. In: Computer Vision – ECCV 2016. Lecture Notes in Computer Science, Vol. 9912. Leibe, B.; Matas, J.; Sebe, N.; Welling, M. Eds. Springer Cham, 483–499, 2016.

Crossref

[50]

He, K. M.; Zhang, X. Y.; Ren, S. Q.; Sun, J. Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 770–778, 2016.

Crossref

[51]

Choi, H.; Moon, G.; Lee, K. M. Pose2Mesh: Graph convolutional network for 3D human pose and mesh recovery from a 2D human pose. In: Computer Vision – ECCV 2020. Lecture Notes in Computer Science, Vol. 12352. Vedaldi, A.; Bischof, H.; Brox, T.; Frahm, J. M. Eds. Springer Cham, 769–787, 2020.

Crossref

[52]

Chen, Y. J.; Tu, Z. G.; Kang, D.; Bao, L. C.; Zhang, Y.; Zhe, X. F.; Chen, R. Z.; Yuan, J. S. Model-based 3D hand reconstruction via self-supervised learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10446–10455, 2021.

Crossref

[53]

Zimmermann, C.; Ceylan, D.; Yang, J. M.; Russell, B.; Argus, M. J.; Brox, T. FreiHAND: A dataset for markerless capture of hand pose and shape from single RGB images. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 813–822, 2019.

Crossref

[54]

Li, M. R.; Gao, Y.; Sang, N. Exploiting learnable joint groups for hand pose estimation. Proceedings of the AAAI Conference on Artificial Intelligence Vol. 35, No. 3, 1921–1929, 2021.

Crossref Google Scholar

[55]

Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S. A.; Huang, Z. H.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. ImageNet large scale visual recognition challenge. International Journal of Computer Vision Vol. 115, No. 3, 211–252, 2015.

Crossref Google Scholar

[56]

Kingma, D. P.; Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.

Google Scholar

[57]

Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z. M.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An imperative style, high-performance deep learning library. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems, Article No. 721, 8026–8037, 2019.

[58]

Gower, J. C. Generalized Procrustes analysis. Psychometrika Vol. 40, No. 1, 33–51, 1975.

Crossref Google Scholar

[59]

Yang, L. L.; Li, S. L.; Lee, D.; Yao, A. Aligning latent spaces for 3D hand pose estimation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2335–2343, 2019.

Crossref

Computational Visual Media

Volume 10 Issue 1,
February 2024

Pages 79-96

DOI: 10.1007/s41095-023-0346-4

Cite this article:

Zhang F, Zhao L, Li S, et al. 3D hand pose and shape estimation from monocular RGB via efficient 2D cues. Computational Visual Media, 2024, 10(1): 79-96. https://doi.org/10.1007/s41095-023-0346-4

360

Views

Downloads

Crossref

Web of Science

Scopus

CSCD

Google Scholar
Citation

Altmetrics

Received: 03 July 2022

Accepted: 24 March 2023

Published: 30 November 2023

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduc-tion in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.

The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Other papers from this open access journal are available free of charge from http://www.springer.com/journal/41095. To submit a manuscript, please go to https://www.editorialmanager.com/cvmj.