Grasp Detection with Hierarchical Multi-Scale Feature Fusion and Inverted Shuffle Residual

Wenjie Geng; Zhiqiang Cao; Peiyu Guan; Fengshui Jing; Min Tan; Junzhi Yu

doi:10.26599/TST.2023.9010003

AI Chat Paper

Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.

Chat more with AI

| Sign up

Browse by Subject

Search for peer-reviewed journals with full access.

Journals A - Z

About Us

Discover the SciOpen Platform and Achieve Your Research Goals with Ease.

About Us

Publish with Us

Support

Search articles, authors, keywords, DOl and etc.

Published Date

Reset Search

{{expandStatus?'Exit ':''}}Advanced Search

Journals A - Z

About Us

Publish with Us

Support

PDF (3.9 MB)

Cite

EndNote(RIS) BibTeX

Collect

Submit Manuscript

AI Chat Paper

Show Outline

Outline

Show full outline

Hide outline

Outline

Show full outline

Hide outline

Open Access

Grasp Detection with Hierarchical Multi-Scale Feature Fusion and Inverted Shuffle Residual

Wenjie Geng^{¹^,²^,^†}, Zhiqiang Cao^{¹^,²}(

), Peiyu Guan^{¹^,²^,^†}, Fengshui Jing^{¹^,²}, Min Tan^{¹^,²}, Junzhi Yu^³

1State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China

2School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing 100049, China

3Department of Advanced Manufacturing and Robotics, College of Engineering, Peking University, Beijing 100871, China

† Wenjie Geng and Peiyu Guan contribute equally to this work.

Show Author Information

Abstract

Grasp detection plays a critical role for robot manipulation. Mainstream pixel-wise grasp detection networks with encoder-decoder structure receive much attention due to good accuracy and efficiency. However, they usually transmit the high-level feature in the encoder to the decoder, and low-level features are neglected. It is noted that low-level features contain abundant detail information, and how to fully exploit low-level features remains unsolved. Meanwhile, the channel information in high-level feature is also not well mined. Inevitably, the performance of grasp detection is degraded. To solve these problems, we propose a grasp detection network with hierarchical multi-scale feature fusion and inverted shuffle residual. Both low-level and high-level features in the encoder are firstly fused by the designed skip connections with attention module, and the fused information is then propagated to corresponding layers of the decoder for in-depth feature fusion. Such a hierarchical fusion guarantees the quality of grasp prediction. Furthermore, an inverted shuffle residual module is created, where the high-level feature from encoder is split in channel and the resultant split features are processed in their respective branches. By such differentiation processing, more high-dimensional channel information is kept, which enhances the representation ability of the network. Besides, an information enhancement module is added before the encoder to reinforce input information. The proposed method attains 98.9% and 97.8% in image-wise and object-wise accuracy on the Cornell grasping dataset, respectively, and the experimental results verify the effectiveness of the method.

Keywords

grasp detection hierarchical multi-scale feature fusion skip connections with attention inverted shuffle residual

References

[1]

Bar

, X.

Wang

, V.

Kantorov

, C. J.

Reed

, R.

Herzig

, G.

Chechik

, A.

Rohrbach

, T.

Darrell

, and A.

Globerson

, DETReg: Unsupervised pretraining with region priors for object detection, in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, New Orleans, LA, USA, 2022, pp. 14585–14595.

Crossref Google Scholar

[2]

Van Gansbeke

, S.

Vandenhende

, S.

Georgoulis

, and L.

Van Gool

, Unsupervised semantic segmentation by contrasting object mask proposals, in Proc. IEEE/CVF Int. Conf. Computer Vision, Montreal, Canada, 2021, pp. 10032–10042.

Crossref Google Scholar

[3]

Y. T.

Chen

, H. P.

Zhang

, L. W.

Liu

, J. J.

Tao

, Q.

Zhang

, K.

Yang

, R. L.

Xia

, and J. B.

Xie

, Research on image inpainting algorithm of improved total variation minimization method, J. Ambient Intell. Human. Comput., vol. 14, no. 5, pp. 5555–5564, 2023.

Crossref Google Scholar

[4]

Y. Y.

, Z. Q.

Cao

, S.

Liang

, W. J.

Geng

, and J. Z.

, A novel vision-based grasping method under occlusion for manipulating robotic system, IEEE Sensors J., vol. 20, no. 18, pp. 10996–11006, 2020.

Crossref Google Scholar

[5]

Suzuki

and T.

Oka

, Grasping of unknown objects on a planar surface using a single depth image, in Proc. IEEE Int. Conf. Advanced Intelligent Mechatronics, Banff, Canada, 2016, pp. 572–577.

Crossref Google Scholar

[6]

B. S.

Zapata-Impata

, P.

Gil

, J.

Pomares

, and F.

Torres

, Fast geometry-based computation of grasping points on three-dimensional point clouds, Int. J. Adv. Rob. Syst., vol. 16, no. 1, pp. 1–18, 2019.

Crossref Google Scholar

[7]

Vezzani

, U.

Pattacini

, and L.

Natale

, A grasping approach based on superquadric models, in Proc. IEEE Int. Conf. Robotics and Automation, Singapore, 2017, pp. 1579–1586.

Crossref Google Scholar

[8]

Gori

, U.

Pattacini

, V.

Tikhanoff

, and G.

Metta

, Ranking the good points: A comprehensive method for humanoid robots to grasp unknown objects, in Proc. 16^th Int. Conf. Advanced Robotics, Montevideo, Uruguay, 2013, pp. 1–7.

Crossref Google Scholar

[9]

and N. S.

Pollard

, A shape matching algorithm for synthesizing humanlike enveloping grasps, in Proc. 5^th IEEE-RAS Int. Conf. Humanoid Robots, Tsukuba, Japan, 2005, pp. 442–449.

Google Scholar

[10]

Herzog

, P.

Pastor

, M.

Kalakrishnan

, L.

Righetti

, T.

Asfour

, and S.

Schaal

, Template-based learning of grasp selection, in Proc. IEEE Int. Conf. Robotics and Automation, Saint Paul, MN, USA, 2012, pp. 2379–2384.

Crossref Google Scholar

[11]

Z. Y.

Wang

, Z. D.

Deng

, and S. Y.

Wang

, CasNet: A cascade coarse-to-fine network for semantic segmentation, Tsinghua Science and Technology, vol. 24, no. 2, pp. 207–215, 2019.

Crossref Google Scholar

[12]

R. Y.

Xin

, J.

Zhang

, and Y. T.

Shao

, Complex network classification with convolutional neural network, Tsinghua Science and Technology, vol. 25, no. 4, pp. 447–457, 2020.

Crossref Google Scholar

[13]

Hua

, L.

Chen

, P.

, S.

Zhao

, and Y.

, A pixel-channel hybrid attention model for image processing, Tsinghua Science and Technology, vol. 27, no. 5, pp. 804–816, 2022.

Crossref Google Scholar

[14]

Mahler

, J.

Liang

, S.

Niyaz

, M.

Laskey

, R.

Doan

, X.

Liu

, J. A.

Ojea

, and K.

Goldberg

, Dex-net 2.0: Deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics, arXiv preprint arXiv: 1703.09312v3, 2017.

Crossref Google Scholar

[15]

Lenz

, H.

Lee

, and A.

Saxena

, Deep learning for detecting robotic grasps, arXiv preprint arXiv: 1301.3592v4, 2014.

Crossref Google Scholar

[16]

Redmon

and A.

Angelova

, Real-time grasp detection using convolutional neural networks, in Proc. IEEE Int. Conf. Robotics and Automation, Seattle, WA, USA, 2015, pp. 1316–1322.

Crossref Google Scholar

[17]

Kumra

and C.

Kanan

, Robotic grasp detection using deep convolutional neural networks, in Proc. IEEE/RSJ Int. Conf. Intelligent Robots and Systems, Vancouver, Canada, 2017, pp. 769–776.

Crossref Google Scholar

[18]

Karaoguz

and P.

Jensfelt

, Object detection approach for robot grasp detection, in Proc. Int. Conf. Robotics and Automation, Montreal, Canada, 2019, pp. 4953–4959.

Crossref Google Scholar

[19]

Zhang

, X.

Lan

, S.

Bai

, X.

Zhou

, Z.

Tian

, and N.

Zheng

, ROI-based robotic grasp detection for object overlapping scenes, in Proc. IEEE/RSJ Int. Conf. Intelligent Robots and Systems, Macau, China, 2019, pp. 4768–4775.

Crossref Google Scholar

[20]

Satish

, J.

Mahler

, and K.

Goldberg

, On-policy dataset synthesis for learning robot grasping policies using fully convolutional deep networks, IEEE Robot. Autom. Lett., vol. 4, no. 2, pp. 1357–1364, 2019.

Crossref Google Scholar

[21]

Zhou

, X.

Lan

, H.

Zhang

, Z.

Tian

, Y.

Zhang

, and N.

Zheng

, Fully convolutional grasp detection network with oriented anchor box, in Proc. IEEE/RSJ Int. Conf. Intelligent Robots and Systems, Madrid, Spain, 2018, pp. 7223–7230.

Crossref Google Scholar

[22]

Zhang

, D.

, F.

, and F.

Zou

, Robust robot grasp detection in multimodal fusion, MATEC Web Conf., vol. 139, p. 00060, 2017.

Crossref Google Scholar

[23]

Morrison

, P.

Corke

, and J.

Leitner

, Closing the loop for robotic grasping: A real-time, generative grasp synthesis approach, arXiv preprint arXiv:1804.05172, 2018.

Crossref Google Scholar

[24]

Kumra

, S.

Joshi

, and F.

Sahin

, Antipodal robotic grasping using generative residual convolutional neural network, in Proc. IEEE/RSJ Int. Conf. Intelligent Robots and Systems, Las Vegas, NV, USA, 2020, pp. 9626–9633.

Crossref Google Scholar

[25]

Y. Y.

, Z. Q.

Cao

, Z. C.

Liu

, W. J.

Geng

, J. Z.

, and W. M.

Zhang

, A two-stream CNN with simultaneous detection and segmentation for robotic grasping, IEEE Trans. Syst. Man Cybern.: Syst., vol. 52, no. 2, pp. 1167–1181, 2022.

Crossref Google Scholar

[26]

Sistu

, I.

Leang

, and S.

Yogamani

, Real-time joint object detection and semantic segmentation network for automated driving, arXiv preprint arXiv: 1901.03912v1, 2019.

Crossref Google Scholar

[27]

Liu

, D.

Anguelov

, D.

Erhan

, C.

Szegedy

, S.

Reed

, C. Y.

, and A. C.

Berg

, SSD: Single shot MultiBox detector, arXiv preprint arXiv: 1512.02325v5, 2016.

Crossref Google Scholar

[28]

T. Y.

Lin

, P.

Dollár

, R.

Girshick

, K.

, B.

Hariharan

, and S.

Belongie

, Feature pyramid networks for object detection, in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Honolulu, HI, USA, 2017, pp. 936–944.

Crossref Google Scholar

[29]

Y. T.

Chen

, V.

Phonevilay

, J. J.

Tao

, X.

Chen

, R. L.

Xia

, Q.

Zhang

, K.

Yang

, J.

Xiong

, and J. B.

Xie

, The face image super-resolution algorithm based on combined representation learning, Multimed. Tools Appl., vol. 80, no. 20, pp. 30839–30861, 2021.

Crossref Google Scholar

[30]

Y. T.

Chen

, L. W.

Liu

, V.

Phonevilay

, K.

, R. L.

Xia

, J. B.

Xie

, Q.

Zhang

, and K.

Yang

, Image super-resolution reconstruction based on feature map attention mechanism, Appl. Intell., vol. 51, no. 7, pp. 4367–4380, 2021.

Crossref Google Scholar

[31]

Zhao

, J.

Shi

, X.

Wang

, and J.

Jia

, Pyramid scene parsing network, in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Honolulu, HI, USA, 2017, pp. 6230–6239.

Crossref Google Scholar

[32]

R. L.

Xia

, Y. T.

Chen

, and B. B.

Ren

, Improved anti-occlusion object tracking algorithm using Unscented Rauch-Tung-Striebel smoother and kernel correlation filter, J. King Saud Univ.-Comput. Inf. Sci., vol. 34, no. 8, pp. 6008–6018, 2022.

Crossref Google Scholar

[33]

, L.

Shen

, and G.

Sun

, Squeeze-and-excitation networks, in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 7132–7141.

Crossref Google Scholar

[34]

, X.

Zhang

, S.

Ren

, and J.

Sun

, Deep residual learning for image recognition, in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 2016, pp. 770–778.

Crossref Google Scholar

[35]

Sandler

, A.

Howard

, M.

Zhu

, A.

Zhmoginov

, and L. C.

Chen

, MobileNetV2: Inverted residuals and linear bottlenecks, in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 4510–4520.

Crossref Google Scholar

[36]

Zhang

, X.

Zhou

, M.

Lin

, and J.

Sun

, ShuffleNet: An extremely efficient convolutional neural network for mobile devices, arXiv preprint arXiv: 1707.01083v2, 2017.

Crossref Google Scholar

[37]

, X.

Zhang

, H. T.

Zheng

, and J.

Sun

, ShuffleNet V2: Practical guidelines for efficient CNN architecture design, arXiv preprint arXiv: 1807.11164, 2018.

Crossref Google Scholar

[38]

Jiang

, S.

Moseson

, and A.

Saxena

, Efficient grasping from RGBD images: Learning using a new rectangle representation, in Proc. IEEE Int. Conf. Robotics and Automation, Shanghai, China, 2011, pp. 3304–3311.

Crossref Google Scholar

[39]

Wang

, Z.

, B.

Wang

, and H.

Liu

, Robot grasp detection using multimodal deep convolutional neural networks, Adv. Mech. Eng., vol. 8, no. 9, pp. 1–12, 2016.

Crossref Google Scholar

[40]

F. J.

Chu

, R.

, and P. A.

Vela

, Real-world multiobject, multigrasp detection, IEEE Robot. Autom. Lett., vol. 3, no. 4, pp. 3355–3362, 2018.

Crossref Google Scholar

[41]

Ouyang

, W.

Huang

, and H.

Min

, Robot grasp with multi-object detection based on RGB-D image, in Proc. China Automation Congress, Beijing, China, 2021, pp. 6543–6548.

Crossref Google Scholar

[42]

Asif

, J.

Tang

, and S.

Harrer

, Densely supervised grasp detector (DSGD), arXiv preprint arXiv: 1810.03962v2, 2019.

Google Scholar

[43]

Wang

, Z.

Zhou

, and Z.

Kan

, When transformer meets robotic grasping: Exploits context for efficient grasp detection, IEEE Robot. Autom. Lett., vol. 7, no. 3, pp. 8170–8177, 2022.

Crossref Google Scholar

[44]

Bolya

, C.

Zhou

, F.

Xiao

, and Y. J.

Lee

, YOLACT: Real-time instance segmentation, in Proc. IEEE/CVF Int. Conf. Computer Vision, Seoul, Republic of Korea, 2019, pp. 9156–9165.

Crossref Google Scholar

Tsinghua Science and Technology

Volume 29 Issue 1,
February 2024

Pages 244-256

DOI: 10.26599/TST.2023.9010003

Cite this article:

Geng W, Cao Z, Guan P, et al. Grasp Detection with Hierarchical Multi-Scale Feature Fusion and Inverted Shuffle Residual. Tsinghua Science and Technology, 2024, 29(1): 244-256. https://doi.org/10.26599/TST.2023.9010003

424

Views

Downloads

Crossref

Web of Science

Scopus

CSCD

Google Scholar
Citation

Altmetrics

Received: 02 September 2022

Revised: 07 January 2023

Accepted: 15 January 2023

Published: 21 August 2023

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).