AI Chat Paper
Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.
{{lang === 'zh_CN' ? '文章概述' : 'Summary'}}
{{lang === 'en_US' ? '中' : 'Eng'}}
Chat more with AI
PDF (3.9 MB)
Collect
Submit Manuscript AI Chat Paper
Show Outline
Outline
Show full outline
Hide outline
Outline
Show full outline
Hide outline
Open Access

Grasp Detection with Hierarchical Multi-Scale Feature Fusion and Inverted Shuffle Residual

Wenjie Geng1,2,Zhiqiang Cao1,2( )Peiyu Guan1,2,Fengshui Jing1,2Min Tan1,2Junzhi Yu3
State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China
School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing 100049, China
Department of Advanced Manufacturing and Robotics, College of Engineering, Peking University, Beijing 100871, China

† Wenjie Geng and Peiyu Guan contribute equally to this work.

Show Author Information

Abstract

Grasp detection plays a critical role for robot manipulation. Mainstream pixel-wise grasp detection networks with encoder-decoder structure receive much attention due to good accuracy and efficiency. However, they usually transmit the high-level feature in the encoder to the decoder, and low-level features are neglected. It is noted that low-level features contain abundant detail information, and how to fully exploit low-level features remains unsolved. Meanwhile, the channel information in high-level feature is also not well mined. Inevitably, the performance of grasp detection is degraded. To solve these problems, we propose a grasp detection network with hierarchical multi-scale feature fusion and inverted shuffle residual. Both low-level and high-level features in the encoder are firstly fused by the designed skip connections with attention module, and the fused information is then propagated to corresponding layers of the decoder for in-depth feature fusion. Such a hierarchical fusion guarantees the quality of grasp prediction. Furthermore, an inverted shuffle residual module is created, where the high-level feature from encoder is split in channel and the resultant split features are processed in their respective branches. By such differentiation processing, more high-dimensional channel information is kept, which enhances the representation ability of the network. Besides, an information enhancement module is added before the encoder to reinforce input information. The proposed method attains 98.9% and 97.8% in image-wise and object-wise accuracy on the Cornell grasping dataset, respectively, and the experimental results verify the effectiveness of the method.

References

[1]
A. Bar, X. Wang, V. Kantorov, C. J. Reed, R. Herzig, G. Chechik, A. Rohrbach, T. Darrell, and A. Globerson, DETReg: Unsupervised pretraining with region priors for object detection, in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, New Orleans, LA, USA, 2022, pp. 14585–14595.
[2]
W. Van Gansbeke, S. Vandenhende, S. Georgoulis, and L. Van Gool, Unsupervised semantic segmentation by contrasting object mask proposals, in Proc. IEEE/CVF Int. Conf. Computer Vision, Montreal, Canada, 2021, pp. 10032–10042.
[3]
Y. T. Chen, H. P. Zhang, L. W. Liu, J. J. Tao, Q. Zhang, K. Yang, R. L. Xia, and J. B. Xie, Research on image inpainting algorithm of improved total variation minimization method, J. Ambient Intell. Human. Comput., vol. 14, no. 5, pp. 5555–5564, 2023.
[4]
Y. Y. Yu, Z. Q. Cao, S. Liang, W. J. Geng, and J. Z. Yu, A novel vision-based grasping method under occlusion for manipulating robotic system, IEEE Sensors J., vol. 20, no. 18, pp. 10996–11006, 2020.
[5]
T. Suzuki and T. Oka, Grasping of unknown objects on a planar surface using a single depth image, in Proc. IEEE Int. Conf. Advanced Intelligent Mechatronics, Banff, Canada, 2016, pp. 572–577.
[6]
B. S. Zapata-Impata, P. Gil, J. Pomares, and F. Torres, Fast geometry-based computation of grasping points on three-dimensional point clouds, Int. J. Adv. Rob. Syst., vol. 16, no. 1, pp. 1–18, 2019.
[7]
G. Vezzani, U. Pattacini, and L. Natale, A grasping approach based on superquadric models, in Proc. IEEE Int. Conf. Robotics and Automation, Singapore, 2017, pp. 1579–1586.
[8]
I. Gori, U. Pattacini, V. Tikhanoff, and G. Metta, Ranking the good points: A comprehensive method for humanoid robots to grasp unknown objects, in Proc. 16th Int. Conf. Advanced Robotics, Montevideo, Uruguay, 2013, pp. 1–7.
[9]
Y. Li and N. S. Pollard, A shape matching algorithm for synthesizing humanlike enveloping grasps, in Proc. 5th IEEE-RAS Int. Conf. Humanoid Robots, Tsukuba, Japan, 2005, pp. 442–449.
[10]
A. Herzog, P. Pastor, M. Kalakrishnan, L. Righetti, T. Asfour, and S. Schaal, Template-based learning of grasp selection, in Proc. IEEE Int. Conf. Robotics and Automation, Saint Paul, MN, USA, 2012, pp. 2379–2384.
[11]
Z. Y. Wang, Z. D. Deng, and S. Y. Wang, CasNet: A cascade coarse-to-fine network for semantic segmentation, Tsinghua Science and Technology, vol. 24, no. 2, pp. 207–215, 2019.
[12]
R. Y. Xin, J. Zhang, and Y. T. Shao, Complex network classification with convolutional neural network, Tsinghua Science and Technology, vol. 25, no. 4, pp. 447–457, 2020.
[13]
Q. Hua, L. Chen, P. Li, S. Zhao, and Y. Li, A pixel-channel hybrid attention model for image processing, Tsinghua Science and Technology, vol. 27, no. 5, pp. 804–816, 2022.
[14]
J. Mahler, J. Liang, S. Niyaz, M. Laskey, R. Doan, X. Liu, J. A. Ojea, and K. Goldberg, Dex-net 2.0: Deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics, arXiv preprint arXiv: 1703.09312v3, 2017.
[15]
I. Lenz, H. Lee, and A. Saxena, Deep learning for detecting robotic grasps, arXiv preprint arXiv: 1301.3592v4, 2014.
[16]
J. Redmon and A. Angelova, Real-time grasp detection using convolutional neural networks, in Proc. IEEE Int. Conf. Robotics and Automation, Seattle, WA, USA, 2015, pp. 1316–1322.
[17]
S. Kumra and C. Kanan, Robotic grasp detection using deep convolutional neural networks, in Proc. IEEE/RSJ Int. Conf. Intelligent Robots and Systems, Vancouver, Canada, 2017, pp. 769–776.
[18]
H. Karaoguz and P. Jensfelt, Object detection approach for robot grasp detection, in Proc. Int. Conf. Robotics and Automation, Montreal, Canada, 2019, pp. 4953–4959.
[19]
H. Zhang, X. Lan, S. Bai, X. Zhou, Z. Tian, and N. Zheng, ROI-based robotic grasp detection for object overlapping scenes, in Proc. IEEE/RSJ Int. Conf. Intelligent Robots and Systems, Macau, China, 2019, pp. 4768–4775.
[20]
V. Satish, J. Mahler, and K. Goldberg, On-policy dataset synthesis for learning robot grasping policies using fully convolutional deep networks, IEEE Robot. Autom. Lett., vol. 4, no. 2, pp. 1357–1364, 2019.
[21]
X. Zhou, X. Lan, H. Zhang, Z. Tian, Y. Zhang, and N. Zheng, Fully convolutional grasp detection network with oriented anchor box, in Proc. IEEE/RSJ Int. Conf. Intelligent Robots and Systems, Madrid, Spain, 2018, pp. 7223–7230.
[22]
Q. Zhang, D. Qu, F. Xu, and F. Zou, Robust robot grasp detection in multimodal fusion, MATEC Web Conf., vol. 139, p. 00060, 2017.
[23]
D. Morrison, P. Corke, and J. Leitner, Closing the loop for robotic grasping: A real-time, generative grasp synthesis approach, arXiv preprint arXiv:1804.05172, 2018.
[24]
S. Kumra, S. Joshi, and F. Sahin, Antipodal robotic grasping using generative residual convolutional neural network, in Proc. IEEE/RSJ Int. Conf. Intelligent Robots and Systems, Las Vegas, NV, USA, 2020, pp. 9626–9633.
[25]
Y. Y. Yu, Z. Q. Cao, Z. C. Liu, W. J. Geng, J. Z. Yu, and W. M. Zhang, A two-stream CNN with simultaneous detection and segmentation for robotic grasping, IEEE Trans. Syst. Man Cybern.: Syst., vol. 52, no. 2, pp. 1167–1181, 2022.
[26]
G. Sistu, I. Leang, and S. Yogamani, Real-time joint object detection and semantic segmentation network for automated driving, arXiv preprint arXiv: 1901.03912v1, 2019.
[27]
W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Y. Fu, and A. C. Berg, SSD: Single shot MultiBox detector, arXiv preprint arXiv: 1512.02325v5, 2016.
[28]
T. Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, Feature pyramid networks for object detection, in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Honolulu, HI, USA, 2017, pp. 936–944.
[29]
Y. T. Chen, V. Phonevilay, J. J. Tao, X. Chen, R. L. Xia, Q. Zhang, K. Yang, J. Xiong, and J. B. Xie, The face image super-resolution algorithm based on combined representation learning, Multimed. Tools Appl., vol. 80, no. 20, pp. 30839–30861, 2021.
[30]
Y. T. Chen, L. W. Liu, V. Phonevilay, K. Gu, R. L. Xia, J. B. Xie, Q. Zhang, and K. Yang, Image super-resolution reconstruction based on feature map attention mechanism, Appl. Intell., vol. 51, no. 7, pp. 4367–4380, 2021.
[31]
H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, Pyramid scene parsing network, in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Honolulu, HI, USA, 2017, pp. 6230–6239.
[32]
R. L. Xia, Y. T. Chen, and B. B. Ren, Improved anti-occlusion object tracking algorithm using Unscented Rauch-Tung-Striebel smoother and kernel correlation filter, J. King Saud Univ.-Comput. Inf. Sci., vol. 34, no. 8, pp. 6008–6018, 2022.
[33]
J. Hu, L. Shen, and G. Sun, Squeeze-and-excitation networks, in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 7132–7141.
[34]
K. He, X. Zhang, S. Ren, and J. Sun, Deep residual learning for image recognition, in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 2016, pp. 770–778.
[35]
M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. C. Chen, MobileNetV2: Inverted residuals and linear bottlenecks, in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 4510–4520.
[36]
X. Zhang, X. Zhou, M. Lin, and J. Sun, ShuffleNet: An extremely efficient convolutional neural network for mobile devices, arXiv preprint arXiv: 1707.01083v2, 2017.
[37]
N. Ma, X. Zhang, H. T. Zheng, and J. Sun, ShuffleNet V2: Practical guidelines for efficient CNN architecture design, arXiv preprint arXiv: 1807.11164, 2018.
[38]
Y. Jiang, S. Moseson, and A. Saxena, Efficient grasping from RGBD images: Learning using a new rectangle representation, in Proc. IEEE Int. Conf. Robotics and Automation, Shanghai, China, 2011, pp. 3304–3311.
[39]
Z. Wang, Z. Li, B. Wang, and H. Liu, Robot grasp detection using multimodal deep convolutional neural networks, Adv. Mech. Eng., vol. 8, no. 9, pp. 1–12, 2016.
[40]
F. J. Chu, R. Xu, and P. A. Vela, Real-world multiobject, multigrasp detection, IEEE Robot. Autom. Lett., vol. 3, no. 4, pp. 3355–3362, 2018.
[41]
W. Ouyang, W. Huang, and H. Min, Robot grasp with multi-object detection based on RGB-D image, in Proc. China Automation Congress, Beijing, China, 2021, pp. 6543–6548.
[42]
U. Asif, J. Tang, and S. Harrer, Densely supervised grasp detector (DSGD), arXiv preprint arXiv: 1810.03962v2, 2019.
[43]
S. Wang, Z. Zhou, and Z. Kan, When transformer meets robotic grasping: Exploits context for efficient grasp detection, IEEE Robot. Autom. Lett., vol. 7, no. 3, pp. 8170–8177, 2022.
[44]
D. Bolya, C. Zhou, F. Xiao, and Y. J. Lee, YOLACT: Real-time instance segmentation, in Proc. IEEE/CVF Int. Conf. Computer Vision, Seoul, Republic of Korea, 2019, pp. 9156–9165.
Tsinghua Science and Technology
Pages 244-256
Cite this article:
Geng W, Cao Z, Guan P, et al. Grasp Detection with Hierarchical Multi-Scale Feature Fusion and Inverted Shuffle Residual. Tsinghua Science and Technology, 2024, 29(1): 244-256. https://doi.org/10.26599/TST.2023.9010003

331

Views

15

Downloads

1

Crossref

1

Web of Science

1

Scopus

0

CSCD

Altmetrics

Received: 02 September 2022
Revised: 07 January 2023
Accepted: 15 January 2023
Published: 21 August 2023
© The author(s) 2024.

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).

Return