Grasp Detection with Hierarchical Multi-Scale Feature Fusion and Inverted Shuffle Residual

Wenjie Geng; Zhiqiang Cao; Peiyu Guan; Fengshui Jing; Min Tan; Junzhi Yu

doi:10.26599/TST.2023.9010003

| Sign up

PDF (3.9 MB)

Cite

EndNote(RIS) BibTeX

Collect

Submit Manuscript

Show Outline

Figures (8)

Fig. 1

Fig. 2

Fig. 3

Fig. 4

Fig. 5

Fig. 6

Fig. 7

Fig. 8

Tables (4)

Table 1

Table 2

Table 3

Table 4

Open Access

Grasp Detection with Hierarchical Multi-Scale Feature Fusion and Inverted Shuffle Residual

Wenjie Geng^{¹^,²^,^†}, Zhiqiang Cao^{¹^,²}(), Peiyu Guan^{¹^,²^,^†}, Fengshui Jing^{¹^,²}, Min Tan^{¹^,²}, Junzhi Yu^³

1State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China

2School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing 100049, China

3Department of Advanced Manufacturing and Robotics, College of Engineering, Peking University, Beijing 100871, China

† Wenjie Geng and Peiyu Guan contribute equally to this work.

Show Author Information

Abstract

Grasp detection plays a critical role for robot manipulation. Mainstream pixel-wise grasp detection networks with encoder-decoder structure receive much attention due to good accuracy and efficiency. However, they usually transmit the high-level feature in the encoder to the decoder, and low-level features are neglected. It is noted that low-level features contain abundant detail information, and how to fully exploit low-level features remains unsolved. Meanwhile, the channel information in high-level feature is also not well mined. Inevitably, the performance of grasp detection is degraded. To solve these problems, we propose a grasp detection network with hierarchical multi-scale feature fusion and inverted shuffle residual. Both low-level and high-level features in the encoder are firstly fused by the designed skip connections with attention module, and the fused information is then propagated to corresponding layers of the decoder for in-depth feature fusion. Such a hierarchical fusion guarantees the quality of grasp prediction. Furthermore, an inverted shuffle residual module is created, where the high-level feature from encoder is split in channel and the resultant split features are processed in their respective branches. By such differentiation processing, more high-dimensional channel information is kept, which enhances the representation ability of the network. Besides, an information enhancement module is added before the encoder to reinforce input information. The proposed method attains 98.9% and 97.8% in image-wise and object-wise accuracy on the Cornell grasping dataset, respectively, and the experimental results verify the effectiveness of the method.

Keywords

grasp detection hierarchical multi-scale feature fusion skip connections with attention inverted shuffle residual

References

[1]

Bar

, X.

Wang

, V.

Kantorov

, C. J.

Reed

, R.

Herzig

, G.

Chechik

, A.

Rohrbach

, T.

Darrell

, and A.

Globerson

, DETReg: Unsupervised pretraining with region priors for object detection, in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, New Orleans, LA, USA, 2022, pp. 14585–14595.

Method	Information enhancement	SCA	ISR	Basic decoder	Feature fusion decoder	IW acc. (%)	OW acc. (%)
HMISR-I	–	–	–	$\sqrt$	–	91.0	89.9
HMISR-II	$\sqrt$	–	–	$\sqrt$	–	93.3	91.0
HMISR-III	$\sqrt$	–	–	–	$\sqrt$	94.4	91.0
HMISR-IV	$\sqrt$	$\sqrt$	–	–	$\sqrt$	98.9	94.4
HMISR-V	$\sqrt$	–	$\sqrt$	–	$\sqrt$	98.9	93.3
HMISR	$\sqrt$	$\sqrt$	$\sqrt$	–	$\sqrt$	98.9	97.8

Method	Structure	IW acc. (%)	OW acc. (%)
ISR-I		97.8	97.8
ISR-II		97.8	94.4
ISR		98.9	97.8
ISR-III		98.9	97.8
ISR-IV		98.9	96.6
ISR-V		98.9	97.8
ISR-VI		98.9	96.6

Method	Input size (pixel $\times$ pixel)	Number of parameters	Input mode		Accuracy (%)		Time (ms)
Method	Input size (pixel $\times$ pixel)	Number of parameters	RGB	Depth	Image-wise	Object-wise	Time (ms)
SAE^[15]	$-$	$> 1 050 500$	$\sqrt$	$\sqrt$	73.9	75.6	13 500 (without device information)
Regression grasp^[16]	224 $\times$ 224	$> 7 300 000$	$\sqrt$	$\sqrt$	88.0	87.1	76 (NVIDIA Tesla K20 GPU)
DCNN^[17]	224 $\times$ 224	$> 20 000 000$	$\sqrt$	$\sqrt$	89.2	89.0	103 (NVIDIA GeForce GTX 645 GPU)
GRPN^[18]	$-$	$-$	$\sqrt$	$-$	88.7	$-$	500 (NVIDIA GTX 1070 GPU)
ROI-GD^[19]	$-$	$> 30 000 000$	$\sqrt$	$-$	93.6	93.5	40 (GTX1080Ti GPU)
Closed-loop grasp^[39]	$-$	$-$	$\sqrt$	$\sqrt$	81.8	$-$	141 (GeForce GTX 980 GPU)
FCGN^[21]	320 $\times$ 320	$> 27 000 000$	$\sqrt$	$-$	97.7	96.6	118 (NVIDIA TITAN-X)
Multimodal Fusion^[22]	224 $\times$ 224	$> 4 400 000$	$\sqrt$	$\sqrt$	88.9	88.2	117 (NVIDIA Tesla K80 GPUs)
GG-CNN^[23]	300 $\times$ 300	62 420	$\sqrt$	$\sqrt$	73.0	69.0	19 (NVIDIA GeForce GTX 1070)
GR-ConvNet^[24]	224 $\times$ 224	1 900 900	$\sqrt$	$\sqrt$	97.7	96.6	20 (NVIDIA GeForce GTX 1080 Ti)
TsGNet^[25]	300 $\times$ 300	66 754	$\sqrt$	$\sqrt$	93.1	93.0	$-$
GN^[40]	227 $\times$ 227	$> 11 000 000$	$\sqrt$	$\sqrt$	96.0	96.1	120 (NVIDIA Titan-X)
GPN-GD^[41]	227 $\times$ 227	$> 31 000 000$	$\sqrt$	$\sqrt$	97.2	97.1	81 (NVIDIA GeForce RTX 2080 Ti GPU)
DSGD^[42]	$-$	$> 13 000 000$	$\sqrt$	$\sqrt$	97.5	$-$	111 (NVIDIA Tesla K80 GPU)
TF-Grasp^[43]	224 $\times$ 224	$-$	$\sqrt$	$\sqrt$	98.0	96.7	41.6 (NVIDIA3090 GPU)
HMISR (depth)	224 $\times$ 224	1 089 856	$-$	$\sqrt$	92.1	89.8	14 (NVIDIA GTX1080 GPU)
HMISR (RGB)	224 $\times$ 224	1 088 128	$\sqrt$	$-$	95.5	95.5	15 (NVIDIA GTX1080 GPU)
HMISR (RGB-D)	224 $\times$ 224	1 089 604	$\sqrt$	$\sqrt$	98.9	97.8	15 (NVIDIA GTX1080 GPU)

Item	Number of parameters
Information enhancement	96
Encoder	$E_{1}$ : 5856
	$E_{2}$ : 18 624
	$E_{3}$ : 74 112
SCA	483 392
ISR	ISU $_{1}$ : 22 976
	ISU $_{2}$ : 65 856
	ISU $_{3}$ : 65 856
Decoder	$D L_{1}$ : 221 760
	$D L_{2}$ : 78 112
	DL $_{3}$ : 52 448
Output head	516
Total	1 089 604