Few-shot object detection receives much attention with the ability to detect novel class objects using limited annotated data. The transfer learning-based solution becomes popular due to its simple training with good accuracy, however, it is still challenging to enrich the feature diversity during the training process. And fine-grained features are also insufficient for novel class detection. To deal with the problems, this paper proposes a novel few-shot object detection method based on dual-domain feature fusion and patch-level attention. Upon original base domain, an elementary domain with more category-agnostic features is superposed to construct a two-stream backbone, which benefits to enrich the feature diversity. To better integrate various features, a dual-domain feature fusion is designed, where the feature pairs with the same size are complementarily fused to extract more discriminative features. Moreover, a patch-wise feature refinement termed as patch-level attention is presented to mine internal relations among the patches, which enhances the adaptability to novel classes. In addition, a weighted classification loss is given to assist the fine-tuning of the classifier by combining extra features from FPN of the base training model. In this way, the few-shot detection quality to novel class objects is improved. Experiments on PASCAL VOC and MS COCO datasets verify the effectiveness of the method.
- Article type
- Year
- Co-author
Grasp detection plays a critical role for robot manipulation. Mainstream pixel-wise grasp detection networks with encoder-decoder structure receive much attention due to good accuracy and efficiency. However, they usually transmit the high-level feature in the encoder to the decoder, and low-level features are neglected. It is noted that low-level features contain abundant detail information, and how to fully exploit low-level features remains unsolved. Meanwhile, the channel information in high-level feature is also not well mined. Inevitably, the performance of grasp detection is degraded. To solve these problems, we propose a grasp detection network with hierarchical multi-scale feature fusion and inverted shuffle residual. Both low-level and high-level features in the encoder are firstly fused by the designed skip connections with attention module, and the fused information is then propagated to corresponding layers of the decoder for in-depth feature fusion. Such a hierarchical fusion guarantees the quality of grasp prediction. Furthermore, an inverted shuffle residual module is created, where the high-level feature from encoder is split in channel and the resultant split features are processed in their respective branches. By such differentiation processing, more high-dimensional channel information is kept, which enhances the representation ability of the network. Besides, an information enhancement module is added before the encoder to reinforce input information. The proposed method attains 98.9% and 97.8% in image-wise and object-wise accuracy on the Cornell grasping dataset, respectively, and the experimental results verify the effectiveness of the method.