AI Chat Paper
Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.
{{lang === 'zh_CN' ? '文章概述' : 'Summary'}}
{{lang === 'en_US' ? '中' : 'Eng'}}
Chat more with AI
Article Link
Collect
Submit Manuscript
Show Outline
Outline
Show full outline
Hide outline
Outline
Show full outline
Hide outline
Regular Paper

CAT: A Simple yet Effective Cross-Attention Transformer for One-Shot Object Detection

School of Computer Science, Northwestern Polytechnical University, Xi’an, 710000, China
National Engineering Laboratory for Integrated Aero-Space-Ground-Ocean Big Data Application Technology, Northwestern Polytechnical University, Xi'an, 710000, China
School of Computer Science, The University of Adelaide, Adelaide, SA 0115, Australia
Show Author Information

Abstract

Given a query patch from a novel class, one-shot object detection aims to detect all instances of this class in a target image through the semantic similarity comparison. However, due to the extremely limited guidance in the novel class as well as the unseen appearance difference between the query and target instances, it is difficult to appropriately exploit their semantic similarity and generalize well. To mitigate this problem, we present a universal Cross-Attention Transformer (CAT) module for accurate and efficient semantic similarity comparison in one-shot object detection. The proposed CAT utilizes the transformer mechanism to comprehensively capture bi-directional correspondence between any paired pixels from the query and the target image, which empowers us to sufficiently exploit their semantic characteristics for accurate similarity comparison. In addition, the proposed CAT enables feature dimensionality compression for inference speedup without performance loss. Extensive experiments on three object detection datasets MS-COCO, PASCAL VOC and FSOD under the one-shot setting demonstrate the effectiveness and efficiency of our model, e.g., it surpasses CoAE, a major baseline in this task, by 1.0% in average precision (AP) on MS-COCO and runs nearly 2.5 times faster.

Electronic Supplementary Material

Download File(s)
JCST-2106-11743-Highlights.pdf (300.8 KB)

References

[1]
Girshick R, Donahue J, Darrell T, Malik J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proc. the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Jun. 2014, pp.580–587. DOI: 10.1109/CVPR.2014.81.
[2]

Ren S Q, He K M, Girshick R, Sun J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Analysis and Machine Intelligence, 2017, 39(6): 1137–1149. DOI: 10.1109/TPAMI.2016.2577031.

[3]
Hsieh T I, Lo Y C, Chen H T, Liu T L. One-shot object detection with co-attention and co-excitation. In Proc. the 33rd International Conference on Neural Information Processing Systems, Dec. 2019, Article No. 245.
[4]
Fan Q, Zhuo W, Tang C K, Tai Y W. Few-shot object detection with attention-RPN and multi-relation detector. In Proc. the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2020, pp.4012–4021. DOI: 10.1109/CVPR42600.2020.00407.
[5]
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser L, Polosukhin I. Attention is all you need. In Proc. the 31st International Conference on Neural Information Processing Systems, Dec. 2017, pp.6000–6010. DOI: 10.5555/3295222.3295349.
[6]
Chen H, Wang Y L, Wang G Y, Qiao Y. LSTD: A low-shot transfer detector for object detection. In Proc. the 32nd AAAI Conference on Artificial Intelligence, Feb. 2018, pp.2836–2843. DOI: 10.1609/aaai.v32i1.11716.
[7]
Kang B Y, Liu Z, Wang X, Yu F, Feng J S, Darrell T. Few-shot object detection via feature reweighting. In Proc. the 2019 IEEE/CVF International Conference on Computer Vision, Oct. 27–Nov. 2, 2019, pp.8419–8428. DOI: 10.1109/ICCV.2019.00851.
[8]
Karlinsky L, Shtok J, Harary S, Schwartz E, Aides A, Feris R, Giryes R, Bronstein A M. RepMet: Representative-based metric learning for classification and few-shot object detection. In Proc. the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2019, pp.5192–5201. DOI: 10.1109/CVPR.2019.00534.
[9]
Osokin A, Sumin D, Lomakin V. OS2D: One-stage one-shot object detection by matching anchor features. In Proc. the 16th European Conference on Computer Vision, Aug. 2020, pp.635–652. DOI: 10.1007/978-3-030-58555-6_38.
[10]

Tay Y, Dehghani M, Bahri D, Metzler D. Efficient transformers: A survey. ACM Computing Surveys, 2023, 55(6): Article No. 109. DOI: 10.1145/3530811.

[11]
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X H, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J, Houlsby N. An image is worth 16x16 words: Transformers for image recognition at scale. In Proc. the 9th International Conference on Learning Representations, May 2021.
[12]
Touvron H, Cord M, Douze M, Massa F, Sablayrolles A, Jégou H. Training data-efficient image transformers & distillation through attention. In Proc. the 38th International Conference on Machine Learning, Jul. 2021, pp.10347–10357.
[13]
Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S. End-to-end object detection with transformers. In Proc. the 16th European Conference on Computer Vision, Aug. 2020, pp.213–229. DOI: 10.1007/978-3-030-58452-8_13.
[14]
Zhu X Z, Su W J, Lu L W, Li B, Wang X G, Dai J F. Deformable DETR: Deformable transformers for end-to-end object detection. In Proc. the 9th International Conference on Learning Representations, May 2021.
[15]
Ye L W, Rochan M, Liu Z, Wang Y. Cross-modal self-attention network for referring image segmentation. In Proc. the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2019, pp.10494–10503. DOI: 10.1109/CVPR.2019.01075.
[16]
Tan H, Bansal M. LXMERT: Learning cross-modality encoder representations from transformers. In Proc. the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, Nov. 2019, pp.5100–5111. DOI: 10.18653/v1/D19-1514.
[17]
Su W J, Zhu X Z, Cao Y, Li B, Lu L W, Wei F R, Dai J F. VL-BERT: Pre-training of generic visual-linguistic representations. In Proc. the 8th International Conference on Learning Representations, Apr. 2020.
[18]

Guo M H, Cai J X, Liu Z N, Mu T J, Martin R R, Hu S M. PCT: Point cloud transformer. Computational Visual Media, 2021, 7(2): 187–199. DOI: 10.1007/s41095-021-0229-5.

[19]
Yuan L, Chen Y P, Wang T, Yu W H, Shi Y J, Jiang Z H, Tay F E H, Feng J S, Yan S C. Tokens-to-token ViT: Training vision transformers from scratch on ImageNet. In Proc. the 2021 IEEE/CVF International Conference on Computer Vision, Oct. 2021, pp.538–547. DOI: 10.1109/ICCV48922.2021.00060.
[20]
He K M, Zhang X Y, Ren S Q, Sun J. Deep residual learning for image recognition. In Proc. the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Jun. 2016, pp.770–778. DOI: 10.1109/CVPR.2016.90.
[21]
Zhang Z M, Warrell J, Torr P H S. Proposal generation for object detection using cascaded ranking SVMs. In Proc. the 2011 IEEE Conference on Computer Vision and Pattern Recognition, Jun. 2011, pp.1497–1504. DOI: 10.1109/CVPR.2011.5995411.
[22]
Lin T Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick C L. Microsoft COCO: Common objects in context. In Proc. the 13th European Conference on Computer Vision, Sept. 2014, pp.740–755. DOI: 10.1007/978-3-319-10602-1_48.
[23]

Everingham M, Van Gool L, Williams C K I, Winn J, Zisserman A. The Pascal visual object classes (VOC) challenge. International Journal of Computer Vision, 2010, 88(2): 303–338. DOI: 10.1007/s11263-009-0275-4.

[24]
Chen K, Wang J Q, Pang J M, Cao Y H, Xiong Y, Li X X, Sun S Y, Feng W S, Liu Z W, Xu J R, Zhang Z, Cheng D Z, Zhu C C, Cheng T H, Zhao Q J, Li B Y, Lu X, Zhu R, Wu Y, Dai J F, Wang J D, Shi J P, Ouyang W L, Loy C C, Lin D H. MMDetection: Open MMLab detection toolbox and benchmark. arXiv: 1906.07155, 2019. https://arxiv.org/abs/1906.07155, March 2024.
[25]
Michaelis C, Ustyuzhaninov I, Bethge M, Ecker A S. One-shot instance segmentation. arXiv: 1811.11507, 2018. https://arxiv.org/abs/1811.11507, March 2024.
[26]

Fu K, Zhang T F, Zhang Y, Sun X. OSCD: A one-shot conditional object detection framework. Neurocomputing, 2021, 425: 243–255. DOI: 10.1016/j.neucom.2020.04.092.

[27]
Cen M B, Jung C. Fully convolutional Siamese fusion networks for object tracking. In Proc. the 25th IEEE International Conference on Image Processing, Oct. 2018, pp.3718–3722. DOI: 10.1109/ICIP.2018.8451102.
[28]
Li B, Yan J J, Wu W, Zhu Z, Hu X L. High performance visual tracking with Siamese region proposal network. In Proc. the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2018, pp.8971–8980. DOI: 10.1109/CVPR.2018.00935.
[29]
Wang X, Huang T E, Darrell T, Gonzalez J E, Yu F. Frustratingly simple few-shot object detection. In Proc. the 37th International Conference on Machine Learning, Jul. 2020, Article No. 920.
[30]
Wu X W, Sahoo D, Hoi S. Meta-RCNN: Meta learning for few-shot object detection. In Proc. the 28th ACM International Conference on Multimedia, Oct. 2020, pp.1679–1687. DOI: 10.1145/3394171.3413832.
[31]
Xiao Y, Marlet R. Few-shot object detection and viewpoint estimation for objects in the wild. In Proc. the 16th European Conference on Computer Vision, Aug. 2020, pp.192–210. DOI: 10.1007/978-3-030-58520-4_12.
[32]
Sun B, Li B H, Cai S C, Yuan Y, Zhang C. FSCE: Few-shot object detection via contrastive proposal encoding. In Proc. the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2021, pp.7348–7358. DOI: 10.1109/CVPR46437.2021.00727.
[33]
Wu J X, Liu S T, Huang D, Wang Y H. Multi-scale positive sample refinement for few-shot object detection. In Proc. the 16th European Conference on Computer Vision, August 2020, pp.456–472. DOI: 10.1007/978-3-030-58517-4_27.
[34]
Lin T Y, Dollár P, Girshick R, He K M, Hariharan B, Belongie S. Feature pyramid networks for object detection. In Proc. the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Jun. 2017, pp.936–944. DOI: 10.1109/CVPR.2017.106.
Journal of Computer Science and Technology
Pages 460-471
Cite this article:
Lin W-D, Deng Y-Y, Gao Y, et al. CAT: A Simple yet Effective Cross-Attention Transformer for One-Shot Object Detection. Journal of Computer Science and Technology, 2024, 39(2): 460-471. https://doi.org/10.1007/s11390-024-1743-6

201

Views

0

Crossref

0

Web of Science

0

Scopus

0

CSCD

Altmetrics

Received: 27 June 2021
Accepted: 18 January 2024
Published: 30 March 2024
© Institute of Computing Technology, Chinese Academy of Sciences 2024
Return