CAT: A Simple yet Effective Cross-Attention Transformer for One-Shot Object Detection

Wei-Dong Lin; Yu-Yan Deng; Yang Gao; Ning Wang; Ling-Qiao Liu; Lei Zhang; Peng Wang

doi:10.1007/s11390-024-1743-6

AI Chat Paper

Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.

Chat more with AI

| Sign up

Browse by Subject

Search for peer-reviewed journals with full access.

Journals A - Z

About Us

Discover the SciOpen Platform and Achieve Your Research Goals with Ease.

About Us

Publish with Us

Support

Search articles, authors, keywords, DOl and etc.

Published Date

Reset Search

{{expandStatus?'Exit ':''}}Advanced Search

Journals A - Z

About Us

Publish with Us

Support

Article Link

Cite

EndNote(RIS) BibTeX

Collect

Submit Manuscript

Show Outline

Outline

Show full outline

Hide outline

Outline

Show full outline

Hide outline

Regular Paper

CAT: A Simple yet Effective Cross-Attention Transformer for One-Shot Object Detection

Wei-Dong Lin^{¹^,²}, Yu-Yan Deng^{¹^,²}, Yang Gao^{¹^,²}, Ning Wang^{¹^,²}, Ling-Qiao Liu^³, Lei Zhang^{¹^,²}, Peng Wang^{¹^,²}(

)

1School of Computer Science, Northwestern Polytechnical University, Xi’an, 710000, China

2National Engineering Laboratory for Integrated Aero-Space-Ground-Ocean Big Data Application Technology, Northwestern Polytechnical University, Xi'an, 710000, China

3School of Computer Science, The University of Adelaide, Adelaide, SA 0115, Australia

Show Author Information

Abstract

Given a query patch from a novel class, one-shot object detection aims to detect all instances of this class in a target image through the semantic similarity comparison. However, due to the extremely limited guidance in the novel class as well as the unseen appearance difference between the query and target instances, it is difficult to appropriately exploit their semantic similarity and generalize well. To mitigate this problem, we present a universal Cross-Attention Transformer (CAT) module for accurate and efficient semantic similarity comparison in one-shot object detection. The proposed CAT utilizes the transformer mechanism to comprehensively capture bi-directional correspondence between any paired pixels from the query and the target image, which empowers us to sufficiently exploit their semantic characteristics for accurate similarity comparison. In addition, the proposed CAT enables feature dimensionality compression for inference speedup without performance loss. Extensive experiments on three object detection datasets MS-COCO, PASCAL VOC and FSOD under the one-shot setting demonstrate the effectiveness and efficiency of our model, e.g., it surpasses CoAE, a major baseline in this task, by 1.0% in average precision (AP) on MS-COCO and runs nearly 2.5 times faster.

Keywords

one-shot object detection Transformer attention mechanism

Electronic Supplementary Material

Download File(s)

JCST-2106-11743-Highlights.pdf (300.8 KB)

References

[1]

Girshick R, Donahue J, Darrell T, Malik J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proc. the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Jun. 2014, pp.580–587. DOI: 10.1109/CVPR.2014.81.

Crossref

[2]

Ren S Q, He K M, Girshick R, Sun J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Analysis and Machine Intelligence, 2017, 39(6): 1137–1149. DOI: 10.1109/TPAMI.2016.2577031.

Crossref Google Scholar

[3]

Hsieh T I, Lo Y C, Chen H T, Liu T L. One-shot object detection with co-attention and co-excitation. In Proc. the 33rd International Conference on Neural Information Processing Systems, Dec. 2019, Article No. 245.

[4]

Fan Q, Zhuo W, Tang C K, Tai Y W. Few-shot object detection with attention-RPN and multi-relation detector. In Proc. the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2020, pp.4012–4021. DOI: 10.1109/CVPR42600.2020.00407.

Crossref

[5]

Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser L, Polosukhin I. Attention is all you need. In Proc. the 31st International Conference on Neural Information Processing Systems, Dec. 2017, pp.6000–6010. DOI: 10.5555/3295222.3295349.

[6]

Chen H, Wang Y L, Wang G Y, Qiao Y. LSTD: A low-shot transfer detector for object detection. In Proc. the 32nd AAAI Conference on Artificial Intelligence, Feb. 2018, pp.2836–2843. DOI: 10.1609/aaai.v32i1.11716.

Crossref

[7]

Kang B Y, Liu Z, Wang X, Yu F, Feng J S, Darrell T. Few-shot object detection via feature reweighting. In Proc. the 2019 IEEE/CVF International Conference on Computer Vision, Oct. 27–Nov. 2, 2019, pp.8419–8428. DOI: 10.1109/ICCV.2019.00851.

Crossref

[8]

Karlinsky L, Shtok J, Harary S, Schwartz E, Aides A, Feris R, Giryes R, Bronstein A M. RepMet: Representative-based metric learning for classification and few-shot object detection. In Proc. the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2019, pp.5192–5201. DOI: 10.1109/CVPR.2019.00534.

Crossref

[9]

Osokin A, Sumin D, Lomakin V. OS2D: One-stage one-shot object detection by matching anchor features. In Proc. the 16th European Conference on Computer Vision, Aug. 2020, pp.635–652. DOI: 10.1007/978-3-030-58555-6_38.

Crossref

[10]

Tay Y, Dehghani M, Bahri D, Metzler D. Efficient transformers: A survey. ACM Computing Surveys, 2023, 55(6): Article No. 109. DOI: 10.1145/3530811.

Crossref Google Scholar

[11]

Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X H, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J, Houlsby N. An image is worth 16x16 words: Transformers for image recognition at scale. In Proc. the 9th International Conference on Learning Representations, May 2021.

[12]

Touvron H, Cord M, Douze M, Massa F, Sablayrolles A, Jégou H. Training data-efficient image transformers & distillation through attention. In Proc. the 38th International Conference on Machine Learning, Jul. 2021, pp.10347–10357.

[13]

Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S. End-to-end object detection with transformers. In Proc. the 16th European Conference on Computer Vision, Aug. 2020, pp.213–229. DOI: 10.1007/978-3-030-58452-8_13.

Crossref

[14]

Zhu X Z, Su W J, Lu L W, Li B, Wang X G, Dai J F. Deformable DETR: Deformable transformers for end-to-end object detection. In Proc. the 9th International Conference on Learning Representations, May 2021.

[15]

Ye L W, Rochan M, Liu Z, Wang Y. Cross-modal self-attention network for referring image segmentation. In Proc. the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2019, pp.10494–10503. DOI: 10.1109/CVPR.2019.01075.

Crossref

[16]

Tan H, Bansal M. LXMERT: Learning cross-modality encoder representations from transformers. In Proc. the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, Nov. 2019, pp.5100–5111. DOI: 10.18653/v1/D19-1514.

Crossref

[17]

Su W J, Zhu X Z, Cao Y, Li B, Lu L W, Wei F R, Dai J F. VL-BERT: Pre-training of generic visual-linguistic representations. In Proc. the 8th International Conference on Learning Representations, Apr. 2020.

[18]

Guo M H, Cai J X, Liu Z N, Mu T J, Martin R R, Hu S M. PCT: Point cloud transformer. Computational Visual Media, 2021, 7(2): 187–199. DOI: 10.1007/s41095-021-0229-5.

Crossref Google Scholar

[19]

Yuan L, Chen Y P, Wang T, Yu W H, Shi Y J, Jiang Z H, Tay F E H, Feng J S, Yan S C. Tokens-to-token ViT: Training vision transformers from scratch on ImageNet. In Proc. the 2021 IEEE/CVF International Conference on Computer Vision, Oct. 2021, pp.538–547. DOI: 10.1109/ICCV48922.2021.00060.

Crossref

[20]

He K M, Zhang X Y, Ren S Q, Sun J. Deep residual learning for image recognition. In Proc. the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Jun. 2016, pp.770–778. DOI: 10.1109/CVPR.2016.90.

Crossref

[21]

Zhang Z M, Warrell J, Torr P H S. Proposal generation for object detection using cascaded ranking SVMs. In Proc. the 2011 IEEE Conference on Computer Vision and Pattern Recognition, Jun. 2011, pp.1497–1504. DOI: 10.1109/CVPR.2011.5995411.

Crossref

[22]

Lin T Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick C L. Microsoft COCO: Common objects in context. In Proc. the 13th European Conference on Computer Vision, Sept. 2014, pp.740–755. DOI: 10.1007/978-3-319-10602-1_48.

Crossref

[23]

Everingham M, Van Gool L, Williams C K I, Winn J, Zisserman A. The Pascal visual object classes (VOC) challenge. International Journal of Computer Vision, 2010, 88(2): 303–338. DOI: 10.1007/s11263-009-0275-4.

Crossref Google Scholar

[24]

Chen K, Wang J Q, Pang J M, Cao Y H, Xiong Y, Li X X, Sun S Y, Feng W S, Liu Z W, Xu J R, Zhang Z, Cheng D Z, Zhu C C, Cheng T H, Zhao Q J, Li B Y, Lu X, Zhu R, Wu Y, Dai J F, Wang J D, Shi J P, Ouyang W L, Loy C C, Lin D H. MMDetection: Open MMLab detection toolbox and benchmark. arXiv: 1906.07155, 2019. https://arxiv.org/abs/1906.07155, March 2024.

[25]

Michaelis C, Ustyuzhaninov I, Bethge M, Ecker A S. One-shot instance segmentation. arXiv: 1811.11507, 2018. https://arxiv.org/abs/1811.11507, March 2024.

[26]

Fu K, Zhang T F, Zhang Y, Sun X. OSCD: A one-shot conditional object detection framework. Neurocomputing, 2021, 425: 243–255. DOI: 10.1016/j.neucom.2020.04.092.

Crossref Google Scholar

[27]

Cen M B, Jung C. Fully convolutional Siamese fusion networks for object tracking. In Proc. the 25th IEEE International Conference on Image Processing, Oct. 2018, pp.3718–3722. DOI: 10.1109/ICIP.2018.8451102.

Crossref

[28]

Li B, Yan J J, Wu W, Zhu Z, Hu X L. High performance visual tracking with Siamese region proposal network. In Proc. the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2018, pp.8971–8980. DOI: 10.1109/CVPR.2018.00935.

Crossref

[29]

Wang X, Huang T E, Darrell T, Gonzalez J E, Yu F. Frustratingly simple few-shot object detection. In Proc. the 37th International Conference on Machine Learning, Jul. 2020, Article No. 920.

[30]

Wu X W, Sahoo D, Hoi S. Meta-RCNN: Meta learning for few-shot object detection. In Proc. the 28th ACM International Conference on Multimedia, Oct. 2020, pp.1679–1687. DOI: 10.1145/3394171.3413832.

Crossref

[31]

Xiao Y, Marlet R. Few-shot object detection and viewpoint estimation for objects in the wild. In Proc. the 16th European Conference on Computer Vision, Aug. 2020, pp.192–210. DOI: 10.1007/978-3-030-58520-4_12.

Crossref

[32]

Sun B, Li B H, Cai S C, Yuan Y, Zhang C. FSCE: Few-shot object detection via contrastive proposal encoding. In Proc. the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2021, pp.7348–7358. DOI: 10.1109/CVPR46437.2021.00727.

Crossref

[33]

Wu J X, Liu S T, Huang D, Wang Y H. Multi-scale positive sample refinement for few-shot object detection. In Proc. the 16th European Conference on Computer Vision, August 2020, pp.456–472. DOI: 10.1007/978-3-030-58517-4_27.

Crossref

[34]

Lin T Y, Dollár P, Girshick R, He K M, Hariharan B, Belongie S. Feature pyramid networks for object detection. In Proc. the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Jun. 2017, pp.936–944. DOI: 10.1109/CVPR.2017.106.

Crossref

Journal of Computer Science and Technology

Volume 39 Issue 2,
March 2024

Pages 460-471

DOI: 10.1007/s11390-024-1743-6

Cite this article:

Lin W-D, Deng Y-Y, Gao Y, et al. CAT: A Simple yet Effective Cross-Attention Transformer for One-Shot Object Detection. Journal of Computer Science and Technology, 2024, 39(2): 460-471. https://doi.org/10.1007/s11390-024-1743-6

200

Views

Crossref

Web of Science

Scopus

CSCD

Google Scholar
Citation

Altmetrics

Received: 27 June 2021

Accepted: 18 January 2024

Published: 30 March 2024