Attention mechanisms in computer vision: A survey

Meng-Hao Guo; Tian-Xing Xu; Jiang-Jiang Liu; Zheng-Ning Liu; Peng-Tao Jiang; Tai-Jiang Mu; Song-Hai Zhang; Ralph R. Martin; Ming-Ming Cheng; Shi-Min Hu

doi:10.1007/s41095-022-0271-y

| Sign up

PDF (2.7 MB)

Cite

EndNote(RIS) BibTeX

Collect

Submit Manuscript

Review Article | Open Access

Attention mechanisms in computer vision: A survey

Meng-Hao Guo^¹, Tian-Xing Xu^¹, Jiang-Jiang Liu^², Zheng-Ning Liu^¹, Peng-Tao Jiang^², Tai-Jiang Mu^¹, Song-Hai Zhang^¹, Ralph R. Martin^³, Ming-Ming Cheng^², Shi-Min Hu^¹()

1BNRist, Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China

2TKLNDST, College of Computer Science, Nankai University, Tianjin 300350, China

3School of Computer Science and Informatics, Cardiff University, Cardiff, UK

Show Author Information

Graphical Abstract

View original image Download original image

Abstract

Humans can naturally and effectively find salient regions in complex scenes. Motivated by thisobservation, attention mechanisms were introduced into computer vision with the aim of imitating this aspect of the human visual system. Such an attention mechanism can be regarded as a dynamic weight adjustment process based on features of the input image. Attention mechanisms have achieved great success in many visual tasks, including image classification, object detection, semantic segmentation, video understanding, image generation, 3D vision, multi-modal tasks, and self-supervised learning. In this survey, we provide a comprehensive review of various attention mechanisms in computer vision and categorize them according to approach, such as channel attention, spatial attention, temporal attention, and branch attention; a related repository https://github.com/MenghaoGuo/Awesome-Vision-Attentions is dedicated to collecting related work. We also suggest future directions for attention mechanism research.

Keywords

attention transformer computer vision deep learning salience

References

[1]

Itti,

; Koch,

; Niebur,

A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 20, No. 11, 1254-1259, 1998.

Crossref Google Scholar

[2]

Hayhoe,

; Ballard,

Eye movements in natural behavior. Trends in Cognitive Sciences Vol. 9, No. 4, 188-194, 2005

Crossref Google Scholar

[3]

Rensink,

R. A.

The dynamic representation of scenes. Visual Cognition Vol. 7, Nos. 1-3, 17-42, 2000.

Crossref Google Scholar

[4]

Corbetta,

; Shulman,

G. L.

Control of goal-directed and stimulus-driven attention in the brain. Nature Reviews Neuroscience Vol. 3, No. 3, 201-215, 2002.

Crossref Google Scholar

[5]

Hu,

; Shen,

; Albanie,

; Sun,

; Wu,

E. H.

Squeeze-and-excitation networks. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 42, No. 8, 2011-2023, 2020.

Crossref Google Scholar

[6]

Woo,

; Park,

; Lee,

; Kweon,

I. S.

CBAM: Convolutional block attention module. In: Computer Vision - ECCV 2018. Lecture Notes in Computer Science, Vol. 11211. Ferrari,

; Hebert,

; Smin-chisescu,

; Weiss,

Eds. Springer Cham, 3-19, 2018.

[7]

Dai,

J. F.

; Qi,

H. Z.

; Xiong,

Y. W.

; Li,

; Zhang,

G. D.

; Hu,

; Wei,

Deformable convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, 764-773, 2017.

Crossref

[8]

Carion,

; Massa,

; Synnaeve,

; Usunier,

; Kirillov,

; Zagoruyko,

End-to-end object detection with transformers. In: Computer Vision - ECCV 2020. Lecture Notes in Computer Science, Vol. 12346. Vedaldi,

; Bischof,

; Brox,

; Frahm,

J. M.

Eds. Springer Cham, 213-229, 2020.

Crossref

[9]

Yuan,

; Wang,

OCNet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916, 2018.

Google Scholar

[10]

Fu,

; Liu,

; Tian,

H. J.

; Li,

; Bao,

Y. J.

; Fang,

Z. W.

; Lu,

Dual attention network for scene segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3141-3149, 2019.

Crossref

[11]

Yang,

J. L.

; Ren,

P. R.

; Zhang,

D. Q.

; Chen,

; Wen,

; Li,

H. D.

; Hua,

Neural aggregation network for video face recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 5216-5225, 2017.

Crossref

[12]

Wang,

Q. C.

; Wu,

T. Y.

; Zheng,

; Guo,

G. D.

Hierarchical pyramid diverse attention networks for face recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8323-8332, 2020.

Crossref

[13]

Li,

; Zhu,

X. T.

; Gong,

S. G.

Harmonious attention network for person re-identification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2285-2294, 2018.

Crossref

[14]

Chen,

B. H.

; Deng,

W. H.

; Hu,

J. N.

Mixed high-order attention network for person re-identification. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 371-381, 2019.

Crossref

[15]

Wang,

X. L.

; Girshick,

; Gupta,

; He,

K. M.

Non-local neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7794-7803, 2018.

Crossref

[16]

Du,

W. B.

; Wang,

Y. L.

; Qiao,

Recurrent spatial-temporal attention network for action recognition in videos. IEEE Transactions on Image Processing Vol. 27, No. 3, 1347-1360, 2018.

Crossref Google Scholar

[17]

Peng,

Y. X.

; He,

X. T.

; Zhao,

J. J.

Object-part attention model for fine-grained image classification. IEEE Transactions on Image Processing Vol. 27, No. 3, 1487-1500, 2018.

Crossref Google Scholar

[18]

He,

; Huang,

W. L.

; He,

; Zhu,

Q. L.

; Qiao,

; Li,

X. L.

Single shot text detector with regional attention. In: Proceedings of the IEEE International Conference on Computer Vision, 3066-3074, 2017.

Crossref

[19]

Oktay,

; Schlemper,

; Folgoc,

L. L.

; Lee,

; Heinrich,

; Misawa,

; Mori,

; McDonagh,

; Hammerla,

N. Y.

; Kainz,

; et al. Attention U-Net: Learning where to look for the pancreas. arXiv preprint arXiv:1804.03999, 2018.

Google Scholar

[20]

Guan,

; Huang,

; Zhong,

; Zheng,

; Yang,

Diagnose like a radiologist: Attention guided convolutional neural network for thorax disease classification. arXiv preprint arXiv:1801.09927, 2018.

Google Scholar

[21]

Gregor,

; Danihelka,

; Graves,

; Wierstra,

DRAW: A recurrent neural network for image generation. In: Proceedings of the 32nd International Conference on Machine Learning, 1462-1471, 2015.

[22]

Zhang,

; Goodfellow,

I. J.

; Metaxas,

D. N.

; Odena,

Self-attention generative adversarial networks. In: Proceedings of the 36th International Conference on Machine Learning, 7354-7363, 2019.

[23]

Chu,

; Yang,

; Ouyang,

W. L.

; Ma,

; Yuille,

A. L.

; Wang,

X. G.

Multi-context attention for human pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 5669-5678, 2017.

Crossref

[24]

Dai,

; Cai,

J. R.

; Zhang,

Y. B.

; Xia,

S. T.

; Zhang,

Second-order attention network for single image super-resolution. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11057-11066, 2019.

Crossref

[25]

Zhang,

Y. L.

; Li,

K. P.

; Li,

; Wang,

L. C.

; Zhong,

B. N.

; Fu,

Image super-resolution using very deep residual channel attention networks. In: Computer Vision - ECCV 2018. Lecture Notes in Computer Science, Vol. 11211. Ferrari,

; Hebert,

; Sminchisescu,

; Weiss,

Eds. Springer Cham, 294-310, 2018.

[26]

Xie,

S. N.

; Liu,

S. N.

; Chen,

Z. Y.

; Tu,

Z. W.

Attentional ShapeContextNet for point cloud recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4606-4615, 2018.

Crossref

[27]

Guo,

M. H.

; Cai,

J. X.

; Liu,

Z. N.

; Mu,

T. J.

; Martin,

R. R.

; Hu,

S. M.

PCT: Point cloud transformer. Computational Visual Media Vol. 7, No. 2, 187-199, 2021.

Crossref Google Scholar

[28]

Su,

W. J.

; Zhu,

X. Z.

; Cao,

; Li,

; Lu,

L. W.

; Wei,

F. R.

; Dai,

L-BERT: Pre-training of generic visual-linguistic representations. In: Proceedings of the International Conference on Learning Representations, 2020.

[29]

Xu,

; Zhang,

P. C.

; Huang,

Q. Y.

; Zhang,

; Gan,

; Huang,

X. L.

; He,

AttnGAN: Fine-grained text to image generation with attentional generative adversarial networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1316-1324, 2018.

Crossref

[30]

Wu,

Y. X.

; He,

K. M.

Group normalization. International Journal of Computer Vision Vol. 128, No. 3, 742-755, 2020.

Crossref Google Scholar

[31]

Mnih,

; Heess,

; Graves,

; Kavukcuoglu,

Recurrent models of visual attention. In: Proceedings of the 27th International Conference on Neural Information Processing Systems, Vol. 2, 2204-2212, 2014.

[32]

Jaderberg,

; Simonyan,

; Zisserman,

; Kavukcuoglu,

Spatial transformer networks. In: Proceedings of the 28th International Conference on Neural Information Processing Systems, Vol. 2, 2017-2025, 2015.

[33]

Vaswani,

; Shazeer,

N. M.

; Parmar,

; Uszkoreit,

; Jones,

; Gomez,

A. N.

; Kaiser,

; Polosukhin,

Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing System, 6000-6010, 2017.

[34]

Dosovitskiy,

; Beyer,

; Kolesnikov,

; Weissenborn,

; Zhai,

; Unterthiner,

; Dehghani,

; Minderer,

; Heigold,

; Gelly,

; et al. An image is worth 16

\times

16 words: Transformers for image recognition at scale. In: Proceedings of the 9th International Conference on Learning Representations, 2021.

[35]

Xu,

; Ba,

; Kiros,

; Cho,

; Courville,

; Salakhutdinov,

; Zemel,

; Bengio,

Show, attend and tell: Neural image caption generation with visual attention. In: Proceedings of the 32nd International Conference on Machine Learning, 2048-2057, 2015.

[36]

Zhu,

X. Z.

; Hu,

; Lin,

; Dai,

J. F.

Deformable ConvNets V2: More deformable, better results. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9300-9308, 2019.

Crossref

[37]

Wang,

Q. L.

; Wu,

B. G.

; Zhu,

P. F.

; Li,

P. H.

; Zuo,

W. M.

; Hu,

Q. H.

ECA-net: Efficient channel attention for deep convolutional neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11531-11539, 2020.

Crossref

[38]

Devlin,

; Chang,

M. W.

; Lee,

; Toutanova,

BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.

Google Scholar

[39]

Yang,

Z. L.

; Dai,

Z. H.

; Yang,

Y. M.

; Carbonell,

J. G.

; Salakhutdinov,

; Le,

Q. V.

XLNet: Generalized autoregressive pretraining for language understanding. In: Proceedings of the 33rd Conference on Neural Information Processing Systems, 2019.

[40]

Li,

; Zhong,

Z. S.

; Wu,

J. L.

; Yang,

Y. B.

; Lin,

Z. C.

; Liu,

Expectation-maximization attention networks for semantic segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 9166-9175, 2019.

Crossref

[41]

Huang,

Z. L.

; Wang,

X. G.

; Huang,

L. C.

; Huang,

; Wei,

Y. C.

; Liu,

W. Y.

CCNet: Criss-cross attention for semantic segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence , 2020.

Crossref Google Scholar

[42]

Geng,

; Guo,

M.-H.

; Chen,

; Li,

; Wei,

; Lin,

Is attention better than matrix decomposition? In: Proceedings of the International Conference on Learning Representations, 2021.

[43]

Ramachandran,

; Parmar,

; Vaswani,

; Bello,

; Levskaya,

; Shlens,

Stand-alone self-attention in vision models. In: Proceedings of the 33rd Conference on Neural Information Processing Systems, 2019.

[44]

Yuan,

; Chen,

; Wang,

; Yu,

; Shi,

; Jiang,

Z.-H.

; Tay,

F. E.

; Feng,

; Yan,

Tokens-to-Token ViT: Training vision transformers from scratch on ImageNet. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 558-567, 2021.

Crossref

[45]

Wang,

W. H.

; Xie,

E. Z.

; Li,

; Fan,

D. P.

; Song,

K. T.

; Liang,

; Lu,

; Luo,

; Shao,

Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In: Proceedings of the IEEE/CVF International Conference on Computer Visio, 568-578, 2021.

Crossref

[46]

Liu,

; Lin,

Y. T.

; Cao,

; Hu,

; Guo,

B. N.

Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 10012-10022, 2021.

Crossref

[47]

Wu,

; Xiao,

; Codella,

; Liu,

; Dai,

; Yuan,

; Zhang,

CvT: Introducing convolutions to vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 22-31, 2021.

Crossref

[48]

Yuan,

; Hou,

Q. B.

; Jiang,

Z. H.

; Feng,

J. S.

; Yan,

S. C.

VOLO: Vision outlooker for visual recognition. arXiv preprint arXiv:2106.13112, 2021.

Google Scholar

[49]

Dai,

Z. H.

; Liu,

H. X.

; Le,

Q. V.

; Tan,

M. X.

CoAtNet: Marrying convolution and attention for all data sizes. arXiv preprint arXiv:2106.04803, 2021.

Google Scholar

[50]

Chen,

; Zhang,

H. W.

; Xiao,

; Nie,

L. Q.

; Shao,

; Liu,

; Chua,

SCA-CNN: Spatial and channel-wise attention in convolutional networks for image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 6298-6306, 2017.

Crossref

[51]

Nair,

; Hinton,

G. E.

Rectified linear units improve restricted Boltzmann machines. In: Proceedings of the 27th International Conference on Machine Learning, 807-814, 2010.

[52]

Ioffe,

; Szegedy,

Batch normalization: Accele-rating deep network training by reducing internal covariate shift. In: Proceedings of the 32nd International Conference on International Conference on Machine Learning, Vol. 37, 448-456, 2015.

[53]

Zhang,

; Dana,

; Shi,

J. P.

; Zhang,

Z. Y.

; Wang,

X. G.

; Tyagi,

; Agrawal,

Context encoding for semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7151-7160, 2018.

Crossref

[54]

Gao,

Z. L.

; Xie,

J. T.

; Wang,

Q. L.

; Li,

P. H.

Global second-order pooling convolutional networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3019-3028, 2019.

Crossref

[55]

Lee,

; Kim,

H. E.

; Nam,

SRM: A style-based recalibration module for convolutional neural networks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 1854-1862, 2019.

Crossref

[56]

Yang,

Z. X.

; Zhu,

L. C.

; Wu,

; Yang,

Gated channel transformation for visual recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11791-11800, 2020.

Crossref

[57]

Qin,

Z. Q.

; Zhang,

P. Y.

; Wu,

; Li,

FcaNet: Frequency channel attention networks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 783-792, 2021.

Crossref

[58]

Diba,

A. L.

; Fayyaz,

; Sharma,

; Arzani,

M. M.

; Yousefzadeh,

; Gall,

; van Gool,

Spatio-temporal channel correlation networks for action classification. In: Computer Vision - ECCV 2018. Lecture Notes in Computer Science, Vol. 11208. Ferrari,

; Hebert,

; Sminchisescu,

; Weiss,

Eds. Springe Cham, 299-315, 2018.

Crossref

[59]

Chen,

Z. R.

; Li,

; Bengio,

; Si,

You look twice: GaterNet for dynamic filter selection in CNNs. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9164-9172, 2019.

Crossref

[60]

Shi,

H. Y.

; Lin,

G. S.

; Wang,

; Hung,

T. Y.

; Wang,

Z. H.

SpSequenceNet: Semantic segmentation network on 4D point clouds. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4573-4582, 2020.

Crossref

[61]

Hu,

; Shen,

; Albanie,

; Sun,

; Vedaldi,

Gather-excite: Exploiting feature context in convolutional neural networks. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, 9423-9433, 2018.

[62]

Yan,

; Zheng,

C. D.

; Li,

; Wang,

; Cui,

S. G.

PointASNL: Robust point clouds processing using nonlocal neural networks with adaptive sampling. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5588-5597, 2020.

Crossref

[63]

Hu,

; Gu,

J. Y.

; Zhang,

; Dai,

J. F.

; Wei,

Y. C.

Relation networks for object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3588-3597, 2018.

Crossref

[64]

Zhang,

; Zhang,

; Wang,

C. G.

; Xie,

J. Y.

Co-occurrent features in semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 548-557, 2019.

Crossref

[65]

Bello,

; Zoph,

; Le,

; Vaswani,

; Shlens,

Attention augmented convolutional networks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 3285-3294, 2019.

Crossref

[66]

Zhu,

X. Z.

; Cheng,

D. Z.

; Zhang,

; Lin,

; Dai,

J. F.

An empirical study of spatial attention mechanisms in deep networks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 6687-6696, 2019.

Crossref

[67]

Li,

; Yang,

Y. B.

; Zhao,

Q. J.

; Shen,

T. C.

; Lin,

Z. C.

; Liu,

Spatial pyramid based graph reasoning for semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8947-8956, 2020.

Crossref

[68]

Zhu,

; Xu,

M. D.

; Bai,

; Huang,

T. T.

; Bai,

Asymmetric non-local neural networks for semantic segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 593-602, 2019.

Crossref

[69]

Cao,

; Xu,

J. R.

; Lin,

; Wei,

F. Y.

; Hu,

GCNet: Non-local networks meet squeeze-excitation networks and beyond. In: Proceedings of the IEEE/CVF International Conference on Computer Vision Workshop, 1971-1980, 2019.

Crossref

[70]

Chen,

; Kalantidis,

; Li,

; Yan,

; Feng,

^{2}

-nets: Double attention networks. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, 350-359, 2018.

[71]

Chen,

Y. P.

; Rohrbach,

; Yan,

Z. C.

; Yan,

S. C.

; Feng,

J. S.

; Kalantidis,

Graph-based global reasoning networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 433-442, 2019.

Crossref

[72]

Zhang,

S. Y.

; Yan,

S. P.

; He,

X. M.

LatentGNN: Learning efficient non-local relations for visual recognition. In: Proceedings of the 36th International Conference on Machine Learning, 7374-7383, 2019.

[73]

Yuan,

; Chen,

; Wang,

Segmen-tation transformer: Object-contextual representations for semantic segmentation. arXiv preprint arXiv: 1909.11065, 2019.

Google Scholar

[74]

Yin,

M. H.

; Yao,

Z. L.

; Cao,

; Li,

; Zhang,

; Lin,

; Hu,

Disentangled non-local neural networks. In: Computer Vision - ECCV 2020. Lecture Notes in Computer Science, Vol. 12360. Vedaldi,

; Bischof,

; Brox,

; Frahm,

J. M.

Eds. Springer Cham, 191-207, 2020.

[75]

Guo,

M. H.

; Liu,

Z. N.

; Mu,

T. J.

; Hu,

S. M.

Beyond self-attention: External attention using two linear layers for visual tasks. arXiv preprint arXiv:2105.02358, 2021.

Google Scholar

[76]

Hu,

; Zhang,

; Xie,

Z. D.

; Lin,

Local relation networks for image recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 3463-3472, 2019.

Crossref

[77]

Zhao,

H. S.

; Jia,

J. Y.

; Koltun,

Exploring self-attention for image recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10073-10082, 2020.

Crossref

[78]

Chen,

; Radford,

; Child,

; Wu,

; Jun,

; Luan,

; Sutskever,

Generative pretraining from pixels. In: Proceedings of the 37th International Conference on Machine Learning, 1691-1703, 2020.

[79]

Chen,

H. T.

; Wang,

Y. H.

; Guo,

T. Y.

; Xu,

; Deng,

Y. P.

; Liu,

Z. H.

; Ma,

; Xu,

; Gao,

Pre-trained image processing transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 12294-12305, 2021.

Crossref

[80]

Zhao,

; Jiang,

; Jia,

; Torr,

; Koltun,

Point transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 16259-16268, 2021.

Crossref

[81]

Zheng,

S. X.

; Lu,

J. C.

; Zhao,

H. S.

; Zhu,

X. T.

; Luo,

Z. K.

; Wang,

Y. B.

; Fu,

; Feng,

; Xiang,

; Torr,

P. H.

; et al. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6877-6886, 2021.

Crossref

[82]

Han,

; Xiao,

; Wu,

; Guo,

; Xu,

; Wang,

Transformer in transformer. arXiv preprint arXiv:2103.00112, 2021.

Google Scholar

[83]

Liu,

S. L.

; Zhang,

; Yang,

; Su,

; Zhu,

Query2Label: A simple transformer way to multi-label classification. arXiv preprint arXiv:2107.10834, 2021.

Google Scholar

[84]

Chen,

X. L.

; Xie,

S. N.

; He,

K. M.

An empirical study of training self-supervised visual transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 9640-9649, 2021.

Crossref

[85]

Bao,

H. B.

; Dong,

; Wei,

F. R.

BEiT: BERT pre-training of image transformers. arXiv preprint arXiv:2106.08254, 2021.

Google Scholar

[86]

Xie,

E. Z.

; Wang,

W. H.

; Yu,

Z. D.

; Anandkumar,

; Álvarez,

; Luo,

SegFormer: Simple and efficient design for semantic segmentation with transformers. arXiv preprint arXiv:2105.15203, 2021.

Google Scholar

[87]

Zhao,

; Zhang,

; Liu,

; Shi,

; Loy,

C. C.

; Lin,

; Jia,

PSANet: Point-wise spatial attention network for scene parsing. In: Computer Vision - ECCV 2018. Lecture Notes in Computer Science, Vol. 11213. Ferrari,

; Hebert,

; Sminchisescu,

; Weiss,

Eds. Springer Cham, 270-286, 2018.

[88]

Ba,

; Mnih,

; Kavukcuoglu,

Multiple object recognition with visual attention. arXiv preprint arXiv:1412.7755, 2014.

Google Scholar

[89]

Sharma,

; Kiros,

; Salakhutdinov,

Action recognition using visual attention. arXiv preprint arXiv:1511.04119, 2015.

Google Scholar

[90]

Girdhar,

; Ramanan,

Attentional pooling for action recognition. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, 33-44, 2017.

[91]

Li,

Z. Y.

; Gavrilyuk,

; Gavves,

; Jain,

; Snoek,

C. G. M.

VideoLSTM convolves, attends and flows for action recognition. Computer Vision and Image Understanding Vol. 166, 41-50, 2018.

Crossref Google Scholar

[92]

Yue,

K. Y.

; Sun,

; Yuan,

Y. C.

; Zhou,

; Ding,

E. R.

; Xu,

F. X.

Compact generalized non-local network. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, 6511-6520, 2018.

[93]

Liu,

X. H.

; Han,

Z. Z.

; Wen,

; Liu,

Y. S.

; Zwicker,

L2G auto-encoder: Understanding point clouds by local-to-global reconstruction with hierarchical self-attention. In: Proceedings of the 27th ACM International Conference on Multimedia, 989-997, 2019.

Crossref

[94]

Paigwar,

; Erkent,

; Wolf,

; Laugier,

Attentional PointNet for 3D-object detection in point clouds. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 1297-1306, 2019.

Crossref

[95]

Wen,

; Han,

Z. Z.

; Youk,

; Liu,

Y. S.

CF-SIS: Semantic-instance segmentation of 3D point clouds by context fusion with self-attention. In: Proceedings of the 28th ACM International Conference on Multimedia, 1661-1669, 2020.

Crossref

[96]

Yang,

J. C.

; Zhang,

; Ni,

B. B.

; Li,

L. G.

; Liu,

J. X.

; Zhou,

M. D.

; Tian,

Modeling point clouds with self-attention and gumbel subset sampling. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3318-3327, 2019.

Crossref

[97]

Xu,

; Zhao,

; Zhu,

; Wang,

H. M.

; Ouyang,

W. L.

Attention-aware compositional network for person re-identification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2119-2128, 2018.

Crossref

[98]

Liu,

; Feng,

J. S.

; Qi,

M. B.

; Jiang,

J. G.

; Yan,

S. C.

End-to-end comparative attention networks for person re-identification. IEEE Transactions on Image Processing Vol. 26, No. 7, 3492-3506, 2017.

Crossref Google Scholar

[99]

Zheng,

Z. D.

; Zheng,

; Yang,

Pedestrian alignment network for large-scale person re-identification. IEEE Transactions on Circuits and Systems for Video Technology Vol. 29, No. 10, 3037-3045, 2019.

Crossref Google Scholar

[100]

Li,

K. P.

; Wu,

Z. Y.

; Peng,

K. C.

; Ernst,

; Fu,

Tell me where to look: Guided attention inference network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9215-9223, 2018.

[101]

Zhang,

Z. Z.

; Lan,

C. L.

; Zeng,

W. J.

; Jin,

; Chen,

Z. B.

Relation-aware global attention for person re-identification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3183-3192, 2020.

Crossref

[102]

Zhao,

; Wu,

; Feng,

J. S.

; Peng,

; Yan,

S. C.

Diversified visual attention networks for fine-grained object classification. IEEE Transactions on Multimedia Vol. 19, No. 6, 1245-1256, 2017.

Crossref Google Scholar

[103]

Bryan,

; Gong,

; Zhang,

Y. Z.

; Poellabauer,

Second-order non-local attention networks for person re-identification. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 3759-3768, 2019.

Crossref

[104]

Zheng,

H. L.

; Fu,

J. L.

; Mei,

; Luo,

J. B.

Learning multi-attention convolutional neural network for fine-grained image recognition. In: Proceedings of the IEEE International Conference on Computer Vision, 5219-5227, 2017.

Crossref

[105]

Fu,

J. L.

; Zheng,

H. L.

; Mei,

Look closer to see better: Recurrent attention convolutionalneural network for fine-grained image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4476-4484, 2017.

[106]

Liu,

; Li,

; Zhang,

; Yang,

; Qi,

; Su,

; Zhu,

; Zhang,

DAB-DETR: Dynamic anchor boxes are better queries for DETR. arXiv preprint arXiv:2201.12329, 2022.

Google Scholar

[107]

Yang,

G. Y.

; Li,

X. L.

; Martin,

; Hu,

S. M.

Sampling equivariant self-attention networks for object detection in aerial images. arXiv preprint arXiv:2111.03420, 2021.

Google Scholar

[108]

Zheng,

H. L.

; Fu,

J. L.

; Zha,

Z. J.

; Luo,

J. B.

Looking for the devil in the details: Learning trilinear attention sampling network for fine-grained image recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5007-5016, 2019.

Crossref

[109]

Lee,

; Lee,

; Kim,

; Kosiorek,

A. R.

; Choi,

; Teh

Y. W.

Set transformer: A framework for attention-based permutation-invariant neural networks. In: Proceedings of the 36th International Conference on Machine Learning, 3744-3753, 2019.

[110]

Xu,

S. J.

; Cheng,

; Gu,

; Yang,

; Chang,

S. Y.

; Zhou,

Jointly attentive spatial-temporal pooling networks for video-based person re-identification. In: Proceedings of the IEEE International Conference on Computer Vision, 4743-4752, 2017.

[111]

Zhang,

R. M.

; Li,

J. Y.

; Sun,

H. B.

; Ge,

Y. Y.

; Luo,

; Wang,

X. G.

; Lin,

SCAN: Self-and-collaborative attention network for video person re-identification. IEEE Transactions on Image Processing Vol. 28, No. 10, 4870-4882, 2019.

Crossref Google Scholar

[112]

Chen,

D. P.

; Li,

H. S.

; Xiao,

; Yi,

; Wang,

X. G.

Video person re-identification with competitive snippet-similarity aggregation and co-attentive snippet embedding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1169-1178, 2018.

Crossref

[113]

Srivastava,

R. K.

; Greff,

; Schmidhuber,

Training very deep networks. In: Proceedings of the 28th International Conference on Neural Information Processing Systems, Vol. 2, 2377-2385, 2015.

[114]

Li,

; Wang,

W. H.

; Hu,

X. L.

; Yang,

Selective kernel networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 510-519, 2019.

Crossref

[115]

Zhang,

; Wu,

; Zhang,

; Zhu,

; Lin,

; Zhang,

; Sun,

; He,

; Mueller,

; Manmatha,

; et al. ResNeSt: Split-attention networks. arXiv preprint arXiv:2004.08955, 2020.

Google Scholar

[116]

Chen,

Y. P.

; Dai,

X. Y.

; Liu,

M. C.

; Chen,

D. D.

; Yuan,

; Liu,

Z. C.

Dynamic convolution: attention over convolution kernels. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11027-11036, 2020.

Crossref

[117]

Park,

; Woo,

; Lee,

J.-Y.

; Kweon,

I. S.

BAM: Bottleneck attention module. arXiv preprint arXiv:1807.06514, 2018.

Google Scholar

[118]

Yang,

; Zhang,

R.-Y.

; Li,

; Xie,

SimAM: A simple, parameter-free attention module for convolutional neural networks. In: Proceedings of the 38th International Conference on Machine Learning, 11863-11874, 2021.

[119]

Wang,

; Jiang,

M. Q.

; Qian,

; Yang,

; Li,

; Zhang,

H. G.

; Wang,

; Tang,

Residual attention network for image classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 6450-6458, 2017.

Crossref

[120]

Guo,

M.-H.

; Lu,

C.-Z.

; Liu,

Z.-N.

; Cheng,

M.-M.

; Hu,

S.-M.

Visual attention network. arXiv preprint arXiv:2202.09741, 2022.

Google Scholar

[121]

Liu,

J. J.

; Hou,

Q. B.

; Cheng,

M. M.

; Wang,

C. H.

; Feng,

J. S.

Improving convolutional networks with self-calibrated convolutions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10093-10102, 2020.

Crossref

[122]

Misra,

; Nalamada,

; Arasanipalai,

A. U.

; Hou,

Q. B.

Rotate to attend: Convolutional triplet attention module. In: Proceedings of the IEEE Winter Conference on Applications of Computer Vision, 3138-3147, 2021.

Crossref

[123]

Linsley,

; Shiebler,

; Eberhardt,

; Serre,

Learning what and where to attend. In: Proceedings of the 7th International Conference on Learning Representations, 2019.

[124]

Roy,

A. G.

; Navab,

; Wachinger,

Recalibrating fully convolutional networks with spatial and channel “squeeze and excitation” blocks. IEEE Transactions on Medical Imaging Vol. 38, No. 2, 540-549, 2019.

Crossref Google Scholar

[125]

Hou,

Q. B.

; Zhang,

; Cheng,

M. M.

; Feng,

J. S.

Strip pooling: Rethinking spatial pooling for scene parsing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4002-4011, 2020.

Crossref

[126]

You,

H. X.

; Feng,

Y. F.

; Ji,

R. R.

; Gao,

PVNet: A joint convolutional network of point cloud and multi-view for 3D shape recognition. In: Proceedings of the 26th ACM International Conference on Multimedia, 1310-1318, 2018.

Crossref

[127]

Xie,

; Lai,

Y. K.

; Wu,

; Wang,

Z. T.

; Zhang,

Y. M.

; Xu,

; Wang,

MLCVNet: Multi-level context VoteNet for 3D object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10444-10453, 2020.

Crossref

[128]

Wang,

; Zhang,

; Huang,

; Liu,

; Wang,

Mancs: A multi-task attentional network with curriculum sampling for person re-identification. In: Computer Vision - ECCV 2018. Lecture Notes in Computer Science, Vol. 11208. Ferrari,

; Hebert,

; Sminchisescu,

; Weiss,

Eds. Springer Cham, 384-400, 2018.

[129]

Chen,

T. L.

; Ding,

S. J.

; Xie,

J. Y.

; Yuan,

; Chen,

W. Y.

; Yang,

; Ren,

; Wang,

ABD-net: Attentive but diverse person re-identification. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 8350-8360, 2019.

Crossref

[130]

Hou,

Q. B.

; Zhou,

D. Q.

; Feng,

J. S.

Coordinate attention for efficient mobile network design. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 13708-13717, 2021.

Crossref

[131]

Song,

; Lan,

; Xing,

; Zeng,

; Liu,

An end-to-end spatio-temporal attention model for human action recognition from skeleton data. In: Proceedings of the 31st AAAI Conference on Artificial Intelligence, 4263-4270, 2017.

[132]

Fu,

; Wang,

X. Y.

; Wei,

Y. C.

; Huang,

STA: Spatial-temporal attention for large-scale video-based person re-identification. Proceedings of the AAAI Conference on Artificial Intelligence Vol. 33, 8287-8294, 2019.

Crossref Google Scholar

[133]

Gao,

L. L.

; Li,

X. P.

; Song,

J. K.

; Shen,

H. T.

Hierarchical LSTMs with adaptive attention for visual captioning. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 42, No. 5, 1112-1131, 2020.

Google Scholar

[134]

Yan,

C. G.

; Tu,

Y. B.

; Wang,

X. Z.

; Zhang,

Y. B.

; Hao,

X. H.

; Zhang,

Y. D.

; Dai,

STAT: Spatial-temporal attention mechanism for video captioning. IEEE Transactions on Multimedia Vol. 22, No. 1, 229-241, 2020.

Crossref Google Scholar

[135]

Meng,

L. L.

; Zhao,

; Chang,

; Huang,

; Sun,

; Tung,

; Sigal,

Interpretable spatio-temporal attention for video action recognition. In:Proceedings of the IEEE/CVF International Conference on Computer Vision Workshop, 1513-1522, 2019.

Crossref

[136]

He,

; Yang,

X. T.

; Wu,

Z. X.

; Chen,

; Shrivastava,

GTA: Global temporal attention for video action understanding. arXiv preprint arXiv:2012.08510, 2020.

Google Scholar

[137]

Li,

; Bak,

; Carr,

; Wang,

X. G.

Diversity regularized spatiotemporal attention for video-based person re-identification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 369-378, 2018.

Crossref

[138]

Zhang,

Z. Z.

; Lan,

C. L.

; Zeng,

W. J.

; Chen,

Z. B.

Multi-granularity reference-aided attentive feature aggregation for video-based person re-identification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10404-10413, 2020.

Crossref

[139]

Shim,

; Ho,

H. I.

; Kim,

; Wee,

READ: Reciprocal attention discriminator for image-to-video re-identification. In: Computer Vision - ECCV 2020. Lecture Notes in Computer Science, Vol. 12359. Vedaldi,

; Bischof,

; Brox,

; Frahm,

J. M.

Eds. Springer Cham, 335-350, 2020.

[140]

Liu,

; Deng,

H. M.

; Huang,

Y. Y.

; Shi,

X. Y.

; Li,

H. S.

Decoupled spatial-temporal transformer for video inpainting. arXiv preprint arXiv:2104.06637, 2021.

Google Scholar

[141]

Chaudhari,

; Mithal,

; Polatkan,

; Ramanath,

An attentive survey of attention models. ACM Transactions on Intelligent Systems and Technology Vol. 12, No. 5, Article No. 53, 2021.

Crossref Google Scholar

[142]

Xu,

Y. F.

; Wei,

H. P.

; Lin,

M. X.

; Deng,

Y. Y.

; Sheng,

K. K.

; Zhang,

M. D.

; Tang,

; Dong,

; Huang,

; Xu,

Transformers in computational visual media: A survey. Computational Visual Media Vol. 8, No. 1, 33-62, 2022.

Crossref Google Scholar

[143]

Han,

; Wang,

; Chen,

; Guo,

; Liu,

; Tang,

; Xiao,

; Xu,

; et al. A survey on visual transformer. arXiv preprint arXiv:2012.12556, 2020.

Google Scholar

[144]

Khan,

; Naseer,

; Hayat,

; Zamir,

S. W.

; Khan,

F. S.

; Shah,

Transformers in vision: A survey. ACM Computing Surveys , 2022.

Crossref Google Scholar

[145]

Wang,

; Tax,

D. M. J.

Survey on the attention based RNN model and its applications in computer vision. arXiv preprint arXiv:1601.06823, 2016.

Google Scholar

[146]

He,

K. M.

; Zhang,

X. Y.

; Ren,

S. Q.

; Sun,

Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 770-778, 2016.

[147]

Fang,

P. F.

; Zhou,

J. M.

; Roy,

; Petersson,

; Harandi,

Bilinear attention networks for person retrieval. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 8029-8038, 2019.

Crossref

[148]

Hochreiter,

; Schmidhuber,

Long short-term memory. Neural Computation Vol. 9, No. 8, 1735-1780, 1997.

Crossref Google Scholar

[149]

Sutton,

R. S.

; McAllester,

D. A.

; Singh,

S. P.

; Mansour,

Policy gradient methods for reinfor-cement learning with function approximation. In: Proceedings of the 12th International Conference on Neural Information Processing Systems, 1057-1063, 1999.

[150]

Bahdanau,

; Cho,

; Bengio,

Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.

Google Scholar

[151]

Lin,

Z. H.

; Feng,

M. W.

; Santos,

C. N. D.

; Yu,

; Bengio,

A structured self-attentive sentence embedding. arXiv preprint arXiv:1703.03130, 2017.

Google Scholar

[152]

Dai,

Z. H.

; Yang,

Z. L.

; Yang,

Y. M.

; Carbonell,

; Le,

; Salakhutdinov,

Transformer-XL: Attentive language models beyond a fixed-length context. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2978-2988, 2019.

Crossref

[153]

Choromanski,

; Likhosherstov,

; Dohan,

; Song,

X. Y.

; Gane,

; Sarlos,

; Hawkins,

; Davis,

; Mohiuddin,

; Kaiser,

; et al. Rethinking attention with performers. arXiv preprint arXiv:2009.14794, 2020.

Google Scholar

[154]

Zhu,

X. Z.

; Su,

W. J.

; Lu,

L. W.

; Li,

; Wang,

X. G.

; Dai,

J. F.

Deformable DETR: Deformable transformers for end-to-end object detection. In: Proceedings of the International Conference on Learning Representations, 2021.

[155]

Liu,

; Rabinovich,

; Berg,

A. C.

ParseNet: Looking wider to see better. arXiv preprint arXiv:1506.04579, 2015.

Google Scholar

[156]

Peng,

; Zhang,

X. Y.

; Yu,

; Luo,

G. M.

; Sun,

Large kernel matters—Improve semantic segmentation by global convolutional network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1743-1751, 2017.

Crossref

[157]

Zhao,

H. S.

; Shi,

J. P.

; Qi,

X. J.

; Wang,

X. G.

; Jia,

J. Y.

Pyramid scene parsing network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 6230-6239, 2017.

Crossref

[158]

He,

K. M.

; Zhang,

X. Y.

; Ren,

S. Q.

; Sun,

Spatial pyramid pooling in deep convolutional networks for visual recognition. In: Computer Vision - ECCV 2014. Lecture Notes in Computer Science, Vol. 8691. Fleet,

; Pajdla,

; Schiele,

; Tuytelaars,

Eds. Springer Cham, 346-361, 2014.

[159]

Tolstikhin,

; Houlsby,

; Kolesnikov,

; Beyer,

; Zhai,

X. H.

; Unterthiner,

; Yung,

; Steiner,

; Keysers,

; Uszkoreit,

; et al. MLP-mixer: An all-MLP architecture for vision. In: Proceedings of the 35th Conference on Neural Information Processing Systems, 2021.

[160]

Touvron,

; Bojanowski,

; Caron,

; Cord,

; El-Nouby,

; Grave,

; Izacard,

; Joulin,

; Synnaeve,

; Verbeek,

; et al. ResMLP: Feedforward networks for image classification with data-efficient training. arXiv preprint arXiv: 2105.03404, 2021.

Google Scholar

[161]

Shaw,

; Uszkoreit,

; Vaswani,

Self-attention with relative position representations. arXiv preprint arXiv:1803.02155, 2018.

Crossref Google Scholar

[162]

Brown,

T. B.

; Mann,

; Ryder,

; Subbiah,

; Kaplan,

; Dhariwal,

; Neelakantan,

; Shyam,

; Sastry,

; Askell,

; et al. Language models are few-shot learners. In: Proceedings of the 34th Conference on Neural Information Processing Systems, 2020.

[163]

Ba,

J. L.

; Kiros,

J. R.

; Hinton,

G. E.

Layer normalization. arXiv preprint arXiv:1607.06450, 2016.

Google Scholar

[164]

Hendrycks,

; Gimpel,

Gaussian error linear units (GELUs). arXiv preprint arXiv:1606.08415, 2016.

Google Scholar

[165]

Sun,

; Shrivastava,

; Singh,

; Gupta,

Revisiting unreasonable effectiveness of data in deep learning era. In: Proceedings of the IEEE International Conference on Computer Vision, 843-852, 2017.

Crossref

[166]

Deng,

; Dong,

; Socher,

; Li,

L. J.

; Kai,

; Li,

F. F.

ImageNet: A large-scale hierarchical image database. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 248-255, 2009.

Crossref

[167]

Zhou,

D. Q.

; Kang,

B. Y.

; Jin,

X. J.

; Yang,

L. J.

; Lian,

X. C.

; Jiang,

Z. H.

; Hou,

Q. B.

; Feng,

J. S.

DeepViT: Towards deeper vision transformer. arXiv preprint arXiv:2103.11886, 2021.

Google Scholar

[168]

Touvron,

; Cord,

; Sablayrolles,

; Synnaeve,

; Jégou,

Going deeper with image transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 32-42, 2021.

Crossref

[169]

Liu,

; Deng,

H. M.

; Huang,

Y. Y.

; Shi,

X. Y.

; Lu,

L. W.

; Sun,

W. X.

; Wang,

; Dai,

; Li,

FuseFormer: Fusing fine-grained information in transformers for video inpainting. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 14040-14049, 2021.

Crossref

[170]

He,

K. M.

; Chen,

X. L.

; Xie,

S. N.

; Li,

Y. H.

; Dollár,

; Girshick,

Masked autoencoders are scalable vision learners. arXiv preprint arXiv:2111.06377, 2021.

Google Scholar

[171]

Guo,

M. H.

; Liu,

Z. N.

; Mu,

T. J.

; Liang,

; Martin,

R. R.

; Hu,

S. M.

Can attention enable MLPs to catch up with CNNs? Computational Visual Media Vol. 7, No. 3, 283-288, 2021.

Crossref Google Scholar

[172]

Li,

J. N.

; Zhang,

S. L.

; Wang,

J. D.

; Gao,

; Tian,

Global-local temporal representations for video person re-identification. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 3957-3966, 2019.

[173]

Liu,

Z. Y.

; Wang,

L. M.

; Wu,

; Qian,

; Lu,

TAM: Temporal adaptive module for video recognition. arXiv preprint arXiv:2005.06803, 2020.

Crossref Google Scholar

[174]

Yang,

; Bender,

; Le,

Q. V.

; Ngiam,

CondConv: Conditionally parameterized convolutions for efficient inference. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems, Article No. 117, 1307-1318, 2019.

[175]

Spillmann,

; Dresp-Langley,

; Tseng,

C. H.

Beyond the classical receptive field: The effect of contextual stimuli. Journal of Vision Vol. 15, No. 9, 7, 2015.

Crossref Google Scholar

[176]

Xie,

S. N.

; Girshick,

; Dollár,

; Tu,

Z. W.

; He,

K. M.

Aggregated residual transformations for deep neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 5987-5995, 2017.

Crossref

[177]

Webb,

B. S.

; Dhruv,

N. T.

; Solomon,

S. G.

; Tailby,

; Lennie,

Early and late mechanisms of surround suppression in striate cortex of macaque. Journal of Neuroscience Vol. 25, No. 50, 11666-11675, 2005.

Crossref Google Scholar

[178]

Yang,

J. R.

; Zheng,

W. S.

; Yang,

Q. Z.

; Chen,

Y. C.

; Tian,

Spatial-temporal graph convolutional network for video-based person re-identification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3286-3296, 2020.

Crossref

[179]

Szegedy,

; Liu,

; Jia,

Y. Q.

; Sermanet,

; Reed,

; Anguelov,

; Erhan,

; Vanhoucke,

; Rabinovich,

Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1-9, 2015.

Crossref

[180]

Caron,

; Touvron,

; Misra,

; Jégou,

; Mairal,

; Bojanowski,

; Joulin,

Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 9650-9660, 2021.

Crossref

[181]

Qian,

On the momentum term in gradient descent learning algorithms. Neural Networks Vol. 12, No. 1, 145-151, 1999.

Crossref Google Scholar

[182]

Kingma,

D. P.

; Ba,

Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.

Google Scholar

[183]

Loshchilov,

; Hutter,

Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.

Google Scholar

[184]

Chen,

X. N.

; Hsieh,

C. J.

; Gong,

B. Q.

When vision transformers outperform ResNets without pretraining or strong data augmentations. arXiv preprint arXiv:2106.01548, 2021.

Google Scholar

[185]

Foret,

; Kleiner,

; Mobahi,

; Neyshabur,

Sharpness-aware minimization for efficiently impro-ving generalization. arXiv preprint arXiv:2010.01412, 2020.

Google Scholar

Computational Visual Media

Volume 8 Issue 3,
September 2022

Pages 331-368

DOI: 10.1007/s41095-022-0271-y

Cite this article:

Guo M-H, Xu T-X, Liu J-J, et al. Attention mechanisms in computer vision: A survey. Computational Visual Media, 2022, 8(3): 331-368. https://doi.org/10.1007/s41095-022-0271-y