Can attention enable MLPs to catch up with CNNs?

Meng-Hao Guo; Zheng-Ning Liu; Tai-Jiang Mu; Dun Liang; Ralph R. Martin; Shi-Min Hu

doi:10.1007/s41095-021-0240-x

| Sign up

PDF (3 MB)

Cite

EndNote(RIS) BibTeX

Collect

Submit Manuscript

Show Outline

Figures (1)

Fig. 1

Short Communication | Open Access

Can attention enable MLPs to catch up with CNNs?

Meng-Hao Guo^¹, Zheng-Ning Liu^¹, Tai-Jiang Mu^¹, Dun Liang^¹, Ralph R. Martin^², Shi-Min Hu^¹()

1BNRist, Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China

2Cardiff University, Cardiff CF243AA, UK

Show Author Information

References

[1]

Tolstikhin,

; Houlsby,

; Kolesnikov,

; Beyer,

; Zhai,

; Unterthiner,

; Yung,

; Keysers,

; Uszkoreit,

; Lucic,

. et al. MLP-mixer: An all-MLP architecture for vision. arXiv preprint arXiv:2105.01601, 2021.

Google Scholar

[2]

Guo,

M. H.

; Liu,

Z. N.

; Mu,

T. J.

; Hu,

S. M

. Beyond self-attention: External attention using two linear layers for visual tasks. arXiv preprint arXiv:2105.02358, 2021.

Google Scholar

[3]

Melas-Kyriazi,

. Do you even need attention? A stack of feed-forward layers does surprisingly well on imageNet. arXiv preprint arXiv:2105.02723, 2021.

Google Scholar

[4]

Touvron,

; Bojanowski,

; Caron,

; Cord,

; El-Nouby,

; Grave,

; Izacard,

; Joulin,

; Synnaeve,

; Verbeek,

. et al. ResMLP: Feedforward networks for image classification with data-efficient training.arXiv preprint arXiv:2105.03404, 2021.

Google Scholar

[5]

Rumelhart,

D. E.

; Hinton,

G. E.

; Williams,

R. J

. Learning internal representations by error propagation. In: Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Volume 1, Foundations.Rumelhart,

D. E.

; Mcclelland,

J. L

. Eds. MIT Press, 318–362, 1986.

Crossref

[6]

LeCun,

; Bottou,

; Bengio,

; Haffner,

. Gradient-based learning applied to document recognition. Proceedings of the IEEE Vol. 86, No. 11, 2278–2324, 1998.

Crossref Google Scholar

[7]

Krizhevsky,

; Sutskever,

; Hinton,

G. E

. ImageNet classification with deep convolutional neural networks. Communications of the ACM Vol. 60, No. 6, 84–90, 2017.

Crossref Google Scholar

[8]

Yu,

; Koltun,

. Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122, 2016.

Google Scholar

[9]

Peng,

; Zhang,

X. Y.

; Yu,

; Luo,

G. M.

; Sun,

. Large kernel matters—Improve semantic segmentation by global convolutional network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1743–1751, 2017.

Crossref

[10]

Vaswani,

; Shazeer,

; Parmar,

; Uszkoreit,

; Jones,

; Gomez,

A. N.

; Kaiser,

; Polosukhin,

. Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, 6000–6010, 2017.

[11]

Radford,

; Narasimhan,

; Salimans,

; Sutskever,

. Improving language understanding with unsupervised learning. Technical Report. OpenAI, 2018.

[12]

Devlin,

; Chang,

; Lee,

; Toutanova,

. BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol. 1, 4171–4186, 2019.

[13]

Dosovitskiy,

; Beyer,

; Kolesnikov,

; Weissenborn,

; Zhai,

; Unterthiner,

; Dehghani,

; Minderer,

; Heigold,

; Gelly,

. et al. An image is worth 16x16 words: Transformers for image recognition at scale. In: Proceedings of the International Conference on Learning Representations, 2021.

[14]

Carion,

; Massa,

; Synnaeve,

; Usunier,

; Kirillov,

; Zagoruyko,

. End-to-end object detection with transformers. In: Computer Vision – ECCV 2020. Lecture Notes in Computer Science, Vol. 12346. Vedaldi,

; Bischof,

; Brox,

; Frahm,

J. M

. Eds. Springer Cham, 213–229, 2020.

Crossref

[15]

Guo,

M. H.

; Cai,

J. X.

; Liu,

Z. N.

; Mu,

T. J.

; Martin,

R. R.

; Hu,

S. M

. PCT: Point cloud transformer. Computational Visual Media Vol. 7, No. 2, 187–199, 2021.

Crossref Google Scholar

[16]

Yuan,

; Chen,

; Wang,

; Yu,

; Shi,

; Jiang,

; Tay,

F. E.

; Feng,

; Yan,

. Tokens-to-token ViT: Training vision transformers from scratch on imageNet. arXiv preprint arXiv:2101.11986, 2021.

Google Scholar

[17]

Touvron,

; Cord,

; Douze,

; Massa,

; Sablayrolles,

; Jégou,

. Training data-efficient image transformers & distillation through attention. arXiv preprint arXiv:2012.12877, 2021.

Google Scholar

[18]

Brown,

T. B.

; Mann,

; Ryder,

; Subbiah,

; Kaplan,

; Dhariwal,

; Neelakantan,

Shyam,

; Sastry,

; Askell,

. et al. Language models are few-shot learners. In: Proceedings of the 34th Conference on Neural Information Processing Systems, 2020.

[19]

Hendrycks,

; Gimpel,

. Gaussian error linear units (GELUs). arXiv preprint arXiv:1606.08415, 2016.

Google Scholar

[20]

Ba,

J. L.

; Kiros,

J. R.

; Hinton,

G. E

. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.

Google Scholar

[21]

He,

K. M.

; Zhang,

X. Y.

; Ren,

S. Q.

; Sun,

. Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 770–778, 2016.

Computational Visual Media

Volume 7 Issue 3,
September 2021

Pages 283-288

DOI: 10.1007/s41095-021-0240-x

Cite this article:

Guo M-H, Liu Z-N, Mu T-J, et al. Can attention enable MLPs to catch up with CNNs?. Computational Visual Media, 2021, 7(3): 283-288. https://doi.org/10.1007/s41095-021-0240-x