PDF (3 MB)
Collect
Submit Manuscript
Show Outline
Figures (1)

Short Communication | Open Access

Can attention enable MLPs to catch up with CNNs?

BNRist, Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China
Cardiff University, Cardiff CF243AA, UK
Show Author Information

References

[1]
Tolstikhin, I.; Houlsby, N.; Kolesnikov, A.; Beyer, L.; Zhai, X.; Unterthiner, T.; Yung, J.; Keysers, D.; Uszkoreit, J.; Lucic, M. et al. MLP-mixer: An all-MLP architecture for vision. arXiv preprint arXiv:2105.01601, 2021.
[2]
Guo, M. H.; Liu, Z. N.; Mu, T. J.; Hu, S. M. Beyond self-attention: External attention using two linear layers for visual tasks. arXiv preprint arXiv:2105.02358, 2021.
[3]
Melas-Kyriazi, L. Do you even need attention? A stack of feed-forward layers does surprisingly well on imageNet. arXiv preprint arXiv:2105.02723, 2021.
[4]
Touvron, H.; Bojanowski, P.; Caron, M.; Cord, M.; El-Nouby, A.; Grave, E.; Izacard, G.; Joulin, A.; Synnaeve, G.; Verbeek, J. et al. ResMLP: Feedforward networks for image classification with data-efficient training.arXiv preprint arXiv:2105.03404, 2021.
[5]
Rumelhart, D. E.; Hinton, G. E.; Williams, R. J. Learning internal representations by error propagation. In: Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Volume 1, Foundations.Rumelhart, D. E.; Mcclelland, J. L. Eds. MIT Press, 318362, 1986.
[6]
LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proceedings of the IEEE Vol. 86, No. 11, 22782324, 1998.
[7]
Krizhevsky, A.; Sutskever, I.; Hinton, G. E. ImageNet classification with deep convolutional neural networks. Communications of the ACM Vol. 60, No. 6, 8490, 2017.
[8]
Yu, F.; Koltun, V. Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122, 2016.
[9]
Peng, C.; Zhang, X. Y.; Yu, G.; Luo, G. M.; Sun, J. Large kernel matters—Improve semantic segmentation by global convolutional network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 17431751, 2017.
[10]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, L.; Polosukhin, I. Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, 60006010, 2017.
[11]
Radford, A.; Narasimhan, K.; Salimans, T.; Sutskever, I. Improving language understanding with unsupervised learning. Technical Report. OpenAI, 2018.
[12]
Devlin, J.; Chang, M.; Lee, K.; Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol. 1, 41714186, 2019.
[13]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S. et al. An image is worth 16x16 words: Transformers for image recognition at scale. In: Proceedings of the International Conference on Learning Representations, 2021.
[14]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In: Computer Vision – ECCV 2020. Lecture Notes in Computer Science, Vol. 12346. Vedaldi, A.; Bischof, H.; Brox, T.; Frahm, J. M. Eds. Springer Cham, 213229, 2020.
[15]
Guo, M. H.; Cai, J. X.; Liu, Z. N.; Mu, T. J.; Martin, R. R.; Hu, S. M. PCT: Point cloud transformer. Computational Visual Media Vol. 7, No. 2, 187199, 2021.
[16]
Yuan, L.; Chen, Y.; Wang, T.; Yu, W.; Shi, Y.; Jiang, Z.; Tay, F. E.; Feng, J.; Yan, S. Tokens-to-token ViT: Training vision transformers from scratch on imageNet. arXiv preprint arXiv:2101.11986, 2021.
[17]
Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H. Training data-efficient image transformers & distillation through attention. arXiv preprint arXiv:2012.12877, 2021.
[18]
Brown, T. B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.Shyam, P.; Sastry, G.; Askell, A. et al. Language models are few-shot learners. In: Proceedings of the 34th Conference on Neural Information Processing Systems, 2020.
[19]
Hendrycks, D.; Gimpel, K. Gaussian error linear units (GELUs). arXiv preprint arXiv:1606.08415, 2016.
[20]
Ba, J. L.; Kiros, J. R.; Hinton, G. E. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
[21]
He, K. M.; Zhang, X. Y.; Ren, S. Q.; Sun, J. Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 770778, 2016.
Computational Visual Media
Pages 283-288
Cite this article:
Guo M-H, Liu Z-N, Mu T-J, et al. Can attention enable MLPs to catch up with CNNs?. Computational Visual Media, 2021, 7(3): 283-288. https://doi.org/10.1007/s41095-021-0240-x
Metrics & Citations  
Article History
Copyright
Rights and Permissions
Return