[4]
Li Y, Fang C, Yang J, Wang Z, Lu X, Yang M H. Universal style transfer via feature transforms. In Proc. the 31st International Conference on Neural Information Processing Systems, December 2017, pp. 385-395.
[6]
Deng Y, Tang F, Dong W, Huang H, Ma C, Xu C. Arbitrary video style transfer via multi-channel correlation. In Proc. the 35th AAAI Conference on Artificial Intelligence, February 2021, pp. 1210-1217.
[7]
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser L U, Polosukhin I. Attention is all you need. In Proc. the 31st International Conference on Neural Information Processing Systems, December 2017, pp. 6000-6010.
[8]
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J, Houlsby N. An image is worth 16x16 words: Transformers for image recognition at scale. In Proc. the 9th International Conference on Learning Representations, May 2021.
[14]
Naseer M, Ranasinghe K, Khan S, Hayat M, Khan F, Yang M H. Intriguing properties of vision transformers. In Proc. the 35th Conference on Neural Information Processing Systems, December 2021.
[24]
Chen M, Radford A, Child R, Wu J, Jun H, Luan D, Sutskever I. Generative pretraining from pixels. In Proc. the 37th International Conference on Machine Learning, July 2020, pp. 1691-1703.
[28]
Kumar M, Weissenborn D, Kalchbrenner N. Colorization transformer. In Proc. the 9th International Conference on Learning Representations, May 2021.
[30]
Jiang Y, Chang S, Wang Z. TransGAN: Two pure transformers can make one strong GAN, and that can scale up. In Proc. the 35th Conference on Neural Information Processing Systems, Dec. 2021.
[31]
Cordonnier J B, Loukas A, Jaggi M. On the relationship between self-attention and convolutional layers. In Proc. the 8th International Conference on Learning Representations, April 2020.
[32]
Xiong R, Yang Y, He D, Zheng K, Zheng S, Xing C, Zhang H, Lan Y, Wang L, Liu T. On layer normalization in the transformer architecture. In Proc. the 37th International Conference on Machine Learning, July 2020, pp. 10524-10533.
[33]
Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. In Proc. the 3rd International Conference on Learning Representations, May 2015.
[34]
Dosovitskiy A, Brox T. Generating images with perceptual similarity metrics based on deep networks. In Proc. the 30th International Conference on Neural Information Processing Systems, December 2016, pp. 658-666.
[37]
Kingma D P, Ba J. Adam: A method for stochastic optimization. In Proc. the 3rd International Conference on Learning Representations, May 2015.
[41]
Geirhos R, Rubisch P, Michaelis C, Bethge M, Wichmann F A, Brendel W. ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. In Proc. the 7th International Conference on Learning Representations, May 2019.
[42]
Touvron H, Cord M, Douze M, Massa F, Sablayrolles A, Jégou H. Training data-efficient image transformers distillation through attention. In Proc. the 38th International Conference on Machine Learning, July 2021, pp. 10347-10357.