Transformers in computational visual media: A survey

Yifan Xu; Huapeng Wei; Minxuan Lin; Yingying Deng; Kekai Sheng; Mengdan Zhang; Fan Tang; Weiming Dong; Feiyue Huang; Changsheng Xu

doi:10.1007/s41095-021-0247-3

AI Chat Paper

Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.

Chat more with AI

| Sign up

Browse by Subject

Search for peer-reviewed journals with full access.

Journals A - Z

About Us

Discover the SciOpen Platform and Achieve Your Research Goals with Ease.

About Us

Publish with Us

Support

Journals A - Z

About Us

Publish with Us

Support

PDF (5.2 MB)

Cite

EndNote(RIS) BibTeX

Collect

Submit Manuscript

AI Chat Paper

Show Outline

Outline

Show full outline

Hide outline

Outline

Show full outline

Hide outline

Review Article | Open Access

Transformers in computational visual media: A survey

Yifan Xu^{¹^,²}, Huapeng Wei^³, Minxuan Lin^{¹^,²}, Yingying Deng^{¹^,²}, Kekai Sheng^⁴, Mengdan Zhang^⁴, Fan Tang^³, Weiming Dong^{¹^,²^,⁵}(

), Feiyue Huang^⁴, Changsheng Xu^{¹^,²^,⁵}

1NLPR, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China

2School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing 100040, China

3School of Artificial Intelligence, Jilin University, Changchun 130012, China

4Youtu Lab, Tencent Inc., Shanghai 200233, China

5CASIA-LLVISION Joint Lab, Beijing 100190, China

Show Author Information

Abstract

Transformers, the dominant architecture for natural language processing, have also recently attracted much attention from computational visual media researchers due to their capacity for long-range representation and high performance. Transformers are sequence-to-sequence models, which use a self-attention mechanism rather than the RNN sequential structure. Thus, such models can be trained in parallel and can represent global information. This study comprehensively surveys recent visual transformer works. We categorize them according to task scenario: backbone design, high-level vision, low-level vision and generation, and multimodal learning. Their key ideas are also analyzed. Differing from previous surveys, we mainly focus on visual transformer methods in low-level vision and generation. The latest works on backbone design are also reviewed in detail. For ease of understanding, we precisely describe the main contributions of the latest works in the form of tables. As well as giving quantitative comparisons, we also present image results for low-level vision and generation tasks. Computational costs and source code links for various important works are also given in this survey to assist further development.

Keywords

image generation visual transformer computational visual media (CVM)high-level vision low-level vision multi-modal learning

References

[1]

He, K. M.; Zhang, X. Y.; Ren, S. Q.; Sun, J. Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 770–778, 2016.

[2]

Tan, M.; Le, Q. EfficientNet: Rethinking model scaling for convolutional neural networks. In: Proceedings of the 36th International Conference on Machine Learning, 2019.

[3]

Radosavovic, I.; Kosaraju, R. P.; Girshick, R.; He, K. M.; Dollár, P. Designing network design spaces. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10425–10433, 2020.

Crossref

[4]

Yin, M. H.; Yao, Z. L.; Cao, Y.; Li, X.; Zhang, Z.; Lin, S.; Hu, H. Disentangled non-local neural networks. In: Computer Vision–ECCV 2020. Lecture Notes in Computer Science, Vol. 12360. Vedaldi, A.; Bischof, H.; Brox, T.; Frahm, J. M. Eds. Springer Cham, 191–207, 2020.

[5]

Hu, H.; Gu, J. Y.; Zhang, Z.; Dai, J. F.; Wei, Y. C. Relation networks for object detection. In: Proceedings of the IEEE/CVF Conference on ComputerVision and Pattern Recognition, 3588–3597, 2018.

Crossref

[6]

Wang, X. L.; Girshick, R.; Gupta, A.; He, K. M. Non-local neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7794–7803, 2018.

Crossref

[7]

Hu, H.; Zhang, Z.; Xie, Z. D.; Lin, S. Local relation networks for image recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 3463–3472, 2019.

Crossref

[8]

Yuan, Y.; Huang, L.; Guo, J.; Zhang, C.; Chen, X.; Wang, J. OCNet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916, 2018.

Google Scholar

[9]

Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; Uszkoreit, J.; Houlsby, N. An image is worth 16x16 words: Transformers for image recognition at scale. In: Proceedings of the International Conference on Learning Representations, 2021.

[10]

Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 4171–4186, 2019.

[11]

Chen, M.; Radford, A.; Child, R.; Wu, J.; Jun, H.; Luan, D.; Sutskever, I. Generative pretraining from pixels. In: Proceedings of the 37th International Conference on Machine Learning, 1691–1703, 2020.

[12]

Graham, B.; El-Nouby, A.; Touvron, H.; Stock, P.; Joulin, A.; Jégou, H.; Douze, M. LeViT: A vision transformer in ConvNet’s clothing for faster inference. arXiv preprint arXiv:2104.01136, 2021.

Google Scholar

[13]

Tay, Y.; Dehghani, M.; Bahri, D.; Metzler, D. Efficient transformers: A survey. arXiv preprint arXiv:2009.06732, 2020.

Google Scholar

[14]

Liang, J.; Hu, D.; He, R.; Feng, J. Distill and fine-tune: Effective adaptation from a black-box source model. arXiv preprint arXiv:2104.01539, 2021.

Google Scholar

[15]

Yuan, L.; Chen, Y.; Wang, T.; Yu, W.; Shi, Y.; Tay, F. E.; Feng, J.; Yan, S. Tokens-to-Token ViT: Training vision transformers from scratch on ImageNet. arXiv preprint arXiv:2101.11986, 2021.

Google Scholar

[16]

Han, K.; Xiao, A.; Wu, E.; Guo, J.; Xu, C.; Wang, Y. Transformer in transformer. arXiv preprint arXiv:2103.00112, 2021.

Google Scholar

[17]

Chu, X. X.; Tian, Z.; Zhang, B.; Wang, X. L.; Wei, X. L.; Xia, H. X.; Shen, C. Conditional positional encodings for vision transformers. arXiv preprint arXiv:2102.10882, 2021.

Google Scholar

[18]

D’Ascoli, S.; Touvron, H.; Leavitt, M. L.; Morcos, A. S.; Biroli, G.; Sagun, L. ConViT: Improving vision transformers with soft convolutional inductive biases. In: Proceedings of the 38th International Conference on Machine Learning, 2286–2296, 2021.

[19]

Zhou, D.; Kang, B.; Jin, X.; Yang, L.; Lian, X.; Hou, Q.; Feng, J. DeepViT: Towards deeper vision transformer. arXiv preprint arXiv:2103.11886, 2021.

Google Scholar

[20]

Liu, Z.; Lin, Y. T.; Cao, Y.; Hu, H.; Guo, B. N. Swin transformer: Hierarchical vision transformer using shifted windows. arXiv preprint arXiv:2103.14030, 2021.

Google Scholar

[21]

Heo, B.; Yun, S.; Han, D.; Chun, S.; Oh, S. J. Rethinking spatial dimensions of vision transformers. arXiv preprint arXiv:2103.16302, 2021.

Google Scholar

[22]

Li, Y. W.; Zhang, K.; Cao, J. Z.; Timofte, R.; Gool, L. V. LocalViT: Bringing locality to vision transformers. arXiv preprint arXiv:2104.05707, 2021.

Google Scholar

[23]

Chefer, H.; Gur, S.; Wolf, L. Transformer interpretability beyond attention visualization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 782–791, 2021.

Crossref

[24]

Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In: Computer Vision–ECCV 2020. Lecture Notes in Computer Science, Vol. 12346. Vedaldi, A.; Bischof, H.; Brox, T.; Frahm, J. M. Eds. Springer Cham, 213–229, 2020.

Crossref

[25]

Zhu, X. Z.; Su, W. J.; Lu, L. W.; Li, B.; Dai, J. F. Deformable DETR: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159, 2020.

Google Scholar

[26]

Dai, Z. G.; Cai, B. L.; Lin, Y. G.; Chen, J. Y. UP-DETR: Unsupervised pre-training for object detection with transformers. arXiv preprint arXiv:2011.09094, 2020.

Google Scholar

[27]

Wang, W.; Xie, E.; Li, X.; Fan, D.-P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao. L. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. arXiv preprint arXiv:2102.12122, 2021.

Google Scholar

[28]

Wang, Y.; Xu, Z.; Wang, X.; Shen, C.; Cheng, B.; Shen, H.; Xia, H. End-to-end video instance segmentation with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8741–8750, 2021.

Crossref

[29]

Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J. M.; Luo, P. SegFormer: Simple and efficient design for semantic segmentation with transformers. arXiv preprint arXiv:2105.15203, 2021.

Google Scholar

[30]

Kumar, M.; Weissenborn, D.; Kalchbrenner, N. Colorization transformer. In: Proceedings of the 9th International Conference on Learning Representations, 2021.

[31]

Liu, B. C.; Song, K. P.; Zhu, Y. Z.; de Melo, G.; Elgammal, A. TIME: Text and image mutual-translation adversarial networks. In: Proceedings of the 35th AAAI Conference on Artificial Intelligence, 2082–2090, 2021.

[32]

Ramesh, A.; Pavlov, M.; Goh, G.; Gray, S.; Voss, C.; Radford, A.; Chen, M.; Sutskever, I. Zero-shot text-to-image generation. arXiv preprint arXiv:2102.12092, 2021.

Google Scholar

[33]

Yang, F. Z.; Yang, H.; Fu, J. L.; Lu, H. T.; Guo, B. N. Learning texture transformer network for image super-resolution. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5790–5799, 2020.

Crossref

[34]

Jiang, Y. F.; Chang, S. Y.; Wang, Z. Y. TransGAN: Two transformers can make one strong GAN. arXiv preprint arXiv:2102.07074, 2021.

Google Scholar

[35]

Hudson, D. A.; Zitnick, C. L. Generative adversarial transformers. arXiv preprint arXiv:2103.01209, 2021.

Google Scholar

[36]

Van den Oord, A.; Vinyals, O.; Kavukcuoglu, K. Neural discrete representation learning. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, 6309–6318, 2017.

[37]

Wang, Z.; Cun, X.; Bao, J.; Liu, J. Uformer: A general U-shaped transformer for image restoration. arXiv preprint arXiv:2106.03106, 2021.

Google Scholar

[38]

Deng, Y. Y.; Tang, F.; Pan, X. J.; Dong, W. M.; Xu, C. S. StyTr2: Unbiased image style transfer with transformers. arXiv preprint arXiv:2105.14576, 2021.

Google Scholar

[39]

Guo, M.-H.; Cai, J.-X.; Liu, Z.-N.; Mu, T.-J.; Martin, R. R.; Hu, S.-M. PCT: Point cloud transformer. Computational Visual Media Vol. 7, No. 2, 187–199, 2021.

Crossref Google Scholar

[40]

Lu, J.; Batra, D.; Parikh, D.; Lee, S. ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: Proceedings of the 33rd Conference on Neural Information Processing Systems, 13–23, 2019.

[41]

Chen, Y.-C.; Li, L. J.; Yu, L. C.; El Kholy, A.; Ahmed, F.; Gan, Z.; Cheng, Y.; Liu, J. UNITER: UNiversal image-TExt representation learning. In: Computer Vision–ECCV 2020. Lecture Notes in Computer Science, Vol. 12375. Vedaldi, A.; Bischof, H.; Brox, T.; Frahm, J. M. Eds. Springer Cham, 104–120, 2020.

Crossref

[42]

Li, C. L.; Yan, M.; Xu, H. Y.; Luo, F. L.; Huang, S. F. SemVLP: Vision-language pre-training by aligning semantics at multiple levels. arXiv preprint arXiv:2103.07829, 2021.

Google Scholar

[43]

Zhang, R.; Isola, P.; Efros, A. A. Colorful image colorization. In: Computer Vision–ECCV 2016. Lecture Notes in Computer Science, Vol. 9907. Leibe, B.; Matas, J.; Sebe, N.; Welling, M. Eds. Springer Cham, 649–666, 2016.

Crossref

[44]

Zhang, R.; Zhu, J.-Y.; Isola, P.; Geng, X. Y.; Lin, A. S.; Yu, T. H.; Efros, A. A. Real-time user-guided image colorization with learned deep priors. arXiv preprint arXiv:1705.02999, 2017.

Google Scholar

[45]

Su, J.-W.; Chu, H.-K.; Huang, J.-B. Instance-aware image colorization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7965–7974, 2020.

[46]

Pang, L.; Lan, Y.; Guo, J.; Xu, J.; Wan, S.; Cheng, X. Text matching as image recognition. In: Proceedings of the 30th AAAI Conference on Artificial Intelligence, 2793–2799, 2016.

[47]

Dong, C.; Loy, C. C.; He, K. M.; Tang, X. O. Image super-resolution using deep convolutional networks. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 38, No. 2, 295–307, 2016.

Crossref Google Scholar

[48]

Zhang, Y. L.; Tian, Y. P.; Kong, Y.; Zhong, B. N.; Fu, Y. Residual dense network for image super-resolution. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2472–2481, 2018.

Crossref

[49]

Haris, M.; Shakhnarovich, G.; Ukita, N. Deep back-projection networks for super-resolution. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1664–1673, 2018.

Crossref

[50]

Chen, X.; Duan, Y.; Houthooft, R.; Schulman, J.; Sutskever, I.; Abbeel, P. InfoGAN: Interpretable representation learning by information maximizing generative adversarial nets. arXiv preprint arXiv:1606.03657, 2016.

Google Scholar

[51]

Salimans, T.; Goodfellow, I.; Zaremba, W.; Cheung, V.; Radford, A.; Chen, X. Improved techniques for training GANs. arXiv preprint arXiv:1606.03498, 2016.

Google Scholar

[52]

Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; Hochreiter, S. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. arXiv preprint arXiv:1706.08500, 2017.

Google Scholar

[53]

Karras, T.; Laine, S.; Aila, T. M. A style-based generator architecture for generative adversarial networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4396–4405, 2019.

Crossref

[54]

Gulrajani, I.; Ahmed, F.; Arjovsky, M.; Dumoulin, V.; Courville, A. Improved training of wasserstein GANs. arXiv preprint arXiv:1704.00028, 2017.

Google Scholar

[55]

Bebis, G.; Georgiopoulos, M. Feed-forward neural networks. IEEE Potentials Vol. 13, No. 4, 27–31, 1994.

Crossref Google Scholar

[56]

Ba, J. L.; Kiros, J. R.; Hinton, G. E. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.

Google Scholar

[57]

Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, L.; Polosukhin, I. Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, 6000–6010, 2017.

[58]

Hendrycks, D.; Gimpel, K. Gaussian error linear units (GELUs). arXiv preprint arXiv:1606.08415, 2016.

Google Scholar

[59]

Kitaev, N.; Kaiser, L.; Levskaya, A. Reformer: The efficient transformer. In: Proceedings of the International Conference on Learning Representations, 2020.

[60]

Choromanski, K. M.; Likhosherstov, V.; Dohan, D.; Song, X.; Gane, A.; Sarlos, T.; Hawkins, P.; Davis, J. Q.; Mohiuddin, A.; Kaiser, L. et al. Rethinking attention with performers. In: Proceedings of the International Conference on Learning Representations, 2021.

[61]

Wang, S.; Li, B.; Khabsa, M.; Fang, H.; Ma, H. Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768, 2020.

Google Scholar

[62]

Abnar, S.; Zuidema, W. Quantifying attention flow in transformers. arXiv preprint arXiv:2005.00928, 2020.

Google Scholar

[63]

Voita, E.; Talbot, D.; Moiseev, F.; Sennrich, R.; Titov, I. Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 5797–5808, 2019.

Crossref

[64]

Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S. A.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M. et al. ImageNet large scale visual recognition challenge. International Journal of Computer Vision Vol. 115, No. 3, 211–252, 2015.

Crossref Google Scholar

[65]

Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jegou, H. Training data-efficient image transformers & distillation through attention. In: Proceedings of the 38th International Conference on Machine Learning, 10347–10357, 2021.

[66]

Han, Y. Z.; Huang, G.; Song, S. J.; Yang, L.; Wang, Y. L. Dynamic neural networks: A survey. arXiv preprint arXiv:2102.04906, 2021.

Google Scholar

[67]

Xu, W.; Xu, Y.; Chang, T.; Tu, Z. Co-scale conv-attentional image transformers. arXiv preprint arXiv:2104.06399, 2021.

Google Scholar

[68]

Dong, X. Y.; Bao, J. M.; Chen, D. D.; Zhang, W. M.; Yu, N. H.; Yuan, L.; Chen, D.; Guo, B. CSWin transformer: A general vision transformer backbone with cross-shaped windows. arXiv preprint arXiv:2107.00652, 2021.

Google Scholar

[69]

Huang, Z. L.; Wang, X. G.; Huang, L. C.; Huang, C.; Wei, Y. C.; Liu, W. CCNet: Criss-cross attention for semantic segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 603–612, 2019.

Crossref

[70]

Hou, Q. B.; Zhang, L.; Cheng, M. M.; Feng, J. S. Strip pooling: Rethinking spatial pooling for scene parsing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4002–4011, 2020.

Crossref

[71]

Touvron, H.; Cord, M.; Sablayrolles, A.; Synnaeve, G.; Jegou, H. Going deeper with image transformers. arXiv preprint arXiv:2103.17239, 2021.

Google Scholar

[72]

Selvaraju, R. R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE International Conference on Computer Vision, 618–626, 2017.

Crossref

[73]

Binder, A.; Montavon, G.; Lapuschkin, S.; Müller, K.-R.; Samek, W. Layer-wise relevance propagation for neural networks with local renormalization layers. In: Artificial Neural Networks and Machine Learning–ICANN 2016. Lecture Notes in Computer Science, Vol. 9887. Villa, A.; Masulli, P.; Pons Rivero, A. Eds. Springer Cham, 63–71, 2016.

Crossref

[74]

Zheng, S.; Lu, J.; Zhao, H.; Zhu, X.; Luo, Z.; Wang, Y.; Fu, Y.; Feng, J.; Xiang, T.; Torr, P. H. et al. Rethinking semantic segmentation from a sequence-tosequence perspective with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6881–6890, 2021.

Crossref

[75]

Duke, B.; Ahmed, A.; Wolf, C.; Aarabi, P.; Taylor, G. W. SSTVOS: Sparse spatiotemporal transformers for video object segmentation. arXiv preprint arXiv:2101.08833, 2021.

Google Scholar

[76]

Chen, J. N.; Lu, Y. Y.; Yu, Q. H.; Luo, X. D.; Zhou, Y. Y. TransUNet: Transformers make strong encoders for medical image segmentation. arXiv preprint arXiv:2102.04306, 2021.

Google Scholar

[77]

Ye, L. W.; Rochan, M.; Liu, Z.; Wang, Y. Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10494–10503, 2019.

[78]

Wang, H.; Zhu, Y.; Adam, H.; Yuille, A.; Chen, L.-C. Max-deeplab: End-to-end panoptic segmentation with mask transformers. arXiv preprint arXiv:2012.00759, 2020.

Google Scholar

[79]

Durner, M.; Boerdijk, W.; Sundermeyer, M.; Friedl, W.; Marton, Z.-C.; Triebel, R. Unknown object segmentation from stereo images. arXiv preprint arXiv:2103.06796, 2021.

Google Scholar

[80]

Cao, Z.; Hidalgo, G.; Simon, T.; Wei, S.-E.; Sheikh, Y. OpenPose: Realtime multi-person 2D pose estimation using part affinity fields. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 43, No. 1, 172–186, 2021.

Crossref Google Scholar

[81]

Simon, T.; Joo, H.; Matthews, I.; Sheikh, Y. Hand keypoint detection in single images using multiview bootstrapping. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4645–4653, 2017.

Crossref

[82]

Cao, Z.; Simon, T.; Wei, S.-E.; Sheikh, Y. Realtime multi-person 2D pose estimation using part affinity fields. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1302–1310, 2017.

Crossref

[83]

Fang, H.-S.; Xie, S. Q.; Tai, Y.-W.; Lu, C. W. RMPE: Regional multi-person pose estimation. In: Proceedings of the IEEE International Conference on Computer Vision, 2353–2362, 2017.

Crossref

[84]

Zhang, F.; Zhu, X. T.; Dai, H. B.; Ye, M.; Zhu, C. Distribution-aware coordinate representation for human pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7091–7100, 2020.

Crossref

[85]

Sun, K.; Xiao, B.; Liu, D.; Wang, J. D. Deephigh-resolution representation learning for human pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5686–5696, 2019.

Crossref

[86]

Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. arXiv preprint arXiv:1506.01497, 2015.

Google Scholar

[87]

Cai, Z. W.; Vasconcelos, N. Cascade R-CNN: High quality object detection and instance segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 43, No. 5, 1483–1498, 2021.

Crossref Google Scholar

[88]

Lin, T. Y.; Goyal, P.; Girshick, R.; He, K. M.; Dollár, P. Focal loss for dense object detection. In: Proceedings of the IEEE International Conference on Computer Vision, 2999–3007, 2017.

Crossref

[89]

Zhou, X.; Wang, D.; Krähenbühl, P. Objects as points. arXiv preprint arXiv:1904.07850, 2019.

Google Scholar

[90]

Tian, Z.; Shen, C. H.; Chen, H.; He, T. FCOS:Fully convolutional one-stage object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 9626–9635, 2019.

Crossref

[91]

Stewart, R.; Andriluka, M.; Ng, A. Y. End-to-end people detection in crowded scenes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2325–2333, 2016.

Crossref

[92]

Hosang, J.; Benenson, R.; Schiele, B. Learning non-maximum suppression. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 6469–6477, 2017.

Crossref

[93]

Rezatofighi, S. H.; Kaskman, R.; Motlagh, F. T.; Shi, Q. F.; Cremers, D.; Leal-Taixé, L.; Reid, I. Deep perm-set net: Learn to predict sets with unknown permutation and cardinality using deep neural networks. arXiv preprint arXiv:1805.00613, 2018.

Google Scholar

[94]

Pan, X. J.; Tang, F.; Dong, W. M.; Gu, Y.; Song, Z. C.; Meng, Y. P.; Xu, P.; Deussen, O.; Xu, C. Self-supervised feature augmentation for large image object detection. IEEE Transactions on Image Processing Vol. 29, 6745–6758, 2020.

Crossref Google Scholar

[95]

Pan, X.; Gao, Y.; Lin, Z.; Tang, F.; Dong, W.;Yuan, H.; Huang, F.; Xu, C. Unveiling the potential of structure preserving for weakly supervised object localization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11642–11651, 2021.

Crossref

[96]

Pan, X. J.; Ren, Y. Q.; Sheng, K. K.; Dong, W. M.; Yuan, H. L.; Guo, X. W.; Ma, C.; Xu, C. Dynamic refinement network for oriented and densely packed object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11204–11213, 2020.

Crossref

[97]

Chu, X. X.; Tian, Z.; Wang, Y. Q.; Zhang, B.; Shen, C. H. Twins: Revisiting spatial attention design in vision transformers. arXiv preprint arXiv:2104.13840, 2021.

Google Scholar

[98]

Beal, J.; Kim, E.; Tzeng, E.; Park, D. H.; Kislyuk, D. Toward transformer-based object detection. arXiv preprint arXiv:2012.09958, 2020.

Google Scholar

[99]

Dai, J. F.; Qi, H. Z.; Xiong, Y. W.; Li, Y.; Zhang, G. D.; Hu, H.; Wei, Y. Deformable convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, 764–773, 2017.

Crossref

[100]

He, K. M.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, 2980–2988, 2017.

[101]

Chen, H.; Wang, Y.; Guo, T.; Xu, C.; Deng, Y.; Liu, Z.; Ma, S.; Xu, C.; Xu, C.; Gao, W. Pre-trained image processing transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 12299–12310, 2021.

Crossref

[102]

Esser, P.; Rombach, R.; Ommer, B. Taming transformers for high-resolution image synthesis. arXiv preprint arXiv:2012.09841, 2020.

Google Scholar

[103]

Kaiser, L.; Bengio, S. Can active memory replace attention? In: Proceedings of the 30th International Conference on Neural Information Processing Systems, 3781–3789, 2016.

[104]

Vinyals, O.; Toshev, A.; Bengio, S.; Erhan, D. Show and tell: Lessons learned from the 2015 MSCOCO image captioning challenge. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 39, No. 4, 652–663, 2016.

Crossref Google Scholar

[105]

Vinyals, O.; Toshev, A.; Bengio, S.; Erhan, D. Show and tell: A neural image caption generator. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3156–3164, 2015.

Crossref

[106]

Rolfe, J. T. Discrete variational autoencoders. arXiv preprint arXiv:1609.02200, 2016.

Google Scholar

[107]

Goodfellow, I. J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial networks. arXiv preprint arXiv:1406.2661, 2014.

Google Scholar

[108]

Ho, J.; Kalchbrenner, N.; Weissenborn, D.; Salimans, T. Axial attention in multidimensional transformers. arXiv preprint arXiv:1912.12180, 2019.

Google Scholar

[109]

Antol, S.; Agrawal, A.; Lu, J. S.; Mitchell, M.; Batra, D.; Zitnick, C. L.; Parikh, D. VQA: Visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision, 2425–2433, 2015.

Crossref

[110]

Goyal, Y.; Khot, T.; Summers-Stay, D.; Batra, D.; Parikh, D. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 6325–6334, 2017.

Crossref

[111]

Chen, X. L.; Fang, H.; Lin, T.-Y.; Vedantam, R.; Gupta, S.; Dollar, P.; Zitnick, C. L. Microsoft COCO captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015.

Google Scholar

[112]

Young, P.; Lai, A.; Hodosh, M.; Hockenmaier, J. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics Vol. 2, 67–78, 2014.

Crossref Google Scholar

[113]

Gan, Z.; Chen, Y.-C.; Li, L.; Zhu, C.; Cheng, Y.; Liu, J. Large-scale adversarial training for vision-and-language representation learning. In: Advances in Neural Information Processing Systems, Vol. 33. Larochelle, H.; Ranzato, M.; Hadsell, R.; Balcan, M. F.; Lin, H. Eds. Curran Associates, Inc., 6616–6628, 2020.

[114]

Lin, J. Y.; Yang, A.; Zhang, Y. C.; Liu, J.; Yang, H. X. InterBERT: Vision-and-language interaction for multi-modal pretraining. arXiv preprint arXiv:2003.13198, 2020.

Google Scholar

[115]

Su, W.; Zhu, X.; Cao, Y.; Li, B.; Lu, L.; Wei, F.; Dai, J. VL-BERT: Pre-training of generic visual-linguistic representations. In: Proceedings of the International Conference on Learning Representations, 2020.

[116]

Zhou, L. W.; Palangi, H.; Zhang, L.; Hu, H. D.; Corso, J.; Gao, J. F. Unified vision-language pre-training for image captioning and VQA. In: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, No. 7, 13041–13049, 2020.

Crossref

[117]

Girshick, R. Fast R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, 1440–1448, 2015.

Crossref

[118]

Li, W.; Gao, C.; Niu, G. C.; Xiao, X. Y.; Wang, H. F. UNIMO: Towards unified-modal understanding and generation via cross-modal contrastive learning. arXiv preprint arXiv:2012.15409, 2020.

Google Scholar

[119]

Li, L. H.; Yatskar, M.; Yin, D.; Hsieh, C. J.; Chang, K. W. VisualBERT: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557, 2019.

Google Scholar

[120]

Alberti, C.; Ling, J.; Collins, M.; Reitter, D. Fusion of detected objects in text for visual question answering. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, 2131–2140, 2019.

Crossref

[121]

Li, X. J.; Yin, X.; Li, C. Y.; Zhang, P. C.; Hu, X. W.; Zhang, L.; Wang, L.; Hu, H.; Dong, L.; Wei, F. et al. OSCAR: Object-semantics aligned pre-training for vision-language tasks. In: Computer Vision–ECCV 2020. Lecture Notes in Computer Science, Vol. 12375. Vedaldi, A.; Bischof, H.; Brox, T.; Frahm, J. M. Eds. Springer Cham, 121–137, 2020.

[122]

Yu, F.; Tang, J.; Yin, W.; Sun, Y.; Tian, H.; Wu, H.; Wang, H. ERNIE-ViL: Knowledge enhanced vision-language representations through scene graph. In: Proceedings of the AAAI Conference on Artificial Intelligence, 2021.

[123]

Li, Y.; Pan, Y.; Yao, T.; Chen, J.; Mei, T. Scheduled sampling in vision-language pretraining with decoupled encoder–decoder network. In: Proceedings of the AAAI Conference on Artificial Intelligence, 8518–8526, 2021.

[124]

Tan, H.; Bansal, M. LXMERT: Learning cross-modality encoder representations from transformers. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, 5100–5111, 2019.

Crossref

[125]

Sharma, P.; Ding, N.; Goodman, S.; Soricut, R. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, 2556–2565, 2018.

Crossref

[126]

Ordonez, V.; Kulkarni, G.; Berg, T. L. Im2Text: Describing images using 1 million captioned photographs. In: Proceedings of the 24th International Conference on Neural Information Processing Systems, 1143–1151, 2011.

[127]

Krishna, R.; Zhu, Y. K.; Groth, O.; Johnson, J.; Hata, K. J.; Kravitz, J.; Chen, S.; Kalantidis, Y.; Li, L.-J.; Shamma, D. A. et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision Vol. 123, No. 1, 32–73, 2017.

Crossref Google Scholar

[128]

Hudson, D. A.; Manning, C. D. GQA: A new dataset for real-world visual reasoning and compositional question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6693–6702, 2019.

Crossref

[129]

Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D., Dollar, P.; Zitnick, C. L. Microsoft COCO: Common objects in context. In: Computer Vision–ECCV 2014. Lecture Notes in Computer Science, Vol. 8693. Fleet, D.; Pajdla, T.; Schiele, B.; Tuytelaars, T. Eds. Springer Cham, 740–755, 2014.

Crossref

[130]

Kuznetsova, A.; Rom, H.; Alldrin, N.; Uijlings, J.; Krasin, I.; Pont-Tuset, J.; Kamali, S.; Popov, S.; Malloci, M.; Kolesnikov, A. et al. The open images dataset V4. International Journal of Computer Vision Vol. 128, No. 7, 1956–1981, 2020.

Crossref Google Scholar

[131]

Zhang, P.; Li, X.; Hu, X.; Yang, J.; Zhang, L.; Wang, L.; Choi, Y.; Gao, J. VinVL: Revisiting visual representations in vision-language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5579–5588, 2021.

Crossref

[132]

Hu, R.; Singh, A. UniT: Multimodal multitask learning with a unified transformer. arXiv preprint arXiv:2102.10772, 2021.

Google Scholar

[133]

Suhr, A.; Zhou, S.; Zhang, A.; Zhang, I.; Bai, H. J.; Artzi, Y. A corpus for reasoning about natural language grounded in photographs. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 6418–6428, 2019.

Crossref

[134]

Xie, N.; Lai, F.; Doran, D.; Kadav, A. Visual entailment: A novel task for fine-grained image understanding. arXiv preprint arXiv:1901.06706, 2019.

Google Scholar

[135]

Zellers, R.; Bisk, Y.; Farhadi, A.; Choi, Y. From recognition to cognition: Visual commonsense reasoning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6713–6724, 2019.

Crossref

[136]

Kazemzadeh, S.; Ordonez, V.; Matten, M.; Berg, T. ReferItGame: Referring to objects in photographs of natural scenes. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, 787–798, 2014.

Crossref

[137]

Sheng, K. K.; Dong, W. M.; Ma, C. Y.; Mei, X.; Huang, F. Y.; Hu, B.-G. Attention-based multi-patch aggregation for image aesthetic assessment. In: Proceedings of the 26th ACM International Conference on Multimedia, 879–886, 2018.

Crossref

[138]

Sheng, K. K.; Dong, W. M.; Chai, M. L.; Wang, G. H.; Zhou, P.; Huang, F. Y.; Hu, B.-G.; Ji, R.; Ma, C. Revisiting image aesthetic assessment via self-supervised feature learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, No. 4, 5709–5716, 2020.

Crossref

[139]

Sheng, K. K.; Dong, W. M.; Huang, H. B.; Chai, M. L.; Zhang, Y.; Ma, C. Y.; Hu, B.-G. Learning to assess visual aesthetics of food images. Computational Visual Media Vol. 7, No. 1, 139–152, 2021.

Crossref Google Scholar

[140]

Zhang, S. F.; Wang, X. B.; Liu, A.; Zhao, C. X.; Wan, J.; Escalera, S.; Shi, H.; Wang, Z.; Li, S. Z. A dataset and benchmark for large-scale multi-modal face anti-spoofing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 919–928, 2019.

Crossref

[141]

Chen, Z.; Yao, T.; Sheng, K.; Ding, S.; Tai, Y.; Li, J.; Huang, F.; Jin, X. Generalizable representation learning for mixture domain face anti-spoofing. In: Proceedings of the AAAI Conference on Artificial Intelligence, 1132–1139, 2021.

[142]

Zhao, H.; Jiang, L.; Jia, J.; Torr, P.; Koltun, V. Point transformer. arXiv preprint arXiv:2012.09164, 2020.

Google Scholar

[143]

Zoph, B.; Le, Q. V. Neural architecture search with reinforcement learning. In: Proceedings of the International Conference on Learning Representations, 2017.

[144]

Zoph, B.; Vasudevan, V.; Shlens, J.; Le, Q. V. Learning transferable architectures for scalable image recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8697–8710, 2018.

Crossref

[145]

Real, E.; Aggarwal, A.; Huang, Y. P.; Le, Q. V. Regularized evolution for image classifier architecture search. In: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, 4780–4789, 2019.

Crossref

[146]

Wang, H. R.; Wu, Z. H.; Liu, Z. J.; Cai, H.; Zhu, L. G.; Gan, C.; Han, S. HAT: Hardware-aware transformers for efficient natural language processing. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 7675–7688, 2020.

Crossref

[147]

So, D.; Le, Q.; Liang, C. The evolved transformer. In: Proceedings of the 36th International Conference on Machine Learning, 5877–5886, 2019.

[148]

Li, C. L.; Tang, T.; Wang, G. R.; Peng, J. F.; Chang, X. J. BossNAS: Exploring hybrid CNN-transformers with Block-wisely Self-supervised neural architecture search. arXiv preprint arXiv:2103.12424, 2021.

Google Scholar

[149]

Schulz, K.; Sixt, L.; Tombari, F.; Landgraf, T. Restricting the flow: Information bottlenecks for attribution. In: Proceedings of the International Conference on Learning Representations, 2019.

[150]

Jiang, Z.; Tang, R.; Xin, J.; Lin, J. Inserting information bottleneck for attribution in transformers. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing: Findings, 3850–3857, 2020.

Crossref

Computational Visual Media

Volume 8 Issue 1,
March 2022

Pages 33-62

DOI: 10.1007/s41095-021-0247-3

Cite this article:

Xu Y, Wei H, Lin M, et al. Transformers in computational visual media: A survey. Computational Visual Media, 2022, 8(1): 33-62. https://doi.org/10.1007/s41095-021-0247-3

936

Views

Downloads

Crossref

Web of Science

Scopus

CSCD

Google Scholar
Citation

Altmetrics

Received: 17 June 2021

Accepted: 16 July 2021

Published: 27 October 2021

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduc-tion in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.

The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Other papers from this open access journal are available free of charge from http://www.springer.com/journal/41095. To submit a manuscript, please go to https://www. editorialmanager.com/cvmj.