PVT v2: Improved baselines with Pyramid Vision Transformer

Wenhai Wang; Enze Xie; Xiang Li; Deng-Ping Fan; Kaitao Song; Ding Liang; Tong Lu; Ping Luo; Ling Shao

doi:10.1007/s41095-022-0274-8

| Sign up

PDF (998 KB)

Cite

EndNote(RIS) BibTeX

Collect

Submit Manuscript

Show Outline

Figures (4)

Fig. 1

Fig. 2

Fig. 3

Fig. 4

Tables (6)

Table 1

Table 2

Table 3

Table 4

Table 5

Research Article | Open Access

PVT v2: Improved baselines with Pyramid Vision Transformer

Wenhai Wang^{¹^,²}

(), Enze Xie^³, Xiang Li^⁴, Deng-Ping Fan^⁵

, Kaitao Song^⁴, Ding Liang^⁶, Tong Lu^², Ping Luo^³, Ling Shao^⁷

1Shanghai AI Laboratory, Shanghai 200232, China

2Department of Computer Science and Technology, NanjingUniversity, Nanjing 210023, China

3Department of Computer Science, the University ofHong Kong, Hong Kong 999077, China

4School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210014, China

5Computer Vision Lab, ETH Zurich, Zurich 8092, Switzerland

6SenseTime, Beijing 100080, China

7Inception Institute of Artificial Intelligence, Abu Dhabi, United Arab Emirates

Show Author Information

Graphical Abstract

View original image Download original image

Abstract

Transformers have recently lead to encouraging progress in computer vision. In this work, we present new baselines by improving the original Pyramid Vision Transformer (PVT v1) by adding three designs: (i) a linear complexity attention layer, (ii) an overlapping patch embedding, and (iii) a convolutional feed-forward network. With these modifications, PVT v2 reduces the computational complexity of PVT v1 to linearity and provides significant improvements on fundamental vision tasks such as classification, detection, and segmentation. In particular, PVT v2 achieves comparable or better performance than recent work such as the Swin transformer. We hope this work will facilitate state-of-the-art transformer research in computer vision. Code is available at https://github.com/whai362/PVT.

Keywords

transformers dense prediction image classification object detection semantic segmentation

References

[1]

Dosovitskiy,

; Beyer,

; Kolesnikov,

; Weissenborn,

; Zhai,

; Unterthiner,

; Dehghani,

; Minderer,

; Heigold,

; S. Gelly,

; et al. An image is worth 16

\times

16 words: Transformers for image recognition at scale. In: Proceedings of the International Conference on Learning Representations, 2021.

[2]

Touvron,

; Cord,

; Douze,

; Massa,

; Jégou,

Training data-efficient image transformers & distillation through attention. In: Proceedings of the 38th International Conference on Machine Learning, 2021.

[3]

Wang,

; Xie,

; Li,

; Fan,

D.-P.

; Song,

; Liang,

; Lu,

; Luo,

; Shao,

Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 568-578, 2021.

Crossref

[4]

Wu,

; Xiao,

; Codella,

; Liu,

; Dai,

; Yuan,

; Zhang,

CvT: Introducing convolutions to vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 22-31, 2021.

Crossref

[5]

Liu,

; Lin,

; Cao,

; Hu,

; Wei,

; Zhang,

; Lin,

; Guo,

Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 10012-10022, 2021.

Crossref

[6]

Xu,

; Xu,

; Chang,

; Tu,

Co-scale conv-attentional image transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 9981-9990, 2021.

Crossref

[7]

Graham,

; El-Nouby,

; Touvron,

; Stock,

; Joulin,

; Jégou,

LeViT: A vision transformer in ConvNet’s clothing for faster inference. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 12259-12269, 2021.

Crossref

[8]

Chu,

; Tian,

; Wang,

; Zhang,

; Ren,

; Wei,

; Xia,

; Shen,

Twins: Revisiting the design of spatial attention in vision transformers. In: Proceedings of the 35th Conference on Neural Information Processing Systems, 2021.

[9]

Lin,

T. Y.

; Maire,

; Belongie,

; Hays,

; Perona,

; Ramanan,

; Dollár,

; Zitnick,

C. L.

Microsoft COCO: Common objects in context. In: Computer Vision - ECCV 2014. Lecture Notes in Computer Science, Vol. 8693. Fleet,

; Pajdla,

; Schiele,

; Tuytelaars,

Eds. Springer Cham, 740-755, 2014.

Crossref

[10]

Zhou,

B. L.

; Zhao,

; Puig,

; Fidler,

; Barriuso,

; Torralba,

Scene parsing through ADE20K dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 5122-5130, 2017.

Crossref

[11]

Dong,

; Wang,

; Fan,

D.-P.

; Li,

; Fu,

; Shao,

Polyp-PVT: Polyp segmentation with pyramid vision transformers. arXiv preprint arXiv:2108.06932, 2021.

Google Scholar

[12]

Li,

; Wang,

; Wu,

; Chen,

; Hu,

; Li,

; Tang,

; Yang,

Generalized focal loss: Learning qualified and distributed bounding boxes for dense object detection. In: Proceedings of the 34th Conference on Neural Information Processing Systems, 2020.

Crossref

[13]

He,

K. M.

; Zhang,

X. Y.

; Ren,

S. Q.

; Sun,

Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification. In: Proceedings of the IEEE International Conference on Computer Vision, 1026-1034, 2015.

[14]

Deng,

; Dong,

; Socher,

; Li,

L. J.

; Kai,

; Li,

F. F.

ImageNet: A large-scale hierarchical image database. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 248-255, 2009.

Crossref

[15]

Yuan,

; Chen,

; Wang,

; Yu,

; Shi,

; Jiang,

; Tay,

F. E.

; Feng,

; Yan,

Tokens-to-token ViT: Training vision transformers from scratch on ImageNet. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 558-567, 2021.

Crossref

[16]

Han,

; Xiao,

; Wu,

; Guo,

; Xu,

; Wang,

Transformer in transformer. arXiv preprint arXiv:2103.00112, 2021.

Google Scholar

[17]

Chu,

; Tian,

; Zhang,

; Wang,

; Wei,

; Xia,

; Shen,

Conditional positional encodings for vision transformers. arXiv preprint arXiv:2102.10882, 2021.

Google Scholar

[18]

Chen,

C.-F.

; Fan,

; Panda,

CrossViT: Cross-attention multi-scale vision transformer for image classification. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 357-366, 2021.

Crossref

[19]

Li,

; Zhang,

; Cao,

; Timofte,

; van Gool,

LocalViT: Bringing locality to vision transformers. arXiv preprint arXiv:2104.05707, 2021.

Google Scholar

[20]

Islam,

M. A.

; Jia,

; Bruce,

N. D. B.

How much position information do convolutional neural networks encode? In: Proceedings of the International Conference on Learning Representations, 2020.

[21]

Howard,

A. G.

; Zhu,

; Chen,

; Kalenichenko,

; Wang,

; Weyand,

; Andreetto,

; Adam,

MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.

Google Scholar

[22]

Hendrycks,

; Gimpel,

Gaussian error linear units (GELUs). arXiv preprint arXiv:1606.08415, 2016.

Google Scholar

[23]

Vaswani,

; Shazeer,

; Parmar,

; Uszkoreit,

; Jones,

; Gomez,

A. N.

; Kaiser,

; Polosukhin,

Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, 6000-6010, 2017.

[24]

He,

K. M.

; Zhang,

X. Y.

; Ren,

S. Q.

; Sun,

Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 770-778, 2016.

[25]

Xie,

S. N.

; Girshick,

; Dollar,

; Tu,

Z. W.

; He,

K. M.

Aggregated residual transformations for deep neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 5987-5995, 2017.

Crossref

[26]

Radosavovic,

; Kosaraju,

R. P.

; Girshick,

; He,

K. M.

; Dollár,

Designing network design spaces. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10425-10433, 2020.

Crossref

[27]

Russakovsky,

; Deng,

; Su,

; Krause,

; Satheesh,

; Ma,

S. A.

; Huang,

; Karpathy,

; Khosla,

; Bernstein,

; et al. ImageNet large scale visual recognition challenge. International Journal of Computer Vision Vol. 115, No. 3, 211-252, 2015.

Crossref Google Scholar

[28]

Szegedy,

; Liu,

; Jia,

Y. Q.

; Sermanet,

; Reed,

; Anguelov,

; Erhan,

; Vanhoucke,

; Rabinovich,

Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1-9, 2015.

Crossref

[29]

Szegedy,

; Vanhoucke,

; Ioffe,

; Shlens,

; Wojna,

Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2818-2826, 2016.

Crossref

[30]

Zhang,

; Cisse,

; Dauphin,

Y. N.

; Lopez-Paz,

mixup: Beyond empirical risk minimization. In: Proceedings of the International Conference on Learning Representations, 2018.

[31]

Zhong,

; Zheng,

; Kang,

G. L.

; Li,

S. Z.

; Yang,

Random erasing data augmentation. Proceedings of the AAAI Conference on Artificial Intelligence Vol. 34, No. 7, 13001-13008, 2020.

Crossref Google Scholar

[32]

Loshchilov,

; Hutter,

Decoupled weight decay regularization. In: Proceedings of the International Conference on Learning Representations, 2019.

[33]

Loshchilov,

; Hutter,

SGDR: Stochastic gradient descent with warm restarts. In: Proceedings of theInternational Conference on Learning Representations, 2017.

[34]

Lin,

T. Y.

; Goyal,

; Girshick,

; He,

K. M.

; Dollár,

Focal loss for dense object detection. In: Proceedings of the IEEE International Conference on Computer Vision, 2999-3007, 2017.

Crossref

[35]

He,

K. M.

; Gkioxari,

; Dollár,

; Girshick,

Mask R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, 2980-2988, 2017.

[36]

Cai,

Z. W.

; Vasconcelos,

Cascade R-CNN: Delving into high quality object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6154-6162, 2018.

Crossref

[37]

Zhang,

S. F.

; Chi,

; Yao,

Y. Q.

; Lei,

; Li,

S. Z.

Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9756-9765, 2020.

Crossref

[38]

Sun,

P. Z.

; Zhang,

R. F.

; Jiang,

; Kong,

; Xu,

C. F.

; Zhan,

; Tomizuka,

; Li,

; Yuan,

; Wang,

; et al. Sparse R-CNN: End-to-end object detection with learnable proposals. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 14449-14458, 2021.

Crossref

[39]

Glorot,

; Bengio,

Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the 13th International Conference on Artificial Intelligence and Statistics, 249-256, 2010.

[40]

Chen,

; Wang,

J. Q.

; Pang,

J. M.

; Cao,

Y. H.

; Xiong,

; Li,

; Sun,

; Feng,

; Liu,

; Xu,

; et al. MMDetection: Open MMLab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155, 2019.

Google Scholar

[41]

Kirillov,

; Girshick,

; He,

K. M.

; Dollár,

Panoptic feature pyramid networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6392-6401, 2019.

Crossref

[42]

Chen,

L. C.

; Papandreou,

; Kokkinos,

; Murphy,

; Yuille,

A. L.

DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 40, No. 4, 834-848, 2018.

Crossref Google Scholar

Computational Visual Media

Volume 8 Issue 3,
September 2022

Pages 415-424

DOI: 10.1007/s41095-022-0274-8

Cite this article:

Wang W, Xie E, Li X, et al. PVT v2: Improved baselines with Pyramid Vision Transformer. Computational Visual Media, 2022, 8(3): 415-424. https://doi.org/10.1007/s41095-022-0274-8

Return

10.1007/s41095-022-0274-8.T001Table 1Detailed settings for PVT v2 series. “-Li” denotes PVT v2 with linear SRA

	Output size	Layer name	Pyramid Vision Transformer v2
	Output size	Layer name	B0	B1	B2	B2-Li	B3	B4	B5
Stage 1	$\frac{H}{4} \times \frac{W}{4}$	Overlapping patch embedding	$S_{1} = 4$
		Overlapping patch embedding	$C_{1} = 32$	$C_{1} = 64$
		Transformer encoder	$\begin{matrix} R_{1} = 8 \\ N_{1} = 1 \\ E_{1} = 8 \\ L_{1} = 2 \end{matrix}$	$\begin{matrix} R_{1} = 8 \\ N_{1} = 1 \\ E_{1} = 8 \\ L_{1} = 2 \end{matrix}$	$\begin{matrix} R_{1} = 8 \\ N_{1} = 1 \\ E_{1} = 8 \\ L_{1} = 3 \end{matrix}$	$\begin{matrix} P_{1} = 7 \\ N_{1} = 1 \\ E_{1} = 8 \\ L_{1} = 3 \end{matrix}$	$\begin{matrix} R_{1} = 8 \\ N_{1} = 1 \\ E_{1} = 8 \\ L_{1} = 3 \end{matrix}$	$\begin{matrix} R_{1} = 8 \\ N_{1} = 1 \\ E_{1} = 8 \\ L_{1} = 3 \end{matrix}$	$\begin{matrix} R_{1} = 8 \\ N_{1} = 1 \\ E_{1} = 4 \\ L_{1} = 3 \end{matrix}$
Stage 2	$\frac{H}{8} \times \frac{W}{8}$	Overlapping patch embedding	$S_{2} = 2$
		Overlapping patch embedding	$C_{2} = 64$	$C_{2} = 128$
		Transformer encoder	$\begin{matrix} R_{2} = 4 \\ N_{2} = 2 \\ E_{2} = 8 \\ L_{2} = 2 \end{matrix}$	$\begin{matrix} R_{2} = 4 \\ N_{2} = 2 \\ E_{2} = 8 \\ L_{2} = 2 \end{matrix}$	$\begin{matrix} R_{2} = 4 \\ N_{2} = 2 \\ E_{2} = 8 \\ L_{2} = 3 \end{matrix}$	$\begin{matrix} P_{2} = 7 \\ N_{2} = 2 \\ E_{2} = 8 \\ L_{2} = 3 \end{matrix}$	$\begin{matrix} R_{2} = 4 \\ N_{2} = 2 \\ E_{2} = 8 \\ L_{2} = 3 \end{matrix}$	$\begin{matrix} R_{2} = 4 \\ N_{2} = 2 \\ E_{2} = 8 \\ L_{2} = 8 \end{matrix}$	$\begin{matrix} R_{2} = 4 \\ N_{2} = 2 \\ E_{2} = 4 \\ L_{2} = 6 \end{matrix}$
Stage 3	$\frac{H}{16} \times \frac{W}{16}$	Overlapping patch embedding	$S_{3} = 2$
		Overlapping patch embedding	$C_{3} = 160$	$C_{3} = 320$
		Transformer encoder	$\begin{matrix} R_{3} = 2 \\ N_{3} = 5 \\ E_{3} = 4 \\ L_{3} = 2 \end{matrix}$	$\begin{matrix} R_{3} = 2 \\ N_{3} = 5 \\ E_{3} = 4 \\ L_{3} = 2 \end{matrix}$	$\begin{matrix} R_{3} = 2 \\ N_{3} = 5 \\ E_{3} = 4 \\ L_{3} = 6 \end{matrix}$	$\begin{matrix} P_{3} = 7 \\ N_{3} = 5 \\ E_{3} = 4 \\ L_{3} = 6 \end{matrix}$	$\begin{matrix} R_{3} = 2 \\ N_{3} = 5 \\ E_{3} = 4 \\ L_{3} = 18 \end{matrix}$	$\begin{matrix} R_{3} = 2 \\ N_{3} = 5 \\ E_{3} = 4 \\ L_{3} = 27 \end{matrix}$	$\begin{matrix} R_{3} = 2 \\ N_{3} = 5 \\ E_{3} = 4 \\ L_{3} = 40 \end{matrix}$
Stage 4	$\frac{H}{32} \times \frac{W}{32}$	Overlapping patch embedding	$S_{4} = 2$
		Overlapping patch embedding	$C_{4} = 256$	$C_{4} = 512$
		Transformer encoder	$\begin{matrix} R_{4} = 1 \\ N_{4} = 8 \\ E_{4} = 4 \\ L_{4} = 2 \end{matrix}$	$\begin{matrix} R_{4} = 1 \\ N_{4} = 8 \\ E_{4} = 4 \\ L_{4} = 2 \end{matrix}$	$\begin{matrix} R_{4} = 1 \\ N_{4} = 8 \\ E_{4} = 4 \\ L_{4} = 3 \end{matrix}$	$\begin{matrix} P_{4} = 7 \\ N_{4} = 8 \\ E_{4} = 4 \\ L_{4} = 3 \end{matrix}$	$\begin{matrix} R_{4} = 1 \\ N_{4} = 8 \\ E_{4} = 4 \\ L_{4} = 3 \end{matrix}$	$\begin{matrix} R_{4} = 1 \\ N_{4} = 8 \\ E_{4} = 4 \\ L_{4} = 3 \end{matrix}$	$\begin{matrix} R_{4} = 1 \\ N_{4} = 8 \\ E_{4} = 4 \\ L_{4} = 3 \end{matrix}$

10.1007/s41095-022-0274-8.T002Table 2Image classification performance on the ImageNet validation set. #Param = millions of parameters. GFLOPs is calculated for input of size $224 \times 224$ . * = performance of the method trained under the strategy of its original paper. Acc = top-1 accuracy. -Li = PVT v2 with linear SRA

Method	#Param	GFLOPs	Acc (%)
PVT v2-B0 (ours)	3.4	0.6	70.5
ResNet18 $^{*}$ [24]	11.7	1.8	69.8
DeiT-Tiny/16 [2]	5.7	1.3	72.2
PVT v1-Tiny [3]	13.2	1.9	75.1
PVT v2-B1 (ours)	13.1	2.1	78.7
ResNet50 $^{*}$ [24]	25.6	4.1	76.1
ResNeXt50-32x4d $^{*}$ [25]	25.0	4.3	77.6
RegNetY-4G [26]	21.0	4.0	80.0
DeiT-Small/16 [2]	22.1	4.6	79.9
T2T-ViT $_{t}$ -14 [15]	22.0	6.1	80.7
PVT v1-Small [3]	24.5	3.8	79.8
TNT-S [16]	23.8	5.2	81.3
Swin-T [5]	29.0	4.5	81.3
CvT-13 [4]	20.0	4.5	81.6
CoaT-Lite Small [6]	20.0	4.0	81.9
Twins-SVT-S [8]	24.0	2.8	81.7
PVT v2-B2-Li (ours)	22.6	3.9	82.1
PVT v2-B2 (ours)	25.4	4.0	82.0
ResNet101 $^{*}$ [24]	44.7	7.9	77.4
ResNeXt101-32x4d $^{*}$ [25]	44.2	8.0	78.8
RegNetY-8G [26]	39.0	8.0	81.7
T2T-ViT $_{t}$ -19 [15]	39.0	9.8	81.4
PVT v1-Medium [3]	44.2	6.7	81.2
CvT-21 [4]	32.0	7.1	82.5
PVT v2-B3 (ours)	45.2	6.9	83.2
ResNet152 $^{*}$ [24]	60.2	11.6	78.3
T2T-ViT $_{t}$ -24 [15]	64.0	15.0	82.2
PVT v1-Large [3]	61.4	9.8	81.7
TNT-B [16]	66.0	14.1	82.8
Swin-S [5]	50.0	8.7	83.0
Twins-SVT-B [8]	56.0	8.3	83.2
PVT v2-B4 (ours)	62.6	10.1	83.6
ResNeXt101-64x4d $^{*}$ [25]	83.5	15.6	79.6
RegNetY-16G [26]	84.0	16.0	82.9
ViT-Base/16 [1]	86.6	17.6	81.8
DeiT-Base/16 [2]	86.6	17.6	81.8
Swin-B [5]	88.0	15.4	83.3
Twins-SVT-L [8]	99.2	14.8	83.7
PVT v2-B5 (ours)	82.0	11.8	83.8

10.1007/s41095-022-0274-8.T003Table 3Object detection and instance segmentation on COCO 2017 val. #P = millions of parameters. AP $^{b}$ = bounding box AP. AP $^{m}$ = mask AP. -Li = PVT v2 with linear SRA

Backbone	RetinaNet 1 $\times$							Mask R-CNN 1 $\times$
Backbone	#P	AP	AP $_{50}$	AP $_{75}$	AP $_{S}$	AP $_{M}$	AP $_{L}$	#P	AP $^{b}$	AP $_{50}^{b}$	AP $_{75}^{b}$	AP $^{m}$	AP $_{50}^{m}$	AP $_{75}^{m}$
PVT v2-B0	13.0	37.2	57.2	39.5	23.1	40.4	49.7	23.5	38.2	60.5	40.7	36.2	57.8	38.6
ResNet18 [24]	21.3	31.8	49.6	33.6	16.3	34.3	43.2	31.2	34.0	54.0	36.7	31.2	51.0	32.7
PVT v1-Tiny [3]	23.0	36.7	56.9	38.9	22.6	38.8	50.0	32.9	36.7	59.2	39.3	35.1	56.7	37.3
PVT v2-B1 (ours)	23.8	41.2	61.9	43.9	25.4	44.5	54.3	33.7	41.8	64.3	45.9	38.8	61.2	41.6
ResNet50 [24]	37.7	36.3	55.3	38.6	19.3	40.0	48.8	44.2	38.0	58.6	41.4	34.4	55.1	36.7
PVT v1-Small [3]	34.2	40.4	61.3	43.0	25.0	42.9	55.7	44.1	40.4	62.9	43.8	37.8	60.1	40.3
PVT v2-B2-Li (ours)	32.3	43.6	64.7	46.8	28.3	47.6	57.4	42.2	44.1	66.3	48.4	40.5	63.2	43.6
PVT v2-B2 (ours)	35.1	44.6	65.6	47.6	27.4	48.8	58.6	45.0	45.3	67.1	49.6	41.2	64.2	44.4
ResNet101 [24]	56.7	38.5	57.8	41.2	21.4	42.6	51.1	63.2	40.4	61.1	44.2	36.4	57.7	38.8
ResNeXt101-32x4d [25]	56.4	39.9	59.6	42.7	22.3	44.2	52.5	62.8	41.9	62.5	45.9	37.5	59.4	40.2
PVT v1-Medium [3]	53.9	41.9	63.1	44.3	25.0	44.9	57.6	63.9	42.0	64.4	45.6	39.0	61.6	42.1
PVT v2-B3 (ours)	55.0	45.9	66.8	49.3	28.6	49.8	61.4	64.9	47.0	68.1	51.7	42.5	65.7	45.7
PVT v1-Large [3]	71.1	42.6	63.7	45.4	25.8	46.0	58.4	81.0	42.9	65.0	46.6	39.5	61.9	42.5
PVT v2-B4 (ours)	72.3	46.1	66.9	49.2	28.4	50.0	62.2	82.2	47.5	68.7	52.0	42.7	66.1	46.1
ResNeXt101-64x4d [25]	95.5	41.0	60.9	44.0	23.9	45.2	54.0	101.9	42.8	63.8	47.3	38.4	60.6	41.3
PVT v2-B5 (ours)	91.7	46.2	67.1	49.5	28.5	50.0	62.5	101.6	47.4	68.6	51.9	42.5	65.7	46.0

10.1007/s41095-022-0274-8.T004Table 4Comparison with Swin transformer on object detection. AP $^{b}$ = bounding box AP. #P = millions of parameters. #G = GFLOPs calculated for an input size $1280 \times 800$ . -Li = PVT v2 with linear SRA

Backbone	Method	AP $^{b}$	AP $_{50}^{b}$	AP $_{75}^{b}$	#P	#G
ResNet50 [24]	Cascade Mask R-CNN	46.3	64.3	50.5	82	739
Swin-T [5]		50.5	69.3	54.9	86	745
PVT v2-B2-Li (ours)		50.9	69.5	55.2	80	725
PVT v2-B2 (ours)		51.1	69.8	55.3	83	788
ResNet50 [24]	ATSS	43.5	61.9	47.0	32	205
Swin-T [5]		47.2	66.5	51.3	36	215
PVT v2-B2-Li (ours)		48.9	68.1	53.4	30	194
PVT v2-B2 (ours)		49.9	69.1	54.1	33	258
ResNet50 [24]	GFL	44.5	63.0	48.3	32	208
Swin-T [5]		47.6	66.8	51.7	36	215
PVT v2-B2-Li (ours)		49.2	68.2	53.7	30	197
PVT v2-B2 (ours)		50.2	69.4	54.7	33	261
ResNet50 [24]	Sparse R-CNN	44.5	63.4	48.2	106	166
Swin-T [5]		47.9	67.3	52.3	110	172
PVT v2-B2-Li (ours)		48.9	68.3	53.4	104	151
PVT v2-B2 (ours)		50.1	69.5	54.9	107	215

10.1007/s41095-022-0274-8.T005Table 5Semantic segmentation results for different backbones using the ADE20K validation set. #P = millions of parameters. #G = GFLOPs with input size $512 \times 512$ . -Li = PVT v2 with linear SRA

Backbone	Semantic FPN
Backbone	#P	#G	mIoU (%)
PVT v2-B0 (ours)	7.6	25.0	37.2
ResNet18 [24]	15.5	32.2	32.9
PVT v1-Tiny [3]	17.0	33.2	35.7
PVT v2-B1 (ours)	17.8	34.2	42.5
ResNet50 [24]	28.5	45.6	36.7
PVT v1-Small [3]	28.2	44.5	39.8
PVT v2-B2-Li (ours)	26.3	41.0	45.1
PVT v2-B2 (ours)	29.1	45.8	45.2
ResNet101 [24]	47.5	65.1	38.8
ResNeXt101-32x4d [25]	47.1	64.7	39.7
PVT v1-Medium [3]	48.0	61.0	41.6
PVT v2-B3 (ours)	49.0	62.4	47.3
PVT v1-Large [3]	65.1	79.6	42.1
PVT v2-B4 (ours)	66.3	81.3	47.9
ResNeXt101-64x4d [25]	86.4	103.9	40.2
PVT v2-B5 (ours)	85.7	91.1	48.7

10.1007/s41095-022-0274-8.T006Table 6Ablation experiments on PVT v2. OPE, CFFN, and LSRA represent overlapping patch embedding, convolutional feed-forward network (PVT v2-B2), and linear SRA (PVT v2-B2-Li), respectively. #P = millions of parameters. #G = GFLOPs. Acc = top-1 accuracy

#	Setting	Acc (%)	RetinaNet 1 $\times$
#	Setting	Acc (%)	#P	#G	AP
1	PVT v1-Small [3]	79.8	34.2	285.8	40.4
2	+ OPE	81.1	34.9	288.6	42.2
3	++ CFFN	82.0	35.1	290.7	44.6
4	+++ LSRA	82.1	32.3	227.4	43.6