PDF (12 MB)
Collect
Submit Manuscript
Research Article | Open Access

Swin3D: A pretrained transformer backbone for 3D indoor scene

Institute for Advanced Study, Tsinghua University, Beijing 100084, China
Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen 518055, China
Internet Graphics Group, Microsoft Research Asia, Beijing 100080, China
Wangxuan Institute of Computer Technology, Peking University, Beijing 100080, China
Show Author Information

Graphical Abstract

View original image Download original image

Abstract

The use of pretrained backbones with fine-tuning has shown success for 2D vision and natural language processing tasks, with advantages over task-specific networks. In this paper, we introduce a pretrained 3D backbone, called Swin3D, for 3D indoor scene understanding. We designed a 3D Swin Transformer as our backbone network, which enables efficient self-attention on sparse voxels with linear memory complexity, making the backbone scalable to large models and datasets. We also introduce a generalized contextual relative positional embedding scheme to capture various irregularities of point signals for improved network performance. We pretrained a large Swin3D model on a synthetic Structured3D dataset, which is an order of magnitude larger than the ScanNet dataset. Our model pretrained on the synthetic dataset not only generalizes well to downstream segmentation and detection on real 3D point datasets but also outperforms state-of-the-art methods on downstream tasks with +2.3 mIoU and +2.2 mIoU on S3DIS Area5 and 6-fold semantic segmentation, respectively, +1.8 mIoU on ScanNet segmentation (val), +1.9 mAP@0.5 on ScanNet detection, and +8.1 mAP@0.5 on S3DIS detection. A series of extensive ablation studies further validated the scalability, generality, and superior performance enabled by our approach.

References

[1]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv: 2010.11929, 2020.
[2]
Bao, H.; Dong, L.; Piao, S.; Wei, F. BEiT: BERT pre-training of image transformers. arXiv preprint arXiv: 2106.08254, 2021.
[3]
Devlin, J.; Chang, M. W.; Lee, K.; Toutanova, K.; Hulburd, E.; Liu, D.; Wang, M.; Catlin, A. G.; Lei, M.; Zhang, J.; et al. BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the NAACL-HLT, 4171–4186, 2019.
[4]
Brown, T. B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J. D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. In: Proceedings of the 34th International Conference on Neural Information Processing Systems, Article No. 159, 1877–1901, 2020.
[5]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 9992–10002, 2021.
[6]
Liu, Z.; Hu, H.; Lin, Y.; Yao, Z.; Xie, Z.; Wei, Y.; Ning, J.; Cao, Y.; Zhang, Z.; Dong, L.; et al. Swin transformer V2: Scaling up capacity and resolution. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11999–12009, 2022.
[7]
Wu, K.; Peng, H.; Chen, M.; Fu, J.; Chao, H. Rethinking and improving relative position encoding for vision transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 10013–10021, 2021.
[8]
Lai, X.; Liu, J.; Jiang, L.; Wang, L.; Zhao, H.; Liu, S.; Qi, X.; Jia, J. Stratified transformer for 3D point cloud segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8490–8499, 2022.
[9]
Zheng, J.; Zhang, J.; Li, J.; Tang, R.; Gao, S.; Zhou, Z. Structured3D: A large photo-realistic dataset for structured 3D modeling. In: Computer Vision – ECCV 2020. Lecture Notes in Computer Science, Vol. 12354. Vedaldi, A.; Bischof, H.; Brox, T.; Frahm, J. M. Eds. Springer Cham, 519–535, 2020.
[10]
Dai, A.; Chang, A. X.; Savva, M.; Halber, M.; Funkhouser, T.; Nießner, M. ScanNet: Richly-annotated 3D reconstructions of indoor scenes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2432–2443, 2017.
[11]
Armeni, I.; Sener, O.; Zamir, A. R.; Jiang, H.; Brilakis, I.; Fischer, M.; Savarese, S. 3D semantic parsing of large-scale indoor spaces. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1534–1543, 2016.
[12]

Khan, S.; Naseer, M.; Hayat, M.; Zamir, S. W.; Khan, F. S.; Shah, M. Transformers in vision: A survey. ACM Computing Surveys Vol. 54, No. 10, Article No. 200, 2022.

[13]

Han, K.; Wang, Y.; Chen, H.; Chen, X.; Guo, J.; Liu, Z.; Tang, Y.; Xiao, A.; Xu, C.; Xu, Y.; et al. A survey on vision transformer. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 45, No. 1, 87–110, 2023.

[14]

Guo, M. H.; Xu, T. X.; Liu, J. J.; Liu, Z. N.; Jiang, P. T.; Mu, T. J.; Zhang, S. H.; Martin, R. R.; Cheng, M. M.; Hu, S. M. Attention mechanisms in computer vision: A survey. Computational Visual Media Vol. 8, No. 3, 331–368, 2022.

[15]
Han, Q.; Fan, Z.; Dai, Q.; Sun, L.; Cheng, M. M.; Liu, J.; Wang, J. Demystifying local vision transformer: Sparse connectivity, weight sharing, and dynamic weight. arXiv preprint arXiv: 2106.04263, 2021.
[16]
Dong, X.; Bao, J.; Chen, D.; Zhang, W.; Yu, N.; Yuan, L.; Chen, D.; Guo, B. CSWin transformer: A general vision transformer backbone with cross-shaped windows. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 12114–12124, 2022.
[17]
Tu, Z.; Talebi, H.; Zhang, H.; Yang, F.; Milanfar, P.; Bovik, A.; Li, Y. MaxViT: multi-axis vision transformer. In: Computer Vision – ECCV 2022. Lecture Notes in Computer Science, Vol. 13684. Avidan, S.; Brostow, G.; Cissé, M.; Farinella, G. M.; Hassner, T. Eds. Springer Cham, 459–479, 2022.
[18]
Zhang, Q.; Xu, Y.; Zhang, J.; Tao, D. VSA: Learning varied-size window attention in vision transformers. In: Computer Vision – ECCV 2022. Lecture Notes in Computer Science, Vol. 13685. Avidan, S.; Brostow, G.; Cissé, M.; Farinella, G. M.; Hassner, T. Eds. Springer Cham, 466–483, 2022.
[19]

Wu, S.; Wu, T.; Tan, H.; Guo, G. Pale transformer: A general vision transformer backbone with pale-shaped attention. Proceedings of the AAAI Conference on Artificial Intelligence Vol. 36, No. 3, 2731–2739, 2022.

[20]
Yang, J.; Li, C.; Zhang, P.; Dai, X.; Xiao, B.; Yuan, L.; Gao, J. Focal attention for long-range interactions in vision transformers. In: Proceeding of the 35th Conference on Neural Information Processing Systems, 30008–30022, 2021.
[21]
Li, W.; Wang, X.; Xia, X.; Wu, J.; Li, J.; Xiao, X.; Zheng, M.; Wen, S. SepViT: Separable vision transformer. arXiv preprint arXiv: 2203.15380, 2022.
[22]
Wang, W.; Yao, L.; Chen, L.; Lin, B.; Cai, D.; He, X.; Liu, W. CrossFormer: A versatile vision transformer hinging on cross-scale attention. arXiv preprint arXiv: 2108.00154, 2021.
[23]
Chu, X.; Tian, Z.; Wang, Y.; Zhang, B.; Ren, H.; Wei, X.; Xia, H.; Shen, C. Twins: Revisiting the design of spatial attention in vision transformers. In: Proceedings of the 35th International Conference on Neural Information Processing Systems, Article No. 716, 9355–9366, 2021.
[24]
Chen, Q.; Wu, Q.; Wang, J.; Hu, Q.; Hu, T.; Ding, E.; Cheng, J.; Wang, J. MixFormer: Mixing features across windows and dimensions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5239–5249, 2022.
[25]
Yuan, Y.; Fu, R.; Huang, L.; Lin, W.; Zhang, C.; Chen, X.; Wang, J. HRFormer: High-resolution vision transformer for dense predict. In: Proceedings of the 35th Conference on Neural Information Processing Systems, 7281–7293, 2021.
[26]
Lahoud, J.; Cao, J.; Khan, F. S.; Cholakkal, H.; Anwer, R. M.; Khan, S.; Yang, M. H. 3D vision with transformers: A survey. arXiv preprint arXiv: 2208.04309, 2022.
[27]

Guo, M. H.; Cai, J. X.; Liu, Z. N.; Mu, T. J.; Martin, R. R.; Hu, S. M. PCT: Point cloud transformer. Computational Visual Media Vol. 7, No. 2, 187–199, 2021.

[28]
Zhao, H.; Jiang, L.; Jia, J.; Torr, P.; Koltun, V. Point transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 16239–16248, 2021.
[29]
Wu, X.; Lao, Y.; Jiang, L.; Liu, X.; Zhao, H. Point transformer V2: Grouped vector attention and partition-based pooling. In: Proceedings of the 36th International Conference on Neural Information Processing Systems, Article No. 2415, 33330–33342, 2024.
[30]
Park, C.; Jeong, Y.; Cho, M.; Park, J. Fast point transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 16928–16937, 2022.
[31]
Xie, S.; Gu, J.; Guo, D.; Qi, C. R.; Guibas, L.; Litany, O. PointContrast: unsupervised pre-training for 3D point cloud understanding. In: Computer Vision – ECCV 2020. Lecture Notes in Computer Science, Vol. 12348. Vedaldi, A.; Bischof, H.; Brox, T.; Frahm, J. M. Eds. Springer Cham, 574–591, 2020.
[32]

Wang, P.-S.; Yang, Y.-Q.; Zou, Q.-F.; Wu, Z.; Liu, Y.; Tong, X. Unsupervised 3D learning for shape analysis via multiresolution instance discrimination. Proceedings of the AAAI Conference on Artificial Intelligence Vol. 35, No. 4, 2773–2781, 2021.

[33]
Hou, J.; Graham, B.; Nießner, M.; Xie, S. Exploring data-efficient 3D scene understanding with contrastive scene contexts. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 15582–15592, 2021.
[34]
Zhang, Z.; Girdhar, R.; Joulin, A.; Misra, I. Self-supervised pretraining of 3D features on any point-cloud. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 10232–10243, 2021.
[35]
He, K.; Chen, X.; Xie, S.; Li, Y.; Dollár, P.; Girshick, R. Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 15979–15988, 2022.
[36]
Liu, H.; Cai, M.; Lee, Y. J. Masked Discrimination for Self-supervised Learning on Point Clouds. In: Computer Vision – ECCV 2022. Lecture Notes in Computer Science, Vol. 13662. Avidan, S.; Brostow, G.; Cissé, M.; Farinella, G. M.; Hassner, T. Eds. Springer Cham, 657–675, 2022.
[37]
Zhang, R.; Guo, Z.; Gao, P.; Fang, R.; Zhao, B.; Wang, D.; Qiao, Y.; Li, H. Point-M2AE: Multi-scale masked autoencoders for hierarchical point cloud pre-training. In: Proceedings of the 36th International Conference on Neural Information Processing Systems, Article No. 1962, 27061–27074, 2024.
[38]
Yu, X.; Tang, L.; Rao, Y.; Huang, T.; Zhou, J.; Lu, J. Point-BERT: Pre-training 3D point cloud transformers with masked point modeling. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 19291–19300, 2022.
[39]
Pang, Y.; Wang, W.; Tay, F. E. H.; Liu, W.; Tian, Y.; Yuan, L. Masked autoencoders for point cloud self-supervised learning. arXiv preprint arXiv: 2203.06604, 2022.
[40]
Wu, X.; Wen, X.; Liu, X.; Zhao, H. Masked scene contrast: A scalable framework for unsupervised 3D representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9415–9424, 2023.
[41]
Wang, Z.; Yu, X.; Rao, Y.; Zhou, J.; Lu, J. P2P: Tuning pre-trained image models for point cloud analysis with point-to-pixel prompting. In: Proceedings of the 36th International Conference on Neural Information Processing Systems, Article No. 1046, 14388–14402, 2024.
[42]
Dong, R.; Qi, Z.; Zhang, L.; Zhang, J.; Sun, J.; Ge, Z.; Yi, L.; Ma, K. Autoencoders as cross-modal teachers: Can pretrained 2D image transformers help 3D representation learning? In: Proceedings of the 11th International Conference on Learning Representations, 2023.
[43]
Huang, T.; Dong, B.; Yang, Y.; Huang, X.; Lau, R. W. H.; Ouyang, W.; Zuo, W. CLIP2Point: Transfer CLIP to point cloud classification with image-depth pre-training. arXiv preprint arXiv: 2210.01055, 2022.
[44]
Huang, X.; Huang, Z.; Li, S.; Qu, W.; He, T.; Hou, Y.; Zuo, Y.; Ouyang, W. EPCL: Frozen CLIP transformer is an efficient point cloud encoder. arXiv preprint arXiv: 2212.04098, 2022.
[45]
Chang, A. X.; Funkhouser, T.; Guibas, L.; Hanrahan, P.; Huang, Q.; Li, Z.; Savarese, S.; Savva, M.; Song, S.; Su, H.; et al. ShapeNet: An information-rich 3D model repository. arXiv preprint arXiv: 1512.03012, 2015.
[46]
Thomas, H.; Qi, C. R.; Deschaud, J. E.; Marcotegui, B.; Goulette, F.; Guibas, L. KPConv: Flexible and deformable convolution for point clouds. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 6410–6419, 2019.
[47]
Shaw, P.; Uszkoreit, J.; Vaswani, A. Self-attention with relative position representations. arXiv preprint arXiv: 1803.02155, 2018.
[48]

Zou, C.; Su, J. W.; Peng, C. H.; Colburn, A.; Shan, Q.; Wonka, P.; Chu, H. K.; Hoiem, D. Manhattan room layout reconstruction from a single 360° image: A comparative study of state-of-the-art methods. International Journal of Computer Vision Vol. 129, No. 5, 1410–1431, 2021.

[49]
Rukhovich, D.; Vorontsova, A.; Konushin, A. FCAF3D: Fully convolutional anchor-free 3D object detection. In: Computer Vision – ECCV 2022. Lecture Notes in Computer Science, Vol. 13670. Avidan, S.; Brostow, G.; Cissé, M.; Farinella, G. M.; Hassner, T. Eds. Springer Cham, 477–493, 2022.
[50]
Wang, H.; Ding, L.; Dong, S.; Shi, S.; Li, A.; Li, J.; Li, Z.; Wang, L. CAGroup3D: Class-aware grouping for 3D object detection on point clouds. In: Proceedings of the 36th International Conference on Neural Information Processing Systems, Article No. 2173, 29975–29988, 2024.
[51]
Chen, Y.; Liu, J.; Qi, X.; Zhang, X.; Sun, J.; Jia, J. LargeKernel3D: Scaling up kernels in 3D CNNs. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 13488–13498, 2023.
[52]
Nekrasov, A.; Schult, J.; Litany, O.; Leibe, B.; Engelmann, F. Mix3D: Out-of-context data augmentation for 3D scenes. In: Proceedings of the International Conference on 3D Vision, 116–125, 2021.
[53]

Wang, P. S. OctFormer: Octree-based transformers for 3D point clouds. ACM Transactions on Graphics Vol. 42, No. 4, Article No. 155, 2023.

[54]
Wang, Q.; Shi, S.; Li, J.; Jiang, W.; Zhang, X. Window normalization: Enhancing point cloud understanding by unifying inconsistent point densities. arXiv preprint arXiv: 2212.02287, 2022.
[55]

Wang, P. S.; Liu, Y.; Guo, Y. X.; Sun, C. Y.; Tong, X. O-CNN: Octree-based convolutional neural networks for 3D shape analysis. ACM Transactions on Graphics Vol. 36, No. 4, Article No. 72, 2017.

[56]
Choy, C.; Gwak, J.; Savarese, S. 4D spatio-temporal ConvNets: Minkowski convolutional neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3070–3079, 2019.
[57]
Baruch, G.; Chen, Z.; Dehghan, A.; Dimry, T.; Feigin, Y.; Fu, P.; Gebauer, T.; Joffe, B.; Kurz, D.; Schwartz, A.; et al. Arkitscenes: A diverse real-world dataset for 3D indoor scene understanding using mobile RGB-D data. arXiv preprint arXiv: 2111.08897, 2021.
[58]
Ran, H.; Liu, J.; Wang, C. Surface representation for point clouds. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 18920–18930, 2022.
[59]
Vu, T.; Kim, K.; Luu, T. M.; Nguyen, T.; Yoo, C. D. SoftGroup for 3D instance segmentation on point clouds. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2698–2707, 2022.
[60]
Rao, Y.; Liu, B.; Wei, Y.; Lu, J.; Hsieh, C. J.; Zhou, J. RandomRooms: Unsupervised pre-training from synthetic shapes and randomized layouts for 3D object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 3263–3272, 2021.
[61]
Gwak, J.; Choy, C.; Savarese, S. Generative sparse detection networks for 3D single-shot object detection. In: Computer Vision – ECCV 2020. Lecture Notes in Computer Science, Vol. 12349. Vedaldi, A.; Bischof, H.; Brox, T.; Frahm, J. M. Eds. Springer Cham, 297–313, 2020.
[62]
Zhang, J.; Zhu, C.; Zheng, L.; Xu, K. Fusion-aware point convolution for online semantic 3D scene segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4533–4542, 2020.
[63]

Huang, S. S.; Ma, Z. Y.; Mu, T. J.; Fu, H.; Hu, S. M. Supervoxel convolution for online 3D semantic segmentation. ACM Transactions on Graphics Vol. 40, No. 3, Article No. 34, 2021.

[64]

Cai, J. X.; Mu, T. J.; Lai, Y. K.; Hu, S. M. LinkNet: 2D–3D linked multi-modal network for online semantic segmentation of RGB-D videos. Computers & Graphics Vol. 98, 37–47, 2021.

Computational Visual Media
Pages 83-101
Cite this article:
Yang Y-Q, Guo Y-X, Xiong J-Y, et al. Swin3D: A pretrained transformer backbone for 3D indoor scene. Computational Visual Media, 2025, 11(1): 83-101. https://doi.org/10.26599/CVM.2025.9450383
Metrics & Citations  
Article History
Copyright
Rights and Permissions
Return