DTCC: Multi-level dilated convolution with transformer for weakly-supervised crowd counting

Zhuangzhuang Miao; Yong Zhang; Yuan Peng; Haocheng Peng; Baocai Yin

doi:10.1007/s41095-022-0313-5

AI Chat Paper

Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.

Chat more with AI

| Sign up

Browse by Subject

Search for peer-reviewed journals with full access.

Journals A - Z

About Us

Discover the SciOpen Platform and Achieve Your Research Goals with Ease.

About Us

Publish with Us

Support

Journals A - Z

About Us

Publish with Us

Support

PDF (4.8 MB)

Cite

EndNote(RIS) BibTeX

Collect

Submit Manuscript

AI Chat Paper

Show Outline

Outline

Show full outline

Hide outline

Outline

Show full outline

Hide outline

Research Article | Open Access

DTCC: Multi-level dilated convolution with transformer for weakly-supervised crowd counting

Zhuangzhuang Miao^¹, Yong Zhang^¹(

), Yuan Peng^², Haocheng Peng^¹, Baocai Yin^¹

1Beijing Key Laboratory of Multimedia and Intelligent Software Technology, Beijing Institute of Artificial Intelligence, Faculty of Information Technology, Beijing University of Technology, Beijing 100124, China

2Taiji Computer Corporation Ltd., China

Show Author Information

Graphical Abstract

Abstract

Crowd counting provides an important foundation for public security and urban management. Due to the existence of small targets and large den-sity variations in crowd images, crowd counting is a challenging task. Mainstream methods usually apply convolution neural networks (CNNs) to regress a density map, which requires annotations of individual persons and counts. Weakly-supervised methods can avoid detailed labeling and only require counts as annotations of images, but existing methods fail to achieve satisfactory performance because a global perspective field and multi-level information are usually ignored. We propose a weakly-supervised method, DTCC, which effectively combines multi-level dilated convolution and transformer methods to realize end-to-end crowd counting. Its main components include a recursive swin transformer and a multi-level dilated convolution regression head. The recursive swin trans-former combines a pyramid visual transformer with a fine-tuned recursive pyramid structure to capture deep multi-level crowd features, including global features. The multi-level dilated convolution regression head includes multi-level dilated convolution and a linear regression head for the feature extraction module. This module can capture both low- and high-level features simultaneously to enhance the receptive field. In addition, two regression head fusion mechanisms realize dynamic and mean fusion counting. Experiments on four well-known benchmark crowd counting datasets (UCF_CC_50, ShanghaiTech, UCF_QNRF, and JHU-Crowd++) show that DTCC achieves results superior to other weakly-supervised methods and comparable to fully-supervised methods.

Keywords

pyramid transformer crowd counting dilated con-volution global perspective field

References

[1]

Li, M.; Zhang, Z. X.; Huang, K. Q.; Tan, T. N. Estimating the number of people in crowded scenes by MID based foreground segmentation and head-shoulder detection. In: Proceedings of the 19th International Conference on Pattern Recognition, 1–4, 2008.

Crossref

[2]

Wu, B.; Nevatia, R. Detection and tracking of multiple, partially occluded humans by Bayesian combination of edgelet based part detectors. International Journal of Computer Vision Vol. 75, No. 2, 247–266, 2007.

Crossref Google Scholar

[3]

Lempitsky, V. S.; Zisserman, A. Learning to count objects in images. In: Proceedings of the 23rd International Conference on Neural Information Processing Systems, Vol. 1, 1324–1332, 2010.

[4]

Walach, E.; Wolf, L. Learning to count with CNN boosting. In: Computer Vision – ECCV 2016. Lecture Notes in Computer Science, Vol. 9906. Leibe, B.; Matas, J.; Sebe, N.; Welling, M. Eds. Springer Cham, 660–676, 2016.

Crossref

[5]

Wang, C.; Zhang, H.; Yang, L.; Liu, S.; Cao, X. C. Deep people counting in extremely dense crowds. In: Proceedings of the 23rd ACM International Conference on Multimedia, 1299–1302, 2015.

Crossref

[6]

Fu, M.; Xu, P.; Li, X. D.; Liu, Q. H.; Ye, M.; Zhu, C. Fast crowd density estimation with convolutional neural networks. Engineering Applications of Artificial Intelligence Vol. 43, 81–88, 2015.

Crossref Google Scholar

[7]

Song, Q. Y.; Wang, C. G.; Jiang, Z. K.; Wang, Y. B.; Tai, Y.; Wang, C. J.; Li, J. L.; Huang, F. Y.; Wu, Y. Rethinking counting and localization in crowds: A purely point-based framework. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 3345–3354, 2021.

Crossref

[8]

Meng, Y. D.; Zhang, H. R.; Zhao, Y. T.; Yang, X. Y.; Qian, X. S.; Huang, X. W.; Zheng, Y. Spatial uncertainty-aware semi-supervised crowd counting. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 15529–15539, 2021.

Crossref

[9]

Wan, J.; Liu, Z. Q.; Chan, A. B. A generalized loss function for crowd counting and localization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1974–1983, 2021.

Crossref

[10]

Liu, X. L.; van de Weijer, J.; Bagdanov, A. D. Exploiting unlabeled data in CNNs by self-supervised learning to rank. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 41, No. 8, 1862–1878, 2019.

Crossref Google Scholar

[11]

Wang, Q.; Gao, J. Y.; Lin, W.; Yuan, Y. Learning from synthetic data for crowd counting in the wild. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8190–8199, 2019.

Crossref

[12]

Liang, D. K.; Chen, X. W.; Xu, W.; Zhou, Y.; Bai, X. TransCrowd: Weakly-supervised crowd counting with transformers. Science China Information Sciences Vol. 65, No. 6, Article No. 160104, 2022.

Crossref Google Scholar

[13]

Liu, Z.; Lin, Y. T.; Cao, Y.; Hu, H.; Wei, Y. X.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 9992–10002, 2021.

Crossref

[14]

Chen, C. F R.; Fan, Q. F.; Panda, R. CrossViT: Cross-attention multi-scale vision transformer for image classification. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 347–356, 2021.

Crossref

[15]

Huang, Z.; Ben, Y.; Luo, G.; Cheng, P.; Yu, G.; Fu, B. Shuffle transformer: Rethinking spatial shuffle for vision transformer. arXiv preprint arXiv:2106.03650, 2021.

Google Scholar

[16]

Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In: Computer Vision – ECCV 2020. Lecture Notes in Computer Science, Vol. 12346. Vedaldi, A.; Bischof, H.; Brox, T.; Frahm, J. M. Eds. Springer Cham, 213–229, 2020.

Crossref

[17]

He, L.; Zhou, Q. Y.; Li, X. T.; Niu, L.; Cheng, G. L.; Li, X.; Liu, W.; Tong, Y.; Ma, L.; Zhang, L. End-to-end video object detection with spatial-temporal transformers. In: Proceedings of the 29th ACM International Conference on Multimedia, 1507–1516, 2021.

Crossref

[18]

Zhang, Y. Y.; Zhou, D. S.; Chen, S. Q.; Gao, S. H.; Ma, Y. Single-image crowd counting via multi-column convolutional neural network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 589–597, 2016.

Crossref

[19]

Sam, D. B.; Surya, S.; Babu, R. V. Switching convolutional neural network for crowd counting. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4031–4039, 2017.

Crossref

[20]

Li, Y. H.; Zhang, X. F.; Chen, D. M. CSRNet: Dilated convolutional neural networks for understanding the highly congested scenes. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1091–1100, 2018.

Crossref

[21]

Ma, Z. H.; Wei, X.; Hong, X. P.; Gong, Y. H. Bayesian loss for crowd count estimation with point supervision. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 6141–6150, 2019.

Crossref

[22]

Liu, Z.; He, Z.; Wang, L.; Wang, W.; Yuan, Y.; Zhang, D.; Zhang, J.; Zhu, P.; Van Gool, L.; Han, J.; et al. VisDrone-CC2021: The vision meets drone crowd counting challenge results. In: Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, 2830–2838, 2021.

Crossref

[23]

Liang, D.; Xu, W.; Bai, X. An end-to-end transformer model for crowd localization. arXiv preprint arXiv:2202.13065, 2022.

Crossref Google Scholar

[24]

Abousamra, S.; Hoai, M.; Samaras, D.; Chen, C. Localization in the crowd with topological constraints. Proceedings of the AAAI Conference on Artificial Intelligence Vol. 35, No. 2, 872–881, 2021.

Crossref Google Scholar

[25]

Sun, G. L.; Liu, Y.; Probst, T.; Paudel, D. P.; Popovic, N.; Van Gool, L. Boosting crowd counting with transformers. arXiv preprint arXiv:2105.10926, 2021.

Google Scholar

[26]

Gao, J. Y.; Gong, M. G.; Li, X. L. Congested crowd instance localization with dilated convolutional swin transformer. arXiv preprint arXiv:2108.00584, 2021.

Crossref Google Scholar

[27]

Shang, C.; Ai, H. Z.; Bai, B. End-to-end crowd counting via joint learning local and global count. In: Proceedings of the IEEE International Conference on Image Processing, 1215–1219, 2016.

Crossref

[28]

Wang, M. J.; Zhou, J.; Cai, H.; Gong, M. L.CrowdMLP: Weakly-supervised crowd counting via multi-granularity MLP. arXiv preprint arXiv: 2203.08219, 2022.

Crossref Google Scholar

[29]

Lei, Y. J.; Liu, Y.; Zhang, P. P.; Liu, L. Q. Towards using count-level weak supervision for crowd counting. Pattern Recognition Vol. 109, 107616, 2021.

Crossref Google Scholar

[30]

Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X. H.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. In: Proceedings of the International Conference on Learning Representations, 2021.

[31]

Tian, Y.; Chu, X.; Wang, H. CCTrans: Simplifying and improving crowd counting with transformer. arXiv preprint arXiv:2109.14483, 2021.

Google Scholar

[32]

Chu, X.; Tian, Z.; Wang, Y.; Zhang, B.; Ren, H.; Wei, X.; Xia, H.; Shen, C. Twins: Revisiting the design of spatial attention in vision transformers. In: Proceedings of the Advances in Neural Information Processing Systems, Vol. 34, 9355–9366, 2021.

[33]

Girshick, R. Fast R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, 1440–1448, 2015.

Crossref

[34]

Idrees, H.; Saleemi, I.; Seibert, C.; Shah, M. Multi-source multi-scale counting in extremely dense crowd images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2547–2554, 2013.

Crossref

[35]

Crossref

[36]

Sindagi, V. A.; Yasarla, R.; Patel, V. M. JHU-CROWD: Large-scale crowd counting dataset and a benchmark method. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 44, No. 5, 2594–2609, 2022.

Google Scholar

[37]

Liu, W. Z.; Salzmann, M.; Fua, P. Context-aware crowd counting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog-nition, 5094–5103, 2020.

Crossref

[38]

Bai, S.; He, Z. Q.; Qiao, Y.; Hu, H. Z.; Wu, W.; Yan, J. J. Adaptive dilated network with self-correction supervision for counting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4593–4602, 2020.

Crossref

[39]

Shi, M. J.; Yang, Z. H.; Xu, C.; Chen, Q. J. Revisiting perspective information for efficient crowd counting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7271–7280, 2019.

Crossref

[40]

Xiong, H. P.; Lu, H.; Liu, C. X.; Liu, L.; Cao, Z. G.; Shen, C. H. From open set to closed set: Counting objects by spatial divide-and-conquer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 8361–8370, 2019.

Crossref

[41]

Song, Q.; Wang, C.; Jiang, Z.; Wang, Y.; Tai, Y.; Wang, C.; Li, J.; Huang, F.; Wu, Y. Rethinking counting and localization in crowds: A purely point-based framework. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 3345–3354, 2021.

Crossref

[42]

Yang, Y.; Li, G.; Wu, Z.; Su, L.; Huang, Q.; Sebe, N. Weakly-supervised crowd counting learns from sorting rather than locations. In: Computer Vision – ECCV 2020. Lecture Notes in Computer Science, Vol. 12353. Vedaldi, A.; Bischof, H.; Brox, T.; Frahm, J. M. Eds. Springer Cham, 1–17, 2020.

Crossref

[43]

Sindagi, V. A.; Patel, V. M. CNN-based cascaded multi-task learning of high-level prior and density estimation for crowd counting. In: Proceedings of the 14th IEEE International Conference on Advanced Video and Signal Based Surveillance, 1–6, 2017.

Crossref

[44]

Sindagi, V. A.; Patel, V. M. Generating high-quality crowd density maps using contextual pyramid CNNs. In: Proceedings of the IEEE International Conference on Computer Vision, 1879–1888, 2017.

Crossref

[45]

Shen, Z.; Xu, Y.; Ni, B. B.; Wang, M. S.; Hu, J. G.; Yang, X. K. Crowd counting via adversarial cross-scale consistency pursuit. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5245–5254, 2018.

Crossref

[46]

Qiao, S. Y.; Chen, L. C.; Yuille, A. DetectoRS: Detecting objects with recursive feature pyramid and switchable atrous convolution. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10208–10219, 2021.

Crossref

[47]

Yang, Y. F.; Li, G. R.; Wu, Z.; Su, L.; Huang, Q. M.; Sebe, N. Reverse perspective network for perspective-aware object counting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4373–4382, 2020.

Crossref

[48]

Crossref

[49]

Liu, L. B.; Qiu, Z. L.; Li, G. B.; Liu, S. F.; Ouyang, W. L.; Lin, L. Crowd counting with deep structured scale integration network. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 1774–1783, 2019.

Crossref

[50]

Cao, X.; Wang, Z.; Zhao, Y.; Su, F. Scale aggregation network for accurate and efficient crowd counting. In: Computer Vision – ECCV 2018. Lecture Notes in Computer Science, Vol. 11209. Ferrari, V.; Hebert, M.; Sminchisescu, C.; Weiss, Y. Eds. Springer Cham, 757–773, 2018.

Crossref

[51]

Idrees, H.; Tayyab, M.; Athrey, K.; Zhang, D.; Al-Maadeed, S.; Rajpoot, N.; Shah, M. Composition loss for counting, density map estimation and localization in dense crowds. In: Computer Vision – ECCV 2018. Lecture Notes in Computer Science, Vol. 11206. Ferrari, V.; Hebert, M.; Sminchisescu, C.; Weiss, Y. Eds. Springer Cham, 544–559, 2018.

Crossref

[52]

Savner, S. S.; Kanhangad, V. CrowdFormer: Weakly- supervised crowd counting with improved genera-lizability. arXiv preprint arXiv:2203.03768, 2022.

Crossref Google Scholar

[53]

Wang, F. S.; Liu, K.; Long, F.; Sang, N.; Xia, X. F.; Sang, J. Joint CNN and transformer network via weakly supervised learning for efficient crowd counting. arXiv preprint arXiv:2203.06388, 2022.

Google Scholar

[54]

Song, Q.; Wang, C.; Wang, Y.; Tai, Y.; Wang, C.; Li, J.; Wu, J.; Ma, J. To choose or to fuse? Scale selection for crowd counting. Proceedings of the AAAI Conference on Artificial Intelligence Vol. 35, No. 3, 2576–2583, 2021.

Crossref Google Scholar

[55]

Sindagi, V. A.; Patel, V. M. Multi-level bottom–top and top–bottom feature fusion for crowd counting. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 1002–1012, 2019.

Crossref

Computational Visual Media

Volume 9 Issue 4,
December 2023

Pages 859-873

DOI: 10.1007/s41095-022-0313-5

Cite this article:

Miao Z, Zhang Y, Peng Y, et al. DTCC: Multi-level dilated convolution with transformer for weakly-supervised crowd counting. Computational Visual Media, 2023, 9(4): 859-873. https://doi.org/10.1007/s41095-022-0313-5

526

Views

Downloads

Crossref

Web of Science

Scopus

CSCD

Google Scholar
Citation

Altmetrics

Received: 22 March 2022

Accepted: 12 September 2022

Published: 02 April 2023

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduc-tion in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.

The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Other papers from this open access journal are available free of charge from http://www.springer.com/journal/41095. To submit a manuscript, please go to https://www.editorialmanager.com/cvmj.