PDF (15.3 MB)
Collect
Submit Manuscript
Research Article | Open Access

A unified multi-view multi-person tracking framework

Fujitsu Research, Japan
Show Author Information

Graphical Abstract

View original image Download original image

Abstract

Despite significant developments in 3D multi-view multi-person (3D MM) tracking, current frameworks separately target footprint tracking, or pose tracking. Frameworks designed for the former cannot be used for the latter, because they directly obtain 3D positions on the ground plane via a homography projection, which is inapplicable to 3D poses above the ground. In contrast, frameworks designed for pose tracking generally isolate multi-view and multi-frame associations and may not be sufficiently robust for footprint tracking, which utilizes fewer key points than pose tracking, weakening multi-view association cues in a single frame. This study presents a unified multi-view multi-person tracking framework to bridge the gap between footprint tracking and pose tracking. Without additional modifications, the framework can adopt monocular 2D bounding boxes and 2D poses as its input to produce robust 3D trajectories for multiple persons. Importantly, multi-frame and multi-view information are jointly employed to improve association and triangulation. Our framework is shown to provide state-of-the-art performance on the Campus and Shelf datasets for 3D pose tracking, with comparable results on the WILDTRACK and MMPTRACK datasets for 3D footprint tracking.

References

[1]
Black, J.; Ellis, T. Multi camera image tracking. Image and Vision Computing Vol. 24, No. 11, 12561267, 2006.
[2]
Sternig, S.; Mauthner, T.; Irschara, A.; Roth, P. M.; Bischof, H. Multi-camera multi-object tracking by robust hough-based homography projections. In: Proceedings of the IEEE International Conference on Computer Vision Workshops, 16891696, 2011.
[3]
He, Y. H.; Wei, X.; Hong, X. P.; Shi, W. W.; Gong, Y. H. Multi-target multi-camera tracking by tracklet-to-target assignment. IEEE Transactions on Image Processing Vol. 29, 51915205, 2020.
[4]
Chen, H.; Guo, P. F.; Li, P. F.; Lee, G. H.; Chirikjian, G. Multi-person 3D pose estimation in crowded scenes based on multi-view geometry. In: Computer Vision – ECCV 2020. Lecture Notes in Computer Science, Vol. 12348. Vedaldi, A.; Bischof, H.; Brox, T.; Frahm, J. M. Eds. Springer Cham, 541557, 2020.
[5]
Chen, L.; Ai, H. Z.; Chen, R.; Zhuang, Z. J.; Liu, S. Cross-view tracking for multi-human 3D pose estimation at over 100 FPS. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 32763285, 2020.
[6]
Ohashi, T.; Ikegami, Y.; Nakamura, Y. Synergetic reconstruction from 2D pose and 3D motion for wide-space multi-person video motion capture in the wild. Image and Vision Computing Vol. 104, 104028, 2020.
[7]
Dong, J. T.; Fang, Q.; Jiang, W.; Yang, Y. R.; Huang, Q. X.; Bao, H. J.; Zhou, X. W. Fast and robust multi-person 3D pose estimation and tracking from multiple views. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 44, No. 10, 69816992, 2022.
[8]
Zhang, Y. F.; Wang, C. Y.; Wang, X. G.; Liu, W. Y.; Zeng, W. J. VoxelTrack: Multi-person 3D human pose estimation and tracking in the wild. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 45, No. 2, 26132626, 2023.
[9]
Wen, L. Y.; Lei, Z.; Chang, M. C.; Qi, H. G.; Lyu, S. W. Multi-camera multi-target tracking with space-time-view hyper-graph. International Journal of Computer Vision Vol. 122, No. 2, 313333, 2017.
[10]
Köhl, P.; Specker, A.; Schumann, A.; Beyerer, J. The MTA dataset for multi target multi camera pedestrian tracking by weighted distance aggregation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 44894498, 2020.
[11]
Canton-Ferrer, C.; Casas, J. R.; Pardàs, M. Towards a Bayesian approach to robust finding correspondences in multiple view geometry environments. In: Compu-tational Science – ICCS 2005. Lecture Notes in Computer Science, Vol. 3515. Sunderam, V. S.; van Albada, G. D.; Sloot, P. M. A.; Dongarra, J. J. Eds. Springer Berlin Heidelberg, 281289, 2005.
[12]
Fleuret, F.; Berclaz, J.; Lengagne, R.; Fua, P. Multicamera people tracking with a probabilistic occupancy map. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 30, No. 2, 267282, 2008.
[13]
Belagiannis, V.; Amin, S.; Andriluka, M.; Schiele, B.; Navab, N.; Ilic, S. 3D pictorial structures revisited: Multiple human pose estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 38, No. 10, 19291942, 2016.
[14]
Yang, F.; Wang, Z.; Wu, Y.; Sakti, S.; Nakamura, S. Tackling multiple object tracking with complicated motions—Re-designing the integration of motion and appearance. Image and Vision Computing Vol. 124, 104514, 2022.
[15]
Zeng, F. G.; Dong, B.; Zhang, Y. A.; Wang, T. C.; Zhang, X. Y.; Wei, Y. C. MOTR: End-to-end multiple-object tracking with transformer. In: Computer Vision – ECCV 2022. Lecture Notes in Computer Science, Vol. 13687. Avidan, S.; Brostow, G.; Cissé, M.; Farinella, G. M.; Hassner, T. Eds. Springer Cham, 659675, 2022.
[16]
Zhou, X. Y.; Yin, T. W.; Koltun, V.; Krähenbühl, P. Global tracking transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 87618770, 2022.
[17]
Du, Y. H.; Zhao, Z. C.; Song, Y.; Zhao, Y. Y.; Su, F.; Gong, T.; Meng, H. Y. StrongSORT: Make DeepSORT great again. arXiv preprint arXiv:2202.13514, 2022.
[18]
Yang, F.; Odashima, S.; Masui, S.; Jiang, S. Hard to track objects with irregular motions and similar appearances? Make it easier by buffering the matching space. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 47884797, 2023.
[19]
Giancola, S.; Cioppa, A.; Deliège, A.; Magera, F.; Somers, V.; Kang, L.; Zhou, X.; Barnich, O.; De Vleeschouwer, C.; Alahi, A.; et al. SoccerNet 2022 challenges results. In: Proceedings of the 5th International ACM Workshop on Multimedia Content Analysis in Sports, 7586, 2022.
[20]
Dong, J. T.; Jiang, W.; Huang, Q. X.; Bao, H. J.; Zhou, X. W. Fast and robust multi-person 3D pose estimation from multiple views. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 77847793, 2019.
[21]
Leal-Taixé, L.; Pons-Moll, G.; Rosenhahn, B. Branch-and-price global optimization for multi-view multi-target tracking. In: Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition, 19871994, 2012.
[22]
Zhang, Y. X.; An, L.; Yu, T.; Li, X.; Li, K.; Liu, Y. B. 4D association graph for realtime multi-person motion capture using multiple video cameras. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 13211330, 2020.
[23]
Bewley, A.; Ge, Z. Y.; Ott, L.; Ramos, F.; Upcroft, B. Simple online and realtime tracking. In: Proceedings of the IEEE International Conference on Image Processing, 34643468, 2016.
[24]
Roberts, S. J. Parametric and non-parametric unsuper-vised cluster analysis. Pattern Recognition Vol. 30, No. 2, 261272, 1997.
[25]
Andrew, A. M. Multiple view geometry in computer vision. Kybernetes Vol. 30, Nos. 9/10, 13331341, 2001.
[26]
Fischler, M. A.; Bolles, R. C. Random sample con-sensus: A paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM Vol. 24, No. 6, 381395, 1981.
[27]
Iskakov, K.; Burkov, E.; Lempitsky, V.; Malkov, Y. Learnable triangulation of human pose. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 77177726, 2019.
[28]
Lowe, D. G. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision Vol. 60, No. 2, 91110, 2004.
[29]
Murtagh, F.; Contreras, P. Algorithms for hierarchical clustering: An overview. WIREs Data Mining and Knowledge Discovery Vol. 2, No. 1, 8697, 2012.
[30]
Wojke, N.; Bewley, A.; Paulus, D. Simple online and realtime tracking with a deep association metric. In: Proceedings of the IEEE International Conference on Image Processing, 36453649, 2017.
[31]
Chavdarova, T.; Baqué, P.; Bouquet, S.; Maksai, A.; Jose, C.; Bagautdinov, T.; Lettry, L.; Fua, P.; Van Gool, L.; Fleuret, F. WILDTRACK: A multi-camera HD dataset for dense unscripted pedestrian detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 50305039, 2018.
[32]
Han, X. T.; You, Q. Z.; Wang, C. Y.; Zhang, Z. Z.; Chu, P.; Hu, H. D.; Wang, J.; Liu, Z. C. MMPTRACK: Large-scale densely annotated multi-camera multiple people tracking benchmark. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 48494858, 2023.
[33]
Nguyen, D. M. H.; Henschel, R.; Rosenhahn, B.; Sonntag, D.; Swoboda, P. LMGP: Lifted multicut meets geometry projections for multi-camera multi-object tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 88568865, 2022.
[34]
Leal-Taixé, L.; Milan, A.; Reid, I.; Roth, S.; Schindler, K. MOTChallenge 2015: Towards a benchmark for multi-target tracking. arXiv preprint arXiv:1504.01942, 2015.
[35]
Milan, A.; Leal-Taixe, L.; Reid, I.; Roth, S.; Schindler, K. MOT16: A benchmark for multi-object tracking. arXiv preprint arXiv:1603.00831, 2016.
[36]
Ong, J.; Vo, B. T.; Vo, B. N.; Kim, D. Y.; Nordholm, S. A Bayesian filter for multi-view 3D multi-object tracking with occlusion handling. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 44, No. 5, 22462263, 2022.
[37]
You, Q. Z.; Jiang, H. Real-time 3D deep multi-camera tracking. arXiv preprint arXiv:2003.11753, 2020.
[38]
Bernardin, K.; Stiefelhagen, R. Evaluating multiple object tracking performance: The CLEAR MOT metrics. Journal on Image and Video Processing Vol. 2008, 1, 2008.
[39]
Ristani, E.; Solera, F.; Zou, R.; Cucchiara, R.; Tomasi, C. Performance measures and a data set for multi-target, multi-camera tracking. In: Computer Vision – ECCV 2016 Workshops. Lecture Notes in Computer Science, Vol. 9914. Hua, G.; Jégou, H. Eds. Springer Cham, 1735, 2016.
[40]
Ge, Z.; Liu, S. T.; Wang, F.; Li, Z. M.; Sun, J. YOLOX: Exceeding YOLO series in 2021. arXiv preprint arXiv:2107.08430, 2021.
[41]
Luo, H.; Jiang, W.; Gu, Y. Z.; Liu, F. X.; Liao, X. Y.; Lai, S. Q.; Gu, J. Y. A strong baseline and batch normalization neck for deep person re-identification. IEEE Transactions on Multimedia Vol. 22, No. 10, 25972609, 2020.
[42]
Belagiannis, V.; Amin, S.; Andriluka, M.; Schiele, B.; Navab, N.; Ilic, S. 3D pictorial structures for multiple human pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 16691676, 2014.
[43]
Belagiannis, V.; Wang, X. C.; Schiele, B.; Fua, P.; Ilic, S.; Navab, N. Multiple human pose estimation with temporally consistent 3D pictorial structures. In: Computer Vision - ECCV 2014 Workshops. Lecture Notes in Computer Science, Vol. 8925. Agapito, L.; Bronstein, M.; Rother, C. Eds. Springer Cham, 742754, 2015.
[44]
Ershadi-Nasab, S.; Noury, E.; Kasaei, S.; Sanaei, E. Multiple human 3D pose estimation from multiview images. Multimedia Tools and Applications Vol. 77, No. 12, 1557315601, 2018.
[45]
Ye, H.; Zhu, W.; Wang, C.; Wu, R.; Wang, Y. Faster VoxelPose: Real-time 3D human pose estimation by orthographic projection. In: Computer Vision – ECCV 2022. Lecture Notes in Computer Science, Vol. 13666. Avidan, S.; Brostow, G.; Cissé, M.; Farinella, G. M.; Hassner, T. Eds. Springer Cham, 142159, 2022.
[46]
Tu, H. Y.; Wang, C. Y.; Zeng, W. J. VoxelPose: Towards multi-camera 3D human pose estimation in wild environment. In: Computer Vision – ECCV 2020. Lecture Notes in Computer Science, Vol. 12346. Vedaldi, A.; Bischof, H.; Brox, T.; Frahm, J. M. Eds. Springer Cham, 197212, 2020.
[47]
Cao, Z.; Simon, T.; Wei, S. H.; Sheikh, Y. Realtime multi-person 2D pose estimation using part affinity fields. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 13021310, 2017.
[48]
Sha, Z. J.; Zeng, Z. L.; Wang, Z.; Natori, Y.; Taniguchi, Y.; Satoh, S. Progressive domain adaptation for robot vision person re-identification. In: Proceedings of the 28th ACM International Conference on Multimedia, 44884490, 2020.
[49]
Yang, F.; Chang, X.; Dang, C. Y.; Zheng, Z. Q.; Sakti, S.; Nakamura, S.; Wu, Y. ReMOTS: Self-supervised refining multi-object tracking and segmentation. arXiv preprint arXiv:2007.03200, 2020.
[50]
Yang, F.; Chang, X.; Sakti, S.; Wu, Y.; Nakamura, S. ReMOT: A model-agnostic refinement for multiple object tracking. Image and Vision Computing Vol. 106, 104091, 2021.
[51]
Hartley, R.; Zisserman, A. Multiple View Geometry in Computer Vision, 2nd edn. Cambridge, UK: Cambridge University Press, 2003.
Computational Visual Media
Pages 137-160
Cite this article:
Yang F, Odashima S, Yamao S, et al. A unified multi-view multi-person tracking framework. Computational Visual Media, 2024, 10(1): 137-160. https://doi.org/10.1007/s41095-023-0334-8
Metrics & Citations  
Article History
Copyright
Rights and Permissions
Return