A unified multi-view multi-person tracking framework

Fan Yang; Shigeyuki Odashima; Sosuke Yamao; Hiroaki Fujimoto; Shoichi Masui; Shan Jiang

doi:10.1007/s41095-023-0334-8

| Sign up

PDF (15.3 MB)

Cite

EndNote(RIS) BibTeX

Collect

Submit Manuscript

Research Article | Open Access

A unified multi-view multi-person tracking framework

Fan Yang^¹(), Shigeyuki Odashima^¹, Sosuke Yamao^¹, Hiroaki Fujimoto^¹, Shoichi Masui^¹, Shan Jiang^¹

1 Fujitsu Research, Japan

Show Author Information

Graphical Abstract

View original image Download original image

Abstract

Despite significant developments in 3D multi-view multi-person (3D MM) tracking, current frameworks separately target footprint tracking, or pose tracking. Frameworks designed for the former cannot be used for the latter, because they directly obtain 3D positions on the ground plane via a homography projection, which is inapplicable to 3D poses above the ground. In contrast, frameworks designed for pose tracking generally isolate multi-view and multi-frame associations and may not be sufficiently robust for footprint tracking, which utilizes fewer key points than pose tracking, weakening multi-view association cues in a single frame. This study presents a unified multi-view multi-person tracking framework to bridge the gap between footprint tracking and pose tracking. Without additional modifications, the framework can adopt monocular 2D bounding boxes and 2D poses as its input to produce robust 3D trajectories for multiple persons. Importantly, multi-frame and multi-view information are jointly employed to improve association and triangulation. Our framework is shown to provide state-of-the-art performance on the Campus and Shelf datasets for 3D pose tracking, with comparable results on the WILDTRACK and MMPTRACK datasets for 3D footprint tracking.

Keywords

multi-camera multi-person tracking pose tracking footprint tracking triangulation spatiotemporal clustering

References

[1]

Black,

; Ellis,

Multi camera image tracking. Image and Vision Computing Vol. 24, No. 11, 1256–1267, 2006.

Crossref Google Scholar

[2]

Sternig,

; Mauthner,

; Irschara,

; Roth,

P. M.

; Bischof,

Multi-camera multi-object tracking by robust hough-based homography projections. In: Proceedings of the IEEE International Conference on Computer Vision Workshops, 1689–1696, 2011.

Crossref

[3]

He,

Y. H.

; Wei,

; Hong,

X. P.

; Shi,

W. W.

; Gong,

Y. H.

Multi-target multi-camera tracking by tracklet-to-target assignment. IEEE Transactions on Image Processing Vol. 29, 5191–5205, 2020.

Crossref Google Scholar

[4]

Chen,

; Guo,

P. F.

; Li,

P. F.

; Lee,

G. H.

; Chirikjian,

Multi-person 3D pose estimation in crowded scenes based on multi-view geometry. In: Computer Vision – ECCV 2020. Lecture Notes in Computer Science, Vol. 12348. Vedaldi,

; Bischof,

; Brox,

; Frahm,

J. M.

Eds. Springer Cham, 541–557, 2020.

Crossref

[5]

Chen,

; Ai,

H. Z.

; Chen,

; Zhuang,

Z. J.

; Liu,

Cross-view tracking for multi-human 3D pose estimation at over 100 FPS. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3276–3285, 2020.

Crossref

[6]

Ohashi,

; Ikegami,

; Nakamura,

Synergetic reconstruction from 2D pose and 3D motion for wide-space multi-person video motion capture in the wild. Image and Vision Computing Vol. 104, 104028, 2020.

Crossref Google Scholar

[7]

Dong,

J. T.

; Fang,

; Jiang,

; Yang,

Y. R.

; Huang,

Q. X.

; Bao,

H. J.

; Zhou,

X. W.

Fast and robust multi-person 3D pose estimation and tracking from multiple views. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 44, No. 10, 6981–6992, 2022.

Crossref Google Scholar

[8]

Zhang,

Y. F.

; Wang,

C. Y.

; Wang,

X. G.

; Liu,

W. Y.

; Zeng,

W. J.

VoxelTrack: Multi-person 3D human pose estimation and tracking in the wild. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 45, No. 2, 2613–2626, 2023.

Crossref Google Scholar

[9]

Wen,

L. Y.

; Lei,

; Chang,

M. C.

; Qi,

H. G.

; Lyu,

S. W.

Multi-camera multi-target tracking with space-time-view hyper-graph. International Journal of Computer Vision Vol. 122, No. 2, 313–333, 2017.

Crossref Google Scholar

[10]

Köhl,

; Specker,

; Schumann,

; Beyerer,

The MTA dataset for multi target multi camera pedestrian tracking by weighted distance aggregation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 4489–4498, 2020.

Crossref

[11]

Canton-Ferrer,

; Casas,

J. R.

; Pardàs,

Towards a Bayesian approach to robust finding correspondences in multiple view geometry environments. In: Compu-tational Science – ICCS 2005. Lecture Notes in Computer Science, Vol. 3515. Sunderam,

V. S.

; van Albada,

G. D.

; Sloot,

P. M. A.

; Dongarra,

J. J.

Eds. Springer Berlin Heidelberg, 281–289, 2005.

Crossref

[12]

Fleuret,

; Berclaz,

; Lengagne,

; Fua,

Multicamera people tracking with a probabilistic occupancy map. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 30, No. 2, 267–282, 2008.

Crossref Google Scholar

[13]

Belagiannis,

; Amin,

; Andriluka,

; Schiele,

; Navab,

; Ilic,

3D pictorial structures revisited: Multiple human pose estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 38, No. 10, 1929–1942, 2016.

Crossref Google Scholar

[14]

Yang,

; Wang,

; Wu,

; Sakti,

; Nakamura,

Tackling multiple object tracking with complicated motions—Re-designing the integration of motion and appearance. Image and Vision Computing Vol. 124, 104514, 2022.

Crossref Google Scholar

[15]

Zeng,

F. G.

; Dong,

; Zhang,

Y. A.

; Wang,

T. C.

; Zhang,

X. Y.

; Wei,

Y. C.

MOTR: End-to-end multiple-object tracking with transformer. In: Computer Vision – ECCV 2022. Lecture Notes in Computer Science, Vol. 13687. Avidan,

; Brostow,

; Cissé,

; Farinella,

G. M.

; Hassner,

Eds. Springer Cham, 659–675, 2022.

Crossref

[16]

Zhou,

X. Y.

; Yin,

T. W.

; Koltun,

; Krähenbühl,

Global tracking transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8761–8770, 2022.

Crossref

[17]

Du,

Y. H.

; Zhao,

Z. C.

; Song,

; Zhao,

Y. Y.

; Su,

; Gong,

; Meng,

H. Y.

StrongSORT: Make DeepSORT great again. arXiv preprint arXiv:2202.13514, 2022.

Crossref Google Scholar

[18]

Yang,

; Odashima,

; Masui,

; Jiang,

Hard to track objects with irregular motions and similar appearances? Make it easier by buffering the matching space. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 4788–4797, 2023.

Crossref

[19]

Giancola,

; Cioppa,

; Deliège,

; Magera,

; Somers,

; Kang,

; Zhou,

; Barnich,

; De Vleeschouwer,

; Alahi,

; et al. SoccerNet 2022 challenges results. In: Proceedings of the 5th International ACM Workshop on Multimedia Content Analysis in Sports, 75–86, 2022.

[20]

Dong,

J. T.

; Jiang,

; Huang,

Q. X.

; Bao,

H. J.

; Zhou,

X. W.

Fast and robust multi-person 3D pose estimation from multiple views. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7784–7793, 2019.

Crossref

[21]

Leal-Taixé,

; Pons-Moll,

; Rosenhahn,

Branch-and-price global optimization for multi-view multi-target tracking. In: Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition, 1987–1994, 2012.

Crossref

[22]

Zhang,

Y. X.

; An,

; Yu,

; Li,

; Liu,

Y. B.

4D association graph for realtime multi-person motion capture using multiple video cameras. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1321–1330, 2020.

Crossref

[23]

Bewley,

; Ge,

Z. Y.

; Ott,

; Ramos,

; Upcroft,

Simple online and realtime tracking. In: Proceedings of the IEEE International Conference on Image Processing, 3464–3468, 2016.

Crossref

[24]

Roberts,

S. J.

Parametric and non-parametric unsuper-vised cluster analysis. Pattern Recognition Vol. 30, No. 2, 261–272, 1997.

Crossref Google Scholar

[25]

Andrew,

A. M.

Multiple view geometry in computer vision. Kybernetes Vol. 30, Nos. 9/10, 1333–1341, 2001.

Crossref Google Scholar

[26]

Fischler,

M. A.

; Bolles,

R. C.

Random sample con-sensus: A paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM Vol. 24, No. 6, 381–395, 1981.

Crossref Google Scholar

[27]

Iskakov,

; Burkov,

; Lempitsky,

; Malkov,

Learnable triangulation of human pose. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 7717–7726, 2019.

Crossref

[28]

Lowe,

D. G.

Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision Vol. 60, No. 2, 91–110, 2004.

Crossref Google Scholar

[29]

Murtagh,

; Contreras,

Algorithms for hierarchical clustering: An overview. WIREs Data Mining and Knowledge Discovery Vol. 2, No. 1, 86–97, 2012.

Crossref Google Scholar

[30]

Wojke,

; Bewley,

; Paulus,

Simple online and realtime tracking with a deep association metric. In: Proceedings of the IEEE International Conference on Image Processing, 3645–3649, 2017.

Crossref

[31]

Chavdarova,

; Baqué,

; Bouquet,

; Maksai,

; Jose,

; Bagautdinov,

; Lettry,

; Fua,

; Van Gool,

; Fleuret,

WILDTRACK: A multi-camera HD dataset for dense unscripted pedestrian detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5030–5039, 2018.

Crossref

[32]

Han,

X. T.

; You,

Q. Z.

; Wang,

C. Y.

; Zhang,

Z. Z.

; Chu,

; Hu,

H. D.

; Wang,

; Liu,

Z. C.

MMPTRACK: Large-scale densely annotated multi-camera multiple people tracking benchmark. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 4849–4858, 2023.

Crossref

[33]

Nguyen,

D. M. H.

; Henschel,

; Rosenhahn,

; Sonntag,

; Swoboda,

LMGP: Lifted multicut meets geometry projections for multi-camera multi-object tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8856–8865, 2022.

Crossref

[34]

Leal-Taixé,

; Milan,

; Reid,

; Roth,

; Schindler,

MOTChallenge 2015: Towards a benchmark for multi-target tracking. arXiv preprint arXiv:1504.01942, 2015.

Google Scholar

[35]

Milan,

; Leal-Taixe,

; Reid,

; Roth,

; Schindler,

MOT16: A benchmark for multi-object tracking. arXiv preprint arXiv:1603.00831, 2016.

Google Scholar

[36]

Ong,

; Vo,

B. T.

; Vo,

B. N.

; Kim,

D. Y.

; Nordholm,

A Bayesian filter for multi-view 3D multi-object tracking with occlusion handling. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 44, No. 5, 2246–2263, 2022.

Crossref Google Scholar

[37]

You,

Q. Z.

; Jiang,

Real-time 3D deep multi-camera tracking. arXiv preprint arXiv:2003.11753, 2020.

Google Scholar

[38]

Bernardin,

; Stiefelhagen,

Evaluating multiple object tracking performance: The CLEAR MOT metrics. Journal on Image and Video Processing Vol. 2008, 1, 2008.

Crossref Google Scholar

[39]

Ristani,

; Solera,

; Zou,

; Cucchiara,

; Tomasi,

Performance measures and a data set for multi-target, multi-camera tracking. In: Computer Vision – ECCV 2016 Workshops. Lecture Notes in Computer Science, Vol. 9914. Hua,

; Jégou,

Eds. Springer Cham, 17–35, 2016.

Crossref

[40]

Ge,

; Liu,

S. T.

; Wang,

; Li,

Z. M.

; Sun,

YOLOX: Exceeding YOLO series in 2021. arXiv preprint arXiv:2107.08430, 2021.

Google Scholar

[41]

Luo,

; Jiang,

; Gu,

Y. Z.

; Liu,

F. X.

; Liao,

X. Y.

; Lai,

S. Q.

; Gu,

J. Y.

A strong baseline and batch normalization neck for deep person re-identification. IEEE Transactions on Multimedia Vol. 22, No. 10, 2597–2609, 2020.

Crossref Google Scholar

[42]

Belagiannis,

; Amin,

; Andriluka,

; Schiele,

; Navab,

; Ilic,

3D pictorial structures for multiple human pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1669–1676, 2014.

Crossref

[43]

Belagiannis,

; Wang,

X. C.

; Schiele,

; Fua,

; Ilic,

; Navab,

Multiple human pose estimation with temporally consistent 3D pictorial structures. In: Computer Vision - ECCV 2014 Workshops. Lecture Notes in Computer Science, Vol. 8925. Agapito,

; Bronstein,

; Rother,

Eds. Springer Cham, 742–754, 2015.

Crossref

[44]

Ershadi-Nasab,

; Noury,

; Kasaei,

; Sanaei,

Multiple human 3D pose estimation from multiview images. Multimedia Tools and Applications Vol. 77, No. 12, 15573–15601, 2018.

Crossref Google Scholar

[45]

Ye,

; Zhu,

; Wang,

; Wu,

; Wang,

Faster VoxelPose: Real-time 3D human pose estimation by orthographic projection. In: Computer Vision – ECCV 2022. Lecture Notes in Computer Science, Vol. 13666. Avidan,

; Brostow,

; Cissé,

; Farinella,

G. M.

; Hassner,

Eds. Springer Cham, 142–159, 2022.

Crossref

[46]

Tu,

H. Y.

; Wang,

C. Y.

; Zeng,

W. J.

VoxelPose: Towards multi-camera 3D human pose estimation in wild environment. In: Computer Vision – ECCV 2020. Lecture Notes in Computer Science, Vol. 12346. Vedaldi,

; Bischof,

; Brox,

; Frahm,

J. M.

Eds. Springer Cham, 197–212, 2020.

Crossref

[47]

Cao,

; Simon,

; Wei,

S. H.

; Sheikh,

Realtime multi-person 2D pose estimation using part affinity fields. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1302–1310, 2017.

Crossref

[48]

Sha,

Z. J.

; Zeng,

Z. L.

; Wang,

; Natori,

; Taniguchi,

; Satoh,

Progressive domain adaptation for robot vision person re-identification. In: Proceedings of the 28th ACM International Conference on Multimedia, 4488–4490, 2020.

Crossref

[49]

Yang,

; Chang,

; Dang,

C. Y.

; Zheng,

Z. Q.

; Sakti,

; Nakamura,

; Wu,

ReMOTS: Self-supervised refining multi-object tracking and segmentation. arXiv preprint arXiv:2007.03200, 2020.

Google Scholar

[50]

Yang,

; Chang,

; Sakti,

; Wu,

; Nakamura,

ReMOT: A model-agnostic refinement for multiple object tracking. Image and Vision Computing Vol. 106, 104091, 2021.

Crossref Google Scholar

[51]

Hartley,

; Zisserman,

Multiple View Geometry in Computer Vision, 2nd edn. Cambridge, UK: Cambridge University Press, 2003.

Crossref

Computational Visual Media

Volume 10 Issue 1,
February 2024

Pages 137-160

DOI: 10.1007/s41095-023-0334-8

Cite this article:

Yang F, Odashima S, Yamao S, et al. A unified multi-view multi-person tracking framework. Computational Visual Media, 2024, 10(1): 137-160. https://doi.org/10.1007/s41095-023-0334-8