Multi-task learning and joint refinement between camera localization and object detection

Junyi Wang; Yue Qi

doi:10.1007/s41095-022-0319-z

| Sign up

PDF (7.5 MB)

Cite

EndNote(RIS) BibTeX

Collect

Submit Manuscript

Show Outline

Figures (7)

Fig. 1

Fig. 2

Fig. 3

Fig. 4

Fig. 5

Fig. 6

Fig. 7

Tables (7)

Table 1

Table 2

Table 3

Table 4

Table 5

Research Article | Open Access

Multi-task learning and joint refinement between camera localization and object detection

Junyi Wang^{¹^,²}, Yue Qi^{¹^,²^,³}()

State Key Laboratory of Virtual Reality Technology and Systems, School of Computer Science and Engineering, Beihang University, Beijing 100191, China

Peng Cheng Laboratory, Shenzhen 518052, China

Qingdao Research Institute of Beihang University, Qingdao 266104, China

^* Junyi Wang’s present address: School of Computer Science and Technology, Shandong University, Qingdao, China.

Show Author Information

Graphical Abstract

View original image Download original image

Abstract

Visual localization and object detection both play important roles in various tasks. In many indoor application scenarios where some detected objects have fixed positions, the two techniques work closely together. However, few researchers consider these two tasks simultaneously, because of a lack of datasets and the little attention paid to such environments. In this paper, we explore multi-task network design and joint refinement of detection and localization. To address the dataset problem, we construct a medium indoor scene of an aviation exhibition hall through a semi-automatic process. The dataset provides localization and detection information, and is publicly available at https://drive.google.com/drive/folders/1U28zkON4_I0dbzkqyIAKlAl5k9oUK0jI?usp=sharing for benchmarking localization and object detection tasks. Targeting this dataset, we have designed a multi-task network, JLDNet, based on YOLO v3, that outputs a target point cloud and object bounding boxes. For dynamic environments, the detection branch also promotes the perception of dynamics. JLDNet includes image feature learning, point feature learning, feature fusion, detection construction, and point cloud regression. Moreover, object-level bundle adjustment is used to further improve localization and detection accuracy. To test JLDNet and compare it to other methods, we have conducted experiments on 7 static scenes, our constructed dataset, and the dynamic TUM RGB-D and Bonn datasets. Our results show state-of-the-art accuracy for both tasks, and the benefit of jointly working on both tasks is demonstrated.

Keywords

visual localization object detection joint optimization multi-task learning

References

[1]

Bao, W.; Wang, W.; Xu, Y. H.; Guo, Y. L.; Hong, S. Y.; Zhang, X. H. InStereo2K: A large real dataset for stereo matching in indoor scenes. Science China Information Sciences Vol. 63, No. 11, 212101, 2020.

Crossref Google Scholar

[2]

Yan, F. H.; Li, Z. X.; Zhou, Z. Robust and efficient edge-based visual odometry. Computational Visual Media Vol. 8, No. 3, 467–481, 2022.

Crossref Google Scholar

[3]

Huang, J. H.; Yang, S.; Zhao, Z. S.; Lai, Y. K.; Hu, S. M. ClusterSLAM: A SLAM backend for simultaneous rigid body clustering and motion estimation. Computational Visual Media Vol. 7, No. 1, 87–101, 2021.

Crossref Google Scholar

[4]

Wang, C.; Guo, X. H. Feature-based RGB-D camera pose optimization for real-time 3D reconstruction. Computational Visual Media Vol. 3, No. 2, 95–106, 2017.

Crossref Google Scholar

[5]

Nakajima, Y.; Saito, H. Robust camera pose estimation by viewpoint classification using deep learning. Computational Visual Media Vol. 3, No. 2, 189–198, 2017.

Crossref Google Scholar

[6]

Liu, S.; Zhang, Y. Q.; Yang, X. S.; Shi, D. M.; Zhang, J. J. Robust facial landmark detection and tracking across poses and expressions for in-the-wild monocular video. Computational Visual Media Vol. 3, No. 1, 33–47, 2017.

Crossref Google Scholar

[7]

Qin, T.; Li, P. L.; Shen, S. J. VINS-mono: A robust and versatile monocular visual-inertial state estimator. IEEE Transactions on Robotics Vol. 34, No. 4, 1004–1020, 2018.

Crossref Google Scholar

[8]

Rublee, E.; Rabaud, V.; Konolige, K.; Bradski, G. ORB: An efficient alternative to SIFT or SURF. In: Proceedings of the International Conference on Computer Vision, 2564–2571, 2011.

	Spatial extent	Training / testing frames	Method	Scene character
Our dataset
Aviation hall	1000 $m^{3}$	4000 / 5000		1. Medium extent
		2000 / 7000	SFM + MVS	2. Occlusion
				3. Strong light
7 Scenes
Chess	6 $m^{3}$	4000 / 2000
Fire	2.5 $m^{3}$	2000 / 2000
Heads	1 $m^{3}$	1000 / 1000
Office	7.5 $m^{3}$	6000 / 4000	Kinect fusion
Pumpkin	5 $m^{3}$	4000 / 2000
Kitchen	18 $m^{3}$	7000 / 5000
Stairs	7.5 $m^{3}$	2000 / 1000

	Split 1 (mean)	Split 1 (median)	Split 2 (mean)	Split 2 (median)
PoseNet	0.28/4.62	0.14/2.98	0.36/5.77	0.17/3.98
VLocNet	0.25/4.26	0.11/2.65	0.32/6.08	0.16/3.37
DSAC^*	0.17/2.72	0.09/1.99	0.21/3.76	0.14/2.65
KFNet	0.16/2.42	0.08/1.69	0.19/3.67	0.14/2.68
Ours (no detection)	0.18/3.27	0.11/2.42	0.24/4.13	0.15/3.12
Ours (no fusion block)	0.13/2.15	0.07/1.28	0.15/2.55	0.11/1.94
Ours (no refinement)	0.15/2.35	0.08/1.59	0.17/3.05	0.13/2.26
Ours	0.12/1.95	0.06/1.19	0.14/2.36	0.09/1.78

	fr3w_xyz	fr3w_static	fr3w_rpy	fr3w_half	fr3w_xyz	fr3s_static	fr3s_rpy	fr3s_half	fr2_desk_with_person
ATE (m)
ORB-SLAM2 (RGB-D)	0.459	0.090	0.662	0.351	0.011	0.009	0.044	0.020	0.074
DynaSLAM	0.015	0.006	0.035	0.025	0.015	—	—	0.017	—
DS-SLAM	0.025	0.008	0.444	0.030	—	0.007	—	—	—
MaskFusion	0.104	0.035	—	0.106	0.031	0.021	—	0.052
LC-CRF SLAM	0.016	0.011	0.046	0.028	0.009	—	—	—	0.069
Ours (no detection)	0.021	0.011	0.032	0.033	0.018	0.012	0.024	0.018	0.066
Ours	0.013	0.007	0.018	0.019	0.014	0.007	0.019	0.013	0.057
RPE (m/s)
ORB-SLAM2 (RGB-D)	0.412	0.216	0.424	0.355	0.016	0.011	0.039	0.024	0.104
DS-SLAM	0.033	0.010	0.150	0.030	—	0.008	—	—	—
MaskFusion	0.097	0.039	—	0.093	0.046	0.017	—	0.041
LC-CRF SLAM	0.021	0.014	0.050	0.035	0.012	—	—	—	0.086
Ours (no detection)	0.022	0.011	0.036	0.042	0.024	0.014	0.029	0.021	0.081
Ours	0.016	0.009	0.027	0.026	0.017	0.010	0.023	0.017	0.069

	ATE (m)				RPE (m/s)
	RF	MF	LC-CRF	Ours	RF	MF	LC-CRF	Ours
balloon	0.175	0.165	0.027	0.021	0.576	0.509	0.612	0.322
balloon2	0.254	0.114	0.024	0.022	0.540	0.499	0.541	0.439
balloon-tracking	0.302	0.194	0.025	0.019	1.031	0.991	0.965	0.856
balloon-tracking2	0.322	0.238	0.045	0.029	1.059	0.937	0.935	0.821
crowd	0.204	0.473	0.019	0.022	0.198	0.633	0.238	0.326
crowd2	0.155	0.653	0.031	0.030	0.315	0.854	0.199	0.231
crowd3	0.137	0.341	0.023	0.019	0.223	0.503	0.194	0.258
kidnapping-box	0.148	0.200	0.023	0.017	0.886	0.840	1.001	0.792
kidnapping-box2	0.161	0.182	0.020	0.031	1.077	1.027	1.184	0.943
moving-no-box	0.071	0.120	0.018	0.024	0.939	0.947	0.936	0.879
moving-no-box2	0.179	0.193	0.038	0.025	1.287	1.252	1.399	1.132
moving-o-box	0.343	0.216	0.253	0.195	1.274	0.847	1.158	0.977
moving-o-box2	0.528	0.298	0.341	0.285	1.523	0.576	1.143	0.866
person-tracking	0.289	0.301	0.035	0.034	1.209	1.312	1.193	1.023
person-tracking2	0.463	0.220	0.040	0.046	1.165	1.267	1.297	0.971
placing-no-box	0.106	0.325	0.014	0.019	0.355	0.598	0.333	0.422
placing-no-box2	0.141	0.153	0.016	0.025	0.282	0.330	0.271	0.215
placing-no-box3	0.174	0.156	0.036	0.029	0.511	0.491	0.482	0.422
placing-o-box	0.571	0.424	0.320	0.264	1.180	0.791	0.505	0.740
removing-no-box	0.041	0.058	0.013	0.025	0.262	0.263	0.240	0.216
Mean	0.238	0.251	0.068	0.058	0.795	0.773	0.741	0.643

	Localization	Detection (mAP)
YOLO	0.026/1.41	84.9%
YOLO v2	0.025/1.23	85.1%
YOLO v3	0.024/1.03	85.7%
YOLO v4	0.024/1.09	85.9%
YOLO v5	0.023/1.13	85.6%

	Chess	Fire	Heads	Office	Pumpkin	Kitchen	Stairs	Mean
Hand-crafted feature
Active search	0.04/1.96	0.03/1.53	0.02/1.45	0.09/3.61	0.08/3.10	0.07/3.37	0.03/2.22	0.051/2.46
Direct learning
PoseNet	0.32/8.12	0.47/14.4	0.29/12.0	0.48/7.68	0.47/8.42	0.59/8.64	0.47/13.8	0.44/10.44
Xue et al.	0.09/3.25	0.26/10.92	0.17/12.7	0.18/5.45	0.20/3.66	0.23/4.92	0.23/11.3	0.194/7.46
MapNet	0.08/3.25	0.27/11.7	0.18/13.3	0.17/5.15	0.22/4.02	0.23/4.93	0.30/12.1	0.207/7.78
Atloc	0.10/3.18	0.26/10.8	0.14/11.4	0.17/5.16	0.20/3.94	0.16/4.90	0.29/10.20	0.189/7.08
VLocNet	0.036/1.71	0.039/5.34	0.046/6.64	0.039/1.95	0.037/2.28	0.039/2.20	0.097/6.48	0.048/3.80
Indirect learning
DSAC++	0.02/0.5	0.02/0.9	0.01/0.8	0.03/0.7	0.04/1.1	0.04/1.1	0.09/2.6	0.036/1.0
DSAC^*	0.019/1.11	0.019/1.24	0.011/1.82	0.026/1.18	0.042/1.41	0.03/1.70	0.041/1.42	0.027/1.41
KFNet	0.018/0.65	0.023/0.90	0.014/0.82	0.025/0.69	0.037/1.02	0.038/1.16	0.033/0.94	0.027/0.88
InLoc	0.03/1.05	0.03/1.07	0.02/1.16	0.03/1.05	0.05/1.55	0.04/1.31	0.09/2.47	0.041/1.38
SANet	0.03/0.88	0.03/1.08	0.02/1.48	0.03/1.00	0.05/1.32	0.04/1.40	0.16/4.59	0.051/1.68
Tang et al.	0.02/0.71	0.02/0.85	0.01/0.85	0.03/0.84	0.04/1.16	0.04/1.17	0.05/1.33	0.03/0.99
Li et al.	0.02/0.7	0.02/0.9	0.01/0.9	0.03/0.8	0.04/1.0	0.04/1.2	0.03/0.8	0.03/0.9
Wang et al. (coarse stage)	0.021/0.85	0.028/1.42	0.028/1.45	0.043/1.37	0.045/1.17	0.042/1.65	0.048/1.21	0.036/1.30
Ours (no detection)	0.025/1.29	0.027/1.54	0.019/1.82	0.033/1.57	0.049/2.03	0.047/1.65	0.056/2.24	0.037/1.73
Ours (no fusion block)	0.016/0.87	0.018/1.22	0.013/1.42	0.024/1.06	0.039/1.32	0.036/1.01	0.037/1.44	0.026/1.19
Ours (no refinement)	0.017/0.91	0.020/1.38	0.014/1.67	0.029/1.46	0.045/1.73	0.041/1.45	0.046/1.79	0.030/1.48
Ours	0.014/0.76	0.018/1.05	0.012/1.17	0.021/0.91	0.036/1.11	0.035/0.95	0.032/1.25	0.024/1.03

7 Scenes
	Chess	Fire	Heads	Office	Pumpkin	Kitchen	Stairs	Mean
YOLO v3	87.3%	82.4%	86.1%	85.4%	71.2%	78.3%	74.7%	80.8%
Ours	91.2%	86.9%	89.4%	90.4%	74.1%	82.3%	78.2%	84.6%
TUM RGB-D
	fr3w_xyz	fr3w_static	fr3w_rpy	fr3w_half	fr3s_xyz	fr3s_static	fr3s_rpy	fr3s_half	Mean
YOLO v3	81.3%	83.2%	80.1%	81.1%	88.7%	87.6%	83.3%	85.1%	83.8%
Ours	85.4%	87.9%	84.2%	85.6%	91.1%	90.3%	87.4%	89.3%	87.7%
Our dataset
	Split 1	Split 2
YOLO v3	90.6%	86.0%
Ours (no fusion block)	91.9%	88.2%
Ours (no refinement)	91.1%	86.8%
Ours	92.3%	88.7%