Discover the SciOpen Platform and Achieve Your Research Goals with Ease.
Search articles, authors, keywords, DOl and etc.
Visual localization and object detection both play important roles in various tasks. In many indoor application scenarios where some detected objects have fixed positions, the two techniques work closely together. However, few researchers consider these two tasks simultaneously, because of a lack of datasets and the little attention paid to such environments. In this paper, we explore multi-task network design and joint refinement of detection and localization. To address the dataset problem, we construct a medium indoor scene of an aviation exhibition hall through a semi-automatic process. The dataset provides localization and detection information, and is publicly available at https://drive.google.com/drive/folders/1U28zkON4_I0dbzkqyIAKlAl5k9oUK0jI?usp=sharing for benchmarking localization and object detection tasks. Targeting this dataset, we have designed a multi-task network, JLDNet, based on YOLO v3, that outputs a target point cloud and object bounding boxes. For dynamic environments, the detection branch also promotes the perception of dynamics. JLDNet includes image feature learning, point feature learning, feature fusion, detection construction, and point cloud regression. Moreover, object-level bundle adjustment is used to further improve localization and detection accuracy. To test JLDNet and compare it to other methods, we have conducted experiments on 7 static scenes, our constructed dataset, and the dynamic TUM RGB-D and Bonn datasets. Our results show state-of-the-art accuracy for both tasks, and the benefit of jointly working on both tasks is demonstrated.
Bao, W.; Wang, W.; Xu, Y. H.; Guo, Y. L.; Hong, S. Y.; Zhang, X. H. InStereo2K: A large real dataset for stereo matching in indoor scenes. Science China Information Sciences Vol. 63, No. 11, 212101, 2020.
Yan, F. H.; Li, Z. X.; Zhou, Z. Robust and efficient edge-based visual odometry. Computational Visual Media Vol. 8, No. 3, 467–481, 2022.
Huang, J. H.; Yang, S.; Zhao, Z. S.; Lai, Y. K.; Hu, S. M. ClusterSLAM: A SLAM backend for simultaneous rigid body clustering and motion estimation. Computational Visual Media Vol. 7, No. 1, 87–101, 2021.
Wang, C.; Guo, X. H. Feature-based RGB-D camera pose optimization for real-time 3D reconstruction. Computational Visual Media Vol. 3, No. 2, 95–106, 2017.
Nakajima, Y.; Saito, H. Robust camera pose estimation by viewpoint classification using deep learning. Computational Visual Media Vol. 3, No. 2, 189–198, 2017.
Liu, S.; Zhang, Y. Q.; Yang, X. S.; Shi, D. M.; Zhang, J. J. Robust facial landmark detection and tracking across poses and expressions for in-the-wild monocular video. Computational Visual Media Vol. 3, No. 1, 33–47, 2017.
Qin, T.; Li, P. L.; Shen, S. J. VINS-mono: A robust and versatile monocular visual-inertial state estimator. IEEE Transactions on Robotics Vol. 34, No. 4, 1004–1020, 2018.
Campos, C.; Elvira, R.; Rodriguez, J. J. G.; Montiel, J. M. M.; Tardos, J. D. ORB-SLAM3: An accurate open-source library for visual, visual–inertial, and multimap SLAM. IEEE Transactions on Robotics Vol. 37, No. 6, 1874–1890, 2021.
Yang, S. C.; Scherer, S. CubeSLAM: Monocular 3-D object SLAM. IEEE Transactions on Robotics Vol. 35, No. 4, 925–938, 2019.
Bescos, B.; Facil, J. M.; Civera, J.; Neira, J. DynaSLAM: Tracking, mapping, and inpainting in dynamic scenes. IEEE Robotics and Automation Letters Vol. 3, No. 4, 4076–4083, 2018.
Radwan, N.; Valada, A.; Burgard, W. VLocNet++: Deep multitask learning for semantic visual localization and odometry. IEEE Robotics and Automation Letters Vol. 3, No. 4, 4407–4414, 2018.
Schmidt, T.; Newcombe, R.; Fox, D. Self-supervised visual descriptor learning for dense correspondence. IEEE Robotics and Automation Letters Vol. 2, No. 2, 420–427, 2017.
Brachmann, E.; Rother, C. Visual camera re-localization from RGB and RGB-D images using DSAC. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 44, No. 9, 5847–5865, 2022.
Cavallari, T.; Golodetz, S.; Lord, N. A.; Valentin, J.; Prisacariu, V. A.; Stefano, L. D.; Torr, P. H. S. Real-time RGB-D camera pose estimation in novel scenes using a relocalisation cascade. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 42, No. 10, 2465–2477, 2020.
Kim, S.; Park, S.; Na, B.; Yoon, S. Spiking-YOLO: Spiking neural network for energy-efficient object detection. Proceedings of the AAAI Conference on Artificial Intelligence Vol. 34, No. 7, 11270–11277, 2020.
Sattler, T.; Leibe, B.; Kobbelt, L. Efficient & effective prioritized matching for large-scale image-based localization. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 39, No. 9, 1744–1756, 2017.
Lan, Y. Q.; Duan, Y.; Liu, C. Y.; Zhu, C. Y.; Xiong, Y. S.; Huang, H.; Xu, K. ARM3D: Attention-based relation module for indoor 3D object detection. Computational Visual Media Vol. 8, No. 3, 395–414, 2022.
Ren, S. Q.; He, K. M.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 39, No. 6, 1137–1149, 2017.
Huang, S. S.; Chen, H. X.; Huang, J. H.; Fu, H. B.; Hu, S. M. Real-time globally consistent 3D reconstruction with semantic priors. IEEE Transactions on Visualization and Computer Graphics Vol. 29, No. 4, 1977–1991, 2023.
Zheng, T.; Zhang, G. Q.; Han, L.; Xu, L.; Fang, L. Building fusion: Semantic-aware structural building-scale 3D reconstruction. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 44, No. 5, 2328–2345, 2022.
Zou, Z. X.; Huang, S. S.; Mu, T. J.; Wang, Y. P. ObjectFusion: Accurate object-level SLAM with neural object priors. Graphical Models Vol. 123, 101165, 2022.
Mur-Artal, R.; Montiel, J. M. M.; Tardos, J. D. ORB-SLAM: A versatile and accurate monocular SLAM system. IEEE Transactions on Robotics Vol. 31, No. 5, 1147–1163, 2015.
Mur-Artal, R.; Tardós, J. D. ORB-SLAM2: An open-source SLAM system for monocular, stereo, and RGB-D cameras. IEEE Transactions on Robotics Vol. 33, No. 5, 1255–1262, 2017.
Wang, B.; Chen, C. H.; Xiaoxuan Lu, C.; Zhao, P. J.; Trigoni, N.; Markham, A. AtLoc: Attention guided camera localization. Proceedings of the AAAI Conference on Artificial Intelligence Vol. 34, No. 6, 10393–10401, 2020.
Taira, H.; Okutomi, M.; Sattler, T.; Cimpoi, M.; Pollefeys, M.; Sivic, J.; Pajdla, T.; Torii, A. InLoc: Indoor visual localization with dense matching and view synthesis. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 43, No. 4, 1293–1307, 2021.
Du, Z. J.; Huang, S. S.; Mu, T. J.; Zhao, Q. H.; Martin, R. R.; Xu, K. Accurate dynamic SLAM using CRF-based long-term consistency. IEEE Transactions on Visualization and Computer Graphics Vol. 28, No. 4, 1745–1757, 2022.
This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.
The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
Other papers from this open access journal are available free of charge from http://www.springer.com/journal/41095. To submit a manuscript, please go to https://www.editorialmanager.com/cvmj.