The cross-view matching of local image features is a fundamental task in visual localization and 3D reconstruction. This study proposes FilterGNN, a transformer-based graph neural network (GNN), aiming to improve the matching efficiency and accuracy of visual descriptors. Based on high matching sparseness and coarse-to-fine covisible area detection, FilterGNN utilizes cascaded optimal graph-matching filter modules to dynamically reject outlier matches. Moreover, we successfully adapted linear attention in FilterGNN with post-instance normalization support, which significantly reduces the complexity of complete graph learning from
Mur-Artal, R.; Montiel, J. M. M.; Tardós, J. D. ORB-SLAM: A versatile and accurate monocular SLAM system. IEEE Transactions on Robotics Vol. 31, No. 5, 1147–1163, 2015.
Huang, J.; Yang, S.; Zhao, Z.; Lai, Y. K.; Hu, S. M. ClusterSLAM: A SLAM backend for simultaneous rigid body clustering and motion estimation. Computational Visual Media Vol. 7, No. 1, 87–101, 2021.
Guo, M. H.; Xu, T. X.; Liu, J. J.; Liu, Z. N.; Jiang, P. T.; Mu, T. J.; Zhang, S. H.; Martin, R. R.; Cheng, M. M.; Hu, S. M. Attention mechanisms in computer vision: A survey. Computational Visual Media Vol. 8, No. 3, 331–368, 2022.
Thomee, B.; Elizalde, B.; Shamma, D. A.; Ni, K.; Friedland, G.; Poland, D.; Borth, D.; Li, A. L. J. YFCC100M: The new data in multimedia research. Communications of the ACM Vol. 59, No. 2, 64–73, 2016.
Zhang, Z.; Sattler, T.; Scaramuzza, D. Reference pose generation for long-term visual localization via learned features and view synthesis. International Journal of Computer Vision Vol. 129, No. 4, 821–844, 2021.
Guo, M. H.; Liu, Z. N.; Mu, T. J.; Hu, S. M. Beyond self-attention: External attention using two linear layers for visual tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 45, No. 5, 5436–5447, 2023.
Gu, Y.; Qin, X.; Peng, Y.; Li, L. Content-augmented feature pyramid network with light linear spatial transformers for object detection. IET Image Processing Vol. 16, No. 13, 3567–3578, 2022.
Lowe, D. G. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision Vol. 60, No. 2, 91–110, 2004.
Roy, A.; Saffar, M.; Vaswani, A.; Grangier, D. Efficient content-based sparse attention with routing transformers. Transactions of the Association for Computational Linguistics Vol. 9, 53–68, 2021.
Toft, C.; Maddern, W.; Torii, A.; Hammarstrand, L.; Stenborg, E.; Safari, D.; Okutomi, M.; Pollefeys, M.; Sivic, J.; Pajdla, T.; et al. Long-term visual localization revisited. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 44, No. 4, 2074–2088, 2022.
Sakai, S.; Ito, K.; Aoki, T.; Watanabe, T.; Unten, H. Phase-based window matching with geometric correction for multi-view stereo. IEICE Transactions on Information and Systems Vol. E98.D, No. 10, 1818–1828, 2015.
Sinkhorn, R.; Knopp, P. Concerning nonnegative matrices and doubly stochastic matrices. Pacific Journal of Mathematics Vol. 21, No. 2, 343–348, 1967.
Guo, J.; Wang, H.; Cheng, Z.; Zhang, X.; Yan, D. M. Learning local shape descriptors for computing non-rigid dense correspondence. Computational Visual Media Vol. 6, No. 1, 95–112, 2020.