| Sign up

PDF (8 MB)

Cite

EndNote(RIS) BibTeX

Collect

Collect

Submit Manuscript

Research Article | Open Access

FilterGNN: Image feature matching with cascaded outlier filters and linear attention

Jun-Xiong Cai^¹, Tai-Jiang Mu^¹(), Yu-Kun Lai^²

1

Key Laboratory of Pervasive Computing, Ministry of Education, Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China

2

School of Computer Science and Informatics, Cardiff University, Wales CF24 4AG, UK

Show Author Information

Graphical Abstract

View original image Download original image

Abstract

The cross-view matching of local image features is a fundamental task in visual localization and 3D reconstruction. This study proposes FilterGNN, a transformer-based graph neural network (GNN), aiming to improve the matching efficiency and accuracy of visual descriptors. Based on high matching sparseness and coarse-to-fine covisible area detection, FilterGNN utilizes cascaded optimal graph-matching filter modules to dynamically reject outlier matches. Moreover, we successfully adapted linear attention in FilterGNN with post-instance normalization support, which significantly reduces the complexity of complete graph learning from $O (N^{2})$ to $O (N)$ . Experiments show that FilterGNN requires only 6% of the time cost and 33.3% of the memory cost compared with SuperGlue under a large-scale input size and achieves a competitive performance in various tasks, such as pose estimation, visual localization, and sparse 3D reconstruction.

Keywords

image matching transformer linear attention visual localization sparse reconstruction

References

[1]

Schonberger, J. L.; Frahm, J. M. Structure-from-motion revisited. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4104–4113, 2016.

[2]

Mur-Artal, R.; Montiel, J. M. M.; Tardós, J. D. ORB-SLAM: A versatile and accurate monocular SLAM system. IEEE Transactions on Robotics Vol. 31, No. 5, 1147–1163, 2015.

Crossref Google Scholar

[3]

Huang, J.; Yang, S.; Zhao, Z.; Lai, Y. K.; Hu, S. M. ClusterSLAM: A SLAM backend for simultaneous rigid body clustering and motion estimation. Computational Visual Media Vol. 7, No. 1, 87–101, 2021.

Crossref Google Scholar

[4]

Sarlin, P. E.; Cadena, C.; Siegwart, R.; Dymczyk, M. From coarse to fine: Robust hierarchical localization at large scale. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 12716–12725, 2019.

[5]

Dusmanu, M.; Rocco, I.; Pajdla, T.; Pollefeys, M.; Sivic, J.; Torii, A.; Sattler, T. D2-net: A trainable CNN for joint detection and description of local features. arXiv preprint arXiv:1905.03561, 2019.

[6]

DeTone, D.; Malisiewicz, T.; Rabinovich, A. SuperPoint: Self-supervised interest point detection and description. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 224–236, 2018.

[7]

Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, L.; Polosukhin, I. Attention is all you need. In: Proceedings of the 31st Conference on Neural Information Processing Systems, 5998–6008, 2017.

[8]

Sarlin, P. E.; DeTone, D.; Malisiewicz, T.; Rabinovich, A. SuperGlue: Learning feature matching with graph neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4938–4947, 2020.

[9]

Chen, H.; Luo, Z.; Zhang, J.; Zhou, L.; Bai, X.; Hu, Z.; Tai, C. L.; Quan, L. Learning to match features with seeded graph matching network. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 6301–6310, 2021.

[10]

Shi, Y.; Cai, J. X.; Shavit, Y.; Mu, T. J.; Feng, W.; Zhang, K. ClusterGNN: Cluster-based coarse-to-fine graph neural network for efficient feature matching. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 12517–12526, 2022.

[11]

Sun, J.; Shen, Z.; Wang, Y.; Bao, H.; Zhou, X. LoFTR: Detector-free local feature matching with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8922–8931, 2021.

[12]

Suwanwimolkul, S.; Komorita, S. Efficient linear attention for fast and accurate keypoint matching. In: Proceedings of the International Conference on Multimedia Retrieval, 330–341, 2022.

[13]

Guo, M. H.; Xu, T. X.; Liu, J. J.; Liu, Z. N.; Jiang, P. T.; Mu, T. J.; Zhang, S. H.; Martin, R. R.; Cheng, M. M.; Hu, S. M. Attention mechanisms in computer vision: A survey. Computational Visual Media Vol. 8, No. 3, 331–368, 2022.

Crossref Google Scholar

[14]

Thomee, B.; Elizalde, B.; Shamma, D. A.; Ni, K.; Friedland, G.; Poland, D.; Borth, D.; Li, A. L. J. YFCC100M: The new data in multimedia research. Communications of the ACM Vol. 59, No. 2, 64–73, 2016.

Crossref Google Scholar

[15]

Balntas, V.; Lenc, K.; Vedaldi, A.; Mikolajczyk, K. HPatches: A benchmark and evaluation of handcrafted and learned local descriptors. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 5173–5182, 2017.

[16]

Sattler, T.; Weyand, T.; Leibe, B.; Kobbelt, L. Image retrieval for image-based localization revisited. In: Proceedings of the British Machine Vision Conference, 2012.

[17]

Zhang, Z.; Sattler, T.; Scaramuzza, D. Reference pose generation for long-term visual localization via learned features and view synthesis. International Journal of Computer Vision Vol. 129, No. 4, 821–844, 2021.

Crossref Google Scholar

[18]

Taira, H.; Okutomi, M.; Sattler, T.; Cimpoi, M.; Pollefeys, M.; Sivic, J.; Pajdla, T.; Torii, A. InLoc: Indoor visual localization with dense matching and view synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7199–7209, 2018.

[19]

Qiu, J.; Ma, H.; Levy, O.; Yih, S. W. T.; Wang, S.; Tang, J. Blockwise self-attention for long document understanding. arXiv preprint arXiv:1911.02972, 2019.

[20]

Parmar, N.; Vaswani, A.; Uszkoreit, J.; Kaiser, Ł.; Shazeer, N.; Ku, A.; Tran, D. Image transformer. arXiv preprint arXiv:1802.05751, 2018.

[21]

Kitaev, N.; Kaiser, Ł.; Levskaya, A. Reformer: The efficient transformer. arXiv preprint arXiv:2001.04451, 2020.

[22]

Wang, S.; Li, B. Z.; Khabsa, M.; Fang, H.; Ma, H.; Kolodiazhnyi, K. Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768, 2020.

[23]

Choromanski, K.; Likhosherstov, V.; Dohan. D.; Song, X.; Gane, A.; Sarlos, T.; Hawkins, P.; Davis, J.; Mohiuddin, A.; Kaiser, L.; et al. Rethinking attention with performers. In: Proceedings of the International Conference on Learning Representations, 2021.

[24]

Guo, M. H.; Liu, Z. N.; Mu, T. J.; Hu, S. M. Beyond self-attention: External attention using two linear layers for visual tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 45, No. 5, 5436–5447, 2023.

[25]

Katharopoulos, A.; Vyas, A.; Pappas, N.; Fleuret, F. Transformers are RNNs: Fast autoregressive transformers with linear attention. arXiv preprint arXiv:2006.16236, 2020.

[26]

Gu, Y.; Qin, X.; Peng, Y.; Li, L. Content-augmented feature pyramid network with light linear spatial transformers for object detection. IET Image Processing Vol. 16, No. 13, 3567–3578, 2022.

Crossref Google Scholar

[27]

Ulyanov, D.; Vedaldi, A.; Lempitsky, V. Instance normalization: The missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022, 2016.

[28]

Ba, J. L.; Kiros, J. R.; Hinton, G. E. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.

[29]

Lowe, D. G. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision Vol. 60, No. 2, 91–110, 2004.

Crossref Google Scholar

[30]

Rublee, E.; Rabaud, V.; Konolige, K.; Bradski, G. ORB: An efficient alternative to SIFT or SURF. In: Proceedings of the International Conference on Computer Vision, 2564–2571, 2011.

[31]

Luo, Z.; Zhou, L.; Bai, X.; Chen, H.; Zhang, J.; Yao, Y.; Li, S.; Fang, T.; Quan, L. ASLFeat: Learning local features of accurate shape and localization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6589–6598, 2020.

[32]

Yi, K. M.; Trulls, E.; Lepetit, V.; Fua, P. LIFT: Learned invariant feature transform. In: Computer Vision – ECCV 2016. Lecture Notes in Computer Science, Vol. 9910. Leibe, B.; Matas, J.; Sebe, N.; Welling, M. Eds. Springer Cham, 467–483, 2016.

[33]

Mishkin, D.; Radenović, F.; Matas, J. Repeatability is not enough: Learning affine regions via discriminability. In: Proceedings of the European Conference on Computer Vision, 284–300, 2018.

[34]

Devlin, J.; Chang, M. W.; Lee, K.; Toutanova, K.; Hulburd, E.; Liu, D.; Wang, M.; Catlin, A. G.; Lei, M.; Zhang, J.; et al. BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the NAACL-HLT, 4171–4186, 2018.

[35]

Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.

[36]

Child, R.; Gray, S.; Radford, A.; Sutskever, I. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019.

[37]

Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 10012–10022, 2021.

[38]

Roy, A.; Saffar, M.; Vaswani, A.; Grangier, D. Efficient content-based sparse attention with routing transformers. Transactions of the Association for Computational Linguistics Vol. 9, 53–68, 2021.

Crossref Google Scholar

[39]

Shen, Z.; Zhang, M.; Zhao, H.; Yi, S.; Li, H. Efficient attention: Attention with linear complexities. In: Proceedings of the IEEE Winter Conference on Applications of Computer Vision, 3531–3539, 2021.

[40]

Rocco, I.; Cimpoi, M.; Arandjelović, R.; Torii, A.; Pajdla, T.; Sivic, J. Neighbourhood consensus networks. In: Proceedings of the 32nd Conference on Neural Information Processing Systems, 1658–1669, 2021.

[41]

Liu, Z.; Hu, H.; Lin, Y.; Yao, Z.; Xie, Z.; Wei, Y.; Ning, J.; Cao, Y.; Zhang, Z.; Dong, L.; et al. Swin transformer V2: Scaling up capacity and resolution. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 12009–12019, 2022.

[42]

Li, Z.; Snavely, N. MegaDepth: Learning single-view depth prediction from Internet photos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2041–2050, 2018.

[43]

Ono, Y.; Trulls, E.; Fua, P.; Yi, K. M. LF-Net: Learning local features from images. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, 6237–6247, 2018.

[44]

Schönberger, J. L.; Zheng, E.; Frahm, J. M.; Pollefeys, M. Pixelwise view selection for unstructured multi-view stereo. In: Computer Vision – ECCV 2016. Lecture Notes in Computer Science, Vol. 9907. Leibe, B.; Matas, J.; Sebe, N.; Welling, M. Eds. Springer Cham, 501–518, 2016.

[45]

Toft, C.; Maddern, W.; Torii, A.; Hammarstrand, L.; Stenborg, E.; Safari, D.; Okutomi, M.; Pollefeys, M.; Sivic, J.; Pajdla, T.; et al. Long-term visual localization revisited. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 44, No. 4, 2074–2088, 2022.

Crossref Google Scholar

[46]

Sakai, S.; Ito, K.; Aoki, T.; Watanabe, T.; Unten, H. Phase-based window matching with geometric correction for multi-view stereo. IEICE Transactions on Information and Systems Vol. E98.D, No. 10, 1818–1828, 2015.

Crossref Google Scholar

[47]

Sinkhorn, R.; Knopp, P. Concerning nonnegative matrices and doubly stochastic matrices. Pacific Journal of Mathematics Vol. 21, No. 2, 343–348, 1967.

Crossref Google Scholar

[48]

Guo, J.; Wang, H.; Cheng, Z.; Zhang, X.; Yan, D. M. Learning local shape descriptors for computing non-rigid dense correspondence. Computational Visual Media Vol. 6, No. 1, 95–112, 2020.

Crossref Google Scholar

Computational Visual Media

Volume 10 Issue 5,
October 2024

Pages 873-884

DOI: 10.1007/s41095-023-0363-3

Cite this article:

Cai J-X, Mu T-J, Lai Y-K. FilterGNN: Image feature matching with cascaded outlier filters and linear attention. Computational Visual Media, 2024, 10(5): 873-884. https://doi.org/10.1007/s41095-023-0363-3

About Us

Learn about Open Access

Tsinghua University Press

Publish with Us

Peer Review Policy

Copyright and Licensing

Article Processing Charge

Contact Us

Journal Collaboration: Yao Meng (Ms.)✉️ +86-10-83470574

Technical Support: Kuo Zhao (Mr.)✉️ +86-10-83470507

Media Contact: Hao Jin (Mr.)✉️ +86-10-83470559

Address: Floor 6, Tower B, Xueyan Building, Shuangqing Road, Haidian District, Beijing 100084, China.

SciOpen——中国科技期刊卓越行动计划支持项目

Copyright © 2025 Tsinghua University Press Ltd.

京ICP备 10035462号-42 京公网安备11010802044758号