AI Chat Paper
Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.
{{lang === 'zh_CN' ? '文章概述' : 'Summary'}}
{{lang === 'en_US' ? '中' : 'Eng'}}
Chat more with AI
PDF (8 MB)
Collect
Submit Manuscript AI Chat Paper
Show Outline
Outline
Show full outline
Hide outline
Outline
Show full outline
Hide outline
Research Article | Open Access

FilterGNN: Image feature matching with cascaded outlier filters and linear attention

Key Laboratory of Pervasive Computing, Ministry of Education, Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China
School of Computer Science and Informatics, Cardiff University, Wales CF24 4AG, UK
Show Author Information

Graphical Abstract

Abstract

The cross-view matching of local image features is a fundamental task in visual localization and 3D reconstruction. This study proposes FilterGNN, a transformer-based graph neural network (GNN), aiming to improve the matching efficiency and accuracy of visual descriptors. Based on high matching sparseness and coarse-to-fine covisible area detection, FilterGNN utilizes cascaded optimal graph-matching filter modules to dynamically reject outlier matches. Moreover, we successfully adapted linear attention in FilterGNN with post-instance normalization support, which significantly reduces the complexity of complete graph learning from O(N2) to O(N). Experiments show that FilterGNN requires only 6% of the time cost and 33.3% of the memory cost compared with SuperGlue under a large-scale input size and achieves a competitive performance in various tasks, such as pose estimation, visual localization, and sparse 3D reconstruction.

References

[1]
Schonberger, J. L.; Frahm, J. M. Structure-from-motion revisited. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4104–4113, 2016.
[2]

Mur-Artal, R.; Montiel, J. M. M.; Tardós, J. D. ORB-SLAM: A versatile and accurate monocular SLAM system. IEEE Transactions on Robotics Vol. 31, No. 5, 1147–1163, 2015.

[3]

Huang, J.; Yang, S.; Zhao, Z.; Lai, Y. K.; Hu, S. M. ClusterSLAM: A SLAM backend for simultaneous rigid body clustering and motion estimation. Computational Visual Media Vol. 7, No. 1, 87–101, 2021.

[4]
Sarlin, P. E.; Cadena, C.; Siegwart, R.; Dymczyk, M. From coarse to fine: Robust hierarchical localization at large scale. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 12716–12725, 2019.
[5]
Dusmanu, M.; Rocco, I.; Pajdla, T.; Pollefeys, M.; Sivic, J.; Torii, A.; Sattler, T. D2-net: A trainable CNN for joint detection and description of local features. arXiv preprint arXiv:1905.03561, 2019.
[6]
DeTone, D.; Malisiewicz, T.; Rabinovich, A. SuperPoint: Self-supervised interest point detection and description. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 224–236, 2018.
[7]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, L.; Polosukhin, I. Attention is all you need. In: Proceedings of the 31st Conference on Neural Information Processing Systems, 5998–6008, 2017.
[8]
Sarlin, P. E.; DeTone, D.; Malisiewicz, T.; Rabinovich, A. SuperGlue: Learning feature matching with graph neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4938–4947, 2020.
[9]
Chen, H.; Luo, Z.; Zhang, J.; Zhou, L.; Bai, X.; Hu, Z.; Tai, C. L.; Quan, L. Learning to match features with seeded graph matching network. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 6301–6310, 2021.
[10]
Shi, Y.; Cai, J. X.; Shavit, Y.; Mu, T. J.; Feng, W.; Zhang, K. ClusterGNN: Cluster-based coarse-to-fine graph neural network for efficient feature matching. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 12517–12526, 2022.
[11]
Sun, J.; Shen, Z.; Wang, Y.; Bao, H.; Zhou, X. LoFTR: Detector-free local feature matching with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8922–8931, 2021.
[12]
Suwanwimolkul, S.; Komorita, S. Efficient linear attention for fast and accurate keypoint matching. In: Proceedings of the International Conference on Multimedia Retrieval, 330–341, 2022.
[13]

Guo, M. H.; Xu, T. X.; Liu, J. J.; Liu, Z. N.; Jiang, P. T.; Mu, T. J.; Zhang, S. H.; Martin, R. R.; Cheng, M. M.; Hu, S. M. Attention mechanisms in computer vision: A survey. Computational Visual Media Vol. 8, No. 3, 331–368, 2022.

[14]

Thomee, B.; Elizalde, B.; Shamma, D. A.; Ni, K.; Friedland, G.; Poland, D.; Borth, D.; Li, A. L. J. YFCC100M: The new data in multimedia research. Communications of the ACM Vol. 59, No. 2, 64–73, 2016.

[15]
Balntas, V.; Lenc, K.; Vedaldi, A.; Mikolajczyk, K. HPatches: A benchmark and evaluation of handcrafted and learned local descriptors. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 5173–5182, 2017.
[16]
Sattler, T.; Weyand, T.; Leibe, B.; Kobbelt, L. Image retrieval for image-based localization revisited. In: Proceedings of the British Machine Vision Conference, 2012.
[17]

Zhang, Z.; Sattler, T.; Scaramuzza, D. Reference pose generation for long-term visual localization via learned features and view synthesis. International Journal of Computer Vision Vol. 129, No. 4, 821–844, 2021.

[18]
Taira, H.; Okutomi, M.; Sattler, T.; Cimpoi, M.; Pollefeys, M.; Sivic, J.; Pajdla, T.; Torii, A. InLoc: Indoor visual localization with dense matching and view synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7199–7209, 2018.
[19]
Qiu, J.; Ma, H.; Levy, O.; Yih, S. W. T.; Wang, S.; Tang, J. Blockwise self-attention for long document understanding. arXiv preprint arXiv:1911.02972, 2019.
[20]
Parmar, N.; Vaswani, A.; Uszkoreit, J.; Kaiser, Ł.; Shazeer, N.; Ku, A.; Tran, D. Image transformer. arXiv preprint arXiv:1802.05751, 2018.
[21]
Kitaev, N.; Kaiser, Ł.; Levskaya, A. Reformer: The efficient transformer. arXiv preprint arXiv:2001.04451, 2020.
[22]
Wang, S.; Li, B. Z.; Khabsa, M.; Fang, H.; Ma, H.; Kolodiazhnyi, K. Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768, 2020.
[23]
Choromanski, K.; Likhosherstov, V.; Dohan. D.; Song, X.; Gane, A.; Sarlos, T.; Hawkins, P.; Davis, J.; Mohiuddin, A.; Kaiser, L.; et al. Rethinking attention with performers. In: Proceedings of the International Conference on Learning Representations, 2021.
[24]

Guo, M. H.; Liu, Z. N.; Mu, T. J.; Hu, S. M. Beyond self-attention: External attention using two linear layers for visual tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 45, No. 5, 5436–5447, 2023.

[25]
Katharopoulos, A.; Vyas, A.; Pappas, N.; Fleuret, F. Transformers are RNNs: Fast autoregressive transformers with linear attention. arXiv preprint arXiv:2006.16236, 2020.
[26]

Gu, Y.; Qin, X.; Peng, Y.; Li, L. Content-augmented feature pyramid network with light linear spatial transformers for object detection. IET Image Processing Vol. 16, No. 13, 3567–3578, 2022.

[27]
Ulyanov, D.; Vedaldi, A.; Lempitsky, V. Instance normalization: The missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022, 2016.
[28]
Ba, J. L.; Kiros, J. R.; Hinton, G. E. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
[29]

Lowe, D. G. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision Vol. 60, No. 2, 91–110, 2004.

[30]
Rublee, E.; Rabaud, V.; Konolige, K.; Bradski, G. ORB: An efficient alternative to SIFT or SURF. In: Proceedings of the International Conference on Computer Vision, 2564–2571, 2011.
[31]
Luo, Z.; Zhou, L.; Bai, X.; Chen, H.; Zhang, J.; Yao, Y.; Li, S.; Fang, T.; Quan, L. ASLFeat: Learning local features of accurate shape and localization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6589–6598, 2020.
[32]
Yi, K. M.; Trulls, E.; Lepetit, V.; Fua, P. LIFT: Learned invariant feature transform. In: Computer Vision – ECCV 2016. Lecture Notes in Computer Science, Vol. 9910. Leibe, B.; Matas, J.; Sebe, N.; Welling, M. Eds. Springer Cham, 467–483, 2016.
[33]
Mishkin, D.; Radenović, F.; Matas, J. Repeatability is not enough: Learning affine regions via discriminability. In: Proceedings of the European Conference on Computer Vision, 284–300, 2018.
[34]
Devlin, J.; Chang, M. W.; Lee, K.; Toutanova, K.; Hulburd, E.; Liu, D.; Wang, M.; Catlin, A. G.; Lei, M.; Zhang, J.; et al. BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the NAACL-HLT, 4171–4186, 2018.
[35]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
[36]
Child, R.; Gray, S.; Radford, A.; Sutskever, I. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019.
[37]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 10012–10022, 2021.
[38]

Roy, A.; Saffar, M.; Vaswani, A.; Grangier, D. Efficient content-based sparse attention with routing transformers. Transactions of the Association for Computational Linguistics Vol. 9, 53–68, 2021.

[39]
Shen, Z.; Zhang, M.; Zhao, H.; Yi, S.; Li, H. Efficient attention: Attention with linear complexities. In: Proceedings of the IEEE Winter Conference on Applications of Computer Vision, 3531–3539, 2021.
[40]
Rocco, I.; Cimpoi, M.; Arandjelović, R.; Torii, A.; Pajdla, T.; Sivic, J. Neighbourhood consensus networks. In: Proceedings of the 32nd Conference on Neural Information Processing Systems, 1658–1669, 2021.
[41]
Liu, Z.; Hu, H.; Lin, Y.; Yao, Z.; Xie, Z.; Wei, Y.; Ning, J.; Cao, Y.; Zhang, Z.; Dong, L.; et al. Swin transformer V2: Scaling up capacity and resolution. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 12009–12019, 2022.
[42]
Li, Z.; Snavely, N. MegaDepth: Learning single-view depth prediction from Internet photos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2041–2050, 2018.
[43]
Ono, Y.; Trulls, E.; Fua, P.; Yi, K. M. LF-Net: Learning local features from images. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, 6237–6247, 2018.
[44]
Schönberger, J. L.; Zheng, E.; Frahm, J. M.; Pollefeys, M. Pixelwise view selection for unstructured multi-view stereo. In: Computer Vision – ECCV 2016. Lecture Notes in Computer Science, Vol. 9907. Leibe, B.; Matas, J.; Sebe, N.; Welling, M. Eds. Springer Cham, 501–518, 2016.
[45]

Toft, C.; Maddern, W.; Torii, A.; Hammarstrand, L.; Stenborg, E.; Safari, D.; Okutomi, M.; Pollefeys, M.; Sivic, J.; Pajdla, T.; et al. Long-term visual localization revisited. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 44, No. 4, 2074–2088, 2022.

[46]

Sakai, S.; Ito, K.; Aoki, T.; Watanabe, T.; Unten, H. Phase-based window matching with geometric correction for multi-view stereo. IEICE Transactions on Information and Systems Vol. E98.D, No. 10, 1818–1828, 2015.

[47]

Sinkhorn, R.; Knopp, P. Concerning nonnegative matrices and doubly stochastic matrices. Pacific Journal of Mathematics Vol. 21, No. 2, 343–348, 1967.

[48]

Guo, J.; Wang, H.; Cheng, Z.; Zhang, X.; Yan, D. M. Learning local shape descriptors for computing non-rigid dense correspondence. Computational Visual Media Vol. 6, No. 1, 95–112, 2020.

Computational Visual Media
Pages 873-884
Cite this article:
Cai J-X, Mu T-J, Lai Y-K. FilterGNN: Image feature matching with cascaded outlier filters and linear attention. Computational Visual Media, 2024, 10(5): 873-884. https://doi.org/10.1007/s41095-023-0363-3

104

Views

7

Downloads

0

Crossref

0

Web of Science

0

Scopus

0

CSCD

Altmetrics

Received: 26 February 2023
Accepted: 23 June 2023
Published: 07 October 2024
© The Author(s) 2024.

This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.

The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Other papers from this open access journal are available free of charge from http://www.springer.com/journal/41095. To submit a manuscript, please go to https://www.editorialmanager.com/cvmj.

Return