| Sign up

PDF (9.2 MB)

Cite

EndNote(RIS) BibTeX

Collect

Collect

Submit Manuscript

Research Article | Open Access

Learning accurate template matching with differentiable coarse-to-fine correspondence refinement

Zhirui Gao^¹, Renjiao Yi^¹, Zheng Qin^¹, Yunfan Ye^¹, Chenyang Zhu^¹, Kai Xu^¹()

1College of Computer, National University of Defense Technology, Changsha 410073, China

Show Author Information

Graphical Abstract

View original image Download original image

Abstract

Template matching is a fundamental task in computer vision and has been studied for decades. It plays an essential role in manufacturing industry for estimating the poses of different parts, facilitating downstream tasks such as robotic grasping. Existing methods fail when the template and source images have different modalities, cluttered backgrounds, or weak textures. They also rarely consider geometric transformations via homographies, which commonly exist even for planar industrial parts. To tackle the challenges, we propose an accurate template matching method based on differentiable coarse-to-fine correspondence refinement. We use an edge-aware module to overcome the domain gap between the mask template and the grayscale image, allowing robust matching. An initial warp is estimated using coarse correspondences based on novel structure-aware information provided by transformers. This initial alignment is passed to a refinement network using references and aligned images to obtain sub-pixel level correspondences which are used to give the final geometric transformation. Extensive evaluation shows that our method to be significantly better than state-of-the-art methods and baselines, providing good generalization ability and visually plausible results even on unseen real data.

Keywords

template matching differentiable homography structure-awareness transformers

References

[1]

Hinterstoisser,

S.

; Cagniart,

C.

; Ilic,

S.

; Sturm,

P.

; Navab,

N.

; Fua,

P.

; Lepetit,

V.

Gradient response maps for real-time detection of textureless objects. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 34, No. 5, 876–888, 2012.

Crossref Google Scholar

[2]

Ballard,

D. H.

Generalizing the Hough transform to detect arbitrary shapes. Pattern Recognition Vol. 13, No. 2, 111–122, 1981.

Crossref Google Scholar

[3]

Muja,

M.

; Rusu,

R. B.

; Bradski,

G.

; Lowe,

D. G.

REIN –A fast, robust, scalable REcognition INfrastructure. In: Proceedings of the IEEE International Conference on Robotics and Automation, 2939–2946, 2011.

[4]

Hinterstoisser,

S.

; Lepetit,

V.

; Ilic,

S.

; Fua,

P.

; Navab,

N.

Dominant orientation templates for real-time detection of texture-less objects. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2257–2264, 2010.

[5]

Cheng,

J. X.

; Wu,

Y.

; AbdAlmageed,

W.

; Natarajan,

P.

QATM: Quality-aware template matching for deep learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11545–11554, 2019.

[6]

Gao,

B.

; Spratling,

M. W.

Robust template matching via hierarchical convolutional features from a shape biased CNN. In: The International Conference on Image, Vision and Intelligent Systems. Lecture Notes in Electrical Engineering, Vol. 813. Yao,

J.

; Xiao,

Y.

; You,

P.

; Sun,

G.

Eds. Springer Singapore, 333–344, 2022.

[7]

Ren,

Q.

; Zheng,

Y. B.

; Sun,

P.

; Xu,

W. Y.

; Zhu,

D.

; Yang,

D. X.

A robust and accurate end-to-end template matching method based on the Siamese network. IEEE Geoscience and Remote Sensing Letters Vol. 19, Article No. 8015505, 2022.

Crossref Google Scholar

[8]

Wu,

Y.

; Abd-Almageed,

W.

; Natarajan,

P.

Deep matching and validation network: An end-to-end solution to constrained image splicing localization and detection. In: Proceedings of the 25th ACM international conference on Multimedia, 1480–1502, 2017.

[9]

Rocco,

I.

; Arandjelovic,

R.

; Sivic,

J.

Convolutional neural network architecture for geometric matching. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 41, No. 11, 2553–2567, 2019.

Crossref Google Scholar

[10]

Efe,

U.

; Ince,

K. G.

; Aydin Alatan,

A.

DFM: A performance baseline for deep feature matching. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 4279–4288, 2021.

[11]

Jiang,

W.

; Trulls,

E.

; Hosang,

J.

; Tagliasacchi,

A.

; Yi,

K. M.

COTR: Correspondence transformer for matching across images. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 6187–6197, 2021.

[12]

Sarlin,

P. E.

; DeTone,

D.

; Malisiewicz,

T.

; Rabinovich,

A.

SuperGlue: Learning feature matching with graph neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4937–4946, 2020.

[13]

Sun,

J. M.

; Shen,

Z. H.

; Wang,

Y. A.

; Bao,

H. J.

; Zhou,

X. W.

LoFTR: Detector-free local feature matching with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8918–8927, 2021.

[14]

Vaswani,

A.

; Shazeer,

N.

; Parmar,

N.

; Uszkoreit,

J.

; Jones,

L.

; Gomez,

A. N.

; Kaiser,

Ł

; Polosukhin,

I.

Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing System, 6000–6010, 2017.

[15]

Wu,

K.

; Peng,

H. W.

; Chen,

M. H.

; Fu,

J. L.

; Chao,

H. Y.

Rethinking and improving relative position encoding for vision transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 10013–10021, 2021.

[16]

Jaderberg,

M.

; Simonyan,

K.

; Zisserman,

A.

; Kavukcuoglu,

K.

Spatial transformer networks. In: Proceedings of the 28th International Conference on Neural Information Processing Systems, Vol. 2, 2017–2025, 2015.

[17]

Fischler,

M.

; Bolles,

R.

Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM Vol. 24, No. 6, 381–395, 1981.

Crossref Google Scholar

[18]

Katharopoulos,

A.

; Vyas,

A.

; Pappas,

N.

; Fleuret,

F.

Transformers are RNNs: Fast autoregressive transformers with linear attention. In: Proceedings of the 37th International Conference on Machine Learning, 5156–5165, 2020.

[19]

Park,

T.

; Liu,

M. Y.

; Wang,

T. C.

; Zhu,

J. Y.

Semantic image synthesis with spatially-adaptive normalization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2332–2341, 2019.

[20]

Lin,

T. Y.

; Maire,

M.

; Belongie,

S.

; Hays,

J.

; Perona,

P.

; Ramanan,

D.

; Dollár,

P.

; Zitnick,

C. L.

Microsoft COCO: Common objects in context. In: Computer Vision – ECCV 2014. Lecture Notes in Computer Science, Vol. 8693. Fleet,

D.

; Pajdla,

T.

; Schiele,

B.

; Tuytelaars,

T.

Eds. Springer Cham, 740–755, 2014.

[21]

Lowe,

D. G.

Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision Vol. 60, No. 2, 91–110, 2004.

Crossref Google Scholar

[22]

Bay,

H.

; Tuytelaars,

T.

; Van Gool,

L.

SURF: Speeded up robust features. In: Computer Vision – ECCV 2006. Lecture Notes in Computer Science, Vol. 3951. Leonardis,

A.

; Bischof,

H.

; Pinz,

A.

Eds. Springer Berlin Heidelberg, 404–417, 2006.

[23]

Rublee,

E.

; Rabaud,

V.

; Konolige,

K.

; Bradski,

G.

ORB: An efficient alternative to SIFT or SURF. In: Proceedings of the International Conference on Computer Vision, 2564–2571, 2011.

[24]

Barath,

D.

; Matas,

J.

; Noskova,

J.

MAGSAC: Marginalizing sample consensus. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10189–10197, 2019.

[25]

Brachmann,

E.

; Rother,

C.

Neural-guided RANSAC: Learning where to sample model hypotheses. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 4321–4330, 2019.

[26]

Brachmann,

E.

; Krull,

A.

; Nowozin,

S.

; Shotton,

J.

; Michel,

F.

; Gumhold,

S.

; Rother,

C.

DSAC—Differentiable RANSAC for camera localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2492–2500, 2017.

[27]

Lucas,

B. D.

; Kanade,

T.

An iterative image registration technique with an application to stereo vision. In: Proceedings of the 7th International Joint Conference on Artificial Intelligence, 674–679, 1981.

[28]

DeTone,

D.

; Malisiewicz,

T.

; Rabinovich,

A.

Deep image homography estimation. arXiv preprint arXiv:1606.03798, 2016.

[29]

Hartley,

R.

; Zisserman,

A.

Multiple View Geometry in Computer Vision, 2nd edn. Cambridge, UK: Cambridge University Press, 2003.

[30]

Nguyen,

T.

; Chen,

S. W.

; Shivakumar,

S. S.

; Taylor,

C. J.

; Kumar,

V.

Unsupervised deep homography: A fast and robust homography estimation model. IEEE Robotics and Automation Letters Vol. 3, No. 3, 2346–2353, 2018.

Crossref Google Scholar

[31]

Zhang,

J. R.

; Wang,

C.

; Liu,

S. C.

; Jia,

L. P.

; Ye,

N. J.

; Wang,

J.

; Zhou,

J.

; Sun,

J.

Content-aware unsupervised deep homography estimation. In: Computer Vision – ECCV 2020. Lecture Notes in Computer Science, Vol. 12346. Vedaldi,

A.

; Bischof,

H.

; Brox,

T.

; Frahm,

J. M.

Eds. Springer Cham, 653–669, 2020.

[32]

Koguciuk,

D.

; Arani,

E.

; Zonooz,

B.

Perceptual loss for robust unsupervised homography estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 4269–4278, 2021.

[33]

DeTone,

D.

; Malisiewicz,

T.

; Rabinovich,

A.

SuperPoint: Self-supervised interest point detection and description. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 337–33712, 2018.

[34]

Dusmanu,

M.

; Rocco,

I.

; Pajdla,

T.

; Pollefeys,

M.

; Sivic,

J.

; Torii,

A.

; Sattler,

T.

D2-net: A trainable CNN for joint description and detection of local features. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8084–8093, 2019.

[35]

Yi,

K. M.

; Trulls,

E.

; Lepetit,

V.

; Fua,

P.

LIFT: Learned invariant feature transform. In: Computer Vision – ECCV 2016. Lecture Notes in Computer Science, Vol. 9910. Leibe,

B.

; Matas,

J.

; Sebe,

N.

; Welling,

M.

Eds. Springer Cham, 467–483, 2016.

[36]

Luo,

Z. X.

; Zhou,

L.

; Bai,

X. Y.

; Chen,

H. K.

; Zhang,

J. H.

; Yao,

Y.

; Li,

S. W.

; Fang,

T.

; Quan,

L.

ASLFeat: Learning local features of accurate shape and localization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6588–6597, 2020.

[37]

Chen,

H. K.

; Luo,

Z. X.

; Zhang,

J. H.

; Zhou,

L.

; Bai,

X. Y.

; Hu,

Z. Y.

; Tai,

C. L.

; Quan,

L.

Learning to match features with seeded graph matching network. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 6281–6290, 2021.

[38]

Jiang,

B.

; Sun,

P. F.

; Luo,

B.

GLMNet: Graph learning-matching convolutional networks for feature matching. Pattern Recognition Vol. 121, 108167, 2022.

Crossref Google Scholar

[39]

Shi,

Y.

; Cai,

J. X.

; Shavit,

Y.

; Mu,

T. J.

; Feng,

W. S.

; Zhang,

K.

ClusterGNN: Cluster-based coarse-to-fine graph neural network for efficient feature matching. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 12507–12516, 2022.

[40]

Roessle,

B.

; Nießner,

M.

End2End multi-view feature matching with differentiable pose optimization. arXiv preprint arXiv:2205.01694, 2022.

[41]

Suwanwimolkul,

S.

; Komorita,

S.

Efficient linear attention for fast and accurate keypoint matching. In: Proceedings of the International Conference on Multimedia Retrieval, 330–341, 2022.

[42]

Carion,

N.

; Massa,

F.

; Synnaeve,

G.

; Usunier,

N.

; Kirillov,

A.

; Zagoruyko,

S.

End-to-end object detection with transformers. In: Computer Vision – ECCV 2020. Lecture Notes in Computer Science, Vol. 12346. Vedaldi,

A.

; Bischof,

H.

; Brox,

T.

; Frahm,

J. M.

Eds. Springer Cham, 213–229, 2020.

[43]

Kitaev,

N.

; Kaiser,

Ł

; Levskaya,

A.

Reformer: The efficient transformer. arXiv preprint arXiv:2001.04451, 2020.

[44]

Tay,

Y.

; Dehghani,

M.

; Bahri,

D.

; Metzler,

D.

Efficient transformers: A survey. ACM Computing Surveys Vol. 55, No. 6, 109, 2022.

Crossref Google Scholar

[45]

Lan,

Y. Q.

; Duan,

Y.

; Liu,

C. Y.

; Zhu,

C. Y.

; Xiong,

Y. S.

; Huang,

H.

; Xu,

K.

ARM3D: Attention-based relation module for indoor 3D object detection. Computational Visual Media Vol. 8, No. 3, 395–414, 2022.

Crossref Google Scholar

[46]

Su,

Z.

; Liu,

W. Z.

; Yu,

Z. T.

; Hu,

D. W.

; Liao,

Q.

; Tian,

Q.

; Pietikäinen,

M.

; Liu,

L.

Pixel difference networks for efficient edge detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 5097–5107, 2021.

[47]

Simonyan,

K.

; Zisserman,

A.

Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.

[48]

Jau,

Y. Y.

; Zhu,

R.

; Su,

H.

; Chandraker,

M.

Deep keypoint-based camera pose estimation with geometric constraints. In: Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, 4950–4957, 2020.

[49]

Wang,

Q.

; Zhou,

X.

; Hariharan,

B.

; Snavely,

N.

Learning feature descriptors using camera pose supervision. In: Computer Vision – ECCV 2020. Lecture Notes in Computer Science, Vol. 12346. Vedaldi,

A.

; Bischof,

H.

; Brox,

T.

; Frahm,

J. M.

Eds. Springer Cham, 757–774, 2020.

[50]

Zhou,

Q.

; Agostinho,

S.

; Ošep,

A.

; Leal-Taixé,

L.

Is geometry enough for matching in visual localization? In: Computer Vision – ECCV 2022. Lecture Notes in Computer Science, Vol. 13670. Avidan,

S.

; Brostow,

G.

; Cissé,

M.

; Farinella,

G. M.

; Hassner,

T.

Eds. Springer Cham, 407–425, 2022.

[51]

Qi,

C. R.

; Yi,

L.

; Su,

H.

; Guibas,

L. J.

PointNet++: Deep hierarchical feature learning on point sets in a metric space. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, 5105–5114, 2017.

[52]

Li,

Y.

; Harada,

T.

Lepard: Learning partial point cloud matching in rigid and deformable scenes. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5544–5554, 2022.

[53]

Su,

J. L.

; Lu,

Y.

; Pan,

S. F.

; Murtadha,

A.

; Wen,

B.

; Liu,

Y. F.

RoFormer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864, 2021.

[54]

Cuturi,

M.

Sinkhorn distances: Lightspeed computation of optimal transport. In: Proceedings of the 26th International Conference on Neural Information Processing Systems, Vol. 2, 2292–2300, 2013.

[55]

Rocco,

I.

; Cimpoi,

M.

; Arandjelovi?

R.

; Torii,

A.

; Pajdla,

T.

; Sivic,

J.

Neighbourhood consensus networks. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, 1658–1669, 2018.

[56]

Tyszkiewicz,

M. J.

; Fua,

P.

; Trulls,

E.

DISK: Learning local features with policy gradient. In: Proceedings of the 34th International Conference on Neural Information Processing Systems, 14254–14265, 2020.

[57]

Barath,

D.

; Matas,

J.

Graph-cut ransac. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6733–6741, 2018.

[58]

Chum,

O.

; Matas,

J.

; Kittler,

J.

Locally optimized RANSAC. In: Pattern Recognition. Lecture Notes in Computer Science, Vol. 2781. Michaelis,

B.

; Krell,

G.

Eds. Springer Berlin Heidelberg, 236–243, 2003.

[59]

Leordeanu,

M.

; Hebert,

M.

A spectral technique forcorrespondence problems using pairwise constraints. In: Proceedings of the 10th IEEE International Conference on Computer Vision, 1482–1489, 2005.

[60]

Bai,

X. Y.

; Luo,

Z. X.

; Zhou,

L.

; Chen,

H. K.

; Li,

L.

; Hu,

Z. Y.

; Fu,

H. B.

; Tai,

C. L.

PointDSC: Robust point cloud registration using deep spatial consistency. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 15854–15864, 2021.

[61]

Chen,

Z.

; Sun,

K.

; Yang,

F.

; Tao,

W. B.

SC2-PCR: A second order spatial compatibility for efficient and robust point cloud registration. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 13211–13221, 2022.

[62]

Quan,

S. W.

; Yang,

J. Q.

Compatibility-guided sampling consensus for 3-D point cloud registration. IEEE Transactions on Geoscience and Remote Sensing Vol. 58, No. 10, 7380–7392, 2020.

Crossref Google Scholar

[63]

Yang,

J. Q.

; Xian,

K.

; Wang,

P.

; Zhang,

Y. N.

A performance evaluation of correspondence grouping methods for 3D rigid data matching. IEEETransactions on Pattern Analysis and Machine Intelligence Vol. 43, No. 6, 1859–1874, 2021.

Crossref Google Scholar

[64]

Qin,

Z.

; Yu,

H.

; Wang,

C.

; Guo,

Y. L.

; Peng,

Y. X.

; Xu,

K.

Geometric transformer for fast and robust point cloud registration. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11133–11142, 2022.

[65]

Mises,

R. V.

; Pollaczek-Geiringer,

H.

Praktische verfahren der gleichungsauflösung. ZAMM - Zeitschrift Für Angewandte Mathematik Und Mechanik Vol. 9, No. 1, 58–77, 1929.

Crossref Google Scholar

[66]

Mok,

T. C. W.

; Chung,

A. C. S.

Affine medical image registration with coarse-to-fine vision transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 20803–20812, 2022.

[67]

Parihar,

U. S.

; Gujarathi,

A.

; Mehta,

K.

; Tourani,

S.

; Garg,

S.

; Milford,

M.

; Krishna

K. M.

RoRD: Rotation-robust descriptors and orthographic views for local feature matching. In: Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, 1593–1600, 2021.

[68]

Shen,

X.

; Darmon,

F.

; Efros,

A. A.

; Aubry,

M.

RANSAC-flow: Generic two-stage image alignment. In: Computer Vision – ECCV 2020. Lecture Notes in Computer Science, Vol. 12349. Vedaldi,

A.

; Bischof,

H.

; Brox,

T.

; Frahm,

J. M.

Eds. Springer Cham, 618–637, 2020.

[69]

Truong,

P.

; Danelljan,

M.

; Timofte,

R.

GLU-net: Global-local universal network for dense flow and correspondences. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6257–6267, 2020.

[70]

Lee,

M. C. H.

; Oktay,

O.

; Schuh,

A.

; Schaap,

M.

; Glocker,

B.

Image-and-spatial transformer networks for structure-guided image registration. In: Medical Image Computing and Computer Assisted Intervention –MICCAI 2019. Lecture Notes in Computer Science, Vol. 11765. Shen,

D.

, et al. Eds. Springer Cham, 337–345, 2019.

[71]

Shu,

C.

; Chen,

X.

; Xie,

Q. W.

; Han,

H.

An unsupervised network for fast microscopic image registration. In: Proceedings of the SPIE 10581, Medical Imaging 2018: Digital Pathology, 105811D, 2018.

[72]

Riba,

E.

; Mishkin,

D.

; Ponsa,

D.

; Rublee,

E.

; Bradski,

G.

Kornia: An open source differentiable computer vision library for PyTorch. In: Proceedings of the IEEE Winter Conference on Applications of Computer Vision, 3663–3672, 2020.

[73]

Redmon,

J.

; Divvala,

S.

; Girshick,

R.

; Farhadi,

A.

You only look once: Unified, real-time object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 779–788, 2016.

[74]

Canny,

J.

A computational approach to edge detection. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. PAMI-8, No. 6, 679–698, 1986.

Crossref Google Scholar

[75]

Van der Maaten,

L.

; Hinton,

G.

Visualizing data using t-SNE. Journal of Machine Learning Research Vol. 9, No. 86, 2579–2605, 2008.

Computational Visual Media

Volume 10 Issue 2,
April 2024

Pages 309-330

DOI: 10.1007/s41095-023-0333-9

Cite this article:

Gao Z, Yi R, Qin Z, et al. Learning accurate template matching with differentiable coarse-to-fine correspondence refinement. Computational Visual Media, 2024, 10(2): 309-330. https://doi.org/10.1007/s41095-023-0333-9

About Us

Learn about Open Access

Tsinghua University Press

Publish with Us

Peer Review Policy

Copyright and Licensing

Article Processing Charge

Contact Us

Journal Collaboration: Yao Meng (Ms.)✉️ +86-10-83470574

Technical Support: Kuo Zhao (Mr.)✉️ +86-10-83470507

Media Contact: Hao Jin (Mr.)✉️ +86-10-83470559

Address: Floor 6, Tower B, Xueyan Building, Shuangqing Road, Haidian District, Beijing 100084, China.

SciOpen——中国科技期刊卓越行动计划支持项目

Copyright © 2025 Tsinghua University Press Ltd.

京ICP备 10035462号-42 京公网安备11010802044758号