The cross-view matching of local image features is a fundamental task in visual localization and 3D reconstruction. This study proposes FilterGNN, a transformer-based graph neural network (GNN), aiming to improve the matching efficiency and accuracy of visual descriptors. Based on high matching sparseness and coarse-to-fine covisible area detection, FilterGNN utilizes cascaded optimal graph-matching filter modules to dynamically reject outlier matches. Moreover, we successfully adapted linear attention in FilterGNN with post-instance normalization support, which significantly reduces the complexity of complete graph learning from


Sparse view 3D reconstruction has attracted increasing attention with the development of neural implicit 3D representation. Existing methods usually only make use of 2D views, requiring a dense set of input views for accurate 3D reconstruction. In this paper, we show that accurate 3D reconstruction can be achieved by incorporating geometric priors into neural implicit 3D reconstruction. Our method adopts the signed distance function as the 3D representation, and learns a generalizable 3D surface reconstruction model from sparse views. Specifically, we build a more effective and sparse feature volume from the input views by using corresponding depth maps, which can be provided by depth sensors or directly predicted from the input views. We recover better geometric details by imposing both depth and surface normal constraints in addition to the color loss when training the neural implicit 3D representation. Experiments demonstrate that our method both outperforms state-of-the-art approaches, and achieves good generalizability.