In recent years, the accuracy of edge detection on several benchmarks has been significantly improved by deep learning based methods. However, the prediction of deep neural networks is usually blurry and needs further post-processing including non-maximum suppression and morphological thinning. In this paper, we demonstrate that the blurry effect arises from the binary cross-entropy loss, and crisp edges could be obtained directly from deep convolutional neural networks. We propose to learn edge maps as the representation of local contrast with a novel local contrast loss. The local contrast is optimized in a stochastic way to focus on specific edge directions. Experiments show that the edge detection network trained with local contrast loss achieves a high accuracy comparable to previous methods and dramatically improves the crispness. We also present several applications of the crisp edges, including image completion, image retrieval, sketch generation, and video stylization.
- Article type
- Year
- Co-author
This paper presents a novel deep neural network for designated point tracking (DPT) in a monocular RGB video, VideoInNet. More concretely, the aim is to track four designated points correlated by a local homography on a textureless planar region in the scene. DPT can be applied to augmented reality and video editing, especially in the field of video advertising. Existing methods predict the location of four designated points without appropriately considering the point correlation. To solve this problem, VideoInNet predicts the motion of the four designated points correlated by a local homography within the heatmap prediction framework. Our network refines the heatmaps of designated points through two stages. On the first stage, we introduce a context-aware and location-aware structure to learn a local homography for the designated plane in a supervised way. On the second stage, we introduce an iterative heatmap refinement module to improve the tracking accuracy. We propose a dataset focusing on textureless planar regions, named ScanDPT, for training and evaluation. We show that the error rate of VideoInNet is about 29% lower than that of the state-of-the-art approach when testing in the first 120 frames of testing videos on ScanDPT.
We propose a novel end-to-end deep learning framework, the Joint Matting Network (JMNet), to automatically generate alpha mattes for human images. We utilize the intrinsic structures of the human body as seen in images by introducing a pose estimation module, which can provide both global structural guidance and a local attention focus for the matting task. Our network model includes a pose network, a trimap network, a matting network, and a shared encoder to extract features for the above three networks. We also append a trimap refinement module and utilize gradient loss to provide a sharper alpha matte. Extensive experiments have shown that our method outperforms state-of-the-art human matting techniques; the shared encoder leads to better performance and lower memory costs. Our model can process real images downloaded from the Internet for use in composition applications.