In this paper, we consider salient instance segmentation. As well as producing bounding boxes, our network also outputs high-quality instance-level segments as initial selections to indicate the regions of interest. Taking into account the category-independent property of each target, we design a single stage salient instance segmentation framework, with a novel segmentation branch. Our new branch regards not only local context inside each detection window but also the surrounding context, enabling us to distinguish instances in the same scope even with partial occlusion. Our network is end-to-end trainable and is fast (running at 40 fps for images with resolution
- Article type
- Year
- Co-author
In this paper, we propose a simple but effective framework for lane boundary detection, called SpinNet. Considering that cars or pedestrians often occlude lane boundaries and that the local features of lane boundaries are not distinctive, therefore, analyzing and collecting global context information is crucial for lane boundary detection. To this end, we design a novel spinning convolution layer and a brand-new lane parameterization branch in our network to detect lane boundaries from a global perspective. To extract features in narrow strip-shaped fields, we adopt strip-shaped convolutions with kernels which have
In this paper, we introduce a very large Chinese text dataset in the wild. While optical character recognition (OCR) in document images is well studied and many commercial tools are available, the detection and recognition of text in natural images is still a challenging problem, especially for some more complicated character sets such as Chinese text. Lack of training data has always been a problem, especially for deep learning methods which require massive training data. In this paper, we provide details of a newly created dataset of Chinese text with about 1 million Chinese characters from 3 850 unique ones annotated by experts in over 30 000 street view images. This is a challenging dataset with good diversity containing planar text, raised text, text under poor illumination, distant text, partially occluded text, etc. For each character, the annotation includes its underlying character, bounding box, and six attributes. The attributes indicate the character’s background complexity, appearance, style, etc. Besides the dataset, we give baseline results using state-of-the-art methods for three tasks: character recognition (top-1 accuracy of 80.5%), character detection (AP of 70.9%), and text line detection (AED of 22.1). The dataset, source code, and trained models are publicly available.
We consider the problem of learning a representation of both spatial relations and dependencies between objects for indoor scene design. We propose a novel knowledge graph framework based on the entity-relation model for representation of facts in indoor scene design, and further develop a weakly-supervised algorithm for extracting the knowledge graph representation from a small dataset using both structure and parameter learning. The proposed framework is flexible, transferable, and readable. We present a variety of computer-aided indoor scene design applications using this representation, to show the usefulness and robustness of the proposed framework.
Current image-editing tools do not match up to the demands of personalized image manipulation, one application of which is changing clothes in user-captured images. Previous work can change single color clothes using parametric human warping methods. In this paper, we propose an image-based clothes changing system, exploiting body factor extraction and content-aware image warping. Image segmentation and mask generation are first applied to the user input. Afterwards, we determine joint positions via a neural network. Then, body shape matching is performed and the shape of the model is warped to the user’s shape. Finally, head swapping is performed to produce realistic virtual results. We also provide a supervision and labeling tool for refinement and further assistance when creating a dataset.
Multiview video can provide more immersive perception than traditional single 2-D video. It enables both interactive free navigation applications as well as high-end autostereoscopic displays on which multiple users can perceive genuine 3-D content without glasses. The multiview format also comprises much more visual information than classical 2-D or stereo 3-D content, which makes it possible to perform various interesting editing operations both on pixel-level and object-level. This survey provides a comprehensive review of existing multiview video synthesis and editing algorithms and applications. For each topic, the related technologies in classical 2-D image and video processing are reviewed. We then continue to the discussion of recent advanced techniques for multiview video virtual view synthesis and various interactive editing applications. Due to the ongoing progress on multiview video synthesis and editing, we can foresee more and more immersive 3-D video applications will appear in the future.