This paper surveys the technology used in threedimensional indoor scene geometry estimation from a single 360◦ omnidirectional image, which is pivotal in extracting 3D structural information from indoor environments. The technology transforms omnidirectional data into a 3D model, depicting spatial structure, object positions, and scene layout. Its significance spans various domains, including virtual reality (VR), augmented reality (AR), mixed reality (MR), game development, urban planning, and robot navigation.We begin by revisiting foundational concepts of omnidirectional imaging and detailing the problems, applications, and challenges in this field. Our review categorizes the fundamental tasks of structure recovery, depth estimation, and layout recovery. We also review pertinent datasets and evaluation metrics, providing the latest research as a reference. Finally, we summarize the field and discuss potential future trends to inform and guide further research.
- Article type
- Year
- Co-author
3D morphable models (3DMMs) are generative models for face shape and appearance. Recent works impose face recognition constraints on 3DMM shape parameters so that the face shapes of the same person remain consistent. However, theshape parameters of traditional 3DMMs satisfy the multivariate Gaussian distribution. In contrast, the identity embeddings meet the hypersphere distribution, and this conflict makes it challenging for face reconstruction models to preserve the faithfulness and the shape consistency simultaneously. In other words, recognition loss and reconstruction loss can not decrease jointly due to their conflict distribution. To address this issue, we propose the Sphere Face Model (SFM), a novel 3DMM for monocular face reconstruction, preserving both shape fidelity and identity consistency. The core of our SFM is the basis matrix which can be used to reconstruct 3D face shapes, and the basic matrix is learned by adopting a two-stage training approach where 3D and 2D training data are used in the first and second stages, respectively. We design a novel loss to resolve the distribution mismatch, enforcing that the shape parameters have the hyperspherical distribution. Our model accepts 2Dand 3D data for constructing the sphere face models. Extensive experiments show that SFM has high representation ability and clustering performance in its shape parameter space. Moreover, it produces high-fidelity face shapes consistently in challenging conditions in monocular face reconstruction. The code will be released at https://github.com/a686432/SIR
In this paper, we introduce a very large Chinese text dataset in the wild. While optical character recognition (OCR) in document images is well studied and many commercial tools are available, the detection and recognition of text in natural images is still a challenging problem, especially for some more complicated character sets such as Chinese text. Lack of training data has always been a problem, especially for deep learning methods which require massive training data. In this paper, we provide details of a newly created dataset of Chinese text with about 1 million Chinese characters from 3 850 unique ones annotated by experts in over 30 000 street view images. This is a challenging dataset with good diversity containing planar text, raised text, text under poor illumination, distant text, partially occluded text, etc. For each character, the annotation includes its underlying character, bounding box, and six attributes. The attributes indicate the character’s background complexity, appearance, style, etc. Besides the dataset, we give baseline results using state-of-the-art methods for three tasks: character recognition (top-1 accuracy of 80.5%), character detection (AP of 70.9%), and text line detection (AED of 22.1). The dataset, source code, and trained models are publicly available.
We consider a face-to-face videoconferencing system that uses a Kinect camera at each end of the link for 3D modeling and an ordinary 2D display for output. The Kinect camera allows a 3D model of each participant to be transmitted; the (assumed static) background is sent separately. Furthermore, the Kinect tracks the receiver’s head, allowing our system to render a view of the sender depending on the receiver’s viewpoint. The resulting motion parallax gives the receivers a strong impression of 3D viewing as they move, yet the system only needs an ordinary 2D display. This is cheaper than a full 3D system, and avoids disadvantages such as the need to wear shutter glasses, VR headsets, or to sit in a particular position required by an autostereo display. Perceptual studies show that users experience a greater sensation of depth with our system compared to a typical 2D videoconferencing system.
This paper considers panorama images used for street views. Their viewing angle of 360