It is challenging to track a target continuously in videos with long-term occlusion, or objects which leave then re-enter a scene. Existing tracking algorithms combined with online-trained object detectors perform unreliably in complex conditions, and can only provide discontinuous trajectories with jumps in position when the object is occluded. This paper proposes a novel framework of tracking-by-detection using selection and completion to solve the abovementioned problems. It has two components, tracking and trajectory completion. An offline-trained object detector can localize objects in the same category as the object being tracked. The object detector is based on a highly accurate deep learning model. The object selector determines which object should be used to re-initialize a traditional tracker. As the object selector is trained online, it allows the framework to be adaptable. During completion, a predictive non-linear autoregressive neural network completes any discontinuous trajectory. The tracking component is an online real-time algorithm, and the completion part is an after-the-event mechanism. Quantitative experiments show a significant improvement in robustness over prior state-of-the-art methods.
- Article type
- Year
- Co-author
We consider a face-to-face videoconferencing system that uses a Kinect camera at each end of the link for 3D modeling and an ordinary 2D display for output. The Kinect camera allows a 3D model of each participant to be transmitted; the (assumed static) background is sent separately. Furthermore, the Kinect tracks the receiver’s head, allowing our system to render a view of the sender depending on the receiver’s viewpoint. The resulting motion parallax gives the receivers a strong impression of 3D viewing as they move, yet the system only needs an ordinary 2D display. This is cheaper than a full 3D system, and avoids disadvantages such as the need to wear shutter glasses, VR headsets, or to sit in a particular position required by an autostereo display. Perceptual studies show that users experience a greater sensation of depth with our system compared to a typical 2D videoconferencing system.