Robust facial landmark detection and tracking across poses and expressions for in-the-wild monocular video

Shuang Liu; Yongqiang Zhang; Xiaosong Yang; Daming Shi; Jian J. Zhang

doi:10.1007/s41095-016-0068-y

| Sign up

PDF (7.6 MB)

Cite

EndNote(RIS) BibTeX

Collect

Submit Manuscript

Research Article | Open Access

Robust facial landmark detection and tracking across poses and expressions for in-the-wild monocular video

Shuang Liu^¹, Yongqiang Zhang^²(), Xiaosong Yang^¹, Daming Shi^², Jian J. Zhang^¹

1 Bournemouth University, Poole, BH12 5BB, UK.

2 Harbin Institute of Technology, Harbin, 150001, China.

Show Author Information

Abstract

We present a novel approach for automatically detecting and tracking facial landmarks across poses and expressions from in-the-wild monocular video data, e.g., YouTube videos and smartphone recordings. Our method does not require any calibration or manual adjustment for new individual input videos or actors. Firstly, we propose a method of robust 2D facial landmark detection across poses, by combining shape-face canonical-correlation analysis with a global supervised descent method. Since 2D regression-based methods are sensitive to unstable initialization, and the temporal and spatial coherence of videos is ignored, we utilize a coarse-to-dense 3D facial expression reconstruction method to refine the 2D landmarks. On one side, we employ an in-the-wild method to extract the coarse reconstruction result and its corresponding texture using the detected sparse facial landmarks, followed by robust pose, expression, and identity estimation. On the other side, to obtain dense reconstruction results, we give a face tracking flow method that corrects coarse reconstruction results and tracks weakly textured areas; this is used to iteratively update the coarse face model. Finally, a dense reconstruction result is estimated after it converges. Extensive experiments on a variety of video sequences recorded by ourselves or downloaded from YouTube show the results of facial landmark detection and tracking under various lighting conditions, for various head poses and facial expressions. The overall performance and a comparison with state-of-art methods demonstrate the robustness and effectiveness of our method.

Keywords

face tracking facial reconstruction landmark detection

Electronic Supplementary Material

Video

41095_2016_68_MOESM1_ESM.mp4

References

[1]

Mori,

; MacDorman,

K. F.

; Kageki,

The uncanny valley [from the field]. IEEE Robotics & Automation Magazine Vol. 19, No. 2, 98-100, 2012.

Crossref Google Scholar

[2]

Cootes,

T. F.

; Taylor,

C. J.

; Cooper,

D. H.

; Graham,

Active shape models—Their training and application. Computer Vision and Image Understanding Vol. 61, No. 1, 38-59, 1995.

Crossref Google Scholar

[3]

Cootes,

T. F.

; Edwards,

G. J.

; Taylor,

C. J.

Active appearance models. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 23, No. 6, 681-685, 2001.

Crossref Google Scholar

[4]

Cristinacce,

; Cootes,

T. F.

Feature detection and tracking with constrained local models. In: Proceedings of the British Machine Conference, 95.1-95.10, 2006.

Crossref

[5]

Gonzalez-Mora,

; De la Torre,

; Murthi,

; Guil,

; Zapata,

E. L.

Bilinear active appearance models. In: Proceedings of IEEE 11th International Conference on Computer Vision, 1-8, 2007.

Crossref

[6]

Lee,

H.-S.

; Kim,

Tensor-based AAM with continuous variation estimation: Application to variation-robust face recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 31, No. 6, 1102-1116, 2009.

Crossref Google Scholar

[7]

Cao,

; Wei,

; Wen,

; Sun,

Face alignment by explicit shape regression. U.S. Patent Application 13/728,584. 2012-12-27.

[8]

Xiong,

; De la Torre,

Supervised descent method and its applications to face alignment. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 532-539, 2013.

Crossref

[9]

Xing,

; Niu,

; Huang,

; Hu,

; Yan,

Towards multi-view and partially-occluded face alignment. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1829-1836, 2014.

Crossref

[10]

Yan,

; Lei,

; Yi,

; Li,

S. Z.

Learn to combine multiple hypotheses for accurate face alignment. In: Proceedings of the IEEE International Conference on Computer Vision Workshops, 392-396, 2013.

Crossref

[11]

Burgos-Artizzu,

X. P.

; Perona,

; Dollár,

Robust face landmark estimation under occlusion. In: Proceedings of the IEEE International Conference on Computer Vision, 1513-1520, 2013.

Crossref

[12]

Yang,

; He,

; Jia,

; Patras,

Robust face alignment under occlusion via regional predictive power estimation. IEEE Transactions on Image Processing Vol. 24, No. 8, 2393-2403, 2015.

Crossref Google Scholar

[13]

Feng,

Z.-H.

; Huber,

; Kittler,

; Christmas,

; Wu,

X.-J.

Random cascaded-regression copse for robust facial landmark detection. IEEE Signal Processing Letters Vol. 22, No. 1, 76-80, 2015.

Crossref Google Scholar

[14]

Yang,

; Jia,

; Patras,

; Chan,

K.-P.

Random subspace supervised descent method for regression problems in computer vision. IEEE Signal Processing Letters Vol. 22, No. 10, 1816-1820, 2015.

Crossref Google Scholar

[15]

Zhu,

; Li,

; Loy,

C. C.

; Tang,

Face alignment by coarse-to-fine shape searching. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4998-5006, 2015.

[16]

Cao,

; Hou,

; Zhou,

Displaced dynamic expression regression for real-time facial tracking and animation. ACM Transactions on Graphics Vol. 33, No. 4, Article No. 43, 2014.

Crossref Google Scholar

[17]

Liu,

; Yang,

; Wang,

; Xiao,

; Zhang,

Real-time facial expression transfer with single video camera. Computer Animation and Virtual Worlds Vol. 27, Nos. 3–4, 301-310, 2016.

Crossref Google Scholar

[18]

Tzimiropoulos,

; Pantic,

Optimization problems for fast AAM fitting in-the-wild. In: Proceedings of the IEEE International Conference on Computer Vision, 593-600, 2013.

Crossref

[19]

Suwajanakorn,

; Kemelmacher-Shlizerman,

; Seitz,

S. M.

Total moving face reconstruction. In: Computer Vision–ECCV 2014. Fleet,

; Pajdla,

; Schiele,

; Tuytelaars,

Eds. Springer International Publishing, 796-812, 2014.

Crossref

[20]

Cootes,

T. F.

; Taylor,

C. J.

Statistical models of appearance for computer vision. 2004. Available at http://personalpages.manchester.ac.uk/staff/timothy.f.cootes/Models/app_models.pdf.

[21]

Yan,

; Liu,

; Li,

S. Z.

; Zhang,

; Shum,

H.-Y.

; Cheng,

Face alignment using texture-constrained active shape models. Image and Vision Computing Vol. 21, No. 1, 69-75, 2003.

Crossref Google Scholar

[22]

Donner,

; Reiter,

; Langs,

; Peloschek,

; Bischof,

Fast active appearance model search using canonical correlation analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 28, No. 10, 1690-1694, 2006.

Crossref Google Scholar

[23]

Matthews,

; Baker,

Active appearance models revisited. International Journal of Computer Vision Vol. 60, No. 2, 135-164, 2004.

Crossref Google Scholar

[24]

Cao,

; Wei,

; Wen,

; Sun,

Face alignment by explicit shape regression. International Journal of Computer Vision Vol. 107, No. 2, 177-190, 2014.

Crossref Google Scholar

[25]

Dollár,

; Welinder,

; Perona,

Cascaded pose regression. In: Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 1078-1085, 2010.

Crossref

[26]

Zhou,

S. K.

; Comaniciu,

Shape regression machine. In: Information Processing in Medical Imaging. Karssemeijer,

; Lelieveldt,

Eds. Springer Berlin Heidelberg, 13-25, 2007.

[27]

Burgos-Artizzu,

X. P.

; Perona,

; Dollár,

Robust face landmark estimation under occlusion. In: Proceedings of the IEEE International Conference on Computer Vision, 1513-1520, 2013.

Crossref

[28]

Ren,

; Cao,

; Wei,

; Sun,

Face alignment at 3000 fps via regressing local binary features. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1685-1692, 2014.

Crossref

[29]

Cootes,

T. F.

; Ionita,

M. C.

; Lindner,

; Sauer,

Robust and accurate shape model fitting using random forest regression voting. In: Computer Vision–ECCV 2012. Fitzgibbon,

; Lazebnik,

; Perona,

; Sato,

; Schmid,

Eds. Springer Berlin Heidelberg, 278-291, 2012.

Crossref

[30]

Kazemi,

; Sullivan,

One millisecond face alignment with an ensemble of regression trees. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1867-1874, 2014.

Crossref

[31]

Sagonas,

; Tzimiropoulos,

; Zafeiriou,

; Pantic,

300 faces in-the-wild challenge: The first facial landmark localization challenge. In: Proceedings of the IEEE International Conference on Computer Vision Workshops, 397-403, 2013.

Crossref

[32]

Zhou,

; Brandt,

; Lin,

Exemplar-based graph matching for robust facial landmark localization. In: Proceedings of the IEEE International Conference on Computer Vision, 1025-1032, 2013.

Crossref

[33]

Huang,

G. B.

; Ramesh,

; Berg,

; Learned-Miller,

Labeled faces in the wild: A database for studying face recognition in unconstrained environments. Technical Report 07-49, University of Massachusetts, Amherst, 2007.

[34]

Shen,

; Zafeiriou,

; Chrysos,

G. G.

; Kossaifi,

; Tzimiropoulos,

; Pantic,

The first facial landmark tracking in-the-wild challenge: Benchmark and results. In: Proceedings of the IEEE International Conference on Computer Vision Workshop, 1003-1011, 2015.

Crossref

[35]

Cao,

; Bradley,

; Zhou,

; Beeler,

Realtime high-fidelity facial performance capture. ACM Transactions on Graphics Vol. 34, No. 4, Article No. 46, 2015.

Crossref Google Scholar

[36]

Cao,

; Wu,

; Weng,

; Shao,

; Zhou,

Real-time facial animation with image-based dynamic avatars. ACM Transactions on Graphics Vol. 35, No. 4, Article No. 126, 2016.

Crossref Google Scholar

[37]

Garrido,

; Valgaerts,

; Wu,

; Theobalt,

Reconstructing detailed dynamic face geometry from monocular video. ACM Transactions on Graphics Vol. 32, No. 6, Article No. 158, 2013.

Crossref Google Scholar

[38]

Ichim,

A. E.

; Bouaziz,

; Pauly,

Dynamic 3D avatar creation from hand-held video input. ACM Transactions on Graphics Vol. 34, No. 4, Article No. 45, 2015.

Crossref Google Scholar

[39]

Saito,

; Li,

Real-time facial segmentation and performance capture from RGB input. arXiv preprint arXiv:1604.02647, 2016.

Crossref

[40]

Shi,

; Wu,

H.-T.

; Tong,

; Chai,

Automatic acquisition of high-fidelity facial performances using monocular videos. ACM Transactions on Graphics Vol. 33, No. 6, Article No. 222, 2014.

Crossref Google Scholar

[41]

Thies,

; Zollhöfer,

; Stamminger,

; Theobalt,

; Nießner,

Face2face: Real-time face capture and reenactment of RGB videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1, 2016.

Crossref

[42]

Furukawa,

; Ponce,

Accurate camera calibration from multi-view stereo and bundle adjustment. International Journal of Computer Vision Vol. 84, No. 3, 257-268, 2009.

Crossref Google Scholar

[43]

Cao,

; Weng,

; Zhou,

; Tong,

; Zhou,

FaceWarehouse: A 3D facial expression database for visual computing. IEEE Transactions on Visualization and Computer Graphics Vol. 20, No. 3, 413-425, 2014.

Crossref Google Scholar

[44]

Newcombe,

R. A.

; Izadi,

; Hilliges,

; Molyneaux,

; Kim,

; Davison,

A. J.

; Kohi,

; Shotton,

; Hodges,

; Fitzgibbon,

KinectFusion: Realtime dense surface mapping and tracking. In: Proceedings of the 10th IEEE International Symposium on Mixed and Augmented Reality, 127-136, 2011.

Crossref

[45]

Weise,

; Bouaziz,

; Li,

; Pauly,

Realtime performance-based facial animation. ACM Transactions on Graphics Vol. 30, No. 4, Article No. 77, 2011.

Crossref Google Scholar

[46]

Blanz,

; Vetter,

A morphable model for the synthesis of 3D faces. In: Proceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques, 187-194, 1999.

Crossref

[47]

Yan,

; Zhang,

; Lei,

; Yi,

; Li,

S. Z.

Structural models for face detection. In: Proceedings of the 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition, 1-6, 2013.

[48]

Xiong,

; De la Torre,

Global supervised descent method. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2664-2673, 2015.

Crossref

[49]

Snavely,

Bundler: Structure from motion (SFM) for unordered image collections. 2010. Available at http://www.cs.cornell.edu/~snavely/bundler/.

[50]

Chen,

; Armstrong,

C. W.

; Raftopoulos,

D. D.

An investigation on the accuracy of three-dimensional space reconstruction using the direct linear transformation technique. Journal of Biomechanics Vol. 27, No. 4, 493-500, 1994.

Crossref Google Scholar

[51]

Moré,

J. J.

The Levenberg–Marquardt algorithm: Implementation and theory. In: Numerical Analysis. Watson,

G. A.

Ed. Springer Berlin Heidelberg, 105-116, 1978.

Crossref

[52]

Rall,

L. B.

Automatic Differentiation: Techniques and Applications. Springer Berlin Heidelberg, 1981.

Crossref

[53]

Kolda,

T. G.

; Sun,

Scalable tensor decompositions for multi-aspect data mining. In: Proceedings of the 8th IEEE International Conference on Data Mining, 363-372, 2008.10.1109/ICDM.2008.89

Crossref

[54]

Li,

D.-H.

; Fukushima,

A modified BFGS method and its global convergence in nonconvex minimization. Journal of Computational and Applied Mathematics Vol. 129, Nos. 1–2, 15-35, 2001.

Crossref Google Scholar

[55]

Igarashi,

; Moscovich,

; Hughes,

J. F.

As-rigid-as-possible shape manipulation. ACM Transactions on Graphics Vol. 24, No. 3, 1134-1141, 2005.

Crossref Google Scholar

[56]

Hartigan,

J. A.

; Wong,

M. A.

Algorithm AS 136: A

K

-means clustering algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics) Vol. 28, No. 1, 100-108, 1979.

Crossref Google Scholar

[57]

Brox,

; Bruhn,

; Papenberg,

; Weickert,

High accuracy optical flow estimation based on a theory for warping. In: Computer Vision–ECCV 2004. Pajdla,

; Matas,

Eds. Springer Berlin Heidelberg, 25-36, 2004.

Crossref

[58]

Brox,

; Malik,

Large displacement optical flow: Descriptor matching in variational motion estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 33, No. 3, 500-513, 2011.

Crossref Google Scholar

[59]

Agarwal,

; Snavely,

; Seitz,

S. M.

; Szeliski,

Bundle adjustment in the large. In: Computer Vision–ECCV 2010. Daniilidis,

; Maragos,

; Paragios,

Eds. Springer Berlin Heidelberg, 29-42, 2010.

Crossref

[60]

Belhumeur,

P. N.

; Jacobs,

D. W.

; Kriegman,

D. J.

; Kumar,

Localizing parts of faces using a consensus of exemplars. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 35, No. 12, 2930-2940, 2013.

Crossref Google Scholar

Computational Visual Media

Volume 3 Issue 1,
March 2017

Pages 33-47

DOI: 10.1007/s41095-016-0068-y

Cite this article:

Liu S, Zhang Y, Yang X, et al. Robust facial landmark detection and tracking across poses and expressions for in-the-wild monocular video. Computational Visual Media, 2017, 3(1): 33-47. https://doi.org/10.1007/s41095-016-0068-y