PDF (6.4 MB)
Collect
Submit Manuscript
Show Outline
Figures (7)

Tables (6)
Table 1
Table 2
Table 3
Table 4
Table 5
Show 1 more tables Hide 1 tables
Research Article | Open Access

Multi3D: 3D-aware multimodal image synthesis

BNRist, Tsinghua University, Beijing 100084, China
Computer Science Department, Stanford University, California 94305, USA
Show Author Information

Graphical Abstract

View original image Download original image

Abstract

3D-aware image synthesis has attained high quality and robust 3D consistency. Existing 3D controllable generative models are designed to synthesize 3D-aware images through a single modality, such as 2D segmentation or sketches, but lack the ability to finely control generated content, such as texture and age. In pursuit of enhancing user-guided controllability, we propose Multi3D, a 3D-aware controllable image synthesis model that supports multi-modal input. Our model can govern the geometry of the generated image using a 2D label map, such as a segmentation or sketch map, while concurrently regulating the appearance of the generated image through a textual description. To demonstrate the effectiveness of our method, we have conducted experiments on multiple datasets, including CelebAMask-HQ, AFHQ-cat, and shapenet-car. Qualitative and quantitative evaluations show that our method outperforms existing state-of-the-art methods.

References

[1]
Goodfellow, I. J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. In: Proceedings of the 27th International Conference on Neural Information Processing Systems, 2672–2680, 2014.
[2]
Brock, A.; Donahue, J.; Simonyan, K. Large scale GAN training for high fidelity natural image synthesis. In: Proceedings of the International Conference on Learning Representations, 2018.
[3]
Karras, T.; Laine, S.; Aila, T. M. A style-based generator architecture for generative adversarial networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4396–4405, 2019.
[4]
Karras, T.; Laine, S.; Aittala, M.; Hellsten, J.; Lehtinen, J.; Aila, T. M. Analyzing and improving the image quality of StyleGAN. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8107–8116, 2020.
[5]
Isola, P.; Zhu, J. Y.; Zhou, T. H.; Efros, A. A. Image-to-image translation with conditional adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 5967–5976, 2017.
[6]
Zhu, J. Y.; Park, T.; Isola, P.; Efros, A. A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings of the IEEE International Conference on Computer Vision, 2242–2251, 2017.
[7]
Park, T.; Liu, M. Y.; Wang, T. C.; Zhu, J. Y. Semantic image synthesis with spatially-adaptive normalization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2332–2341, 2019.
[8]
Chan, E. R.; Lin, C. Z.; Chan, M. A.; Nagano, K.; Pan, B. X.; de Mello, S.; Gallo, O.; Guibas, L.; Tremblay, J.; Khamis, S.; et al. Efficient geometry-aware 3D generative adversarial networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 16102–16112, 2022.
[9]
OrEl, R.; Luo, X.; Shan, M. Y.; Shechtman, E.; Park, J. J.; Kemelmacher-Shlizerman, I. StyleSDF: High-resolution 3D-consistent image and geometry generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 13493–13503, 2022.
[10]
Gu, J. T.; Liu, L. J.; Wang, P.; Theobalt, C. StyleNeRF: A style-based 3D-aware generator for high-resolution image synthesis. In: Proceedings of the International Conference on Learning Representations, 2022.
[11]
Jiang, K. W.; Chen, S. Y.; Liu, F. L.; Fu, H. B.; Gao, L. NeRFFaceEditing: Disentangled face editing in neural radiance fields. In: Proceedings of the SIGGRAPH Asia Conference Papers, Article No. 31, 2022.
[12]

Sun, J.; Wang, X.; Shi, Y.; Wang, L.; Wang, J.; Liu, Y. IDE-3D: Interactive disentangled editing for high-resolution 3D-aware portrait synthesis. ACM Transactions on Graphics Vol. 41, No. 6, Article No. 270, 2022.

[13]

Zhou, W. Y.; Yuan, L.; Chen, S. Y.; Gao, L.; Hu, S. M. LC-NeRF: Local controllable face generation in neural radiance field. IEEE Transactions on Visualization and Computer Graphics doi: 10.1109/TVCG.2023.3293653, 2023.

[14]

Gao, L.; Liu, F. L.; Chen, S. Y.; Jiang, K. W.; Li, C. P.; Lai, Y. K.; Fu, H. B. SketchFaceNeRF: Sketch-based facial generation and editing in neural radiance fields. ACM Transactions on Graphics Vol. 42, No. 4, Article No. 159, 2023.

[15]
Lee, C. H.; Liu, Z. W.; Wu, L. Y.; Luo, P. MaskGAN: Towards diverse and interactive facial image manipulation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5548–5557, 2020.
[16]
Choi, Y.; Uh, Y.; Yoo, J.; Ha, J. W. StarGAN v2: Diverse image synthesis for multiple domains. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8185–8194, 2020.
[17]
Chang, A. X.; Funkhouser, T.; Guibas, L.; Hanrahan, P.; Huang, Q.; Li, Z.; Savarese, S.; Savva, M.; Song, S.; Su, H.; et al. ShapeNet: An information-rich 3D model repository. arXiv preprint arXiv:1512.03012, 2015.
[18]
Mildenhall, B.; Srinivasan, P. P.; Tancik, M.; Barron, J. T.; Ramamoorthi, R.; Ng, R. NeRF: Representing scenes as neural radiance fields for view synthesis. In: Computer Vision–ECCV 2020. Lecture Notes in Computer Science, Vol. 12346. Vedaldi, A.; Bischof, H.; Brox, T.; Frahm, J. M. Eds. Springer Cham, 405–421, 2020.
[19]

Müller, T.; Evans, A.; Schied, C.; Keller A. Instant neural graphics primitives with a multiresolution hash encoding. ACM Transactions on Graphics Vol. 41, No. 4, Article No. 102, 2022.

[20]
Yu, A.; Li, R. L.; Tancik, M.; Li, H.; Ng, R.; Kanazawa, A. PlenOctrees for real-time rendering of neural radiance fields. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 5732–5741, 2021.
[21]
Fridovich-Keil, S.; Yu, A.; Tancik, M.; Chen, Q. H.; Recht, B.; Kanazawa, A. Plenoxels: Radiance fields without neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5491–5500, 2022.
[22]
Shi, Y. C.; Yang, X.; Wan, Y. Y.; Shen, X. H. SemanticStyleGAN: Learning compositional generative priors for controllable image synthesis and editing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11244–11254, 2022.
[23]

Huang, Z. Y.; Peng, Y. C.; Hibino, T.; Zhao, C. Q.; Xie, H. R.; Fukusato, T.; Miyata, K. DualFace: Two-stage drawing guidance for freehand portrait sketching. Computational Visual Media Vol. 8, No. 1, 63–77, 2022.

[24]
Huang, Z. Q.; Chan, K. C. K.; Jiang, Y. M.; Liu, Z. W. Collaborative diffusion for multi-modal face generation and editing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6080–6090, 2023.
[25]

Liu, X. T.; Wu, W. L.; Li, C. Z.; Li, Y. F.; Wu, H. S. Reference-guided structure-aware deep sketch colorization for cartoons. Computational Visual Media Vol. 8, No. 1, 135–148, 2022.

[26]
Zhu, P. H.; Abdal, R.; Qin, Y. P.; Wonka, P. SEAN: Image synthesis with semantic region-adaptive normalization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5103–5112, 2020.
[27]

Xue, Y.; Guo, Y. C.; Zhang, H.; Xu, T.; Zhang, S. H.; Huang, X. L. Deep image synthesis from intuitive user input: A review and perspectives. Computational Visual Media Vol. 8, No. 1, 3–31, 2022.

[28]

Zhou, W. Y.; Yang, G. W.; Hu, S. M. Jittor-GAN: A fast-training generative adversarial network model zoo based on Jittor. Computational Visual Media Vol. 7, No. 1, 153–157, 2021.

[29]

Sushko, V.; Schönfeld, E.; Zhang, D.; Gall, J.; Schiele, B.; Khoreva, A. OASIS: Only adversarial supervision for semantic image synthesis. International Journal of Computer Vision Vol. 130, No. 12, 2903–2923, 2022.

[30]
Xia, W. H.; Yang, Y. J.; Xue, J. H.; Wu, B. Y. TediGAN: Text-guided diverse face image generation and manipulation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2256–2265, 2021.
[31]
Wang, T. C.; Liu, M. Y.; Zhu, J. Y.; Tao, A.; Kautz, J.; Catanzaro, B. High-resolution image synthesis and semantic manipulation with conditional GANs. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8798–8807, 2018.
[32]
Patashnik, O.; Wu, Z. Z.; Shechtman, E.; Cohen-Or, D.; Lischinski, D. StyleCLIP: Text-driven manipulation of StyleGAN imagery. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2065–2074, 2021.
[33]

Chen, A.; Liu, R.; Xie, L.; Chen, Z.; Su, H.; Yu, J. SofGAN: A portrait image generator with dynamic styling. ACM Transactions on Graphics Vol. 41, No. 1, Article No. 1, 2022.

[34]
Ling, H.; Kreis, K.; Li, D.; Kim, S. W.; Torralba, A.; Fidler, S. EditGAN: High-precision semantic image editing. In: Proceedings of the 35th Conference on Neural Information Processing Systems, 16331–16345, 2021.
[35]

Sun, R. Q.; Huang, C.; Zhu, H. L.; Ma, L. Z. Mask-aware photorealistic facial attribute manipulation. Computational Visual Media Vol. 7, No. 3, 363–374, 2021.

[36]

Wang, C.; Tang, F.; Zhang, Y.; Wu, T. R.; Dong, W. M. Towards harmonized regional style transfer and manipulation for facial images. Computational Visual Media Vol. 9, No. 2, 351–366, 2023.

[37]

Chen, S. Y.; Su, W. C.; Gao, L.; Xia, S. H.; Fu, H. B. DeepFaceDrawing: Deep generation of face images from sketches. ACM Transactions on Graphics Vol. 39, No. 4, Article No. 72, 2020.

[38]

Chen, S. Y.; Liu, F. L.; Lai, Y. K.; Rosin, P. L.; Li, C. P.; Gao, L. DeepFaceEditing: Deep face generation and editing with disentangled geometry and appearance control. ACM Transactions on Graphics Vol. 40, No. 4, Article No. 90, 2021.

[39]
Chan, E. R.; Monteiro, M.; Kellnhofer, P.; Wu, J. J.; Wetzstein, G. pi-GAN: Periodic implicit generative adversarial networks for 3D-aware image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5795–5805, 2021.
[40]
Niemeyer, M.; Geiger, A. GIRAFFE: Representing scenes as compositional generative neural feature fields. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11448–11459, 2021.
[41]
Deng, Y.; Yang, J. L.; Xiang, J. F.; Tong, X. GRAM: Generative radiance manifolds for 3D-aware image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10663–10673, 2022.
[42]
Xiang, J. F.; Yang, J. L.; Deng, Y.; Tong, X. GRAM-HD: 3D-consistent image generation at high resolution with generative radiance manifolds. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2195–2205, 2023.
[43]
Sun, J. X.; Wang, X.; Zhang, Y.; Li, X. Y.; Zhang, Q.; Liu, Y. B.; Wang, J. FENeRF: Face editing in neural radiance fields. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7662–7672, 2022.
[44]
Chen, Y. D.; Wu, Q. Y.; Zheng, C. X.; Cham, T. J.; Cai, J. F. Sem2NeRF: Converting single-view semantic masks to neural radiance fields. In: Computer Vision – ECCV 2022. Lecture Notes in Computer Science, Vol. 13674. Avidan, S.; Brostow, G.; Cissé, M.; Farinella, G. M.; Hassner, T. Eds. Springer Cham, 730–748, 2022.
[45]

Jiang, K. W.; Chen, S. Y.; Fu, H. B.; Gao, L. NeRFFaceLighting: Implicit and disentangled face lighting representation leveraging generative prior in neural radiance fields. ACM Transactions on Graphics Vol. 42, No. 3, Article No. 35, 2023.

[46]

Tang, J. S.; Zhang, B.; Yang, B. X.; Zhang, T.; Chen, D.; Ma, L. Z.; Wen, F. 3DFaceShop: Explicitly controllable 3D-aware portrait generation. IEEE Transactions on Visualization and Computer Graphics doi: 10.1109/TVCG.2023.3323578, 2023.

[47]
Sun, J. X.; Wang, X.; Wang, L. Z.; Li, X. Y.; Zhang, Y.; Zhang, H. W.; Liu, Y. B. Next3D: Generative neural texture rasterization for 3D-aware head avatars. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 20991–21002, 2023.
[48]
Cai, S. Q.; Obukhov, A.; Dai, D. X.; Van Gool, L. Pix2NeRF: Unsupervised conditional $\pi$-GAN for single image to neural radiance fields translation In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3971–3980, 2022.
[49]
Deng, K. L.; Yang, G. S.; Ramanan, D.; Zhu, J. Y. 3D-aware conditional image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4434–445, 2023.
[50]
Zhu, P. H.; Abdal, R.; Qin, Y. P.; Wonka, P. Improved StyleGAN embedding: Where are the good latents? arXiv preprint arXiv:2012.09036, 2020.
[51]
Radford, A.;, Kim, J. W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision In: Proceedings of the 38th International Conference on Machine Learning, 8748–8763, 2021.
[52]
Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; Hochreiter, S. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, 6629–6640, 2017.
[53]
Deng, Y.; Yang, J. L.; Xu, S. C.; Chen, D.; Jia, Y. D.; Tong, X. Accurate 3D face reconstruction with weakly-supervised learning: From single image to image set. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 285–295, 2019.
[54]
Loshchilov, I.; Hutter, F. Decoupled weight decay regularization. In: Proceedings of the International Conference on Learning Representations, 2019.
[55]
Kingma, D. P.; Ba, J. Adam: A method for stochastic optimization. In: Proceedings of the 3rd International Conference for Learning Representations, 2015.
[56]
Karras, T.; Aittala, M.; Laine, S.; Härkönen, E.; Hellsten, J.; Lehtinen, J.; Aila, T. Alias-free generative adversarial networks. In: Proceedings of the 35th Conference on Neural Information Processing Systems, 852–863, 2021.
Computational Visual Media
Pages 1205-1217
Cite this article:
Zhou W, Yuan L, Mu T. Multi3D: 3D-aware multimodal image synthesis. Computational Visual Media, 2024, 10(6): 1205-1217. https://doi.org/10.1007/s41095-024-0422-4
Metrics & Citations  
Article History
Copyright
Rights and Permissions
Return