AI Chat Paper
Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.
{{lang === 'zh_CN' ? '文章概述' : 'Summary'}}
{{lang === 'en_US' ? '中' : 'Eng'}}
Chat more with AI
PDF (6.4 MB)
Collect
Submit Manuscript AI Chat Paper
Show Outline
Outline
Show full outline
Hide outline
Outline
Show full outline
Hide outline
Research Article | Open Access

Multi3D: 3D-aware multimodal image synthesis

BNRist, Tsinghua University, Beijing 100084, China
Computer Science Department, Stanford University, California 94305, USA
Show Author Information

Graphical Abstract

Abstract

3D-aware image synthesis has attained high quality and robust 3D consistency. Existing 3D controllable generative models are designed to synthesize 3D-aware images through a single modality, such as 2D segmentation or sketches, but lack the ability to finely control generated content, such as texture and age. In pursuit of enhancing user-guided controllability, we propose Multi3D, a 3D-aware controllable image synthesis model that supports multi-modal input. Our model can govern the geometry of the generated image using a 2D label map, such as a segmentation or sketch map, while concurrently regulating the appearance of the generated image through a textual description. To demonstrate the effectiveness of our method, we have conducted experiments on multiple datasets, including CelebAMask-HQ, AFHQ-cat, and shapenet-car. Qualitative and quantitative evaluations show that our method outperforms existing state-of-the-art methods.

References

[1]
Goodfellow, I. J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. In: Proceedings of the 27th International Conference on Neural Information Processing Systems, 2672–2680, 2014.
[2]
Brock, A.; Donahue, J.; Simonyan, K. Large scale GAN training for high fidelity natural image synthesis. In: Proceedings of the International Conference on Learning Representations, 2018.
[3]
Karras, T.; Laine, S.; Aila, T. M. A style-based generator architecture for generative adversarial networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4396–4405, 2019.
[4]
Karras, T.; Laine, S.; Aittala, M.; Hellsten, J.; Lehtinen, J.; Aila, T. M. Analyzing and improving the image quality of StyleGAN. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8107–8116, 2020.
[5]
Isola, P.; Zhu, J. Y.; Zhou, T. H.; Efros, A. A. Image-to-image translation with conditional adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 5967–5976, 2017.
[6]
Zhu, J. Y.; Park, T.; Isola, P.; Efros, A. A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings of the IEEE International Conference on Computer Vision, 2242–2251, 2017.
[7]
Park, T.; Liu, M. Y.; Wang, T. C.; Zhu, J. Y. Semantic image synthesis with spatially-adaptive normalization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2332–2341, 2019.
[8]
Chan, E. R.; Lin, C. Z.; Chan, M. A.; Nagano, K.; Pan, B. X.; de Mello, S.; Gallo, O.; Guibas, L.; Tremblay, J.; Khamis, S.; et al. Efficient geometry-aware 3D generative adversarial networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 16102–16112, 2022.
[9]
OrEl, R.; Luo, X.; Shan, M. Y.; Shechtman, E.; Park, J. J.; Kemelmacher-Shlizerman, I. StyleSDF: High-resolution 3D-consistent image and geometry generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 13493–13503, 2022.
[10]
Gu, J. T.; Liu, L. J.; Wang, P.; Theobalt, C. StyleNeRF: A style-based 3D-aware generator for high-resolution image synthesis. In: Proceedings of the International Conference on Learning Representations, 2022.
[11]
Jiang, K. W.; Chen, S. Y.; Liu, F. L.; Fu, H. B.; Gao, L. NeRFFaceEditing: Disentangled face editing in neural radiance fields. In: Proceedings of the SIGGRAPH Asia Conference Papers, Article No. 31, 2022.
[12]

Sun, J.; Wang, X.; Shi, Y.; Wang, L.; Wang, J.; Liu, Y. IDE-3D: Interactive disentangled editing for high-resolution 3D-aware portrait synthesis. ACM Transactions on Graphics Vol. 41, No. 6, Article No. 270, 2022.

[13]

Zhou, W. Y.; Yuan, L.; Chen, S. Y.; Gao, L.; Hu, S. M. LC-NeRF: Local controllable face generation in neural radiance field. IEEE Transactions on Visualization and Computer Graphics doi: 10.1109/TVCG.2023.3293653, 2023.

[14]

Gao, L.; Liu, F. L.; Chen, S. Y.; Jiang, K. W.; Li, C. P.; Lai, Y. K.; Fu, H. B. SketchFaceNeRF: Sketch-based facial generation and editing in neural radiance fields. ACM Transactions on Graphics Vol. 42, No. 4, Article No. 159, 2023.

[15]
Lee, C. H.; Liu, Z. W.; Wu, L. Y.; Luo, P. MaskGAN: Towards diverse and interactive facial image manipulation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5548–5557, 2020.
[16]
Choi, Y.; Uh, Y.; Yoo, J.; Ha, J. W. StarGAN v2: Diverse image synthesis for multiple domains. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8185–8194, 2020.
[17]
Chang, A. X.; Funkhouser, T.; Guibas, L.; Hanrahan, P.; Huang, Q.; Li, Z.; Savarese, S.; Savva, M.; Song, S.; Su, H.; et al. ShapeNet: An information-rich 3D model repository. arXiv preprint arXiv:1512.03012, 2015.
[18]
Mildenhall, B.; Srinivasan, P. P.; Tancik, M.; Barron, J. T.; Ramamoorthi, R.; Ng, R. NeRF: Representing scenes as neural radiance fields for view synthesis. In: Computer Vision–ECCV 2020. Lecture Notes in Computer Science, Vol. 12346. Vedaldi, A.; Bischof, H.; Brox, T.; Frahm, J. M. Eds. Springer Cham, 405–421, 2020.
[19]

Müller, T.; Evans, A.; Schied, C.; Keller A. Instant neural graphics primitives with a multiresolution hash encoding. ACM Transactions on Graphics Vol. 41, No. 4, Article No. 102, 2022.

[20]
Yu, A.; Li, R. L.; Tancik, M.; Li, H.; Ng, R.; Kanazawa, A. PlenOctrees for real-time rendering of neural radiance fields. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 5732–5741, 2021.
[21]
Fridovich-Keil, S.; Yu, A.; Tancik, M.; Chen, Q. H.; Recht, B.; Kanazawa, A. Plenoxels: Radiance fields without neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5491–5500, 2022.
[22]
Shi, Y. C.; Yang, X.; Wan, Y. Y.; Shen, X. H. SemanticStyleGAN: Learning compositional generative priors for controllable image synthesis and editing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11244–11254, 2022.
[23]

Huang, Z. Y.; Peng, Y. C.; Hibino, T.; Zhao, C. Q.; Xie, H. R.; Fukusato, T.; Miyata, K. DualFace: Two-stage drawing guidance for freehand portrait sketching. Computational Visual Media Vol. 8, No. 1, 63–77, 2022.

[24]
Huang, Z. Q.; Chan, K. C. K.; Jiang, Y. M.; Liu, Z. W. Collaborative diffusion for multi-modal face generation and editing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6080–6090, 2023.
[25]

Liu, X. T.; Wu, W. L.; Li, C. Z.; Li, Y. F.; Wu, H. S. Reference-guided structure-aware deep sketch colorization for cartoons. Computational Visual Media Vol. 8, No. 1, 135–148, 2022.

[26]
Zhu, P. H.; Abdal, R.; Qin, Y. P.; Wonka, P. SEAN: Image synthesis with semantic region-adaptive normalization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5103–5112, 2020.
[27]

Xue, Y.; Guo, Y. C.; Zhang, H.; Xu, T.; Zhang, S. H.; Huang, X. L. Deep image synthesis from intuitive user input: A review and perspectives. Computational Visual Media Vol. 8, No. 1, 3–31, 2022.

[28]

Zhou, W. Y.; Yang, G. W.; Hu, S. M. Jittor-GAN: A fast-training generative adversarial network model zoo based on Jittor. Computational Visual Media Vol. 7, No. 1, 153–157, 2021.

[29]

Sushko, V.; Schönfeld, E.; Zhang, D.; Gall, J.; Schiele, B.; Khoreva, A. OASIS: Only adversarial supervision for semantic image synthesis. International Journal of Computer Vision Vol. 130, No. 12, 2903–2923, 2022.

[30]
Xia, W. H.; Yang, Y. J.; Xue, J. H.; Wu, B. Y. TediGAN: Text-guided diverse face image generation and manipulation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2256–2265, 2021.
[31]
Wang, T. C.; Liu, M. Y.; Zhu, J. Y.; Tao, A.; Kautz, J.; Catanzaro, B. High-resolution image synthesis and semantic manipulation with conditional GANs. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8798–8807, 2018.
[32]
Patashnik, O.; Wu, Z. Z.; Shechtman, E.; Cohen-Or, D.; Lischinski, D. StyleCLIP: Text-driven manipulation of StyleGAN imagery. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2065–2074, 2021.
[33]

Chen, A.; Liu, R.; Xie, L.; Chen, Z.; Su, H.; Yu, J. SofGAN: A portrait image generator with dynamic styling. ACM Transactions on Graphics Vol. 41, No. 1, Article No. 1, 2022.

[34]
Ling, H.; Kreis, K.; Li, D.; Kim, S. W.; Torralba, A.; Fidler, S. EditGAN: High-precision semantic image editing. In: Proceedings of the 35th Conference on Neural Information Processing Systems, 16331–16345, 2021.
[35]

Sun, R. Q.; Huang, C.; Zhu, H. L.; Ma, L. Z. Mask-aware photorealistic facial attribute manipulation. Computational Visual Media Vol. 7, No. 3, 363–374, 2021.

[36]

Wang, C.; Tang, F.; Zhang, Y.; Wu, T. R.; Dong, W. M. Towards harmonized regional style transfer and manipulation for facial images. Computational Visual Media Vol. 9, No. 2, 351–366, 2023.

[37]

Chen, S. Y.; Su, W. C.; Gao, L.; Xia, S. H.; Fu, H. B. DeepFaceDrawing: Deep generation of face images from sketches. ACM Transactions on Graphics Vol. 39, No. 4, Article No. 72, 2020.

[38]

Chen, S. Y.; Liu, F. L.; Lai, Y. K.; Rosin, P. L.; Li, C. P.; Gao, L. DeepFaceEditing: Deep face generation and editing with disentangled geometry and appearance control. ACM Transactions on Graphics Vol. 40, No. 4, Article No. 90, 2021.

[39]
Chan, E. R.; Monteiro, M.; Kellnhofer, P.; Wu, J. J.; Wetzstein, G. pi-GAN: Periodic implicit generative adversarial networks for 3D-aware image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5795–5805, 2021.
[40]
Niemeyer, M.; Geiger, A. GIRAFFE: Representing scenes as compositional generative neural feature fields. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11448–11459, 2021.
[41]
Deng, Y.; Yang, J. L.; Xiang, J. F.; Tong, X. GRAM: Generative radiance manifolds for 3D-aware image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10663–10673, 2022.
[42]
Xiang, J. F.; Yang, J. L.; Deng, Y.; Tong, X. GRAM-HD: 3D-consistent image generation at high resolution with generative radiance manifolds. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2195–2205, 2023.
[43]
Sun, J. X.; Wang, X.; Zhang, Y.; Li, X. Y.; Zhang, Q.; Liu, Y. B.; Wang, J. FENeRF: Face editing in neural radiance fields. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7662–7672, 2022.
[44]
Chen, Y. D.; Wu, Q. Y.; Zheng, C. X.; Cham, T. J.; Cai, J. F. Sem2NeRF: Converting single-view semantic masks to neural radiance fields. In: Computer Vision – ECCV 2022. Lecture Notes in Computer Science, Vol. 13674. Avidan, S.; Brostow, G.; Cissé, M.; Farinella, G. M.; Hassner, T. Eds. Springer Cham, 730–748, 2022.
[45]

Jiang, K. W.; Chen, S. Y.; Fu, H. B.; Gao, L. NeRFFaceLighting: Implicit and disentangled face lighting representation leveraging generative prior in neural radiance fields. ACM Transactions on Graphics Vol. 42, No. 3, Article No. 35, 2023.

[46]

Tang, J. S.; Zhang, B.; Yang, B. X.; Zhang, T.; Chen, D.; Ma, L. Z.; Wen, F. 3DFaceShop: Explicitly controllable 3D-aware portrait generation. IEEE Transactions on Visualization and Computer Graphics doi: 10.1109/TVCG.2023.3323578, 2023.

[47]
Sun, J. X.; Wang, X.; Wang, L. Z.; Li, X. Y.; Zhang, Y.; Zhang, H. W.; Liu, Y. B. Next3D: Generative neural texture rasterization for 3D-aware head avatars. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 20991–21002, 2023.
[48]
Cai, S. Q.; Obukhov, A.; Dai, D. X.; Van Gool, L. Pix2NeRF: Unsupervised conditional $\pi$-GAN for single image to neural radiance fields translation In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3971–3980, 2022.
[49]
Deng, K. L.; Yang, G. S.; Ramanan, D.; Zhu, J. Y. 3D-aware conditional image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4434–445, 2023.
[50]
Zhu, P. H.; Abdal, R.; Qin, Y. P.; Wonka, P. Improved StyleGAN embedding: Where are the good latents? arXiv preprint arXiv:2012.09036, 2020.
[51]
Radford, A.;, Kim, J. W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision In: Proceedings of the 38th International Conference on Machine Learning, 8748–8763, 2021.
[52]
Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; Hochreiter, S. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, 6629–6640, 2017.
[53]
Deng, Y.; Yang, J. L.; Xu, S. C.; Chen, D.; Jia, Y. D.; Tong, X. Accurate 3D face reconstruction with weakly-supervised learning: From single image to image set. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 285–295, 2019.
[54]
Loshchilov, I.; Hutter, F. Decoupled weight decay regularization. In: Proceedings of the International Conference on Learning Representations, 2019.
[55]
Kingma, D. P.; Ba, J. Adam: A method for stochastic optimization. In: Proceedings of the 3rd International Conference for Learning Representations, 2015.
[56]
Karras, T.; Aittala, M.; Laine, S.; Härkönen, E.; Hellsten, J.; Lehtinen, J.; Aila, T. Alias-free generative adversarial networks. In: Proceedings of the 35th Conference on Neural Information Processing Systems, 852–863, 2021.
Computational Visual Media
Pages 1205-1217
Cite this article:
Zhou W, Yuan L, Mu T. Multi3D: 3D-aware multimodal image synthesis. Computational Visual Media, 2024, 10(6): 1205-1217. https://doi.org/10.1007/s41095-024-0422-4

35

Views

0

Downloads

0

Crossref

1

Web of Science

1

Scopus

0

CSCD

Altmetrics

Received: 09 December 2023
Accepted: 29 February 2024
Published: 03 April 2024
© The Author(s) 2024.

This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.

The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Other papers from this open access journal are available free of charge from http://www.springer.com/journal/41095. To submit a manuscript, please go to https://www.editorialmanager.com/cvmj.

Return