AI Chat Paper
Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.
{{lang === 'zh_CN' ? '文章概述' : 'Summary'}}
{{lang === 'en_US' ? '中' : 'Eng'}}
Chat more with AI
PDF (11.1 MB)
Collect
Submit Manuscript AI Chat Paper
Show Outline
Outline
Show full outline
Hide outline
Outline
Show full outline
Hide outline
Research Article | Open Access

CLIP-Flow: Decoding images encoded in CLIP space

Visual Computing Research Center, College of Computer Science and Software Engineering, Shenzhen University, Shenzhen 518060, China
Department of Computer Science, Tel Aviv University, Tel Aviv 6997801, Israel
School of Computer Science and Engineering, the Hebrew University of Jerusalem, Jerusalem 91904, Israel
Show Author Information

Graphical Abstract

Abstract

This study introduces CLIP-Flow, a novel network for generating images from a given image or text. To effectively utilize the rich semantics contained in both modalities, we designed a semantics-guided methodology for image- and text-to-image synthesis. In particular, we adopted Contrastive Language-Image Pretraining (CLIP) as an encoder to extract semantics and StyleGAN as a decoder to generate images from such information. Moreover, to bridge the embedding space of CLIP and latent space of StyleGAN, real NVP is employed and modified with activation normalization and invertible convolution. As the images and text in CLIP share the same representation space, text prompts can be fed directly into CLIP-Flow to achieve text-to-image synthesis. We conducted extensive experiments on several datasets to validate the effectiveness of the proposed image-to-image synthesis method. In addition, we tested on the public dataset Multi-Modal CelebA-HQ, for text-to-image synthesis. Experiments validated that our approach can generate high-quality text-matching images, and is comparable with state-of-the-art methods, both qualitatively and quantitatively.

References

[1]
Radford, A.; Kim, J.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In: Proceedings of the 38th International Conference on Machine Learning, 8748–8763, 2021.
[2]
Patashnik, O.; Wu, Z.; Shechtman, E.; Cohen-Or, D.; Lischinski, D. StyleCLIP: Text-driven manipulation of StyleGAN imagery. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2065–2074, 2021.
[3]

Gal, R.; Patashnik, O.; Maron, H.; Bermano, A. H.; Chechik, G.; Cohen-Or, D. StyleGAN-NADA: CLIP-guided domain adaptation of image generators. ACM Transactions on Graphics Vol. 41, No. 4, Article No. 141, 2022.

[4]
Frans, K.; Soros, L.; Witkowski, O. CLIPDraw: Exploring text-to-drawing synthesis through language-image encoders. In: Proceedings of the 36th Conference on Neural Information Processing System, 5207–5218, 2022.
[5]
Wang, C.; Chai, M.; He, M.; Chen, D.; Liao, J. CLIP-NeRF: Text-and-image driven manipulation of neural radiance fields. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3835–3844, 2022.
[6]
Michel, O.; Bar-On, R.; Liu, R.; Benaim, S.; Hanocka, R. Text2Mesh: Text-driven neural stylization for meshes. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 13492–13502, 2022.
[7]
Avrahami, O.; Lischinski, D.; Fried, O. Blended diffusion for text-driven editing of natural images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 18208–18218, 2022.
[8]
Ramesh, A.; Pavlov, M.; Goh, G.; Gray, S.; Voss, C.; Radford, A.; Chen, M.; Sutskever, I. Zero-shot text-to-image generation. In: Proceedings of the 38th International Conference on Machine Learning, 8821–8831, 2021.
[9]
Nichol, A.; Dhariwal, P.; Ramesh, A.; Shyam, P.; Mishkin, P.; McGrew, B.; Sutskever, I.; Chen, M. GLIDE: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
[10]
Xu, T.; Zhang, P.; Huang, Q.; Zhang, H.; Gan, Z.; Huang, X.; He, X. AttnGAN: Fine-grained text to image generation with attentional generative adversarial networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1316–1324, 2018.
[11]
Li, B.; Qi, X.; Lukasiewicz, T.; Torr, P. Controllable text-to-image generation. In: Proceedings of the 33rd Conference on Neural Information Processing Systems, 2063–2073, 2019.
[12]
Zhu, M.; Pan, P.; Chen, W.; Yang, Y. DM-GAN: Dynamic memory generative adversarial networks for text-to-image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5802–5810, 2019.
[13]
Tao, M.; Tang, H.; Wu, F.; Jing, X.; Bao, B. K.; Xu, C. DF-GAN: A simple and effective baseline for text-to-image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 16515–16525, 2022.
[14]
Bar-Tal, O.; Ofri-Amar, D.; Fridman, R.; Kasten, Y.; Dekel, T. Text2LIVE: Text-driven layered image and video editing. In: Computer Vision–ECCV 2022. Lecture Notes in Computer Science, Vol. 13675. Avidan, S.; Brostow, G.; Cissé, S.; Farinella, G. M.; Hassner, T. Eds. Springer Cham, 707–723, 2022.
[15]
Zhou, Y.; Zhang, R.; Chen, C.; Li, C.; Tensmeyer, C.; Yu, T.; Gu, J.; Xu, J.; Sun, T. Towards language-free training for text-to-image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 17907–17917, 2022.
[16]
Kim, G.; Kwon, T.; Ye, J. C. DiffusionCLIP: Text-guided diffusion models for robust image manipulation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2426–2435, 2022.
[17]
Liu, X.; Park, D. H.; Azadi, S.; Zhang, G.; Chopikyan, A.; Hu, Y.; Shi, H.; Rohrbach, A.; Darrell, T. More control for free! image synthesis with semantic diffusion guidance. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 289–299, 2023.
[18]
Ramesh, A.; Dhariwal, P.; Nichol, A.; Chu, C.; Chen, M. Hierarchical text-conditional image generation with CLIP latents. arXiv preprint arXiv:2204.06125, 2022.
[19]
Crowson, K.; Biderman, S.; Kornis, D.; Stander, D.; Hallahan, E.; Castricato, L.; Raff, E. VQGAN-clip: Open domain image generation and editing with natural language guidance. In: Computer Vision–ECCV 2022. Lecture Notes in Computer Science, Vol. 13697. Avidan, S.; Brostow, G.; Cissé, S.; Farinella, G. M.; Hassner, T. Eds. Springer Cham, 88–105, 2022.
[20]
Crowson, K. CLIP guided diffusion HQ 256x256. 2021. Available at https://colab.research.google.com/drive/12a_Wrfi2_gwwAuN3VvMTwVMz9TfqctNj
[21]
Sanghi, A.; Chu, H.; Lambourne, J. G.; Wang, Y.; Cheng, C. Y.; Fumero, M.; Malekshan, K. R. CLIP-forge: Towards zero-shot text-to-shape generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 18603–18613, 2022.
[22]
Tevet, G.; Gordon, B.; Hertz, A.; Bermano, A. H.; Cohen-Or, D. MotionCLIP: Exposing human motion generation to CLIP space. In: Computer Vision–ECCV 2022. Lecture Notes in Computer Science, Vol. 13682. Avidan, S.; Brostow, G.; Cissé, S.; Farinella, G. M.; Hassner, T. Eds. Springer Cham, 358–374, 2022.
[23]
Pinkney, J. N. M.; Li, C. clip2latent: Text driven sampling of a pre-trained StyleGAN using denoising diffusion and CLIP. arXiv preprint arXiv:2210.02347, 2022.
[24]
Dinh, L.; Sohl-Dickstein, J.; Bengio, S. Density estimation using real nvp. arXiv preprint arXiv:1605.08803, 2016.
[25]
Kingma, D. P.; Dhariwal, P. Glow: Generative flow with invertible 1 × 1 convolutions. In: Proceedings of the 32nd Conference on Neural Information Processing Systems, 2018.
[26]
Xia, W.; Yang, Y.; Xue, J. H.; Wu, B. TediGAN: Text-guided diverse face image generation and manipulation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2256–2265, 2021.
[27]
Wah, C.; Branson, S.; Welinder, P.; Perona, P.; Belongie, S. J. The caltech-UCSD birds-200-2011 dataset. 2011.
[28]
Liu, Z.; Luo, P.; Wang, X.; Tang, X. Deep learning face attributes in the wild. In: Proceedings of the IEEE International Conference on Computer Vision, 3730–3738, 2015.
[29]
Casanova, A.; Careil, M.; Verbeek, J.; Drozdzal, M.; Romero-Soriano, A. Instance-conditioned GAN. In: Proceedings of the 35th Conference on Neural Information Processing Systems, 27517–27529, 2021.
[30]
Blattmann, A.; Rombach, R.; Oktay, K.; Muller, J.; Ommer, B. Semi-parametric neural image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8088–8816, 2022.
Computational Visual Media
Pages 1157-1168
Cite this article:
Ma H, Li M, Yang J, et al. CLIP-Flow: Decoding images encoded in CLIP space. Computational Visual Media, 2024, 10(6): 1157-1168. https://doi.org/10.1007/s41095-023-0375-z

59

Views

2

Downloads

0

Crossref

0

Web of Science

0

Scopus

0

CSCD

Altmetrics

Received: 16 March 2023
Accepted: 26 August 2023
Published: 28 August 2024
© The Author(s) 2024.

This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.

The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Other papers from this open access journal are available free of charge from http://www.springer.com/journal/41095. To submit a manuscript, please go to https://www.editorialmanager.com/cvmj.

Return