| Sign up

PDF (11.1 MB)

Cite

EndNote(RIS) BibTeX

Collect

Collect

Submit Manuscript

Research Article | Open Access

CLIP-Flow: Decoding images encoded in CLIP space

Hao Ma^¹, Ming Li^¹, Jingyuan Yang^¹, Or Patashnik^², Dani Lischinski^³, Daniel Cohen-Or^{¹^,²}, Hui Huang^¹()

1

Visual Computing Research Center, College of Computer Science and Software Engineering, Shenzhen University, Shenzhen 518060, China

2

Department of Computer Science, Tel Aviv University, Tel Aviv 6997801, Israel

3

School of Computer Science and Engineering, the Hebrew University of Jerusalem, Jerusalem 91904, Israel

Show Author Information

Graphical Abstract

View original image Download original image

Abstract

This study introduces CLIP-Flow, a novel network for generating images from a given image or text. To effectively utilize the rich semantics contained in both modalities, we designed a semantics-guided methodology for image- and text-to-image synthesis. In particular, we adopted Contrastive Language-Image Pretraining (CLIP) as an encoder to extract semantics and StyleGAN as a decoder to generate images from such information. Moreover, to bridge the embedding space of CLIP and latent space of StyleGAN, real NVP is employed and modified with activation normalization and invertible convolution. As the images and text in CLIP share the same representation space, text prompts can be fed directly into CLIP-Flow to achieve text-to-image synthesis. We conducted extensive experiments on several datasets to validate the effectiveness of the proposed image-to-image synthesis method. In addition, we tested on the public dataset Multi-Modal CelebA-HQ, for text-to-image synthesis. Experiments validated that our approach can generate high-quality text-matching images, and is comparable with state-of-the-art methods, both qualitatively and quantitatively.

Keywords

image-to-image text-to-image contrastive language-image pretraining (CLIP)flow StyleGAN

References

[1]

Radford, A.; Kim, J.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In: Proceedings of the 38th International Conference on Machine Learning, 8748–8763, 2021.

[2]

Patashnik, O.; Wu, Z.; Shechtman, E.; Cohen-Or, D.; Lischinski, D. StyleCLIP: Text-driven manipulation of StyleGAN imagery. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2065–2074, 2021.

[3]

Gal, R.; Patashnik, O.; Maron, H.; Bermano, A. H.; Chechik, G.; Cohen-Or, D. StyleGAN-NADA: CLIP-guided domain adaptation of image generators. ACM Transactions on Graphics Vol. 41, No. 4, Article No. 141, 2022.

Crossref Google Scholar

[4]

Frans, K.; Soros, L.; Witkowski, O. CLIPDraw: Exploring text-to-drawing synthesis through language-image encoders. In: Proceedings of the 36th Conference on Neural Information Processing System, 5207–5218, 2022.

[5]

Wang, C.; Chai, M.; He, M.; Chen, D.; Liao, J. CLIP-NeRF: Text-and-image driven manipulation of neural radiance fields. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3835–3844, 2022.

[6]

Michel, O.; Bar-On, R.; Liu, R.; Benaim, S.; Hanocka, R. Text2Mesh: Text-driven neural stylization for meshes. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 13492–13502, 2022.

[7]

Avrahami, O.; Lischinski, D.; Fried, O. Blended diffusion for text-driven editing of natural images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 18208–18218, 2022.

[8]

Ramesh, A.; Pavlov, M.; Goh, G.; Gray, S.; Voss, C.; Radford, A.; Chen, M.; Sutskever, I. Zero-shot text-to-image generation. In: Proceedings of the 38th International Conference on Machine Learning, 8821–8831, 2021.

[9]

Nichol, A.; Dhariwal, P.; Ramesh, A.; Shyam, P.; Mishkin, P.; McGrew, B.; Sutskever, I.; Chen, M. GLIDE: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.

[10]

Xu, T.; Zhang, P.; Huang, Q.; Zhang, H.; Gan, Z.; Huang, X.; He, X. AttnGAN: Fine-grained text to image generation with attentional generative adversarial networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1316–1324, 2018.

[11]

Li, B.; Qi, X.; Lukasiewicz, T.; Torr, P. Controllable text-to-image generation. In: Proceedings of the 33rd Conference on Neural Information Processing Systems, 2063–2073, 2019.

[12]

Zhu, M.; Pan, P.; Chen, W.; Yang, Y. DM-GAN: Dynamic memory generative adversarial networks for text-to-image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5802–5810, 2019.

[13]

Tao, M.; Tang, H.; Wu, F.; Jing, X.; Bao, B. K.; Xu, C. DF-GAN: A simple and effective baseline for text-to-image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 16515–16525, 2022.

[14]

Bar-Tal, O.; Ofri-Amar, D.; Fridman, R.; Kasten, Y.; Dekel, T. Text2LIVE: Text-driven layered image and video editing. In: Computer Vision–ECCV 2022. Lecture Notes in Computer Science, Vol. 13675. Avidan, S.; Brostow, G.; Cissé, S.; Farinella, G. M.; Hassner, T. Eds. Springer Cham, 707–723, 2022.

[15]

Zhou, Y.; Zhang, R.; Chen, C.; Li, C.; Tensmeyer, C.; Yu, T.; Gu, J.; Xu, J.; Sun, T. Towards language-free training for text-to-image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 17907–17917, 2022.

[16]

Kim, G.; Kwon, T.; Ye, J. C. DiffusionCLIP: Text-guided diffusion models for robust image manipulation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2426–2435, 2022.

[17]

Liu, X.; Park, D. H.; Azadi, S.; Zhang, G.; Chopikyan, A.; Hu, Y.; Shi, H.; Rohrbach, A.; Darrell, T. More control for free! image synthesis with semantic diffusion guidance. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 289–299, 2023.

[18]

Ramesh, A.; Dhariwal, P.; Nichol, A.; Chu, C.; Chen, M. Hierarchical text-conditional image generation with CLIP latents. arXiv preprint arXiv:2204.06125, 2022.

[19]

Crowson, K.; Biderman, S.; Kornis, D.; Stander, D.; Hallahan, E.; Castricato, L.; Raff, E. VQGAN-clip: Open domain image generation and editing with natural language guidance. In: Computer Vision–ECCV 2022. Lecture Notes in Computer Science, Vol. 13697. Avidan, S.; Brostow, G.; Cissé, S.; Farinella, G. M.; Hassner, T. Eds. Springer Cham, 88–105, 2022.

[20]

Crowson, K. CLIP guided diffusion HQ 256x256. 2021. Available at https://colab.research.google.com/drive/12a_Wrfi2_gwwAuN3VvMTwVMz9TfqctNj

[21]

Sanghi, A.; Chu, H.; Lambourne, J. G.; Wang, Y.; Cheng, C. Y.; Fumero, M.; Malekshan, K. R. CLIP-forge: Towards zero-shot text-to-shape generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 18603–18613, 2022.

[22]

Tevet, G.; Gordon, B.; Hertz, A.; Bermano, A. H.; Cohen-Or, D. MotionCLIP: Exposing human motion generation to CLIP space. In: Computer Vision–ECCV 2022. Lecture Notes in Computer Science, Vol. 13682. Avidan, S.; Brostow, G.; Cissé, S.; Farinella, G. M.; Hassner, T. Eds. Springer Cham, 358–374, 2022.

[23]

Pinkney, J. N. M.; Li, C. clip2latent: Text driven sampling of a pre-trained StyleGAN using denoising diffusion and CLIP. arXiv preprint arXiv:2210.02347, 2022.

[24]

Dinh, L.; Sohl-Dickstein, J.; Bengio, S. Density estimation using real nvp. arXiv preprint arXiv:1605.08803, 2016.

[25]

Kingma, D. P.; Dhariwal, P. Glow: Generative flow with invertible 1 × 1 convolutions. In: Proceedings of the 32nd Conference on Neural Information Processing Systems, 2018.

[26]

Xia, W.; Yang, Y.; Xue, J. H.; Wu, B. TediGAN: Text-guided diverse face image generation and manipulation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2256–2265, 2021.

[27]

Wah, C.; Branson, S.; Welinder, P.; Perona, P.; Belongie, S. J. The caltech-UCSD birds-200-2011 dataset. 2011.

[28]

Liu, Z.; Luo, P.; Wang, X.; Tang, X. Deep learning face attributes in the wild. In: Proceedings of the IEEE International Conference on Computer Vision, 3730–3738, 2015.

[29]

Casanova, A.; Careil, M.; Verbeek, J.; Drozdzal, M.; Romero-Soriano, A. Instance-conditioned GAN. In: Proceedings of the 35th Conference on Neural Information Processing Systems, 27517–27529, 2021.

[30]

Blattmann, A.; Rombach, R.; Oktay, K.; Muller, J.; Ommer, B. Semi-parametric neural image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8088–8816, 2022.

Computational Visual Media

Volume 10 Issue 6,
December 2024

Pages 1157-1168

DOI: 10.1007/s41095-023-0375-z

Cite this article:

Ma H, Li M, Yang J, et al. CLIP-Flow: Decoding images encoded in CLIP space. Computational Visual Media, 2024, 10(6): 1157-1168. https://doi.org/10.1007/s41095-023-0375-z

About Us

Learn about Open Access

Tsinghua University Press

Publish with Us

Peer Review Policy

Copyright and Licensing

Article Processing Charge

Contact Us

Journal Collaboration: Yao Meng (Ms.)✉️ +86-10-83470574

Technical Support: Kuo Zhao (Mr.)✉️ +86-10-83470507

Media Contact: Hao Jin (Mr.)✉️ +86-10-83470559

Address: Floor 6, Tower B, Xueyan Building, Shuangqing Road, Haidian District, Beijing 100084, China.

SciOpen——中国科技期刊卓越行动计划支持项目

Copyright © 2025 Tsinghua University Press Ltd.

京ICP备 10035462号-42 京公网安备11010802044758号