| Sign up

Article Link

Cite

EndNote(RIS) BibTeX

Collect

Collect

Submit Manuscript

Show Outline

Outline

Abstract

Keywords

Electronic Supplementary Material

References

Show full outline

Hide outline

Survey

A Survey of Multimodal Controllable Diffusion Models

Rui Jiang^{¹^,^†}, Guang-Cong Zheng^{¹^,^†}, Teng Li^¹, Tian-Rui Yang^², Jing-Dong Wang^³, Xi Li^¹()

1College of Computer Science and Technology, Zhejiang University, Hangzhou 310007, China

2Department of Mathematics, Nanjing University, Nanjing 210023, China

3Baidu Visual Technology Department, Baidu Inc., Beijing 100085, China

^†Equal Contribution. Rui Jiang was responsible for the theoretical underpinnings and comprehensive literature review within the survey. Guang-Cong Zheng was responsible for revising and improving the overall article structure.

Show Author Information

Abstract

Diffusion models have recently emerged as powerful generative models, producing high-fidelity samples across domains. Despite this, they have two key challenges, including improving the time-consuming iterative generation process and controlling and steering the generation process. Existing surveys provide broad overviews of diffusion model advancements. However, they lack comprehensive coverage specifically centered on techniques for controllable generation. This survey seeks to address this gap by providing a comprehensive and coherent review on controllable generation in diffusion models. We provide a detailed taxonomy defining controlled generation for diffusion models. Controllable generation is categorized based on the formulation, methodologies, and evaluation metrics. By enumerating the range of methods researchers have developed for enhanced control, we aim to establish controllable diffusion generation as a distinct subfield warranting dedicated focus. With this survey, we contextualize recent results, provide the dedicated treatment of controllable diffusion model generation, and outline limitations and future directions. To demonstrate applicability, we highlight controllable diffusion techniques for major computer vision tasks application. By consolidating methods and applications for controllable diffusion models, we hope to catalyze further innovations in reliable and scalable controllable generation.

Keywords

diffusion model controllable generation application personalization

Electronic Supplementary Material

Video

JCST-2309-13814-Video.mp4

Download File(s)

JCST-2309-13814-Highlights.pdf (152.4 KB)

References

[1]

Efros A A, Leung T K. Texture synthesis by non-parametric sampling. In Proc. the 7th IEEE International Conference on Computer Vision, Sept. 1999, pp.1033–1038. DOI: 10.1109/iccv.1999.790383.

[2]

Heckbert P S. Survey of texture mapping. IEEE Computer Graphics and Applications, 1986, 6(11): 56–67. DOI: 10.1109/mcg.1986.276672.

Crossref Google Scholar

[3]

Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y. Generative adversarial networks. Communications of the ACM, 2020, 63(11): 139–144. DOI: 10.1145/3422622.

Crossref Google Scholar

[4]

Karras T, Laine S, Aila T. A style-based generator architecture for generative adversarial networks. In Proc. the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2019, pp.4396–4405. DOI: 10.1109/cvpr.2019.00453.

[5]

Rezende D J, Mohamed S, Wierstra D. Stochastic backpropagation and approximate inference in deep generative models. In Proc. the 31st International Conference on Machine Learning, Jun. 2014, pp.1278–1286.

[6]

Rezende D J, Mohamed S. Variational inference with normalizing flows. In Proc. the 32nd International Conference on Machine Learning, Jul. 2015, pp.1530–1538.

[7]

Dhariwal P, Nichol A. Diffusion models beat GANs on image synthesis. In Proc. the 35th Conference on Neural Information Processing Systems, Dec. 2021, pp.8780–8794.

[8]

Sohl-Dickstein J, Weiss E A, Maheswaranathan N, Ganguli S. Deep unsupervised learning using nonequilibrium thermodynamics. In Proc. the 32nd International Conference on Machine Learning, Jul. 2015, pp.2256–2265.

[9]

Song Y, Ermon S. Generative modeling by estimating gradients of the data distribution. In Proc. the 33rd International Conference on Neural Information Processing Systems, Dec. 2019, Article No. 1067.

[10]

Ho J, Jain A, Abbeel P. Denoising diffusion probabilistic models. In Proc. the 34th International Conference on Neural Information Processing Systems, Dec. 2020, Article No. 574.

[11]

Song Y, Sohl-Dickstein J, Kingma D P, Kumar A, Ermon S, Poole B. Score-based generative modeling through stochastic differential equations. arXiv: 2011.13456, 2020. https://arxiv.org/abs/2011.13456, May 2024.

[12]

Karras T, Aittala M, Aila T, Laine S. Elucidating the design space of diffusion-based generative models. arXiv: 2206.00364, 2022. https://arxiv.org/abs/2206.00364, May 2024.

[13]

Gu S Y, Chen D, Bao J M, Wen F, Zhang B, Chen D D, Yuan L, Guo B N. Vector quantized diffusion model for text-to-image synthesis. In Proc. the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2022, pp.10686–10696. DOI: 10.1109/cvpr52688.2022.01043.

[14]

Austin J, Johnson D D, Ho J, Tarlow D, van den Berg R. Structured denoising diffusion models in discrete state-spaces. In Proc. the 35th Conference on Neural Information Processing Systems, Dec. 2021, pp.17981–17993.

[15]

Song J M, Meng C L, Ermon S. Denoising diffusion implicit models. arXiv: 2010.02502, 2020. https://arxiv.org/abs/2010.02502, May 2024.

[16]

Bao F, Li C X, Zhu J, Zhang B. Analytic-DPM: An analytic estimate of the optimal reverse variance in diffusion probabilistic models. arXiv: 2201.06503, 2022. https://arxiv.org/abs/2201.06503, May 2024.

[17]

Lu C, Zhou Y H, Bao F, Chen J F, Li C X, Zhu J. DPM-Solver++: Fast solver for guided sampling of diffusion probabilistic models. arXiv: 2211.01095, 2022. https://arxiv.org/abs/2211.01095, May 2024.

[18]

Salimans T, Ho J. Progressive distillation for fast sampling of diffusion models. arXiv: 2202.00512, 2022. https://arxiv.org/abs/2202.00512, May 2024.

[19]

Hu V T, Zhang D W, Asano Y M, Burghouts G J, Snoek C G M. Self-guided diffusion models. In Proc. the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2023, pp.18413–18422. DOI: 10.1109/cvpr52729.2023.01766.

[20]

Cho W, Ravi H, Harikumar M, Khuc V, Singh K K, Lu J W, Inouye D I, Kale A. Towards enhanced controllability of diffusion models. arXiv: 2302.14368, 2023. https://arxiv.org/abs/2302.14368, May 2024.

[21]

Deja K, Trzciński T, Tomczak J M. Learning data representations with joint diffusion models. In Proc. the 2023 European Conference on Machine Learning and Knowledge Discovery in Databases: Research Track, Sept. 2023, pp.543–559. DOI: 10.1007/978-3-031-43415-0_32.

[22]

Zhang L M, Rao A Y, Agrawala M. Adding conditional control to text-to-image diffusion models. In Proc. the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Oct. 2023, pp.3813–3824. DOI: 10.1109/iccv51070.2023.00355.

[23]

Ham C, Hays J, Lu J W, Singh K K, Zhang Z F, Hinz T. Modulating pretrained diffusion models for multimodal image synthesis. In Proc. the 2023 Conference on Special Interest Group on Computer Graphics and Interactive Techniques, Jul. 2023, Article No. 35. DOI: 10.1145/3588432.3591549.

[24]

He Y F, Cai Z F, Gan X, Chang B B. DiffCap: Exploring continuous diffusion on image captioning. arXiv: 2305.12144, 2023. https://arxiv.org/abs/2305.12144, May 2024.

[25]

Kumari N, Zhang B L, Zhang R, Shechtman E, Zhu J Y. Multi-concept customization of text-to-image diffusion. In Proc. the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2023, pp.1931–1941. DOI: 10.1109/cvpr52729.2023.00192.

[26]

Kumar Bhunia A, Khan S, Cholakkal H, Anwer R M, Laaksonen J, Shah M, Khan F S. Person image synthesis via denoising diffusion model. In Proc. the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2023, pp.5968–5976. DOI: 10.1109/cvpr52729.2023.00578.

[27]

Ju X, Zeng A L, Zhao C C, Wang J N, Zhang L, Xu Q. HumanSD: A native skeleton-guided diffusion model for human image generation. In Proc. the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Oct. 2023, pp.15942–15952. DOI: 10.1109/iccv51070.2023.01465.

[28]

Cao H Q, Tan C, Gao Z Y, Xu Y L, Chen G Y, Heng P A, Li S Z. A survey on generative diffusion models. IEEE Trans. Knowledge and Data Engineering, 20241–20. DOI: 10.1109/tkde.2024.3361474.

Crossref Google Scholar

[29]

Yang L, Zhang Z L, Song Y, Hong S D, Xu R S, Zhao Y, Zhang W T, Cui B, Yang M H. Diffusion models: A comprehensive survey of methods and applications. ACM Computing Surveys, 2024, 56(4): 105. DOI: 10.1145/3626 235.

Crossref Google Scholar

[30]

Kazerouni A, Aghdam E K, Heidari M, Azad R, Fayyaz M, Hacihaliloglu I, Merhof D. Diffusion models for medical image analysis: A comprehensive survey. arXiv: 2211.07804, 2022. https://arxiv.org/abs/2211.07804, May 2024.

[31]

Croitoru F A, Hondru V, Ionescu R T, Shah M. Diffusion models in vision: A survey. IEEE Trans. Pattern Analysis and Machine Intelligence, 2023, 45(9): 10850–10869. DOI: 10.1109/tpami.2023.3261988.

Crossref Google Scholar

[32]

Zhang C S, Zhang C N, Zhang M C, Kweon I S. Text-to-image diffusion models in generative AI: A survey. arXiv: 2303.07909, 2023. https://arxiv.org/abs/2303.07909, May 2024.

[33]

Zou H, Kim Z M, Kang D. A survey of diffusion models in natural language processing. arXiv: 2305.14671, 2023. https://arxiv.org/abs/2305.14671, May 2024.

[34]

Anderson B D O. Reverse-time diffusion equation models. Stochastic Processes and Their Applications, 1982, 12(3): 313–326. DOI: 10.1016/0304-4149(82)90051-5.

Crossref Google Scholar

[35]

Lu C, Zhou Y H, Bao F, Chen J F, Li C X, Zhu J. DPM-solver: A fast ODE solver for diffusion probabilistic model sampling in around 10 steps. In Proc. the 36th International Conference on Neural Information Processing Systems, Nov. 28/Dec. 9, 2022, Article No. 418.

[36]

Zhang Q S, Chen Y X. Fast sampling of diffusion models with exponential integrator. arXiv: 2204.13902, 2022. https://arxiv.org/abs/2204.13902, May 2024.

[37]

Liu L P, Ren Y, Lin Z J, Zhao Z. Pseudo numerical methods for diffusion models on manifolds. arXiv: 2202.09778, 2022. https://arxiv.org/abs/2202.09778, May 2024.

[38]

Zhang Q S, Tao M L, Chen Y X. gDDIM: Generalized denoising diffusion implicit models. arXiv: 2206.05564, 2022. https://arxiv.org/abs/2206.05564, May 2024.

[39]

Ascher U M, Petzold L R. Computer Methods for Ordinary Differential Equations and Differential-Algebraic Equations. Society for Industrial and Applied Mathematics, 1998.

[40]

Bao F, Li C X, Sun J C, Zhu J, Zhang B. Estimating the optimal covariance with imperfect mean in diffusion probabilistic models. In Proc. the 39th International Conference on Machine Learning, Jul. 2022, pp.1555–1584.

[41]

Lin Z H, Gong Y Y, Liu X, Zhang H, Lin C, Dong A L, Jiao J, Lu J W, Jiang D X, Majumder R, Duan N. PROD: Progressive distillation for dense retrieval. In Proc. the 2023 ACM Web Conference, Apr. 2023, pp.3299–3308. DOI: 10.1145/3543507.3583421.

[42]

Huang R J, Zhao Z, Liu H D, Liu J L, Cui C Y, Ren Y. ProDiff: Progressive fast diffusion model for high-quality text-to-speech. In Proc. the 30th ACM International Conference on Multimedia, Oct. 2022, pp.2595–2605. DOI: 10.1145/3503161.3547855.

[43]

Luo W J. A comprehensive survey on knowledge distillation of diffusion models. arXiv: 2304.04262, 2023. https://arxiv.org/abs/2304.04262, May 2024.

[44]

Luhman E, Luhman T. Knowledge distillation in iterative generative models for improved sampling speed. arXiv: 2101.02388, 2021. https://arxiv.org/abs/2101.02388, May 2024.

[45]

Zheng H K, Nie W L, Vahdat A, Azizzadenesheli K, Anandkumar A. Fast sampling of diffusion models via operator learning. In Proc. the 40th International Conference on Machine Learning, Jul. 2023, pp.42390–42402.

[46]

Meng C L, Rombach R, Gao R Q, Kingma D, Ermon S, Ho J, Salimans T. On distillation of guided diffusion models. In Proc. the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2023, pp.14297–14306. DOI: 10.1109/cvpr52729.2023.01374.

[47]

Berthelot D, Autef A, Lin J R, Yap D A, Zhai S F, Hu S Y, Zheng D, Talbott W, Gu E. TRACT: Denoising diffusion models with transitive closure time-distillation. arXiv: 2303.04248, 2023. https://arxiv.org/abs/2303.04248, May 2024.

[48]

Daras G, Dagan Y, Dimakis A G, Daskalakis C. Score-guided intermediate layer optimization: Fast Langevin mixing for inverse problems. arXiv: 2206.09104, 2022. https://arxiv.org/abs/2206.09104, May 2024.

[49]

Ronneberger O, Fischer P, Brox T. U-Net: Convolutional networks for biomedical image segmentation. In Proc. the 18th International Conference on Medical Image Computing and Computer-Assisted Intervention, Oct. 2015, pp.234–241. DOI: 10.1007/978-3-319-24574-4_28.

[50]

Salimans T, Kingma D P. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. In Proc. the 30th International Conference on Neural Information Processing Systems, Dec. 2016, pp.901–909.

[51]

Wu Y X, He K M. Group normalization. International Journal of Computer Vision, 2020, 128(3): 742–755. DOI: 10.1007/s11263-019-01198-w.

Crossref Google Scholar

[52]

Chen C F R, Fan Q F, Panda R. CrossViT: Cross-attention multi-scale vision transformer for image classification. In Proc. the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Oct. 2021, pp.347–356. DOI: 10.1109/iccv48922.2021.00041.

[53]

Nichol A Q, Dhariwal P. Improved denoising diffusion probabilistic models. In Proc. the 38th International Conference on Machine Learning, Jul. 2021, pp.8162–8171.

[54]

Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser Ł, Polosukhin I. Attention is all you need. In Proc. the 31st International Conference on Neural Information Processing Systems, Dec. 2017, pp.6000–6010.

[55]

Tamborrino A, Pellicanò N, Pannier B, Voitot P, Naudin L. Pre-training is (almost) all you need: An application to commonsense reasoning. In Proc. the 58th Annual Meeting of the Association for Computational Linguistics, Jul. 2020, pp.3878–3887. DOI: 10.18653/v1/2020.acl-main.357.

[56]

Wen Q S, Zhou T, Zhang C L, Chen W Q, Ma Z Q, Yan J C, Sun L. Transformers in time series: A survey. In Proc. the 32nd International Joint Conference on Artificial Intelligence, Aug. 2023, pp.6778–6786. DOI: 10.24963/ijcai.2023/759.

[57]

Peebles W, Xie S N. Scalable diffusion models with transformers. In Proc. the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Oct. 2023, pp.4172–4182. DOI: 10.1109/iccv51070.2023.00387.

[58]

Bao F, Nie S, Xue K W, Cao Y, Li C X, Su H, Zhu J. All are worth words: A ViT backbone for diffusion models. In Proc. the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2023, pp.22669–22679. DOI: 10.1109/cvpr52729.2023.02171.

[59]

Gao S H, Zhou P, Cheng M M, Yan S C. Masked diffusion transformer is a strong image synthesizer. In Proc. the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Oct. 2023, pp.23107–23116. DOI: 10.1109/iccv51070.2023.02117.

[60]

Hoogeboom E, Heek J, Salimans T. Simple diffusion: End-to-end diffusion for high resolution images. arXiv: 2301.11093, 2023. https://arxiv.org/abs/2301.11093, May 2024.

[61]

Chen J W, Pan Y W, Yao T, Mei T. ControlStyle: Text-driven stylized image generation using diffusion priors. In Proc. the 31st ACM International Conference on Multimedia, Oct. 29/Nov. 3, 2023, pp.7540–7548. DOI: 10.1145/3581783.3612524.

[62]

Blattmann A, Rombach R, Ling H, Dockhorn T, Kim S W, Fidler S, Kreis K. Align your latents: High-resolution video synthesis with latent diffusion models. In Proc. the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2023, pp.22563–22575. DOI: 10.1109/cvpr52729.2023.02161.

[63]

Avrahami O, Fried O, Lischinski D. Blended latent diffusion. ACM Trans. Graphics, 2023, 42(4): 149. DOI: 10.1145/3592450.

Crossref Google Scholar

[64]

Rombach R, Blattmann A, Lorenz D, Esser P, Ommer B. High-resolution image synthesis with latent diffusion models. In Proc. the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2022, pp.10674–10685. DOI: 10.1109/cvpr52688.2022.01042.

[65]

Vlassis N N, Sun W, Alshibli K A, Regueiro R A. Synthesizing realistic sand assemblies with denoising diffusion in latent space. arXiv: 2306.04411, 2023. https://arxiv.org/abs/2306.04411, May 2024.

[66]

Yu S, Sohn K, Kim S, Shin J. Video probabilistic diffusion models in projected latent space. In Proc. the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2023, pp.18456–18466. DOI: 10.1109/cvpr52729.2023.01770.

[67]

Braure T, Lazaro D, Hateau D, Brandon V, Ginsburger K. Conditioning generative latent optimization for sparse-view CT image reconstruction. arXiv: 2307.16670, 2023. https://arxiv.org/abs/2307.16670, May 2024.

[68]

Koley S, Bhunia A K, Sain A, Chowdhury P N, Xiang T, Song Y Z. Picture that sketch: Photorealistic image generation from abstract sketches. In Proc. the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2023, pp.6850–6861. DOI: 10.1109/cvpr52729.2023.00662.

[69]

Do H, Yoo E, Kim T, Lee C, Choi J Y. Quantitative manipulation of custom attributes on 3D-aware image synthesis. In Proc. the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2023, pp.8529–8538. DOI: 10.1109/cvpr52729.2023.00824.

[70]

Hu V T, Zhang W, Tang M, Mettes P, Zhao D L, Snoek C. Latent space editing in transformer-based flow matching. In Proc. the 38th AAAI Conference on Artificial Intelligence, Feb. 2024, pp.2247–2255. DOI: 10.1609/aaai.v38i3.27998.

[71]

Ramesh A, Dhariwal P, Nichol A, Chu C, Chen M. Hierarchical text-conditional image generation with CLIP latents. arXiv: 2204.06125, 2022. https://arxiv.org/abs/2204.06125, May 2024.

[72]

Liu H H, Chen Z H, Yuan Y, Mei X H, Liu X B, Mandic D, Wang W W, Plumbley M D. AudioLDM: Text-to-audio generation with latent diffusion models. arXiv: 2301.12503, 2023. https://arxiv.org/abs/2301.12503, May 2024.

[73]

Schramowski P, Brack M, Deiseroth B, Kersting K. Safe latent diffusion: Mitigating inappropriate degeneration in diffusion models. In Proc. the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2023, pp.22522–22531. DOI: 10.1109/cvpr52729.2023.02157.

[74]

Ni H M, Shi C H, Li K, Huang S X, Min M R. Conditional image-to-video generation with latent flow diffusion models. In Proc. the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2023, pp.18444–18455. DOI: 10.1109/cvpr52729.2023.01769.

[75]

Wang X, Yuan H J, Zhang S W, Chen D Y, Wang J N, Zhang Y Y, Shen Y J, Zhao D L, Zhou J R. VideoComposer: Compositional video synthesis with motion controllability. arXiv: 2306.02018, 2023. https://arxiv.org/abs/2306.02018, May 2024.

[76]

Saharia C, Chan W, Saxena S, Li L L, Whang J, Denton E, Ghasemipour S K S, Ayan B K, Mahdavi S S, Gontijo-Lopes R, Salimans T, Ho J, Fleet D J, Norouzi M. Photorealistic text-to-image diffusion models with deep language understanding. In Proc. the 36th International Conference on Neural Information Processing Systems, Nov. 28/Dec. 9, 2022, Article No. 2643.

[77]

Saharia C, Chan W, Chang H W, Lee C, Ho J, Salimans T, Fleet D, Norouzi M. Palette: Image-to-image diffusion models. In Proc. the 2022 Conference on Special Interest Group on Computer Graphics and Interactive Techniques, Aug. 2022, Article No. 15. DOI: 10.1145/3528233.3530757.

[78]

Ho J, Saharia C, Chan W, Fleet D J, Norouzi M, Salimans T. Cascaded diffusion models for high fidelity image generation. Journal of Machine Learning Research, 2022, 23(47): 1–33.

[79]

Chang H W, Zhang H, Barber J, Maschinot A J, Lezama J, Jiang L, Yang M H, Murphy K, Freeman W T, Rubinstein M, Li Y Z, Krishnan D. Muse: Text-to-image generation via masked generative transformers. arXiv: 2301.00704, 2023. https://arxiv.org/abs/2301.00704, May 2024.

[80]

Saharia C, Ho J, Chan W, Salimans T, Fleet D J, Norouzi M. Image super-resolution via iterative refinement. IEEE Trans. Pattern Analysis and Machine Intelligence, 2023, 45(4): 4713–4726. DOI: 10.1109/tpami.2022. 3204461.

Crossref Google Scholar

[81]

Balaji Y, Nah S, Huang X, Vahdat A, Song J M, Zhang Q S, Kreis K, Aittala M, Aila T, Laine S, Catanzaro B, Karras T, Liu M Y. eDiff-I: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv: 2211.01324, 2022. https://arxiv.org/abs/2211.01324, May 2024.

[82]

Kim S, Jung S, Kim B, Choi M, Shin J, Lee J. Towards safe self-distillation of Internet-scale text-to-image diffusion models. arXiv: 2307.05977, 2023. https://arxiv.org/abs/2307.05977, May 2024.

[83]

Li Y H, Liu H T, Wu Q Y, Mu F Z, Yang J W, Gao J F, Li C Y, Lee Y J. GLIGEN: Open-set grounded text-to-image generation. In Proc. the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2023, pp.22511–22521. DOI: 10.1109/cvpr52729.2023.02156.

[84]

Mou C, Wang X T, Xie L B, Wu Y Z, Zhang J, Qi Z A, Shan Y. T2I-Adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. In Proc. the 38th AAAI Conference on Artificial Intelligence, Feb. 2024, pp.4296–4304. DOI: 10.1609/aaai.v38i5.28226.

[85]

Chen D, Qi X D, Zheng Y, Lu Y Z, Huang Y B, Li Z J. Deep data augmentation for weed recognition enhancement: A diffusion probabilistic model and transfer learning based approach. In Proc. the 2023 ASABE Annual International Meeting, Jul. 2023. DOI: 10.13031/aim.202300108.

[86]

Ding K Z, Xu Z, Tong H H, Liu H. Data augmentation for deep graph learning: A survey. ACM SIGKDD Explorations Newsletter, 2022, 24(2): 61–77. DOI: 10.1145/3575637.3575646.

Crossref Google Scholar

[87]

Zheng G C, Zhou X P, Li X W, Qi Z A, Shan Y, Li X. LayoutDiffusion: Controllable diffusion model for layout-to-image generation. In Proc. the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2023, pp.22490–22499. DOI: 10.1109/cvpr52729.2023.02154.

[88]

Inoue N, Kikuchi K, Simo-Serra E, Otani M, Yamaguchi K. LayoutDM: Discrete diffusion model for controllable layout generation. In Proc. the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2023, pp.10167–10176. DOI: 10.1109/cvpr52729.2023.00980.

[89]

Avrahami O, Hayes T, Gafni O, Gupta S, Taigman Y, Parikh D, Lischinski D, Fried O, Yin X. SpaText: Spatio-textual representation for controllable image generation. In Proc. the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2023, pp.18370–18380. DOI: 10.1109/cvpr52729.2023.01762.

[90]

Yang Z Y, Wang J F, Gan Z, Li L J, Lin K, Wu C F, Duan N, Liu Z C, Liu C, Zeng M, Wang L J. ReCo: Region-controlled text-to-image generation. In Proc. the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2023, pp.14246–14255. DOI: 10.1109/cvpr52729.2023.01369.

[91]

Xie J H, Li Y X, Huang Y W, Liu H Z, Zhang W T, Zheng Y F, Shou M Z. BoxDiff: Text-to-image synthesis with training-free box-constrained diffusion. In Proc. the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Oct. 2023, pp.7418–7427. DOI: 10.1109/iccv51070.2023.00685.

[92]

Voynov A, Aberman K, Cohen-Or D. Sketch-guided text-to-image diffusion models. In Proc. the 2023 Conference on Special Interest Group on Computer Graphics and Interactive Techniques, Jul. 2023, Article No. 55. DOI: 10.1145/3588432.3591560.

[93]

Yu J W, Wang Y H, Zhao C, Ghanem B, Zhang J. FreeDoM: Training-free energy-guided conditional diffusion model. In Proc. the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Oct. 2023, pp.23117–23127. DOI: 10.1109/iccv51070.2023.02118.

[94]

Li D X, Li J N, Hoi S C H. BLIP-Diffusion: Pre-trained subject representation for controllable text-to-image generation and editing. arXiv: 2305.14720, 2023. https://arxiv.org/abs/2305.14720, May 2024.

[95]

Zhao S H, Chen D D, Chen Y C, Bao J M, Hao S Z, Yuan L, Wong K Y K. Uni-ControlNet: All-in-one control to text-to-image diffusion models. In Proc. the 37th Conference on Neural Information Processing Systems, Dec. 2023.

[96]

Qin C, Zhang S, Yu N, Feng Y H, Yang X Y, Zhou Y B, Wang H, Niebles J C, Xiong C M, Savarese S, Ermon S, Fu Y, Xu R. UniControl: A unified diffusion model for controllable visual generation in the wild. arXiv: 2305.11147, 2023. https://arxiv.org/abs/2305.11147, May 2024.

[97]

Huang L H, Chen D, Liu Y, Shen Y J, Zhao D L, Zhou J R. Composer: Creative and controllable image synthesis with composable conditions. arXiv: 2302.09778, 2023. https://arxiv.org/abs/2302.09778, May 2024.

[98]

Cao Z, Simon T, Wei S E, Sheikh Y. Realtime multi-person 2D pose estimation using part affinity fields. In Proc. the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jul. 2017, pp.1302–1310. DOI: 10.1109/cvpr.2017.143.

[99]

Yang R H, Srivastava P, Mandt S. Diffusion probabilistic modeling for video generation. Entropy, 2023, 25(10): 1469. DOI: 10.3390/e25101469.

Crossref Google Scholar

[100]

Mo S C, Mu F Z, Lin K H, Liu Y L, Guan B C, Li Y, Zhou B L. FreeControl: Training-free spatial control of any text-to-image diffusion model with any condition. arXiv: 2312.07536, 2023. https://arxiv.org/abs/2312.07536, May 2024.

[101]

Patashnik O, Wu Z Z, Shechtman E, Cohen-Or D, Lischinski D. StyleCLIP: Text-driven manipulation of StyleGAN imagery. In Proc. the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Oct. 2021, pp.2065–2074. DOI: 10.1109/iccv48922.2021.00209.

[102]

Wu Z Z, Lischinski D, Shechtman E. StyleSpace analysis: Disentangled controls for StyleGAN image generation. In Proc. the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2021, pp.12858–12867. DOI: 10.1109/cvpr46437.2021.01267.

[103]

Liu Z H, Feng R L, Zhu K, Zhang Y F, Zheng K C, Liu Y, Zhao D L, Zhou J R, Cao Y. Cones: Concept neurons in diffusion models for customized generation. arXiv: 2303.05125, 2023. https://arxiv.org/abs/2303.05125, May 2024.

[104]

Yang B X, Gu S Y, Zhang B, Zhang T, Chen X J, Sun X Y, Chen D, Wen F. Paint by example: Exemplar-based image editing with diffusion models. In Proc. the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2023, pp.18381–18391. DOI: 10.1109/cvpr52729.2023.01763.

[105]

Song Y Z, Zhang Z F, Lin Z, Cohen S, Price B, Zhang J M, Kim S Y, Aliaga D. ObjectStitch: Object compositing with diffusion model. In Proc. the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2023, pp.18310–18319. DOI: 10.1109/cvpr52729.2023.01756.

[106]

Pan Z H, Zhou X, Tian H. Arbitrary style guidance for enhanced diffusion-based text-to-image generation. In Proc. the 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Jan. 2023, pp.4450–4460. DOI: 10.1109/wacv56688.2023.00444.

[107]

Kang M, Han W, Hwang S J, Yang E. ZET-Speech: Zero-shot adaptive emotion-controllable text-to-speech synthesis with diffusion and style-based models. In Proc. the 2023 INTERSPEECH, Aug. 2023, pp.4339–4343. DOI: 10.21437/interspeech.2023-754.

[108]

Huang N S, Zhang Y X, Tang F, Ma C Y, Huang H B, Dong W M, Xu C S. DiffStyler: Controllable dual diffusion for text-driven image stylization. IEEE Trans. Neural Networks and Learning Systems, 2024. DOI: 10.1109/tnnls.2023.3342645. (early access

Crossref Google Scholar

[109]

Tarrés G C, Ruta D, Bui T, Collomosse J. PARASOL: Parametric style control for diffusion image synthesis. arXiv: 2303.06464, 2023. https://arxiv.org/abs/2303.06464, May 2024.

[110]

Nair N G, Cherian A, Lohit S, Wang Y, Koike-Akino T, Patel V M, Marks T K. Steered diffusion: A generalized framework for plug-and-play conditional image synthesis. In Proc. the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Oct. 2023, pp.20793–20803. DOI: 10.1109/iccv51070.2023.01906.

[111]

Gal R, Alaluf Y, Atzmon Y, Patashnik O, Bermano A H, Chechik G, Cohen-Or D. An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv: 2208.01618, 2022. https://arxiv.org/abs/2208.01618, May 2024.

[112]

Ruiz N, Li Y z, Jampani V, Pritch Y, Rubinstein M, Aberman K. DreamBooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proc. the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2023, pp.22500–22510. DOI: 10.1109/cvpr52729.2023.02155.

[113]

Hu E J, Shen Y L, Wallis P, Allen-Zhu Z, Li Y Z, Wang S A, Wang L, Chen W Z. LoRA: Low-rank adaptation of large language models. arXiv: 2106.09685, 2021. https://arxiv.org/abs/2106.09685, May 2024.

[114]

Lu H M, Tunanyan H, Wang K, Navasardyan S, Wang Z Y, Shi H. Specialist diffusion: Plug-and-play sample-efficient fine-tuning of text-to-image diffusion models to learn any unseen style. In Proc. the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2023, pp.14267–14276. DOI: 10.1109/cvpr52729.2023.01371.

[115]

Yu J H, Xu Y Z, Koh J Y, Luong T, Baid G, Wang Z R, Vasudevan V, Ku A, Yang Y F, Ayan B K, Hutchinson B, Han W, Parekh Z, Li X, Zhang H, Baldridge J, Wu Y H. Scaling autoregressive models for content-rich text-to-image generation. arXiv: 2206.10789, 2022. https://arxiv.org/abs/2206.10789, May 2024.

[116]

Meng C L, He Y T, Song Y, Song J M, Wu J J, Zhu J Y, Ermon S. SDEdit: Guided image synthesis and editing with stochastic differential equations. arXiv: 2108.01073, 2021. https://arxiv.org/abs/2108.01073, May 2024.

[117]

Zhu Y Z, Li Z H, Wang T W, He M C, Yao C. Conditional text image generation with diffusion models. In Proc. the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2023, pp.14235–14244. DOI: 10.1109/cvpr52729.2023.01368.

[118]

Huang Z Q, Chan K C K, Jiang Y M, Liu Z W. Collaborative diffusion for multi-modal face generation and editing. In Proc. the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2023, pp.6080–6090. DOI: 10.1109/cvpr52729.2023.00589.

[119]

Nichol A, Dhariwal P, Ramesh A, Shyam P, Mishkin P, McGrew B, Sutskever I, Chen M. GLIDE: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv: 2112.10741, 2021. https://arxiv.org/abs/2112.10741, May 2024.

[120]

Liu X H, Park D H, Azadi S, Zhang G, Chopikyan A, Hu Y X, Shi H, Rohrbach A, Darrell T. More control for free! Image synthesis with semantic diffusion guidance. In Proc. the 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Jan. 2023, pp.289–299. DOI: 10.1109/wacv56688.2023.00037.

[121]

Xifara T, Sherlock C, Livingstone S, Byrne S, Girolami M. Langevin diffusions and the metropolis-adjusted Langevin algorithm. Statistics & Probability Letters, 2014, 91: 14–19. DOI: 10.1016/j.spl.2014.04.002.

Crossref Google Scholar

[122]

Luo C. Understanding diffusion models: A unified perspective. arXiv: 2208.11970, 2022. https://arxiv.org/abs/2208.11970, May 2024.

[123]

Ho J, Salimans T. Classifier-free diffusion guidance. arXiv: 2207.12598, 2022. https://arxiv.org/abs/2207.12598, May 2024.

[124]

Hosseini H, Xiao B C, Poovendran R. Google’s cloud vision API is not robust to noise. In Proc. the 16th IEEE International Conference on Machine Learning and Applications (ICMLA), Dec. 2017, pp.101–105. DOI: 10.1109/icmla.2017.0-172.

[125]

Wallace B, Gokul A, Ermon S, Naik N. End-to-end diffusion latent optimization improves classifier guidance. In Proc. the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Oct. 2023, pp.7246–7256. DOI: 10.1109/iccv51070.2023.00669.

[126]

Bansal A, Borgnia E, Chu H M, Li J S, Kazemi H, Huang F R, Goldblum M, Geiping J, Goldstein T. Cold diffusion: Inverting arbitrary image transforms without noise. arXiv: 2208.09392, 2022. https://arxiv.org/abs/2208.09392, May 2024.

[127]

Radford A, Kim J W, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J, Krueger G, Sutskever I. Learning transferable visual models from natural language supervision. In Proc. the 38th International Conference on Machine Learning, Jul. 2021, pp.8748–8763.

[128]

Hertz A, Mokady R, Tenenbaum J, Aberman K, Pritch Y, Cohen-Or D. Prompt-to-prompt image editing with cross attention control. arXiv: 2208.01626, 2022. https://arxiv.org/abs/2208.01626, May 2024.

[129]

Mokady R, Hertz A, Aberman K, Pritch Y, Cohen-Or D. Null-text inversion for editing real images using guided diffusion models. In Proc. the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2023, pp.6038–6047. DOI: 10.1109/cvpr52729.2023.00585.

[130]

Feng W X, He X H, Fu T J, Jampani V, Akula A, Narayana P, Basu S, Wang X E, Wang W Y. Training-free structured diffusion guidance for compositional text-to-image synthesis. arXiv: 2212.05032, 2022. https://arxiv.org/abs/2212.05032, May 2024.

[131]

Chen M H, Laina I, Vedaldi A. Training-free layout control with cross-attention guidance. In Proc. the 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Jan. 2024, pp.5331–5341. DOI: 10.1109/wacv57701.2024.00526.

[132]

He Y T, Salakhutdinov R, Kolter J Z. Localized text-to-image generation for free via cross attention control. arXiv: 2306.14636, 2023. https://arxiv.org/abs/2306.14636, May 2024.

[133]

Parmar G, Singh K K, Zhang R, Li Y J, Lu J W, Zhu J Y. Zero-shot image-to-image translation. In Proc. the 2023 Conference on Special Interest Group on Computer Graphics and Interactive Techniques, Jul. 2023, Article No. 11. DOI: 10.1145/3588432.3591513.

[134]

Mou C, Wang X T, Song J C, Shan Y, Zhang J. DragonDiffusion: Enabling drag-style manipulation on diffusion models. arXiv: 2307.02421, 2023. https://arxiv.org/abs/2307.02421, May 2024.

[135]

Choi J, Kim S, Jeong Y, Gwon Y, Yoon S. ILVR: Conditioning method for denoising diffusion probabilistic models. In Proc. the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Oct. 2021, pp.14347–14356. DOI: 10.1109/ICCV48922.2021.01410.

[136]

Kawar B, Elad M, Ermon S, Song J M. Denoising diffusion restoration models. In Proc. the 36th International Conference on Neural Information Processing Systems, Nov. 28/Dec. 9, 2022, Article No. 1714.

[137]

Lugmayr A, Danelljan M, Romero A, Yu F, Timofte R, Van Gool L. RePaint: Inpainting using denoising diffusion probabilistic models. In Proc. the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2022, pp.11451–11461. DOI: 10.1109/cvpr52688.2022.01117.

[138]

Wang Y H, Yu J W, Zhang J. Zero-shot image restoration using denoising diffusion null-space model. arXiv: 2212.00490, 2022. https://arxiv.org/abs/2212.00490, May 2024.

[139]

Wang Y H, Hu Y J, Yu J W, Zhang J. GAN prior based null-space learning for consistent super-resolution. In Proc. the 37th AAAI Conference on Artificial Intelligence, Feb. 2023, pp.2724–2732. DOI: 10.1609/aaai.v37i3.25372.

[140]

Chen D D, Davies M E. Deep decomposition learning for inverse imaging problems. In Proc. the 16th European Conference on Computer Vision, Aug. 2020, pp.510–526. DOI: 10.1007/978-3-030-58604-1_31.

[141]

Brock A, Donahue J, Simonyan K. Large scale GAN training for high fidelity natural image synthesis. arXiv: 1809.11096, 2018. https://arxiv.org/abs/1809.11096, May 2024.

[142]

Heusel M, Ramsauer H, Unterthiner T, Nessler B, Hochreiter S. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In Proc. the 31st International Conference on Neural Information Processing Systems, Dec. 2017, pp.6629–6640.

[143]

Salimans T, Goodfellow I, Zaremba W, Cheung V, Radford A, Chen X. Improved techniques for training GANs. In Proc. the 30th International Conference on Neural Information Processing Systems, Dec. 2016, pp.2234–2242.

[144]

Cho J, Li L J, Yang Z Y, Gan Z, Wang L J, Bansal M. Diagnostic benchmark and iterative inpainting for layout-guided image generation. arXiv: 2304.06671, 2023. https://arxiv.org/abs/2304.06671, May 2024.

[145]

Li H Y, Yang Y F, Chang M, Chen S Q, Feng H J, Xu Z H, Li Q, Chen Y T. SRDiff: Single image super-resolution with diffusion probabilistic models. Neurocomputing, 2022, 479: 47–59. DOI: 10.1016/j.neucom.2022.01.029.

Crossref Google Scholar

[146]

Fei B, Lyu Z Y, Pan L, Zhang J Z, Yang W D, Luo T Y, Zhang B, Dai B. Generative diffusion prior for unified image restoration and enhancement. In Proc. the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2023, pp.9935–9946. DOI: 10.1109/cvpr52729.2023.00958.

[147]

Zheng G C, Li S M, Wang H, Yao T P, Chen Y, Ding S H, Li X. Entropy-driven sampling and training scheme for conditional diffusion generation. In Proc. the 17th European Conference on Computer Vision, Oct. 2022, pp.754–769. DOI: 10.1007/978-3-031-20047-2_43.

[148]

Harvey W, Naderiparizi S, Masrani V, Weilbach C, Wood F. Flexible diffusion modeling of long videos. In Proc. the 36th International Conference on Neural Information Processing Systems, Nov. 28/Dec. 9, 2022, Article No. 2027.

[149]

Voleti V, Jolicoeur-Martineau A, Pal C. MCVD: Masked conditional video diffusion for prediction, generation, and interpolation. In Proc. the 36th International Conference on Neural Information Processing Systems, Nov. 28/Dec. 9, 2022, Article No. 1698.

[150]

Singer U, Polyak A, Hayes T, Yin X, An J, Zhang S Y, Hu Q Y, Yang H, Ashual O, Gafni O, Parikh D, Gupta S, Taigman Y. Make-A-Video: Text-to-video generation without text-video data. arXiv: 2209.14792, 2022. https://arxiv.org/abs/2209.14792, May 2024.

[151]

Xing J B, Xia M H, Liu Y X, Zhang Y C, Zhang Y, He Y Q, Liu H Y, Chen H X, Cun X D, Wang X T, Shan Y, Wong T T. Make-Your-Video: Customized video generation using textual and structural guidance. IEEE Trans. Visualization and Computer Graphics, 20241–15. DOI: 10.1109/tvcg.2024.3365804.

Crossref Google Scholar

[152]

Ma W D K, Lahiri A, Lewis J P, Leung T, Kleijn W B. Directed diffusion: Direct control of object placement through attention guidance. In Proc. the 38th AAAI Conference on Artificial Intelligence, Feb. 2024, pp.4098–4106. DOI: 10.1609/aaai.v38i5.28204.

[153]

Zhang Y B, Wei Y X, Jiang D S, Zhang X P, Zuo W M, Tian Q. ControlVideo: Training-free controllable text-to-video generation. arXiv: 2305.13077, 2023. https://arxiv.org/abs/2305.13077, May 2024.

[154]

Luo Z X, Chen D Y, Zhang Y Y, Huang Y, Wang L, Shen Y J, Zhao D L, Zhou J R, Tan T N. Notice of removal: VideoFusion: Decomposed diffusion models for high-quality video generation. In Proc. the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2023, pp.10209–10218. DOI: 10.1109/CVPR52729.2023.00984.

[155]

Poole B, Jain A, Barron J T, Mildenhall B. DreamFusion: Text-to-3D using 2D diffusion. arXiv: 2209.14988, 2022. https://arxiv.org/abs/2209.14988, May 2024.

[156]

Lin C H, Gao J, Tang L M, Takikawa T, Zeng X H, Huang X, Kreis K, Fidler S, Liu M Y, Lin T Y. Magic3D: High-resolution text-to-3D content creation. In Proc. the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2023, pp.300–309. DOI: 10.1109/cvpr52729.2023.00037.

[157]

Chen R, Chen Y W, Jiao N X, Jia K. Fantasia3D: Disentangling geometry and appearance for high-quality text-to-3D content creation. In Proc. the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Oct. 2023, pp.22189–22199. DOI: 10.1109/iccv51070.2023.02033.

[158]

Liu R S, Wu R D, Van Hoorick B, Tokmakov P, Zakharov S, Vondrick C. Zero-1-to-3: Zero-shot one image to 3D object. In Proc. the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Oct. 2023, pp.9264–9275. DOI: 10.1109/iccv51070.2023.00853.

[159]

Qian G C, Mai J J, Hamdi A, Ren J, Siarohin A, Li B, Lee H Y, Skorokhodov I, Wonka P, Tulyakov S, Ghanem B. Magic123: One image to high-quality 3D object generation using both 2D and 3D diffusion priors. arXiv: 2306.17843, 2023. https://arxiv.org/abs/2306.17843, May 2024.

[160]

Liu Y, Lin C, Zeng Z J, Long X X, Liu L J, Komura T, Wang W P. SyncDreamer: Generating multiview-consistent images from a single-view image. arXiv: 2309.03453, 2023. https://arxiv.org/abs/2309.03453, May 2024.

[161]

Zheng X Y, Pan H, Wang P S, Tong X, Liu Y, Shum H Y. Locally attentional SDF diffusion for controllable 3D shape generation. ACM Trans. Graphics, 2023, 42(4): 91. DOI: 10.1145/3592103.

Crossref Google Scholar

[162]

Han L G, Li Y X, Zhang H, Milanfar P, Metaxas D, Yang F. SVDiff: Compact parameter space for diffusion fine-tuning. In Proc. the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Oct. 2023, pp.7289–7300. DOI: 10.1109/iccv51070.2023.00673.

[163]

Tewel Y, Gal R, Chechik G, Atzmon Y. Key-locked rank one editing for text-to-image personalization. In Proc. the 2023 Conference on Special Interest Group on Computer Graphics and Interactive Techniques, Jul. 2023, Article No. 12. DOI: 10.1145/3588432.3591506.

[164]

Shamsian A, Navon A, Fetaya E, Chechik G. Personalized federated learning using hypernetworks. In Proc. the 38th International Conference on Machine Learning, Jul. 2021, pp.9489–9502.

[165]

Wei Y X, Zhang Y B, Ji Z L, Bai J F, Zhang L, Zuo W M. ELITE: Encoding visual concepts into textual embeddings for customized text-to-image generation. In Proc. the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Oct. 2023, pp.15897–15907. DOI: 10.1109/iccv51070.2023.01461.

[166]

Zhou Y F, Zhang R Y, Sun T, Xu J H. Enhancing detail preservation for customized text-to-image generation: A regularization-free approach. arXiv: 2305.13579, 2023. https://arxiv.org/abs/2305.13579, May 2024.

[167]

Gu Y C, Wang X T, Wu J Z, Shi Y J, Chen Y P, Fan Z H, Xiao W Y, Zhao R, Chang S N, Wu W J, Ge Y X, Shan Y, Shou M Z. Mix-of-show: Decentralized low-rank adaptation for multi-concept customization of diffusion models. arXiv: 2305.18292, 2023. https://arxiv.org/abs/2305.18292, May 2024.

[168]

Wang Z, Bovik A C, Sheikh H R, Simoncelli E P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Processing, 2004, 13(4): 600–612. DOI: 10.1109/TIP.2003.819861.

Crossref Google Scholar

[169]

Horé A, Ziou D. Image quality metrics: PSNR vs. SSIM. In Proc. the 20th International Conference on Pattern Recognition, Aug. 2010, pp.2366–2369. DOI: 10.1109/icpr.2010.579.

[170]

Zhang R, Isola P, Efros A A, Shechtman E, Wang O. The unreasonable effectiveness of deep features as a perceptual metric. In Proc. the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2018, pp.586–595. DOI: 10.1109/cvpr.2018.00068.

[171]

Unterthiner T, van Steenkiste S, Kurach K, Marinier R, Michalski M, Gelly S. FVD: A new metric for video generation. In Proc. the 2019 International Conference on Learning Representations, May 2019.

[172]

Hessel J, Holtzman A, Forbes M, Le Bras R, Choi Y. CLIPScore: A reference-free evaluation metric for image captioning. In Proc. the 2021 Conference on Empirical Methods in Natural Language Processing, Nov. 2021, pp.7514–7528. DOI: 10.18653/v1/2021.emnlp-main.595.

[173]

Sajjadi M S M, Bachem O, Lucic M, Bousquet O, Gelly S. Assessing generative models via precision and recall. In Proc. the 32nd International Conference on Neural Information Processing Systems, Dec. 2018, pp.5234–5243.

[174]

Kynkäänniemi T, Karras T, Laine S, Lehtinen J, Aila T. Improved precision and recall metric for assessing generative models. In Proc. the 33rd International Conference on Neural Information Processing Systems, Dec. 2019, Article No. 353.

[175]

Schuhmann C, Beaumont R, Vencu R, Gordon C, Wightman R, Cherti M, Coombes T, Katta A, Mullis C, Wortsman M, Schramowski P, Kundurthy S, Crowson K, Schmidt L, Kaczmarczyk R, Jitsev J. LAION-5B: An open large-scale dataset for training next generation image-text models. In Proc. the 36th International Conference on Neural Information Processing Systems, Nov. 28/Dec. 9, 2022, Article No. 1833.

[176]

Zhou Y F, Liu B C, Zhu Y Z, Yang X, Chen C Y, Xu J H. Shifted diffusion for text-to-image generation. In Proc. the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2023, pp.10157–10166. DOI: 10.1109/cvpr52729.2023.00979.

[177]

Feng Z D, Zhang Z Y, Yu X T, Fang Y W, Li L X, Chen X Y, Lu Y X, Liu J X, Yin W C, Feng S K, Sun Y, Chen L, Tian H, Wu H, Wang H F. ERNIE-VilG 2.0: Improving text-to-image diffusion model with knowledge-enhanced mixture-of-denoising-experts. In Proc. the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2023, pp.10135–10145. DOI: 10.1109/cvpr52729.2023.00977.

[178]

Wei C, Mangalam K, Huang P Y, Li Y H, Fan H Q, Xu H, Wang H Y, Xie C H, Yuille A, Feichtenhofer C. Diffusion models as masked autoencoders. In Proc. the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Oct. 2023, pp.16238–16248. DOI: 10.1109/iccv51070.2023.01492.

[179]

Deng J, Dong W, Socher R, Li L J, Li K, Fei-Fei L. ImageNet: A large-scale hierarchical image database. In Proc. the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Jun. 2009, pp.248–255. DOI: 10.1109/cvpr.2009.5206848.

[180]

Pan X G, Zhan X H, Dai B, Lin D H, Loy C C, Luo P. Exploiting deep generative prior for versatile image restoration and manipulation. IEEE Trans. Pattern Analysis and Machine Intelligence, 2022, 44(11): 7474–7489. DOI: 10.1109/tpami.2021.3115428.

Crossref Google Scholar

[181]

Kawar B, Vaksman G, Elad M. SNIPS: Solving noisy inverse problems stochastically. In Proc. the 35th Conference on Neural Information Processing Systems, Dec. 2021, pp.21757–21769.

[182]

Romano Y, Elad M, Milanfar P. The little engine that could: Regularization by denoising (RED). SIAM Journal on Imaging Sciences, 2017, 10(4): 1804–1844. DOI: 10.1137/16m1102884.

Crossref Google Scholar

[183]

Karras T, Aila T, Laine S, Lehtinen J. Progressive growing of GANs for improved quality, stability, and variation. arXiv: 1710.10196, 2017. https://arxiv.org/abs/1710.10196, May 2024.

[184]

Cun X D, Pun C M, Shi C. Towards ghost-free shadow removal via dual hierarchical aggregation network and shadow matting GAN. In Proc. the 34th AAAI Conference on Artificial Intelligence, Feb. 2020, pp.10680–10687. DOI: 10.1609/aaai.v34i07.6695.

[185]

Luo Z W, Gustafsson F K, Zhao Z, Sjölund J, Schön T B. Image restoration with mean-reverting stochastic differential equations. arXiv: 2301.11699, 2023. https://arxiv.org/abs/2301.11699, May 2024.

[186]

Luo Z W, Gustafsson F K, Zhao Z, Sjölund J, Schön T B. Refusion: Enabling large-size realistic image restoration with latent-space diffusion models. In Proc. the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Jun. 2023, pp.1680–1691. DOI: 10.1109/cvprw59228.2023.00169.

[187]

Wei C, Wang W J, Yang W H, Liu J Y. Deep retinex decomposition for low-light enhancement. arXiv: 1808.04560, 2018. https://arxiv.org/abs/1808.04560, May 2024.

[188]

Li C Y, Guo J C, Porikli F, Pang Y W. LightenNet: A convolutional neural network for weakly illuminated image enhancement. Pattern Recognition Letters, 2018, 104: 15–22. DOI: 10.1016/j.patrec.2018.01.010.

Crossref Google Scholar

[189]

Jiang Y F, Gong X Y, Liu D, Cheng Y, Fang C, Shen X H, Yang J C, Zhou P, Wang Z Y. EnlightenGAN: Deep light enhancement without paired supervision. IEEE Trans. Image Processing, 2021, 30: 2340–2349. DOI: 10.1109/tip.2021.3051462.

Crossref Google Scholar

[190]

Zhang Y H, Zhang J W, Guo X J. Kindling the darkness: A practical low-light image enhancer. In Proc. the 27th ACM International Conference on Multimedia, Oct. 2019, pp.1632–1640. DOI: 10.1145/3343031.3350926.

[191]

Liu J Y, Xu D J, Yang W H, Fan M H, Huang H F. Benchmarking low-light image enhancement and beyond. International Journal of Computer Vision, 2021, 129(4): 1153–1184. DOI: 10.1007/s11263-020-01418-8.

Crossref Google Scholar

[192]

Sauer A, Schwarz K, Geiger A. StyleGAN-XL: Scaling StyleGAN to large diverse datasets. In Proc. the 2022 Conference on Special Interest Group on Computer Graphics and Interactive Techniques, Aug. 2022, Article No. 49. DOI: 10.1145/3528233.3530738.

[193]

Hang T K, Gu S Y, Li C, Bao J M, Chen D, Hu H, Geng X, Guo B N. Efficient diffusion training via min-SNR weighting strategy. In Proc. the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Oct. 2023, pp.7407–7417. DOI: 10.1109/iccv51070.2023.00684.

[194]

Choi J, Lee J, Shin C, Kim S, Kim H, Yoon S. Perception prioritized training of diffusion models. In Proc. the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2022, pp.11462–11471. DOI: 10.1109/cvpr52688.2022.01118.

[195]

Yang X Y, Zhou D Q, Feng J S, Wang X C. Diffusion probabilistic model made slim. In Proc. the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2023, pp.22552–22562. DOI: 10.1109/cvpr52729.2023.02160.

[196]

Krizhevsky A. Learning multiple layers of features from tiny images. Technical Report TR-2009, University of Toronto, 2009. https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf, May 2024.

[197]

Vahdat A, Kreis K, Kautz J. Score-based generative modeling in latent space. In Proc. the 35th Conference on Neural Information Processing Systems, Dec. 2021, pp.11287–11302.

[198]

Tan F W, Feng S, Ordonez V. Text2Scene: Generating compositional scenes from textual descriptions. In Proc. the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2019, pp.6703–6712. DOI: 10.1109/cvpr.2019.00687.

[199]

Hinz T, Heinrich S, Wermter S. Semantic object accuracy for generative text-to-image synthesis. IEEE Trans. Pattern Analysis and Machine Intelligence, 2022, 44(3): 1552–1565. DOI: 10.1109/tpami.2020.3021209.

Crossref Google Scholar

[200]

Yu J H, Li X, Koh J Y, Zhang H, Pang R M, Qin J, Ku A, Xu Y Z, Baldridge J, Wu Y H. Vector-quantized image modeling with improved VQGAN. arXiv: 2110.04627, 2021. https://arxiv.org/abs/2110.04627, May 2024.

[201]

Lin T Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick C L. Microsoft COCO: Common objects in context. In Proc. the 13th European Conference on Computer Vision, Sept. 2014, pp.740–755. DOI: 10.1007/978-3-319-10602-1_48.

[202]

Zhou Y F, Zhang R Y, Chen C Y, Li C Y, Tensmeyer C, Yu T, Gu J X, Xu J H, Sun T. Towards language-free training for text-to-image generation. In Proc. the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2022, pp.17886–17896. DOI: 10.1109/cvpr52688.2022.01738.

[203]

Ding M, Yang Z Y, Hong W Y, Zheng W D, Zhou C, Yin D, Lin J Y, Zou X, Shao Z, Yang H X, Tang J. CogView: Mastering text-to-image generation via transformers. In Proc. the 35th Conference on Neural Information Processing Systems, Dec. 2021, pp.19822–19835.

[204]

Ho J, Chan W, Saharia C, Whang J, Gao R Q, Gritsenko A, Kingma D P, Poole B, Norouzi M, Fleet D J, Salimans T. Imagen video: High definition video generation with diffusion models. arXiv: 2210.02303, 2022. https://arxiv.org/abs/2210.02303, May 2024.

[205]

Molad E, Horwitz E, Valevski D, Acha A R, Matias Y, Pritch Y, Leviathan Y, Hoshen Y. Dreamix: Video diffusion models are general video editors. arXiv: 2302.01329, 2023. https://arxiv.org/abs/2302.01329, May 2024.

[206]

Mei K F, Patel V. VIDM: Video implicit diffusion models. In Proc. the 37th AAAI Conference on Artificial Intelligence, Feb. 2023, pp.9117–9125. DOI: 10.1609/aaai.v37i8.26094.

[207]

Zhou D Q, Wang W M, Yan H S, Lv W W, Zhu Y Z, Feng J S. MagicVideo: Efficient video generation with latent diffusion models. arXiv: 2211.11018, 2022. https://arxiv.org/abs/2211.11018, May 2024.

[208]

Deng Z J, He X T, Peng Y X, Zhu X W, Cheng L L. MV-Diffusion: Motion-aware video diffusion model. In Proc. the 31st ACM International Conference on Multimedia, Oct. 29/Nov. 3, 2023, pp.7255–7263. DOI: 10.1145/3581783.3612405.

[209]

Deng Z J, He X T, Peng Y X. Efficiency-optimized video diffusion models. In Proc. the 31st ACM International Conference on Multimedia, Oct. 29/Nov. 3, 2023, pp.7295–7303. DOI: 10.1145/3581783.3612406.

[210]

Soomro K, Zamir A R, Shah M. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv: 1212.0402, 2012. https://arxiv.org/abs/1212.0402, May 2024.

[211]

Hong W Y, Ding M, Zheng W D, Liu X H, Tang J. CogVideo: Large-scale pretraining for text-to-video generation via transformers. arXiv: 2205.15868, 2022. https://arxiv.org/abs/2205.15868, May 2024.

[212]

Xu J, Mei T, Yao T, Rui Y. MSR-VTT: A large video description dataset for bridging video and language. In Proc. the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2016, pp.5288–5296. DOI: 10.1109/cvpr.2016.571.

[213]

Wu C F, Huang L, Zhang Q X, Li B Y, Ji L, Yang F, Sapiro G, Duan N. GODIVA: Generating open-domain videos from natural descriptions. arXiv: 2104.14806, 2021. https://arxiv.org/abs/2104.14806, May 2024.

[214]

Wu C F, Liang J, Ji L, Yang F, Fang Y J, Jiang D X, Duan N. NÜWA: Visual synthesis pre-training for neural visual world creation. In Proc. the 17th European Conference on Computer Vision, Oct. 2022, pp.720–736. DOI: 10.1007/978-3-031-19787-1_41.

[215]

Xu J L, Wang X T, Cheng W H, Cao Y P, Shan Y, Qie X H, Gao S H. Dream3D: Zero-shot text-to-3D synthesis using 3D shape prior and text-to-image diffusion models. In Proc. the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2023, pp.20908–20918. DOI: 10.1109/cvpr52729.2023.02003.

[216]

Wang H C, Du X D, Li J H, Yeh R A, Shakhnarovich G. Score jacobian chaining: Lifting pretrained 2D diffusion models for 3D generation. In Proc. the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2023, pp.12619–12629. DOI: 10.1109/cvpr52729.2023.01214.

[217]

Long X X, Guo Y C, Lin C, Liu Y, Dou Z Y, Liu L J, Ma Y X, Zhang S H, Habermann M, Theobalt C, Wang W P. Wonder3D: Single image to 3D using cross-domain diffusion. arXiv: 2310.15008, 2023. https://arxiv.org/abs/2310.15008, May 2024.

[218]

Shi Y C, Wang P, Ye J L, Long M, Li K J, Yang X. MVDream: Multi-view diffusion for 3D generation. arXiv: 2308.16512, 2023. https://arxiv.org/abs/2308.16512, May 2024.

[219]

Wang T F, Zhang B, Zhang T, Gu S Y, Bao J M, Baltrusaitis T, Shen J J, Chen D, Wen F, Chen Q F, Guo B N. RODIN: A generative model for sculpting 3D digital avatars using diffusion. In Proc. the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2023, pp.4563–4573. DOI: 10.1109/cvpr52729.2023.00443.

[220]

Downs L, Francis A, Koenig N, Kinman B, Hickman R, Reymann K, McHugh T B, Vanhoucke V. Google scanned objects: A high-quality dataset of 3D scanned household items. In Proc. the 2022 International Conference on Robotics and Automation (ICRA), May 2022, pp.2553–2560. DOI: 10.1109/icra46639.2022.9811809.

[221]

Melas-Kyriazi L, Laina I, Rupprecht C, Vedaldi A. RealFusion 360°; reconstruction of any object from a single image. In Proc. the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2023, pp.8446–8455. DOI: 10.1109/cvpr52729.2023.00816.

[222]

Liu M H, Xu C, Jin H A, Chen L H, Varma T M, Xu Z X, Su H. One-2-3-45: Any single image to 3D mesh in 45 seconds without per-shape optimization. arXiv: 2306.16928, 2023. https://arxiv.org/abs/2306.16928, May 2024.

[223]

Jun H, Nichol A. Shap-E: Generating conditional 3D implicit functions. arXiv: 2305.02463, 2023. https://arxiv.org/abs/2305.02463, May 2024.

[224]

Voynov A, Chu Q H, Cohen-Or D, Aberman K. P+: Extended textual conditioning in text-to-image generation. arXiv: 2303.09522, 2023. https://arxiv.org/abs/2303.09522, May 2024.

[225]

Shi J, Xiong W, Lin Z, Jung H J. InstantBooth: Personalized text-to-image generation without test-time finetuning. arXiv: 2304.03411, 2023. https://arxiv.org/abs/2304.03411, May 2024.

[226]

Jia X H, Zhao Y, Chan K C K, Li Y D, Zhang H, Gong B Q, Hou T B, Wang H S, Su Y C. Taming encoder for zero fine-tuning image customization with text-to-image diffusion models. arXiv: 2304.02642, 2023. https://arxiv.org/abs/2304.02642, May 2024.

[227]

Xiao G X, Yin T W, Freeman W T, Durand F, Han S. FastComposer: Tuning-free multi-subject image generation with localized attention. arXiv: 2305.10431, 2023. https://arxiv.org/abs/2305.10431, May 2024.

[228]

Chen W H, Hu H X, Li Y D, Ruiz N, Jia X H, Chang M W, Cohen W W. Subject-driven text-to-image generation via apprenticeship learning. arXiv: 2304.00186, 2023. https://arxiv.org/abs/2304.00186, May 2024.

[229]

Ruiz N, Li Y Z, Jampani V, Wei W, Hou T B, Pritch Y, Wadhwa N, Rubinstein M, Aberman K. HyperDreamBooth: Hypernetworks for fast personalization of text-to-image models. arXiv: 2307.06949, 2023. https://arxiv.org/abs/2307.06949, May 2024.

[230]

Gal R, Arar M, Atzmon Y, Bermano A H, Chechik G, Cohen-Or D. Designing an encoder for fast personalization of text-to-image models. arXiv: 2302.12228, 2023. https://arxiv.org/abs/2302.12228, May 2024.

[231]

Arar M, Gal R, Atzmon Y, Chechik G, Cohen-Or D, Shamir A, Bermano A H. Domain-agnostic tuning-encoder for fast personalization of text-to-image models. In Proc. the 2023 Conference on SIGGRAPH Asia, Dec. 2023, Article No. 72. DOI: 10.1145/3610548.3618173.

[232]

Brooks T, Holynski A, Efros A A. InstructPix2Pix: Learning to follow image editing instructions. In Proc. the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2023, pp.18392–18402. DOI: 10.1109/cvpr52729.2023.01764.

[233]

Kawar B, Zada S, Lang O, Tov O, Chang H W, Dekel T, Mosseri I, Irani M. Imagic: Text-based real image editing with diffusion models. In Proc. the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2023, pp.6007–6017. DOI: 10.1109/cvpr52729.2023.00582.

[234]

Liu S T, Zhang Y C, Li W B, Lin Z, Jia J Y. Video-P2P: Video editing with cross-attention control. arXiv: 2303.04761, 2023. https://arxiv.org/abs/2303.04761, May 2024.

Journal of Computer Science and Technology

Volume 39 Issue 3,
May 2024

Pages 509-541

DOI: 10.1007/s11390-024-3814-0

Cite this article:

Jiang R, Zheng G-C, Li T, et al. A Survey of Multimodal Controllable Diffusion Models. Journal of Computer Science and Technology, 2024, 39(3): 509-541. https://doi.org/10.1007/s11390-024-3814-0

About Us

Learn about Open Access

Tsinghua University Press

Publish with Us

Peer Review Policy

Copyright and Licensing

Article Processing Charge

Contact Us

Journal Collaboration: Yao Meng (Ms.)✉️ +86-10-83470574

Technical Support: Kuo Zhao (Mr.)✉️ +86-10-83470507

Media Contact: Hao Jin (Mr.)✉️ +86-10-83470559

Address: Floor 6, Tower B, Xueyan Building, Shuangqing Road, Haidian District, Beijing 100084, China.

SciOpen——中国科技期刊卓越行动计划支持项目

Copyright © 2025 Tsinghua University Press Ltd.

京ICP备 10035462号-42 京公网安备11010802044758号