Diffusion models have recently emerged as powerful generative models, producing high-fidelity samples across domains. Despite this, they have two key challenges, including improving the time-consuming iterative generation process and controlling and steering the generation process. Existing surveys provide broad overviews of diffusion model advancements. However, they lack comprehensive coverage specifically centered on techniques for controllable generation. This survey seeks to address this gap by providing a comprehensive and coherent review on controllable generation in diffusion models. We provide a detailed taxonomy defining controlled generation for diffusion models. Controllable generation is categorized based on the formulation, methodologies, and evaluation metrics. By enumerating the range of methods researchers have developed for enhanced control, we aim to establish controllable diffusion generation as a distinct subfield warranting dedicated focus. With this survey, we contextualize recent results, provide the dedicated treatment of controllable diffusion model generation, and outline limitations and future directions. To demonstrate applicability, we highlight controllable diffusion techniques for major computer vision tasks application. By consolidating methods and applications for controllable diffusion models, we hope to catalyze further innovations in reliable and scalable controllable generation.
Heckbert P S. Survey of texture mapping. IEEE Computer Graphics and Applications, 1986, 6(11): 56–67. DOI: 10.1109/mcg.1986.276672.
Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y. Generative adversarial networks. Communications of the ACM, 2020, 63(11): 139–144. DOI: 10.1145/3422622.
Cao H Q, Tan C, Gao Z Y, Xu Y L, Chen G Y, Heng P A, Li S Z. A survey on generative diffusion models. IEEE Trans. Knowledge and Data Engineering, 20241–20. DOI: 10.1109/tkde.2024.3361474.
Yang L, Zhang Z L, Song Y, Hong S D, Xu R S, Zhao Y, Zhang W T, Cui B, Yang M H. Diffusion models: A comprehensive survey of methods and applications. ACM Computing Surveys, 2024, 56(4): 105. DOI: 10.1145/3626 235.
Croitoru F A, Hondru V, Ionescu R T, Shah M. Diffusion models in vision: A survey. IEEE Trans. Pattern Analysis and Machine Intelligence, 2023, 45(9): 10850–10869. DOI: 10.1109/tpami.2023.3261988.
Anderson B D O. Reverse-time diffusion equation models. Stochastic Processes and Their Applications, 1982, 12(3): 313–326. DOI: 10.1016/0304-4149(82)90051-5.
Wu Y X, He K M. Group normalization. International Journal of Computer Vision, 2020, 128(3): 742–755. DOI: 10.1007/s11263-019-01198-w.
Avrahami O, Fried O, Lischinski D. Blended latent diffusion. ACM Trans. Graphics, 2023, 42(4): 149. DOI: 10.1145/3592450.
Ho J, Saharia C, Chan W, Fleet D J, Norouzi M, Salimans T. Cascaded diffusion models for high fidelity image generation. Journal of Machine Learning Research, 2022, 23(47): 1–33.
Saharia C, Ho J, Chan W, Salimans T, Fleet D J, Norouzi M. Image super-resolution via iterative refinement. IEEE Trans. Pattern Analysis and Machine Intelligence, 2023, 45(4): 4713–4726. DOI: 10.1109/tpami.2022. 3204461.
Ding K Z, Xu Z, Tong H H, Liu H. Data augmentation for deep graph learning: A survey. ACM SIGKDD Explorations Newsletter, 2022, 24(2): 61–77. DOI: 10.1145/3575637.3575646.
Yang R H, Srivastava P, Mandt S. Diffusion probabilistic modeling for video generation. Entropy, 2023, 25(10): 1469. DOI: 10.3390/e25101469.
Huang N S, Zhang Y X, Tang F, Ma C Y, Huang H B, Dong W M, Xu C S. DiffStyler: Controllable dual diffusion for text-driven image stylization. IEEE Trans. Neural Networks and Learning Systems, 2024. DOI: 10.1109/tnnls.2023.3342645. (early access
Xifara T, Sherlock C, Livingstone S, Byrne S, Girolami M. Langevin diffusions and the metropolis-adjusted Langevin algorithm. Statistics & Probability Letters, 2014, 91: 14–19. DOI: 10.1016/j.spl.2014.04.002.
Li H Y, Yang Y F, Chang M, Chen S Q, Feng H J, Xu Z H, Li Q, Chen Y T. SRDiff: Single image super-resolution with diffusion probabilistic models. Neurocomputing, 2022, 479: 47–59. DOI: 10.1016/j.neucom.2022.01.029.
Xing J B, Xia M H, Liu Y X, Zhang Y C, Zhang Y, He Y Q, Liu H Y, Chen H X, Cun X D, Wang X T, Shan Y, Wong T T. Make-Your-Video: Customized video generation using textual and structural guidance. IEEE Trans. Visualization and Computer Graphics, 20241–15. DOI: 10.1109/tvcg.2024.3365804.
Zheng X Y, Pan H, Wang P S, Tong X, Liu Y, Shum H Y. Locally attentional SDF diffusion for controllable 3D shape generation. ACM Trans. Graphics, 2023, 42(4): 91. DOI: 10.1145/3592103.
Wang Z, Bovik A C, Sheikh H R, Simoncelli E P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Processing, 2004, 13(4): 600–612. DOI: 10.1109/TIP.2003.819861.
Pan X G, Zhan X H, Dai B, Lin D H, Loy C C, Luo P. Exploiting deep generative prior for versatile image restoration and manipulation. IEEE Trans. Pattern Analysis and Machine Intelligence, 2022, 44(11): 7474–7489. DOI: 10.1109/tpami.2021.3115428.
Romano Y, Elad M, Milanfar P. The little engine that could: Regularization by denoising (RED). SIAM Journal on Imaging Sciences, 2017, 10(4): 1804–1844. DOI: 10.1137/16m1102884.
Li C Y, Guo J C, Porikli F, Pang Y W. LightenNet: A convolutional neural network for weakly illuminated image enhancement. Pattern Recognition Letters, 2018, 104: 15–22. DOI: 10.1016/j.patrec.2018.01.010.
Jiang Y F, Gong X Y, Liu D, Cheng Y, Fang C, Shen X H, Yang J C, Zhou P, Wang Z Y. EnlightenGAN: Deep light enhancement without paired supervision. IEEE Trans. Image Processing, 2021, 30: 2340–2349. DOI: 10.1109/tip.2021.3051462.
Liu J Y, Xu D J, Yang W H, Fan M H, Huang H F. Benchmarking low-light image enhancement and beyond. International Journal of Computer Vision, 2021, 129(4): 1153–1184. DOI: 10.1007/s11263-020-01418-8.
Hinz T, Heinrich S, Wermter S. Semantic object accuracy for generative text-to-image synthesis. IEEE Trans. Pattern Analysis and Machine Intelligence, 2022, 44(3): 1552–1565. DOI: 10.1109/tpami.2020.3021209.