The future of image generation with diffusion models

Explore how diffusion models create high-quality images by transforming random noise into detailed visuals. This article dives into the science behind their success, the challenges of managing unintended outputs, and the innovative techniques, such as concept erasure, that offer greater control over the images these models produce. Discover how applying concept erasure to a LoRa model fine-tuned for automotive photography removes unwanted elements, resulting in the generation of realistic and controlled car images tailored on specific models like the BMW M5 (2024).

Stay on top of the latest tech trends & AI news with Le Wagon’s newsletter

This article is written by Laura Meyer, an Engineer at a leading consultancy, specializing in AI, data science, and DevOps, with extensive experience in GenAI innovation and delivering technical training.

Imagine creating an image from nothing but pure noise, where each step gradually transforms that chaos into a detailed, lifelike creation. By learning to reverse noise-adding steps, these so-called “diffusion models” can create diverse and high-quality images, even from abstract prompts. This power comes with a challenge: managing unintended or harmful content, like generating an image that wasn’t quite what you had in mind. One solution lies in the ability to fine-tune these models to “erase” unwanted concepts, providing a way to control and refine what they generate. This ability to erase and redefine concepts opens up powerful opportunities, allowing for complete control over how concepts are represented to align with specific needs or visions. This brings us to a deeper look at how diffusion models work and the techniques that drive their creative potential.

Theoretical foundation of diffusion models

Diffusion models are a type of generative model used extensively for image generation. These models work by iteratively transforming random noise into a coherent image through a learned denoising process. The core idea is to train the model to reverse a sequence of noise-adding steps, effectively learning how to reconstruct data by eliminating noise.

Stay with me —we’ll break it down together.

The process of a diffusion model involves two primary phases:

The forward process (noise addition), where noise is progressively added to the training data until it is transformed into pure noise—usually a gaussian distribution, and
The reverse process (denoising), where noise is incrementally removed to recreate the original data by sampling from the learned probability distribution. This denoising process is modeled using neural networks.

As shown in the image below (source), diffusion models gradually corrupt data with noise and subsequently reverse the process to produce new data from the noise. During each denoising step in the reverse process, the model estimates the score function, which is a gradient that guides the model towards data points with higher likelihood and less noise.

Let’s take a closer look at the structure of the forward and reverse processes during training.

Forward process

The forward process in diffusion models transforms data into pure Gaussian noise through a series of steps. Starting with a simple distribution, the model applies invertible transformations, gradually increasing complexity and capturing the original distribution’s patterns.

The way the complexity (or “noise”) increases is controlled by a schedule. This so-called variance schedule can either be set beforehand or learned by the model during training. Typically, an increasing variance schedule is used, ensuring the noise grows gradually while retaining enough structure in intermediate states to maintain a meaningful mapping to the original data.

Reverse process

Training a diffusion model involves teaching it how to reverse this noise process for sample generation. As such, the model uses the learned mapping to denoise the data incrementally, moving from pure noise back to a realistic image. Each step in this process involves a neural network predicting the noise present in the current state, which is then subtracted to produce a denoised state. By repeating this process across multiple steps, the model generates an image that closely matches the distribution of the training data.

This is done by maximizing something called likelihood — which, in simple terms, measures how likely the generated image is to resemble the original data. Instead of directly maximizing the likelihood, the model optimizes a simpler version of this measure, called the variational upper bound, which is easier to compute and helps guide the training process. To fine-tune this optimization, we use something called Kullback-Leibler (KL) Divergence, a mathematical tool that measures the difference between two probability distributions. It helps the model understand how far its current predictions are from the ideal predictions, guiding it toward more realistic results.

Reverse process architecture

The architecture of the reverse process can vary widely, depending on the problem at hand. The only requirement is that input and output of the reverse process have the same dimensions (i.e., the same size image). This ensures that the noise added during the forward process can be effectively removed, step by step, to recreate the original data. Beyond this basic requirement, however, there’s a lot of room for customization.

Despite this flexibility, many diffusion models use a U-Net-like architecture for the reverse process. U-Net is special type of Convolutional Neural Network (CNN) designed to capture both fine details and overall structure in an image. It achieves this by first reducing the image’s spatial dimensions in the contracting path, effectively zooming in on important features. This downward process forms the left side of the “U.” In the expanding path, the model gradually upsamples the image back to its original size, reconstructing details using the features it learned during the contraction phase. This upward process forms the right side of the “U.” The contracting path of the U-Net captures important features of the noisy input at multiple scales, while the expanding path reconstructs these features into a clean, high-quality image. At each step of the reverse diffusion process, the U-Net predicts and removes noise. This prediction is crucial for guiding the image toward a realistic reconstruction.

At the very end of the reverse process, the goal is to produce a clear and realistic image where each pixel has a specific value, such as a brightness level between 0 and 255 for grayscale images. However, the model’s predictions are continuous—like a guess that a pixel’s brightness might be 127.85 or 128.3, instead of an exact value. Thus, a discrete decoder is used to convert continuous predictions into discrete pixel values. This step is essential for generating images with realistic details, as it assigns probabilities to pixel values based on the reverse process’s predictions.

Now that we’ve outlined the core idea of diffusion models, let’s highlight the properties that make them particularly powerful.

Properties of diffusion models

Diffusion models are defined by their probabilistic nature, allowing them to generate diverse outputs from the same input. The reverse process enables the model to sample from the learned data distribution, producing multiple outputs for a given input. For instance, a diffusion model trained on a dataset of human faces can generate realistic new faces with diverse features and expressions, even if those exact faces were not included in the original dataset. By conditioning the model on additional inputs—such as text descriptions or class labels—it can generate images tailored to specific prompts, making it highly flexible for tasks like text-to-image generation. Stable Diffusion models leverage this ability, enabling users to create detailed, coherent images based on text prompts.

Also, diffusion models rely on a Markov chain framework to describe the transitions between states in both the forward and reverse processes. A Markov chain is a sequence of states where each state depends only on the previous one. This “memoryless” property is crucial because it means that the model doesn’t need to remember the entire history of the data, just the current state.

In summary, the success of diffusion models stems from:

Probabilistic Sampling: By learning a distribution rather than a deterministic mapping, diffusion models can generate diverse outputs from the same input, allowing for creative variations.
Multi-Step Refinement: The step-by-step nature of the reverse process ensures gradual improvement, enabling the model to correct errors and refine details at each stage.
Flexibility with Conditioning: By incorporating additional inputs, such as text descriptions or class labels, diffusion models can be guided to produce outputs that align with specific requirements.

Erasing concepts from diffusion models

Advancements in image generation, particularly with diffusion models like Stable Diffusion, raise ethical and legal concerns due to the potential generation of undesirable content (e.g., nudity, copyrighted objects, or imitation of artistic styles). This has led to misuse, including deepfake pornography, unauthorized replication of artists’ styles, and intellectual property violations. Existing solutions, such as post-generation filtering or inference-time modifications, are often ineffective, especially with open-source models. To address these issues, Gandikota et al. (2024) propose a new approach called Erased Stable Diffusion (ESD), which fine-tunes model weights to suppress specific concepts (e.g., objects or styles) using textual descriptions. ESD offers an efficient solution by embedding restrictions directly into the model, eliminating the need for costly retraining on new datasets while effectively targeting undesired concepts.

Concept erasure in diffusion models involves selectively removing specific learned concepts from a pre-trained model’s parameters. This is achieved by minimizing the influence of features associated with the concept, allowing the model to “forget” while preserving its ability to generate unrelated content. The challenge is ensuring this process does not unintentionally disrupt other learned concepts. The process primarily adjusts the cross-attention and self-attention modules.

Cross-attention modules activate in response to specific tokens in the prompt. For example, in the prompt “a red car,” the cross-attention mechanism ensures that the word “car” affects the structure of the image while “red” influences its color. As such, cross-attention mechanisms are fine-tuned to suppress the influence of the undesired concept. By modifying how the model maps textual descriptions to visual features, the generation of the concept can be effectively blocked. Self-attention mechanisms focus on internal relationships within the input data, enabling the model to understand how different components (e.g., objects, colors, or textures) interact. Fine-tuning self-attention layers helps refine the model’s understanding of internal relationships, ensuring that features related to the undesired concept are deprioritized.

Fine-tuning with LoRa model

Recently, the concept of “concept erasure” as introduced by Gandikota et al. (2024) has been effectively applied in combination with a LoRa model to enhance image generation for automotive photography. LoRa (Low-Rank Adaptation) is a technique used to fine-tune pre-trained generative models efficiently, enabling them to specialize in specific tasks without needing to retrain the entire model. By adapting only a small subset of parameters, LoRa allows for rapid customization while maintaining the integrity of the base model. This approach is especially useful in tasks like image generation, where a model can be fine-tuned for specific subjects or styles without affecting the broader capabilities of the model.

During the work, the LoRa model was trained on a CGI-generated dataset focused on the BMW M5 (2024), specifically optimized for automotive photography. This model can generate realistic and detailed car images from any angle, capturing intricate aspects such as lighting conditions, material options, and customizations, including different paint finishes and number plate details. Before training the LoRa model, concept erasure was used to refine the base model by removing unwanted concepts, such as other car manufacturers, non-photorealistic styles, or irrelevant political themes. This “cleaning” of the base model ensured that the resulting images were highly focused on the subject, allowing for better control over the generated content. Fine-tuning the model for higher-resolution images, up to 2560x2560px, further improved the quality and detail reproduction.

As shown in the image below, the combination of concept erasure and precise fine-tuning allowed for significant improvements in the model’s adaptability and realism. The model’s ability to generate specific perspectives, lighting, and car details was enhanced by including over 30 integrated trigger words, making it a powerful tool for producing high-quality, production-ready automotive images tailored to specific needs. This approach highlights the importance of refining generative models for more precise and controlled outputs.

Conclusion

The ability to erase and redefine concepts in diffusion models offers vast opportunities for businesses and creatives, enabling industries like advertising, design, and media production to produce highly customized and consistent outputs that align with brand identity. Artists and designers can leverage fine-tuned models to streamline workflows, automate tasks, and explore new creative possibilities. This powerful tool enhances flexibility and control in creative projects, allowing for precise and intentional outcomes. However, with these advancements come ethical responsibilities—businesses must prioritize transparency, accountability, and thoughtful implementation to ensure these technologies are used innovatively and responsibly.