Diffusion: from noise to image, step by step

“A sculptor doesn't add marble. They start with a rough block and remove everything that isn't the statue.”

Image models start from noise and denoise it toward your prompt.

Generation as denoising

A diffusion model is trained on a simple, almost silly idea. Take a real image, add a little random noise, and teach a model to remove it. Do this across every level of noise, from barely-speckled to pure static. The model becomes an expert at one thing: given a noisy image, predict a slightly cleaner version.

To generate, you start from pure noise (random static) and apply that denoising step over and over. Each pass removes a little noise, and because the model learned what real images look like, the noise resolves into a coherent image. Generation is just denoising, run from nothing.

Start from random noise; each step removes a little, guided by the prompt, until a clean image emerges. Creation as repeated subtraction.

How the prompt steers it

Pure denoising would produce some plausible image, but not your image. The prompt enters as guidance: at each denoising step, the model is conditioned on your text, nudging the result toward an image that matches the description. The text is turned into a representation (using the kind of text-image shared space we'll meet in chapter 6) that the denoiser can follow.

This is why the same prompt gives different images each time (you start from different random noise) and why tiny prompt changes can swing the result: you're steering a process, not retrieving a picture. The prompt is a force field over the denoising, not a lookup key.

Working small: latent diffusion

Denoising a full-resolution image directly is enormously expensive (millions of pixels, hundreds of steps). The breakthrough that put image generation on ordinary hardware was to work in a compressed space instead. An encoder shrinks the image to a small "latent" representation, all the expensive denoising happens there, and a decoder expands the result back to full resolution.

Compress to a small latent, do the costly generation there, then decode back to full resolution. The same result for a fraction of the compute.

This latent-diffusion approach, popularised around 2022, is why image generators went from research-lab curiosities to tools running on a gaming GPU. The idea (do the hard work in a compressed space) recurs all over efficient AI.

What this explains about image AI

The denoising picture explains the quirks you've seen. Why images are slow to generate (many steps). Why details like hands and text historically came out garbled (fine, structured detail is hard to recover from noise). Why you can guide, inpaint, and vary an image (you can intervene in the denoising). And why outputs are non-deterministic (different starting noise). The weirdness isn't randomness: it's the mechanism showing through.

In one line each

Diffusion models generate by starting from pure noise and repeatedly removing it until an image emerges.
They're trained by adding noise to real images and learning to reverse it. Generation runs that reversal from nothing.
The prompt steers each denoising step; different starting noise is why the same prompt gives different images.
Latent diffusion does the expensive work in a compressed space, which is what put image generation on ordinary hardware.

Where to go next

Chapter 3: Controlling images