Controlling images

“A prompt is shouting an order across a noisy room. Control is putting the blueprint in their hands.”

Seeds, masks, and reference images steer what you get.

The limits of words

Text is a low-bandwidth way to specify an image. "A person standing by a red car" leaves a billion details unspecified (pose, angle, lighting, exact placement) and the model fills them with whatever the noise suggests. For exploration that's fine. For a specific result, prompting alone is frustrating: you're describing a picture to someone who can't see your intent.

So the techniques that matter are the ones that give the model more than words: an existing image to modify, a structural guide to follow, a reference style to match. These turn generation from "roll the dice on my description" into "execute against my specification."

Editing what's already there

Because generation is denoising you can intervene in, you can start from an existing image instead of pure noise. Two workhorse techniques follow:

Image-to-image: start the denoising from your image plus some noise, so the output keeps its overall structure but changes according to the prompt. "Make this photo look like a painting."
Inpainting: regenerate only a masked region, leaving the rest untouched. "Remove the person from this corner" or "change just the sky." The model fills the gap consistently with what surrounds it.
Outpainting: extend an image beyond its borders, inventing plausible continuation.

These are the basis of real creative and production workflows, where you rarely want a whole image from scratch. You want to change one thing while holding everything else fixed.

Imposing structure

The biggest leap in control came from conditioning generation on a structural input alongside the prompt (an edge map, a depth map, a human pose skeleton, a rough sketch). The model must produce an image that both matches your words and conforms to that structure. Now you can say "a knight in this exact pose" by handing over a stick-figure skeleton, or "this building with that facade" via an edge outline.

This family of techniques (ControlNet, introduced around 2023, is the best-known) is what makes image models usable for professional work, where you need the composition you intended, not a plausible composition the model preferred. It's the difference between a toy and an instrument.

Matching style and subject

Often you want consistency: the same character across many images, or a specific art style throughout. A range of techniques address this, from lightweight personalisation that teaches a model a new subject or style from a few examples to reference-image conditioning that carries a look across generations. The details shift quickly with the tools, but the goal is constant: reproducibility, not one-off luck.

The skill is in the loop, not the prompt

Put together, controlled image generation is iterative: generate, inspect, mask and regenerate a region, adjust the structural guide, vary a seed, refine. The people who get professional results aren't writing magic prompts. They're running a tight loop with the control tools, exactly the way a photographer works the shot rather than expecting one perfect frame.

In one line each

Text is low-bandwidth; prompting alone leaves most of an image to chance. Control means giving the model more than words.
Because generation is interruptible denoising, you can edit: image-to-image, inpainting a masked region, outpainting beyond the borders.
Structural conditioning (edge, depth, pose, e.g. ControlNet) forces the composition you intended, turning a toy into an instrument.
Professional results come from a tight iterative loop with the control tools, not from a single magic prompt. Personalisation raises real consent and copyright issues.

Where to go next

Chapter 4: Audio & music