One idea, many modalities

“Once you know how to learn the shape of a thing, you can learn the shape of anything: words, pictures, sound.”

One idea wearing many coats

Underneath every generative model is a single move: learn the distribution of some kind of data, then sample new examples from it. Learn what English sentences look like, sample a new one: that's a language model. Learn what photographs look like, sample a new one: that's an image generator. The data changes; the core idea doesn't.

The same core idea (learn a distribution, then sample from it) powers generation across text, image, audio, video, 3D, and code.

This is why progress in one modality keeps spilling into the others. The transformer architecture that powered language models turned out to work for images and audio too. The lesson learned with text (scale plus the right architecture beats clever hand-engineering) replayed in every other modality, a few years behind.

Generation vs understanding

Two directions matter and are easy to confuse. Understanding goes from rich input to a compact answer: an image to a caption, audio to a transcript, a video to a summary. Generation goes the other way: a prompt to an image, text to speech, a description to a video. The same underlying models often do both, but the engineering, the cost, and the risks differ sharply between them.

Why it all happened at once

Generative AI across modalities seemed to erupt suddenly in the early 2020s, but the eruption was the meeting of three slow trends: enough data (the internet's images, audio, and video), enough compute (GPUs built for exactly this kind of math), and a couple of architectural unlocks, chiefly the transformer and, for images, diffusion models. None was new magic; together they crossed a usefulness threshold.

Knowing this keeps you grounded. The capabilities are real and improving fast. But each modality is at a different point on the curve (text and image are mature, video and 3D are earlier and rougher), and the gap between a stunning demo and a reliable product is, as always, the whole story.

What this course covers

We'll open the box on image generation (how diffusion actually works), how you steer it, then audio and music, then video and 3D, then the multimodal models that fuse vision and language, and finish with the risks that come with machines that can fabricate convincing media. The fundamentals course is useful background but not required; this one stands on its own.

In one line each

Every generative model does one thing: learn the distribution of some data, then sample new examples from it.
Progress spills across modalities because the same architectures (especially the transformer) keep working on new data types.
Understanding (input → compact answer) and generation (prompt → rich output) are different in cost, reliability, and risk.
Each modality is at a different point on the curve (text and image mature, video and 3D earlier), and demos still outrun reliable products.

Where to go next

Chapter 2: How image generation works