Video & 3D

“A flipbook only works if every page agrees with the last. That agreement is the hard part.”

Video and 3D add the hard part: consistency across frames and views.

Why video is so much harder than images

A video is not just many images. It's many images that must agree. The same object has to stay the same shape, colour, and identity across every frame; motion has to be physically plausible; lighting has to stay consistent. This temporal consistency is the core challenge, and it's why a model can nail a single photorealistic frame but produce a video where faces morph, objects flicker, and physics drifts.

It's also vastly more expensive. A few seconds of video is hundreds of frames, each as costly as an image, plus the work of keeping them coherent. The compute and the consistency problem compound, which is why video generation trails image generation by a few years in maturity.

Where video generation actually is

The trajectory is fast and real: text-to-video has gone from a few flickering seconds to clips of impressive coherence and length in a short span. The honest status, though, is that it's strongest for short, self-contained clips and weakest at exactly what professional video needs (precise control, long duration, consistent characters across scenes, and reliable physics).

3D and the worlds beyond flat images

Generating 3D (models, scenes, environments) is earlier still and genuinely useful in specific niches (games, product visualisation, virtual production). The challenges echo video's: consistency, but now across viewpoints rather than time, and a scarcity of training data, since the world has far fewer 3D models than 2D images.

Approaches range from reconstructing 3D from multiple photos (techniques like neural radiance fields and, more recently, gaussian splatting, which build a navigable 3D scene from images) to generating 3D assets from text or images directly. It's a fast-moving, specialised area, promising but not yet a general-purpose push-button tool.

What to expect, and when

For a builder or decision-maker: treat video and 3D as high-potential, early-stage capabilities. There are real uses today for short clips, b-roll, concepting, previsualisation, and specific 3D niches. But anything requiring precise, consistent, controllable, long-form output is still rough, and a demo's spectacle should not be read as production reliability. This is the corner of generative AI where the build-vs-buy-vs-wait judgment most often lands on "wait and watch."

In one line each

Video is harder than images because frames must agree. Temporal consistency of identity, motion, and physics is the core challenge.
It's also far more expensive (hundreds of frames plus coherence), so video generation trails images in maturity.
Status: strong for short clips, weak at long, controllable, consistent professional output. Judge the boring parts, not the spectacle.
3D is earlier still and niche; treat video and 3D as high-potential early-stage capabilities, often a "wait and watch."

Where to go next

Chapter 6: Multimodal models