Skip to content
Chapter 06 · 11 min

Multimodal models

The most consequential shift isn't better image or audio models in isolation — it's models that handle several modalities at once, connecting what they see to what they read to what they hear. This chapter is how a model bridges modalities, and why it changes what's buildable.

Text and images in one shared spaceAn image of a dog and the words "a dog" are both mapped, by separate encoders, into the same vector space — landing close together. Shared embedding space is what lets a model connect pictures and words.🐕 image“a dog”img enctxt encshared spaceclose together

Teach two languages in the same classroom and they start finishing each other's sentences.

The shared-space idea

The key that unlocks multimodal AI is mapping different kinds of data into the same space. Recall that a language model turns words into vectors so that similar meanings sit close together. Now do the same for images, with one crucial addition: train so that an image and its description land near each other in the same space. The picture of a dog and the words "a dog" become neighbours.

Text and images in one shared spaceAn image of a dog and the words "a dog" are both mapped, by separate encoders, into the same vector space — landing close together. Shared embedding space is what lets a model connect pictures and words.🐕 image“a dog”img enctxt encshared spaceclose together
Separate encoders map an image and matching text into one shared space, where they land close together. That shared geometry is the bridge.

This is what models like CLIP (around 2021) demonstrated, and it's the quiet engine behind a huge amount of multimodal AI: text-to-image guidance, image search by description, zero-shot image classification. Once pictures and words share a geometry, you can move between them.

Embedding arithmeticFour word-points in a 2D space. The vector from "man" to "woman" is parallel to the vector from "king" to "queen" — visualising the famous king − man + woman ≈ queen relationship.manwomankingqueenking − man + woman ≈ queen2D PROJECTION OF EMBEDDING SPACE
The same geometry-of-meaning idea from language, now spanning modalities: related concepts sit close, whatever form they arrived in.

Vision-language models: models that see and talk

Modern frontier models are increasingly multimodal natively: you can show them an image and ask questions about it, hand them a chart and have them read it, point a camera and get a description. Under the hood, the image is encoded into the same representation the language model consumes, so the model reasons over pictures and text together rather than treating them as separate systems.

This is genuinely powerful and broadly useful: reading documents and forms, describing images for accessibility, visual question answering, understanding screenshots and diagrams. It's also where understanding (reading an image) and generation (making one) increasingly live in one model — though the same cautions about confident wrongness apply, now to what the model claims to see.

Any-to-any: the direction of travel

The trajectory is toward models that take any modality in and produce any modality out — read a document and answer aloud, watch a video and write a summary, hear a question and draw a diagram. We're partway there: text-plus-vision is common, audio is increasingly integrated, full any-to-any is emerging. The shared-space idea is what makes it conceivable at all.

For builders, the practical upshot is that you can increasingly assume one model can handle mixed input. Document pipelines no longer need a separate OCR step plus a text model; a multimodal model reads the page directly. That simplification — fewer brittle stages — is quietly one of the bigger near-term wins of multimodal AI.

In one line each

  • Multimodal AI works by mapping different data types into one shared space where an image and its description land close together.
  • Models like CLIP demonstrated this; it powers text-to-image guidance, image search by description, and zero-shot classification.
  • Vision-language models reason over images and text together — powerful for documents, accessibility, and visual Q&A — with the same confident-error caution.
  • The direction is any-to-any; the near-term win is simpler pipelines (one model reads the page, no separate OCR stage).
Multimodal models · AI courses · SDEN