Skip to content
Chapter 03 · 13 min

Retrieval done right

Retrieval-augmented generation (RAG) is the most useful, lowest-risk pattern in applied AI — and the one most often built badly. The naive version demos beautifully and fails in production for a handful of predictable reasons. This chapter is those reasons and their fixes.

Don't make the model memorise the library. Hand it the three pages it needs, opened to the right paragraph.

What RAG actually is

RAG separates what the system knows from what the system says. Knowledge lives in an index you control and can update hourly. The model only supplies language and reasoning over whatever you put in front of it. When a question comes in, you search the index, retrieve the most relevant passages, and put them in the prompt as context.

RAG — retrieval-augmented generation pipelineA pipeline from left to right: a user query is sent to a search system, which finds the most relevant chunks from a document store. Those chunks are added to the prompt and sent to the LLM, which produces an answer grounded in the retrieved context.userquerysearchvector + BM25top chunksfrom your docspromptquery + contextLLManswerINDEXthe model answers from context, not from memory
The model answers from the retrieved chunks, not from its training data. You update the index, never the model.

This is the right answer to almost every "chat with our docs," "support assistant," or "internal knowledge" problem. It's cheaper than fine-tuning, updatable in real time, and — crucially — auditable: you can show exactly which source the answer came from.

The naive pipeline, and why it breaks

The demo version: split every document into fixed 500-token chunks, embed each, store in a vector database, embed the query, take the five nearest chunks, stuff them in the prompt. It works on a tidy FAQ and falls apart on real document sets. Here's where:

  • Bad chunking — the answer is split across two chunks, so neither alone is sufficient.
  • Retrieval miss — the relevant passage isn't in the top results because the query and the document use different words.
  • Wrong embedding model — your domain (legal, medical, internal jargon) isn't well represented, so "similar" vectors aren't actually similar.
  • Ignored context — the model has the right passage but answers from its training data anyway.
  • Stale index — the document changed; the index didn't.

The pattern that ships: hybrid + re-rank + cite

Three additions fix most production RAG. First, hybrid search: combine vector similarity with old-fashioned keyword search (BM25). Vector search catches meaning; keyword search catches exact terms (error codes, names, SKUs) that vectors blur. Run both, merge the results.

Second, re-ranking. Retrieve a wide net — fifty candidates — then score them with a more precise (and more expensive) re-ranker, and keep only the best five for the prompt. You get the recall of a wide search with the precision of a careful one.

Retrieve wide, then re-rank to a fewThree shrinking boxes from left to right: retrieve a wide net of 50 candidates, re-rank them with a more precise model, then keep only the top 5 to put in the prompt. Recall first, precision second.retrieve widetop 50re-rankcross-encoderkeep besttop 5
Cast a wide net for recall, then re-rank for precision. The model only sees the few passages most likely to contain the answer.

Third, citations. Have the model quote the chunk id it used for each claim. This makes the answer auditable, lets you show sources in the UI, and — measurably — reduces the model drifting away from the retrieved text.

When retrieval is the wrong tool

RAG answers questions whose answer is written down somewhere. It does not help with questions that require reasoning over the whole corpus ("what are the three biggest themes across these 10,000 tickets?"), or computation ("what's our average resolution time?"). Those want aggregation, analytics, or tools — covered next chapter — not retrieval.

Retrieval versus reasoningA 2-by-2 quadrant. The horizontal axis goes from retrieval-heavy to reasoning-heavy. The vertical axis goes from easy to hard. Tasks like "name a capital" and "summarise this article" sit in the easy retrieval corner; long-horizon planning sits in the hard reasoning corner.HARD ↑EASY← RETRIEVALREASONING →name a capitalsummarise this articlesolve this puzzlelong-horizon planwrite a function
Retrieval shines on "find the passage that answers this." It struggles as the task moves toward reasoning over everything at once.

In one line each

  • RAG separates what the system knows (the index) from what it says (the model). Update the index, not the model.
  • Naive RAG breaks on chunking, retrieval misses, wrong embeddings, ignored context, and stale data — all invisible without evals.
  • The pattern that ships: hybrid search + re-rank + citations.
  • RAG answers "find the passage," not "reason over everything" or "compute a number" — those need tools.
Retrieval done right · AI courses · SDEN