RAG: retrieval-augmented generation pipeline

“Don't make the model memorise the library. Hand it the three pages it needs, opened to the right paragraph.”

What RAG actually is

RAG separates what the system knows from what the system says. Knowledge lives in an index you control and can update hourly. The model only supplies language and reasoning over whatever you put in front of it. When a question comes in, you search the index, retrieve the most relevant passages, and put them in the prompt as context.

The model answers from the retrieved chunks, not from its training data. You update the index, never the model.

This is the right answer to almost every "chat with our docs," "support assistant," or "internal knowledge" problem. It's cheaper than fine-tuning, updatable in real time, and crucially auditable: you can show exactly which source the answer came from.

The naive pipeline, and why it breaks

The demo version: split every document into fixed 500-token chunks, embed each, store in a vector database, embed the query, take the five nearest chunks, stuff them in the prompt. It works on a tidy FAQ and falls apart on real document sets. Here's where:

Bad chunking: the answer is split across two chunks, so neither alone is sufficient.
Retrieval miss: the relevant passage isn't in the top results because the query and the document use different words.
Wrong embedding model: your domain (legal, medical, internal jargon) isn't well represented, so "similar" vectors aren't actually similar.
Ignored context: the model has the right passage but answers from its training data anyway.
Stale index: the document changed; the index didn't.

The pattern that ships: hybrid + re-rank + cite

Three additions fix most production RAG. First, hybrid search: combine vector similarity with old-fashioned keyword search (BM25). Vector search catches meaning; keyword search catches exact terms (error codes, names, SKUs) that vectors blur. Run both, merge the results.

Second, re-ranking. Retrieve a wide net (fifty candidates), then score them with a more precise (and more expensive) re-ranker, and keep only the best five for the prompt. You get the recall of a wide search with the precision of a careful one.

Cast a wide net for recall, then re-rank for precision. The model only sees the few passages most likely to contain the answer.

Third, citations. Have the model quote the chunk id it used for each claim. This makes the answer auditable, lets you show sources in the UI, and measurably reduces the model drifting away from the retrieved text.

When retrieval is the wrong tool

RAG answers questions whose answer is written down somewhere. It does not help with questions that require reasoning over the whole corpus ("what are the three biggest themes across these 10,000 tickets?"), or computation ("what's our average resolution time?"). Those want aggregation, analytics, or tools (covered next chapter), not retrieval.

Retrieval shines on "find the passage that answers this." It struggles as the task moves toward reasoning over everything at once.

In one line each

RAG separates what the system knows (the index) from what it says (the model). Update the index, not the model.
Naive RAG breaks on chunking, retrieval misses, wrong embeddings, ignored context, and stale data, all invisible without evals.
The pattern that ships: hybrid search + re-rank + citations.
RAG answers "find the passage," not "reason over everything" or "compute a number." Those need tools.

Where to go next

Chapter 4: Tools & function calling