Large language models are impressive, but they have a fundamental problem: they only know what was in their training data. Ask ChatGPT about events from last week, your company's internal policies, or a document you just wrote, and it can't help — that information wasn't there when the model was trained.

Retrieval-Augmented Generation (RAG) solves this. It's the technique that lets AI systems answer questions about up-to-date or private information they were never trained on.

The Core Idea

RAG combines two things:

Retrieval — finding relevant documents from a knowledge base
Generation — using a language model to answer based on those documents

Instead of relying on knowledge baked into the model's parameters, you retrieve the relevant information at query time and hand it to the model as context.

The flow looks like this:

User question
     ↓
Search knowledge base → find relevant documents
     ↓
Inject documents into prompt
     ↓
LLM generates answer grounded in those documents
     ↓
Response to user

Why Not Just Fine-Tune the Model?

You might wonder: why not just train the model on your data? Fine-tuning is expensive, slow, and has to be redone every time data changes. RAG is:

Cheaper — retrieval is fast; no expensive retraining
Updatable — add new documents to the knowledge base without touching the model
Transparent — you can show users which sources the answer came from
More accurate for factual recall — models can hallucinate even facts from fine-tuning, but RAG anchors answers to retrieved text

Fine-tuning teaches a model new behaviors and styles. RAG gives a model access to new facts. They solve different problems.

How Retrieval Works: Embeddings and Vector Search

The most common retrieval approach uses embeddings — numerical representations of text that capture semantic meaning.

The idea: sentences with similar meanings should have similar embeddings, even if they use different words. "How do I cancel my subscription?" and "I want to stop paying for this service" should be close together in embedding space.

Here's the process:

1. Indexing (done once, updated as data changes)

Take your documents (PDFs, web pages, database records, etc.)
Split them into chunks (paragraphs or ~500 token segments)
Run each chunk through an embedding model to get a vector
Store the vectors in a vector database (Pinecone, Weaviate, pgvector, Chroma, etc.)

2. Retrieval (done at query time)

Take the user's question
Embed it using the same embedding model
Find the chunks whose vectors are most similar to the question vector (nearest neighbor search)
Return the top-k most relevant chunks

3. Generation

Insert the retrieved chunks into a prompt
Ask the LLM to answer the question based on those chunks
Return the answer (optionally with citations)

You are a helpful assistant. Answer the question using only
the provided context. If the answer isn't in the context,
say so.

Context:
[chunk 1]
[chunk 2]
[chunk 3]

Question: [user's question]

What's a Vector Database?

A vector database is designed to store high-dimensional vectors and answer nearest-neighbor queries efficiently. Given a query vector, find the stored vectors most similar to it.

Popular options: | Database | Notes | |---|---| | Pinecone | Managed, cloud-hosted, easy to start | | Weaviate | Open source, supports hybrid search | | Chroma | Lightweight, good for local dev | | pgvector | PostgreSQL extension — if you already use Postgres | | Qdrant | Open source, Rust-based, good performance |

For smaller knowledge bases, you don't even need a dedicated vector database — libraries like FAISS let you do similarity search in memory.

Chunking Strategy Matters

How you split documents significantly affects retrieval quality.

Too large — each chunk contains many topics; retrieved chunks will contain irrelevant information that confuses the model

Too small — chunks lack context; the model might get the relevant sentence but not enough surrounding information to answer properly

Common approaches:

Fixed-size chunks — split at N tokens with some overlap (e.g., 500 tokens, 50 token overlap)
Semantic splitting — split at paragraph or section boundaries
Hierarchical chunking — store both summaries and full chunks; retrieve summaries first, then fetch the relevant full chunks

Hybrid Search: Combining Semantic and Keyword Search

Pure semantic search has a weakness: exact terms. If a user searches for a specific error code, product ID, or proper noun, semantic similarity might not surface the exact match.

Hybrid search combines:

Dense retrieval (vector/semantic) — finds conceptually related content
Sparse retrieval (BM25/keyword) — finds exact keyword matches

Results are merged and re-ranked, usually using a technique called Reciprocal Rank Fusion. This is often more robust than either approach alone.

Re-ranking

After retrieving the top-k candidates, a re-ranker model scores each candidate for relevance to the query and re-orders them. Re-rankers are slower than embedding search but more accurate at judging relevance — running a cross-encoder over query + document captures nuances that embedding similarity misses.

Real-World RAG Applications

Customer support bots that answer questions from documentation and knowledge bases
Internal search tools for company wikis, Notion, Confluence, Google Drive
Legal research assistants that query large document repositories
Code assistants that retrieve relevant function signatures and documentation
Medical assistants that ground responses in clinical guidelines
News chatbots that answer about recent events by searching articles

Common Failure Modes

Retrieval failures — the relevant chunk wasn't retrieved:

Query and document use different vocabulary (fix: improve chunking, add metadata, use hybrid search)
Chunk was too large and diluted the relevance signal

Generation failures — the right chunk was retrieved but the model still got it wrong:

Model ignored the context and relied on its training data (fix: tighten the prompt's instructions)
Context window too full; model didn't effectively process all chunks
Model hallucinated citations that look like the retrieved sources

Indexing failures — the knowledge base is incomplete or stale:

Documents weren't indexed
Data has changed since indexing

Evaluating a RAG System

Key metrics:

| Metric | What it measures | |---|---| | Retrieval precision | Of retrieved chunks, how many are actually relevant? | | Retrieval recall | Of all relevant chunks, how many were retrieved? | | Answer faithfulness | Is the generated answer grounded in the retrieved context? | | Answer relevance | Does the answer actually address the question? |

Frameworks like RAGAS and TruLens automate RAG evaluation.

The Bottom Line

RAG is the standard solution for giving language models access to information outside their training data — whether that's recent news, private company documents, or a knowledge base that changes frequently. By combining retrieval with generation, you get answers grounded in real, citable sources rather than the model's potentially outdated or hallucinated training knowledge. For most production AI applications that need to answer factual questions, RAG is the right starting point.