Retrieval Augmented Generation (RAG)
Embeddings, vector databases, document chunking, and building a RAG pipeline from scratch.
Grounding AI in Your Data
Large language models are incredibly capable, but they have a fundamental limitation: they only know what they learned during training. They can't access your company's internal documents, today's news, or any private data. Retrieval Augmented Generation (RAG) solves this by giving LLMs the ability to search through your data before generating a response.
RAG has become the most widely adopted pattern for building production AI applications. If you're building anything that needs to work with proprietary data — customer support bots, internal knowledge bases, document Q&A systems — RAG is almost certainly part of the solution.
What Is RAG?
RAG is a technique that augments an LLM's knowledge by retrieving relevant documents from an external knowledge base and including them in the prompt context. The term was coined in a 2020 paper by Lewis et al. at Meta AI.
The core idea is simple: instead of expecting the model to know everything, you give it the relevant information at query time.
RAG in a nutshell:
User asks: "What is our company's parental leave policy?" Without RAG: LLM responds with generic information about parental leave, potentially hallucinating specific details about your company. With RAG: 1. System searches your HR document database 2. Retrieves the relevant policy document 3. Includes it in the prompt: "Based on this document: [POLICY TEXT]" 4. LLM generates an accurate answer grounded in your actual policy
Why RAG Matters
- Reduces hallucination: The model generates answers from retrieved facts rather than relying on its parametric memory.
- Keeps information current: Update your document store anytime — no model retraining needed.
- Works with private data: Your proprietary data never needs to be part of the model's training set.
- Provides citations: You can trace every answer back to its source document, enabling verification and trust.
- Cost-effective: Far cheaper and faster than fine-tuning a model on your data.
Embeddings Explained
At the heart of RAG is a concept called embeddings — numerical representations of text that capture semantic meaning. An embedding model converts text into a high-dimensional vector (a list of numbers, typically 768 to 3072 dimensions) where similar meanings map to nearby points in vector space.
How embeddings capture meaning:
"The cat sat on the mat" → [0.23, -0.45, 0.87, ...] (1536 dimensions) "A kitten rested on a rug" → [0.25, -0.43, 0.85, ...] (nearby in vector space!) "Stock market crashed today" → [-0.67, 0.12, -0.34, ...] (far away) Similar meaning = similar vectors = small distance between them Different meaning = different vectors = large distance between them
Popular embedding models include OpenAI's text-embedding-3-large, Cohere's embed-v4, Google's Gecko, and open-source options like BGE and E5 from Hugging Face. The choice of embedding model affects retrieval quality significantly — it's one of the most important decisions in a RAG pipeline.
Vector Databases
Once you've converted your documents into embeddings, you need somewhere to store and efficiently search them. Vector databases are purpose-built for this: they store high-dimensional vectors and support fast similarity searches (finding the nearest neighbors to a query vector).
| Database | Type | Best For | Key Feature |
|---|---|---|---|
| Pinecone | Fully managed cloud | Teams wanting zero ops overhead | Serverless pricing, automatic scaling, easy to start |
| Weaviate | Hybrid (managed or self-hosted) | Hybrid search (vector + keyword) | Built-in vectorizers, GraphQL API, multi-modal support |
| Chroma | Lightweight, open-source | Prototyping, small-to-medium projects | Embeds in Python apps, simple API, low setup cost |
| Qdrant | Open-source, self-hosted or cloud | High-performance production workloads | Written in Rust for speed, advanced filtering, scalar quantization |
| pgvector | PostgreSQL extension | Teams already using PostgreSQL | No new infrastructure needed, ACID compliance, familiar SQL interface |
Document Chunking Strategies
Before embedding, documents must be split into smaller pieces called "chunks." Chunking strategy dramatically affects retrieval quality. Chunks that are too large may contain irrelevant information that dilutes the signal. Chunks that are too small may lose important context.
| Strategy | How It Works | Pros | Cons |
|---|---|---|---|
| Fixed-size | Split every N tokens/characters | Simple, predictable chunk sizes | May split mid-sentence or mid-thought |
| Recursive | Split by paragraphs, then sentences, then words | Respects document structure | Variable chunk sizes |
| Semantic | Use embeddings to detect topic boundaries | Chunks align with meaning | More complex, slower to process |
| Document-aware | Split by headers, sections, or markdown structure | Preserves document hierarchy | Requires structured input documents |
For most applications, recursive chunking with a target size of 512–1024 tokens and an overlap of 50–200 tokens provides good results. The overlap ensures that information at chunk boundaries isn't lost.
Building a RAG Pipeline: Step by Step
Here's the complete flow for building a RAG system, from raw documents to generated answers.
Phase 1: Indexing (offline, done once per document update)
Python pseudocode for indexing:
# 1. Load documents docs = load_documents("./data/") # PDF, DOCX, HTML, Markdown, etc. # 2. Chunk documents chunks = recursive_text_splitter( docs, chunk_size=512, # tokens per chunk chunk_overlap=50, # overlap between consecutive chunks ) # 3. Generate embeddings embeddings = embedding_model.embed(chunks) # e.g., text-embedding-3-large # 4. Store in vector database vector_db.upsert( vectors=embeddings, metadata=[{ "text": chunk.text, "source": chunk.source_file, "page": chunk.page_number, } for chunk in chunks] )
Phase 2: Retrieval and Generation (online, per query)
Python pseudocode for querying:
# 1. Embed the user's query query_embedding = embedding_model.embed(user_query) # 2. Search the vector database for relevant chunks results = vector_db.query( vector=query_embedding, top_k=5, # retrieve top 5 most similar chunks filter={"department": "HR"}, # optional metadata filtering ) # 3. Build the augmented prompt context = "\n\n".join([r.metadata["text"] for r in results]) prompt = f"""Based on the following context, answer the user's question. If the context doesn't contain enough information, say so. <context> {context} </context> Question: {user_query}""" # 4. Generate the response response = llm.generate(prompt)
Production Considerations
Moving from a prototype RAG system to production requires addressing several challenges:
Chunk Overlap
Always include overlap between consecutive chunks (typically 10–20% of chunk size). Without overlap, important information that spans a chunk boundary gets split and may never be retrieved correctly.
Metadata Filtering
Attach metadata to each chunk — source document, date, department, category, author — and use it to filter results before or during retrieval. A query about Q4 2025 financials should not return chunks from 2023 reports.
Reranking
Initial vector search retrieves candidates based on embedding similarity, but this is an approximation. A reranking step uses a more powerful cross-encoder model to re-score the top candidates with the actual query. This significantly improves relevance. Cohere Rerank and bge-reranker-v2 are popular options.
Hybrid Search
Combine vector (semantic) search with traditional keyword (BM25) search. Vector search excels at finding semantically similar content, but keyword search is better for exact matches (product names, IDs, specific terms). Most production RAG systems use both and merge the results.
Production RAG architecture:
User Query │ ├──→ Vector Search (semantic similarity) ──→ Top 20 candidates │ │ ├──→ Keyword Search (BM25, exact match) ──→ Top 20 candidates │ │ └──→ Merge & Deduplicate ◄───────────────────────────┘ │ ▼ Reranker (cross-encoder) │ ▼ Top 5 most relevant chunks │ ▼ LLM Generation (with context + system prompt) │ ▼ Answer with source citations
When to Use RAG vs Fine-Tuning vs Prompt Engineering
Choosing the right approach is one of the most important decisions in building AI applications. Here's a decision framework:
| Approach | Best When | Cost | Time to Deploy |
|---|---|---|---|
| Prompt Engineering | Task can be solved with the right instructions and a few examples. No private data needed. | Free (just prompt iteration) | Hours |
| RAG | Need access to private/current data. Data changes frequently. Need citations. Want to avoid retraining. | Moderate (infra + embeddings) | Days to weeks |
| Fine-Tuning | Need a specific behavior, style, or format the model can't achieve through prompting. Need consistent performance on a narrow task. | High (compute + data prep) | Weeks to months |
| RAG + Fine-Tuning | Need both private data access and specialized model behavior. Highest quality for complex enterprise applications. | Highest | Months |
Advanced RAG Patterns
Multi-Hop RAG
Standard RAG retrieves once and generates. Multi-hop RAG performs multiple retrieval steps, where the results of one retrieval inform the next query. This is essential for questions that require synthesizing information from multiple documents — for example, "Compare our Q3 and Q4 revenue growth rates and explain the factors behind any differences."
Graph RAG
Graph RAG combines vector search with knowledge graphs. Entities and their relationships are extracted from documents and stored in a graph database. When a query comes in, the system retrieves both relevant text chunks and related graph entities, providing richer context. Microsoft Research published influential work on this approach in 2024, and it's particularly effective for complex domains with many interconnected concepts (legal, medical, financial).
Agentic RAG
Agentic RAG gives an AI agent the ability to decide how and when to search, which sources to query, and whether the retrieved results are sufficient. Instead of a fixed retrieve-then-generate pipeline, the agent can:
- Reformulate queries if initial results are poor
- Search multiple knowledge bases selectively
- Decide to ask clarifying questions before searching
- Verify retrieved information against other sources
This pattern blurs the line between RAG and AI agents (covered in the next module) and represents the cutting edge of production RAG systems in 2026.
Resources
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
Lewis et al. (Meta AI)
The original 2020 paper that introduced the RAG framework, establishing the foundation for combining retrieval with generation.
Anthropic: Contextual Retrieval
Anthropic
Anthropic's guide to contextual retrieval, a technique that improves RAG chunk quality by prepending document-level context to each chunk before embedding.
LangChain RAG Documentation
LangChain
Comprehensive tutorials for building RAG pipelines with LangChain, covering document loading, splitting, embedding, retrieval, and generation.
GraphRAG: Unlocking LLM discovery on narrative private data
Microsoft Research
Microsoft Research's work on combining knowledge graphs with RAG for superior performance on complex questions that require synthesizing information across documents.
Key Takeaways
- 1RAG grounds LLM responses in your actual data by retrieving relevant documents at query time, dramatically reducing hallucination and enabling private data access.
- 2Embeddings convert text into numerical vectors that capture semantic meaning, enabling similarity-based search across your document collection.
- 3Choose your vector database based on your needs: pgvector for simplicity, Chroma for prototyping, Pinecone for managed ease, Qdrant for performance, or Weaviate for hybrid search.
- 4Chunking strategy is critical — recursive chunking with 512–1024 token chunks and 10–20% overlap is a strong default for most use cases.
- 5Production RAG systems combine vector search with keyword search (hybrid) and add a reranking step for significantly better relevance.
- 6Always start with prompt engineering, add RAG for private/current data needs, and reserve fine-tuning for when you need specialized model behavior.
- 7Advanced patterns like multi-hop RAG, graph RAG, and agentic RAG handle complex queries that require multiple retrieval steps or dynamic search strategies.
Test Your Understanding
Module Assessment
5 questions · Score 70% or higher to complete this module
You can retake the quiz as many times as you need. Your best score is saved.