Advanced55 minModule 2 of 7

Retrieval Augmented Generation (RAG)

Embeddings, vector databases, document chunking, and building a RAG pipeline from scratch.

Grounding AI in Your Data

Large language models are incredibly capable, but they have a fundamental limitation: they only know what they learned during training. They can't access your company's internal documents, today's news, or any private data. Retrieval Augmented Generation (RAG) solves this by giving LLMs the ability to search through your data before generating a response.

RAG has become the most widely adopted pattern for building production AI applications. If you're building anything that needs to work with proprietary data — customer support bots, internal knowledge bases, document Q&A systems — RAG is almost certainly part of the solution.

What Is RAG?

RAG is a technique that augments an LLM's knowledge by retrieving relevant documents from an external knowledge base and including them in the prompt context. The term was coined in a 2020 paper by Lewis et al. at Meta AI.

The core idea is simple: instead of expecting the model to know everything, you give it the relevant information at query time.

RAG in a nutshell:

User asks: "What is our company's parental leave policy?" Without RAG: LLM responds with generic information about parental leave, potentially hallucinating specific details about your company. With RAG: 1. System searches your HR document database 2. Retrieves the relevant policy document 3. Includes it in the prompt: "Based on this document: [POLICY TEXT]" 4. LLM generates an accurate answer grounded in your actual policy

Why RAG Matters

  • Reduces hallucination: The model generates answers from retrieved facts rather than relying on its parametric memory.
  • Keeps information current: Update your document store anytime — no model retraining needed.
  • Works with private data: Your proprietary data never needs to be part of the model's training set.
  • Provides citations: You can trace every answer back to its source document, enabling verification and trust.
  • Cost-effective: Far cheaper and faster than fine-tuning a model on your data.

Embeddings Explained

At the heart of RAG is a concept called embeddings — numerical representations of text that capture semantic meaning. An embedding model converts text into a high-dimensional vector (a list of numbers, typically 768 to 3072 dimensions) where similar meanings map to nearby points in vector space.

How embeddings capture meaning:

"The cat sat on the mat" → [0.23, -0.45, 0.87, ...] (1536 dimensions) "A kitten rested on a rug" → [0.25, -0.43, 0.85, ...] (nearby in vector space!) "Stock market crashed today" → [-0.67, 0.12, -0.34, ...] (far away) Similar meaning = similar vectors = small distance between them Different meaning = different vectors = large distance between them

Popular embedding models include OpenAI's text-embedding-3-large, Cohere's embed-v4, Google's Gecko, and open-source options like BGE and E5 from Hugging Face. The choice of embedding model affects retrieval quality significantly — it's one of the most important decisions in a RAG pipeline.

Embedding Dimensions and Quality
Higher-dimensional embeddings capture more nuance but use more storage and compute. OpenAI's text-embedding-3 models offer a novel feature: you can reduce dimensions at the cost of some accuracy. For most applications, 1024–1536 dimensions hit the sweet spot between quality and efficiency.

Vector Databases

Once you've converted your documents into embeddings, you need somewhere to store and efficiently search them. Vector databases are purpose-built for this: they store high-dimensional vectors and support fast similarity searches (finding the nearest neighbors to a query vector).

DatabaseTypeBest ForKey Feature
PineconeFully managed cloudTeams wanting zero ops overheadServerless pricing, automatic scaling, easy to start
WeaviateHybrid (managed or self-hosted)Hybrid search (vector + keyword)Built-in vectorizers, GraphQL API, multi-modal support
ChromaLightweight, open-sourcePrototyping, small-to-medium projectsEmbeds in Python apps, simple API, low setup cost
QdrantOpen-source, self-hosted or cloudHigh-performance production workloadsWritten in Rust for speed, advanced filtering, scalar quantization
pgvectorPostgreSQL extensionTeams already using PostgreSQLNo new infrastructure needed, ACID compliance, familiar SQL interface
Starting Simple
If you're building your first RAG system, start with pgvector (if you already use PostgreSQL) or Chroma (for rapid prototyping). You can always migrate to a dedicated vector database like Pinecone or Qdrant when you need to scale. Don't over-engineer the storage layer before you've proven the RAG pipeline works.

Document Chunking Strategies

Before embedding, documents must be split into smaller pieces called "chunks." Chunking strategy dramatically affects retrieval quality. Chunks that are too large may contain irrelevant information that dilutes the signal. Chunks that are too small may lose important context.

StrategyHow It WorksProsCons
Fixed-sizeSplit every N tokens/charactersSimple, predictable chunk sizesMay split mid-sentence or mid-thought
RecursiveSplit by paragraphs, then sentences, then wordsRespects document structureVariable chunk sizes
SemanticUse embeddings to detect topic boundariesChunks align with meaningMore complex, slower to process
Document-awareSplit by headers, sections, or markdown structurePreserves document hierarchyRequires structured input documents

For most applications, recursive chunking with a target size of 512–1024 tokens and an overlap of 50–200 tokens provides good results. The overlap ensures that information at chunk boundaries isn't lost.

Building a RAG Pipeline: Step by Step

Here's the complete flow for building a RAG system, from raw documents to generated answers.

Phase 1: Indexing (offline, done once per document update)

Python pseudocode for indexing:

# 1. Load documents docs = load_documents("./data/") # PDF, DOCX, HTML, Markdown, etc. # 2. Chunk documents chunks = recursive_text_splitter( docs, chunk_size=512, # tokens per chunk chunk_overlap=50, # overlap between consecutive chunks ) # 3. Generate embeddings embeddings = embedding_model.embed(chunks) # e.g., text-embedding-3-large # 4. Store in vector database vector_db.upsert( vectors=embeddings, metadata=[{ "text": chunk.text, "source": chunk.source_file, "page": chunk.page_number, } for chunk in chunks] )

Phase 2: Retrieval and Generation (online, per query)

Python pseudocode for querying:

# 1. Embed the user's query query_embedding = embedding_model.embed(user_query) # 2. Search the vector database for relevant chunks results = vector_db.query( vector=query_embedding, top_k=5, # retrieve top 5 most similar chunks filter={"department": "HR"}, # optional metadata filtering ) # 3. Build the augmented prompt context = "\n\n".join([r.metadata["text"] for r in results]) prompt = f"""Based on the following context, answer the user's question. If the context doesn't contain enough information, say so. <context> {context} </context> Question: {user_query}""" # 4. Generate the response response = llm.generate(prompt)

The Garbage In, Garbage Out Principle
The quality of your RAG system is determined primarily by the quality of your data and your chunking strategy — not the LLM. Spend 80% of your optimization effort on retrieval quality (better chunking, better embeddings, metadata enrichment) and 20% on generation (prompt tuning, model selection).

Production Considerations

Moving from a prototype RAG system to production requires addressing several challenges:

Chunk Overlap

Always include overlap between consecutive chunks (typically 10–20% of chunk size). Without overlap, important information that spans a chunk boundary gets split and may never be retrieved correctly.

Metadata Filtering

Attach metadata to each chunk — source document, date, department, category, author — and use it to filter results before or during retrieval. A query about Q4 2025 financials should not return chunks from 2023 reports.

Reranking

Initial vector search retrieves candidates based on embedding similarity, but this is an approximation. A reranking step uses a more powerful cross-encoder model to re-score the top candidates with the actual query. This significantly improves relevance. Cohere Rerank and bge-reranker-v2 are popular options.

Hybrid Search

Combine vector (semantic) search with traditional keyword (BM25) search. Vector search excels at finding semantically similar content, but keyword search is better for exact matches (product names, IDs, specific terms). Most production RAG systems use both and merge the results.

Production RAG architecture:

User Query │ ├──→ Vector Search (semantic similarity) ──→ Top 20 candidates │ │ ├──→ Keyword Search (BM25, exact match) ──→ Top 20 candidates │ │ └──→ Merge & Deduplicate ◄───────────────────────────┘ │ ▼ Reranker (cross-encoder) │ ▼ Top 5 most relevant chunks │ ▼ LLM Generation (with context + system prompt) │ ▼ Answer with source citations

When to Use RAG vs Fine-Tuning vs Prompt Engineering

Choosing the right approach is one of the most important decisions in building AI applications. Here's a decision framework:

ApproachBest WhenCostTime to Deploy
Prompt EngineeringTask can be solved with the right instructions and a few examples. No private data needed.Free (just prompt iteration)Hours
RAGNeed access to private/current data. Data changes frequently. Need citations. Want to avoid retraining.Moderate (infra + embeddings)Days to weeks
Fine-TuningNeed a specific behavior, style, or format the model can't achieve through prompting. Need consistent performance on a narrow task.High (compute + data prep)Weeks to months
RAG + Fine-TuningNeed both private data access and specialized model behavior. Highest quality for complex enterprise applications.HighestMonths
Start With Prompt Engineering
Always start with prompt engineering. If that's not enough, add RAG. If RAG doesn't fully solve the problem (usually because you need a specific output style or domain behavior), then consider fine-tuning. This "ladder" approach avoids unnecessary complexity and cost.

Advanced RAG Patterns

Multi-Hop RAG

Standard RAG retrieves once and generates. Multi-hop RAG performs multiple retrieval steps, where the results of one retrieval inform the next query. This is essential for questions that require synthesizing information from multiple documents — for example, "Compare our Q3 and Q4 revenue growth rates and explain the factors behind any differences."

Graph RAG

Graph RAG combines vector search with knowledge graphs. Entities and their relationships are extracted from documents and stored in a graph database. When a query comes in, the system retrieves both relevant text chunks and related graph entities, providing richer context. Microsoft Research published influential work on this approach in 2024, and it's particularly effective for complex domains with many interconnected concepts (legal, medical, financial).

Agentic RAG

Agentic RAG gives an AI agent the ability to decide how and when to search, which sources to query, and whether the retrieved results are sufficient. Instead of a fixed retrieve-then-generate pipeline, the agent can:

  • Reformulate queries if initial results are poor
  • Search multiple knowledge bases selectively
  • Decide to ask clarifying questions before searching
  • Verify retrieved information against other sources

This pattern blurs the line between RAG and AI agents (covered in the next module) and represents the cutting edge of production RAG systems in 2026.

Resources

Article

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

Lewis et al. (Meta AI)

The original 2020 paper that introduced the RAG framework, establishing the foundation for combining retrieval with generation.

Article

Anthropic: Contextual Retrieval

Anthropic

Anthropic's guide to contextual retrieval, a technique that improves RAG chunk quality by prepending document-level context to each chunk before embedding.

Tool

LangChain RAG Documentation

LangChain

Comprehensive tutorials for building RAG pipelines with LangChain, covering document loading, splitting, embedding, retrieval, and generation.

Article

GraphRAG: Unlocking LLM discovery on narrative private data

Microsoft Research

Microsoft Research's work on combining knowledge graphs with RAG for superior performance on complex questions that require synthesizing information across documents.

Key Takeaways

  • 1RAG grounds LLM responses in your actual data by retrieving relevant documents at query time, dramatically reducing hallucination and enabling private data access.
  • 2Embeddings convert text into numerical vectors that capture semantic meaning, enabling similarity-based search across your document collection.
  • 3Choose your vector database based on your needs: pgvector for simplicity, Chroma for prototyping, Pinecone for managed ease, Qdrant for performance, or Weaviate for hybrid search.
  • 4Chunking strategy is critical — recursive chunking with 512–1024 token chunks and 10–20% overlap is a strong default for most use cases.
  • 5Production RAG systems combine vector search with keyword search (hybrid) and add a reranking step for significantly better relevance.
  • 6Always start with prompt engineering, add RAG for private/current data needs, and reserve fine-tuning for when you need specialized model behavior.
  • 7Advanced patterns like multi-hop RAG, graph RAG, and agentic RAG handle complex queries that require multiple retrieval steps or dynamic search strategies.

Test Your Understanding

Module Assessment

5 questions · Score 70% or higher to complete this module

You can retake the quiz as many times as you need. Your best score is saved.

Cookie Preferences

We use cookies to enhance your experience. By continuing, you agree to our use of cookies.