RAG in Production: Chunking Strategies, Retrieval Tuning, and the Grounding Problem

Every RAG tutorial ends at the same place: a notebook that answers three questions correctly. The author declares success, links to a GitHub repo, and moves on. Meanwhile, production RAG systems face challenges that notebooks never encounter: documents with tables and images, queries that span multiple topics, retrieval that returns plausible-but-wrong context, and answers that sound correct but cite nothing.

This article is about the gap between the demo and the deployment. It covers the three hardest problems in production RAG -- chunking, retrieval quality, and grounding -- with specific techniques, benchmarks, and evaluation frameworks from real deployments in 2025 and 2026.

The Production RAG Stack

Before diving into specifics, here is the full stack for a production RAG system:

Documents --> [Ingestion Pipeline] --> [Chunking] --> [Embedding] --> [Vector Store]
                                                                          |
User Query --> [Query Processing] --> [Embedding] --> [Retrieval] --------+
                                                          |
                                                    [Reranking]
                                                          |
                                                    [Context Assembly]
                                                          |
                                              [Prompt Construction]
                                                          |
                                                       [LLM]
                                                          |
                                                    [Response + Citations]

Each box is a decision point. Each decision affects the quality of the final answer. Let's walk through the three that matter most.

Part 1: Chunking Strategies

Chunking is how you split documents into pieces for embedding and retrieval. It is the most unglamorous part of RAG and also the most consequential. A 2025 NAACL paper found that chunking configuration has as much or more influence on retrieval quality as the choice of embedding model.

Strategy 1: Recursive Character Splitting

The default. Split text at paragraph boundaries, then sentence boundaries, then character boundaries, up to a maximum chunk size. This is what LangChain's RecursiveCharacterTextSplitter does.

Settings: 400-512 tokens per chunk, 10-20% overlap (40-100 tokens).

Results: 69% accuracy on the 2026 real-document benchmark. Acceptable for most use cases. Start here.

Strengths: Simple, fast, predictable behavior. Works on any text.

Weaknesses: Splits mid-thought. A paragraph about "pricing" might land in one chunk while the actual price is in the next chunk. Overlap mitigates this but does not eliminate it.

Strategy 2: Semantic Chunking

Split at natural semantic boundaries. Instead of cutting at a fixed token count, monitor the embedding similarity between consecutive sentences. When similarity drops below a threshold, insert a chunk boundary.

Results: Up to ~70% accuracy improvement over naive baselines in benchmarks. A peer-reviewed clinical decision support study (MDPI Bioengineering, November 2025) found that adaptive chunking aligned to logical topic boundaries hit 87% accuracy versus 13% for fixed-size baselines.

Strengths: Preserves complete thoughts. Chunks are self-contained semantic units.

Weaknesses: More computationally expensive (requires embedding every sentence). Chunk sizes are variable, which complicates indexing and cost estimation.

Strategy 3: Structural Chunking

Use document structure -- headings, sections, HTML tags, Markdown headers -- to define chunk boundaries. Each section becomes a chunk. If a section exceeds the size limit, apply recursive splitting within it.

Best for: Technical documentation, legal contracts, structured reports.

Strengths: Chunks align with how humans organized the information. Section headings provide natural metadata for filtering.

Weaknesses: Only works on well-structured documents. Falls apart on unstructured PDFs, chat transcripts, or freeform text.

Strategy 4: Contextual Retrieval (Anthropic-Style)

Before embedding each chunk, prepend a short contextual summary: the document title, the heading hierarchy, and a 1-2 sentence description of what the chunk contains. This makes each chunk self-contained, eliminating the "orphan chunk" problem where a chunk makes no sense without its surrounding context.

Results: Anthropic reported a 49% reduction in retrieval failure rate when combining contextual retrieval with a BM25 hybrid approach.

Strengths: Chunks can be understood in isolation. Retrieval is more reliable because the embedding captures the chunk's meaning within its broader document context.

Weaknesses: Requires an LLM call per chunk during ingestion (adds cost and latency to the ingestion pipeline). For a 100,000-document corpus, this adds meaningful cost.

Practical Recommendation

Start with recursive character splitting at 400-512 tokens. Measure retrieval precision. If precision is below your target, try contextual retrieval. Only move to semantic chunking if your documents are poorly structured and contextual retrieval is too expensive at your scale.

Part 2: Retrieval Tuning

Retrieval determines which chunks the model sees. Bad retrieval is the root cause of most RAG failures -- more often than bad chunking, bad prompting, or bad models.

Metric: Retrieval Precision and Recall

•Precision: Of the chunks retrieved, what fraction are actually relevant?
•Recall: Of the relevant chunks in the corpus, what fraction were retrieved?

For production RAG, precision matters more than recall. An irrelevant chunk in the context can actively mislead the model. A missing chunk means the model might say "I don't have enough information" -- which is the correct behavior.

Technique 1: Hybrid Search (Vector + BM25)

Pure vector search finds semantically similar content but misses exact keyword matches. BM25 (keyword search) finds exact matches but misses semantic equivalents. Combine both.

Implementation: Run both searches in parallel, normalize scores, and merge results. Pinecone, Weaviate, and Qdrant all support hybrid search natively. For pgvector, combine with PostgreSQL's full-text search.

Typical weighting: 70% vector, 30% BM25. Adjust based on your query patterns.

Technique 2: Reranking

Initial retrieval (vector or hybrid) returns top-K candidates quickly but imprecisely. A cross-encoder reranker then scores each candidate against the query with much higher accuracy.

Tools: Cohere Rerank, BGE Reranker v2, Jina Reranker. These add 50-200ms of latency but dramatically improve precision.

Pattern: Retrieve top-20 with vector search, rerank to top-5 with a cross-encoder, and inject top-5 into the prompt.

Technique 3: Query Transformation

User queries are often poorly formed for retrieval. Transform them before embedding:

•Query expansion: Generate 3-5 alternative phrasings of the query and retrieve for all of them. Merge and deduplicate results.
•HyDE (Hypothetical Document Embeddings): Ask the LLM to generate a hypothetical answer, then use that answer's embedding for retrieval. This bridges the query-document semantic gap.
•Step-back prompting: For specific questions, generate a more general question first, retrieve for both, and combine context.

Technique 4: Metadata Filtering

Add metadata to chunks during ingestion (document type, date, department, product) and filter during retrieval. If a user asks about "Q3 2025 revenue," filtering to documents from Q3 2025 before vector search eliminates most noise.

This is simple, often overlooked, and in many cases more effective than any embedding optimization.

Part 3: The Grounding Problem

Grounding is the fundamental challenge of RAG: ensuring the model's response is faithful to the retrieved context and not to its parametric knowledge or its tendency to confabulate.

Why Models Hallucinate Even With Context

A model with relevant context in its prompt can still hallucinate for three reasons:

Distraction: Too much context dilutes the relevant signal. The model loses focus.
Override: The model's parametric knowledge conflicts with the context, and it trusts its own training more.
Extrapolation: The context partially answers the question, and the model fills in gaps instead of saying "I don't know."

Grounding Technique 1: Keep Context Short and Precise

Counterintuitively, less context often produces better answers. Keep assembled context under 8,000 tokens for most queries. If your retriever returns 15 chunks, that is too many. Rerank aggressively and include only the top 3-5.

Grounding Technique 2: Instruct Explicitly

Your system prompt must include explicit grounding instructions:

Answer the user's question using ONLY the information in the provided context.
If the context does not contain enough information to answer the question,
say "I don't have enough information to answer this question."
Do not use your general knowledge. Do not speculate. Do not extrapolate.
For every factual claim in your answer, cite the source document.

This does not guarantee compliance, but it significantly reduces hallucination rates in practice.

Grounding Technique 3: Citation Verification

After generating a response with citations, verify that each citation actually supports the claim it is attached to. This can be automated with an LLM-as-judge step:

For each claim-citation pair:
  - Is the claim supported by the cited context? (Yes/No)
  - If No, flag for review or remove the claim

Grounding Technique 4: Abstention Training

Fine-tune or prompt the model to abstain when uncertain. A model that says "I don't know" correctly is more valuable than a model that answers every question confidently and wrong.

Evaluation Framework

60% of RAG deployments in 2026 include systematic evaluation from day one, up from under 30% in early 2025. If you are not evaluating, you are guessing.

Metrics to Track

Metric	What It Measures	How to Compute
Faithfulness	Are generated claims supported by context?	LLM-as-judge or NLI model
Answer Relevance	Does the answer address the query?	LLM-as-judge
Context Precision	Are retrieved chunks relevant?	Human annotation or LLM-as-judge
Context Recall	Are all necessary chunks retrieved?	Compare against gold-standard context
Hallucination Rate	Fraction of claims not supported by any source	Automated claim-context matching

### Tools

•Ragas -- Open-source RAG evaluation framework. Computes faithfulness, relevance, precision, and recall automatically.
•DeepEval -- LLM evaluation framework with RAG-specific metrics.
•Custom eval pipelines -- For production systems, build a pipeline that samples N queries per day, evaluates them against the metrics above, and alerts on degradation.

The Eval Set

Build a golden evaluation set of 100-200 query-answer-context triples. Include:

•Questions that have a clear answer in the corpus

•Questions that span multiple documents

•Questions that have no answer in the corpus (the model should abstain)

•Adversarial questions designed to trigger hallucination

Run this eval set on every change to the pipeline: new embedding model, new chunking strategy, new system prompt, new LLM version.

Summary: The Production RAG Playbook

Start with recursive character splitting at 400-512 tokens
Use hybrid search (vector + BM25) from day one
Add a reranker (Cohere Rerank or BGE Reranker v2)
Keep assembled context under 8,000 tokens
Include explicit grounding instructions in your system prompt
Build an eval set of 100+ golden examples before going live
Monitor faithfulness and context precision in production
Iterate on chunking and retrieval, not on prompt hacks

RAG is not a "set it and forget it" system. It is a pipeline with multiple tuning points. The teams that treat it like an engineering system -- with metrics, evaluation, and continuous improvement -- build RAG that works. The teams that treat it like a prompt engineering exercise build RAG that demos.

RAG in Production: Chunking Strategies, Retrieval Tuning, and the Grounding Problem

The Production RAG Stack

Part 1: Chunking Strategies

Strategy 1: Recursive Character Splitting

Strategy 2: Semantic Chunking

Strategy 3: Structural Chunking

Strategy 4: Contextual Retrieval (Anthropic-Style)

Practical Recommendation

Part 2: Retrieval Tuning

Metric: Retrieval Precision and Recall

Technique 1: Hybrid Search (Vector + BM25)

Technique 2: Reranking

Technique 3: Query Transformation

Technique 4: Metadata Filtering

Part 3: The Grounding Problem

Why Models Hallucinate Even With Context

Grounding Technique 1: Keep Context Short and Precise

Grounding Technique 2: Instruct Explicitly

Grounding Technique 3: Citation Verification

Grounding Technique 4: Abstention Training

Evaluation Framework

Metrics to Track

The Eval Set

Summary: The Production RAG Playbook

Ready to Transform Your Development?

Cookie Preferences

RAG in Production: Chunking Strategies, Retrieval Tuning, and the Grounding Problem

The Production RAG Stack

Part 1: Chunking Strategies

Strategy 1: Recursive Character Splitting

Strategy 2: Semantic Chunking

Strategy 3: Structural Chunking

Strategy 4: Contextual Retrieval (Anthropic-Style)

Practical Recommendation

Part 2: Retrieval Tuning

Metric: Retrieval Precision and Recall

Technique 1: Hybrid Search (Vector + BM25)

Technique 2: Reranking

Technique 3: Query Transformation

Technique 4: Metadata Filtering

Part 3: The Grounding Problem

Why Models Hallucinate Even With Context

Grounding Technique 1: Keep Context Short and Precise

Grounding Technique 2: Instruct Explicitly

Grounding Technique 3: Citation Verification

Grounding Technique 4: Abstention Training

Evaluation Framework

Metrics to Track

The Eval Set

Summary: The Production RAG Playbook

Ready to Transform Your Development?

Related Articles

What Does an AI Workflow Consultant Do?

How to Choose an AI Workflow Automation Consultant

AI Workflow Automation: A Planning Checklist

Cookie Preferences