How do I know if my RAG is good enough or if I need fine-tuning?

Run an eval set against the RAG system. If retrieval is finding the right content but the model is producing the wrong style or format, fine-tuning may help. If retrieval is missing the right content, the problem is in the retrieval layer, and fine-tuning the generator will not fix it. Most cases that look like fine-tuning problems are actually retrieval problems.

Is fine-tuning worth it for cost reduction at scale?

Sometimes. A fine-tuned smaller model can be cheaper per request than a larger generic model, but only if the behaviour generalises and the maintenance cost is manageable. Run the math at your projected production volume; the break-even is usually higher than people assume.

What about RAFT, instruction tuning, RLHF, and the other variants?

These are specific techniques inside the broader fine-tuning category. The strategic question (whether to customise the model itself) is the same for all of them. Pick the technique after deciding fine-tuning is the right approach, not before.

When does the answer change for open-weight versus proprietary models?

Open-weight models make fine-tuning more accessible because you can fully control the training process and host the result. Proprietary fine-tuning is API-mediated and limited. If fine-tuning is core to your strategy, open-weight is structurally better. If fine-tuning is an occasional adjustment, proprietary is fine.

Can RAG really not learn from feedback?

RAG itself does not learn, but the system around RAG can. You can use feedback to improve the retrieval layer (which documents to surface), to adjust the prompt template, and to grow the eval set. What feedback cannot do, without crossing into fine-tuning, is change how the model itself reasons over the retrieved content.

RAG vs Fine-Tuning vs Tools: The Decision Guide

The three approaches at a glance

Retrieval-Augmented Generation (RAG): the model receives relevant content alongside the query and uses it to answer. The model itself does not change; the context the model sees does. Cheap, fast, easy to update (change the documents, the answers change).

Fine-tuning: the model itself is retrained on a dataset of examples to learn a behaviour or style. The base model becomes a customised model. More expensive, slower to iterate, but capable of teaching the model patterns that prompting cannot reliably elicit.

Tool use (sometimes called function calling or agents): the model can call external functions to fetch live data, run computations, or take actions. The knowledge or capability is not in the model; it is in the tools the model can invoke.

Each approach is good at something the others are bad at. Confusing them is the source of most architecture mistakes we audit.

When RAG wins

When the task is grounding answers in proprietary content the model has not seen. Knowledge bases, internal documentation, support content, product specifications, legal or compliance documents. If the answer should come from your content, RAG is almost always the right approach.

When the content changes. RAG updates by re-indexing documents. Fine-tuning updates by retraining, which is expensive and slow. If the content updates more than every quarter, RAG is structurally better.

When you need citations. RAG can return the source documents alongside the answer, which makes verification possible. Fine-tuning produces answers without provenance. In regulated or high-trust contexts, citations are not optional.

When you want fast iteration. Adding a new document to a RAG system takes minutes. Adding new behaviour to a fine-tuned model takes a training run.

When fine-tuning wins

When you need a specific output format or style that prompting cannot reliably produce. Specialised summarisation styles, domain-specific terminology, structured output that has to match an exact schema even on edge cases.

When latency matters and prompting your way to the right behaviour requires a long prompt. Fine-tuning can move the behaviour into the model, allowing shorter prompts and faster responses.

When the task is narrow, well-defined, and high-volume. The economics favour fine-tuning when the behaviour will be invoked many times and the cost of teaching the model the behaviour amortises across all those invocations.

When you want to deploy a smaller, cheaper model on a task a larger model handles via prompting. A fine-tuned smaller model can match or beat a generic larger model on a narrow task at a fraction of the inference cost.

When tools and agents win

When the AI needs to take actions in the world. Sending an email, updating a record, scheduling a meeting, processing a payment. These are not knowledge problems; they are capability problems, and tools are how LLMs get capabilities.

When the answer requires fresh data the model cannot have memorised. Today's stock price, current inventory levels, the user's most recent order. RAG can handle this if the data is in a document store, but a tool call is more direct.

When the task requires multi-step reasoning across systems. An agent that researches a question by querying multiple data sources, synthesises the answers, and produces a result. This is the hardest pattern to get right but the most powerful when it works.

When determinism matters for parts of the task. Calculations, data lookups, and rule applications should be tools, not prompts. Models are bad at arithmetic; calculators are good at arithmetic. Use the right tool for each step.

The decision tree

Start with one question: what does the AI need to do that it cannot already do.

If the gap is knowledge (it does not know your content), start with RAG. RAG handles most knowledge gaps and is cheap to iterate on. Reach for fine-tuning only if RAG is failing despite good retrieval.

If the gap is behaviour (it does not produce the right format, style, or output structure consistently), try sophisticated prompting first, then RAG with examples in the context, and only then fine-tuning. Fine-tuning is rarely the first answer; it is the answer when the cheaper options have been exhausted.

If the gap is capability (it cannot take actions or fetch live data), use tools. There is no way around this. RAG and fine-tuning will not help.

If the gap is multi-step reasoning across systems, you are looking at agents. Agents combine tools with multi-turn LLM reasoning. They are the most powerful and the hardest to make reliable.

Hybrid patterns that work

RAG plus tools. The model retrieves knowledge from documents and also calls tools for live data. Most production knowledge-base applications work this way: documents for stable knowledge, tools for live signals.

Fine-tuned model plus RAG. A model fine-tuned for a specific output style or task, layered with retrieval for the content that informs each answer. Common in production support and sales applications where both behaviour and knowledge need to be customised.

Multi-stage pipelines with each stage choosing its own approach. A classifier (fine-tuned for speed) decides what kind of query came in. A retrieval stage (RAG) gathers relevant content. A generator (general model with tools) produces the answer. Each stage uses the right approach for its job.

The pattern to avoid is using all three on a problem that only needs one. Complexity has a cost; reach for hybrid only when the simpler approach is genuinely insufficient.

Common mistakes

Reaching for fine-tuning first because it sounds powerful. Fine-tuning has a high build and maintenance cost, and most behaviours that teams want to fine-tune for can be achieved with better prompting plus RAG.

Over-investing in agents before the foundation is solid. Multi-step agentic systems are seductive in demos and brittle in production. Build a working RAG-plus-tools system first, validate that it solves your problem, then consider whether agents add enough value to justify the additional complexity.

Using RAG as a search engine. RAG is generation grounded in retrieval. If your query is 'find me documents about X', what you want is search, not RAG. Putting an LLM in front of search adds latency and cost without adding value.

Confusing fine-tuning with custom training. Fine-tuning adapts a pre-trained model to a narrow task. Training a model from scratch is a different (much more expensive) thing that almost no production teams should be doing.

RAG vs Fine-Tuning vs Tools: The Decision Guide

The three approaches at a glance

When RAG wins

When fine-tuning wins

When tools and agents win

The decision tree

Hybrid patterns that work

Common mistakes

RAG vs Fine-Tuning Decision Worksheet

Related playbooks

Common questions

How do I know if my RAG is good enough or if I need fine-tuning?

Is fine-tuning worth it for cost reduction at scale?

What about RAFT, instruction tuning, RLHF, and the other variants?

When does the answer change for open-weight versus proprietary models?

Can RAG really not learn from feedback?

Pick the right approach for your AI feature.

Cookie Preferences

RAG vs Fine-Tuning vs Tools: The Decision Guide

The three approaches at a glance

When RAG wins

When fine-tuning wins

When tools and agents win

The decision tree

Hybrid patterns that work

Common mistakes

RAG vs Fine-Tuning Decision Worksheet

Related playbooks

LLM App Architecture: The Decision Guide

From Prototype to Production: The 6-Week AI Launch Framework

Common questions

How do I know if my RAG is good enough or if I need fine-tuning?

Is fine-tuning worth it for cost reduction at scale?

What about RAFT, instruction tuning, RLHF, and the other variants?

When does the answer change for open-weight versus proprietary models?

Can RAG really not learn from feedback?

Pick the right approach for your AI feature.

Cookie Preferences