AI Development

Reference Architectures for LLM Applications: RAG vs Tools vs Fine-Tuning

Clarvia Team
Author
Mar 25, 2026
13 min read
Reference Architectures for LLM Applications: RAG vs Tools vs Fine-Tuning

Every LLM application is an answer to the same question: how do you get a general-purpose language model to do something specific and useful?

There are three fundamental approaches, and they solve different problems:

  • RAG (Retrieval-Augmented Generation) changes what the model can see right now
  • Fine-tuning changes how the model tends to behave every time
  • Tool use changes what the model can do beyond generating text

This distinction is the single most useful mental model for LLM architecture decisions. Get it right and you build the simplest system that solves your problem. Get it wrong and you build a complex system that still does not work.

This article provides reference architectures for each approach, a decision framework for choosing between them, and guidance on hybrid patterns for production systems.


Architecture 1: Retrieval-Augmented Generation (RAG)

What It Solves

RAG solves the knowledge problem. Foundation models have a training cutoff and no access to your proprietary data. RAG gives them that access at inference time by retrieving relevant context and injecting it into the prompt.

Reference Architecture

User Query
    |
    v
[Query Embedding] --> [Vector Store] --> [Top-K Retrieval]
                                              |
                                              v
                                    [Context Assembly]
                                              |
                                              v
                        [System Prompt + Context + Query] --> [LLM] --> Response

Core Components

  1. Document Ingestion Pipeline -- Ingest, chunk, embed, and index your documents into a vector store (Pinecone, Weaviate, Qdrant, pgvector, or similar)
  2. Query Pipeline -- Embed the user query, retrieve top-K similar chunks, assemble them into a prompt, and call the LLM
  3. Reranking Layer (optional but recommended) -- After initial retrieval, use a cross-encoder reranker (Cohere Rerank, BGE Reranker, or a ColBERT model) to re-score results by relevance

Chunking: The Decision That Matters Most

A 2025 NAACL paper found that chunking configuration has as much or more influence on retrieval quality as the choice of embedding model. The 2026 benchmark data reinforces this:

StrategyBest ForTypical Chunk Size
Recursive character splittingGeneral text, articles400-512 tokens, 10-20% overlap
Semantic chunkingTechnical docs, contractsVariable, boundary-aligned
Adaptive/topic-based chunkingLong documents with sectionsVariable, heading-aligned
Recursive splitting at 400-512 tokens with 10-20% overlap is the default starting point. It scored 69% accuracy in the largest real-document benchmark of 2026. Semantic chunking can push that to ~70% accuracy or higher, but requires more compute and tuning.

Critical rule: Keep assembled context under 8,000 tokens for most queries. Counterintuitively, shorter and more precise context produces better answers than dumping 50K tokens of retrieved text into the prompt.

When to Use RAG

  • Your data changes frequently (daily or weekly updates)
  • You need source attribution and citations
  • Your knowledge base is too large to fit in a prompt
  • You need to answer questions over proprietary or internal data
  • Accuracy and grounding matter more than response style

When RAG Is Not Enough

RAG gives the model information, but it does not change how the model formats, reasons, or stylizes its output. If your problem is behavioral -- the model gives correct answers but in the wrong format, or with the wrong tone, or with inconsistent structure -- RAG will not fix it. That is a fine-tuning problem.


Architecture 2: Fine-Tuning

What It Solves

Fine-tuning solves the behavior problem. It modifies the model's weights so it naturally produces outputs in a specific style, format, or domain without needing extensive prompt engineering.

Reference Architecture

Training Data (input-output pairs)
    |
    v
[Base Model] + [LoRA Adapter Training]
    |
    v
[Fine-Tuned Model / Adapter Weights]
    |
    v
[Inference: Prompt --> Fine-Tuned Model --> Response]

Approaches and Costs (2026)

MethodGPU MemoryCost (7B model)TimeQuality
Full fine-tuning100-120 GB$12,000+DaysHighest
LoRA16-24 GB$5-$3,000HoursNear-full
QLoRA6-10 GB$5-$500HoursComparable to LoRA
LoRA (Low-Rank Adaptation) trains only 0.1-1% of model parameters. In most production scenarios, the results are nearly indistinguishable from full fine-tuning. QLoRA adds 4-bit quantization, reducing GPU memory by another 80% with minimal quality loss.

The practical implication: You can fine-tune a 7B model on a single consumer GPU in hours for under $20 in cloud compute. This was prohibitively expensive 18 months ago.

When to Use Fine-Tuning

  • You need consistent output format (JSON schema compliance, specific XML structures)
  • You need domain-specific tone or style (medical, legal, brand voice)
  • You need to improve accuracy on a narrow task where the base model underperforms
  • You have high-volume, low-latency requirements (smaller fine-tuned models can replace larger general models)
  • The base model understands your domain but needs behavioral nudging

When Fine-Tuning Is Not Enough

Fine-tuning changes how the model behaves, but it does not give it new knowledge. If your problem requires access to data the model was not trained on, fine-tuning alone will not solve it. A fine-tuned model will hallucinate with better formatting.


Architecture 3: Tool Use (Function Calling / Agents)

What It Solves

Tool use solves the capability problem. Language models generate text. Tools let them take actions: query databases, call APIs, perform calculations, send emails, or interact with external systems.

Reference Architecture

User Query
    |
    v
[LLM with Tool Definitions]
    |
    v
[Decision: Generate Text OR Call Tool]
    |                    |
    v                    v
[Text Response]    [Tool Call]
                       |
                       v
                 [External System]
                       |
                       v
                 [Tool Result]
                       |
                       v
            [LLM Synthesizes Final Response]

Core Components

  1. Tool Definitions -- JSON schema describing available tools, their parameters, and expected return types. Every major LLM API (OpenAI, Anthropic, Google) supports a tool/function-calling schema.
  2. Tool Execution Layer -- Application code that receives the LLM's tool call request, validates parameters, executes the call, and returns results.
  3. Orchestration Loop -- For multi-step tasks, the agent iterates: call a tool, observe the result, decide the next step, call another tool, and so on until the task is complete.

When to Use Tool Use

  • The task requires real-time data (stock prices, weather, database queries)
  • The task involves computation (math, aggregation, date calculations)
  • The task requires taking action (sending emails, creating records, updating systems)
  • The task requires multi-step reasoning with external data at each step

When Tool Use Is Not Enough

Tools expand what the model can do, but they do not improve how well it reasons about when to use them or how to interpret results. If your agent consistently chooses the wrong tool or misinterprets results, the problem may require better prompting, fine-tuning, or a different model.


The Decision Framework

Use this flowchart:

1. What is the core problem?

  • The model lacks knowledge → RAG
  • The model behaves wrong (format, tone, style) → Fine-tuning
  • The model needs to interact with external systems → Tool use

2. How often does the underlying data change?

  • Daily or more → RAG (fine-tuning cannot keep up)
  • Weekly to monthly → RAG or periodic re-fine-tuning
  • Rarely → Fine-tuning is viable

3. What is the query volume?

  • High volume, low latency → Fine-tuning a smaller model (reduces per-request cost)
  • Moderate volume → RAG with caching
  • Low volume, high complexity → Tool use with larger models

4. What is your knowledge base size?

  • Under ~200K tokens → Consider full-context prompting with prompt caching (simpler than RAG)
  • 200K to 10M tokens → RAG
  • Over 10M tokens → RAG with hierarchical retrieval

Hybrid Architectures: The Production Reality

Most production systems combine two or three approaches. The 2025 LaRA benchmark (ICML) found no silver bullet -- the best choice depends on task type, model behavior, and retrieval setup. Here are the most common hybrid patterns:

Pattern 1: RAG + Fine-Tuning (RAFT)

Fine-tune the model to work well with retrieved context. The model learns to extract information from provided documents rather than relying on its parametric knowledge. This produces better citation behavior and reduces hallucination when context is available.

Use when: You need both domain knowledge and specific output behavior.

Pattern 2: RAG + Tool Use

The model retrieves context for knowledge questions and calls tools for action queries. A router (rule-based or LLM-powered) determines which path a query takes.

Use when: Your application handles both informational queries and transactional requests.

Pattern 3: Fine-Tuning + Tool Use

Fine-tune the model to use tools more reliably. The training data includes examples of correct tool selection, parameter formatting, and result interpretation.

Use when: Your agent needs to use domain-specific tools with high reliability.

Pattern 4: All Three (Full Stack)

A fine-tuned model with RAG access and tool capabilities. This is the most complex architecture and the hardest to debug. Only use this when simpler approaches have been tried and failed.

Use when: Enterprise applications with diverse requirements, but only after validating each component independently.


Practical Recommendations

  1. Start with RAG. For 80% of enterprise LLM applications, RAG with a strong base model is sufficient. Add complexity only when you hit a specific limitation.
  1. Use prompt caching before building RAG. If your knowledge base fits in 200K tokens, full-context prompting with prompt caching (available from Anthropic, Google, and OpenAI) is simpler and often more accurate.
  1. Fine-tune for behavior, not knowledge. If you are fine-tuning to inject knowledge, you are probably better off with RAG. Fine-tune for output format, tone, and task-specific reasoning patterns.
  1. Prototype with tools, productionize with guardrails. Tool use is powerful but introduces reliability risks. Every tool call needs input validation, error handling, and a timeout.
  1. Measure before you optimize. Build the simplest architecture that might work, instrument it thoroughly, and let production data tell you where the bottleneck is.

The right architecture is not the most sophisticated one. It is the simplest one that meets your accuracy, latency, and cost requirements.

LLM architectureRAG vs fine-tuningLLM tool useAI reference architecture

Ready to Transform Your Development?

Let's discuss how AI-first development can accelerate your next project.

Book a Consultation

Cookie Preferences

We use cookies to enhance your experience. By continuing, you agree to our use of cookies.