Do you have a recommended default architecture?

For most knowledge-grounded business applications: a strong proprietary model behind a hybrid retrieval layer with re-ranking, constrained tool use for actions, and a continuously-running eval layer with both static test set and production sampling. Reach for agentic patterns only after this default has been ruled out.

How do you choose between proprietary and open-weight models?

Start proprietary unless you have a hard data residency, latency, or cost constraint that proprietary models cannot meet. Open-weight models are improving fast and are increasingly viable, but the operational overhead of self-hosting (GPUs, scaling, security patching) is real. Choose proprietary by default and switch when the constraints justify the operational cost.

How often should architecture be reviewed?

After every major model release in your provider, and at minimum quarterly. The reason is that the cost-quality frontier is shifting fast enough that an architecture optimal six months ago is rarely optimal today. Most of our clients find one architectural improvement per quarter that pays for the review work several times over.

What about multi-modal applications?

The four-decision framework still holds, with adjustments. Model selection now includes multi-modal capability. Retrieval may need to handle images, video, or audio in addition to text. Tools may need to operate on non-text content. Eval methodology has to handle multi-modal outputs. The structure of the decision is the same; the surface area is larger.

Can we mix this with our existing application architecture?

Yes, and you should. We rarely build LLM applications as standalone services; we usually build them as components inside an existing application. The integration concerns (caching, rate limiting, observability, error handling) are mostly the same as any other distributed system, with the addition of the LLM-specific evaluation layer.

LLM App Architecture: The Decision Guide for Production AI

The four decisions every LLM architecture answers

After dozens of production builds, we find that LLM application architectures rarely fail because of obscure technical choices. They fail because one of four foundational decisions was made by default rather than by design. The decisions are: which model, what retrieval pattern, whether to use tools or agents, and how the system will be evaluated.

Each decision interacts with the others. The cheapest model might force a more aggressive retrieval pattern. A heavy retrieval setup might invalidate the case for fine-tuning. Agentic flows multiply the eval surface area in ways that single-shot generations do not. The architecture is the set of trade-offs, not any one choice.

Decision one: which model

The naive answer is to use the most capable proprietary model available. The answer that survives a CFO review is more nuanced. Production LLM applications usually run a tiered model strategy: a strong model for the cases that justify the cost, a cheaper model for the bulk of traffic, and a small open-weight model on the edges where latency or privacy demands it.

The factors that should drive the choice are: accuracy on your eval set, cost per request at production volume, latency at the 95th percentile, data residency and privacy constraints, and the rate of model deprecation in your chosen provider. The factor that should not drive the choice is which model the engineering team finds most exciting.

Fine-tuning is rarely the right first choice. Most production teams reach for it after they have exhausted prompt engineering, retrieval, and structured output, and only when there is a stylistic or behavioural pattern that prompting cannot capture reliably. Starting with fine-tuning before exhausting cheaper options usually means you are paying for fine-tuning runs while still iterating on the wrong prompt.

Decision two: retrieval, RAG, or none

Some LLM applications do not need retrieval. If the task is bounded by the model's training data and the inputs are short, you can ship without it. Most useful business applications, however, are grounded in proprietary content the model has not seen, which means retrieval is the load-bearing layer.

Standard RAG (vector search over chunked documents) is the default. It works well for question answering over a stable corpus where queries are well-formed and answers are present in the source documents. It struggles when documents are highly structured (tables, forms), when the corpus is constantly changing, or when the answer requires synthesising across many documents rather than finding one.

Hybrid retrieval (vector search plus keyword search plus metadata filtering) handles a wider range of queries reliably and is what we ship for most production knowledge-base applications. Agentic retrieval (the model decides what to search for and re-queries based on what it finds) is appropriate for genuinely open-ended questions but adds substantial latency and complexity. Reach for it when you have evidence simpler retrieval is failing, not by default.

Decision three: tools and agents

Tool use (the model calls a function to fetch data, run a calculation, or write to a system) is well-understood and stable. Use it whenever the model needs information or capabilities that are not in its context window. Most production LLM applications use tools.

Multi-step agentic workflows are a different conversation. They unlock genuinely new capabilities (an agent that researches across many systems, an agent that performs a multi-step task with checkpoints) but they expand the failure surface dramatically. The system fails in more ways, the eval methodology has to grade trajectories rather than single outputs, and debugging requires tools most teams do not yet have.

Our default position: tools, yes; agents, only with explicit eval methodology and graceful degradation. If the agent gets stuck or wanders, what happens. If the answer is 'we surface the question to a human', you have a defensible architecture. If the answer is 'we hope it does not', you have a demo, not a production system.

Decision four: the evaluation layer

Every LLM application needs an evaluation layer. The mistake is treating evaluation as a step before launch rather than a continuous component of the system. Models change, prompts drift, retrieval indexes update, the world changes. Without continuous evaluation you discover regressions in production from customer complaints.

The eval layer has three pieces. A static test set (your week-1 eval set, kept current) that runs on every change. A production sampling system that grades a slice of real traffic continuously. A feedback loop that captures user signals (thumbs up/down, escalations to humans, downstream outcomes) and feeds them back into the eval set over time.

Off-the-shelf eval frameworks (LangSmith, Phoenix, Arize, etc.) handle the plumbing. The discipline is in the test cases. We treat the eval set as a first-class engineering artifact: versioned, reviewed, and owned by a named person on the team.

Latency, cost, and reliability: the tradeoff space

Every architecture decision moves you along three axes. Cheaper models are usually less accurate. Smaller models are usually faster. Heavier retrieval improves accuracy and adds latency. Agents add capability and reduce reliability. Where you sit on each axis is determined by the use case, not by best practice in general.

A useful exercise: write down the latency budget, the per-request cost ceiling, and the accuracy floor before you choose components. Then evaluate every component against those constraints. Most architectures we audit have at least one component that violates one of the three; usually because the constraint was never written down.

Patterns that work in production

The single-shot RAG pattern: query, retrieve, generate, done. Cheap, fast, well-understood. Default for most knowledge-base and Q&A applications.

Tiered routing: a fast classifier picks one of several specialised pipelines (one for FAQ-type queries, one for analytical queries, one for action-taking queries). Reduces average cost and latency by sending easy queries through cheap paths.

Re-ranked hybrid retrieval: combine vector and keyword retrieval, then re-rank with a small model before passing to the generator. Substantially improves retrieval precision over plain vector search at modest extra cost.

Constrained tool use with guardrails: the model can call a fixed set of tools, every tool call is validated, and the system fails closed (refuses to act) on validation failure. The default for any LLM that takes actions in production systems.

Patterns that fail

Single-prompt monoliths that try to do everything in one call. They look elegant in a demo and fail unpredictably as scope grows. Decompose into named steps with eval at each step.

Unconstrained agentic loops with no termination criteria. They will eventually find a way to spin forever or hallucinate themselves into a corner. Always cap iterations and always have an escape hatch.

RAG over an index that is updated manually. The corpus drifts from production reality and the system silently degrades. Build incremental indexing or do not bother with retrieval at all.

Fine-tuning a model and then changing the underlying base model six months later. The fine-tune does not transfer cleanly. If you are going to fine-tune, plan for the version-pinning, retraining, and migration costs.

LLM App Architecture: The Decision Guide

The four decisions every LLM architecture answers

Decision one: which model

Decision two: retrieval, RAG, or none

Decision three: tools and agents

Decision four: the evaluation layer

Latency, cost, and reliability: the tradeoff space

Patterns that work in production

Patterns that fail

LLM Architecture Decision Worksheet

Related playbooks

Common questions

Do you have a recommended default architecture?

How do you choose between proprietary and open-weight models?

How often should architecture be reviewed?

What about multi-modal applications?

Can we mix this with our existing application architecture?

Get your LLM architecture reviewed by people who ship them.

Cookie Preferences

LLM App Architecture: The Decision Guide

The four decisions every LLM architecture answers

Decision one: which model

Decision two: retrieval, RAG, or none

Decision three: tools and agents

Decision four: the evaluation layer

Latency, cost, and reliability: the tradeoff space

Patterns that work in production

Patterns that fail

LLM Architecture Decision Worksheet

Related playbooks

From Prototype to Production: The 6-Week AI Launch Framework

RAG vs Fine-Tuning vs Tools: The Decision Guide

Common questions

Do you have a recommended default architecture?

How do you choose between proprietary and open-weight models?

How often should architecture be reviewed?

What about multi-modal applications?

Can we mix this with our existing application architecture?

Get your LLM architecture reviewed by people who ship them.

Cookie Preferences