The four decisions every LLM architecture answers
After dozens of production builds, we find that LLM application architectures rarely fail because of obscure technical choices. They fail because one of four foundational decisions was made by default rather than by design. The decisions are: which model, what retrieval pattern, whether to use tools or agents, and how the system will be evaluated.
Each decision interacts with the others. The cheapest model might force a more aggressive retrieval pattern. A heavy retrieval setup might invalidate the case for fine-tuning. Agentic flows multiply the eval surface area in ways that single-shot generations do not. The architecture is the set of trade-offs, not any one choice.
Decision one: which model
The naive answer is to use the most capable proprietary model available. The answer that survives a CFO review is more nuanced. Production LLM applications usually run a tiered model strategy: a strong model for the cases that justify the cost, a cheaper model for the bulk of traffic, and a small open-weight model on the edges where latency or privacy demands it.
The factors that should drive the choice are: accuracy on your eval set, cost per request at production volume, latency at the 95th percentile, data residency and privacy constraints, and the rate of model deprecation in your chosen provider. The factor that should not drive the choice is which model the engineering team finds most exciting.
Fine-tuning is rarely the right first choice. Most production teams reach for it after they have exhausted prompt engineering, retrieval, and structured output, and only when there is a stylistic or behavioural pattern that prompting cannot capture reliably. Starting with fine-tuning before exhausting cheaper options usually means you are paying for fine-tuning runs while still iterating on the wrong prompt.
Decision two: retrieval, RAG, or none
Some LLM applications do not need retrieval. If the task is bounded by the model's training data and the inputs are short, you can ship without it. Most useful business applications, however, are grounded in proprietary content the model has not seen, which means retrieval is the load-bearing layer.
Standard RAG (vector search over chunked documents) is the default. It works well for question answering over a stable corpus where queries are well-formed and answers are present in the source documents. It struggles when documents are highly structured (tables, forms), when the corpus is constantly changing, or when the answer requires synthesising across many documents rather than finding one.
Hybrid retrieval (vector search plus keyword search plus metadata filtering) handles a wider range of queries reliably and is what we ship for most production knowledge-base applications. Agentic retrieval (the model decides what to search for and re-queries based on what it finds) is appropriate for genuinely open-ended questions but adds substantial latency and complexity. Reach for it when you have evidence simpler retrieval is failing, not by default.
Decision three: tools and agents
Tool use (the model calls a function to fetch data, run a calculation, or write to a system) is well-understood and stable. Use it whenever the model needs information or capabilities that are not in its context window. Most production LLM applications use tools.
Multi-step agentic workflows are a different conversation. They unlock genuinely new capabilities (an agent that researches across many systems, an agent that performs a multi-step task with checkpoints) but they expand the failure surface dramatically. The system fails in more ways, the eval methodology has to grade trajectories rather than single outputs, and debugging requires tools most teams do not yet have.
Our default position: tools, yes; agents, only with explicit eval methodology and graceful degradation. If the agent gets stuck or wanders, what happens. If the answer is 'we surface the question to a human', you have a defensible architecture. If the answer is 'we hope it does not', you have a demo, not a production system.
Decision four: the evaluation layer
Every LLM application needs an evaluation layer. The mistake is treating evaluation as a step before launch rather than a continuous component of the system. Models change, prompts drift, retrieval indexes update, the world changes. Without continuous evaluation you discover regressions in production from customer complaints.
The eval layer has three pieces. A static test set (your week-1 eval set, kept current) that runs on every change. A production sampling system that grades a slice of real traffic continuously. A feedback loop that captures user signals (thumbs up/down, escalations to humans, downstream outcomes) and feeds them back into the eval set over time.
Off-the-shelf eval frameworks (LangSmith, Phoenix, Arize, etc.) handle the plumbing. The discipline is in the test cases. We treat the eval set as a first-class engineering artifact: versioned, reviewed, and owned by a named person on the team.
Latency, cost, and reliability: the tradeoff space
Every architecture decision moves you along three axes. Cheaper models are usually less accurate. Smaller models are usually faster. Heavier retrieval improves accuracy and adds latency. Agents add capability and reduce reliability. Where you sit on each axis is determined by the use case, not by best practice in general.
A useful exercise: write down the latency budget, the per-request cost ceiling, and the accuracy floor before you choose components. Then evaluate every component against those constraints. Most architectures we audit have at least one component that violates one of the three; usually because the constraint was never written down.
Patterns that work in production
The single-shot RAG pattern: query, retrieve, generate, done. Cheap, fast, well-understood. Default for most knowledge-base and Q&A applications.
Tiered routing: a fast classifier picks one of several specialised pipelines (one for FAQ-type queries, one for analytical queries, one for action-taking queries). Reduces average cost and latency by sending easy queries through cheap paths.
Re-ranked hybrid retrieval: combine vector and keyword retrieval, then re-rank with a small model before passing to the generator. Substantially improves retrieval precision over plain vector search at modest extra cost.
Constrained tool use with guardrails: the model can call a fixed set of tools, every tool call is validated, and the system fails closed (refuses to act) on validation failure. The default for any LLM that takes actions in production systems.
Patterns that fail
Single-prompt monoliths that try to do everything in one call. They look elegant in a demo and fail unpredictably as scope grows. Decompose into named steps with eval at each step.
Unconstrained agentic loops with no termination criteria. They will eventually find a way to spin forever or hallucinate themselves into a corner. Always cap iterations and always have an escape hatch.
RAG over an index that is updated manually. The corpus drifts from production reality and the system silently degrades. Build incremental indexing or do not bother with retrieval at all.
Fine-tuning a model and then changing the underlying base model six months later. The fine-tune does not transfer cleanly. If you are going to fine-tune, plan for the version-pinning, retraining, and migration costs.