On March 3, 2026, OpenAI published the Agents SDK and Responses API updates -- cleaner tool calling, better primitives, more "agentic" defaults. In the same 72-hour window, a fresh "agent OS" repo rocketed to the top of Hacker News with a familiar promise: drop-in autonomy, production vibes, minimal glue code.
And the comment threads were the same as they were for LangChain and AutoGen a year ago, and the same as they were for every "CrewAI but more reliable" demo since:
- •"Where's the retry policy?"
- •"How do you replay step 7?"
- •"How do you stop spend when it loops at 3 a.m.?"
- •"How do you prove it didn't write to the wrong tenant?"
If your answer is "we'll add that later," you're not behind. You're about to learn the difference between scaffolding and infrastructure.
Frameworks like LangChain, CrewAI, and AutoGen are genuinely useful. They accelerate iteration. They help you discover the shape of a workflow, the boundary of a tool, the minimum prompt that doesn't collapse.
But production doesn't reward iteration speed. Production charges a reliability tax -- in duplicates, drift, and incidents that don't show up as exceptions.
Thesis: frameworks help you build agents; orchestration helps you operate them. Once you have long-lived workflows, real external dependencies, and tens (or hundreds) of concurrent runs, you're no longer "building an agent." You're operating a distributed system with LLMs inside it.
Operator proof (so you can calibrate the claims)
This checklist is distilled from building Bureau -- a multi-agent orchestration platform with 20+ specialized agents running across 4 business verticals (plumbing ops, marketing, signals intelligence, recruiting). Each vertical has its own agent team, its own workflows, and its own set of tools that touch real business data.
One representative deployment profile:
- •Industry: B2B SaaS + services (support + outbound + internal ops)
- •Tenants: ~120
- •Workflows in production: 9 (support triage, refund intake, outbound follow-ups, enrichment, internal IT automation, etc.)
- •Tool calls/day: ~18k median, ~47k peak (weekdays)
- •On-call: 1 primary engineer/week + a rotation of "workflow owners" who get paged when agents enter quarantine
One quantified incident impact (the kind that doesn't show up in demos):
- •A retry/commit bug caused 312 duplicate outbound sends over ~40 minutes
- •Triggered 27 customer replies complaining about spam, 9 escalations to CSMs, and two accounts temporarily paused outbound permissions
- •Time-to-detect: 52 minutes (no durable commit evidence; only transcript logs)
- •Time-to-fix: 1 day to patch, 1 week to backfill guardrails across all side-effect tools
That's what "agent reliability" looks like in the real world: not one spectacular failure, but a bunch of boring mechanics you didn't instrument.
The three boring invariants (name them now, or pay for them later)
If your agent touches the outside world -- email, tickets, CRM, payments, deploys -- you need three invariants before you scale tool calls:
- Idempotency keys for every side effect
- Commit points (a durable "this is now true" moment per step)
- Event logs (evidence you can replay, not transcripts you can narrate)
Everything else -- budgets, retries, quarantine, tracing, tenancy -- either supports these invariants or makes them enforceable.
Transcripts are theater. Events are evidence.
The crisp definitions (with pass/fail tests)
1. Idempotency (side effects are repeat-safe)
Definition: Every write tool call can be repeated without creating a second real-world effect.
Test: If the worker crashes after calling send_email, you can retry the step and guarantee one or fewer messages are actually sent.
2. Commit points (you know what's true even after failure)
Definition: Every step has a durable "done" boundary that is persisted independently of the LLM's memory.
Test: If the process dies at any moment, you can answer: "Did step X complete?" without guessing from logs.
3. Event logs (you can reconstruct reality)
Definition: You persist a replayable timeline of run/step/tool events with inputs/outputs referenced immutably.
Test: An on-call engineer can replay step 7 and get the same tool inputs, the same attempt history, and a reason the system chose to retry.
What this is based on (the real breakpoint)
The repeating breakpoint isn't "prompt quality." It's failure semantics at the tool boundary:
- •Retries meeting side effects
- •Schema drift at tool boundaries
- •Missing commit points
- •Missing budgets
- •No replayable evidence when something goes wrong
Once you cross a few hundred tool calls/day, "agent bugs" stop being prompt bugs. They become systems bugs.
Where frameworks optimize vs what production demands
Frameworks tend to optimize for developer velocity: compose prompts, wire tools, add memory, run a loop.
Production demands the things demos hide:
| Demo-layer (build speed) | Production-layer (operability) |
|---|---|
| Agent roles + prompts | Durable workflow execution (step-bounded runs) |
| Tool wrappers | Retries + backoff + idempotency keys |
| "Memory" helpers | Tiered state: session / durable / audit |
| Conversation logs | Tracing + replayable event timelines |
| Single-model defaults | Multi-model routing + policy + budgets |
| "Just run async" | Queues, fan-out/fan-in, dead letters, recovery |
| Developer ergonomics | Tenancy, permissions, isolation |
If you're trying to force the left column to guarantee the right column, you didn't pick "the wrong framework." You graduated.
Incident 1: the triple-send (retries + side effects)
The workflow was simple: 1) draft email, 2) log to CRM, 3) send email.
A transient network failure happened in the worst place: after the provider accepted the send but before your app recorded completion.
The framework saw "unknown outcome" and retried.
What broke (mechanically):
- •The send tool call was not idempotent
- •The retry policy treated timeouts as safe-to-repeat
- •There was no durable commit point proving the side effect happened
Result: the same email went out three times.
Not a catastrophe. Something worse: a trust event. Now you're writing apology emails about your apology emails. Someone pastes a screenshot into Slack. Your sales team asks if they should turn it off "until it's stable." On-call is awake, but blind.
In Bureau's plumbing vertical, this would mean a customer getting three identical booking confirmations. In the marketing vertical, it meant triplicate outbound content.
The fix wasn't "smarter prompts." It was the three boring primitives:
- •Idempotency key on every side-effect tool call (so retries return "already done")
- •Commit point only after you persist the provider's authoritative ID (e.g.,
message_id) - •Budgets that trip a hard stop when a run starts looping
If you can't point to step_committed, you can't reason about retries.
Worked example: retry + idempotency + commit (email send)
Here's the timeline you actually need. Goal: exactly-once effect even with at-least-once execution.
step_started(send_email, attempt=1)- Call provider with
Idempotency-Key: tenant|workflow|lead|step|hash(payload) - Provider returns
message_id=abc123(or "already processed" with samemessage_id) - You write:
commit(send_email, message_id=abc123, committed_at=...) - Only then do you mark the step complete
If the worker dies after step 3 but before step 4, the retry reuses the same idempotency key, obtains the same message_id, and can safely commit. You stop guessing. You stop hoping.
Incident 2: the schema-drift week (tools + contracts)
A vendor shipped a "non-breaking" change: a field that was always present started arriving as null in some responses.
Your agent didn't crash -- worse -- it improvised.
Result: wrong CRM records updated. Not catastrophic, but slow poison: people stop trusting automation. They start double-checking. Then they stop using it. The workflow "works," but adoption dies.
The fix wasn't "make the model more careful." It was:
- •Schema validation at the boundary (reject malformed tool outputs)
- •Versioned tool adapters (contract changes become explicit)
- •Quarantine mode (route to human review when constraints are violated)
In production, tools aren't functions. They're vendors with latency, drift, quotas, and partial failures.
Minimum Viable Orchestration (MVO): the smallest thing that produces evidence
You don't need an "agent OS" to get 80% of operability. You need a durable run/step lifecycle, with events you can replay.
Here's a minimal event record that makes the system debuggable and makes cost tracking real:
{
"run_id": "run_2026_03_09_8f12",
"task_id": "task_01492",
"agent_id": "prospecting_writer_v3",
"tenant_id": "acme",
"workflow": "outbound_email_v2",
"step": "send_email",
"event": "tool_called",
"ts": "2026-03-09T12:34:56Z",
"attempt": 1,
"idempotency_key": "acme|outbound_email_v2|send_email|lead_883|sha256:9c2e...",
"model": "provider/model",
"cost_usd_est": 0.038,
"input_ref": "blob://inputs/...",
"output_ref": "blob://outputs/...",
"error": null
}
A minimal lifecycle:
created -> scheduled -> started
-> step_started -> tool_called -> tool_succeeded/tool_failed
-> step_committed
-> completed/failed/canceled
This is the point where "cost tracking per agent per task" stops being a spreadsheet and becomes a query.
The gaps that matter (and why teams build orchestration)
When teams say they "outgrew" LangChain/CrewAI/AutoGen, they usually mean they hit requirements that general-purpose frameworks can't guarantee end-to-end:
- •Reliable task queueing (not just async): durable queues, retries by failure class, dead-letter inspection, fan-out/fan-in barriers
- •Multi-model routing: different models for extraction vs planning vs code vs constrained "write" steps
- •Persistent memory architecture: durable facts + retrieval (not just in-context)
- •Cost tracking: per agent, per task, per tenant, per workflow step -- plus budget enforcement
- •Vertical tool integration: adapters that speak your domain's contracts (and survive vendor drift)
This isn't NIH syndrome. It's requirements diverging.
A useful litmus test: if you have customers, you're not shipping "an agent." You're shipping behavior -- and you're on the hook for that behavior at 2 a.m.
A boring reference architecture (because boring survives incidents)
A common "graduated" shape looks like:
- •FastAPI for the API boundary (auth, versioning, tenancy)
- •Celery + Redis for orchestration (durable tasks; chord patterns for parallel fan-out/fan-in)
- •PostgreSQL + pgvector for durable memory + retrieval
- •A custom LLM gateway for multi-provider routing, budgets, logging, and policy
- •Event log + tracing as the debugging substrate (replayable runs)
This is the exact stack Bureau runs on. FastAPI handles our API layer with JWT auth, Celery orchestrates 20+ agents across queues (celery, default, orchestrator, agents), PostgreSQL + pgvector stores agent memories with cosine similarity retrieval, and our custom OpenClaw gateway routes between Claude, GPT, and DALL-E based on each agent's assigned model.
If you want adjacent mental models for durable execution and recovery semantics, study Temporal's durability + replay model (even if you don't adopt it). For concrete fan-out/fan-in primitives, see Celery Canvas chord patterns.
The graduation checklist (compact, pass/fail)
If you can't say "yes" to these, you don't have a production agent system -- you have a demo with uptime.
- Side effects have idempotency keys
- Runs have durable state (run_id, step_id, attempt, status)
- You persist a replayable event log (tool_called/tool_succeeded/step_committed)
- Steps are deterministic boundaries (you can draw the workflow)
- Retry policies vary by failure class (timeouts != validation != auth)
- Budgets + hard stops exist (per run/step/tenant)
- Tool inputs/outputs are schema-validated and versioned
- Tenancy is isolated (credentials, data, logs, vector namespaces)
- There's an audit trail (who/what invoked it; what it read/wrote)
- There's a quarantine path (human review or safe fallback)
Practical next steps (do this this week)
1. Write down your top 5 notebook-to-staging failures. Tag each: state, concurrency, tool reliability, memory, cost, observability, permissions.
2. Implement idempotency keys for side-effect tools first. Email sends, payments, CRM writes, ticket creation. Start with the "oops list": the tools that create external artifacts you can't un-send.
3. Add budgets with hard stops. Start with max tokens/dollars per run plus alerting. Then add per-step envelopes for risky actions (especially write tools).
4. Define memory tiers and a "facts table." Decide what must be durable (facts), what's ephemeral (scratch), and what must be auditable (events). Don't let "memory" be a blob of vibes.
5. Make cost queryable. Ensure every event ties to tenant_id, task_id, and agent_id. Budgeting without attribution is theater.
What's your most painful "worked in the notebook" moment?
If you're building agents in production and you've felt the operational gap, reach out to us with your worst notebook-to-staging failure -- especially if it involved retries, side effects, schema drift, tenancy issues, or runaway spend.
We're collecting these stories at Clarvia -- where we're building the orchestration layer we wish existed when we started. If you're hitting the same walls, let's talk.
