Why six weeks (not six days, not six months)
Six days is enough time to ship a demo that works on cherry-picked inputs. It is not enough to build evaluation, monitoring, fallback behaviour, and the runbook your support team will use when something breaks at 3am. Teams that try to compress AI delivery into a one-week sprint typically ship the demo and then spend three months patching production issues that surface after launch.
Six months, on the other hand, is the default consulting cadence and it is too slow for AI work. AI features iterate faster than traditional software because the underlying models, data, and prompts change continuously. A six-month build loop usually ships a feature against a model that has been superseded twice and a prompt strategy that no longer reflects best practice.
Six weeks is the cadence where you can do the full loop, discovery through production, with all the artifacts a real operating system needs, while still being fast enough to keep the work tied to a single business goal that has not shifted underneath it.
The weekly cadence at a glance
Each week has a defined output. Skipping the output, even when the work feels almost done, is the most common failure mode. Each week's deliverable is also the input the following week depends on, so a missed milestone compounds.
- Week 1: Discovery, scope lock, evaluation criteria written down
- Week 2: Architecture decision, evaluation harness running on baseline
- Week 3: First end-to-end working flow, eval harness integrated into the build loop
- Week 4: Feature-complete build with monitoring instrumentation in place
- Week 5: Internal pilot with operations, runbook, and rollback plan
- Week 6: Controlled production rollout with monitoring and human review queue
Week 1: Discovery and scope lock
The week begins with a working session that maps three things: the data the feature will use, the decisions the feature will make, and the failure modes that would matter to the business. By Friday, you should have a written scope document that names the AI capability, the in-scope inputs and outputs, the evaluation criteria, and the things explicitly out of scope.
The single most useful artifact from week 1 is the evaluation set. Pull a hundred real examples from existing data, with the answer or label your team would expect, before any model is wired up. This evaluation set becomes the goalposts. If you skip this step, you will spend weeks 5 and 6 arguing about whether the feature is good enough, with no objective measure to settle the question.
The mistake teams make in week 1 is treating it as a kickoff meeting rather than a build week. Discovery is build work; the artifacts produced (scope doc, eval set, data access plan, success metrics) are the foundation that everything else depends on.
Weeks 2 to 4: Build with the eval harness running
Week 2 picks the architecture. For most AI features the question collapses to: which model, which retrieval pattern (none, RAG, agentic), and which evaluation methodology. Decisions get written down with rationale. The evaluation harness from week 1 gets wired up so every change can be graded.
Weeks 3 and 4 are the core build. The discipline that distinguishes production AI from a demo is that the evaluation harness runs continuously. Every prompt change, every retrieval tweak, every model swap is graded against the eval set before being merged. Regressions get caught in minutes, not in production.
By the end of week 4, the feature works end-to-end on the eval set with measurable accuracy, monitoring is wired in, and the cost-per-request is known. If you do not have these three things by Friday of week 4, weeks 5 and 6 will not save you.
Weeks 5 and 6: Production rollout with controls
Week 5 is an internal pilot. The feature runs against real production traffic but the outputs are reviewed by a human before reaching the customer. This is where you find the failure modes the eval set missed: the unusual phrasing, the edge case data, the integration glitch. You also use this week to write the runbook, the on-call playbook, and the rollback procedure.
Week 6 is the controlled rollout. Most teams ramp from 1% of traffic to 10% to 50% to 100% over the week, watching the monitoring dashboards at each step. A human-review queue stays in place for low-confidence cases indefinitely. The feature is shipped when the metrics on real production traffic match the metrics on the eval set, which validates that the eval set was representative.
The handoff at the end of week 6 includes the runbook, the eval harness, the monitoring dashboard, the audit logs, and the source code. The receiving team should be able to run the feature without you on day one of week 7.
Common ways teams blow the timeline
Skipping the eval set in week 1 is the most expensive mistake, because it is invisible until weeks 5 and 6. Teams that skip it always think they will write the eval set later and never do.
Trying to ship a finished feature in week 4 (rather than a feature-complete build that the pilot in week 5 will validate) typically results in a feature that demos well to leadership but breaks on real traffic. The pilot week exists for a reason.
Treating week 6 as a launch party rather than a rollout. Production rollouts for AI features are gradual on purpose. You ramp, you watch, you ramp again. Going from 0 to 100 percent on day one of week 6 is how teams ship features that have to be reverted by day three.
When this framework fits, and when it does not
It fits a scoped AI feature with clear inputs, clear outputs, and a measurable definition of done. Examples: a support deflection bot for a defined set of intents, an invoice extraction pipeline for a known document set, an internal search copilot scoped to one team. The constraint that makes the framework work is that the scope is small enough to evaluate.
It does not fit research projects where the goal is to discover whether something is possible, multi-quarter platform builds, or features where the evaluation criteria genuinely cannot be defined before build. Forcing those projects into a six-week shape produces bad work in any of those categories.
If your project does not fit, the right move is not to stretch the framework. The right move is to scope down to a six-week feature inside the larger project and deliver that first.
What you walk away with
A working AI feature in production. An evaluation harness that catches regressions on every change. A monitoring dashboard that shows what the AI is doing in production. A runbook your operations team can use without asking the build team for help. An audit trail that satisfies your security and compliance review. A documented decision log that records why the architecture is what it is.
The artifacts are as important as the feature itself. They are how the feature stays running after the build team moves on, and they are how a regulator, an auditor, or a future engineer reconstructs why the system behaves the way it does.