Does six weeks include the procurement and contracting time?

No. Six weeks is execution time once a scope and contract are in place. Procurement, security review, and data access permissions typically add one to three weeks before week 1 begins. Plan accordingly.

Can the framework run in parallel for multiple features?

Yes. Most teams that ship more than one AI feature run multiple six-week tracks with shared platform components (eval framework, monitoring, governance). The first feature carries platform overhead; subsequent features benefit from a working foundation and ship faster.

What if the evaluation set looks wrong in week 5?

You either fix the eval set or accept that the feature is ready for the cases the eval set captured. We treat eval-set drift as a normal finding, not a failure. The runbook tracks when the eval set is updated and why, so future changes have context.

Does this work for agentic workflows?

Yes, with a wrinkle. Agentic features need additional weeks to evaluate because the search space of agent behaviour is larger than a fixed pipeline. We typically scope agents for an eight to ten week framework rather than six, and the eval methodology shifts toward trajectory-grading rather than single-output scoring.

How does this compare to your packages?

The framework is what we use inside the AI Discovery Sprint, AI Build Sprint, and AI Audit & Hardening packages. Discovery Sprint is week 1 standalone (you can stop after discovery if the project does not justify a build). Build Sprint is weeks 2 through 6. Audit & Hardening replaces weeks 5 and 6 for AI features that are already in production but lack the controls described here.

From Prototype to Production: The 6-Week AI Launch Framework

Why six weeks (not six days, not six months)

Six days is enough time to ship a demo that works on cherry-picked inputs. It is not enough to build evaluation, monitoring, fallback behaviour, and the runbook your support team will use when something breaks at 3am. Teams that try to compress AI delivery into a one-week sprint typically ship the demo and then spend three months patching production issues that surface after launch.

Six months, on the other hand, is the default consulting cadence and it is too slow for AI work. AI features iterate faster than traditional software because the underlying models, data, and prompts change continuously. A six-month build loop usually ships a feature against a model that has been superseded twice and a prompt strategy that no longer reflects best practice.

Six weeks is the cadence where you can do the full loop, discovery through production, with all the artifacts a real operating system needs, while still being fast enough to keep the work tied to a single business goal that has not shifted underneath it.

The weekly cadence at a glance

Each week has a defined output. Skipping the output, even when the work feels almost done, is the most common failure mode. Each week's deliverable is also the input the following week depends on, so a missed milestone compounds.

Week 1: Discovery, scope lock, evaluation criteria written down
Week 2: Architecture decision, evaluation harness running on baseline
Week 3: First end-to-end working flow, eval harness integrated into the build loop
Week 4: Feature-complete build with monitoring instrumentation in place
Week 5: Internal pilot with operations, runbook, and rollback plan
Week 6: Controlled production rollout with monitoring and human review queue

Week 1: Discovery and scope lock

The week begins with a working session that maps three things: the data the feature will use, the decisions the feature will make, and the failure modes that would matter to the business. By Friday, you should have a written scope document that names the AI capability, the in-scope inputs and outputs, the evaluation criteria, and the things explicitly out of scope.

The single most useful artifact from week 1 is the evaluation set. Pull a hundred real examples from existing data, with the answer or label your team would expect, before any model is wired up. This evaluation set becomes the goalposts. If you skip this step, you will spend weeks 5 and 6 arguing about whether the feature is good enough, with no objective measure to settle the question.

The mistake teams make in week 1 is treating it as a kickoff meeting rather than a build week. Discovery is build work; the artifacts produced (scope doc, eval set, data access plan, success metrics) are the foundation that everything else depends on.

Weeks 2 to 4: Build with the eval harness running

Week 2 picks the architecture. For most AI features the question collapses to: which model, which retrieval pattern (none, RAG, agentic), and which evaluation methodology. Decisions get written down with rationale. The evaluation harness from week 1 gets wired up so every change can be graded.

Weeks 3 and 4 are the core build. The discipline that distinguishes production AI from a demo is that the evaluation harness runs continuously. Every prompt change, every retrieval tweak, every model swap is graded against the eval set before being merged. Regressions get caught in minutes, not in production.

By the end of week 4, the feature works end-to-end on the eval set with measurable accuracy, monitoring is wired in, and the cost-per-request is known. If you do not have these three things by Friday of week 4, weeks 5 and 6 will not save you.

Weeks 5 and 6: Production rollout with controls

Week 5 is an internal pilot. The feature runs against real production traffic but the outputs are reviewed by a human before reaching the customer. This is where you find the failure modes the eval set missed: the unusual phrasing, the edge case data, the integration glitch. You also use this week to write the runbook, the on-call playbook, and the rollback procedure.

Week 6 is the controlled rollout. Most teams ramp from 1% of traffic to 10% to 50% to 100% over the week, watching the monitoring dashboards at each step. A human-review queue stays in place for low-confidence cases indefinitely. The feature is shipped when the metrics on real production traffic match the metrics on the eval set, which validates that the eval set was representative.

The handoff at the end of week 6 includes the runbook, the eval harness, the monitoring dashboard, the audit logs, and the source code. The receiving team should be able to run the feature without you on day one of week 7.

Common ways teams blow the timeline

Skipping the eval set in week 1 is the most expensive mistake, because it is invisible until weeks 5 and 6. Teams that skip it always think they will write the eval set later and never do.

Trying to ship a finished feature in week 4 (rather than a feature-complete build that the pilot in week 5 will validate) typically results in a feature that demos well to leadership but breaks on real traffic. The pilot week exists for a reason.

Treating week 6 as a launch party rather than a rollout. Production rollouts for AI features are gradual on purpose. You ramp, you watch, you ramp again. Going from 0 to 100 percent on day one of week 6 is how teams ship features that have to be reverted by day three.

When this framework fits, and when it does not

It fits a scoped AI feature with clear inputs, clear outputs, and a measurable definition of done. Examples: a support deflection bot for a defined set of intents, an invoice extraction pipeline for a known document set, an internal search copilot scoped to one team. The constraint that makes the framework work is that the scope is small enough to evaluate.

It does not fit research projects where the goal is to discover whether something is possible, multi-quarter platform builds, or features where the evaluation criteria genuinely cannot be defined before build. Forcing those projects into a six-week shape produces bad work in any of those categories.

If your project does not fit, the right move is not to stretch the framework. The right move is to scope down to a six-week feature inside the larger project and deliver that first.

What you walk away with

A working AI feature in production. An evaluation harness that catches regressions on every change. A monitoring dashboard that shows what the AI is doing in production. A runbook your operations team can use without asking the build team for help. An audit trail that satisfies your security and compliance review. A documented decision log that records why the architecture is what it is.

The artifacts are as important as the feature itself. They are how the feature stays running after the build team moves on, and they are how a regulator, an auditor, or a future engineer reconstructs why the system behaves the way it does.

From Prototype to Production: The 6-Week AI Launch Framework

Why six weeks (not six days, not six months)

The weekly cadence at a glance

Week 1: Discovery and scope lock

Weeks 2 to 4: Build with the eval harness running

Weeks 5 and 6: Production rollout with controls

Common ways teams blow the timeline

When this framework fits, and when it does not

What you walk away with

The 6-Week AI Launch Framework Template

Related playbooks

Common questions

Does six weeks include the procurement and contracting time?

Can the framework run in parallel for multiple features?

What if the evaluation set looks wrong in week 5?

Does this work for agentic workflows?

How does this compare to your packages?

Ship your next AI feature in six weeks.

Cookie Preferences

From Prototype to Production: The 6-Week AI Launch Framework

Why six weeks (not six days, not six months)

The weekly cadence at a glance

Week 1: Discovery and scope lock

Weeks 2 to 4: Build with the eval harness running

Weeks 5 and 6: Production rollout with controls

Common ways teams blow the timeline

When this framework fits, and when it does not

What you walk away with

The 6-Week AI Launch Framework Template

Related playbooks

LLM App Architecture: The Decision Guide

AI Readiness Checklist for Business Leaders

Common questions

Does six weeks include the procurement and contracting time?

Can the framework run in parallel for multiple features?

What if the evaluation set looks wrong in week 5?

Does this work for agentic workflows?

How does this compare to your packages?

Ship your next AI feature in six weeks.

Cookie Preferences