A demo is not a product. The model works in a notebook, the stakeholders are impressed, and someone says "let's ship it." Then reality arrives: latency spikes, costs spiral, the model hallucinates in production, and nobody built the monitoring to catch it.
The distance between a working AI demo and a production-ready AI feature is not a few deploys. It is a checklist of 47 discrete items spanning model quality, infrastructure, security, cost management, user experience, and operational readiness. Miss one and you get an outage, a data leak, or a slow degradation that nobody notices until a customer complains on Twitter.
This is the checklist we run before every AI feature goes live. It is organized into seven categories. Use it as-is or adapt it to your stack.
Category 1: Model Quality and Evaluation (Items 1-8)
Before anything else, verify that the model actually works for your use case.
1. Evaluation dataset exists and is versioned. You have a held-out test set that represents real production queries. It is stored in version control or a data versioning system (DVC, LakeFS, or equivalent). It is never used for training or fine-tuning.
2. Accuracy meets the defined threshold. You have a specific metric (F1, BLEU, faithfulness score, whatever fits your task) and a specific number that constitutes "good enough." This was agreed on during discovery, not made up after looking at results.
3. Edge cases are documented and tested. You have identified the inputs most likely to cause failures -- empty inputs, adversarial inputs, out-of-distribution queries, multilingual text, extremely long inputs -- and tested the model against each one.
4. Hallucination rate is measured. For generative models, you have quantified the rate at which the model produces false statements. If you are using RAG, you have measured faithfulness: what percentage of generated claims are supported by the retrieved context?
5. Bias testing is complete. You have tested model outputs across relevant demographic categories. For classification models, check accuracy parity. For generation models, check for stereotyping, exclusion, or differential treatment.
6. Model version is pinned. You are deploying a specific model version, not "latest." If you are calling a third-party API (OpenAI, Anthropic, Google), you are using a date-stamped model version, not the default alias.
7. Prompt/configuration is version-controlled. System prompts, few-shot examples, temperature settings, and any other configuration that affects model behavior are stored in version control with the same rigor as application code.
8. Regression test suite exists. You have a suite of golden input-output pairs that runs automatically on every deployment. If a new model version or prompt change breaks a previously-working case, the deploy is blocked.
Category 2: Infrastructure and Performance (Items 9-17)
The model is only useful if it responds fast enough at acceptable cost.
9. Latency budget is defined and met. You have a P50 and P99 latency target for the AI feature. You have measured actual latency under realistic conditions and it meets both targets.
10. Load testing is complete. You have simulated production traffic levels and verified that the system handles them without degradation. Include burst patterns, not just steady-state load.
11. Auto-scaling is configured. If traffic exceeds the baseline, the system scales up automatically. If you are using a managed API, verify that rate limits accommodate your expected peak traffic.
12. Fallback behavior is implemented. When the AI service is unavailable or slow, the feature degrades gracefully. This might mean returning a cached response, falling back to a rule-based system, or showing a meaningful error state.
13. Timeout is configured. Every AI API call has a timeout. For synchronous flows, this is typically 10-30 seconds. For async flows, define the maximum wait time and the notification mechanism.
14. Retry logic is appropriate. Transient failures trigger retries with exponential backoff. Non-transient failures (4xx errors, validation failures) do not trigger retries.
15. Caching strategy is implemented. For deterministic or semi-deterministic queries, responses are cached. Semantic caching (using embedding similarity) can reduce costs by 40-60% for applications with repetitive query patterns.
16. Cost per request is calculated. You know the exact cost per AI inference at current pricing. You have projected monthly cost at expected traffic levels. You have a cost alert threshold.
17. Rate limiting is in place. Both upstream (protecting the AI provider from overuse) and downstream (protecting your API from abuse). Rate limits are enforced per-user, not just globally.
Category 3: Data Pipeline and Privacy (Items 18-25)
AI features consume and produce data. Both paths need protection.
18. Input data is validated. All user inputs are validated for type, length, and format before reaching the model. Injection attempts (prompt injection, indirect prompt injection) are filtered.
19. PII handling is defined. You have documented which personally identifiable information enters the model, whether it is stored, and how long it is retained. If PII should not reach the model, you have a scrubbing step.
20. Data retention policy exists. You have defined how long AI inputs, outputs, and intermediate data are stored. The retention period complies with your privacy policy and applicable regulations (GDPR, CCPA, etc.).
21. Logging does not capture sensitive data. Application logs do not inadvertently capture user prompts, model responses, or PII. If you need to log AI interactions for debugging, use a separate, access-controlled logging pipeline.
22. Third-party data processing agreements are in place. If you are sending data to a third-party AI provider, you have a Data Processing Agreement (DPA) that covers data handling, retention, and deletion.
23. Training data is not leaking into outputs. For fine-tuned models, verify that the model does not memorize and regurgitate training data, especially sensitive data from the training set.
24. Context window does not leak between users. In multi-tenant systems, verify that one user's data never appears in another user's model context. This is particularly critical for RAG systems with shared vector stores.
25. Audit trail exists. Every AI decision that affects a user (content moderation, recommendation, classification) is logged with the input, output, model version, and timestamp. This trail is queryable.
Category 4: Security (Items 26-32)
AI features introduce new attack surfaces.
26. Prompt injection defenses are in place. User inputs are treated as data, not instructions. System prompts and user messages are clearly separated in the API call. Input sanitization removes known injection patterns.
27. Output filtering is active. Model outputs are filtered for harmful content, off-topic responses, and policy violations before reaching the user. This is a separate layer from the model's own safety training.
28. System prompt is not extractable. An attacker cannot retrieve your system prompt through crafted inputs. Test this with known extraction techniques before launch.
29. API keys and credentials are secured. AI provider API keys are stored in a secrets manager (AWS Secrets Manager, HashiCorp Vault, etc.), not in environment variables, not in code, and definitely not in a .env file committed to git.
30. Access control is enforced. The AI feature respects existing authorization. A user who cannot access a document through the normal UI also cannot access it through an AI-powered search or summary.
31. Abuse patterns are monitored. You are tracking for anomalous usage patterns: high-volume automated queries, systematic prompt injection attempts, and data extraction patterns.
32. Model access is authenticated. If you are hosting your own model, the inference endpoint requires authentication. It is not exposed to the public internet without auth.
Category 5: Monitoring and Observability (Items 33-40)
You cannot fix what you cannot see.
33. Latency is tracked per-request. Every AI API call records its duration. You have dashboards showing P50, P95, and P99 latency over time.
34. Error rates are tracked. Failed AI calls, timeout errors, and malformed responses are counted and alerted on. You have separate error rate metrics for the AI provider vs. your application logic.
35. Cost is tracked in real-time. Token usage and associated cost are recorded per request. Daily and monthly cost dashboards exist. Cost anomaly alerts are configured.
36. Model quality is monitored continuously. A subset of production traffic is evaluated against quality metrics on an ongoing basis. Quality degradation triggers alerts, not just periodic manual reviews.
37. Drift detection is active. For classification models, monitor input distribution and prediction distribution for shifts. For generative models, monitor output characteristics (length, sentiment, topic distribution).
38. User feedback is captured. Thumbs up/down, report buttons, or other feedback mechanisms exist. This feedback feeds into the evaluation pipeline.
39. Dashboards are built. A single dashboard shows: request volume, latency, error rate, cost, quality metrics, and user satisfaction. The on-call team can diagnose issues from this dashboard alone.
40. Alerting is configured. Latency exceeds P99 target? Alert. Error rate exceeds threshold? Alert. Cost exceeds daily budget? Alert. Quality score drops below threshold? Alert. Every metric has an alert, and every alert has an owner.
Category 6: User Experience (Items 41-44)
An AI feature that confuses users is worse than no AI feature at all.
41. Loading states communicate progress. If the AI takes more than 200ms, the user sees a loading indicator. For streaming responses, tokens appear progressively. The user is never staring at a frozen screen.
42. Confidence is communicated appropriately. Where relevant, the UI indicates how confident the AI is in its output. This could be a confidence score, a source citation, or a caveat like "Based on documents from Q3 2025."
43. Error states are helpful. When the AI fails, the user sees a clear message and a path forward -- not a generic "Something went wrong." If fallback content is available, show it.
44. Users can override or correct AI outputs. Every AI-generated result has an escape hatch. Users can edit, dismiss, report, or bypass the AI suggestion. The AI assists; it does not dictate.
Category 7: Operational Readiness (Items 45-47)
Shipping is not the end. It is the beginning.
45. Runbook exists. There is a document that describes: how to restart the AI service, how to roll back to the previous model version, how to disable the AI feature without a full deploy, and how to investigate common failure modes.
46. On-call rotation is staffed. Someone is responsible for AI-specific incidents. That person knows the system architecture, has access to the monitoring dashboards, and can execute the runbook.
47. Post-launch review is scheduled. A review meeting is on the calendar for one week and four weeks after launch. The agenda: latency, cost, quality, user feedback, and incidents. Decisions are documented.
How to Use This Checklist
Do not try to check all 47 items in a single sprint. Instead:
- During discovery (2-week sprint): Address items 1-5 and 16.
- During build (weeks 1-4): Address items 6-8, 9-15, 18-25, 26-32, and 41-44.
- Before launch (final week): Address items 17, 33-40, and 45-47.
- After launch (ongoing): Continuously monitor items 33-40 and run item 8 (regression tests) on every change.
Print this list. Pin it to your project board. Check items off as you go. The item you skip is the one that causes the incident.
The Uncomfortable Truth
Most AI features ship with fewer than 20 of these 47 items addressed. Teams under pressure skip monitoring, defer security testing, and launch without fallback behavior. It works until it does not.
The organizations that treat AI production readiness with the same rigor as traditional production readiness -- load testing, security review, runbooks, on-call -- are the ones whose AI features survive first contact with real users. The rest get a demo that works in staging and an incident in production.
Production readiness is not a phase. It is a discipline.
