Evaluation & Testing
Benchmarks, evals, metrics for LLMs. Human evaluation, A/B testing, regression testing.
You can't improve what you can't measure. Evaluation and testing are the foundation of reliable AI systems — they tell you whether your model is good enough to ship, whether a change improved or degraded quality, and whether your system is drifting over time. This module covers industry-standard benchmarks, building custom evaluation suites, human evaluation frameworks, A/B testing, regression testing for prompts, and the tools that make eval-driven development practical.
Why AI Evaluation Is Hard
Traditional software has deterministic outputs: given the same input, you get the same output, and you can check it with an assertion. LLMs are stochastic — the same prompt can produce different (but equally valid) responses every time. This means evaluation shifts from "is the output exactly right?" to "is the output good enough along multiple dimensions?"
Evaluation dimensions for AI systems include correctness, helpfulness, safety, format compliance, latency, cost, and consistency. A change that improves correctness might degrade safety. A prompt tweak that fixes one edge case might break five others. Systematic evaluation is the only way to navigate these trade-offs confidently.
Industry Benchmarks for LLMs
Benchmarks provide standardized tests that allow comparison across models and over time. While no single benchmark captures real-world performance, together they paint a useful picture of model capabilities.
| Benchmark | What It Measures | Format | Key Notes |
|---|---|---|---|
| MMLU / MMLU-Pro | Broad knowledge across 57 subjects | Multiple choice | The standard general-knowledge benchmark; MMLU-Pro uses harder, 10-option questions to reduce saturation |
| HumanEval / HumanEval+ | Code generation correctness | Write Python functions, verified by test cases | HumanEval+ adds 80x more tests to catch false positives; SWE-bench tests real-world repo-level coding |
| ARC-AGI / ARC-AGI-2 | Novel reasoning and abstraction | Visual pattern puzzles requiring generalization | Designed to resist memorization; tests true reasoning ability rather than pattern matching on training data |
| GPQA (Diamond) | Graduate-level expert knowledge | PhD-level science questions | Questions so hard that PhD holders outside the specialty score ~30%; tests deep domain expertise |
| MATH / GSM8K | Mathematical reasoning | Multi-step math problems | GSM8K covers grade-school math; MATH covers competition-level problems; both are largely saturated by top models |
| Chatbot Arena (LMSYS) | Overall human preference | Head-to-head blind comparisons by humans | Elo-based ranking from real user votes; the most trusted indicator of general chat quality |
Building Custom Eval Suites
Custom evals are tests designed specifically for your application's use case. They are the single most important tool for making confident changes to your AI system.
Anatomy of an Eval
Every eval has three components: an input (the test case), an expected behavior (what "good" looks like), and a scoring function (how to measure success).
Example eval structure:
// Eval case for a customer support bot { "input": "I want to return the shoes I bought last week", "context": { "order_id": "ORD-12345", "purchase_date": "2026-03-12", "return_policy": "30-day returns for unworn items" }, "expected": { "intent_detected": "return_request", "mentions_return_policy": true, "asks_about_item_condition": true, "provides_return_instructions": true, "tone": "helpful_and_empathetic" }, "scoring": { "intent_accuracy": "exact_match", // binary: correct or not "content_checks": "checklist", // % of expected elements present "tone_assessment": "llm_as_judge", // use a grader model "overall": "weighted_average" // 40% intent, 40% content, 20% tone } }
Types of Scoring Functions
- Exact match: The output must match exactly. Good for classifications, structured outputs, and extracted fields (dates, IDs, amounts).
- Contains/regex: Check whether the output contains specific strings or matches a pattern. Good for verifying citations, required disclaimers, or format compliance.
- Semantic similarity: Compare the output embedding to a reference answer embedding. Good for open-ended responses where wording can vary but meaning should be similar.
- LLM-as-judge: Use a powerful model to evaluate the output against a rubric. The most flexible approach and essential for subjective qualities like helpfulness, tone, and completeness.
- Code execution: For code generation tasks, run the generated code against test cases and check whether it produces correct outputs.
Building Your Eval Dataset
- Start with 50–100 representative cases: Cover the main paths your application handles. Include happy paths, edge cases, and known failure modes.
- Mine production logs: Real user queries are the best source of eval cases. Flag interactions where users gave negative feedback, retried their query, or escalated to a human.
- Bug-driven additions: Every bug report becomes an eval case. This ensures the same failure never ships twice.
- Adversarial cases: Include prompt injection attempts, off-topic queries, ambiguous inputs, and multi-language inputs to test robustness.
- Stratify by category: Ensure balanced representation across different query types, difficulty levels, and user segments. A suite dominated by easy cases will give inflated scores.
Human Evaluation Frameworks
Automated evals catch most issues, but human evaluation remains the gold standard for subjective quality — especially for tone, helpfulness, and naturalness. The challenge is making human eval scalable and consistent.
Designing Human Eval Protocols
- Clear rubrics: Define exactly what each score means. A 5-point scale with vague labels like "good" and "excellent" produces inconsistent ratings. Use behavioral anchors: "Score 5: Answer is factually correct, directly addresses the question, cites relevant sources, and is easy to understand."
- Blind evaluation: Evaluators should not know which model or prompt version generated the output. This eliminates bias toward the known "current" version.
- Inter-rater reliability: Have multiple evaluators rate the same samples and measure agreement (Cohen's Kappa or Krippendorff's Alpha). If agreement is low, your rubric needs refinement.
- Side-by-side comparison: Instead of absolute ratings, show evaluators two outputs and ask which is better. Humans are more consistent at relative comparison than absolute scoring. This is the approach behind LMSYS Chatbot Arena.
A/B Testing AI Features
A/B testing for AI features follows the same principles as traditional product A/B tests, with additional considerations for the stochastic nature of AI outputs.
What to Measure
| Metric Category | Examples | Why It Matters |
|---|---|---|
| Task success | Resolution rate, task completion, accuracy | Did the AI actually help the user accomplish their goal? |
| User satisfaction | Thumbs up/down, CSAT, NPS | Subjective quality that automated evals may miss |
| Engagement | Retry rate, conversation length, feature adoption | Low retries and short conversations often signal effectiveness |
| Operational | Latency (p50/p95/p99), cost per request, error rate | A higher-quality variant that costs 5x more may not be viable |
| Safety | Guardrail trigger rate, escalation rate, content flags | Improved quality should not come at the expense of safety |
A/B Testing Pitfalls for AI
- Sample size: Because AI outputs are variable, you need larger sample sizes than typical product A/B tests to detect meaningful differences. Plan for at least 1,000–5,000 interactions per variant.
- Confounding by topic: If variant A gets easy queries and variant B gets hard ones (by chance), the comparison is invalid. Ensure random assignment is truly random and stratified by query type when possible.
- Measuring long-term effects: A prompt that produces slightly lower satisfaction per-interaction might lead to higher overall task completion. Run tests long enough to capture downstream effects.
Regression Testing for Prompts
Prompt drift is one of the most insidious problems in AI applications. A small prompt tweak to fix one issue silently breaks five other behaviors. Regression testing catches this by running your full eval suite on every prompt change.
The Prompt Regression Workflow
Eval-driven prompt development cycle:
1. Identify Issue └─ Bug report, user feedback, or observed failure 2. Add Eval Case └─ Create a test case that reproduces the failure └─ Verify: current prompt FAILS this new case 3. Modify Prompt └─ Make the targeted change to fix the issue 4. Run Full Eval Suite └─ New case passes? ✓ └─ All existing cases still pass? ✓ → Ship it └─ Existing cases regressed? ✗ → Iterate on prompt 5. Deploy with Monitoring └─ Watch production metrics for unexpected changes
Eval Tooling
LangSmith
LangSmith, built by LangChain, is a platform for tracing, evaluating, and monitoring LLM applications. On the evaluation side, it provides:
- Dataset management: Create and version eval datasets with inputs, expected outputs, and metadata. Import from CSV, JSON, or production traces.
- Custom evaluators: Define scoring functions in Python — from simple string matching to LLM-as-judge chains. Evaluators run automatically against datasets.
- Experiment tracking: Compare multiple prompt versions side-by-side with full scoring breakdowns. See exactly which cases improved and which regressed.
- CI integration: Run evals in your CI pipeline and gate deployments on eval scores.
Braintrust
Braintrust is purpose-built for AI evaluation with a focus on developer experience and speed. Key features include:
- Fast iteration: Evaluations run in parallel with real-time streaming results. See scores updating live as each case completes instead of waiting for the full suite.
- Built-in scorers: Pre-built scoring functions for common patterns — factuality, relevance, summarization quality, SQL correctness, and more.
- Diff view: Side-by-side comparison of outputs between experiments, highlighting exactly what changed and how scores shifted.
- GitHub integration: Automatically posts eval results as comments on pull requests, making prompt changes reviewable by the team.
- Online scoring: Run evaluations against production traffic in real-time to catch quality issues as they happen.
Eval-Driven Development Workflow
The best AI engineering teams practice eval-driven development — the AI equivalent of test-driven development (TDD). The principle is simple: write the eval first, then make it pass.
Eval-driven development in practice:
Sprint Planning: "We need to handle return requests for international orders" Step 1: Write eval cases FIRST - 20 international return scenarios - Edge cases: different countries, currencies, shipping methods - Expected behaviors defined with rubrics Step 2: Run evals against current system - Baseline score: 35% (system doesn't handle these well yet) Step 3: Iterate on implementation - Add international return logic to system prompt - Add country-specific return policy to knowledge base - Add tool for looking up international shipping options Step 4: Run evals after each change - After prompt change: 60% - After knowledge base update: 78% - After tool addition: 94% Step 5: Run full regression suite - International returns: 94% ✓ - Domestic returns: 97% (no regression) ✓ - General queries: 95% (no regression) ✓ Step 6: Ship with confidence
Resources
Braintrust — AI Evaluation Platform
Braintrust
Purpose-built eval platform with fast iteration, built-in scorers, diff views, and GitHub integration. Excellent for teams practicing eval-driven development.
LangSmith — Tracing & Evaluation
LangChain
Full-featured platform for tracing, evaluating, and monitoring LLM applications. Includes dataset management, custom evaluators, experiment comparison, and CI integration.
Chatbot Arena Leaderboard
LMSYS
Live leaderboard ranking LLMs by human preference through blind side-by-side comparisons. The most trusted community benchmark for conversational AI quality.
ARC Prize — ARC-AGI Benchmark
ARC Prize Foundation
The ARC-AGI benchmark and competition, designed to measure genuine reasoning and abstraction abilities that resist memorization. A key benchmark for tracking progress toward general intelligence.
Key Takeaways
- 1AI evaluation is harder than traditional testing because outputs are stochastic — focus on measuring quality dimensions (correctness, safety, tone) rather than exact string matching.
- 2Industry benchmarks (MMLU, HumanEval, ARC-AGI, Chatbot Arena) measure general capabilities, but custom evals specific to your use case are what actually predict product quality.
- 3Build eval suites with 50–100 representative cases covering happy paths, edge cases, and known failures. Every bug becomes a new eval case to prevent regression.
- 4LLM-as-judge is the most flexible scoring method for subjective qualities — use a stronger model as judge with a detailed rubric, and validate against human ratings.
- 5Prompt regression testing is essential: run the full eval suite on every prompt change to catch the silent breakage that small tweaks can introduce.
- 6A/B test AI features with larger sample sizes than traditional product tests (1,000–5,000+ interactions per variant) to account for output variability.
- 7Practice eval-driven development: write eval cases first, establish a baseline, iterate until scores pass, confirm no regressions, then ship with confidence.
Test Your Understanding
Module Assessment
5 questions · Score 70% or higher to complete this module
You can retake the quiz as many times as you need. Your best score is saved.