Evaluation & Testing

You can't improve what you can't measure. Evaluation and testing are the foundation of reliable AI systems — they tell you whether your model is good enough to ship, whether a change improved or degraded quality, and whether your system is drifting over time. This module covers industry-standard benchmarks, building custom evaluation suites, human evaluation frameworks, A/B testing, regression testing for prompts, and the tools that make eval-driven development practical.

Why AI Evaluation Is Hard

Traditional software has deterministic outputs: given the same input, you get the same output, and you can check it with an assertion. LLMs are stochastic — the same prompt can produce different (but equally valid) responses every time. This means evaluation shifts from "is the output exactly right?" to "is the output good enough along multiple dimensions?"

Evaluation dimensions for AI systems include correctness, helpfulness, safety, format compliance, latency, cost, and consistency. A change that improves correctness might degrade safety. A prompt tweak that fixes one edge case might break five others. Systematic evaluation is the only way to navigate these trade-offs confidently.

Industry Benchmarks for LLMs

Benchmarks provide standardized tests that allow comparison across models and over time. While no single benchmark captures real-world performance, together they paint a useful picture of model capabilities.

Benchmark	What It Measures	Format	Key Notes
MMLU / MMLU-Pro	Broad knowledge across 57 subjects	Multiple choice	The standard general-knowledge benchmark; MMLU-Pro uses harder, 10-option questions to reduce saturation
HumanEval / HumanEval+	Code generation correctness	Write Python functions, verified by test cases	HumanEval+ adds 80x more tests to catch false positives; SWE-bench tests real-world repo-level coding
ARC-AGI / ARC-AGI-2	Novel reasoning and abstraction	Visual pattern puzzles requiring generalization	Designed to resist memorization; tests true reasoning ability rather than pattern matching on training data
GPQA (Diamond)	Graduate-level expert knowledge	PhD-level science questions	Questions so hard that PhD holders outside the specialty score ~30%; tests deep domain expertise
MATH / GSM8K	Mathematical reasoning	Multi-step math problems	GSM8K covers grade-school math; MATH covers competition-level problems; both are largely saturated by top models
Chatbot Arena (LMSYS)	Overall human preference	Head-to-head blind comparisons by humans	Elo-based ranking from real user votes; the most trusted indicator of general chat quality

Benchmarks Are Not Product Quality

A model scoring 90% on MMLU does not mean it will perform well in your application. Benchmarks measure general capability, not task-specific fitness. Always supplement benchmarks with custom evals that mirror your actual use case. The model that ranks highest on Chatbot Arena may not be the best choice for your structured data extraction pipeline.

Building Custom Eval Suites

Custom evals are tests designed specifically for your application's use case. They are the single most important tool for making confident changes to your AI system.

Anatomy of an Eval

Every eval has three components: an input (the test case), an expected behavior (what "good" looks like), and a scoring function (how to measure success).

Example eval structure:

// Eval case for a customer support bot { "input": "I want to return the shoes I bought last week", "context": { "order_id": "ORD-12345", "purchase_date": "2026-03-12", "return_policy": "30-day returns for unworn items" }, "expected": { "intent_detected": "return_request", "mentions_return_policy": true, "asks_about_item_condition": true, "provides_return_instructions": true, "tone": "helpful_and_empathetic" }, "scoring": { "intent_accuracy": "exact_match", // binary: correct or not "content_checks": "checklist", // % of expected elements present "tone_assessment": "llm_as_judge", // use a grader model "overall": "weighted_average" // 40% intent, 40% content, 20% tone } }

Types of Scoring Functions

Exact match: The output must match exactly. Good for classifications, structured outputs, and extracted fields (dates, IDs, amounts).
Contains/regex: Check whether the output contains specific strings or matches a pattern. Good for verifying citations, required disclaimers, or format compliance.
Semantic similarity: Compare the output embedding to a reference answer embedding. Good for open-ended responses where wording can vary but meaning should be similar.
LLM-as-judge: Use a powerful model to evaluate the output against a rubric. The most flexible approach and essential for subjective qualities like helpfulness, tone, and completeness.
Code execution: For code generation tasks, run the generated code against test cases and check whether it produces correct outputs.

The LLM-as-Judge Pattern

When using LLM-as-judge, provide a detailed rubric in the grader prompt. Define what a score of 1, 3, and 5 looks like with concrete examples. Use a more capable model as the judge than the model being evaluated (e.g., Claude Opus judging Claude Sonnet outputs). Always validate your judge against human ratings on a sample — a biased judge is worse than no judge.

Building Your Eval Dataset

Start with 50–100 representative cases: Cover the main paths your application handles. Include happy paths, edge cases, and known failure modes.
Mine production logs: Real user queries are the best source of eval cases. Flag interactions where users gave negative feedback, retried their query, or escalated to a human.
Bug-driven additions: Every bug report becomes an eval case. This ensures the same failure never ships twice.
Adversarial cases: Include prompt injection attempts, off-topic queries, ambiguous inputs, and multi-language inputs to test robustness.
Stratify by category: Ensure balanced representation across different query types, difficulty levels, and user segments. A suite dominated by easy cases will give inflated scores.

Human Evaluation Frameworks

Automated evals catch most issues, but human evaluation remains the gold standard for subjective quality — especially for tone, helpfulness, and naturalness. The challenge is making human eval scalable and consistent.

Designing Human Eval Protocols

Clear rubrics: Define exactly what each score means. A 5-point scale with vague labels like "good" and "excellent" produces inconsistent ratings. Use behavioral anchors: "Score 5: Answer is factually correct, directly addresses the question, cites relevant sources, and is easy to understand."
Blind evaluation: Evaluators should not know which model or prompt version generated the output. This eliminates bias toward the known "current" version.
Inter-rater reliability: Have multiple evaluators rate the same samples and measure agreement (Cohen's Kappa or Krippendorff's Alpha). If agreement is low, your rubric needs refinement.
Side-by-side comparison: Instead of absolute ratings, show evaluators two outputs and ask which is better. Humans are more consistent at relative comparison than absolute scoring. This is the approach behind LMSYS Chatbot Arena.

A/B Testing AI Features

A/B testing for AI features follows the same principles as traditional product A/B tests, with additional considerations for the stochastic nature of AI outputs.

What to Measure

Metric Category	Examples	Why It Matters
Task success	Resolution rate, task completion, accuracy	Did the AI actually help the user accomplish their goal?
User satisfaction	Thumbs up/down, CSAT, NPS	Subjective quality that automated evals may miss
Engagement	Retry rate, conversation length, feature adoption	Low retries and short conversations often signal effectiveness
Operational	Latency (p50/p95/p99), cost per request, error rate	A higher-quality variant that costs 5x more may not be viable
Safety	Guardrail trigger rate, escalation rate, content flags	Improved quality should not come at the expense of safety

A/B Testing Pitfalls for AI

Sample size: Because AI outputs are variable, you need larger sample sizes than typical product A/B tests to detect meaningful differences. Plan for at least 1,000–5,000 interactions per variant.
Confounding by topic: If variant A gets easy queries and variant B gets hard ones (by chance), the comparison is invalid. Ensure random assignment is truly random and stratified by query type when possible.
Measuring long-term effects: A prompt that produces slightly lower satisfaction per-interaction might lead to higher overall task completion. Run tests long enough to capture downstream effects.

Regression Testing for Prompts

Prompt drift is one of the most insidious problems in AI applications. A small prompt tweak to fix one issue silently breaks five other behaviors. Regression testing catches this by running your full eval suite on every prompt change.

The Prompt Regression Workflow

Eval-driven prompt development cycle:

1. Identify Issue └─ Bug report, user feedback, or observed failure 2. Add Eval Case └─ Create a test case that reproduces the failure └─ Verify: current prompt FAILS this new case 3. Modify Prompt └─ Make the targeted change to fix the issue 4. Run Full Eval Suite └─ New case passes? ✓ └─ All existing cases still pass? ✓ → Ship it └─ Existing cases regressed? ✗ → Iterate on prompt 5. Deploy with Monitoring └─ Watch production metrics for unexpected changes

Version Control for Prompts

Treat prompts like code. Store them in version control, require pull requests for changes, and run eval suites in CI. Tools like Braintrust and LangSmith integrate with Git workflows to provide eval results directly on pull requests, showing exactly which test cases improved, regressed, or remained stable.

Eval Tooling

LangSmith

LangSmith, built by LangChain, is a platform for tracing, evaluating, and monitoring LLM applications. On the evaluation side, it provides:

Dataset management: Create and version eval datasets with inputs, expected outputs, and metadata. Import from CSV, JSON, or production traces.
Custom evaluators: Define scoring functions in Python — from simple string matching to LLM-as-judge chains. Evaluators run automatically against datasets.
Experiment tracking: Compare multiple prompt versions side-by-side with full scoring breakdowns. See exactly which cases improved and which regressed.
CI integration: Run evals in your CI pipeline and gate deployments on eval scores.

Braintrust

Braintrust is purpose-built for AI evaluation with a focus on developer experience and speed. Key features include:

Fast iteration: Evaluations run in parallel with real-time streaming results. See scores updating live as each case completes instead of waiting for the full suite.
Built-in scorers: Pre-built scoring functions for common patterns — factuality, relevance, summarization quality, SQL correctness, and more.
Diff view: Side-by-side comparison of outputs between experiments, highlighting exactly what changed and how scores shifted.
GitHub integration: Automatically posts eval results as comments on pull requests, making prompt changes reviewable by the team.
Online scoring: Run evaluations against production traffic in real-time to catch quality issues as they happen.

Eval-Driven Development Workflow

The best AI engineering teams practice eval-driven development — the AI equivalent of test-driven development (TDD). The principle is simple: write the eval first, then make it pass.

Eval-driven development in practice:

Sprint Planning: "We need to handle return requests for international orders" Step 1: Write eval cases FIRST - 20 international return scenarios - Edge cases: different countries, currencies, shipping methods - Expected behaviors defined with rubrics Step 2: Run evals against current system - Baseline score: 35% (system doesn't handle these well yet) Step 3: Iterate on implementation - Add international return logic to system prompt - Add country-specific return policy to knowledge base - Add tool for looking up international shipping options Step 4: Run evals after each change - After prompt change: 60% - After knowledge base update: 78% - After tool addition: 94% Step 5: Run full regression suite - International returns: 94% ✓ - Domestic returns: 97% (no regression) ✓ - General queries: 95% (no regression) ✓ Step 6: Ship with confidence

Start Your Eval Suite Today

The hardest part of evaluation is starting. Begin with just 20 test cases that cover your application's most important behaviors. Run them manually if you have to. As you encounter bugs and edge cases, add them to the suite. Within a few weeks, you'll have a comprehensive eval suite that makes every prompt change a confident one. There is no more high-leverage activity in AI engineering than building good evals.

Resources

Tool

Braintrust — AI Evaluation Platform

Braintrust

Purpose-built eval platform with fast iteration, built-in scorers, diff views, and GitHub integration. Excellent for teams practicing eval-driven development.

Tool

LangSmith — Tracing & Evaluation

LangChain

Full-featured platform for tracing, evaluating, and monitoring LLM applications. Includes dataset management, custom evaluators, experiment comparison, and CI integration.

Article

Chatbot Arena Leaderboard

LMSYS

Live leaderboard ranking LLMs by human preference through blind side-by-side comparisons. The most trusted community benchmark for conversational AI quality.

Article

ARC Prize — ARC-AGI Benchmark

ARC Prize Foundation

The ARC-AGI benchmark and competition, designed to measure genuine reasoning and abstraction abilities that resist memorization. A key benchmark for tracking progress toward general intelligence.

Key Takeaways

1AI evaluation is harder than traditional testing because outputs are stochastic — focus on measuring quality dimensions (correctness, safety, tone) rather than exact string matching.
2Industry benchmarks (MMLU, HumanEval, ARC-AGI, Chatbot Arena) measure general capabilities, but custom evals specific to your use case are what actually predict product quality.
3Build eval suites with 50–100 representative cases covering happy paths, edge cases, and known failures. Every bug becomes a new eval case to prevent regression.
4LLM-as-judge is the most flexible scoring method for subjective qualities — use a stronger model as judge with a detailed rubric, and validate against human ratings.
5Prompt regression testing is essential: run the full eval suite on every prompt change to catch the silent breakage that small tweaks can introduce.
6A/B test AI features with larger sample sizes than traditional product tests (1,000–5,000+ interactions per variant) to account for output variability.
7Practice eval-driven development: write eval cases first, establish a baseline, iterate until scores pass, confirm no regressions, then ship with confidence.

Why AI Evaluation Is Hard

Industry Benchmarks for LLMs

Building Custom Eval Suites

Anatomy of an Eval

Types of Scoring Functions

Building Your Eval Dataset

Human Evaluation Frameworks

Designing Human Eval Protocols

A/B Testing AI Features

What to Measure

A/B Testing Pitfalls for AI

Regression Testing for Prompts

The Prompt Regression Workflow

Eval Tooling

LangSmith

Braintrust

Eval-Driven Development Workflow

Resources

Braintrust — AI Evaluation Platform

LangSmith — Tracing & Evaluation

Chatbot Arena Leaderboard

ARC Prize — ARC-AGI Benchmark

Key Takeaways

Test Your Understanding

Module Assessment

Cookie Preferences