Advanced45 minModule 2 of 5

Evaluation & Testing

Benchmarks, evals, metrics for LLMs. Human evaluation, A/B testing, regression testing.

You can't improve what you can't measure. Evaluation and testing are the foundation of reliable AI systems — they tell you whether your model is good enough to ship, whether a change improved or degraded quality, and whether your system is drifting over time. This module covers industry-standard benchmarks, building custom evaluation suites, human evaluation frameworks, A/B testing, regression testing for prompts, and the tools that make eval-driven development practical.

Why AI Evaluation Is Hard

Traditional software has deterministic outputs: given the same input, you get the same output, and you can check it with an assertion. LLMs are stochastic — the same prompt can produce different (but equally valid) responses every time. This means evaluation shifts from "is the output exactly right?" to "is the output good enough along multiple dimensions?"

Evaluation dimensions for AI systems include correctness, helpfulness, safety, format compliance, latency, cost, and consistency. A change that improves correctness might degrade safety. A prompt tweak that fixes one edge case might break five others. Systematic evaluation is the only way to navigate these trade-offs confidently.

Industry Benchmarks for LLMs

Benchmarks provide standardized tests that allow comparison across models and over time. While no single benchmark captures real-world performance, together they paint a useful picture of model capabilities.

BenchmarkWhat It MeasuresFormatKey Notes
MMLU / MMLU-ProBroad knowledge across 57 subjectsMultiple choiceThe standard general-knowledge benchmark; MMLU-Pro uses harder, 10-option questions to reduce saturation
HumanEval / HumanEval+Code generation correctnessWrite Python functions, verified by test casesHumanEval+ adds 80x more tests to catch false positives; SWE-bench tests real-world repo-level coding
ARC-AGI / ARC-AGI-2Novel reasoning and abstractionVisual pattern puzzles requiring generalizationDesigned to resist memorization; tests true reasoning ability rather than pattern matching on training data
GPQA (Diamond)Graduate-level expert knowledgePhD-level science questionsQuestions so hard that PhD holders outside the specialty score ~30%; tests deep domain expertise
MATH / GSM8KMathematical reasoningMulti-step math problemsGSM8K covers grade-school math; MATH covers competition-level problems; both are largely saturated by top models
Chatbot Arena (LMSYS)Overall human preferenceHead-to-head blind comparisons by humansElo-based ranking from real user votes; the most trusted indicator of general chat quality
Benchmarks Are Not Product Quality
A model scoring 90% on MMLU does not mean it will perform well in your application. Benchmarks measure general capability, not task-specific fitness. Always supplement benchmarks with custom evals that mirror your actual use case. The model that ranks highest on Chatbot Arena may not be the best choice for your structured data extraction pipeline.

Building Custom Eval Suites

Custom evals are tests designed specifically for your application's use case. They are the single most important tool for making confident changes to your AI system.

Anatomy of an Eval

Every eval has three components: an input (the test case), an expected behavior (what "good" looks like), and a scoring function (how to measure success).

Example eval structure:

// Eval case for a customer support bot { "input": "I want to return the shoes I bought last week", "context": { "order_id": "ORD-12345", "purchase_date": "2026-03-12", "return_policy": "30-day returns for unworn items" }, "expected": { "intent_detected": "return_request", "mentions_return_policy": true, "asks_about_item_condition": true, "provides_return_instructions": true, "tone": "helpful_and_empathetic" }, "scoring": { "intent_accuracy": "exact_match", // binary: correct or not "content_checks": "checklist", // % of expected elements present "tone_assessment": "llm_as_judge", // use a grader model "overall": "weighted_average" // 40% intent, 40% content, 20% tone } }

Types of Scoring Functions

  • Exact match: The output must match exactly. Good for classifications, structured outputs, and extracted fields (dates, IDs, amounts).
  • Contains/regex: Check whether the output contains specific strings or matches a pattern. Good for verifying citations, required disclaimers, or format compliance.
  • Semantic similarity: Compare the output embedding to a reference answer embedding. Good for open-ended responses where wording can vary but meaning should be similar.
  • LLM-as-judge: Use a powerful model to evaluate the output against a rubric. The most flexible approach and essential for subjective qualities like helpfulness, tone, and completeness.
  • Code execution: For code generation tasks, run the generated code against test cases and check whether it produces correct outputs.
The LLM-as-Judge Pattern
When using LLM-as-judge, provide a detailed rubric in the grader prompt. Define what a score of 1, 3, and 5 looks like with concrete examples. Use a more capable model as the judge than the model being evaluated (e.g., Claude Opus judging Claude Sonnet outputs). Always validate your judge against human ratings on a sample — a biased judge is worse than no judge.

Building Your Eval Dataset

  • Start with 50–100 representative cases: Cover the main paths your application handles. Include happy paths, edge cases, and known failure modes.
  • Mine production logs: Real user queries are the best source of eval cases. Flag interactions where users gave negative feedback, retried their query, or escalated to a human.
  • Bug-driven additions: Every bug report becomes an eval case. This ensures the same failure never ships twice.
  • Adversarial cases: Include prompt injection attempts, off-topic queries, ambiguous inputs, and multi-language inputs to test robustness.
  • Stratify by category: Ensure balanced representation across different query types, difficulty levels, and user segments. A suite dominated by easy cases will give inflated scores.

Human Evaluation Frameworks

Automated evals catch most issues, but human evaluation remains the gold standard for subjective quality — especially for tone, helpfulness, and naturalness. The challenge is making human eval scalable and consistent.

Designing Human Eval Protocols

  • Clear rubrics: Define exactly what each score means. A 5-point scale with vague labels like "good" and "excellent" produces inconsistent ratings. Use behavioral anchors: "Score 5: Answer is factually correct, directly addresses the question, cites relevant sources, and is easy to understand."
  • Blind evaluation: Evaluators should not know which model or prompt version generated the output. This eliminates bias toward the known "current" version.
  • Inter-rater reliability: Have multiple evaluators rate the same samples and measure agreement (Cohen's Kappa or Krippendorff's Alpha). If agreement is low, your rubric needs refinement.
  • Side-by-side comparison: Instead of absolute ratings, show evaluators two outputs and ask which is better. Humans are more consistent at relative comparison than absolute scoring. This is the approach behind LMSYS Chatbot Arena.

A/B Testing AI Features

A/B testing for AI features follows the same principles as traditional product A/B tests, with additional considerations for the stochastic nature of AI outputs.

What to Measure

Metric CategoryExamplesWhy It Matters
Task successResolution rate, task completion, accuracyDid the AI actually help the user accomplish their goal?
User satisfactionThumbs up/down, CSAT, NPSSubjective quality that automated evals may miss
EngagementRetry rate, conversation length, feature adoptionLow retries and short conversations often signal effectiveness
OperationalLatency (p50/p95/p99), cost per request, error rateA higher-quality variant that costs 5x more may not be viable
SafetyGuardrail trigger rate, escalation rate, content flagsImproved quality should not come at the expense of safety

A/B Testing Pitfalls for AI

  • Sample size: Because AI outputs are variable, you need larger sample sizes than typical product A/B tests to detect meaningful differences. Plan for at least 1,000–5,000 interactions per variant.
  • Confounding by topic: If variant A gets easy queries and variant B gets hard ones (by chance), the comparison is invalid. Ensure random assignment is truly random and stratified by query type when possible.
  • Measuring long-term effects: A prompt that produces slightly lower satisfaction per-interaction might lead to higher overall task completion. Run tests long enough to capture downstream effects.

Regression Testing for Prompts

Prompt drift is one of the most insidious problems in AI applications. A small prompt tweak to fix one issue silently breaks five other behaviors. Regression testing catches this by running your full eval suite on every prompt change.

The Prompt Regression Workflow

Eval-driven prompt development cycle:

1. Identify Issue └─ Bug report, user feedback, or observed failure 2. Add Eval Case └─ Create a test case that reproduces the failure └─ Verify: current prompt FAILS this new case 3. Modify Prompt └─ Make the targeted change to fix the issue 4. Run Full Eval Suite └─ New case passes? ✓ └─ All existing cases still pass? ✓ → Ship it └─ Existing cases regressed? ✗ → Iterate on prompt 5. Deploy with Monitoring └─ Watch production metrics for unexpected changes

Version Control for Prompts
Treat prompts like code. Store them in version control, require pull requests for changes, and run eval suites in CI. Tools like Braintrust and LangSmith integrate with Git workflows to provide eval results directly on pull requests, showing exactly which test cases improved, regressed, or remained stable.

Eval Tooling

LangSmith

LangSmith, built by LangChain, is a platform for tracing, evaluating, and monitoring LLM applications. On the evaluation side, it provides:

  • Dataset management: Create and version eval datasets with inputs, expected outputs, and metadata. Import from CSV, JSON, or production traces.
  • Custom evaluators: Define scoring functions in Python — from simple string matching to LLM-as-judge chains. Evaluators run automatically against datasets.
  • Experiment tracking: Compare multiple prompt versions side-by-side with full scoring breakdowns. See exactly which cases improved and which regressed.
  • CI integration: Run evals in your CI pipeline and gate deployments on eval scores.

Braintrust

Braintrust is purpose-built for AI evaluation with a focus on developer experience and speed. Key features include:

  • Fast iteration: Evaluations run in parallel with real-time streaming results. See scores updating live as each case completes instead of waiting for the full suite.
  • Built-in scorers: Pre-built scoring functions for common patterns — factuality, relevance, summarization quality, SQL correctness, and more.
  • Diff view: Side-by-side comparison of outputs between experiments, highlighting exactly what changed and how scores shifted.
  • GitHub integration: Automatically posts eval results as comments on pull requests, making prompt changes reviewable by the team.
  • Online scoring: Run evaluations against production traffic in real-time to catch quality issues as they happen.

Eval-Driven Development Workflow

The best AI engineering teams practice eval-driven development — the AI equivalent of test-driven development (TDD). The principle is simple: write the eval first, then make it pass.

Eval-driven development in practice:

Sprint Planning: "We need to handle return requests for international orders" Step 1: Write eval cases FIRST - 20 international return scenarios - Edge cases: different countries, currencies, shipping methods - Expected behaviors defined with rubrics Step 2: Run evals against current system - Baseline score: 35% (system doesn't handle these well yet) Step 3: Iterate on implementation - Add international return logic to system prompt - Add country-specific return policy to knowledge base - Add tool for looking up international shipping options Step 4: Run evals after each change - After prompt change: 60% - After knowledge base update: 78% - After tool addition: 94% Step 5: Run full regression suite - International returns: 94% ✓ - Domestic returns: 97% (no regression) ✓ - General queries: 95% (no regression) ✓ Step 6: Ship with confidence

Start Your Eval Suite Today
The hardest part of evaluation is starting. Begin with just 20 test cases that cover your application's most important behaviors. Run them manually if you have to. As you encounter bugs and edge cases, add them to the suite. Within a few weeks, you'll have a comprehensive eval suite that makes every prompt change a confident one. There is no more high-leverage activity in AI engineering than building good evals.

Resources

Key Takeaways

  • 1AI evaluation is harder than traditional testing because outputs are stochastic — focus on measuring quality dimensions (correctness, safety, tone) rather than exact string matching.
  • 2Industry benchmarks (MMLU, HumanEval, ARC-AGI, Chatbot Arena) measure general capabilities, but custom evals specific to your use case are what actually predict product quality.
  • 3Build eval suites with 50–100 representative cases covering happy paths, edge cases, and known failures. Every bug becomes a new eval case to prevent regression.
  • 4LLM-as-judge is the most flexible scoring method for subjective qualities — use a stronger model as judge with a detailed rubric, and validate against human ratings.
  • 5Prompt regression testing is essential: run the full eval suite on every prompt change to catch the silent breakage that small tweaks can introduce.
  • 6A/B test AI features with larger sample sizes than traditional product tests (1,000–5,000+ interactions per variant) to account for output variability.
  • 7Practice eval-driven development: write eval cases first, establish a baseline, iterate until scores pass, confirm no regressions, then ship with confidence.

Test Your Understanding

Module Assessment

5 questions · Score 70% or higher to complete this module

You can retake the quiz as many times as you need. Your best score is saved.

Cookie Preferences

We use cookies to enhance your experience. By continuing, you agree to our use of cookies.