Advanced40 minModule 3 of 5

Observability & Monitoring

Logging, tracing agent workflows, monitoring latency/cost/quality. LangSmith, W&B, Arize.

Traditional software observability tells you whether your system is up and fast. AI observability must also tell you whether your system is good — whether the outputs are correct, safe, and useful. This module covers why AI monitoring differs from traditional monitoring, how to log and trace AI interactions, what metrics to track, which tools to use, and how to build dashboards and alerts that catch quality degradation before your users do.

Why AI Observability Is Different

In traditional software, if the API returns a 200 status code with a well-formed response, the request was successful. In AI systems, the API can return a 200 with a perfectly structured response that is completely wrong, hallucinated, or harmful. Success is no longer a binary — it's a spectrum of quality that requires new monitoring approaches.

DimensionTraditional SoftwareAI Systems
CorrectnessDeterministic — same input produces same outputStochastic — same input can produce different outputs, all potentially valid
Failure modesErrors, timeouts, crashes — visible and countableHallucinations, drift, subtle quality degradation — invisible without measurement
Cost modelRelatively fixed per request (compute, bandwidth)Highly variable — depends on input/output token count, model choice, retries
Latency profileMostly consistent, varies by data sizeVaries dramatically by prompt complexity and output length (100ms–60s+)
DependenciesYour code, your database, your infrastructureExternal model providers whose behavior can change without notice
Silent Failures Are the Real Threat
The most dangerous failures in AI systems are silent. The model confidently produces incorrect answers. A prompt change subtly shifts behavior. A model provider updates their model and your application's quality degrades. Without proactive monitoring for output quality — not just system health — you won't know until users complain.

Logging AI Interactions

Every AI interaction should be logged with enough detail to reconstruct what happened, why it happened, and how much it cost. This data powers debugging, evaluation, cost analysis, and compliance.

Structured Logging Schema

Essential fields for AI interaction logs:

{ // Request metadata "trace_id": "abc-123-def", "span_id": "span-456", "timestamp": "2026-03-19T14:30:00Z", "user_id": "user_789", "session_id": "session_012", // Model configuration "model": "claude-sonnet-4-20260514", "temperature": 0.3, "max_tokens": 1024, "system_prompt_version": "v2.4.1", // Input "input_tokens": 1847, "prompt_template": "customer_support_v3", "has_tool_calls": true, "retrieved_context_chunks": 4, // Output "output_tokens": 342, "finish_reason": "end_turn", "tool_calls": ["lookup_order", "check_return_policy"], "latency_ms": 2340, "time_to_first_token_ms": 480, // Cost "input_cost_usd": 0.00554, "output_cost_usd": 0.00513, "total_cost_usd": 0.01067, // Quality signals "user_feedback": "thumbs_up", "guardrail_triggered": false, "error": null }

Cost Tracking

Cost tracking deserves special attention because AI API costs are per-token and can vary dramatically across requests. A single request with a large context window can cost 100x more than a simple query.

  • Per-request cost: Calculate input tokens multiplied by the model's input price, plus output tokens multiplied by the output price. Log this on every request.
  • Per-user cost: Aggregate request costs by user to identify heavy users, detect abuse, and inform pricing decisions.
  • Per-feature cost: Tag requests by feature (search, chat, summarization) to understand which features drive cost and where optimization will have the most impact.
  • Cost budgets: Set daily and monthly cost budgets with alerts at 50%, 80%, and 100% thresholds. Consider automatic rate limiting or model downgrading when approaching budget limits.

Tracing Agent Workflows

Modern AI applications — especially agents and multi-step RAG pipelines — involve chains of operations: retrieval, multiple LLM calls, tool executions, and conditional branching. Tracing provides visibility into these complex workflows by linking related operations into a tree of spans.

Spans and Traces

Borrowing from distributed tracing (OpenTelemetry), AI observability uses the same concepts:

  • Trace: The complete lifecycle of a user request — from initial query to final response. A trace contains one or more spans.
  • Span: A single operation within a trace — an LLM call, a vector search, a tool execution, or a guardrail check. Each span records its start time, duration, inputs, outputs, and status.
  • Parent-child relationships: Spans are nested. An agent trace might contain a top-level "agent_run" span with child spans for "planning", "retrieval", "tool_call", and "generation" — each revealing where time and tokens are spent.

Example trace for a RAG agent query:

Trace: "What was our Q4 revenue growth?" │ ├─ Span: query_understanding (LLM call) │ ├─ Duration: 340ms │ ├─ Tokens: 120 in / 45 out │ └─ Output: intent=financial_query, period=Q4_2025 │ ├─ Span: document_retrieval (vector search) │ ├─ Duration: 85ms │ ├─ Documents retrieved: 6 │ └─ Relevance scores: [0.94, 0.91, 0.87, 0.82, 0.76, 0.71] │ ├─ Span: reranking (cross-encoder) │ ├─ Duration: 120ms │ └─ Top 3 after rerank: [doc_4, doc_1, doc_6] │ ├─ Span: answer_generation (LLM call) │ ├─ Duration: 1,850ms │ ├─ Tokens: 2,400 in / 280 out │ └─ Cost: $0.028 │ └─ Span: guardrail_check ├─ Duration: 45ms └─ Status: passed Total trace duration: 2,440ms Total cost: $0.031

OpenTelemetry for AI
The AI observability ecosystem is converging on OpenTelemetry (OTel) as the standard instrumentation layer. Libraries like OpenLLMetry (by Traceloop) and OpenInference (by Arize) extend OTel with AI-specific semantic conventions for LLM calls, embeddings, and retrievals. This means your AI traces can live alongside your traditional application traces in the same observability backend.

Key Metrics to Monitor

Latency Metrics

  • Time to first token (TTFT): How long until the user sees the first token of the response. Critical for streaming UIs — users perceive responsiveness by when output starts, not when it finishes.
  • Total latency: End-to-end time for the full response. Track p50, p95, and p99 percentiles — averages hide tail latency problems.
  • Tokens per second: Generation speed. Important for self-hosted models where throughput is a function of GPU utilization and batching.

Cost Metrics

  • Cost per request: Average and p95 cost. High variance indicates some requests are disproportionately expensive.
  • Daily/weekly spend: Track trends and detect spikes early. A sudden 3x increase in daily spend might indicate a bug causing unnecessary retries or inflated prompts.
  • Cost per user action: The business-level metric — how much does it cost to serve one customer support resolution, one document summary, or one search query?

Quality Metrics

  • User feedback scores: Thumbs up/down ratios, CSAT, or custom rating scales. The most direct signal of output quality.
  • Guardrail trigger rate: How often do safety filters or content policies activate? A sudden increase may indicate the model is producing more problematic outputs.
  • Retry and regeneration rate: How often do users retry or regenerate? High retry rates signal that outputs aren't meeting user expectations.
  • Online eval scores: Run automated evaluations (e.g., LLM-as-judge) on a sample of production traffic to continuously measure quality dimensions like correctness and relevance.

Observability Tools

ToolPrimary FocusKey StrengthsBest For
LangSmithTracing + EvaluationDeep LangChain integration, conversation threading, annotation queues, dataset managementTeams using LangChain/LangGraph; combined tracing and eval workflows
Weights & Biases (W&B)Experiment tracking + ML lifecycleModel training tracking, prompt versioning, artifact management, collaborative dashboardsTeams doing model training/fine-tuning alongside application development
Arize AIProduction monitoring + troubleshootingEmbedding drift detection, automated monitors, root cause analysis, LLM guardrailsProduction-heavy teams needing advanced drift detection and automated alerting
BraintrustEvaluation + loggingFast eval iteration, online scoring, production logging, GitHub CI integrationTeams prioritizing eval-driven development with production monitoring
HeliconeCost + usage analyticsOne-line proxy integration, detailed cost breakdowns, rate limiting, caching analyticsTeams needing quick cost visibility with minimal integration effort
Start with One Tool, Not Five
The observability tool landscape is crowded and overlapping. Pick one platform that covers your most urgent need — usually tracing and cost tracking — and expand from there. LangSmith is a strong default if you're using LangChain. Braintrust if eval-driven development is your priority. Helicone if you need cost visibility immediately with minimal setup.

Building Dashboards

A well-designed AI dashboard gives your team at-a-glance visibility into system health, quality, and cost. Structure your dashboard around three levels of detail.

Executive Dashboard

High-level metrics for leadership and product managers:

  • Total AI requests per day/week (trend line)
  • Total spend and cost per user action (trend line)
  • Overall user satisfaction score (trend line)
  • Error rate and escalation rate

Engineering Dashboard

Operational metrics for the engineering team:

  • Latency percentiles (p50, p95, p99) by endpoint
  • Token usage by model and feature
  • Guardrail trigger rate and types
  • Request volume heatmap by hour
  • Model provider status and error rates

Quality Dashboard

Output quality metrics for AI/ML engineers:

  • Online eval scores over time (by eval dimension)
  • User feedback distribution (thumbs up/down, ratings)
  • Retry and regeneration rates
  • Top failure categories from user reports
  • Embedding drift scores (for RAG systems)

Alerting on Quality Degradation

Alerts for AI systems must go beyond uptime and latency. Quality degradation often happens gradually — a few percentage points per week — making it invisible without statistical alerting.

Alert Categories

  • Availability alerts: Model API errors, timeout rates, and circuit breaker activations. Threshold: error rate > 1% over 5 minutes.
  • Latency alerts: p95 latency exceeds SLA. Threshold: p95 > 5 seconds for 10 minutes (adjust based on your application).
  • Cost alerts: Hourly or daily spend exceeds budget. Use anomaly detection rather than fixed thresholds — a 3x spike on a Tuesday when traffic normally dips is more alarming than steady high spend on a Monday.
  • Quality alerts: User feedback score drops below threshold. Online eval scores decline by more than 5% week-over-week. Guardrail trigger rate increases significantly.
  • Drift alerts: For RAG systems, monitor whether the distribution of retrieved documents or embedding similarities is shifting. This can indicate stale data or degraded retrieval.

Cost Anomaly Detection

Cost anomalies deserve dedicated attention because they can be both a symptom of bugs and a direct financial risk:

  • Infinite loop detection: An agent stuck in a retry loop can burn through thousands of dollars in minutes. Alert on any single trace exceeding a cost threshold (e.g., $1 per trace).
  • Prompt inflation: A code change that accidentally includes too much context can multiply costs. Monitor average input tokens per request and alert on sudden increases.
  • Model routing failures: If your routing logic fails and all requests go to the most expensive model, costs spike. Monitor the distribution of requests across models.
The $10,000 Agent Loop
One of the most common production incidents in AI systems is an agent entering an infinite loop — calling tools, failing, retrying, and burning through tokens. Always implement per-trace cost caps and maximum step limits. A single runaway agent trace should not be able to spend more than a predefined budget (e.g., $5) before being forcefully terminated.

Resources

Key Takeaways

  • 1AI observability must monitor output quality (correctness, safety, helpfulness), not just system health (uptime, latency) — a 200 status code doesn't mean the response was good.
  • 2Log every AI interaction with structured fields covering model config, token counts, cost, latency, tool calls, and user feedback to enable debugging and cost analysis.
  • 3Tracing with spans and parent-child relationships provides visibility into multi-step agent workflows — showing exactly where time, tokens, and cost are spent.
  • 4Track three metric categories: latency (TTFT, p95 total), cost (per request, per user action, daily trend), and quality (feedback scores, eval scores, guardrail rates).
  • 5Build tiered dashboards: executive (cost and satisfaction trends), engineering (latency and errors), and quality (eval scores and drift).
  • 6Alert on cost anomalies aggressively — infinite agent loops, prompt inflation, and routing failures can burn through thousands of dollars in minutes.
  • 7Implement per-trace cost caps and maximum step limits for agents to prevent runaway traces from causing financial damage.

Test Your Understanding

Module Assessment

5 questions · Score 70% or higher to complete this module

You can retake the quiz as many times as you need. Your best score is saved.

Cookie Preferences

We use cookies to enhance your experience. By continuing, you agree to our use of cookies.