Observability & Monitoring

Traditional software observability tells you whether your system is up and fast. AI observability must also tell you whether your system is good — whether the outputs are correct, safe, and useful. This module covers why AI monitoring differs from traditional monitoring, how to log and trace AI interactions, what metrics to track, which tools to use, and how to build dashboards and alerts that catch quality degradation before your users do.

Why AI Observability Is Different

In traditional software, if the API returns a 200 status code with a well-formed response, the request was successful. In AI systems, the API can return a 200 with a perfectly structured response that is completely wrong, hallucinated, or harmful. Success is no longer a binary — it's a spectrum of quality that requires new monitoring approaches.

Dimension	Traditional Software	AI Systems
Correctness	Deterministic — same input produces same output	Stochastic — same input can produce different outputs, all potentially valid
Failure modes	Errors, timeouts, crashes — visible and countable	Hallucinations, drift, subtle quality degradation — invisible without measurement
Cost model	Relatively fixed per request (compute, bandwidth)	Highly variable — depends on input/output token count, model choice, retries
Latency profile	Mostly consistent, varies by data size	Varies dramatically by prompt complexity and output length (100ms–60s+)
Dependencies	Your code, your database, your infrastructure	External model providers whose behavior can change without notice

Silent Failures Are the Real Threat

The most dangerous failures in AI systems are silent. The model confidently produces incorrect answers. A prompt change subtly shifts behavior. A model provider updates their model and your application's quality degrades. Without proactive monitoring for output quality — not just system health — you won't know until users complain.

Logging AI Interactions

Every AI interaction should be logged with enough detail to reconstruct what happened, why it happened, and how much it cost. This data powers debugging, evaluation, cost analysis, and compliance.

Structured Logging Schema

Essential fields for AI interaction logs:

{ // Request metadata "trace_id": "abc-123-def", "span_id": "span-456", "timestamp": "2026-03-19T14:30:00Z", "user_id": "user_789", "session_id": "session_012", // Model configuration "model": "claude-sonnet-4-20260514", "temperature": 0.3, "max_tokens": 1024, "system_prompt_version": "v2.4.1", // Input "input_tokens": 1847, "prompt_template": "customer_support_v3", "has_tool_calls": true, "retrieved_context_chunks": 4, // Output "output_tokens": 342, "finish_reason": "end_turn", "tool_calls": ["lookup_order", "check_return_policy"], "latency_ms": 2340, "time_to_first_token_ms": 480, // Cost "input_cost_usd": 0.00554, "output_cost_usd": 0.00513, "total_cost_usd": 0.01067, // Quality signals "user_feedback": "thumbs_up", "guardrail_triggered": false, "error": null }

Cost Tracking

Cost tracking deserves special attention because AI API costs are per-token and can vary dramatically across requests. A single request with a large context window can cost 100x more than a simple query.

Per-request cost: Calculate input tokens multiplied by the model's input price, plus output tokens multiplied by the output price. Log this on every request.
Per-user cost: Aggregate request costs by user to identify heavy users, detect abuse, and inform pricing decisions.
Per-feature cost: Tag requests by feature (search, chat, summarization) to understand which features drive cost and where optimization will have the most impact.
Cost budgets: Set daily and monthly cost budgets with alerts at 50%, 80%, and 100% thresholds. Consider automatic rate limiting or model downgrading when approaching budget limits.

Tracing Agent Workflows

Modern AI applications — especially agents and multi-step RAG pipelines — involve chains of operations: retrieval, multiple LLM calls, tool executions, and conditional branching. Tracing provides visibility into these complex workflows by linking related operations into a tree of spans.

Spans and Traces

Borrowing from distributed tracing (OpenTelemetry), AI observability uses the same concepts:

Trace: The complete lifecycle of a user request — from initial query to final response. A trace contains one or more spans.
Span: A single operation within a trace — an LLM call, a vector search, a tool execution, or a guardrail check. Each span records its start time, duration, inputs, outputs, and status.
Parent-child relationships: Spans are nested. An agent trace might contain a top-level "agent_run" span with child spans for "planning", "retrieval", "tool_call", and "generation" — each revealing where time and tokens are spent.

Example trace for a RAG agent query:

Trace: "What was our Q4 revenue growth?" │ ├─ Span: query_understanding (LLM call) │ ├─ Duration: 340ms │ ├─ Tokens: 120 in / 45 out │ └─ Output: intent=financial_query, period=Q4_2025 │ ├─ Span: document_retrieval (vector search) │ ├─ Duration: 85ms │ ├─ Documents retrieved: 6 │ └─ Relevance scores: [0.94, 0.91, 0.87, 0.82, 0.76, 0.71] │ ├─ Span: reranking (cross-encoder) │ ├─ Duration: 120ms │ └─ Top 3 after rerank: [doc_4, doc_1, doc_6] │ ├─ Span: answer_generation (LLM call) │ ├─ Duration: 1,850ms │ ├─ Tokens: 2,400 in / 280 out │ └─ Cost: $0.028 │ └─ Span: guardrail_check ├─ Duration: 45ms └─ Status: passed Total trace duration: 2,440ms Total cost: $0.031

OpenTelemetry for AI

The AI observability ecosystem is converging on OpenTelemetry (OTel) as the standard instrumentation layer. Libraries like OpenLLMetry (by Traceloop) and OpenInference (by Arize) extend OTel with AI-specific semantic conventions for LLM calls, embeddings, and retrievals. This means your AI traces can live alongside your traditional application traces in the same observability backend.

Key Metrics to Monitor

Latency Metrics

Time to first token (TTFT): How long until the user sees the first token of the response. Critical for streaming UIs — users perceive responsiveness by when output starts, not when it finishes.
Total latency: End-to-end time for the full response. Track p50, p95, and p99 percentiles — averages hide tail latency problems.
Tokens per second: Generation speed. Important for self-hosted models where throughput is a function of GPU utilization and batching.

Cost Metrics

Cost per request: Average and p95 cost. High variance indicates some requests are disproportionately expensive.
Daily/weekly spend: Track trends and detect spikes early. A sudden 3x increase in daily spend might indicate a bug causing unnecessary retries or inflated prompts.
Cost per user action: The business-level metric — how much does it cost to serve one customer support resolution, one document summary, or one search query?

Quality Metrics

User feedback scores: Thumbs up/down ratios, CSAT, or custom rating scales. The most direct signal of output quality.
Guardrail trigger rate: How often do safety filters or content policies activate? A sudden increase may indicate the model is producing more problematic outputs.
Retry and regeneration rate: How often do users retry or regenerate? High retry rates signal that outputs aren't meeting user expectations.
Online eval scores: Run automated evaluations (e.g., LLM-as-judge) on a sample of production traffic to continuously measure quality dimensions like correctness and relevance.

Observability Tools

Tool	Primary Focus	Key Strengths	Best For
LangSmith	Tracing + Evaluation	Deep LangChain integration, conversation threading, annotation queues, dataset management	Teams using LangChain/LangGraph; combined tracing and eval workflows
Weights & Biases (W&B)	Experiment tracking + ML lifecycle	Model training tracking, prompt versioning, artifact management, collaborative dashboards	Teams doing model training/fine-tuning alongside application development
Arize AI	Production monitoring + troubleshooting	Embedding drift detection, automated monitors, root cause analysis, LLM guardrails	Production-heavy teams needing advanced drift detection and automated alerting
Braintrust	Evaluation + logging	Fast eval iteration, online scoring, production logging, GitHub CI integration	Teams prioritizing eval-driven development with production monitoring
Helicone	Cost + usage analytics	One-line proxy integration, detailed cost breakdowns, rate limiting, caching analytics	Teams needing quick cost visibility with minimal integration effort

Start with One Tool, Not Five

The observability tool landscape is crowded and overlapping. Pick one platform that covers your most urgent need — usually tracing and cost tracking — and expand from there. LangSmith is a strong default if you're using LangChain. Braintrust if eval-driven development is your priority. Helicone if you need cost visibility immediately with minimal setup.

Building Dashboards

A well-designed AI dashboard gives your team at-a-glance visibility into system health, quality, and cost. Structure your dashboard around three levels of detail.

Executive Dashboard

High-level metrics for leadership and product managers:

Total AI requests per day/week (trend line)
Total spend and cost per user action (trend line)
Overall user satisfaction score (trend line)
Error rate and escalation rate

Engineering Dashboard

Operational metrics for the engineering team:

Latency percentiles (p50, p95, p99) by endpoint
Token usage by model and feature
Guardrail trigger rate and types
Request volume heatmap by hour
Model provider status and error rates

Quality Dashboard

Output quality metrics for AI/ML engineers:

Online eval scores over time (by eval dimension)
User feedback distribution (thumbs up/down, ratings)
Retry and regeneration rates
Top failure categories from user reports
Embedding drift scores (for RAG systems)

Alerting on Quality Degradation

Alerts for AI systems must go beyond uptime and latency. Quality degradation often happens gradually — a few percentage points per week — making it invisible without statistical alerting.

Alert Categories

Availability alerts: Model API errors, timeout rates, and circuit breaker activations. Threshold: error rate > 1% over 5 minutes.
Latency alerts: p95 latency exceeds SLA. Threshold: p95 > 5 seconds for 10 minutes (adjust based on your application).
Cost alerts: Hourly or daily spend exceeds budget. Use anomaly detection rather than fixed thresholds — a 3x spike on a Tuesday when traffic normally dips is more alarming than steady high spend on a Monday.
Quality alerts: User feedback score drops below threshold. Online eval scores decline by more than 5% week-over-week. Guardrail trigger rate increases significantly.
Drift alerts: For RAG systems, monitor whether the distribution of retrieved documents or embedding similarities is shifting. This can indicate stale data or degraded retrieval.

Cost Anomaly Detection

Cost anomalies deserve dedicated attention because they can be both a symptom of bugs and a direct financial risk:

Infinite loop detection: An agent stuck in a retry loop can burn through thousands of dollars in minutes. Alert on any single trace exceeding a cost threshold (e.g., $1 per trace).
Prompt inflation: A code change that accidentally includes too much context can multiply costs. Monitor average input tokens per request and alert on sudden increases.
Model routing failures: If your routing logic fails and all requests go to the most expensive model, costs spike. Monitor the distribution of requests across models.

The $10,000 Agent Loop

One of the most common production incidents in AI systems is an agent entering an infinite loop — calling tools, failing, retrying, and burning through tokens. Always implement per-trace cost caps and maximum step limits. A single runaway agent trace should not be able to spend more than a predefined budget (e.g., $5) before being forcefully terminated.

Resources

Tool

LangSmith — LLM Observability

LangChain

Tracing, evaluation, and monitoring platform for LLM applications. Features conversation threading, annotation queues, and deep LangChain/LangGraph integration.

Tool

Arize AI — ML Observability

Arize AI

Production ML and LLM monitoring with embedding drift detection, automated troubleshooting, and guardrail monitoring. Strong focus on root cause analysis.

Tool

Helicone — LLM Cost Analytics

Helicone

One-line proxy integration for LLM cost tracking, usage analytics, rate limiting, and caching. The fastest path to cost visibility for AI applications.

Article

OpenLLMetry — OpenTelemetry for LLMs

Traceloop

Open-source OpenTelemetry-based instrumentation for LLM applications. Standardized semantic conventions for AI traces compatible with any OTel-compliant backend.

Key Takeaways

1AI observability must monitor output quality (correctness, safety, helpfulness), not just system health (uptime, latency) — a 200 status code doesn't mean the response was good.
2Log every AI interaction with structured fields covering model config, token counts, cost, latency, tool calls, and user feedback to enable debugging and cost analysis.
3Tracing with spans and parent-child relationships provides visibility into multi-step agent workflows — showing exactly where time, tokens, and cost are spent.
4Track three metric categories: latency (TTFT, p95 total), cost (per request, per user action, daily trend), and quality (feedback scores, eval scores, guardrail rates).
5Build tiered dashboards: executive (cost and satisfaction trends), engineering (latency and errors), and quality (eval scores and drift).
6Alert on cost anomalies aggressively — infinite agent loops, prompt inflation, and routing failures can burn through thousands of dollars in minutes.
7Implement per-trace cost caps and maximum step limits for agents to prevent runaway traces from causing financial damage.

Why AI Observability Is Different

Logging AI Interactions

Structured Logging Schema

Cost Tracking

Tracing Agent Workflows

Spans and Traces

Key Metrics to Monitor

Latency Metrics

Cost Metrics

Quality Metrics

Observability Tools

Building Dashboards

Executive Dashboard

Engineering Dashboard

Quality Dashboard

Alerting on Quality Degradation

Alert Categories

Cost Anomaly Detection

Resources

LangSmith — LLM Observability

Arize AI — ML Observability

Helicone — LLM Cost Analytics

OpenLLMetry — OpenTelemetry for LLMs

Key Takeaways

Test Your Understanding

Module Assessment

Cookie Preferences