Scaling AI Products

Scaling an AI application from hundreds to millions of users introduces challenges that don't exist at prototype scale — cost grows linearly with usage, latency becomes inconsistent under load, and quality must remain high across a diverse range of inputs. This module covers the engineering strategies that make AI products viable at scale: caching, rate limiting, multi-model routing, cost optimization, edge deployment, horizontal scaling, and load testing.

Caching Strategies

Caching is the single highest-leverage optimization for AI applications. Unlike traditional web caching where cache hit rates of 90%+ are common, AI caching requires more sophisticated matching because user queries are rarely identical — but they are often semantically similar.

Exact Match Caching

The simplest form: hash the input (prompt + relevant parameters) and check if an identical request has been served before. If so, return the cached response immediately without calling the model.

When it works: Autocomplete suggestions, structured data extraction from identical templates, classification of repeated inputs (e.g., the same email gets classified the same way).
Cache key design: Include the model name, system prompt version, temperature, and the user message in the hash. Two requests that differ only in temperature should not share a cache entry.
Hit rates: Typically 5–20% for conversational applications, but can reach 60–80% for structured applications where inputs follow predictable patterns.
TTL strategy: Set time-to-live based on how quickly the underlying data changes. A knowledge base that updates weekly can cache for days; real-time data needs short TTLs.

Semantic Caching

Semantic caching matches queries by meaning rather than exact string match. It embeds the incoming query, searches a cache of previous query embeddings, and returns a cached response if a sufficiently similar query has been answered before.

How it works: Embed each new query using the same embedding model used for your application. Search the cache using cosine similarity. If the nearest neighbor exceeds a similarity threshold (e.g., 0.95), return the cached response.
Threshold tuning: Too low (0.85) returns stale or inaccurate responses for distinct queries. Too high (0.99) rarely produces hits. Start at 0.95 and tune based on quality monitoring — track cases where the cached response didn't match user expectations.
Hit rates: Typically 15–40% depending on query diversity and threshold. Customer support applications — where many users ask variations of the same questions — see the highest rates.
Implementation: Use a vector database (Qdrant, Pinecone, Redis with vector search) as the cache backend. Tools like GPTCache and LangChain's caching module provide ready-made implementations.

Semantic caching flow:

User: "How do I reset my password?" │ ├─ Embed query → [0.23, -0.45, 0.87, ...] ├─ Search cache → nearest match: "How can I change my password?" │ similarity: 0.97 (above threshold 0.95) │ └─ Return cached response (no LLM call needed!) Latency: ~50ms instead of ~2,000ms Cost: ~$0.0001 instead of ~$0.01 User: "How do I reset my password for the mobile app?" │ ├─ Embed query → [0.21, -0.43, 0.84, ...] ├─ Search cache → nearest match: "How can I change my password?" │ similarity: 0.88 (below threshold 0.95) │ └─ Cache miss — call LLM, cache the new response

Prompt Caching from Providers

Major model providers now offer server-side prompt caching. Anthropic's prompt caching lets you mark static parts of your prompts (system instructions, few-shot examples, document context) for caching using cache control breakpoints, and charges 90% less for cached input tokens on subsequent requests. The main change is structuring your prompts so that cacheable content comes first and adding cache_control markers in API calls. At scale, this alone can reduce costs by 50–80% for applications with long system prompts.

Rate Limiting and Queue Management

AI APIs have rate limits imposed by providers (tokens per minute, requests per minute), and your own system needs rate limits to control cost and ensure fair access. Effective queue management prevents request failures during traffic spikes.

Rate Limiting Patterns

Per-user rate limits: Prevent any single user from consuming disproportionate resources. Example: 20 requests per minute for free tier, 100 per minute for paid tier.
Per-feature rate limits: Expensive features (document analysis, long-form generation) get tighter limits than cheap features (classification, short answers).
Token-based rate limits: Rather than counting requests, count tokens consumed. A user who sends 100 short queries uses fewer resources than one who sends 10 massive documents.
Graceful degradation under limits: When a user hits a rate limit, don't just return a 429 error. Offer alternatives: switch to a faster/cheaper model, return a cached response, or queue the request for later processing.

Queue Architecture

For non-real-time workloads, a queue-based architecture absorbs traffic spikes and ensures reliable processing:

Request queue architecture:

Incoming Requests │ ├─ Priority Queue (Redis, SQS, RabbitMQ) │ ├─ P0: Real-time user-facing requests → direct to model │ ├─ P1: Near-real-time (webhooks, notifications) → <30s queue │ └─ P2: Batch processing (reports, bulk analysis) → background │ ├─ Worker Pool (auto-scaled) │ ├─ Workers pull from queue based on priority │ ├─ Respect provider rate limits (token bucket algorithm) │ └─ Retry with exponential backoff on failures │ └─ Result Store (Redis, DynamoDB) ├─ Short-lived results for synchronous callers (polling/SSE) └─ Persistent results for batch jobs (S3, database)

Multi-Model Routing

Not every request deserves the most powerful (and expensive) model. Multi-model routing directs each request to the most appropriate model based on complexity, cost constraints, and latency requirements. This is the single most effective cost optimization strategy for production AI systems.

Routing Strategies

Strategy	How It Works	Cost Savings	Trade-offs
Complexity-based	Classifier model estimates query difficulty; simple queries go to fast/cheap model, complex to powerful model	40–70%	Classifier adds latency; misrouting causes quality drops
Cascading	Start with cheapest model; if confidence is low, escalate to next tier	30–60%	Adds latency for escalated requests (double LLM call)
Task-based	Different features hardcoded to different models (classification → Haiku, analysis → Opus)	30–50%	Simple but static; doesn't adapt to query variability within a feature
User-tier-based	Free users get fast/cheap models; premium users get powerful models	20–40%	Simple to implement; creates visible quality tiers

Example multi-model routing implementation:

# Complexity-based routing with cascading fallback def route_request(query, user_tier): # Step 1: Classify complexity (using a fast, cheap model) complexity = classify_complexity(query) # "simple", "medium", "complex" # Step 2: Select model based on complexity and user tier if complexity == "simple": model = "claude-haiku" # ~$0.001 per request elif complexity == "medium": model = "claude-sonnet" # ~$0.01 per request else: # complex model = "claude-opus" # ~$0.05 per request # Step 3: Generate response response = call_model(model, query) # Step 4: Confidence check — escalate if needed if response.confidence < 0.7 and model != "claude-opus": response = call_model("claude-opus", query) return response

Router Model Economics

The classification step itself costs tokens. A good routing classifier uses a small, fast model (Claude Haiku or similar) and adds ~50ms and less than $0.0005 per request. If your routing saves 50% on average model costs, the router pays for itself when the average request costs more than $0.001 — which is virtually always.

Cost vs. Quality Trade-offs at Scale

At scale, small per-request cost differences compound into massive budget impacts. Understanding where quality is essential and where "good enough" suffices is a core product decision.

Cost impact at scale (1 million requests/day):

Per Request Daily Monthly All Opus: $0.05 $50,000 $1,500,000 All Sonnet: $0.01 $10,000 $300,000 All Haiku: $0.001 $1,000 $30,000 Smart routing: $0.008 avg $8,000 $240,000 (70% Haiku, 20% Sonnet, 10% Opus) Savings from routing: $42,000/day vs. all-Opus With semantic caching (30% hit rate): $5,600/day → $168,000/month These numbers are illustrative — actual costs depend on token counts, prompt lengths, and specific model pricing.

Optimization Levers (in order of impact)

1. Multi-model routing (40–70% savings): Route simple requests to cheap models. This has the highest impact because most requests in most applications are simple.
2. Caching — provider and semantic (20–50% savings): Eliminate redundant LLM calls entirely. Provider prompt caching reduces input costs; semantic caching eliminates entire requests.
3. Prompt optimization (10–30% savings): Shorter prompts cost less. Remove unnecessary instructions, compress examples, and use references instead of inline documents.
4. Batch APIs (50% savings on eligible workloads): For non-time-sensitive processing, batch APIs from Anthropic and OpenAI offer 50% discounts. Schedule bulk analysis, eval runs, and content generation for batch processing.
5. Output length control (5–15% savings): Set appropriate max_tokens limits. A classification task doesn't need 1,024 output tokens. Use structured outputs (JSON mode) to keep responses concise.

Edge Deployment

Edge deployment runs AI models closer to the user — on edge servers, CDN nodes, or even on the user's device. This reduces latency and can improve privacy by keeping data local.

Edge AI Use Cases

Small language models on-device: Models like Gemma 3, Phi-4-mini, and quantized Llama variants can run on modern smartphones and laptops. Good for offline-capable features, privacy-sensitive tasks, and ultra-low-latency needs.
Edge classification: Run lightweight classification models at the edge to pre-filter or route requests, sending only complex queries to cloud-hosted large models.
Embedding generation: Generate embeddings on-device for local semantic search without sending data to the cloud. ONNX Runtime and Core ML support efficient on-device embedding models.
Hybrid edge-cloud: The most practical pattern — run small models at the edge for latency-sensitive or privacy-sensitive operations, and call cloud models for complex tasks. The edge model handles 70–80% of requests; the cloud handles the rest.

Edge Model Limitations

Edge models (1B–7B parameters) are significantly less capable than cloud models (70B–400B+ parameters). They work well for focused tasks (classification, extraction, simple Q&A) but struggle with complex reasoning, nuanced generation, and multi-step analysis. Always benchmark your specific use case on the edge model before committing — don't assume cloud-level quality.

Horizontal Scaling Patterns

Horizontal scaling adds more instances of your serving infrastructure to handle increased load. For AI applications, this means scaling both the application layer (API servers, queue workers) and the model serving layer (GPU instances).

Stateless API Design

Design your AI application layer to be completely stateless. Store conversation history, user preferences, and session state in external stores (Redis, DynamoDB, PostgreSQL). This allows any API instance to handle any request, enabling simple horizontal scaling.

Connection Pooling for Model APIs

When calling external model APIs (Anthropic, OpenAI), use HTTP connection pooling to reuse connections across requests. Creating a new TLS connection for every LLM call adds 50–100ms of unnecessary latency. Most HTTP client libraries support connection pooling with configurable pool sizes.

Load Balancing AI Workloads

Application layer: Standard load balancing (round robin, least connections) works well for stateless API servers.
Model serving layer: For self-hosted models, use least-tokens-in-queue or least-GPU-utilization routing to balance load across GPU instances. Standard round robin doesn't account for variable request complexity.
Sticky sessions for streaming: If using server-sent events (SSE) for token streaming, ensure the load balancer supports long-lived connections and doesn't prematurely close streams.

Load Testing AI Applications

Load testing AI applications requires different approaches than traditional load testing because response times are 10–100x longer and highly variable.

What to Test

Throughput under load: How many concurrent requests can your system handle before latency degrades? For self-hosted models, this directly relates to GPU count and batch size.
Latency distribution under load: Track p50, p95, and p99 latency as concurrency increases. AI latency tails can be extreme — p99 may be 10x p50 due to long generations.
Queue behavior: Test what happens when incoming request rate exceeds processing capacity. Verify that queues backfill correctly, priorities are respected, and timeouts trigger appropriately.
Autoscaling response: Trigger a load spike and measure how quickly new instances spin up and become ready. Factor in model loading time for GPU instances — a Kubernetes pod can be "running" for 2 minutes before the model is actually loaded.
Failure modes: Test what happens when a model API is down, rate limited, or responding slowly. Verify fallbacks activate correctly.

Load Testing Tools

k6: Scriptable load testing with good support for WebSocket and SSE (needed for streaming AI responses). Write test scenarios that simulate realistic user behavior.
Locust: Python-based load testing that lets you define user behavior as Python code. Easy to create AI-specific test patterns with variable prompt lengths and thinking time.
Custom harness: For AI-specific testing, you often need a custom harness that tracks AI-specific metrics (tokens per second, time to first token, cost per request) alongside standard HTTP metrics.

Load test checklist for AI applications:

□ Test with realistic prompt lengths (not just "Hello") □ Include variable output lengths in test scenarios □ Simulate concurrent users, not just sequential requests □ Test streaming endpoints separately from batch endpoints □ Measure time-to-first-token, not just total latency □ Test cache hit scenarios (repeated queries) and cache miss scenarios □ Simulate traffic patterns (bursts, ramp-ups, sustained load) □ Test with rate limits enabled to verify graceful degradation □ Monitor GPU utilization and memory during tests (self-hosted) □ Track cost during load tests to project production spend

Load Test Against Real Models Sparingly

Load testing against real LLM APIs costs money and consumes rate limit quotas. Use a mock server that simulates realistic latency distributions for most testing, and run real API load tests sparingly to validate assumptions. A mock that returns a response after a random 1–5 second delay captures most scaling behavior without API costs.

Resources

Article

Anthropic: Prompt Caching

Anthropic

Official guide to Anthropic's server-side prompt caching, which reduces input token costs by up to 90% for repeated system prompts and context blocks.

Tool

LiteLLM — Multi-Provider LLM Gateway

BerriAI

Open-source proxy that provides a unified API across 100+ LLM providers. Features load balancing, fallbacks, spend tracking, and rate limiting out of the box.