Latency Budgeting for AI Features: Caching, Streaming, and Async Patterns

An LLM API call takes between 1 and 30 seconds. A user's patience runs out in about 3. This is the fundamental tension in AI product engineering, and the reason most AI features feel slow compared to the traditional features they sit next to.

You cannot make the model faster. What you can do is build architecture patterns that hide, reduce, and redistribute latency so the user never feels the gap. This article covers three approaches -- caching, streaming, and async patterns -- with specific implementations, latency numbers, and decision frameworks.

The Latency Budget

Before optimizing anything, define your latency budget. A latency budget allocates the total acceptable response time across each component in the request path.

Here is a typical budget for a RAG-powered AI feature:

Component	Target P50	Target P99
Query processing + embedding	50ms	150ms
Vector search + reranking	100ms	300ms
LLM inference (streaming, TTFT)	500ms	1,500ms
Post-processing + delivery	50ms	100ms
Total to first visible content	700ms	2,050ms

The critical metric is Time to First Token (TTFT) -- how long until the user sees the first character of the response. With streaming, TTFT is independent of total response time. A response that takes 8 seconds total but shows the first word in 600ms feels faster than a 4-second response that appears all at once.

Research on perceived responsiveness confirms this: users perceive streaming interfaces as 40% faster than buffered responses, even when total time is identical.

Strategy 1: Caching

Caching is the single most effective optimization for AI features. A cache hit eliminates the LLM call entirely, reducing response time from seconds to milliseconds.

Layer 1: Exact-Match Caching

Cache LLM responses keyed on the exact input (prompt hash). This works for deterministic queries where the same input always produces an acceptable output.

Good for: Classification, entity extraction, structured data generation, FAQ-style queries.

Not good for: Conversational AI, personalized responses, queries that depend on fresh data.

Implementation: Redis or Memcached with a TTL appropriate to your data freshness requirements. Hash the full prompt (system prompt + user message + retrieved context) as the cache key.

import hashlib
import redis
r = redis.Redis()
def get_cached_response(prompt: str, ttl: int = 3600):     key = hashlib.sha256(prompt.encode()).hexdigest()     cached = r.get(key)     if cached:         return cached.decode()     return None
def cache_response(prompt: str, response: str, ttl: int = 3600):     key = hashlib.sha256(prompt.encode()).hexdigest()     r.setex(key, ttl, response)

Layer 2: Semantic Caching

Semantic caching stores responses alongside vector embeddings of the queries. When a new query arrives, it is embedded and compared against cached query embeddings. If a cached query is semantically similar (above a cosine similarity threshold), the cached response is returned.

Performance: Semantic caching adds 5-20ms for the vector similarity search but saves 1-5 seconds by skipping the LLM call. Production deployments report 40-60% cost reduction on applications with repetitive query patterns.

Tools:

•Redis with RedisVL -- Built-in semantic cache interface with vector search

•GPTCache -- Open-source library from Zilliz. Supports Milvus, Faiss, Redis, and Qdrant as vector backends. Production benchmarks show up to 68.8% cache hit rate with positive hit accuracy exceeding 97%.

•LangChain CacheBackedEmbeddings -- Caches embeddings to avoid recomputation

Critical parameter: Similarity threshold. Too high (0.98+) and you rarely hit the cache. Too low (0.85-) and you return irrelevant cached responses. Start at 0.92 and tune based on your query distribution.

from redisvl.extensions.llmcache import SemanticCache
cache = SemanticCache(     name="llm_cache",     redis_url="redis://localhost:6379",     distance_threshold=0.08  # Lower = stricter matching )
# Check cache before calling LLM result = cache.check(prompt=user_query) if result:     return result[0]["response"]

Layer 3: Prompt Caching (Provider-Level)

Anthropic, Google, and OpenAI all offer prompt caching: if the first N tokens of your prompt match a recent request, those tokens are served from cache at reduced cost and latency.

Best for: Systems where the system prompt and context are long but repeated across queries (same document set, same instructions, different user questions).

Impact: Up to 90% reduction in input token costs and 85% reduction in time-to-first-token for the cached prefix.

Strategy 2: Streaming

Streaming does not reduce total response time. It eliminates perceived wait time by showing results progressively.

Server-Sent Events (SSE)

SSE is the standard for LLM streaming. Every major LLM API (OpenAI, Anthropic, Google) supports streaming via SSE.

Architecture:

Client --> [Your API] --> [LLM API (streaming)]
   ^                            |
   |                            v
   +--- SSE connection <-- Token-by-token relay

Implementation (Node.js/Express):

app.get('/api/chat', async (req, res) => {
  res.setHeader('Content-Type', 'text/event-stream');
  res.setHeader('Cache-Control', 'no-cache');
  res.setHeader('Connection', 'keep-alive');
const stream = await anthropic.messages.create({     model: 'claude-sonnet-4-20250514',     stream: true,     messages: [{ role: 'user', content: req.query.q }],     max_tokens: 1024,   });
for await (const event of stream) {     if (event.type === 'content_block_delta') {       res.write(data: ${JSON.stringify({         text: event.delta.text       })}\n\n);     }   }
res.write('data: [DONE]\n\n');   res.end(); });

UI Patterns for Streaming

Token-by-token rendering: The classic "typewriter" effect. Text appears word by word. TTFT is the time to show the first word.

Block rendering: Wait until a complete sentence or paragraph is available, then render the entire block. Slightly higher perceived TTFT but avoids the visual noise of individual tokens appearing.

Progressive disclosure: Show a structured skeleton (headings, bullet points) first, then fill in the content. Works well for structured outputs like reports or summaries.

Streaming with citations: Stream the text first, then append citations as a batch after the response is complete. This avoids the jarring UX of citation numbers appearing before the referenced text is visible.

Strategy 3: Async Patterns

For AI tasks that take 10+ seconds (complex analysis, multi-step workflows, batch processing), synchronous request-response breaks down. Move to async.

Pattern 1: Background Job + Polling

Submit the AI task, return a job ID immediately, and poll for completion.

POST /api/analyze --> 202 Accepted { "jobId": "abc123" }
GET /api/jobs/abc123 --> 200 { "status": "processing", "progress": 45 } GET /api/jobs/abc123 --> 200 { "status": "complete", "result": {...} }

Best for: Report generation, document analysis, batch classification. Tasks where the user can do other things while waiting.

Implementation: Use a task queue (Celery, Bull, or AWS SQS + Lambda) to process AI jobs asynchronously. Store results in a database or object store.

Pattern 2: WebSocket Progress Updates

For tasks where the user is actively waiting, use WebSockets to push progress updates in real-time.

Client connects via WebSocket
Server sends: { "stage": "Retrieving documents", "progress": 20 }
Server sends: { "stage": "Analyzing content", "progress": 60 }
Server sends: { "stage": "Generating summary", "progress": 90 }
Server sends: { "stage": "Complete", "result": {...} }

Best for: Multi-step AI workflows where the user cares about intermediate progress (e.g., "Searching 500 documents... Found 12 relevant results... Generating analysis...").

Pattern 3: Optimistic UI + Background Sync

Show a predicted or placeholder result immediately, then replace it with the actual AI result when available.

Best for: AI suggestions, auto-complete, smart defaults. Cases where a "close enough" initial response is better than a loading spinner.

Example: In an email composer, immediately show a draft subject line based on a template, then silently replace it with an AI-generated subject line when the model responds. If the user has already edited the subject line, do not overwrite.

Pattern 4: Pre-computation

Run AI inference before the user needs it. When a new document is uploaded, immediately generate a summary, extract entities, and classify the document. When the user opens the document, the AI results are already waiting.

Best for: Document management systems, content platforms, any context where you can predict what AI features the user will need.

Tradeoff: Higher compute cost (you process documents users may never look at) in exchange for zero perceived latency.

Combining Strategies: The Layered Approach

The best AI features use all three strategies in combination:

Check the semantic cache (5-20ms). If hit, return immediately.
If cache miss, start streaming the LLM response via SSE (~500ms to first token).
For complex queries, fall back to async with progress updates.
Pre-compute common results during off-peak hours.

User Query
    |
    v
[Semantic Cache] --hit--> Return cached response (15ms)
    |
    miss
    |
    v
[Exact Cache] --hit--> Return cached response (5ms)
    |
    miss
    |
    v
[Stream LLM Response] --> TTFT ~500ms, full response 2-8s
    |
    v
[Cache the response] --> Available for next similar query

Measuring What Matters

Track these latency metrics in production:

Metric	What It Tells You	Target
TTFT (Time to First Token)	Perceived responsiveness	< 1 second
Total response time	End-to-end latency	< 10 seconds
Cache hit rate	Caching effectiveness	> 30%
P99 latency	Worst-case user experience	< 3x P50
Timeout rate	Infrastructure reliability	< 0.1%

If your TTFT exceeds 2 seconds, users will perceive the feature as slow regardless of total quality. If your cache hit rate is below 20%, your caching strategy needs tuning. If your P99 is more than 5x your P50, you have a tail latency problem that streaming alone cannot fix.

The Bottom Line

Users do not care how long the model takes. They care how long they wait. Caching eliminates the wait entirely for repeated queries. Streaming turns a wait into a progressive reveal. Async patterns convert blocking waits into background processes.

The AI model is slow. Your feature does not have to be.

Latency Budgeting for AI Features: Caching, Streaming, and Async Patterns

The Latency Budget

Strategy 1: Caching

Layer 1: Exact-Match Caching

Layer 2: Semantic Caching

Layer 3: Prompt Caching (Provider-Level)

Strategy 2: Streaming

Server-Sent Events (SSE)

UI Patterns for Streaming

Strategy 3: Async Patterns

Pattern 1: Background Job + Polling

Pattern 2: WebSocket Progress Updates

Pattern 3: Optimistic UI + Background Sync

Pattern 4: Pre-computation

Combining Strategies: The Layered Approach

Measuring What Matters

The Bottom Line

Ready to Transform Your Development?

Cookie Preferences

Latency Budgeting for AI Features: Caching, Streaming, and Async Patterns

The Latency Budget

Strategy 1: Caching

Layer 1: Exact-Match Caching

Layer 2: Semantic Caching

Layer 3: Prompt Caching (Provider-Level)

Strategy 2: Streaming

Server-Sent Events (SSE)

UI Patterns for Streaming

Strategy 3: Async Patterns

Pattern 1: Background Job + Polling

Pattern 2: WebSocket Progress Updates

Pattern 3: Optimistic UI + Background Sync

Pattern 4: Pre-computation

Combining Strategies: The Layered Approach

Measuring What Matters

The Bottom Line

Ready to Transform Your Development?

Related Articles

What Does an AI Workflow Consultant Do?

How to Choose an AI Workflow Automation Consultant

AI Workflow Automation: A Planning Checklist

Cookie Preferences