An LLM API call takes between 1 and 30 seconds. A user's patience runs out in about 3. This is the fundamental tension in AI product engineering, and the reason most AI features feel slow compared to the traditional features they sit next to.
You cannot make the model faster. What you can do is build architecture patterns that hide, reduce, and redistribute latency so the user never feels the gap. This article covers three approaches -- caching, streaming, and async patterns -- with specific implementations, latency numbers, and decision frameworks.
The Latency Budget
Before optimizing anything, define your latency budget. A latency budget allocates the total acceptable response time across each component in the request path.
Here is a typical budget for a RAG-powered AI feature:
| Component | Target P50 | Target P99 |
|---|---|---|
| Query processing + embedding | 50ms | 150ms |
| Vector search + reranking | 100ms | 300ms |
| LLM inference (streaming, TTFT) | 500ms | 1,500ms |
| Post-processing + delivery | 50ms | 100ms |
| Total to first visible content | 700ms | 2,050ms |
Research on perceived responsiveness confirms this: users perceive streaming interfaces as 40% faster than buffered responses, even when total time is identical.
Strategy 1: Caching
Caching is the single most effective optimization for AI features. A cache hit eliminates the LLM call entirely, reducing response time from seconds to milliseconds.
Layer 1: Exact-Match Caching
Cache LLM responses keyed on the exact input (prompt hash). This works for deterministic queries where the same input always produces an acceptable output.
Good for: Classification, entity extraction, structured data generation, FAQ-style queries.
Not good for: Conversational AI, personalized responses, queries that depend on fresh data.
Implementation: Redis or Memcached with a TTL appropriate to your data freshness requirements. Hash the full prompt (system prompt + user message + retrieved context) as the cache key.
import hashlib import redisr = redis.Redis()
def get_cached_response(prompt: str, ttl: int = 3600): key = hashlib.sha256(prompt.encode()).hexdigest() cached = r.get(key) if cached: return cached.decode() return None
def cache_response(prompt: str, response: str, ttl: int = 3600): key = hashlib.sha256(prompt.encode()).hexdigest() r.setex(key, ttl, response)
Layer 2: Semantic Caching
Semantic caching stores responses alongside vector embeddings of the queries. When a new query arrives, it is embedded and compared against cached query embeddings. If a cached query is semantically similar (above a cosine similarity threshold), the cached response is returned.
Performance: Semantic caching adds 5-20ms for the vector similarity search but saves 1-5 seconds by skipping the LLM call. Production deployments report 40-60% cost reduction on applications with repetitive query patterns.
Tools:
Critical parameter: Similarity threshold. Too high (0.98+) and you rarely hit the cache. Too low (0.85-) and you return irrelevant cached responses. Start at 0.92 and tune based on your query distribution.
from redisvl.extensions.llmcache import SemanticCachecache = SemanticCache( name="llm_cache", redis_url="redis://localhost:6379", distance_threshold=0.08 # Lower = stricter matching )
# Check cache before calling LLM result = cache.check(prompt=user_query) if result: return result[0]["response"]
Layer 3: Prompt Caching (Provider-Level)
Anthropic, Google, and OpenAI all offer prompt caching: if the first N tokens of your prompt match a recent request, those tokens are served from cache at reduced cost and latency.
Best for: Systems where the system prompt and context are long but repeated across queries (same document set, same instructions, different user questions).
Impact: Up to 90% reduction in input token costs and 85% reduction in time-to-first-token for the cached prefix.
Strategy 2: Streaming
Streaming does not reduce total response time. It eliminates perceived wait time by showing results progressively.
Server-Sent Events (SSE)
SSE is the standard for LLM streaming. Every major LLM API (OpenAI, Anthropic, Google) supports streaming via SSE.
Architecture:
Client --> [Your API] --> [LLM API (streaming)]
^ |
| v
+--- SSE connection <-- Token-by-token relay
Implementation (Node.js/Express):
app.get('/api/chat', async (req, res) => { res.setHeader('Content-Type', 'text/event-stream'); res.setHeader('Cache-Control', 'no-cache'); res.setHeader('Connection', 'keep-alive');const stream = await anthropic.messages.create({ model: 'claude-sonnet-4-20250514', stream: true, messages: [{ role: 'user', content: req.query.q }], max_tokens: 1024, });
for await (const event of stream) { if (event.type === 'content_block_delta') { res.write(
data: ${JSON.stringify({ text: event.delta.text })}\n\n); } }
res.write('data: [DONE]\n\n'); res.end(); });
UI Patterns for Streaming
Token-by-token rendering: The classic "typewriter" effect. Text appears word by word. TTFT is the time to show the first word.
Block rendering: Wait until a complete sentence or paragraph is available, then render the entire block. Slightly higher perceived TTFT but avoids the visual noise of individual tokens appearing.
Progressive disclosure: Show a structured skeleton (headings, bullet points) first, then fill in the content. Works well for structured outputs like reports or summaries.
Streaming with citations: Stream the text first, then append citations as a batch after the response is complete. This avoids the jarring UX of citation numbers appearing before the referenced text is visible.
Strategy 3: Async Patterns
For AI tasks that take 10+ seconds (complex analysis, multi-step workflows, batch processing), synchronous request-response breaks down. Move to async.
Pattern 1: Background Job + Polling
Submit the AI task, return a job ID immediately, and poll for completion.
POST /api/analyze --> 202 Accepted { "jobId": "abc123" }
GET /api/jobs/abc123 --> 200 { "status": "processing", "progress": 45 } GET /api/jobs/abc123 --> 200 { "status": "complete", "result": {...} }
Best for: Report generation, document analysis, batch classification. Tasks where the user can do other things while waiting.
Implementation: Use a task queue (Celery, Bull, or AWS SQS + Lambda) to process AI jobs asynchronously. Store results in a database or object store.
Pattern 2: WebSocket Progress Updates
For tasks where the user is actively waiting, use WebSockets to push progress updates in real-time.
Client connects via WebSocket
Server sends: { "stage": "Retrieving documents", "progress": 20 }
Server sends: { "stage": "Analyzing content", "progress": 60 }
Server sends: { "stage": "Generating summary", "progress": 90 }
Server sends: { "stage": "Complete", "result": {...} }
Best for: Multi-step AI workflows where the user cares about intermediate progress (e.g., "Searching 500 documents... Found 12 relevant results... Generating analysis...").
Pattern 3: Optimistic UI + Background Sync
Show a predicted or placeholder result immediately, then replace it with the actual AI result when available.
Best for: AI suggestions, auto-complete, smart defaults. Cases where a "close enough" initial response is better than a loading spinner.
Example: In an email composer, immediately show a draft subject line based on a template, then silently replace it with an AI-generated subject line when the model responds. If the user has already edited the subject line, do not overwrite.
Pattern 4: Pre-computation
Run AI inference before the user needs it. When a new document is uploaded, immediately generate a summary, extract entities, and classify the document. When the user opens the document, the AI results are already waiting.
Best for: Document management systems, content platforms, any context where you can predict what AI features the user will need.
Tradeoff: Higher compute cost (you process documents users may never look at) in exchange for zero perceived latency.
Combining Strategies: The Layered Approach
The best AI features use all three strategies in combination:
- Check the semantic cache (5-20ms). If hit, return immediately.
- If cache miss, start streaming the LLM response via SSE (~500ms to first token).
- For complex queries, fall back to async with progress updates.
- Pre-compute common results during off-peak hours.
User Query
|
v
[Semantic Cache] --hit--> Return cached response (15ms)
|
miss
|
v
[Exact Cache] --hit--> Return cached response (5ms)
|
miss
|
v
[Stream LLM Response] --> TTFT ~500ms, full response 2-8s
|
v
[Cache the response] --> Available for next similar query
Measuring What Matters
Track these latency metrics in production:
| Metric | What It Tells You | Target |
|---|---|---|
| TTFT (Time to First Token) | Perceived responsiveness | < 1 second |
| Total response time | End-to-end latency | < 10 seconds |
| Cache hit rate | Caching effectiveness | > 30% |
| P99 latency | Worst-case user experience | < 3x P50 |
| Timeout rate | Infrastructure reliability | < 0.1% |
The Bottom Line
Users do not care how long the model takes. They care how long they wait. Caching eliminates the wait entirely for repeated queries. Streaming turns a wait into a progressive reveal. Async patterns convert blocking waits into background processes.
The AI model is slow. Your feature does not have to be.
