AI Development

Latency Budgeting for AI Features: Caching, Streaming, and Async Patterns

Clarvia Team
Author
Mar 22, 2026
12 min read
Latency Budgeting for AI Features: Caching, Streaming, and Async Patterns

An LLM API call takes between 1 and 30 seconds. A user's patience runs out in about 3. This is the fundamental tension in AI product engineering, and the reason most AI features feel slow compared to the traditional features they sit next to.

You cannot make the model faster. What you can do is build architecture patterns that hide, reduce, and redistribute latency so the user never feels the gap. This article covers three approaches -- caching, streaming, and async patterns -- with specific implementations, latency numbers, and decision frameworks.


The Latency Budget

Before optimizing anything, define your latency budget. A latency budget allocates the total acceptable response time across each component in the request path.

Here is a typical budget for a RAG-powered AI feature:

ComponentTarget P50Target P99
Query processing + embedding50ms150ms
Vector search + reranking100ms300ms
LLM inference (streaming, TTFT)500ms1,500ms
Post-processing + delivery50ms100ms
Total to first visible content700ms2,050ms
The critical metric is Time to First Token (TTFT) -- how long until the user sees the first character of the response. With streaming, TTFT is independent of total response time. A response that takes 8 seconds total but shows the first word in 600ms feels faster than a 4-second response that appears all at once.

Research on perceived responsiveness confirms this: users perceive streaming interfaces as 40% faster than buffered responses, even when total time is identical.


Strategy 1: Caching

Caching is the single most effective optimization for AI features. A cache hit eliminates the LLM call entirely, reducing response time from seconds to milliseconds.

Layer 1: Exact-Match Caching

Cache LLM responses keyed on the exact input (prompt hash). This works for deterministic queries where the same input always produces an acceptable output.

Good for: Classification, entity extraction, structured data generation, FAQ-style queries.

Not good for: Conversational AI, personalized responses, queries that depend on fresh data.

Implementation: Redis or Memcached with a TTL appropriate to your data freshness requirements. Hash the full prompt (system prompt + user message + retrieved context) as the cache key.

import hashlib
import redis

r = redis.Redis()

def get_cached_response(prompt: str, ttl: int = 3600): key = hashlib.sha256(prompt.encode()).hexdigest() cached = r.get(key) if cached: return cached.decode() return None

def cache_response(prompt: str, response: str, ttl: int = 3600): key = hashlib.sha256(prompt.encode()).hexdigest() r.setex(key, ttl, response)

Layer 2: Semantic Caching

Semantic caching stores responses alongside vector embeddings of the queries. When a new query arrives, it is embedded and compared against cached query embeddings. If a cached query is semantically similar (above a cosine similarity threshold), the cached response is returned.

Performance: Semantic caching adds 5-20ms for the vector similarity search but saves 1-5 seconds by skipping the LLM call. Production deployments report 40-60% cost reduction on applications with repetitive query patterns.

Tools:

  • Redis with RedisVL -- Built-in semantic cache interface with vector search
  • GPTCache -- Open-source library from Zilliz. Supports Milvus, Faiss, Redis, and Qdrant as vector backends. Production benchmarks show up to 68.8% cache hit rate with positive hit accuracy exceeding 97%.
  • LangChain CacheBackedEmbeddings -- Caches embeddings to avoid recomputation
  • Critical parameter: Similarity threshold. Too high (0.98+) and you rarely hit the cache. Too low (0.85-) and you return irrelevant cached responses. Start at 0.92 and tune based on your query distribution.

    from redisvl.extensions.llmcache import SemanticCache
    

    cache = SemanticCache( name="llm_cache", redis_url="redis://localhost:6379", distance_threshold=0.08 # Lower = stricter matching )

    # Check cache before calling LLM result = cache.check(prompt=user_query) if result: return result[0]["response"]

    Layer 3: Prompt Caching (Provider-Level)

    Anthropic, Google, and OpenAI all offer prompt caching: if the first N tokens of your prompt match a recent request, those tokens are served from cache at reduced cost and latency.

    Best for: Systems where the system prompt and context are long but repeated across queries (same document set, same instructions, different user questions).

    Impact: Up to 90% reduction in input token costs and 85% reduction in time-to-first-token for the cached prefix.


    Strategy 2: Streaming

    Streaming does not reduce total response time. It eliminates perceived wait time by showing results progressively.

    Server-Sent Events (SSE)

    SSE is the standard for LLM streaming. Every major LLM API (OpenAI, Anthropic, Google) supports streaming via SSE.

    Architecture:

    Client --> [Your API] --> [LLM API (streaming)]
       ^                            |
       |                            v
       +--- SSE connection <-- Token-by-token relay
    

    Implementation (Node.js/Express):

    app.get('/api/chat', async (req, res) => {
      res.setHeader('Content-Type', 'text/event-stream');
      res.setHeader('Cache-Control', 'no-cache');
      res.setHeader('Connection', 'keep-alive');
    

    const stream = await anthropic.messages.create({ model: 'claude-sonnet-4-20250514', stream: true, messages: [{ role: 'user', content: req.query.q }], max_tokens: 1024, });

    for await (const event of stream) { if (event.type === 'content_block_delta') { res.write(data: ${JSON.stringify({ text: event.delta.text })}\n\n); } }

    res.write('data: [DONE]\n\n'); res.end(); });

    UI Patterns for Streaming

    Token-by-token rendering: The classic "typewriter" effect. Text appears word by word. TTFT is the time to show the first word.

    Block rendering: Wait until a complete sentence or paragraph is available, then render the entire block. Slightly higher perceived TTFT but avoids the visual noise of individual tokens appearing.

    Progressive disclosure: Show a structured skeleton (headings, bullet points) first, then fill in the content. Works well for structured outputs like reports or summaries.

    Streaming with citations: Stream the text first, then append citations as a batch after the response is complete. This avoids the jarring UX of citation numbers appearing before the referenced text is visible.


    Strategy 3: Async Patterns

    For AI tasks that take 10+ seconds (complex analysis, multi-step workflows, batch processing), synchronous request-response breaks down. Move to async.

    Pattern 1: Background Job + Polling

    Submit the AI task, return a job ID immediately, and poll for completion.

    POST /api/analyze --> 202 Accepted { "jobId": "abc123" }
    

    GET /api/jobs/abc123 --> 200 { "status": "processing", "progress": 45 } GET /api/jobs/abc123 --> 200 { "status": "complete", "result": {...} }

    Best for: Report generation, document analysis, batch classification. Tasks where the user can do other things while waiting.

    Implementation: Use a task queue (Celery, Bull, or AWS SQS + Lambda) to process AI jobs asynchronously. Store results in a database or object store.

    Pattern 2: WebSocket Progress Updates

    For tasks where the user is actively waiting, use WebSockets to push progress updates in real-time.

    Client connects via WebSocket
    Server sends: { "stage": "Retrieving documents", "progress": 20 }
    Server sends: { "stage": "Analyzing content", "progress": 60 }
    Server sends: { "stage": "Generating summary", "progress": 90 }
    Server sends: { "stage": "Complete", "result": {...} }
    

    Best for: Multi-step AI workflows where the user cares about intermediate progress (e.g., "Searching 500 documents... Found 12 relevant results... Generating analysis...").

    Pattern 3: Optimistic UI + Background Sync

    Show a predicted or placeholder result immediately, then replace it with the actual AI result when available.

    Best for: AI suggestions, auto-complete, smart defaults. Cases where a "close enough" initial response is better than a loading spinner.

    Example: In an email composer, immediately show a draft subject line based on a template, then silently replace it with an AI-generated subject line when the model responds. If the user has already edited the subject line, do not overwrite.

    Pattern 4: Pre-computation

    Run AI inference before the user needs it. When a new document is uploaded, immediately generate a summary, extract entities, and classify the document. When the user opens the document, the AI results are already waiting.

    Best for: Document management systems, content platforms, any context where you can predict what AI features the user will need.

    Tradeoff: Higher compute cost (you process documents users may never look at) in exchange for zero perceived latency.


    Combining Strategies: The Layered Approach

    The best AI features use all three strategies in combination:

    1. Check the semantic cache (5-20ms). If hit, return immediately.
    2. If cache miss, start streaming the LLM response via SSE (~500ms to first token).
    3. For complex queries, fall back to async with progress updates.
    4. Pre-compute common results during off-peak hours.
    User Query
        |
        v
    [Semantic Cache] --hit--> Return cached response (15ms)
        |
        miss
        |
        v
    [Exact Cache] --hit--> Return cached response (5ms)
        |
        miss
        |
        v
    [Stream LLM Response] --> TTFT ~500ms, full response 2-8s
        |
        v
    [Cache the response] --> Available for next similar query
    

    Measuring What Matters

    Track these latency metrics in production:

    MetricWhat It Tells YouTarget
    TTFT (Time to First Token)Perceived responsiveness< 1 second
    Total response timeEnd-to-end latency< 10 seconds
    Cache hit rateCaching effectiveness> 30%
    P99 latencyWorst-case user experience< 3x P50
    Timeout rateInfrastructure reliability< 0.1%
    If your TTFT exceeds 2 seconds, users will perceive the feature as slow regardless of total quality. If your cache hit rate is below 20%, your caching strategy needs tuning. If your P99 is more than 5x your P50, you have a tail latency problem that streaming alone cannot fix.

    The Bottom Line

    Users do not care how long the model takes. They care how long they wait. Caching eliminates the wait entirely for repeated queries. Streaming turns a wait into a progressive reveal. Async patterns convert blocking waits into background processes.

    The AI model is slow. Your feature does not have to be.

    AI latencyLLM cachingstreaming responsesasync AI patterns

    Ready to Transform Your Development?

    Let's discuss how AI-first development can accelerate your next project.

    Book a Consultation

    Cookie Preferences

    We use cookies to enhance your experience. By continuing, you agree to our use of cookies.