MLOps & AI Infrastructure
Model serving, containerization, Kubernetes for ML, CI/CD, and cost optimization.
MLOps — machine learning operations — is the discipline of deploying, managing, and scaling AI models in production. It bridges the gap between a working prototype and a reliable, cost-effective system serving millions of requests. This module covers model serving infrastructure, containerization, orchestration with Kubernetes, CI/CD pipelines for AI, and the cost optimization strategies that make production AI financially viable.
Model Serving Infrastructure
Model serving is the process of making a trained model available to handle inference requests. For large language models, serving is particularly challenging because of the sheer computational requirements — a single LLM can require tens of gigabytes of GPU memory and generate tokens sequentially, making throughput optimization critical.
vLLM — High-Throughput LLM Serving
vLLM is an open-source library designed specifically for fast LLM inference and serving. Originally developed at UC Berkeley, it has become the de facto standard for self-hosted LLM serving due to its breakthrough throughput improvements.
- PagedAttention: vLLM's core innovation. It manages the key-value (KV) cache — the memory that stores attention state during generation — using virtual memory paging techniques borrowed from operating systems. This eliminates memory waste from fragmentation, enabling 2–4x higher throughput compared to naive serving.
- Continuous batching: Instead of waiting for an entire batch to finish before starting the next, vLLM dynamically adds and removes requests from the batch as they complete. This maximizes GPU utilization and reduces individual request latency.
- Speculative decoding: Uses a smaller, faster draft model to predict multiple tokens ahead, then verifies them with the main model in a single forward pass. This can double generation speed for many workloads.
- OpenAI-compatible API: vLLM exposes an API that matches the OpenAI specification, making it a drop-in replacement for OpenAI-hosted models in existing applications.
Text Generation Inference (TGI) by Hugging Face
TGI is Hugging Face's production-ready inference server for LLMs. It's tightly integrated with the Hugging Face ecosystem, making it the natural choice if you're already using Hugging Face models and libraries.
- Flash Attention 2: Implements the Flash Attention algorithm for memory-efficient attention computation, supporting longer context windows with less GPU memory.
- Tensor parallelism: Splits model weights across multiple GPUs, enabling you to serve models too large for a single GPU.
- Quantization support: Built-in support for GPTQ, AWQ, and GGUF quantized models, reducing memory requirements by 50–75% with minimal quality loss.
- Token streaming: Native server-sent events (SSE) for real-time token streaming to clients.
NVIDIA Triton Inference Server
Triton is NVIDIA's enterprise-grade inference server designed for multi-model, multi-framework deployments. It excels in environments running diverse model types — not just LLMs but also vision models, embedding models, and traditional ML models.
- Multi-framework support: Serves models from PyTorch, TensorFlow, TensorRT, ONNX, and custom Python backends — all from a single server instance.
- Dynamic batching: Automatically batches incoming requests to maximize GPU throughput, with configurable maximum batch size and wait times.
- Model ensembles: Chain multiple models together into pipelines (e.g., tokenizer → LLM → post-processor) with zero-copy data passing between stages.
- Model versioning: Serve multiple versions of a model simultaneously and route traffic between them — essential for A/B testing and gradual rollouts.
| Feature | vLLM | TGI | Triton |
|---|---|---|---|
| Primary focus | LLM throughput | Hugging Face integration | Multi-model enterprise |
| Best for | Max tokens/second on LLMs | HF model ecosystem | Mixed model workloads |
| Quantization | AWQ, GPTQ, FP8, INT8 | GPTQ, AWQ, GGUF | TensorRT quantization |
| Multi-GPU | Tensor + pipeline parallel | Tensor parallelism | Full MIG support |
| API compatibility | OpenAI-compatible | HF Messages API | gRPC + HTTP/REST |
| Setup complexity | Low | Low | Medium–High |
Containerization with Docker for ML
Docker containers package your model, its dependencies, and the serving runtime into a single reproducible unit. This eliminates the "it works on my machine" problem and ensures consistent behavior across development, staging, and production environments.
Key Practices for ML Containers
- Use NVIDIA base images: Start from
nvidia/cudaornvcr.io/nvidia/pytorchimages that include CUDA drivers and GPU libraries pre-configured. Building CUDA from scratch is error-prone and wastes hours. - Layer caching strategy: Place rarely-changing layers (OS packages, CUDA libraries) early in the Dockerfile and frequently-changing layers (model code, configs) last. This dramatically speeds up rebuild times.
- Separate model weights from code: Don't bake multi-gigabyte model weights into the container image. Instead, mount them as volumes or download them at startup from a model registry (Hugging Face Hub, S3, GCS). This keeps images small and deployable.
- Health checks: Include health and readiness endpoints so orchestrators know when the container is ready to serve traffic. Model loading can take minutes — without readiness probes, requests will hit a container that hasn't finished loading.
Example Dockerfile for vLLM serving:
FROM vllm/vllm-openai:latest # Set environment variables ENV MODEL_NAME="meta-llama/Llama-3.1-70B-Instruct" ENV TENSOR_PARALLEL_SIZE=4 ENV MAX_MODEL_LEN=8192 ENV GPU_MEMORY_UTILIZATION=0.90 # Copy custom configuration COPY serving_config.yaml /app/config.yaml # Expose the serving port EXPOSE 8000 # Health check endpoint HEALTHCHECK --interval=30s --timeout=10s --retries=3 \ CMD curl -f http://localhost:8000/health || exit 1 # Start the server CMD ["python", "-m", "vllm.entrypoints.openai.api_server", \ "--model", "$MODEL_NAME", \ "--tensor-parallel-size", "$TENSOR_PARALLEL_SIZE", \ "--max-model-len", "$MAX_MODEL_LEN", \ "--gpu-memory-utilization", "$GPU_MEMORY_UTILIZATION"]
Kubernetes for ML Workloads
Kubernetes orchestrates containerized ML workloads across clusters of machines, handling scaling, failover, and resource allocation. For ML specifically, Kubernetes introduces unique challenges around GPU scheduling, long startup times, and resource-intensive workloads.
GPU Scheduling and Resource Management
- NVIDIA device plugin: Install the NVIDIA device plugin for Kubernetes to expose GPUs as schedulable resources. Pods request GPUs via
nvidia.com/gpu: 1in their resource spec. - Node pools: Create separate node pools for GPU and CPU workloads. GPU nodes are expensive — don't run non-GPU workloads on them. Use taints and tolerations to enforce this separation.
- Multi-Instance GPU (MIG): NVIDIA A100 and H100 GPUs support MIG, which partitions a single GPU into multiple isolated instances. This lets you serve smaller models on fractions of a GPU, improving utilization and reducing cost.
Scaling Strategies
- Horizontal Pod Autoscaler (HPA): Scale the number of serving pods based on custom metrics like requests per second, queue depth, or GPU utilization. Standard CPU-based autoscaling doesn't work well for GPU workloads.
- KEDA (Kubernetes Event-Driven Autoscaling): Scale based on external event sources — queue length, HTTP request rate, or custom Prometheus metrics. More flexible than HPA for AI workloads.
- Scale-to-zero: For infrequently used models, scale down to zero replicas when idle and spin up on demand. Tools like Knative or KEDA support this pattern, though cold start times for large models (30–120 seconds) must be acceptable for your use case.
CI/CD for AI Applications
Traditional CI/CD pipelines test code. AI applications need to test code and model behavior — prompts, outputs, latency, cost, and quality. This requires new tools and practices beyond standard software engineering CI/CD.
Testing Prompts
- Golden dataset testing: Maintain a curated set of input-output pairs that represent expected behavior. Run these against every prompt change. If accuracy on the golden dataset drops below your threshold, the change is rejected.
- Regression suites: Every time you fix a prompt bug, add that case to your regression suite. This prevents the common problem of fixing one failure mode while breaking another.
- LLM-as-judge: Use a powerful model (Claude Opus, GPT-5.4 Pro) to evaluate the outputs of your production model. Define rubrics for correctness, helpfulness, safety, and format compliance. This scales evaluation far beyond what manual review can handle.
Model Versioning
- Version everything: Track model weights, prompt templates, system prompts, hyperparameters, and configuration as versioned artifacts. Use tools like DVC (Data Version Control) or MLflow for model artifacts and Git for prompts and configs.
- Canary deployments: Roll out model changes to a small percentage of traffic first (1–5%). Monitor quality metrics, latency, and error rates. Only promote to full traffic after the canary period confirms no regressions.
- Rollback strategy: Keep previous model versions deployed and ready to receive traffic. If a new version degrades quality, route traffic back to the previous version in seconds, not minutes.
AI-aware CI/CD pipeline stages:
1. Code & Prompt Linting └─ Static analysis, prompt format validation, schema checks 2. Unit Tests └─ Code logic, data transformations, API contracts 3. Integration Tests (with mocked LLM) └─ End-to-end flow testing with deterministic mock responses 4. Eval Suite (with real LLM) └─ Golden dataset tests, regression checks, LLM-as-judge scoring └─ Gate: accuracy >= 95%, no regressions on golden set 5. Cost & Latency Checks └─ Estimate per-request cost, verify p95 latency within budget 6. Canary Deployment (1-5% traffic) └─ Monitor quality metrics for 1-4 hours 7. Progressive Rollout (25% → 50% → 100%) └─ Automated rollback if error rate exceeds threshold
Cost Optimization Strategies
GPU compute is the dominant cost in AI infrastructure. A single NVIDIA H100 GPU costs $2–4 per hour in the cloud, and serving large models often requires 4–8 GPUs. Without deliberate cost optimization, AI infrastructure bills can grow to six or seven figures monthly.
Spot/Preemptible Instances
Cloud providers offer unused GPU capacity at 60–90% discounts through spot instances (AWS), preemptible VMs (GCP), or spot VMs (Azure). The trade-off: these instances can be reclaimed with short notice (typically 30 seconds to 2 minutes).
- Good for: Batch inference, eval suites, fine-tuning jobs, and non-latency-sensitive workloads
- Risky for: Real-time serving (unless you have enough on-demand capacity as fallback)
- Strategy: Run a base layer of on-demand instances for guaranteed capacity, then burst with spot instances during peak demand
Model Quantization
Quantization reduces model precision from 16-bit floating point (FP16) to lower bit-widths (INT8, INT4), shrinking memory footprint and increasing throughput with modest quality trade-offs.
| Precision | Memory Reduction | Quality Impact | Best Method |
|---|---|---|---|
| FP16 (baseline) | — | Full quality | Default for most models |
| FP8 | ~50% | Negligible loss | Native on H100/H200 GPUs |
| INT8 | ~50% | Minimal loss (<1%) | GPTQ, SmoothQuant |
| INT4 | ~75% | Noticeable on complex tasks | AWQ, GPTQ-4bit |
Request Batching
Batching multiple inference requests together dramatically improves GPU utilization. Instead of processing one request at a time (where the GPU sits idle during memory transfers), batching fills the GPU's compute capacity with parallel work.
- Static batching: Collect requests until a batch is full or a timeout expires, then process them together. Simple but adds latency for the first requests in the batch.
- Continuous batching: vLLM and TGI support this — requests are dynamically added to and removed from the active batch as they arrive and complete. This minimizes both latency and idle time.
- Impact: Continuous batching can improve throughput by 10–20x compared to single-request processing, making the economics of self-hosted models viable at scale.
Resources
vLLM: Easy, Fast, and Cheap LLM Serving
vLLM Project
Official documentation for vLLM, the high-throughput LLM serving engine with PagedAttention. Covers installation, configuration, supported models, and deployment guides.
Text Generation Inference
Hugging Face
Hugging Face's production inference server documentation. Covers deployment, quantization, tensor parallelism, and integration with the Hugging Face ecosystem.
MLOps Principles
Google Cloud
Google Cloud's comprehensive guide to MLOps maturity levels, from manual ML workflows to fully automated CI/CD/CT pipelines.
NVIDIA Triton Inference Server
NVIDIA
Enterprise-grade multi-framework inference server supporting PyTorch, TensorFlow, TensorRT, ONNX, and custom backends with dynamic batching and model management.
Key Takeaways
- 1vLLM offers the highest LLM throughput via PagedAttention and continuous batching; TGI integrates tightly with Hugging Face; Triton handles multi-model enterprise workloads.
- 2Docker containers for ML should use NVIDIA base images, separate model weights from code, and include health/readiness checks for orchestrator compatibility.
- 3Kubernetes GPU workloads require the NVIDIA device plugin, separate GPU node pools with taints/tolerations, and custom metrics for autoscaling.
- 4AI CI/CD pipelines must test both code and model behavior — golden dataset evals, regression suites, LLM-as-judge scoring, and cost/latency gates.
- 5Spot instances offer 60–90% savings for batch workloads; pair them with on-demand capacity for real-time serving to balance cost and reliability.
- 6Model quantization (FP8/INT8) reduces memory by 50% with minimal quality loss — INT4 saves 75% but shows degradation on complex reasoning tasks.
- 7Continuous batching improves throughput by 10–20x over single-request processing and is essential for making self-hosted model economics work.
Test Your Understanding
Module Assessment
5 questions · Score 70% or higher to complete this module
You can retake the quiz as many times as you need. Your best score is saved.