MLOps & AI Infrastructure

MLOps — machine learning operations — is the discipline of deploying, managing, and scaling AI models in production. It bridges the gap between a working prototype and a reliable, cost-effective system serving millions of requests. This module covers model serving infrastructure, containerization, orchestration with Kubernetes, CI/CD pipelines for AI, and the cost optimization strategies that make production AI financially viable.

Model Serving Infrastructure

Model serving is the process of making a trained model available to handle inference requests. For large language models, serving is particularly challenging because of the sheer computational requirements — a single LLM can require tens of gigabytes of GPU memory and generate tokens sequentially, making throughput optimization critical.

vLLM — High-Throughput LLM Serving

vLLM is an open-source library designed specifically for fast LLM inference and serving. Originally developed at UC Berkeley, it has become the de facto standard for self-hosted LLM serving due to its breakthrough throughput improvements.

PagedAttention: vLLM's core innovation. It manages the key-value (KV) cache — the memory that stores attention state during generation — using virtual memory paging techniques borrowed from operating systems. This eliminates memory waste from fragmentation, enabling 2–4x higher throughput compared to naive serving.
Continuous batching: Instead of waiting for an entire batch to finish before starting the next, vLLM dynamically adds and removes requests from the batch as they complete. This maximizes GPU utilization and reduces individual request latency.
Speculative decoding: Uses a smaller, faster draft model to predict multiple tokens ahead, then verifies them with the main model in a single forward pass. This can double generation speed for many workloads.
OpenAI-compatible API: vLLM exposes an API that matches the OpenAI specification, making it a drop-in replacement for OpenAI-hosted models in existing applications.

Text Generation Inference (TGI) by Hugging Face

TGI is Hugging Face's production-ready inference server for LLMs. It's tightly integrated with the Hugging Face ecosystem, making it the natural choice if you're already using Hugging Face models and libraries.

Flash Attention 2: Implements the Flash Attention algorithm for memory-efficient attention computation, supporting longer context windows with less GPU memory.
Tensor parallelism: Splits model weights across multiple GPUs, enabling you to serve models too large for a single GPU.
Quantization support: Built-in support for GPTQ, AWQ, and GGUF quantized models, reducing memory requirements by 50–75% with minimal quality loss.
Token streaming: Native server-sent events (SSE) for real-time token streaming to clients.

NVIDIA Triton Inference Server

Triton is NVIDIA's enterprise-grade inference server designed for multi-model, multi-framework deployments. It excels in environments running diverse model types — not just LLMs but also vision models, embedding models, and traditional ML models.

Multi-framework support: Serves models from PyTorch, TensorFlow, TensorRT, ONNX, and custom Python backends — all from a single server instance.
Dynamic batching: Automatically batches incoming requests to maximize GPU throughput, with configurable maximum batch size and wait times.
Model ensembles: Chain multiple models together into pipelines (e.g., tokenizer → LLM → post-processor) with zero-copy data passing between stages.
Model versioning: Serve multiple versions of a model simultaneously and route traffic between them — essential for A/B testing and gradual rollouts.

Feature	vLLM	TGI	Triton
Primary focus	LLM throughput	Hugging Face integration	Multi-model enterprise
Best for	Max tokens/second on LLMs	HF model ecosystem	Mixed model workloads
Quantization	AWQ, GPTQ, FP8, INT8	GPTQ, AWQ, GGUF	TensorRT quantization
Multi-GPU	Tensor + pipeline parallel	Tensor parallelism	Full MIG support
API compatibility	OpenAI-compatible	HF Messages API	gRPC + HTTP/REST
Setup complexity	Low	Low	Medium–High

Choosing a Serving Framework

If you're serving only LLMs and want maximum throughput, start with vLLM. If your team is deeply embedded in the Hugging Face ecosystem and values tight integration, choose TGI. If you're running a mix of model types in an enterprise environment with existing NVIDIA infrastructure, Triton is the enterprise-grade choice.

Containerization with Docker for ML

Docker containers package your model, its dependencies, and the serving runtime into a single reproducible unit. This eliminates the "it works on my machine" problem and ensures consistent behavior across development, staging, and production environments.

Key Practices for ML Containers

Use NVIDIA base images: Start from nvidia/cuda or nvcr.io/nvidia/pytorch images that include CUDA drivers and GPU libraries pre-configured. Building CUDA from scratch is error-prone and wastes hours.
Layer caching strategy: Place rarely-changing layers (OS packages, CUDA libraries) early in the Dockerfile and frequently-changing layers (model code, configs) last. This dramatically speeds up rebuild times.
Separate model weights from code: Don't bake multi-gigabyte model weights into the container image. Instead, mount them as volumes or download them at startup from a model registry (Hugging Face Hub, S3, GCS). This keeps images small and deployable.
Health checks: Include health and readiness endpoints so orchestrators know when the container is ready to serve traffic. Model loading can take minutes — without readiness probes, requests will hit a container that hasn't finished loading.

Example Dockerfile for vLLM serving:

FROM vllm/vllm-openai:latest # Set environment variables ENV MODEL_NAME="meta-llama/Llama-3.1-70B-Instruct" ENV TENSOR_PARALLEL_SIZE=4 ENV MAX_MODEL_LEN=8192 ENV GPU_MEMORY_UTILIZATION=0.90 # Copy custom configuration COPY serving_config.yaml /app/config.yaml # Expose the serving port EXPOSE 8000 # Health check endpoint HEALTHCHECK --interval=30s --timeout=10s --retries=3 \ CMD curl -f http://localhost:8000/health || exit 1 # Start the server CMD ["python", "-m", "vllm.entrypoints.openai.api_server", \ "--model", "$MODEL_NAME", \ "--tensor-parallel-size", "$TENSOR_PARALLEL_SIZE", \ "--max-model-len", "$MAX_MODEL_LEN", \ "--gpu-memory-utilization", "$GPU_MEMORY_UTILIZATION"]

Kubernetes for ML Workloads

Kubernetes orchestrates containerized ML workloads across clusters of machines, handling scaling, failover, and resource allocation. For ML specifically, Kubernetes introduces unique challenges around GPU scheduling, long startup times, and resource-intensive workloads.

GPU Scheduling and Resource Management

NVIDIA device plugin: Install the NVIDIA device plugin for Kubernetes to expose GPUs as schedulable resources. Pods request GPUs via nvidia.com/gpu: 1 in their resource spec.
Node pools: Create separate node pools for GPU and CPU workloads. GPU nodes are expensive — don't run non-GPU workloads on them. Use taints and tolerations to enforce this separation.
Multi-Instance GPU (MIG): NVIDIA A100 and H100 GPUs support MIG, which partitions a single GPU into multiple isolated instances. This lets you serve smaller models on fractions of a GPU, improving utilization and reducing cost.

Scaling Strategies

Horizontal Pod Autoscaler (HPA): Scale the number of serving pods based on custom metrics like requests per second, queue depth, or GPU utilization. Standard CPU-based autoscaling doesn't work well for GPU workloads.
KEDA (Kubernetes Event-Driven Autoscaling): Scale based on external event sources — queue length, HTTP request rate, or custom Prometheus metrics. More flexible than HPA for AI workloads.
Scale-to-zero: For infrequently used models, scale down to zero replicas when idle and spin up on demand. Tools like Knative or KEDA support this pattern, though cold start times for large models (30–120 seconds) must be acceptable for your use case.

Cold Starts Kill User Experience

Loading a 70B parameter model takes 60–120 seconds, even from fast storage. If you scale to zero, incoming requests during cold start will time out. Either keep a minimum of one replica warm, use a smaller model as a fallback during cold starts, or only use scale-to-zero for batch workloads where latency is acceptable.

CI/CD for AI Applications

Traditional CI/CD pipelines test code. AI applications need to test code and model behavior — prompts, outputs, latency, cost, and quality. This requires new tools and practices beyond standard software engineering CI/CD.

Testing Prompts

Golden dataset testing: Maintain a curated set of input-output pairs that represent expected behavior. Run these against every prompt change. If accuracy on the golden dataset drops below your threshold, the change is rejected.
Regression suites: Every time you fix a prompt bug, add that case to your regression suite. This prevents the common problem of fixing one failure mode while breaking another.
LLM-as-judge: Use a powerful model (Claude Opus, GPT-5.4 Pro) to evaluate the outputs of your production model. Define rubrics for correctness, helpfulness, safety, and format compliance. This scales evaluation far beyond what manual review can handle.

Model Versioning

Version everything: Track model weights, prompt templates, system prompts, hyperparameters, and configuration as versioned artifacts. Use tools like DVC (Data Version Control) or MLflow for model artifacts and Git for prompts and configs.
Canary deployments: Roll out model changes to a small percentage of traffic first (1–5%). Monitor quality metrics, latency, and error rates. Only promote to full traffic after the canary period confirms no regressions.
Rollback strategy: Keep previous model versions deployed and ready to receive traffic. If a new version degrades quality, route traffic back to the previous version in seconds, not minutes.

AI-aware CI/CD pipeline stages:

1. Code & Prompt Linting └─ Static analysis, prompt format validation, schema checks 2. Unit Tests └─ Code logic, data transformations, API contracts 3. Integration Tests (with mocked LLM) └─ End-to-end flow testing with deterministic mock responses 4. Eval Suite (with real LLM) └─ Golden dataset tests, regression checks, LLM-as-judge scoring └─ Gate: accuracy >= 95%, no regressions on golden set 5. Cost & Latency Checks └─ Estimate per-request cost, verify p95 latency within budget 6. Canary Deployment (1-5% traffic) └─ Monitor quality metrics for 1-4 hours 7. Progressive Rollout (25% → 50% → 100%) └─ Automated rollback if error rate exceeds threshold

Cost Optimization Strategies

GPU compute is the dominant cost in AI infrastructure. A single NVIDIA H100 GPU costs $2–4 per hour in the cloud, and serving large models often requires 4–8 GPUs. Without deliberate cost optimization, AI infrastructure bills can grow to six or seven figures monthly.

Spot/Preemptible Instances

Cloud providers offer unused GPU capacity at 60–90% discounts through spot instances (AWS), preemptible VMs (GCP), or spot VMs (Azure). The trade-off: these instances can be reclaimed with short notice (typically 30 seconds to 2 minutes).

Good for: Batch inference, eval suites, fine-tuning jobs, and non-latency-sensitive workloads
Risky for: Real-time serving (unless you have enough on-demand capacity as fallback)
Strategy: Run a base layer of on-demand instances for guaranteed capacity, then burst with spot instances during peak demand

Model Quantization

Quantization reduces model precision from 16-bit floating point (FP16) to lower bit-widths (INT8, INT4), shrinking memory footprint and increasing throughput with modest quality trade-offs.

Precision	Memory Reduction	Quality Impact	Best Method
FP16 (baseline)	—	Full quality	Default for most models
FP8	~50%	Negligible loss	Native on H100/H200 GPUs
INT8	~50%	Minimal loss (<1%)	GPTQ, SmoothQuant
INT4	~75%	Noticeable on complex tasks	AWQ, GPTQ-4bit

Request Batching

Batching multiple inference requests together dramatically improves GPU utilization. Instead of processing one request at a time (where the GPU sits idle during memory transfers), batching fills the GPU's compute capacity with parallel work.

Static batching: Collect requests until a batch is full or a timeout expires, then process them together. Simple but adds latency for the first requests in the batch.
Continuous batching: vLLM and TGI support this — requests are dynamically added to and removed from the active batch as they arrive and complete. This minimizes both latency and idle time.
Impact: Continuous batching can improve throughput by 10–20x compared to single-request processing, making the economics of self-hosted models viable at scale.

Buy vs. Build for Model Serving

Self-hosting models makes sense when you need data sovereignty, have consistent high-volume traffic, or need to serve fine-tuned models. For most startups and mid-sized companies, API-based model providers (Anthropic, OpenAI, Google) are more cost-effective up to surprisingly high volumes. Do the math: compare the fully-loaded cost of GPU infrastructure (hardware, ops team, monitoring, on-call) against API costs at your projected volume.

Resources

Tool

vLLM: Easy, Fast, and Cheap LLM Serving

vLLM Project

Official documentation for vLLM, the high-throughput LLM serving engine with PagedAttention. Covers installation, configuration, supported models, and deployment guides.

Tool

Text Generation Inference

Hugging Face

Hugging Face's production inference server documentation. Covers deployment, quantization, tensor parallelism, and integration with the Hugging Face ecosystem.

Article

MLOps Principles

Google Cloud

Google Cloud's comprehensive guide to MLOps maturity levels, from manual ML workflows to fully automated CI/CD/CT pipelines.

Tool

NVIDIA Triton Inference Server

NVIDIA

Enterprise-grade multi-framework inference server supporting PyTorch, TensorFlow, TensorRT, ONNX, and custom backends with dynamic batching and model management.

Key Takeaways

1vLLM offers the highest LLM throughput via PagedAttention and continuous batching; TGI integrates tightly with Hugging Face; Triton handles multi-model enterprise workloads.
2Docker containers for ML should use NVIDIA base images, separate model weights from code, and include health/readiness checks for orchestrator compatibility.
3Kubernetes GPU workloads require the NVIDIA device plugin, separate GPU node pools with taints/tolerations, and custom metrics for autoscaling.
4AI CI/CD pipelines must test both code and model behavior — golden dataset evals, regression suites, LLM-as-judge scoring, and cost/latency gates.
5Spot instances offer 60–90% savings for batch workloads; pair them with on-demand capacity for real-time serving to balance cost and reliability.
6Model quantization (FP8/INT8) reduces memory by 50% with minimal quality loss — INT4 saves 75% but shows degradation on complex reasoning tasks.
7Continuous batching improves throughput by 10–20x over single-request processing and is essential for making self-hosted model economics work.

Model Serving Infrastructure

vLLM — High-Throughput LLM Serving

Text Generation Inference (TGI) by Hugging Face

NVIDIA Triton Inference Server

Containerization with Docker for ML

Key Practices for ML Containers

Kubernetes for ML Workloads

GPU Scheduling and Resource Management

Scaling Strategies

CI/CD for AI Applications

Testing Prompts

Model Versioning

Cost Optimization Strategies

Spot/Preemptible Instances

Model Quantization

Request Batching

Resources

vLLM: Easy, Fast, and Cheap LLM Serving

Text Generation Inference

MLOps Principles

NVIDIA Triton Inference Server

Key Takeaways

Test Your Understanding

Module Assessment

Cookie Preferences