Advanced35 minModule 4 of 5

Open Source AI

Llama, Mistral, Qwen. Running local models, quantization, open source vs proprietary.

The Open Source AI Ecosystem

Open source AI has transformed from a distant follower of proprietary models into a genuine competitive force. In 2026, the best open-weight models match or exceed the proprietary frontier from just a year ago, and in many practical applications, the gap is negligible. For organizations that need data sovereignty, cost control, customization, or low-latency inference, open source models are not just viable — they are often the best choice.

This module covers the leading open source model families, how to run models locally, the trade-offs between self-hosting and APIs, and how to evaluate what makes sense for your use case.

The Major Open-Weight Model Families

Meta Llama 4

Llama 4 is the most significant open-weight release to date. It introduced two models that redefined what open models could achieve:

  • Llama 4 Scout: A Mixture of Experts model with 109B total parameters but only 17B active per token (16 experts, routing to a subset per token). Its headline feature is a 10 million token context window — longer than any other model, open or proprietary. This makes it uniquely suited for tasks requiring massive context: entire codebases, long document collections, or extended conversation histories.
  • Llama 4 Maverick: A larger MoE model with approximately 400B total parameters and 17B active, using 128 experts. Maverick targets frontier-level quality on reasoning, coding, and multilingual tasks, while remaining efficient enough to serve at reasonable cost.

Crucially, Llama 4 is the first natively multimodal open-weight model family — both Scout and Maverick process images and text jointly from pretraining, not through bolted-on vision adapters. This is a significant architectural advantage.

Mistral AI

The Paris-based lab has established itself as a leading open-weight provider with a portfolio of models across the capability spectrum:

ModelTargetKey Strengths
Mistral Large 3Frontier capabilityCompetes with proprietary frontier models on reasoning and multilingual tasks. Particularly strong in European languages.
Mistral Small 4Efficient general-purposeExcellent quality-to-cost ratio for production deployments. Fast inference, reasonable hardware requirements.
DevstralCode generationSpecialized for software development tasks. Code completion, debugging, refactoring, and code review.
MagistralReasoningMistral's reasoning-focused model. Extended thinking for math, logic, and complex analysis tasks.

Qwen (Alibaba Cloud)

Alibaba's Qwen family has been particularly notable for consistently delivering strong performance across a wide range of sizes. Qwen models are popular for fine-tuning and deployment in production systems, with a broad size range from small efficient models to large capable ones. Their multilingual capabilities, particularly for Chinese and other Asian languages, are industry-leading among open models.

Open-Weight vs. Open Source
A precise terminology note: most "open source" AI models are technically "open-weight" — the model weights are released, but the training data, full training code, and data curation pipeline are not. True open source would include everything needed to reproduce the model from scratch. Llama 4, Mistral, and Qwen models are open-weight. Some projects like OLMo from AI2 aim for full open source, releasing training data and code alongside weights.

Running Models Locally

One of the most powerful aspects of open-weight models is the ability to run them on your own hardware — for privacy, cost control, offline access, or just the satisfaction of self-sufficiency.

Ollama

Ollama is the easiest way to get started with local models. It provides a simple command-line interface for downloading and running models, handling all the complexity of model loading, quantization, and GPU management behind a clean API. If you can install a program and type a command, you can run a local LLM with Ollama.

Getting started with Ollama:

# Install Ollama (macOS, Linux, or Windows) curl -fsSL https://ollama.com/install.sh | sh # Pull and run a model ollama run llama4-scout # Use the API programmatically curl http://localhost:11434/api/generate -d '{ "model": "llama4-scout", "prompt": "Explain quantum computing in simple terms" }'

vLLM

For production deployments, vLLM is the standard inference engine. It implements PagedAttention and continuous batching to maximize throughput, achieving 2-4x higher throughput than naive implementations. vLLM supports tensor parallelism across multiple GPUs, making it the go-to choice for serving large models at scale.

Quantization: Making Models Smaller

Full-precision models are enormous. A 70B parameter model in FP16 (16-bit floating point) requires approximately 140 GB of memory — far more than most consumer GPUs. Quantization reduces the precision of model weights, dramatically shrinking memory requirements at the cost of some quality.

FormatBits per Weight70B Model SizeQuality Impact
FP16 / BF1616~140 GBBaseline (full precision)
INT88~70 GBMinimal degradation, nearly lossless
INT4 / GPTQ4~35 GBSlight quality loss, usually acceptable
GGUF (Q4_K_M)~4.5 avg~40 GBOptimized for CPU+GPU, good quality balance
GGUF (Q2_K)~2.5 avg~25 GBNoticeable degradation, use for testing

GGUF deserves special mention. Developed by the llama.cpp community, GGUF is a quantization format specifically designed for efficient CPU and mixed CPU/GPU inference. It uses mixed-precision quantization — keeping more important layers at higher precision while aggressively quantizing less critical ones. This makes it the best format for running models on consumer hardware with limited GPU memory.

Choosing a Quantization Level
For most use cases, Q4_K_M (roughly 4-bit with important layers kept at higher precision) offers the best trade-off between quality and size. INT8 is nearly lossless and should be your default for production serving on capable GPUs. Only go below 4-bit for experimentation or when hardware constraints are severe.

Hardware Requirements

Understanding hardware requirements helps you plan what models you can realistically run:

HardwareVRAM / RAMWhat You Can Run
MacBook Pro M3/M4 (36GB)36 GB unifiedUp to ~30B params (Q4), smaller models at higher quality
NVIDIA RTX 409024 GB VRAMUp to ~20B params (Q4), great for fine-tuning small models
2x NVIDIA A100 (80GB)160 GB VRAM70B params at FP16, larger models quantized
8x NVIDIA H100640 GB VRAMFull frontier models, production serving at scale

When to Self-Host vs. Use APIs

The decision to self-host models or use proprietary APIs depends on your specific requirements:

FactorSelf-HostingAPI
Data privacyFull control, data never leaves your infraDepends on provider data policies
Peak capabilityStrong but usually behind the frontierFrontier models (GPT-5.4, Claude 4.6)
Cost at scaleLower marginal cost at high volumePay-per-token, can get expensive at volume
CustomizationFull control: fine-tune, modify, specializeLimited to what the API offers
Operational burdenYou manage infrastructure and reliabilityProvider handles everything
LatencyControl over network hops, can co-locateNetwork dependent, may have cold starts
The Hidden Costs of Self-Hosting
Self-hosting is not free just because the model weights are free. Factor in GPU costs (purchase or rental), electricity, cooling, engineering time for setup and maintenance, monitoring, failover, and the opportunity cost of your team's attention. For many teams, the total cost of ownership for self-hosting exceeds API costs until they hit significant scale.

The Open Source vs. Proprietary Debate

The capability gap between open and proprietary models continues to narrow. The pattern is consistent: proprietary labs push the frontier, and open-weight models catch up within months. Llama 4 Maverick competes with proprietary models that were frontier-level just six months before its release.

This narrowing gap has important implications:

  • For most applications, open models are sufficient. If you don't need the absolute cutting edge, an open model likely meets your quality requirements.
  • The proprietary advantage is in the last mile. Where proprietary models maintain an edge is on the hardest reasoning tasks, the newest capabilities, and the most polished user experiences.
  • Hybrid strategies work best. Many production systems use open models for the majority of requests (cost efficiency) and route complex queries to proprietary APIs (capability).

Resources

Key Takeaways

  • 1Open-weight AI is a genuine competitive force in 2026. Llama 4 (Scout/Maverick), Mistral (Large 3, Small 4, Devstral, Magistral), and Qwen offer strong capabilities across the spectrum.
  • 2Llama 4 Scout's 10M token context and native multimodality represent a breakthrough for open models, while Maverick targets frontier-level quality with 128 experts.
  • 3Quantization (FP16 → INT8 → INT4 → GGUF) is essential for running models on practical hardware. Q4_K_M offers the best quality-to-size trade-off for most use cases.
  • 4Ollama makes local model running accessible to anyone, while vLLM is the standard for production-scale serving with its PagedAttention and continuous batching.
  • 5Self-hosting makes sense when you need data privacy, customization, or have high-volume workloads. APIs are better for peak capability, operational simplicity, and lower initial investment.
  • 6The capability gap between open and proprietary models is narrowing steadily. Hybrid strategies — open models for volume, proprietary APIs for complexity — often work best.

Test Your Understanding

Module Assessment

5 questions · Score 70% or higher to complete this module

You can retake the quiz as many times as you need. Your best score is saved.

Cookie Preferences

We use cookies to enhance your experience. By continuing, you agree to our use of cookies.