Open Source AI
Llama, Mistral, Qwen. Running local models, quantization, open source vs proprietary.
The Open Source AI Ecosystem
Open source AI has transformed from a distant follower of proprietary models into a genuine competitive force. In 2026, the best open-weight models match or exceed the proprietary frontier from just a year ago, and in many practical applications, the gap is negligible. For organizations that need data sovereignty, cost control, customization, or low-latency inference, open source models are not just viable — they are often the best choice.
This module covers the leading open source model families, how to run models locally, the trade-offs between self-hosting and APIs, and how to evaluate what makes sense for your use case.
The Major Open-Weight Model Families
Meta Llama 4
Llama 4 is the most significant open-weight release to date. It introduced two models that redefined what open models could achieve:
- Llama 4 Scout: A Mixture of Experts model with 109B total parameters but only 17B active per token (16 experts, routing to a subset per token). Its headline feature is a 10 million token context window — longer than any other model, open or proprietary. This makes it uniquely suited for tasks requiring massive context: entire codebases, long document collections, or extended conversation histories.
- Llama 4 Maverick: A larger MoE model with approximately 400B total parameters and 17B active, using 128 experts. Maverick targets frontier-level quality on reasoning, coding, and multilingual tasks, while remaining efficient enough to serve at reasonable cost.
Crucially, Llama 4 is the first natively multimodal open-weight model family — both Scout and Maverick process images and text jointly from pretraining, not through bolted-on vision adapters. This is a significant architectural advantage.
Mistral AI
The Paris-based lab has established itself as a leading open-weight provider with a portfolio of models across the capability spectrum:
| Model | Target | Key Strengths |
|---|---|---|
| Mistral Large 3 | Frontier capability | Competes with proprietary frontier models on reasoning and multilingual tasks. Particularly strong in European languages. |
| Mistral Small 4 | Efficient general-purpose | Excellent quality-to-cost ratio for production deployments. Fast inference, reasonable hardware requirements. |
| Devstral | Code generation | Specialized for software development tasks. Code completion, debugging, refactoring, and code review. |
| Magistral | Reasoning | Mistral's reasoning-focused model. Extended thinking for math, logic, and complex analysis tasks. |
Qwen (Alibaba Cloud)
Alibaba's Qwen family has been particularly notable for consistently delivering strong performance across a wide range of sizes. Qwen models are popular for fine-tuning and deployment in production systems, with a broad size range from small efficient models to large capable ones. Their multilingual capabilities, particularly for Chinese and other Asian languages, are industry-leading among open models.
Running Models Locally
One of the most powerful aspects of open-weight models is the ability to run them on your own hardware — for privacy, cost control, offline access, or just the satisfaction of self-sufficiency.
Ollama
Ollama is the easiest way to get started with local models. It provides a simple command-line interface for downloading and running models, handling all the complexity of model loading, quantization, and GPU management behind a clean API. If you can install a program and type a command, you can run a local LLM with Ollama.
Getting started with Ollama:
# Install Ollama (macOS, Linux, or Windows) curl -fsSL https://ollama.com/install.sh | sh # Pull and run a model ollama run llama4-scout # Use the API programmatically curl http://localhost:11434/api/generate -d '{ "model": "llama4-scout", "prompt": "Explain quantum computing in simple terms" }'
vLLM
For production deployments, vLLM is the standard inference engine. It implements PagedAttention and continuous batching to maximize throughput, achieving 2-4x higher throughput than naive implementations. vLLM supports tensor parallelism across multiple GPUs, making it the go-to choice for serving large models at scale.
Quantization: Making Models Smaller
Full-precision models are enormous. A 70B parameter model in FP16 (16-bit floating point) requires approximately 140 GB of memory — far more than most consumer GPUs. Quantization reduces the precision of model weights, dramatically shrinking memory requirements at the cost of some quality.
| Format | Bits per Weight | 70B Model Size | Quality Impact |
|---|---|---|---|
| FP16 / BF16 | 16 | ~140 GB | Baseline (full precision) |
| INT8 | 8 | ~70 GB | Minimal degradation, nearly lossless |
| INT4 / GPTQ | 4 | ~35 GB | Slight quality loss, usually acceptable |
| GGUF (Q4_K_M) | ~4.5 avg | ~40 GB | Optimized for CPU+GPU, good quality balance |
| GGUF (Q2_K) | ~2.5 avg | ~25 GB | Noticeable degradation, use for testing |
GGUF deserves special mention. Developed by the llama.cpp community, GGUF is a quantization format specifically designed for efficient CPU and mixed CPU/GPU inference. It uses mixed-precision quantization — keeping more important layers at higher precision while aggressively quantizing less critical ones. This makes it the best format for running models on consumer hardware with limited GPU memory.
Hardware Requirements
Understanding hardware requirements helps you plan what models you can realistically run:
| Hardware | VRAM / RAM | What You Can Run |
|---|---|---|
| MacBook Pro M3/M4 (36GB) | 36 GB unified | Up to ~30B params (Q4), smaller models at higher quality |
| NVIDIA RTX 4090 | 24 GB VRAM | Up to ~20B params (Q4), great for fine-tuning small models |
| 2x NVIDIA A100 (80GB) | 160 GB VRAM | 70B params at FP16, larger models quantized |
| 8x NVIDIA H100 | 640 GB VRAM | Full frontier models, production serving at scale |
When to Self-Host vs. Use APIs
The decision to self-host models or use proprietary APIs depends on your specific requirements:
| Factor | Self-Hosting | API |
|---|---|---|
| Data privacy | Full control, data never leaves your infra | Depends on provider data policies |
| Peak capability | Strong but usually behind the frontier | Frontier models (GPT-5.4, Claude 4.6) |
| Cost at scale | Lower marginal cost at high volume | Pay-per-token, can get expensive at volume |
| Customization | Full control: fine-tune, modify, specialize | Limited to what the API offers |
| Operational burden | You manage infrastructure and reliability | Provider handles everything |
| Latency | Control over network hops, can co-locate | Network dependent, may have cold starts |
The Open Source vs. Proprietary Debate
The capability gap between open and proprietary models continues to narrow. The pattern is consistent: proprietary labs push the frontier, and open-weight models catch up within months. Llama 4 Maverick competes with proprietary models that were frontier-level just six months before its release.
This narrowing gap has important implications:
- For most applications, open models are sufficient. If you don't need the absolute cutting edge, an open model likely meets your quality requirements.
- The proprietary advantage is in the last mile. Where proprietary models maintain an edge is on the hardest reasoning tasks, the newest capabilities, and the most polished user experiences.
- Hybrid strategies work best. Many production systems use open models for the majority of requests (cost efficiency) and route complex queries to proprietary APIs (capability).
Resources
Key Takeaways
- 1Open-weight AI is a genuine competitive force in 2026. Llama 4 (Scout/Maverick), Mistral (Large 3, Small 4, Devstral, Magistral), and Qwen offer strong capabilities across the spectrum.
- 2Llama 4 Scout's 10M token context and native multimodality represent a breakthrough for open models, while Maverick targets frontier-level quality with 128 experts.
- 3Quantization (FP16 → INT8 → INT4 → GGUF) is essential for running models on practical hardware. Q4_K_M offers the best quality-to-size trade-off for most use cases.
- 4Ollama makes local model running accessible to anyone, while vLLM is the standard for production-scale serving with its PagedAttention and continuous batching.
- 5Self-hosting makes sense when you need data privacy, customization, or have high-volume workloads. APIs are better for peak capability, operational simplicity, and lower initial investment.
- 6The capability gap between open and proprietary models is narrowing steadily. Hybrid strategies — open models for volume, proprietary APIs for complexity — often work best.
Test Your Understanding
Module Assessment
5 questions · Score 70% or higher to complete this module
You can retake the quiz as many times as you need. Your best score is saved.