Open Source AI

The Open Source AI Ecosystem

Open source AI has transformed from a distant follower of proprietary models into a genuine competitive force. In 2026, the best open-weight models match or exceed the proprietary frontier from just a year ago, and in many practical applications, the gap is negligible. For organizations that need data sovereignty, cost control, customization, or low-latency inference, open source models are not just viable — they are often the best choice.

This module covers the leading open source model families, how to run models locally, the trade-offs between self-hosting and APIs, and how to evaluate what makes sense for your use case.

The Major Open-Weight Model Families

Meta Llama 4

Llama 4 is the most significant open-weight release to date. It introduced two models that redefined what open models could achieve:

Llama 4 Scout: A Mixture of Experts model with 109B total parameters but only 17B active per token (16 experts, routing to a subset per token). Its headline feature is a 10 million token context window — longer than any other model, open or proprietary. This makes it uniquely suited for tasks requiring massive context: entire codebases, long document collections, or extended conversation histories.
Llama 4 Maverick: A larger MoE model with approximately 400B total parameters and 17B active, using 128 experts. Maverick targets frontier-level quality on reasoning, coding, and multilingual tasks, while remaining efficient enough to serve at reasonable cost.

Crucially, Llama 4 is the first natively multimodal open-weight model family — both Scout and Maverick process images and text jointly from pretraining, not through bolted-on vision adapters. This is a significant architectural advantage.

Mistral AI

The Paris-based lab has established itself as a leading open-weight provider with a portfolio of models across the capability spectrum:

Model	Target	Key Strengths
Mistral Large 3	Frontier capability	Competes with proprietary frontier models on reasoning and multilingual tasks. Particularly strong in European languages.
Mistral Small 4	Efficient general-purpose	Excellent quality-to-cost ratio for production deployments. Fast inference, reasonable hardware requirements.
Devstral	Code generation	Specialized for software development tasks. Code completion, debugging, refactoring, and code review.
Magistral	Reasoning	Mistral's reasoning-focused model. Extended thinking for math, logic, and complex analysis tasks.

Qwen (Alibaba Cloud)

Alibaba's Qwen family has been particularly notable for consistently delivering strong performance across a wide range of sizes. Qwen models are popular for fine-tuning and deployment in production systems, with a broad size range from small efficient models to large capable ones. Their multilingual capabilities, particularly for Chinese and other Asian languages, are industry-leading among open models.

Open-Weight vs. Open Source

A precise terminology note: most "open source" AI models are technically "open-weight" — the model weights are released, but the training data, full training code, and data curation pipeline are not. True open source would include everything needed to reproduce the model from scratch. Llama 4, Mistral, and Qwen models are open-weight. Some projects like OLMo from AI2 aim for full open source, releasing training data and code alongside weights.

Running Models Locally

One of the most powerful aspects of open-weight models is the ability to run them on your own hardware — for privacy, cost control, offline access, or just the satisfaction of self-sufficiency.

Ollama

Ollama is the easiest way to get started with local models. It provides a simple command-line interface for downloading and running models, handling all the complexity of model loading, quantization, and GPU management behind a clean API. If you can install a program and type a command, you can run a local LLM with Ollama.

Getting started with Ollama:

# Install Ollama (macOS, Linux, or Windows) curl -fsSL https://ollama.com/install.sh | sh # Pull and run a model ollama run llama4-scout # Use the API programmatically curl http://localhost:11434/api/generate -d '{ "model": "llama4-scout", "prompt": "Explain quantum computing in simple terms" }'

vLLM

For production deployments, vLLM is the standard inference engine. It implements PagedAttention and continuous batching to maximize throughput, achieving 2-4x higher throughput than naive implementations. vLLM supports tensor parallelism across multiple GPUs, making it the go-to choice for serving large models at scale.

Quantization: Making Models Smaller

Full-precision models are enormous. A 70B parameter model in FP16 (16-bit floating point) requires approximately 140 GB of memory — far more than most consumer GPUs. Quantization reduces the precision of model weights, dramatically shrinking memory requirements at the cost of some quality.

Format	Bits per Weight	70B Model Size	Quality Impact
FP16 / BF16	16	~140 GB	Baseline (full precision)
INT8	8	~70 GB	Minimal degradation, nearly lossless
INT4 / GPTQ	4	~35 GB	Slight quality loss, usually acceptable
GGUF (Q4_K_M)	~4.5 avg	~40 GB	Optimized for CPU+GPU, good quality balance
GGUF (Q2_K)	~2.5 avg	~25 GB	Noticeable degradation, use for testing

GGUF deserves special mention. Developed by the llama.cpp community, GGUF is a quantization format specifically designed for efficient CPU and mixed CPU/GPU inference. It uses mixed-precision quantization — keeping more important layers at higher precision while aggressively quantizing less critical ones. This makes it the best format for running models on consumer hardware with limited GPU memory.

Choosing a Quantization Level

For most use cases, Q4_K_M (roughly 4-bit with important layers kept at higher precision) offers the best trade-off between quality and size. INT8 is nearly lossless and should be your default for production serving on capable GPUs. Only go below 4-bit for experimentation or when hardware constraints are severe.

Hardware Requirements

Understanding hardware requirements helps you plan what models you can realistically run:

Hardware	VRAM / RAM	What You Can Run
MacBook Pro M3/M4 (36GB)	36 GB unified	Up to ~30B params (Q4), smaller models at higher quality
NVIDIA RTX 4090	24 GB VRAM	Up to ~20B params (Q4), great for fine-tuning small models
2x NVIDIA A100 (80GB)	160 GB VRAM	70B params at FP16, larger models quantized
8x NVIDIA H100	640 GB VRAM	Full frontier models, production serving at scale

When to Self-Host vs. Use APIs

The decision to self-host models or use proprietary APIs depends on your specific requirements:

Factor	Self-Hosting	API
Data privacy	Full control, data never leaves your infra	Depends on provider data policies
Peak capability	Strong but usually behind the frontier	Frontier models (GPT-5.4, Claude 4.6)
Cost at scale	Lower marginal cost at high volume	Pay-per-token, can get expensive at volume
Customization	Full control: fine-tune, modify, specialize	Limited to what the API offers
Operational burden	You manage infrastructure and reliability	Provider handles everything
Latency	Control over network hops, can co-locate	Network dependent, may have cold starts

The Hidden Costs of Self-Hosting

Self-hosting is not free just because the model weights are free. Factor in GPU costs (purchase or rental), electricity, cooling, engineering time for setup and maintenance, monitoring, failover, and the opportunity cost of your team's attention. For many teams, the total cost of ownership for self-hosting exceeds API costs until they hit significant scale.

The Open Source vs. Proprietary Debate

The capability gap between open and proprietary models continues to narrow. The pattern is consistent: proprietary labs push the frontier, and open-weight models catch up within months. Llama 4 Maverick competes with proprietary models that were frontier-level just six months before its release.

This narrowing gap has important implications:

For most applications, open models are sufficient. If you don't need the absolute cutting edge, an open model likely meets your quality requirements.
The proprietary advantage is in the last mile. Where proprietary models maintain an edge is on the hardest reasoning tasks, the newest capabilities, and the most polished user experiences.
Hybrid strategies work best. Many production systems use open models for the majority of requests (cost efficiency) and route complex queries to proprietary APIs (capability).

Resources

Tool

Ollama

The simplest way to run open-weight LLMs locally. Supports a wide range of models with automatic quantization and GPU management.

Tool

vLLM

vLLM Project

High-throughput inference engine for LLMs with PagedAttention and continuous batching. The standard for production model serving.

Tool

Hugging Face Open LLM Leaderboard

Hugging Face

Community benchmark comparing open-weight models across reasoning, math, coding, and knowledge tasks. Essential for evaluating model options.

Article

Llama 4: Open, Multimodal, and Massive Context

Meta AI

Meta's technical blog post on the Llama 4 model family, covering the MoE architecture, 10M token context, and native multimodal capabilities.

Key Takeaways

1Open-weight AI is a genuine competitive force in 2026. Llama 4 (Scout/Maverick), Mistral (Large 3, Small 4, Devstral, Magistral), and Qwen offer strong capabilities across the spectrum.
2Llama 4 Scout's 10M token context and native multimodality represent a breakthrough for open models, while Maverick targets frontier-level quality with 128 experts.
3Quantization (FP16 → INT8 → INT4 → GGUF) is essential for running models on practical hardware. Q4_K_M offers the best quality-to-size trade-off for most use cases.
4Ollama makes local model running accessible to anyone, while vLLM is the standard for production-scale serving with its PagedAttention and continuous batching.
5Self-hosting makes sense when you need data privacy, customization, or have high-volume workloads. APIs are better for peak capability, operational simplicity, and lower initial investment.
6The capability gap between open and proprietary models is narrowing steadily. Hybrid strategies — open models for volume, proprietary APIs for complexity — often work best.

The Open Source AI Ecosystem

The Major Open-Weight Model Families

Meta Llama 4

Mistral AI

Qwen (Alibaba Cloud)

Running Models Locally

Ollama

vLLM

Quantization: Making Models Smaller

Hardware Requirements

When to Self-Host vs. Use APIs

The Open Source vs. Proprietary Debate

Resources

Ollama

vLLM

Hugging Face Open LLM Leaderboard

Llama 4: Open, Multimodal, and Massive Context

Key Takeaways

Test Your Understanding

Module Assessment

Cookie Preferences