LLM Architecture Deep Dive

Understanding the Engine Behind AI

Large language models are the technology powering ChatGPT, Claude, Gemini, and virtually every generative AI application you use today. But how do they actually work? Understanding LLM architecture isn't just academic — it's practical knowledge that helps you make better decisions about which models to use, why they behave the way they do, and how to get the most out of them.

This module takes you inside the transformer architecture, explains how models are trained, and covers the key concepts — from attention mechanisms to scaling laws — that define the current generation of AI.

The Transformer Architecture

Every modern LLM is built on the transformer architecture, introduced in the landmark 2017 paper "Attention Is All You Need" by Vaswani et al. at Google. Before transformers, language models relied on recurrent neural networks (RNNs) and LSTMs, which processed text sequentially — one word at a time. This made them slow and limited in their ability to capture long-range relationships in text.

The transformer solved this by introducing a mechanism called self-attention, which allows the model to look at all parts of the input simultaneously. This parallelism made transformers dramatically faster to train and far more capable at understanding context.

Encoder-Decoder Structure

The original transformer has two main components:

Encoder: Reads and processes the input text, building a rich representation of its meaning. Each layer of the encoder refines this understanding by attending to different parts of the input.
Decoder: Generates output text token by token, using both the encoder's representation and the tokens it has already generated.

Modern LLMs typically use variations of this structure:

Architecture	How It Works	Examples
Encoder-only	Processes input to create embeddings and representations. Great for understanding and classification tasks.	BERT, RoBERTa
Decoder-only	Generates text autoregressively (one token at a time). The dominant architecture for generative AI.	GPT-4o, Claude 4.6, Llama 4, Gemini 3.1
Encoder-decoder	Uses the full original architecture. Good for tasks that transform input to output (translation, summarization).	T5, BART, original Transformer

Why Decoder-Only Won

Most frontier LLMs today — including the GPT series, Claude, Gemini, and Llama — use a decoder-only architecture. It turns out that when you scale a decoder-only model to billions of parameters and train it on enough data, it learns to do everything the encoder was designed for, plus it can generate text. Simplicity at scale won over architectural complexity.

Self-Attention: The Core Innovation

Self-attention is the mechanism that allows transformers to understand context. For every token in the input, self-attention computes a weighted relationship to every other token. This means the model can understand that in the sentence "The bank by the river was steep," the word "bank" relates to "river" and "steep," not to finance.

How Self-Attention Works (Simplified)

For each token, the model creates three vectors:

Query (Q): "What am I looking for?"
Key (K): "What do I contain?"
Value (V): "What information do I provide?"

The model computes attention scores by comparing each token's query against every other token's key. High scores mean strong relationships. These scores are then used to create a weighted sum of the values, producing a context-aware representation of each token.

The attention formula:

Attention(Q, K, V) = softmax(QKᵀ / √d_k) V

The division by √d_k (square root of the key dimension) prevents the dot products from growing too large, which would push the softmax into regions with tiny gradients.

Multi-Head Attention

Rather than computing attention once, transformers use multi-head attention — they run several attention computations in parallel, each with different learned weight matrices. This allows the model to attend to different types of relationships simultaneously. One head might focus on syntactic relationships (subject-verb agreement), another on semantic relationships (topic coherence), and another on positional patterns.

Intuition for Attention

Think of attention like a meeting where everyone can hear everyone else. Each person (token) decides who is most relevant to listen to for the current topic. Multi-head attention is like having multiple parallel conversations at the same table, each focused on a different aspect of the discussion.

Tokenization

Before text reaches the transformer, it must be converted into numerical tokens. Tokenization is the process of breaking text into sub-word units that the model can process. Understanding tokenization helps explain many LLM behaviors, including why they sometimes struggle with spelling, counting, or character-level operations.

Common Tokenization Methods

Byte Pair Encoding (BPE): The most widely used method. Starts with individual characters and iteratively merges the most frequent pairs. Used by GPT models and many others. The word "unhappiness" might become ["un", "happiness"] or ["un", "happ", "iness"] depending on the vocabulary.
SentencePiece: A language-agnostic tokenizer that works directly on raw text (including spaces). Used by Llama, T5, and many multilingual models. It handles languages without whitespace-separated words (like Japanese and Chinese) more naturally.
WordPiece: Similar to BPE but uses a slightly different merging criterion. Used by BERT and some Google models.

Tokenization example:

Input: "Tokenization is fascinating!" Tokens: ["Token", "ization", " is", " fascinating", "!"] IDs: [5765, 2065, 318, 22177, 0] Common English words often map to a single token. Rare or compound words get split into sub-words. A rough rule of thumb: 1 token ≈ 0.75 English words.

Why Tokenization Matters

Tokenization is why LLMs sometimes can't count letters in a word or reverse a string reliably — they don't see individual characters, they see tokens. It's also why costs are measured in tokens, not words. When budgeting API usage, remember that a typical English sentence of 15 words is roughly 20 tokens.

Context Windows

The context window is the maximum number of tokens a model can process in a single interaction — both your input and the model's output combined. It's one of the most important practical constraints when working with LLMs.

Model	Context Window	Approximate Equivalent
GPT-4o	128K tokens	~96K words / a 300-page book
Claude 4.6 (Opus)	1M tokens	~750K words / several novels
Gemini 3.1 Pro	1M tokens	~750K words / several novels
Llama 4 Scout	10M tokens	~7.5M words / an entire library shelf

Larger context windows enable use cases like analyzing entire codebases, processing lengthy legal documents, or maintaining very long conversations. However, there are trade-offs: larger context windows increase compute costs (attention scales quadratically with sequence length), and models can sometimes lose focus on information in the middle of very long contexts — a phenomenon known as "lost in the middle."

How LLMs Are Trained

Training a large language model is a multi-stage process that transforms a randomly initialized neural network into a capable AI system. Each stage has a distinct purpose.

Stage 1: Pretraining (Next Token Prediction)

The foundational stage. The model is trained on massive amounts of text data — web pages, books, code repositories, academic papers — with a deceptively simple objective: predict the next token. Given a sequence of tokens, the model learns to predict what comes next.

This simple objective, at scale, produces remarkable emergent abilities. By learning to predict text well, the model implicitly learns grammar, facts, reasoning patterns, coding conventions, and much more. Pretraining typically requires thousands of GPUs running for weeks or months and costs tens to hundreds of millions of dollars for frontier models.

Stage 2: Supervised Fine-Tuning (SFT)

After pretraining, the model is a powerful text predictor but not a good assistant. It might continue a question with another question instead of answering it. Supervised fine-tuning trains the model on curated examples of desired input-output behavior — questions paired with high-quality answers, instructions paired with correct completions. This teaches the model to be helpful and follow instructions.

Stage 3: Alignment (RLHF and Beyond)

The final stage aligns the model with human values and preferences. The most common approach is Reinforcement Learning from Human Feedback (RLHF):

Human raters compare multiple model outputs and rank them by quality.
A reward model is trained on these rankings to predict human preferences.
The LLM is then fine-tuned using reinforcement learning (typically PPO or DPO) to maximize the reward model's score.

Anthropic pioneered an alternative approach called Constitutional AI (CAI), where the model is trained to follow a set of principles (a "constitution") and uses AI-generated feedback rather than relying entirely on human labelers. This makes the alignment process more scalable and transparent — you can read and debate the principles the model is trained to follow.

Direct Preference Optimization (DPO)

DPO has become increasingly popular as an alternative to full RLHF. It simplifies the process by directly optimizing the language model on preference data without needing a separate reward model. Many recent open models including Llama 4 use DPO or variants of it during alignment.

Scaling Laws

One of the most important discoveries in AI research is that LLM performance follows predictable scaling laws. A series of papers — most notably from OpenAI (Kaplan et al., 2020) and DeepMind's Chinchilla paper (Hoffmann et al., 2022) — showed that model performance improves as a power law function of three variables:

Model size (number of parameters)
Dataset size (amount of training data)
Compute budget (training FLOPs)

The Chinchilla paper's key insight was that many models were "undertrained" — using too many parameters for the amount of data they saw. The optimal approach is to scale data and parameters together. A well-trained smaller model can outperform a poorly trained larger one.

However, scaling comes with diminishing returns. Each doubling of compute yields a smaller improvement. This has driven the industry toward finding efficiency gains — better architectures, better training data, and techniques like mixture of experts — rather than relying solely on scale.

Model Sizes and Parameters

A model's parameter count is a rough proxy for its capacity. Parameters are the learned numerical weights in the neural network.

Size Category	Parameters	Use Cases	Examples
Small	1B – 8B	On-device, edge, simple tasks	Gemma 3 (1B/4B), Phi-4-mini (3.8B)
Medium	8B – 70B	General-purpose, good quality/cost ratio	Llama 4 Scout (17B active), Llama 4 Maverick (17B active), Qwen 2.5 72B
Large / Frontier	100B+	Maximum capability, complex reasoning	GPT-4o, Claude 4.6 Opus, Gemini 3.1 Pro

Mixture of Experts (MoE)

Mixture of Experts is an architecture that has become increasingly prominent. Instead of activating the entire neural network for every token, MoE models have multiple "expert" sub-networks and a routing mechanism that selects only a subset of experts for each token.

How MoE Works

The model contains many expert feed-forward networks (e.g., 16, 128, or more).
A gating/router network decides which experts to activate for each token (typically 2–8 experts out of the total pool).
Only the selected experts process the token, so the active parameter count per token is much smaller than the total parameter count.

MoE in practice:

Llama 4 Maverick: Total parameters: ~400B Active parameters per token: ~17B (using 2 of 128 experts) Result: Near-frontier quality at a fraction of the compute cost Llama 4 Scout: Total parameters: ~109B Active parameters per token: ~17B 16 experts, with only a few active per token Context window: 10M tokens Mistral Large (2025): Also uses MoE architecture Efficiently routes tokens to specialized expert networks

The advantage of MoE is efficiency: you get the knowledge capacity of a very large model with the inference cost of a much smaller one. The trade-off is that total memory requirements remain high (all experts must be loaded), which can make deployment more challenging.

MoE Is the Trend

MoE has become the dominant approach for new frontier models. It allows labs to build models with massive total knowledge bases while keeping inference costs manageable. When evaluating a model, pay attention to both total parameter count and active parameter count — the latter determines actual inference speed and cost.

Key Architectural Innovations Beyond the Original Transformer

Since 2017, researchers have introduced many improvements to the base transformer:

Rotary Position Embeddings (RoPE): A more effective way to encode token positions, enabling better extrapolation to longer sequences. Used by Llama, Qwen, and many modern models.
Flash Attention: An exact attention algorithm that is significantly faster and more memory-efficient by optimizing GPU memory access patterns. Now standard in most training and inference stacks.
Grouped Query Attention (GQA): Reduces memory overhead by sharing key-value heads across groups of query heads, enabling faster inference with minimal quality loss.
SwiGLU activation: A non-linearity that consistently outperforms the original ReLU activation in transformers.
RMSNorm: A simpler, faster normalization layer that replaces the original LayerNorm in most modern LLMs.

Resources

Article

Attention Is All You Need

Vaswani et al. (Google Brain)

The foundational 2017 paper that introduced the transformer architecture. One of the most cited papers in AI history.

Video

3Blue1Brown: But what is a GPT? Visual intro to Transformers

3Blue1Brown

An excellent visual explanation of how transformers and attention work, starting from first principles. Perfect for building intuition.

Article

Training Compute-Optimal Large Language Models (Chinchilla)

Hoffmann et al. (Google DeepMind)

The influential 2022 paper that reshaped how the industry thinks about scaling by demonstrating the importance of balancing model size with training data.

Article

The Llama 4 Collection of Models

Meta AI

Meta's technical blog post on the Llama 4 model family, including details on their Mixture of Experts architecture and 10M token context windows.

Key Takeaways

1The transformer architecture, introduced in 2017, is the foundation of all modern LLMs. Its key innovation — self-attention — allows models to process all tokens in parallel and capture long-range context.
2Most frontier LLMs use decoder-only architectures, generating text one token at a time via next-token prediction.
3Tokenization (BPE, SentencePiece) converts text into sub-word units. Understanding tokenization explains many LLM quirks and why costs are measured in tokens.
4Context windows define how much text a model can process at once — ranging from 128K tokens (GPT-4o) to 10M tokens (Llama 4 Scout).
5LLM training is a three-stage process: pretraining on internet-scale data, supervised fine-tuning for instruction following, and alignment via RLHF or Constitutional AI.
6Scaling laws show that performance improves predictably with more data, parameters, and compute — but with diminishing returns, driving the industry toward efficiency innovations.
7Mixture of Experts (MoE) is the dominant architectural trend, offering large model capacity with efficient inference by activating only a subset of parameters per token.

Understanding the Engine Behind AI

The Transformer Architecture

Encoder-Decoder Structure

Self-Attention: The Core Innovation

How Self-Attention Works (Simplified)

Multi-Head Attention

Tokenization

Common Tokenization Methods

Context Windows

How LLMs Are Trained

Stage 1: Pretraining (Next Token Prediction)

Stage 2: Supervised Fine-Tuning (SFT)

Stage 3: Alignment (RLHF and Beyond)

Scaling Laws

Model Sizes and Parameters

Mixture of Experts (MoE)

How MoE Works

Key Architectural Innovations Beyond the Original Transformer

Resources

Attention Is All You Need

3Blue1Brown: But what is a GPT? Visual intro to Transformers

Training Compute-Optimal Large Language Models (Chinchilla)

The Llama 4 Collection of Models

Key Takeaways

Test Your Understanding

Module Assessment

Cookie Preferences