LLM Architecture Deep Dive
Transformer architecture, attention mechanisms, tokenization, context windows, and scaling laws.
Understanding the Engine Behind AI
Large language models are the technology powering ChatGPT, Claude, Gemini, and virtually every generative AI application you use today. But how do they actually work? Understanding LLM architecture isn't just academic — it's practical knowledge that helps you make better decisions about which models to use, why they behave the way they do, and how to get the most out of them.
This module takes you inside the transformer architecture, explains how models are trained, and covers the key concepts — from attention mechanisms to scaling laws — that define the current generation of AI.
The Transformer Architecture
Every modern LLM is built on the transformer architecture, introduced in the landmark 2017 paper "Attention Is All You Need" by Vaswani et al. at Google. Before transformers, language models relied on recurrent neural networks (RNNs) and LSTMs, which processed text sequentially — one word at a time. This made them slow and limited in their ability to capture long-range relationships in text.
The transformer solved this by introducing a mechanism called self-attention, which allows the model to look at all parts of the input simultaneously. This parallelism made transformers dramatically faster to train and far more capable at understanding context.
Encoder-Decoder Structure
The original transformer has two main components:
- Encoder: Reads and processes the input text, building a rich representation of its meaning. Each layer of the encoder refines this understanding by attending to different parts of the input.
- Decoder: Generates output text token by token, using both the encoder's representation and the tokens it has already generated.
Modern LLMs typically use variations of this structure:
| Architecture | How It Works | Examples |
|---|---|---|
| Encoder-only | Processes input to create embeddings and representations. Great for understanding and classification tasks. | BERT, RoBERTa |
| Decoder-only | Generates text autoregressively (one token at a time). The dominant architecture for generative AI. | GPT-4o, Claude 4.6, Llama 4, Gemini 3.1 |
| Encoder-decoder | Uses the full original architecture. Good for tasks that transform input to output (translation, summarization). | T5, BART, original Transformer |
Self-Attention: The Core Innovation
Self-attention is the mechanism that allows transformers to understand context. For every token in the input, self-attention computes a weighted relationship to every other token. This means the model can understand that in the sentence "The bank by the river was steep," the word "bank" relates to "river" and "steep," not to finance.
How Self-Attention Works (Simplified)
For each token, the model creates three vectors:
- Query (Q): "What am I looking for?"
- Key (K): "What do I contain?"
- Value (V): "What information do I provide?"
The model computes attention scores by comparing each token's query against every other token's key. High scores mean strong relationships. These scores are then used to create a weighted sum of the values, producing a context-aware representation of each token.
The attention formula:
Attention(Q, K, V) = softmax(QKᵀ / √d_k) V
The division by √d_k (square root of the key dimension) prevents the dot products from growing too large, which would push the softmax into regions with tiny gradients.
Multi-Head Attention
Rather than computing attention once, transformers use multi-head attention — they run several attention computations in parallel, each with different learned weight matrices. This allows the model to attend to different types of relationships simultaneously. One head might focus on syntactic relationships (subject-verb agreement), another on semantic relationships (topic coherence), and another on positional patterns.
Tokenization
Before text reaches the transformer, it must be converted into numerical tokens. Tokenization is the process of breaking text into sub-word units that the model can process. Understanding tokenization helps explain many LLM behaviors, including why they sometimes struggle with spelling, counting, or character-level operations.
Common Tokenization Methods
- Byte Pair Encoding (BPE): The most widely used method. Starts with individual characters and iteratively merges the most frequent pairs. Used by GPT models and many others. The word "unhappiness" might become ["un", "happiness"] or ["un", "happ", "iness"] depending on the vocabulary.
- SentencePiece: A language-agnostic tokenizer that works directly on raw text (including spaces). Used by Llama, T5, and many multilingual models. It handles languages without whitespace-separated words (like Japanese and Chinese) more naturally.
- WordPiece: Similar to BPE but uses a slightly different merging criterion. Used by BERT and some Google models.
Tokenization example:
Input: "Tokenization is fascinating!" Tokens: ["Token", "ization", " is", " fascinating", "!"] IDs: [5765, 2065, 318, 22177, 0] Common English words often map to a single token. Rare or compound words get split into sub-words. A rough rule of thumb: 1 token ≈ 0.75 English words.
Context Windows
The context window is the maximum number of tokens a model can process in a single interaction — both your input and the model's output combined. It's one of the most important practical constraints when working with LLMs.
| Model | Context Window | Approximate Equivalent |
|---|---|---|
| GPT-4o | 128K tokens | ~96K words / a 300-page book |
| Claude 4.6 (Opus) | 1M tokens | ~750K words / several novels |
| Gemini 3.1 Pro | 1M tokens | ~750K words / several novels |
| Llama 4 Scout | 10M tokens | ~7.5M words / an entire library shelf |
Larger context windows enable use cases like analyzing entire codebases, processing lengthy legal documents, or maintaining very long conversations. However, there are trade-offs: larger context windows increase compute costs (attention scales quadratically with sequence length), and models can sometimes lose focus on information in the middle of very long contexts — a phenomenon known as "lost in the middle."
How LLMs Are Trained
Training a large language model is a multi-stage process that transforms a randomly initialized neural network into a capable AI system. Each stage has a distinct purpose.
Stage 1: Pretraining (Next Token Prediction)
The foundational stage. The model is trained on massive amounts of text data — web pages, books, code repositories, academic papers — with a deceptively simple objective: predict the next token. Given a sequence of tokens, the model learns to predict what comes next.
This simple objective, at scale, produces remarkable emergent abilities. By learning to predict text well, the model implicitly learns grammar, facts, reasoning patterns, coding conventions, and much more. Pretraining typically requires thousands of GPUs running for weeks or months and costs tens to hundreds of millions of dollars for frontier models.
Stage 2: Supervised Fine-Tuning (SFT)
After pretraining, the model is a powerful text predictor but not a good assistant. It might continue a question with another question instead of answering it. Supervised fine-tuning trains the model on curated examples of desired input-output behavior — questions paired with high-quality answers, instructions paired with correct completions. This teaches the model to be helpful and follow instructions.
Stage 3: Alignment (RLHF and Beyond)
The final stage aligns the model with human values and preferences. The most common approach is Reinforcement Learning from Human Feedback (RLHF):
- Human raters compare multiple model outputs and rank them by quality.
- A reward model is trained on these rankings to predict human preferences.
- The LLM is then fine-tuned using reinforcement learning (typically PPO or DPO) to maximize the reward model's score.
Anthropic pioneered an alternative approach called Constitutional AI (CAI), where the model is trained to follow a set of principles (a "constitution") and uses AI-generated feedback rather than relying entirely on human labelers. This makes the alignment process more scalable and transparent — you can read and debate the principles the model is trained to follow.
Scaling Laws
One of the most important discoveries in AI research is that LLM performance follows predictable scaling laws. A series of papers — most notably from OpenAI (Kaplan et al., 2020) and DeepMind's Chinchilla paper (Hoffmann et al., 2022) — showed that model performance improves as a power law function of three variables:
- Model size (number of parameters)
- Dataset size (amount of training data)
- Compute budget (training FLOPs)
The Chinchilla paper's key insight was that many models were "undertrained" — using too many parameters for the amount of data they saw. The optimal approach is to scale data and parameters together. A well-trained smaller model can outperform a poorly trained larger one.
However, scaling comes with diminishing returns. Each doubling of compute yields a smaller improvement. This has driven the industry toward finding efficiency gains — better architectures, better training data, and techniques like mixture of experts — rather than relying solely on scale.
Model Sizes and Parameters
A model's parameter count is a rough proxy for its capacity. Parameters are the learned numerical weights in the neural network.
| Size Category | Parameters | Use Cases | Examples |
|---|---|---|---|
| Small | 1B – 8B | On-device, edge, simple tasks | Gemma 3 (1B/4B), Phi-4-mini (3.8B) |
| Medium | 8B – 70B | General-purpose, good quality/cost ratio | Llama 4 Scout (17B active), Llama 4 Maverick (17B active), Qwen 2.5 72B |
| Large / Frontier | 100B+ | Maximum capability, complex reasoning | GPT-4o, Claude 4.6 Opus, Gemini 3.1 Pro |
Mixture of Experts (MoE)
Mixture of Experts is an architecture that has become increasingly prominent. Instead of activating the entire neural network for every token, MoE models have multiple "expert" sub-networks and a routing mechanism that selects only a subset of experts for each token.
How MoE Works
- The model contains many expert feed-forward networks (e.g., 16, 128, or more).
- A gating/router network decides which experts to activate for each token (typically 2–8 experts out of the total pool).
- Only the selected experts process the token, so the active parameter count per token is much smaller than the total parameter count.
MoE in practice:
Llama 4 Maverick: Total parameters: ~400B Active parameters per token: ~17B (using 2 of 128 experts) Result: Near-frontier quality at a fraction of the compute cost Llama 4 Scout: Total parameters: ~109B Active parameters per token: ~17B 16 experts, with only a few active per token Context window: 10M tokens Mistral Large (2025): Also uses MoE architecture Efficiently routes tokens to specialized expert networks
The advantage of MoE is efficiency: you get the knowledge capacity of a very large model with the inference cost of a much smaller one. The trade-off is that total memory requirements remain high (all experts must be loaded), which can make deployment more challenging.
Key Architectural Innovations Beyond the Original Transformer
Since 2017, researchers have introduced many improvements to the base transformer:
- Rotary Position Embeddings (RoPE): A more effective way to encode token positions, enabling better extrapolation to longer sequences. Used by Llama, Qwen, and many modern models.
- Flash Attention: An exact attention algorithm that is significantly faster and more memory-efficient by optimizing GPU memory access patterns. Now standard in most training and inference stacks.
- Grouped Query Attention (GQA): Reduces memory overhead by sharing key-value heads across groups of query heads, enabling faster inference with minimal quality loss.
- SwiGLU activation: A non-linearity that consistently outperforms the original ReLU activation in transformers.
- RMSNorm: A simpler, faster normalization layer that replaces the original LayerNorm in most modern LLMs.
Resources
Attention Is All You Need
Vaswani et al. (Google Brain)
The foundational 2017 paper that introduced the transformer architecture. One of the most cited papers in AI history.
3Blue1Brown: But what is a GPT? Visual intro to Transformers
3Blue1Brown
An excellent visual explanation of how transformers and attention work, starting from first principles. Perfect for building intuition.
Training Compute-Optimal Large Language Models (Chinchilla)
Hoffmann et al. (Google DeepMind)
The influential 2022 paper that reshaped how the industry thinks about scaling by demonstrating the importance of balancing model size with training data.
The Llama 4 Collection of Models
Meta AI
Meta's technical blog post on the Llama 4 model family, including details on their Mixture of Experts architecture and 10M token context windows.
Key Takeaways
- 1The transformer architecture, introduced in 2017, is the foundation of all modern LLMs. Its key innovation — self-attention — allows models to process all tokens in parallel and capture long-range context.
- 2Most frontier LLMs use decoder-only architectures, generating text one token at a time via next-token prediction.
- 3Tokenization (BPE, SentencePiece) converts text into sub-word units. Understanding tokenization explains many LLM quirks and why costs are measured in tokens.
- 4Context windows define how much text a model can process at once — ranging from 128K tokens (GPT-4o) to 10M tokens (Llama 4 Scout).
- 5LLM training is a three-stage process: pretraining on internet-scale data, supervised fine-tuning for instruction following, and alignment via RLHF or Constitutional AI.
- 6Scaling laws show that performance improves predictably with more data, parameters, and compute — but with diminishing returns, driving the industry toward efficiency innovations.
- 7Mixture of Experts (MoE) is the dominant architectural trend, offering large model capacity with efficient inference by activating only a subset of parameters per token.
Test Your Understanding
Module Assessment
5 questions · Score 70% or higher to complete this module
You can retake the quiz as many times as you need. Your best score is saved.