Reasoning & Planning

The Quest for Deeper Thinking in AI

Early large language models were impressive pattern matchers, but they struggled with tasks that required genuine multi-step reasoning — mathematical proofs, complex code debugging, strategic planning, and scientific analysis. The development of reasoning models represents a fundamental shift: instead of generating answers in a single forward pass, these models are trained to "think" — to break problems into steps, explore solution paths, and verify their own work before producing a final answer.

This module explores how reasoning models work, the techniques behind them, and where the frontier of AI reasoning stands in 2026.

Chain-of-Thought Prompting: Where It Started

The reasoning revolution began with a deceptively simple observation. In 2022, Wei et al. at Google showed that adding the phrase "Let's think step by step" to a prompt dramatically improved LLM performance on math and logic problems. This technique — chain-of-thought (CoT) prompting — doesn't change the model at all. It simply encourages the model to generate intermediate reasoning steps before arriving at an answer.

Why does this work? When a model generates tokens step by step, each step becomes part of the context for the next step. The model essentially gets to "use paper" — externalizing its reasoning into text that it can then reference. Without chain-of-thought, the model must solve the entire problem in the latent computation of a single forward pass, which has limited depth.

Zero-Shot vs. Few-Shot CoT

Zero-shot CoT simply appends "Let's think step by step" to the prompt. Few-shot CoT provides examples of step-by-step reasoning for similar problems. Few-shot CoT generally performs better because it demonstrates the reasoning style you want, but zero-shot CoT is surprisingly effective and requires no example engineering.

Beyond Linear Reasoning: Tree-of-Thought

Chain-of-thought is linear — the model follows one reasoning path. But many problems benefit from exploring multiple paths and backtracking when a path leads to a dead end. Tree-of-Thought (ToT), introduced by Yao et al. in 2023, addresses this by having the model generate multiple possible next steps at each point, evaluate them, and pursue the most promising paths.

Think of it as the difference between walking down a single corridor and exploring a maze. ToT enables the model to:

Generate several candidate reasoning steps
Evaluate which steps are most promising
Backtrack from dead ends and try alternatives
Combine insights from different branches

ToT dramatically improves performance on problems that require search and planning, such as puzzle-solving, game strategy, and creative problem-solving. However, it is significantly more computationally expensive than linear chain-of-thought.

Reasoning Models: The Test-Time Compute Revolution

The biggest leap in AI reasoning came with the realization that you can trade inference-time compute for better answers. Instead of making a model smarter by making it larger (more parameters, more training), you make it smarter by letting it think longer at inference time. This is the core idea behind reasoning models.

The Frontier Reasoning Models

Model	Provider	Key Characteristics
o3 / o3-pro	OpenAI	OpenAI's dedicated reasoning series. Uses extended internal chain-of-thought with verification loops. o3-pro trades additional latency for higher accuracy on the hardest problems.
GPT-5.4 Thinking	OpenAI	Integrated reasoning mode within GPT-5.4. Can dynamically allocate more compute to harder sub-problems within a conversation.
Claude Extended Thinking	Anthropic	Anthropic's approach to reasoning, where Claude generates an extended thinking trace before responding. Users can see the summarized reasoning process for transparency.
Mistral Magistral	Mistral AI	Mistral's reasoning-focused model. Competitive on math and coding benchmarks with strong multilingual reasoning capabilities.

How Reasoning Models Work

Reasoning models differ from standard LLMs in several fundamental ways:

Extended internal chain-of-thought: When given a problem, the model generates a long internal reasoning trace — often thousands of tokens — before producing its final answer. This trace includes hypothesis formation, step-by-step work, self-correction, and verification.
Test-time compute scaling: Harder problems automatically trigger longer reasoning chains. The model learns to allocate more "thinking time" to problems that require it. This is fundamentally different from standard models, which spend the same compute on easy and hard problems.
Training with process supervision: Rather than only rewarding the final answer (outcome supervision), reasoning models are trained with process reward models (PRMs) that evaluate each step of the reasoning chain. This teaches the model to reason correctly, not just get lucky with the right answer.
Self-verification: The best reasoning models learn to check their own work — re-reading the problem, verifying intermediate steps, and catching errors before committing to a final answer.

When to Use Reasoning Models

Reasoning models excel at math, coding, logic puzzles, scientific analysis, and any task requiring multi-step deduction. But they are slower and more expensive than standard models. For simple tasks like summarization, translation, or casual conversation, a standard model is faster and equally capable. Match the model to the task.

Process Reward Models (PRMs)

Process reward models are a critical training innovation for reasoning systems. Traditional reward models in RLHF evaluate only the final output — did the model get the right answer? PRMs instead evaluate each step of the reasoning process.

Consider a math problem where the correct answer is 42. A model might arrive at 42 through correct reasoning or through a series of errors that coincidentally cancel out. Outcome-based reward would give both approaches the same score. A PRM would correctly reward the sound reasoning and penalize the lucky errors.

Training with PRMs produces models that:

Make fewer reasoning errors at each step
Are more reliable on problems outside their training distribution
Generate reasoning chains that humans can actually follow and verify
Self-correct more effectively, because they have learned what correct reasoning looks like

Monte Carlo Tree Search Applied to LLMs

Monte Carlo Tree Search (MCTS) — the technique that powered AlphaGo's victory over world champion Go players — is being adapted for language model reasoning. The idea is to treat reasoning as a search problem:

Selection: Choose a promising reasoning path to explore further, guided by a value function that estimates how likely the current partial solution is to lead to a correct answer.
Expansion: Generate the next reasoning step, creating new nodes in the search tree.
Simulation: Roll out the reasoning to completion (potentially with a cheaper/faster model) to estimate the quality of this path.
Backpropagation: Update the value estimates of all nodes along the path based on the outcome.

This approach allows reasoning models to systematically explore the space of possible solutions rather than committing to a single reasoning chain. It is particularly powerful for problems with clear verification criteria — such as math (check the answer), code (run the tests), and formal logic (verify the proof).

The Verifier Advantage

MCTS-style reasoning is most powerful when you have a reliable verifier. In code generation, you can run tests. In math, you can check the answer. In formal proofs, you can verify logical validity. These verification signals allow the search process to efficiently prune wrong paths. For more open-ended tasks where verification is harder, the advantage of search-based reasoning is more limited.

The Frontier of AI Reasoning

Reasoning models are pushing into domains previously considered out of reach for AI:

Mathematical Proofs

AI systems are increasingly capable of solving competition-level mathematics and assisting with research-level proof development. Systems like AlphaProof (Google DeepMind) have made headway on International Mathematical Olympiad problems. The combination of language models for intuition with formal proof assistants like Lean for verification is a particularly promising direction.

Code Generation and Debugging

Reasoning models have dramatically improved code generation quality. By thinking through the problem, planning the approach, writing the code, and then mentally tracing through test cases, reasoning models achieve pass rates on coding benchmarks that far exceed standard models. They are particularly strong at debugging — reasoning about why code fails and systematically identifying the root cause.

Scientific Discovery

Reasoning models are being applied to scientific tasks: analyzing experimental data, generating and evaluating hypotheses, and synthesizing findings across papers. While they are not yet making independent discoveries, they are increasingly valuable as research assistants that can process and reason about vast amounts of scientific literature.

Limitations of Current Reasoning Approaches

Despite impressive progress, important limitations remain:

Faithfulness of reasoning chains: The reasoning trace a model generates may not accurately reflect its actual internal computation. Models can produce plausible-sounding reasoning that leads to a conclusion they were going to reach anyway — a form of post-hoc rationalization rather than genuine reasoning.
Compositional generalization: Reasoning models struggle with problems that require combining known concepts in truly novel ways. They can solve problems that resemble their training data, but novel problem structures remain challenging.
Cost and latency: Reasoning models can be 10-100x more expensive than standard models due to the extended thinking tokens. For applications requiring real-time responses, this is a significant constraint.
Planning horizon: Current models struggle with tasks requiring very long-horizon planning — sequences of dozens or hundreds of dependent steps. They can plan a few steps ahead reliably but degrade over longer horizons.
Brittleness: Small changes in problem phrasing can dramatically affect reasoning performance. Models can solve a problem presented one way but fail when the same problem is rephrased.

Resources

Article

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Wei et al. (Google Brain)

The foundational paper that demonstrated how prompting models to show their reasoning dramatically improves performance on math and logic tasks.

Article

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

Yao et al.

Introduces Tree-of-Thought reasoning, allowing LLMs to explore multiple solution paths, evaluate them, and backtrack from dead ends.

Article

Let's Verify Step by Step

Lightman et al. (OpenAI)

The key paper on process reward models, demonstrating that evaluating each reasoning step produces more reliable reasoning than only evaluating the final answer.

Article

Scaling LLM Test-Time Compute

Snell et al. (UC Berkeley)

Research showing that spending more compute at inference time can be more efficient than scaling model parameters for many reasoning tasks.

Key Takeaways

1Chain-of-thought prompting was the breakthrough that showed LLMs can reason better when they 'show their work' — generating intermediate steps rather than jumping to answers.
2Reasoning models (o3, GPT-5.4 Thinking, Claude Extended Thinking, Magistral) trade test-time compute for accuracy — they think longer on harder problems.
3Process reward models (PRMs) evaluate each reasoning step rather than just the final answer, producing more reliable and transparent reasoning.
4Monte Carlo Tree Search applied to LLMs enables systematic exploration of solution spaces, and is most powerful when combined with verifiers (test suites, proof checkers).
5Current limitations include unfaithful reasoning chains, poor compositional generalization, high cost/latency, and brittleness to problem rephrasing.
6The frontier of AI reasoning extends to mathematical proofs, complex code generation, and scientific discovery, but genuine long-horizon planning remains an open challenge.

The Quest for Deeper Thinking in AI

Chain-of-Thought Prompting: Where It Started

Beyond Linear Reasoning: Tree-of-Thought

Reasoning Models: The Test-Time Compute Revolution

The Frontier Reasoning Models

How Reasoning Models Work

Process Reward Models (PRMs)

Monte Carlo Tree Search Applied to LLMs

The Frontier of AI Reasoning

Mathematical Proofs

Code Generation and Debugging

Scientific Discovery

Limitations of Current Reasoning Approaches

Resources

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

Let's Verify Step by Step

Scaling LLM Test-Time Compute

Key Takeaways

Test Your Understanding

Module Assessment

Cookie Preferences