Expert40 minModule 3 of 5

Reasoning & Planning

Chain-of-thought, reasoning models, Monte Carlo Tree Search, the frontier of AI reasoning.

The Quest for Deeper Thinking in AI

Early large language models were impressive pattern matchers, but they struggled with tasks that required genuine multi-step reasoning — mathematical proofs, complex code debugging, strategic planning, and scientific analysis. The development of reasoning models represents a fundamental shift: instead of generating answers in a single forward pass, these models are trained to "think" — to break problems into steps, explore solution paths, and verify their own work before producing a final answer.

This module explores how reasoning models work, the techniques behind them, and where the frontier of AI reasoning stands in 2026.

Chain-of-Thought Prompting: Where It Started

The reasoning revolution began with a deceptively simple observation. In 2022, Wei et al. at Google showed that adding the phrase "Let's think step by step" to a prompt dramatically improved LLM performance on math and logic problems. This technique — chain-of-thought (CoT) prompting — doesn't change the model at all. It simply encourages the model to generate intermediate reasoning steps before arriving at an answer.

Why does this work? When a model generates tokens step by step, each step becomes part of the context for the next step. The model essentially gets to "use paper" — externalizing its reasoning into text that it can then reference. Without chain-of-thought, the model must solve the entire problem in the latent computation of a single forward pass, which has limited depth.

Zero-Shot vs. Few-Shot CoT
Zero-shot CoT simply appends "Let's think step by step" to the prompt. Few-shot CoT provides examples of step-by-step reasoning for similar problems. Few-shot CoT generally performs better because it demonstrates the reasoning style you want, but zero-shot CoT is surprisingly effective and requires no example engineering.

Beyond Linear Reasoning: Tree-of-Thought

Chain-of-thought is linear — the model follows one reasoning path. But many problems benefit from exploring multiple paths and backtracking when a path leads to a dead end. Tree-of-Thought (ToT), introduced by Yao et al. in 2023, addresses this by having the model generate multiple possible next steps at each point, evaluate them, and pursue the most promising paths.

Think of it as the difference between walking down a single corridor and exploring a maze. ToT enables the model to:

  • Generate several candidate reasoning steps
  • Evaluate which steps are most promising
  • Backtrack from dead ends and try alternatives
  • Combine insights from different branches

ToT dramatically improves performance on problems that require search and planning, such as puzzle-solving, game strategy, and creative problem-solving. However, it is significantly more computationally expensive than linear chain-of-thought.

Reasoning Models: The Test-Time Compute Revolution

The biggest leap in AI reasoning came with the realization that you can trade inference-time compute for better answers. Instead of making a model smarter by making it larger (more parameters, more training), you make it smarter by letting it think longer at inference time. This is the core idea behind reasoning models.

The Frontier Reasoning Models

ModelProviderKey Characteristics
o3 / o3-proOpenAIOpenAI's dedicated reasoning series. Uses extended internal chain-of-thought with verification loops. o3-pro trades additional latency for higher accuracy on the hardest problems.
GPT-5.4 ThinkingOpenAIIntegrated reasoning mode within GPT-5.4. Can dynamically allocate more compute to harder sub-problems within a conversation.
Claude Extended ThinkingAnthropicAnthropic's approach to reasoning, where Claude generates an extended thinking trace before responding. Users can see the summarized reasoning process for transparency.
Mistral MagistralMistral AIMistral's reasoning-focused model. Competitive on math and coding benchmarks with strong multilingual reasoning capabilities.

How Reasoning Models Work

Reasoning models differ from standard LLMs in several fundamental ways:

  1. Extended internal chain-of-thought: When given a problem, the model generates a long internal reasoning trace — often thousands of tokens — before producing its final answer. This trace includes hypothesis formation, step-by-step work, self-correction, and verification.
  2. Test-time compute scaling: Harder problems automatically trigger longer reasoning chains. The model learns to allocate more "thinking time" to problems that require it. This is fundamentally different from standard models, which spend the same compute on easy and hard problems.
  3. Training with process supervision: Rather than only rewarding the final answer (outcome supervision), reasoning models are trained with process reward models (PRMs) that evaluate each step of the reasoning chain. This teaches the model to reason correctly, not just get lucky with the right answer.
  4. Self-verification: The best reasoning models learn to check their own work — re-reading the problem, verifying intermediate steps, and catching errors before committing to a final answer.
When to Use Reasoning Models
Reasoning models excel at math, coding, logic puzzles, scientific analysis, and any task requiring multi-step deduction. But they are slower and more expensive than standard models. For simple tasks like summarization, translation, or casual conversation, a standard model is faster and equally capable. Match the model to the task.

Process Reward Models (PRMs)

Process reward models are a critical training innovation for reasoning systems. Traditional reward models in RLHF evaluate only the final output — did the model get the right answer? PRMs instead evaluate each step of the reasoning process.

Consider a math problem where the correct answer is 42. A model might arrive at 42 through correct reasoning or through a series of errors that coincidentally cancel out. Outcome-based reward would give both approaches the same score. A PRM would correctly reward the sound reasoning and penalize the lucky errors.

Training with PRMs produces models that:

  • Make fewer reasoning errors at each step
  • Are more reliable on problems outside their training distribution
  • Generate reasoning chains that humans can actually follow and verify
  • Self-correct more effectively, because they have learned what correct reasoning looks like

Monte Carlo Tree Search Applied to LLMs

Monte Carlo Tree Search (MCTS) — the technique that powered AlphaGo's victory over world champion Go players — is being adapted for language model reasoning. The idea is to treat reasoning as a search problem:

  1. Selection: Choose a promising reasoning path to explore further, guided by a value function that estimates how likely the current partial solution is to lead to a correct answer.
  2. Expansion: Generate the next reasoning step, creating new nodes in the search tree.
  3. Simulation: Roll out the reasoning to completion (potentially with a cheaper/faster model) to estimate the quality of this path.
  4. Backpropagation: Update the value estimates of all nodes along the path based on the outcome.

This approach allows reasoning models to systematically explore the space of possible solutions rather than committing to a single reasoning chain. It is particularly powerful for problems with clear verification criteria — such as math (check the answer), code (run the tests), and formal logic (verify the proof).

The Verifier Advantage
MCTS-style reasoning is most powerful when you have a reliable verifier. In code generation, you can run tests. In math, you can check the answer. In formal proofs, you can verify logical validity. These verification signals allow the search process to efficiently prune wrong paths. For more open-ended tasks where verification is harder, the advantage of search-based reasoning is more limited.

The Frontier of AI Reasoning

Reasoning models are pushing into domains previously considered out of reach for AI:

Mathematical Proofs

AI systems are increasingly capable of solving competition-level mathematics and assisting with research-level proof development. Systems like AlphaProof (Google DeepMind) have made headway on International Mathematical Olympiad problems. The combination of language models for intuition with formal proof assistants like Lean for verification is a particularly promising direction.

Code Generation and Debugging

Reasoning models have dramatically improved code generation quality. By thinking through the problem, planning the approach, writing the code, and then mentally tracing through test cases, reasoning models achieve pass rates on coding benchmarks that far exceed standard models. They are particularly strong at debugging — reasoning about why code fails and systematically identifying the root cause.

Scientific Discovery

Reasoning models are being applied to scientific tasks: analyzing experimental data, generating and evaluating hypotheses, and synthesizing findings across papers. While they are not yet making independent discoveries, they are increasingly valuable as research assistants that can process and reason about vast amounts of scientific literature.

Limitations of Current Reasoning Approaches

Despite impressive progress, important limitations remain:

  • Faithfulness of reasoning chains: The reasoning trace a model generates may not accurately reflect its actual internal computation. Models can produce plausible-sounding reasoning that leads to a conclusion they were going to reach anyway — a form of post-hoc rationalization rather than genuine reasoning.
  • Compositional generalization: Reasoning models struggle with problems that require combining known concepts in truly novel ways. They can solve problems that resemble their training data, but novel problem structures remain challenging.
  • Cost and latency: Reasoning models can be 10-100x more expensive than standard models due to the extended thinking tokens. For applications requiring real-time responses, this is a significant constraint.
  • Planning horizon: Current models struggle with tasks requiring very long-horizon planning — sequences of dozens or hundreds of dependent steps. They can plan a few steps ahead reliably but degrade over longer horizons.
  • Brittleness: Small changes in problem phrasing can dramatically affect reasoning performance. Models can solve a problem presented one way but fail when the same problem is rephrased.

Resources

Key Takeaways

  • 1Chain-of-thought prompting was the breakthrough that showed LLMs can reason better when they 'show their work' — generating intermediate steps rather than jumping to answers.
  • 2Reasoning models (o3, GPT-5.4 Thinking, Claude Extended Thinking, Magistral) trade test-time compute for accuracy — they think longer on harder problems.
  • 3Process reward models (PRMs) evaluate each reasoning step rather than just the final answer, producing more reliable and transparent reasoning.
  • 4Monte Carlo Tree Search applied to LLMs enables systematic exploration of solution spaces, and is most powerful when combined with verifiers (test suites, proof checkers).
  • 5Current limitations include unfaithful reasoning chains, poor compositional generalization, high cost/latency, and brittleness to problem rephrasing.
  • 6The frontier of AI reasoning extends to mathematical proofs, complex code generation, and scientific discovery, but genuine long-horizon planning remains an open challenge.

Test Your Understanding

Module Assessment

5 questions · Score 70% or higher to complete this module

You can retake the quiz as many times as you need. Your best score is saved.

Cookie Preferences

We use cookies to enhance your experience. By continuing, you agree to our use of cookies.