Does Modern AI Truly Think?

Does Modern AI Truly Think?

Are today's AI models sentient "thinkers" or just a dressed-up, billion-dollar version of predictive text? This newsletter unpacks Apple's recent study to find the answer.
8 mins read
Jun 23, 2025
Share

For years, the promise of artificial intelligence has captivated researchers and the public alike.

Recent advancements in large language models (LLMs) have brought us closer than ever, with large reasoning models (LRMs) now generating detailed “thinking processes” before providing answers. These models exhibit impressive performance on complex reasoning tasks, leading some to believe they are on the cusp of true artificial general intelligence.

But what if this thinking is merely an illusion? What if, despite their sophisticated self-reflection mechanisms, these models are still fundamentally limited in their reasoning ability, particularly as complex problems grow? For example, if an LRM encountered a novel sorting problem, could it devise a new, more efficient algorithm or simply rearrange known sorting methods based on patterns from its training data?

This idea raises the following questions:

Q1. Do these models engage in flexible, general reasoning, or do they primarily rely on recognizing familiar patterns?

Q2. How does their effectiveness change as problems become more intricate? 

Q3. When given comparable computational resources, how do these reasoning models stack up against regular language models that don’t employ thinking mechanisms?

Q4. What are the fundamental constraints of existing reasoning methods, and what advancements are needed to foster more dependable reasoning skills?

To systematically investigate these questions and bridge existing gaps in understanding, Apple conducted a study in June 2025.https://machinelearning.apple.com/research/illusion-of-thinking

This research notably moved beyond traditional evaluation paradigms that primarily emphasized final answer accuracy on established mathematical and coding benchmarks, which often suffered from data contamination and lacked insights into the reasoning traces’ structure and quality.

Instead, the researchers leveraged controllable puzzle environments designed to allow precise manipulation of compositional complexity while maintaining consistent logical structures. This innovative setup enabled a deeper analysis of final answers and the internal reasoning traces, offering unprecedented insights into how large reasoning models (LRMs) think.

Data contamination refers to the issue where evaluation benchmarks used for models overlap with the data they were trained on. This can lead to inflated performance metrics because models might recall memorized solutions rather than genuinely reasoning.

Today, we'll walk you through the scientific ins and outs of Apple's recent study —and we'll also include the SparkNotes version at the end.

Experimentation setup#

For their experimental setup, the researchers at Apple chose a unique approach to assessing large reasoning models. Unlike prior studies that predominantly relied on established mathematical and coding benchmarks (which often suffered from data contamination and lacked insights into reasoning traces) this work introduced controllable puzzle environments. This choice enabled precise manipulation of compositional complexity while maintaining consistent logical structures.

The puzzles were selected for several key reasons:

  • They offer fine-grained control over complexity.

  • They help avoid data contamination common in established benchmarks.

  • They require only explicitly provided rules, emphasizing algorithmic reasoning.

  • They support rigorous, simulator-based evaluation, allowing for precise solution checks and detailed failure analyses.

The study evaluated LRMs on four distinct puzzle environments:

  • Tower of Hanoi: This puzzle involves moving a stack of different-sized disks from one peg to another, one disk at a time, without placing a larger disk on a smaller one. Its difficulty is precisely manipulated by adjusting the number of initial disks, which makes the minimum number of required moves scale exponentially as  2n12^n-1. However, the study measures the correctness of each move and reaching the target state, not optimality.

  • Checker Jumping: This one-dimensional puzzle aims to swap the positions of red and blue checkers and a single empty space. Checkers can slide into an adjacent space or jump over one opposite-colored checker, but cannot move backward. The complexity is controlled by the number of checkers (2n2n), where the minimum number of moves required scales quadratically as (n+1)21(n+1)^2−1.

  • River Crossing: This puzzle requires “n” actors and their “n” agents to cross a river in a limited capacity boat, ensuring no actor is left alone with another agent unless their agent is present. The task’s complexity is controlled by adjusting the number of actor/agent pairs (nn) and the boat capacity (kk); specifically, for n=2n=2 and n=3n=3 pairs, a boat capacity of k=2k=2 is used, and for larger numbers of pairs, k=3k=3 is used.

  • Blocks World: This block-stacking puzzle aims to rearrange blocks from an initial configuration to a specified goal configuration. Only the topmost block of any stack can be moved and placed on an empty stack or another block. The difficulty is controlled by the number of blocks while maintaining clear structural patterns for initial and goal configurations.

Four puzzles used in experimentation: Tower of Hanoi, Checker Jumping, River Crossing, and Blocks World—each shown in three phases (initial, intermediate, target), depicting disk moves, token swaps, river crossings, and stack reconfigurations
Four puzzles used in experimentation: Tower of Hanoi, Checker Jumping, River Crossing, and Blocks World—each shown in three phases (initial, intermediate, target), depicting disk moves, token swaps, river crossings, and stack reconfigurations

Twenty-five samples per puzzle instance are generated, and each model’s average performance across those samples is reported.

Assessing thinking vs. non-thinking models#

Drawing insights from observations on existing mathematical benchmarks, the researchers thoroughly investigated how problem complexity influences reasoning capabilities. They did this by contrasting pairs of thinking and non-thinking models within controlled puzzle environments. Their analysis specifically concentrated on large language models (LLMs) that shared identical core architectures — specifically Claude-3.7-Sonnet (with and without reasoning) and DeepSeek (R1 and V3 variants).

In each puzzle scenario, the level of complexity was adjusted by changing the problem size n, which might represent the number of disks, checkers, blocks, or elements to cross. The outcomes from these experiments demonstrated that, unlike findings from pure math problems, the models’ behavior showed three distinct phases depending on complexity:

In the initial phase, where problem difficulty was low, the standard, non-thinking models could achieve performance on par with — or even superior to — their thinking counterparts. And they did so more efficiently in terms of token usage.

The middle phase, involving tasks of moderate complexity, is where the benefits of reasoning models (those capable of generating detailed thought processes like a chain of thought) become evident, leading to a wider performance gap between the model types.

The most notable is the final phase, associated with high problem complexity. In this stage, the performance of both model types completely dropped to zero. Although thinking models managed to postpone this failure point, they ultimately faced the same inherent constraints as the non-thinking versions.

Pass@k performance of thinking vs. non-thinking models across different compute budgets and puzzle complexities
Pass@k performance of thinking vs. non-thinking models across different compute budgets and puzzle complexities

Pass@k is an evaluation metric frequently used in mathematical and coding benchmarks. It assesses the “upper-bound capabilities” of models by calculating the percentage of problems for which at least one of the “k” generated solutions is correct. It compares performance between different model types, including reasoning vs. non-reasoning models, particularly under equivalent computational budgets.

Reasoning model breakdown#

Researchers also investigated how specialized reasoning models, equipped with “thinking tokens,” perform as problem complexity rises. Apple's experiments evaluated five state-of-the-art thinking models:

  1. o3-mini (medium)

  2. o3-mini (high)

  3. DeepSeek-R1

  4. DeepSeek-R1-Qwen-32B

  5. Claude-3.7-Sonnet (thinking)

Results indicate that all reasoning models follow a similar pattern: their accuracy steadily declines as problem complexity increases, eventually collapsing to zero beyond a specific threshold unique to each model.

A closer look at how these models use their “thinking tokens,” representing the amount of effort or internal processing they put into reasoning, shows a puzzling trend: as problems become more complex, these reasoning models use more thinking tokens, as if trying harder to solve the tougher challenges. However, once they reach a certain very difficult point (which is also when their accuracy completely fails) they actually reduce their reasoning effort, even though the problems are still very hard.

This unusual behavior is most noticeable in the o3-mini models, though less so in the Claude-3.7-Sonnet (thinking) model. It’s important to note that these models still have plenty of “space” or budget to generate more thoughts, but they choose not to as problems get harder. This suggests a limit to how well current reasoning models can scale their thinking abilities to handle increasingly complex problems.

Reasoning models demonstrating a distinct failure pattern linked to problem complexity
Reasoning models demonstrating a distinct failure pattern linked to problem complexity

Would training on these puzzles solve the problem?#

Training LLMs or LRMs on these specific puzzles could enhance their performance on those particular tasks, especially if such problem types were underrepresented in their initial training data. For example, the paper suggests that models struggled with higher complexity instances of the “River Crossing” puzzle, possibly due to a scarcity of similar examples (e.g., for N>2) on the web, meaning LRMs might not have frequently encountered or memorized such instances during training. Therefore, increased exposure through training could lead to higher accuracy within the seen complexity ranges for these specific puzzles.

However, this approach is unlikely to fully overcome the fundamental limitations in generalizable reasoning observed in the study. The research indicates that even when models have ample inference compute, LRMs exhibit a counterintuitive scaling limit where their reasoning effort declines at high complexity, despite a complete accuracy collapse. They also demonstrate limitations in exact computation and inconsistent reasoning, even failing to effectively utilize explicit algorithms provided to them.

These findings suggest that the challenge goes beyond simply lacking exposure to specific puzzle instances; it points toward “fundamental barriers to generalizable reasoning.” Suppose these puzzles were extensively incorporated into training data. In that case, they might merely become another source of data contamination, potentially masking the models’ true generalized reasoning capabilities rather than genuinely improving them.

Wrapping up#

These findings underscore that while large reasoning models show promise, their current thinking mechanisms have clear limitations. They highlight the need for future research to address fundamental barriers to truly generalizable AI reasoning.

Here's the TL;DR version of what we covered:

  • What Apple studied: Apple looked at “large reasoning models” (LRMs) — AI systems that write out their own step-by-step thought process — versus standard large language models (LLMs) that jump straight to an answer.

  • The test bed: Instead of reusing the usual math or coding quizzes (which many models have already seen), Apple leaned on four classic puzzles — Tower of Hanoi, Checker Jumping, River Crossing, and Blocks World — that let them crank problem difficulty up or down while keeping the rules clear.

  • Three performance phases:

    • Easy problems: Regular models did just as well as, and sometimes better than, their “thinking” counterparts while using fewer tokens.

    • Moderate problems: LRMs pulled ahead; their written reasoning helped them solve tougher tasks.

    • Hard problems: Both types eventually failed completely. LRMs delayed the crash, but couldn’t avoid it.

  • Reasoning effort hits a wall: As puzzles grew harder, LRMs initially spent more “thinking tokens,” then suddenly cut back right when accuracy collapsed. They still had room in their token budget, suggesting today’s reasoning approach doesn’t scale smoothly with complexity.

  • Training on the puzzles isn’t a silver bullet: Feeding models more examples of these exact puzzles might boost scores on those tasks, but it would mask — not fix — the deeper limits Apple observed in general reasoning.

  • Big picture: LRMs show promise for middle-tier challenges, yet they share the same upper limits as standard models once complexity spikes. True, broadly reliable reasoning will need breakthroughs beyond the current “write your thoughts out” method.

Fascinated by large language models? Explore these courses for more:


Written By:
Fahim ul Haq
The AI Infrastructure Blueprint: 5 Rules to Stay Online
Whether you’re building with OpenAI’s API, fine-tuning your own model, or scaling AI features in production, these strategies will help you keep services reliable under pressure.
9 mins read
Apr 9, 2025