Does Modern AI Truly Think?

Are today's AI models sentient "thinkers" or just a dressed-up, billion-dollar version of predictive text? This newsletter unpacks Apple's recent study to find the answer.

8 mins read

Jun 23, 2025

For years, the promise of artificial intelligence has captivated researchers and the public alike.

Recent advancements in large language models (LLMs) have brought us closer than ever, with large reasoning models (LRMs) now generating detailed “thinking processes” before providing answers. These models exhibit impressive performance on complex reasoning tasks, leading some to believe they are on the cusp of true artificial general intelligence.

But what if this thinking is merely an illusion? What if, despite their sophisticated self-reflection mechanisms, these models are still fundamentally limited in their reasoning ability, particularly as complex problems grow? For example, if an LRM encountered a novel sorting problem, could it devise a new, more efficient algorithm or simply rearrange known sorting methods based on patterns from its training data?

This idea raises the following questions:

Q1. Do these models engage in flexible, general reasoning, or do they primarily rely on recognizing familiar patterns?

Q2. How does their effectiveness change as problems become more intricate?

Q3. When given comparable computational resources, how do these reasoning models stack up against regular language models that don’t employ thinking mechanisms?

Q4. What are the fundamental constraints of existing reasoning methods, and what advancements are needed to foster more dependable reasoning skills?

To systematically investigate these questions and bridge existing gaps in understanding, Apple conducted a study in June 2025.https://machinelearning.apple.com/research/illusion-of-thinking

This research notably moved beyond traditional evaluation paradigms that primarily emphasized final answer accuracy on established mathematical and coding benchmarks, which often suffered from data contamination and lacked insights into the reasoning traces’ structure and quality.

Instead, the researchers leveraged controllable puzzle environments designed to allow precise manipulation of compositional complexity while maintaining consistent logical structures. This innovative setup enabled a deeper analysis of final answers and the internal reasoning traces, offering unprecedented insights into how large reasoning models (LRMs) think.

Data contamination refers to the issue where evaluation benchmarks used for models overlap with the data they were trained on. This can lead to inflated performance metrics because models might recall memorized solutions rather than genuinely reasoning.

Today, we'll walk you through the scientific ins and outs of Apple's recent study —and we'll also include the SparkNotes version at the end.

Experimentation setup#

For their experimental setup, the researchers at Apple chose a unique approach to assessing large reasoning models. Unlike prior studies that predominantly relied on established mathematical and coding benchmarks (which often suffered from data contamination and lacked insights into reasoning traces) this work introduced controllable puzzle environments. This choice enabled precise manipulation of compositional complexity while maintaining consistent logical structures.

The puzzles were selected for several key reasons:

The Educative Newsletter

Speedrun your learning with the Educative Newsletter

Level up every day in just 5 minutes!

Level up every day in just 5 minutes. Your new skill-building hack, curated exclusively for Educative subscribers.

Tech news essentials – from a dev's perspective

In-depth case studies for an insider's edge

The latest in AI, System Design, and Cloud Computing

Essential tech news & industry insights – all from a dev's perspective

Battle-tested guides & in-depth case studies for an insider's edge

The latest in AI, System Design, and Cloud Computing

Written By:

Fahim ul Haq

Free Edition

OpenAI's o3-mini: Is it worth trying as a developer?

Is the o3-mini a worthwhile alternative to DeepSeek's accuracy and performance? We break down its strength and compare it with R1.

7 mins read

Feb 24, 2025