Did Apple Just Burst the AI Reasoning Bubble? I don't think so

A shallow dive into Apple's latest research that challenges everything we think we know about reasoning models

Jun 10, 2025

∙ Paid

Just when I was getting excited about reasoning models, Apple researchers decided to play party pooper and published a paper that's making waves across the AI community. "The Illusion of Thinking" - even the title is provocative!

As someone who's been very bullish on reasoning models(and honestly, still am), this paper hit me somewhat. But here's the thing - we really needed this kind of study to understand where we really stand with AI reasoning.

What Apple Actually Did

So, instead of just throwing another math benchmark at these models, Apple's team got a bit creative. They built controllable puzzle environments like Tower of Hanoi, but with a scientific twist. Why did they use puzzles? Because:

Complexity Control: Want to make a problem 2x harder? Just add more disks.
No data contamination: Fresh problems that models haven't memorized.
Easy to evaluate reasoning: They could peek inside the "thinking" process, not just final answers.
Objective truth: Puzzle simulators don't lie about correctness.

The Three Regimes of Testing

So Apple researchers discovered that reasoning models don't just "get better" - they operate in three distinct regimes:

1. The Embarrassing Zone (Low Complexity)

This is where regular LLMs actually outperform reasoning models. And I somehow know about it based on my experience with reasoning models. For simple problems, actually giving models more "thinking time" actually made them worse where they would happen to have the right answer but they would overthink and just go into some planning loop. It's like watching a chess grandmaster overthink a basic move and blunder.

2. The Sweet Spot (Medium Complexity)

This is the sweet spot for reasoning models. This is where we see the reasoning models do great. Models genuinely benefit from that internal monologue, working through problems step by step.

3. The Cliff (High Complexity)

This is where everything fell apart. Both reasoning and regular models just... stop working. Complete collapse as per the paper.

But there is an interesting Twist

Till now all was good and what I was expecting. But here's what really baffled me: Apple discovered that as problems get harder, reasoning models actually think LESS, not MORE.

You'd expect models to use more computational juice when facing tougher problems, right? Nope! They hit some internal complexity threshold and basically give up, reducing their reasoning effort despite having plenty of tokens left in the budget.

It's like watching a student encounter a hard math problem and immediately writing "I don't know" instead of trying harder. Except these are supposed to be our most advanced AI systems.

Why I'm Still Bullish (Despite Everything)

Now, you might think this paper is enough to crush my faith in reasoning models. Quite the opposite! Here's why I think it is still the next thing in GenAI:

Keep reading with a 7-day free trial

Subscribe to MLWhiz | AI Unwrapped to keep reading this post and get 7 days of free access to the full post archives.