Apple Exposes Reasoning Flaws in o3, Claude, and DeepSeek-R1

Riya Bansal. Last Updated : 12 Jun, 2025

8 min read

A rather brutal truth has emerged in the AI industry, redefining what we consider the true capabilities of AI. A research paper titled “The Illusion of Thinking” has sent ripples across the tech world, exposing reasoning flaws in prominent AI ‘so-called reasoning’ models – Claude 3.7 Sonnet (thinking), DeepSeek-R1, and OpenAI’s o3-mini (high). The research proves that these advanced models don’t really reason the way we’ve been led to believe. So what are they actually doing? Let’s find out by diving into this research paper by Apple that exposes the reality of AI thinking models.

The Great Myth of AI Reasoning
The Devastating Discovery
The Benchmark Problem and Apple’s Solution
- Watching AI “Think”: The Actual Truth
- Results and Analysis
A Critical Perspective: Are These Findings Complete?
Why Does This Matter for the Future of AI?
Conclusion

The Great Myth of AI Reasoning

For months, tech companies have been pitching their newer models as great ‘reasoning’ systems that follow the human method of step-by-step thinking to solve complex problems. These large reasoning models generate elaborate scenarios of “thinking processes” before the actual answer is given, showing the genuine cognitive work happening behind the scenes.

But Apple’s researchers have lifted the curtain on the technological drama, revealing the true capabilities of AI chatbots, which look rather dull. These models seem to be far more akin to pattern matchers that really cannot get through when faced with truly complex problems.

The Illusion of Thinking: Apple Finds Reasoning Flaws in AI models — Source: Apple Research

The Devastating Discovery

The observations stated in ‘The Illusion of Thinking’ would bother anyone already placing a wager on the reasoning capabilities of current AI systems. Apple’s research team, led by scientists who carefully designed controllable puzzle environments, made three monumental discoveries:

1. The Complexity Cliff

One of the major findings is that these supposedly advanced reasoning models suffer from what has been termed by the researchers as “complete accuracy collapse”, beyond certain complexity thresholds. Rather than a slow descent that may happen over time, this observation outright exposes the shallow nature of their so-called “reasoning”.

Imagine a chess grandmaster who suddenly forgets how a piece moves, just because you added an extra row to the board. That’s exactly how these models behaved during the research. The models that seemed extremely intelligent on problem sets they were acquainted with, suddenly became completely lost, the moment they were nudged even an inch out of their comfort zone.

2. The Effort Paradox

What is more baffling is that Apple found these models have a scaling barrier against any logic. As the problems became more demanding, these models initially augmented their reasoning effort, showing longer thinking processes and more detail in each step. However, there came a point when they simply stopped trying and started paying less attention to their tasks, despite having hefty computational resources.

It is as if a student, when presented with increasingly difficult math problems, tries a bit hard at first but loses interest at one point and just starts to guess the answer randomly, despite having ample time to work on the problems.

3. The Three Zones of Performance

In the third finding, Apple identifies three zones of pure performance, indicating the true nature of these systems:

Low-complexity tasks: Standard AI models outperform their “reasoning” counterparts in these tasks, suggesting extra reasoning steps may just be an expensive show.
Medium-complexity tasks: This is found to be the sweet spot where reasoning models shine.
High-Complexity tasks: A spectacular failure from both standard and reasoning models was seen in these tasks, hinting at inherent limitations.

The Benchmark Problem and Apple’s Solution

‘The Illusion of Thinking’ reveals a secret about AI evaluation as well. Most benchmarks contain training data, causing the model to appear more capable than it actually is. These tests, therefore, evaluate models on memorized instances to a great extent. Apple, on the other hand, created a much more revealing evaluation process. The research team tested the models on the following four logical puzzles with systematically rescalable complexity:

Tower of Hanoi: It involves moving disks of different sizes from one slot to another. It follows two very simple rules: first, that only one disk can be moved at a time; second, only a smaller disk can be placed over a larger one. This classic problem requires planning moves several steps ahead, as the difficulty increases with more disks.
Checker Jumping: Here, we have to move pieces strategically, requiring spatial reasoning and sequential planning. A line of blue and red checkers must swap positions if there is one empty space between them. They can do that either by jumping into the empty space or jumping over an opposite-colored checker.
River crossing: Logic puzzles about getting multiple entities across a river with constraints. It involves actors and agents, and they must cross while ensuring that no actor is left without their own agent.
Block Stacking: A 3D reasoning puzzle that requires knowledge of physical relationships. In this, only one block can be moved at a time, and it needs to be shifted correctly to achieve the target. The complexity of the puzzle increases with the number of blocks.

The selection of these tasks or problems was by no means random. Each problem could be scaled precisely from trivial to mind-boggling, so that researchers can know at which level the AI reasoning gives out.

Watching AI “Think”: The Actual Truth

Unlike most traditional benchmarks, these puzzles did not limit the researchers to look at just the final answers. They actually revealed the entire chain of reasoning of the models to be evaluated. Researchers could watch the models solve problems step-by-step, seeing if the machines were going through logical principles or were just pattern-matching from some memory.

The results were eye-opening. Models that appeared to be actually “reasoning” through a problem beautifully would suddenly go illogical, abandon systematic approaches, or simply give up when complexity increased, though moments earlier, they had perfectly demonstrated the required skills.

By making new, controllable puzzle environments, Apple circumvented the contamination problem and exposed the full scale of model limitations. The outcome was sobering. For real, new, and fresh challenges that could not be memorized, even the most advanced reasoning models were struggling in ways that highlight the real limits posed upon them.

Results and Analysis

Across all four types of puzzles, Apple’s researchers documented consistent failure modes that provide a grim picture of today’s AI capabilities.

Accuracy Issue: On these puzzle sets, a model that reached almost perfect performance on the simplistic versions encountered an astonishing drop in accuracy. Sometimes, it would fall from almost 90% success to an almost total failure with only a few additional complex steps added. This was never a gradual degradation, but a sudden and catastrophic failure.
Inconsistent logic application: The models sometimes failed to apply algorithms consistently when demonstrating knowledge of the very correct approaches. For example, a model may apply a systematic strategy successfully for one Tower of Hanoi puzzle, but then abandon that very strategy on a very similar but slightly more complex instance.
Role of Effort Paradox: The researchers, in correlation with problem difficulty, studied the amount of ‘thinking” the model did. This ranged from length to granularity levels of reasoning traces. Initially, the thinking effort increased with complexity. However, as the problems became tougher to solve, the model would quite abnormally start relaxing its effort, even with an unlimited computational resource provided.
Computational Shortcuts: It was also found that the model tended to take computational shortcuts that worked really well for simple problems, but would lead to catastrophic failures in harder cases. Rather than recognizing such a pattern and trying to compensate, the model would either keep on trying with bad strategies or just give up.

These findings establish that, in essence, current AI reasoning is more brittle and limited than the public demonstrations have led us to believe. The models are yet to learn reasoning; for now, they only recognize reasoning and replicate it if they have seen it somewhere else.

A Critical Perspective: Are These Findings Complete?

Before declaring the bane of AI reasoning models, these findings are to be reviewed carefully and with scrutiny. The research does not necessarily say that LRMs cannot reason; it may say that they cannot reason superhumanly. As Gary Marcus put it, “Humans actually have a bunch of well-known limits that parallel what the Apple team discovered. Many humans screw up versions of the Tower of Hanoi with 8 discs.” Importantly, Apple decided to test with classic puzzles, such as the Tower of Hanoi, whose solutions exist in ample supply in the training data, which makes one wonder why they’d think these algorithmic hints would be useful when models already know the algorithm.

The reasoning models were being trained in and optimized for math and coding, not puzzles, and this is rather like judging language-based progress by how well the language model performs in poetry versus prose. Even if the complexity threshold argument is accepted, does struggling with a 1000-step Tower of Hanoi prove that models cannot reason? How many humans can perform one thousand algorithmic steps flawlessly?

Reasoning until the tenth step and failing on the eleventh is still reasoning. It may not be superhuman-level reasoning, but it lies on the human side of cognition, which is a tremendous stride ahead. Crucially, in the absence of human basic performance on such differently scaled puzzles, we are judging the AI in a vacuum and hence cannot assess whether we are witnessing fundamental AI limitations or essentially discovering that present-day systems reason roughly in human terms.

Why Does This Matter for the Future of AI?

‘The Illusion of Thinking’, far from being academically nitpicking, touches very deeply upon the implications of AI. We can see it affects the entire AI industry and anyone who may make a decision using AI capabilities.

Apple’s findings indicate that so-called ‘reasoning’ is indeed just a very sophisticated kind of memorization and pattern matching. The models excel in recognizing problem patterns they have seen before and then associate the solution they have previously learned. However, they tend to fail when asked to really logically reason through a problem that is somehow new to them.

For the past few months, the AI community has been awestruck with the advancements in reasoning models, as shown by their parent companies. Industry leaders have even gone on to promise us that Artificial General Intelligence (AGI) is right around the corner. ‘The Illusion of Thinking’ tells us that this assessment is absurdly optimistic. If present ‘reasoning’ models are not able to handle complexities above the current benchmarks, and if they are indeed just dressed-up pattern-matching systems, then the pathway toward true AGI might be longer and tougher than Silicon Valley’s most optimistic proposals.

Despite sobering observations, Apple’s study does not remain entirely pessimistic. The performance of AI models in the medium-complexity regime shows the actual progress in their reasoning capabilities. In this category, these systems can execute really complicated tasks, which were deemed impossible some four or so years ago.

Conclusion

Apple’s research marks a turning point from breathless hype to precise scientific measurements of what AI systems can do. This is where the AI Industry faces its next choice. Will it continue to chase benchmark scores and marketing claims, or focus on building systems that can really do some level of reasoning? The companies that will do the latter might end up building the AI systems we really need.

It is clear, however, that future paths to AGI will require more than just scaled-up pattern-matchers. They will need fundamentally new approaches to reasoning, understanding, and genuine intelligence. Illusions of thinking can be convincing, but as Apple has shown, that’s all they are: illusions. The real task of engineering truly intelligent systems is just beginning.

Riya Bansal.

Gen AI Intern at Analytics Vidhya
Department of Computer Science, Vellore Institute of Technology, Vellore, India

I am currently working as a Gen AI Intern at Analytics Vidhya, where I contribute to innovative AI-driven solutions that empower businesses to leverage data effectively. As a final-year Computer Science student at Vellore Institute of Technology, I bring a solid foundation in software development, data analytics, and machine learning to my role.

Feel free to connect with me at [email protected]

Artificial Intelligence Beginner

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.6

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

Reading list

Apple Exposes Reasoning Flaws in o3, Claude, and DeepSeek-R1

Table of Contents

The Great Myth of AI Reasoning