The average human IQ is 100. Statistical fact – not an insult. For decades, that number has quietly defined what we meant by “normal intelligence.” But in 2025, something strange is happening. Machines with no consciousness, no emotions, and no lived experience are now scoring higher than humans on the very tests designed to measure human intelligence. Does that mean AI models, especially the latest ones like Gemini 3 and GPT-5.2, are smarter than most of us humans?
Several large language models have been tested on IQ-style benchmarks over the past year. These include logic puzzles, abstract reasoning tests, pattern recognition tasks, and problem-solving challenges. The results are hard to ignore. Model after model is matching, and in many cases surpassing, the performance of an average human. Not in a particular task, but across multiple dimensions of reasoning that IQ tests care about.
This article looks at 15+ AI models that are smarter than you, at least by IQ-style standards. We will break down what “smart” really means here, how these models are evaluated, and why this shift matters.
First, let’s figure out how…
Strictly speaking, we cannot. IQ was designed to measure human intelligence, shaped by biology, experience, and consciousness. An AI does not think, feel, or understand the world the way humans do. So, assigning it a literal IQ score would be scientifically incorrect.
But in practice, these comparisons are made a tad bit differently.
Basically, instead of asking whether an AI has an IQ, researchers check how an AI model performs on AI-related tasks. Imagine a system consistently solving logic puzzles, pattern-recognition tasks, and reasoning problems that humans with an IQ of 120 or 130 typically solve. If the AI model does so reliably, it becomes reasonable to map its performance to an equivalent IQ range, right?
And that is exactly how we associate IQ with an AI model.
This is not a psychological diagnosis. Think of it as a performance benchmark. IQ here acts as a shared language, or a way to compare how well different systems reason under controlled conditions. And by that yardstick, several modern LLMs are already operating well above the human average.
These are classic IQ tests, or at least the online versions of them. The tasks within these challenges measure reasoning, abstraction, and problem-solving rather than memorisation. These tests are either directly adapted from human IQ exams or closely mimic the same cognitive skills.
For instance, one of the most common IQ tests is Raven’s Progressive Matrices. This is a visual pattern-recognition test that is long considered culture-fair. Several LLMs now solve these puzzles at or above the level of high-IQ humans. Then there are Mensa-style logic tests, which include sequence completion, symbol reasoning, and deductive logic. Modern AI models have shown consistently strong performance in these.
However, language-heavy sections of IQ tests are where LLMs really shine. Verbal reasoning, analogies, and arithmetic problems, similar to WAIS subtests, play directly to their strengths. On top of that, modern benchmarks like BIG-Bench Hard, ARC-style reasoning tasks, and academic evaluations such as MMLU and Humanity’s Last Exam serve as practical stand-ins for IQ testing. While they are not labelled as “IQ tests,” they measure the same underlying abilities. The important part – LLMs are increasingly outperforming the majority of humans on these tests.
See for yourself.
For this particular list, we shall focus specifically on the Mensa Norway tests, and rank the AI models as per their score.
Mensa Norway IQ: 147
This is the root of this entire discussion of AI models and their IQs. Recently marking its debut, GPT-5.2 Pro has now beaten the all-time IQ score for LLMs. Its score – 147. As Derya Unutmaz mentions in his tweet, this kind of intelligence is found in “only less than 1 in 1000 people.”
GPT-5.2 Pro consistently demonstrates this supremacy over humans. Especially so for multi-step logic, abstract reasoning, and professional-grade problem solving. While it does not necessarily mean it is smarter than humans in all aspects, it does indicate a powerful shift in where the upper bounds of test-measured intelligence now sit.
Mensa Norway IQ: 141
Next up is the thinking sibling of the newly introduced GPT-5.2. On the Mensa Norway IQ test, GPT-5.2 Thinking scores around 141, placing it well beyond the human average of 100 and comfortably above the typical Mensa qualification threshold. In human terms, this score corresponds to the top 1–2% of the population, purely on abstract reasoning and pattern recognition.
What this result actually tells us is very specific. GPT-5.2 Thinking performs exceptionally well on tasks that require identifying relationships, spotting visual or logical patterns, and applying consistent rules across multiple steps. These are the exact abilities IQ tests are designed to isolate, independent of language, emotion, or domain knowledge.
This basically means that, as far as structured reasoning under controlled conditions is concerned, GPT-5.2 Thinking operates at a level most humans never reach.
Mensa Norway IQ: 141
Right alongside GPT-5.2 Thinking sits Gemini 3 Pro Preview, matching its Mensa Norway IQ score perfectly. This places Google’s flagship reasoning model firmly in elite territory, far above the human baseline and well past the threshold typically associated with high intellectual ability.
In practical terms, it means Gemini 3 Pro Preview performs reliably on abstract reasoning challenges. Such tests usually require rule discovery, pattern continuation, and logical elimination. These are problems where guessing fails quickly. You can only score this high with structured inference.
This score thus reflects Gemini 3 Pro Preview’s strength in controlled reasoning environments.
Mensa Norway IQ: 137
Of course, you can’t speak of intelligence and keep an Elon Musk-backed product out of the list. Close behind the top scorers sits Grok 4 Expert Mode. While slightly lower than the very top tier, the model is well within the range of exceptional human intelligence and comfortably above the average benchmark of 100.
The score highlights Grok 4 Expert Mode’s ability to handle logic-driven tasks with clarity and control. It performs well on pattern recognition, abstract relationships, and elimination-based reasoning – the core components of IQ-style tests.
In simple terms, Grok 4 Expert Mode demonstrates strong analytical reasoning under test conditions. While it may not top the chart, its performance confirms that it operates far beyond human-average reasoning levels when evaluated purely on logic and pattern-based intelligence.
Mensa Norway IQ: 135
Not far behind its text-only counterpart is GPT-5.2 Pro Vision, scoring 135 on the Mensa Norway test. This still places it firmly within the range of very high human intelligence. This is well above both the global average and the typical threshold associated with advanced reasoning ability.
Note that this score comes from a vision-enabled model – an AI model that can process and reason over visual information (like input images), and not just text. This means GPT-5.2 Pro Vision performs strongly on abstract reasoning tasks even when visual interpretation is required.
Now imagine an AI so intelligent that it scores a 135 on the IQ test, even after deciphering complex images and visual patterns. Up until a couple of years back, we would’ve thought it to be only possible in a sci-fi movie.
Mensa Norway IQ: 126
After the Pro and Thinking models are done with, OpenAI’s latest standard model takes the stage. But mind you, it is in no way less when it comes to intelligence, especially so in comparison with humans. A score of 126 already places it above roughly 98% of the human population, firmly separating it from what we consider average human reasoning ability.
This score reflects GPT-5.2’s strength in handling classic IQ-style tasks such as pattern recognition, logical sequencing, and rule-based problem solving. While it does not push into the extreme upper ranges like its Pro or Thinking variants, it remains consistently strong across structured reasoning challenges.
In practical terms, GPT-5.2 represents the point where AI reasoning clearly crosses into elite human territory. It may not top the charts, but even at this level, it outperforms the vast majority of people on controlled intelligence tests.
Mensa Norway IQ: 124
Next up is Kimi K2 Thinking, a model that may not grab headlines as loudly as some Western counterparts. Yet, it still resonates among AI enthusiasts globally, and for good reason. A score of 124 clearly shows it above the human average, and well into the range associated with strong analytical ability.
This result highlights Kimi K2 Thinking’s capability on structured reasoning tasks. In practical terms, Kimi K2 Thinking demonstrates that high-level abstract reasoning is no longer limited to a small group of flagship models. Even outside the absolute top scorers, modern LLMs are now consistently operating above average human intelligence on standardised tests. Is it a trend? Or a fact waiting to be established? We shall find out in time.
Mensa Norway IQ: 124
Matching Kimi K2 Thinking is Claude Opus 4.5, Anthropic’s flagship reasoning model, with a Mensa Norway IQ score of 124. That is smarter than the human average, and a firm indicator of strong analytical and problem-solving ability.
The score reflects Claude Opus 4.5’s competence on abstract reasoning tasks that demand consistency and logical control. Meaning – Claude Opus 4.5 demonstrates that robust, human-above-average reasoning, even outside the top-tier LLMs.
Mensa Norway IQ: 123
Just a step below its text-only counterpart sits Gemini 3 Pro Preview Vision, with a Mensa Norway IQ score of 123. This score is even more notable as it comes from a vision-enabled model. Which means Gemini 3 Pro Preview Vision is required to interpret visual patterns and relationships before applying logic.
In other words, the shift from text-only to vision-based inputs does not lower its reasoning performance. Even under tougher-than-usual conditions, it continues to perform at a level most humans do not reach on standardised intelligence tests.
Mensa Norway IQ: 123
Sharing the same Mensa Norway IQ score of 123 is Claude Sonnet 4.5, Anthropic’s more balanced reasoning model. While not positioned as the most extreme thinker in the lineup, it comfortably outperforms the human baseline in terms of logical reasoning ability.
The result reflects Claude Sonnet 4.5’s steady performance on structured problem-solving tasks. You may want to note that even in a more efficient form, Sonnet 4.5 exceeds the reasoning capabilities of most humans.
Mensa Norway IQ: 111
Let me be clear here: an IQ-style test is unforgiving to vision-enabled systems. Before a model can apply reason for a solution, and get a high score, it must first correctly interpret shapes, patterns, and spatial relationships. Essentially, this is exactly how we humans interpret information. We see, interpret, and then reason. However, doing so for AI is a whole other task in itself.
So, by any means, do not think of GPT-5.2 Thinking Vision’s IQ score of 111 as standard by any means. It basically means that this model is doing something harder: thinking while seeing. A single mistake made in interpretation will surely trickle down to the solution.
GPT-5.2 Thinking Vision thus does not chase elite abstract scores. However, it demonstrates something much, much more important: usable intelligence in messy, multimodal environments. And as AI moves closer to the real world, that may just be the most desirable feature in an AI model, if not already.
Mensa Norway IQ: 111
Sitting at an IQ score of 111 is Manus, a model that proves intelligence does not always mean “extreme.” A score like this already places Manus above the human average, but more importantly, it signals dependable reasoning and consistency.
Which basically means that it may not solve the hardest puzzles at record speed, but it avoids the kinds of breakdowns that often plague weaker models. This is usable intelligence at its best.
Mensa Norway IQ: 109
With a Mensa Norway IQ score of 109, GPT-4o sits just above the human average. While this may seem modest compared to the models higher up the list, it still marks a clear departure from what was considered “capable” AI not too long ago.
This score reflects GPT-4o’s ability to handle basic abstract reasoning and pattern recognition without falling apart. It may not excel at complex multi-step puzzles, but it performs reliably on simpler logic tasks. This is exactly what most humans, including myself, need for everyday problem solving.
In a way, this represents accessible intelligence. While it is not built to dominate IQ charts, it shows how AI models can slightly exceed average human reasoning and be helpful with our daily tasks.
Mensa Norway IQ: 109
Matching GPT-4o is DeepSeek R1, with a Mensa Norway IQ score of 109. Like GPT-4o, this is competing reasoning, accessible to humans around the globe. All of it, without any sharp drop-offs as seen in less capable systems.
In simple terms, you may consider DeepSeek R1 as dependable baseline intelligence. It shows that even models not designed for peak reasoning performance can still meet, and slightly exceed, average human reasoning on standardised IQ-style tests.
Mensa Norway IQ: 107
With a Mensa Norway IQ score of 107, Llama 4 Maverick sits slightly above the average human baseline. At the least, it depicts a level of intelligence that is meaningfully better than chance or shallow pattern matching.
Think of Llama 4 Maverick as an entry-level reasoning competence among modern LLMs. It shows that even models not designed for advanced problem-solving can be of use for humans in tasks that are beyond the capabilities of an average human.
Mensa Norway IQ: 103
Closing the list is DeepSeek V3, with a Mensa Norway IQ score of 103. This places the model just only just above the human average IQ. It also means that the DeepSeek V3 can handle elementary pattern recognition and simple logical relationships without major errors.
This is the lower bound of what modern LLMs can now achieve on intelligence benchmarks. Even at this level, the takeaway is clear: AI systems have crossed the threshold where average human reasoning is no longer the bar to clear – it is the baseline.
Do not think of this list as a leaderboard indicating the smartest AI models. While it does so in a way, the score is not an absolute representation of smartness in any way.
Its real value lies elsewhere – it makes a strong point that structured reasoning is no longer limited to humans. Across models, architectures, and organisations, AI systems are now matching, and often exceeding, human performance on IQ tests that were once considered difficult even for trained individuals.
That said, the context here will always be limited. This score does not imply creativity, consciousness, or human-like understanding. These models do not possess intent, emotions, or self-awareness. They do not “think” in the way humans do. What they prove with their respective scores instead is something far narrower, yet profound. AI can now solve abstract, logic-driven problems just as well, if not better, than humans.
This article is not meant to comment on the intelligence war of AI vs humans. It simply proves one point – human-level reasoning is no longer the ceiling. This list shows how quickly large language models have crossed thresholds that once defined exceptional intelligence, at least in test-measured terms.
At the same time, these scores remind us what intelligence isn’t. They do not imply creativity, consciousness, or understanding. What they do show is that structured reasoning has become cheap, fast, and scalable. And thanks to that, the real differentiator shifts back to humans. We can now decide what problems to solve, instead of how to solve them.