“It’s smarter than almost all graduate students in all disciplines – Elon Musk.”
Elon Musk and his Grok team are back with their latest and best model to date: Grok 4. It was only 3 months ago that this team of experts launched Grok 3, a model that still competes with the giants from OpenAI, Gemini, and Anthropic. But with Grok 4, Elon Musk is giving these companies a run for their money. Grok 4 comes with superhuman-level thinking and reasoning capabilities. With tools and agents in its arsenal, it brings a better understanding of the world, both personal and professional. In this blog, we’ll explore everything about Grok 4: its features, capabilities, benchmarks, and finally, we’ll test it.
Let’s Grok it!
Grok 4 is the latest multi-modal large language model (LLM) from Elon Musk’s company, x.ai. It has 100 times more training data than Grok 2 (the first public model by x.ai) and 10 times more reinforcement learning compute than any other model available. Grok 4 features a 256K context window, real-time data search, advanced voice capabilities, agentic abilities, and intelligence that closely mimics human behavior.
Grok 4 has two versions:
When it comes to multimodal capabilities, especially image analysis and generation, Grok 4 models currently perform poorer than the top models like o3, Gemini 2.4 Pro, Claude 4, etc. Although this may improve significantly in the coming few days (or weeks).

To access Grok 4 on chat:

To access Grok 4 on the API:
from xai_sdk import Client
from xai_sdk.chat import user, system
client = Client(
api_host="api.x.ai",
api_key="<YOUR_XAI_API_KEY_HERE>"
)
chat = client.chat.create(model='grok-4-0709', temperature=0)
chat.append(system("You are a PhD-level mathematician."))
chat.append(user("What is 2 + 2?"))
response = chat.sample()
print(response.content)
Now that we’ve read all about Grok 4, it’s time to see if it brings in the punch as it claims. To do this, we will test Grok 4 on the following tasks:
Let’s start.

Result:
Analysis:
Grok 4 approached the problem step-by-step, addressing each question in order. It correctly interpreted the prompt, reasoned through the solution, and even generated code for the graphs when asked. The visualizations were accurate and aligned with the explanation.
Prompt: “Tell me about Analytics Vidhya’s latest post on X and find the latest blog on their website – summarise information on them in 5 lines each.”
Result:
Analysis:
This task it performed better than I had imagined. The task itself is not difficult, but I see so many models struggling with the dates to accurately fetch the latest information. Grok 4 took only a few seconds. It went through the website and the Twitter page, found the latest information, and then reasoned it out to give me 5 concrete lines on each.
You can check it yourself on our blog page or X page.
Prompt: “Merge all these PDFs and create a single JSON file.”
Result:

Analysis:
It started well, by listing down the content from a few files, and then began the hallucinations. All that I got in the result was a stream of #. So this was disappointing.
Prompt 2: “Convert the following code into Python and React”
Result:
Analysis:
Grok 4 was quick and pretty efficient, it quickly generated the code in Python and actually understood that with the “react” word in my prompt. I was looking forward to seeing the code for my app’s frontend. It then also presented the code for each section, making it simple for me to copy the required part as and when it is needed.
Grok 4 almost aced all of the benchmarks that we usually look at. Here is a summary:

These are the usual benchmarks for testing any latest LLM. Grok 4 also came with its scorecard on two new benchmarks: ARC-AGI and Vending Bench.
This benchmark checks how close models are to achieving AGI, or artificial general intelligence. This is done by scoring their performance on different ARC-style tasks, which are a collection of challenging puzzles.

Grok 4 takes up the top spot, breaking the 10% barrier, meaning the model has taken its first steps into general reasoning. Claude Opus 4 models follow next and then come o3 (high), o4-mini(high), and others! This seems that Grok 4 is essentially closer to AGI than the rest of its peers.
This benchmark tests the agentic AI systems to measure how well these agents can interact with a real e-commerce website to complete complex tasks. It’s designed to stress test real-world decision making, planning, and UI interaction.
Grok 4 excels in this too, beating some human, Claude 4, Opus, and Gemini 2.5 Pro and o3.

Infact, the Grok 4 was tested to run an actual vending machine to test this, and it incurred huge profits while doing so. Anthropic had released something similar about Claude running a vending machine a few days back, and in that, they had mentioned that the machine ran into a loss!
Grok 4 comes with a great set of features and performance benchmarks, based on which it can be pretty useful for:
Although Grok 3 has been in the spotlight for its racist comments, with Grok 4, the team is looking to do more than just damage control. Grok 4 comes with tool use integrated from the start, and the Grok team plans to upgrade this to “commercial grade” capabilities, helping you solve actual, real-world problems. Along with this, we can expect Grok 4 to master video and image analysis and generation very soon, bringing us closer to experiencing playable AI-generated video games and fully AI-generated shows.
Is Grok 4 a big deal? Definitely. In a market that feels increasingly saturated, it stands out as a breath of fresh air, offering real improvements over its predecessors. With actual use cases emerging, it seems poised to help solve many everyday problems. Both standard and Heavy variants are agentic, fast, and significantly better at reasoning. While some suggest it’s built for AGI, I believe there’s still time and room for growth. Grok 3 also launched with great promise but later went off track. With this new release, it’s just the beginning, much testing is still needed to understand its true potential.