Gemini 3.1 Pro: A Hands-On Test of Google’s Newest AI

Vasu Deo Sankrityayan Last Updated : 20 Feb, 2026
7 min read

Just 3 months after the release of their state-of-the-art model Gemini 3 Pro, Google DeepMind is here with its latest iteration: Gemini 3.1 Pro.

A radical upgrade in terms of capabilities and safety, Gemini 3.1 Pro model strives to be accessible and operable by all. Regardless of your preference, platform, purchasing power, the model has a lot to offer for all the users.

I’d be testing the capabilities of Gemini 3.1 Pro and would elaborate on its key features. From how to access Gemini 3.1 Pro to benchmarks, all things about this new model has been touched upon in this article. 

Gemini 3.1 Pro: What’s new?

Gemini 3.1 Pro is the latest member of the Gemini model family. As usual the model comes with an astounding number of features and improvements from the past. Some of the most noticeable one are:

  • 1 Million Context Window: Maintains the industry-leading 1 million token input capacity, allowing it to process over 1,500 pages of text or entire code repositories in a single prompt.
  • Advanced Reasoning Performance: It delivers more than double the reasoning performance of Gemini 3 Pro, scoring 77.1% on the ARC-AGI-2 benchmark. 
  • Enhanced Agentic Reliability: Specifically optimized for autonomous workflows, including a dedicated API endpoint (gemini-3.1-pro-preview-customtools) for high-precision tool orchestration and bash execution.
  • Pricing: The cost/token of the latest model is the same as that of its predecessor. For those accustomed to the Pro variant, they are getting a free upgrade.
Gemini 3.1 Pro model card
  • Advanced Vibe Coding: The model handles visual coding exceptionally well. It can generate website-ready, animated SVGs purely through code, meaning crisp scaling and tiny file sizes.
  • Hallucinations: Gemini 3.1 Pro has tacked the hallucinations problem head on by reducing its rate of hallucinations from 88% to 50% across AA-Omniscience: Knowledge and Hallucination Benchmark
  • Granular Thinking: The model adds more granularity to the thinking option offered by its predecessor. Now the users can choose between high, medium and low thinking parameters.
Thinking Level Gemini 3.1 Pro Gemini 3 Pro Gemini 3 Flash Description
Minimal Not supported Not supported Supported Matches the no thinking setting for most queries. The model may think minimally for complex coding tasks. Minimizes latency for chat or high throughput applications.
Low Supported Supported Supported Minimizes latency and cost. Best for simple instruction following or high-throughput applications.
Medium Supported Not supported Supported Balanced reasoning for most tasks.
High Supported (Default, Dynamic) Supported (Default, Dynamic) Supported (Default, Dynamic) Maximizes reasoning depth. May increase latency, but outputs are more carefully reasoned.

Hands-On: Let’s have some fun

All the talk in the world wouldn’t amount to anything if the performance falls flat in practice. To evaluate Gemini 3.1 Pro properly, I tested it across three categories:

  1. Complex reasoning
  2. Code generation & debugging
  3. Long-context synthesis

Task 1: Multi-Step Logical Reasoning

What this tests: Chain-of-thought reasoning, constraint handling, and hallucination resistance.

Prompt: 

“You are given the following scenario:

Five analysts — A, B, C, D, and E — are assigned to three projects: Alpha, Beta, and Gamma.

Rules:

1. Each project must have at least one analyst.
2. A cannot work with C.
3. B must be assigned to the same project as D.
4. E cannot be on Alpha.
5. No project can have more than three analysts.

Question: List all valid assignment combinations. Show your reasoning clearly and ensure no rule is violated.

Response:

Gemini 3.1 Pro handled constraint-heavy logic without collapsing into contradictions, which is where most models stumble. The consistency and clarity in enumerating valid combinations showed serious reasoning depth.

Task 2: Code Generation & Refactoring

What this tests: Structured thinking, edge-case awareness, code correctness, and production realism.

Prompt:

“You are given the following Python function:

def process_orders(orders):
    total = 0
    for order in orders:
        if order["status"] == "completed":
            total += order["amount"]
    return total

Refactor this function to:

1. Handle missing keys safely.
2. Ignore negative amounts.
3. Log skipped invalid entries.
4. Be production-ready with type hints and docstrings.
5. Include unit tests for edge cases.

Return clean, well-structured Python code only.

Response:

import logging
import unittest
from typing import List, Dict, Any

logger = logging.getLogger(__name__)

def process_orders(orders: List[Dict[str, Any]]) -> float:
    """
    Process a list of order dictionaries and calculate
    the total amount of completed orders.
    """
    total: float = 0.0

    if not isinstance(orders, list):
        logger.error("'orders' must be a list.")
        return total

    for index, order in enumerate(orders):
        if not isinstance(order, dict):
            logger.warning("Skipping index %d: Not a dictionary.", index)
            continue

        if order.get("status") != "completed":
            continue

        if "amount" not in order:
            logger.warning("Skipping index %d: Missing 'amount'.", index)
            continue

        amount = order["amount"]

        if not isinstance(amount, (int, float)) or isinstance(amount, bool):
            logger.warning("Skipping index %d: 'amount' must be numeric.", index)
            continue

        if amount < 0:
            logger.warning("Skipping index %d: Negative amount (%.2f).", index, amount)
            continue

        total += float(amount)

    return total


class TestProcessOrders(unittest.TestCase):

    def setUp(self):
        logging.disable(logging.WARNING)

    def tearDown(self):
        logging.disable(logging.NOTSET)

    def test_valid_completed_orders(self):
        orders = [
            {"status": "completed", "amount": 100},
            {"status": "completed", "amount": 50.5}
        ]
        self.assertEqual(process_orders(orders), 150.5)

    def test_ignore_pending_orders(self):
        orders = [
            {"status": "pending", "amount": 100},
            {"status": "completed", "amount": 50}
        ]
        self.assertEqual(process_orders(orders), 50.0)

    def test_missing_keys_skipped(self):
        orders = [
            {"amount": 100},
            {"status": "completed"},
            {"status": "completed", "amount": 20}
        ]
        self.assertEqual(process_orders(orders), 20.0)

    def test_negative_amounts_ignored(self):
        orders = [
            {"status": "completed", "amount": -10},
            {"status": "completed", "amount": 3

The refactored code felt production-aware, not toy-level. It anticipated edge cases, enforced type safety, and included meaningful tests. This is the kind of output that actually respects real-world development standards.

Task 3: Long-Context Analytical Synthesis

What this tests: Information compression, structured summarization, and reasoning across context.

Prompt:

“Below is a synthetic business report:

Company: NovaGrid AI

2022 Revenue: $12M
2023 Revenue : $28M
2024 Revenue: $46M

Customer churn increased from 4% to 11% in 2024.
R&D spending increased by 70% in 2024.
Operating margin dropped from 18% to 9%.
Enterprise customers grew by 40%.
SMB customers declined by 22%.
Cloud infrastructure costs doubled.

Task:

1. Diagnose the most likely root causes of margin decline.
2. Identify strategic risks.
3. Recommend 3 data-backed actions.
4. Present your answer in a structured executive memo format.

Response:

It connected financial signals, operational shifts, and strategic risks into a coherent executive narrative. The ability to diagnose margin pressure while balancing growth signals shows strong business reasoning. It read like something a sharp strategy consultant would draft, not a generic summary.

Note: I didn’t use the standard “Create a dashboard” tasks as most latest models like Sonnet 4.6, Kimi K 2.5, are easily able to create one. So it wouldn’t offer much of a challenge to a model this capable.

How to access Gemini 3.1 Pro? 

Unlike the previous Pro models, Gemini 3.1 Pro is freely accessible by all the users on the platform of their choice. 

Now that you’ve made up your mind about using Gemini 3.1 Pro, let’s see how you can access the model. 

  1. Gemini Web UI: Free and Gemini Advanced users now have 3.1 Pro available under the model section option.
Gemini 3.1 Pro
  1. API: Available via Google AI Studio for developers (models/Gemini-3.1-pro).
Model Base Input Tokens 5m Cache Writes 1h Cache Writes Cache Hits & Refreshes Output Tokens
Gemini 3.1 Pro (≤200 K tokens) $2 / 1M tokens ~$0.20–$0.40 / 1M tokens ~$4.50 / 1M tokens per hour storage Not formally documented $12 / 1M tokens
Gemini 3.1 Pro (>200 K tokens) $4 / 1M tokens ~$0.20–$0.40 / 1M tokens ~$4.50 / 1M tokens per hour storage Not formally documented $18 / 1M tokens
  1. Cloud Platforms: Being rolled out to NotebookLM, Google Cloud’s Vertex AI, and Microsoft Foundry.

Benchmarks

To quantify how good this model is, the benchmarks would assist. 

Source: DeepMind

There is a lot to decipher here. But the most astounding improvement of all is certainly in Abstract reasoning puzzles. 

Let me put things into perspective: Gemini 3 Pro released with a ARC-AGI-2 score of 31.1%. This was the highest for the time and considered a breakthrough for LLM standards. Fast forward just 3 months, and that score has been eclipsed by its own successor by double the margin

This is the rapid pace at which AI models are improving. 

If you’re unfamiliar with what these benchmarks test, read this article: AI Benchmarks

Conclusion: Powerful and Accessible

Gemini 3.1 Pro proves it’s more than a flashy multimodal model. Across reasoning, code, and analytical synthesis, it demonstrates real capability with production relevance. It’s not flawless and still demands structured prompting and human oversight. But as a frontier model embedded in Google’s ecosystem, it’s powerful, competitive, and absolutely worth serious evaluation.

Frequently Asked Questions

Q1. What is Gemini 3.1 Pro designed for?

A. It is built for advanced reasoning, long-context processing, multimodal understanding, and production-grade AI applications.

Q2. How can developers access Gemini 3.1 Pro?

A. Developers can access it via Google AI Studio for prototyping or Vertex AI for scalable, enterprise deployments.

Q3. Is Gemini 3.1 Pro reliable for high-stakes tasks?

A. It performs strongly but still requires structured prompting and human oversight to ensure accuracy and reduce hallucinations.

I specialize in reviewing and refining AI-driven research, technical documentation, and content related to emerging AI technologies. My experience spans AI model training, data analysis, and information retrieval, allowing me to craft content that is both technically accurate and accessible.

Login to continue reading and enjoy expert-curated content.

Responses From Readers

Clear