Gemini 3 'Deep Think' Hits 84.6% on ARC-AGI-2 and Claims Math Olympiad Gold

Google Just Raised the Reasoning Bar
Google has shipped a major upgrade to Gemini 3's Deep Think mode — and the benchmark numbers are hard to ignore. The updated reasoning system scored 84.6% on ARC-AGI-2, one of the most demanding tests of general intelligence and novel problem-solving in AI evaluation. More striking still: Deep Think achieved gold-medal-level performance on both the 2025 International Mathematical Olympiad (IMO) and the 2025 International Physics Olympiad (IPhO) — competitions where even the sharpest human minds in the world often fall short of the top tier.
This isn't a narrow benchmark win. ARC-AGI-2 is specifically designed to resist pattern-matching and memorization, requiring models to generalize to genuinely new problem types. Gold-medal performance on the Olympiads requires multi-step deductive reasoning, creative problem construction, and the ability to verify solutions rigorously — all things that previous-generation models struggled with badly.
Deep Think represents Google's answer to the "thinking" model category that OpenAI pioneered with o1 and o3. Where standard Gemini 3 responds quickly, Deep Think allocates extended compute time to reason through problems step-by-step before committing to an answer.
5 Use Cases Where Deep Think Changes Everything
1. Advanced Mathematical Research and Proof Verification
Deep Think's Olympiad-level math performance opens the door to real research utility. Mathematicians can use it to check proof sketches, explore edge cases in conjectures, and generate candidate solutions to open problems. At gold-medal reasoning level, the model isn't just calculating — it's constructing logical arguments with the rigor that formal mathematics demands. For fields like number theory, combinatorics, and topology where human intuition bottlenecks progress, an AI collaborator that can hold and reason about complex mathematical structures is genuinely useful.
2. Physics Simulation and Theoretical Modeling
Gold-medal IPhO performance signals that Deep Think can handle the kind of multi-body, multi-constraint reasoning that theoretical and computational physics requires. Use cases include deriving equations of motion for novel physical systems, cross-checking theoretical predictions against boundary conditions, and helping researchers identify errors in complex derivations. For engineering applications — spacecraft trajectory optimization, quantum circuit design, advanced materials modeling — this level of physics reasoning is directly applicable.
3. Competitive Programming and Algorithm Design
ARC-AGI-2's emphasis on novel problem-solving maps directly to competitive programming: problems are unseen, solutions require original thinking, and correctness is binary. Deep Think can approach algorithmic challenges with the kind of structured exploration that top competitive programmers use — trying multiple approaches, reasoning about complexity, and backtracking when a path fails. For software teams, this translates to help with particularly thorny algorithmic design problems, not just boilerplate code generation.
4. Legal and Financial Scenario Analysis
Complex legal reasoning and financial modeling both require the same core skill Deep Think excels at: holding many conditional rules in working memory and reasoning about their interactions without losing track of constraints. Deep Think can analyze multi-jurisdiction regulatory questions, model the logical consequences of contract clauses, stress-test financial models against edge-case assumptions, and reason through adversarial scenarios. For law firms and quant teams, the upgrade from "smart assistant" to "gold-medal reasoner" changes what's actually delegatable.
5. Medical Differential Diagnosis and Clinical Decision Support
Diagnostic medicine is fundamentally a reasoning problem under uncertainty — weighing symptom combinations, ruling out alternatives, and reasoning about rare presentations that don't fit standard patterns. Deep Think's ability to reason through novel, constrained problems without relying on surface-pattern matching makes it significantly more useful for clinical decision support than standard models. It can reason through unusual symptom clusters, cross-reference drug interaction constraints, and walk through differential diagnoses with the rigor that medical accuracy demands.
What the ARC-AGI-2 Score Actually Means
ARC-AGI-2 was designed by François Chollet specifically to be hard for AI systems to game. The tasks require understanding abstract visual patterns and applying novel transformations — the kind of fluid intelligence that IQ tests approximate. An 84.6% score is a landmark result. For context, earlier frontier models were scoring in the 30–50% range on ARC-AGI-1, and ARC-AGI-2 is meaningfully harder.
This doesn't mean Gemini 3 Deep Think is AGI. But it does mean the gap between what AI can reason about and what humans can reason about is closing faster than most predictions anticipated.
The Competitive Landscape Just Shifted
With Deep Think, Google has a direct answer to OpenAI's o3 and the emerging class of "extended thinking" models. The Olympiad benchmarks in particular are a pointed competitive signal — these are the exact evaluations the AI industry uses to calibrate frontier reasoning capability, and gold-medal performance is the ceiling, not just a passing grade.
Expect the other labs to respond quickly. The reasoning model race in 2026 is moving at the same pace as the base model race did in 2023 — and the stakes are considerably higher.