The race to Agentic AI just accelerated with o3 and o4-mini
OpenAI just raised the bar with o3 and o4-mini, two new models built to push the limits of reasoning and autonomy.
These models don’t just answer questions.
They plan, reason, call tools, and verify information mid-task, blurring the line between simple language models and true AI agents.
Until now, the idea of agentic AI has been more aspiration than reality. Models could predict text impressively, even simulate thought through chain-of-thought prompting, but they still operated as glorified autocomplete engines, bound to direct outputs.
With o3 and o4-mini, that boundary feels genuinely challenged. We have to ask: Are we actually getting closer to AGI?
What's AGI? Unlike Narrow AI, which is trained for specific tasks (like writing text), AGI would understand and solve any problem, like a human.
Whether you're an AI engineer, researcher, or tech enthusiast, today we'll break down how reasoning models are evolving, what’s happening under the hood of o3 and o4-mini, and what it means for the future of AI.
We’ll cover:
What’s new in o3 and o4-mini
How they improve on earlier models like o1 and o3-mini
How they stack up against other leading reasoning AIs (DeepSeek R1, LLaMA 4, Claude 3.7, Gemini 2.5 Pro)
How they perform on major benchmarks like MMLU, GPQA, GSM8K, and HumanEval
What agentic capabilities really mean
Whether o3 and o4-mini move us closer to AGI (or if the gap remains wider than it seems)
Let's get started.
From o1 to o3: The rise of agentic intelligence#
OpenAI’s journey from o1 to the newly released o3 and o4-mini represents a monumental evolution in AI reasoning—from models that simulate thought to ones that can reason, act, and adapt with increasing autonomy.
o1: The first reasoning model#
Launched in late 2024, o1 was OpenAI’s first model purpose-built for reasoning. It introduced the idea of thinking before answering, using an internal chain-of-thought to simulate deliberation. This was a paradigm shift from GPT-style direct-response generation. o1 performed admirably across math, coding, and science questions, solving about 74% of problems on the AIME math exam and earning a competitive (but human-level) 1891 Elo on Codeforces. It was thoughtful, but passive. o1 couldn’t use tools, reference external information, or act beyond its internal logic.
o3: From thinking to doing#
With o3, OpenAI has turned that foundational reasoning into something far more powerful: agentic intelligence.
Note: While foundational reasoning involves internal steps like planning and verification, agentic reasoning goes further, allowing the model to take action.
At its core, o3 performs deep, hidden chains of thought using scaled reinforcement learning—planning, verifying, and adapting before answering. But what truly sets o3 apart is that it acts: it can run Python code, search the web, read files, analyze images, and decide which tools to invoke without explicit instruction.
Note: Interestingly, OpenAI skipped an “o2” release altogether—reportedly to avoid confusion with the UK mobile network o2. Instead, it moved from o1 to a preview model called o3-mini, which hinted at the coming leap. That leap is now fully realized in o3 and further distilled into o4-mini.
This shift is massive. For example, in tasks where it previously might have guessed a number or paraphrased a Wikipedia article, o3 can now calculate, search, or generate what it needs to reach an accurate answer. That tool is native, not bolted on, and embedded into its decision-making loop.
o3’s agentic abilities pay off in benchmarks. It scores 91.6% on AIME (up from o1’s 74.3%), achieves a 2706 Elo rating in competitive programming (International Grandmaster level), and hits ~87–88% on the GPQA science QA benchmark (compared to o1’s ~75%). Perhaps most strikingly, on ARC-AGI—a test designed to evaluate abstract generalization—o3 scores nearly three times better than o1, approaching human performance. These aren’t just numbers; they reflect a shift from rigid problem-solving to general-purpose reasoning.
o4-mini#
To complement the o3 model, OpenAI introduced o4-mini, a smaller but remarkably capable sibling. While designed for speed and cost-efficiency, o4-mini retains key o3 features: tool use, multimodal reasoning, long context, and scaled reinforcement learning.
And it’s no slouch on benchmarks. o4-mini beats o3 on some math tasks—scoring 93.4% on AIME and reaching a 2719 Elo on Codeforces. It performs competitively on SWE-bench coding challenges and handles image reasoning with ease. Despite its size, o4-mini behaves like a full-fledged reasoning agent and is now available in ChatGPT’s “fast reasoning” mode—even for free users.
Its efficiency makes it ideal for production use at scale, high-volume querying, and embedded AI systems where resource constraints matter. It’s also a sign of what’s to come: OpenAI has hinted that o4-mini is a preview of the full o4 model currently in development.
Multimodal reasoning: Text meets vision#
One of the defining upgrades in o3 and o4-mini is their ability to process visual inputs directly, like graphs, figures, screenshots, and diagrams. This expands their utility across STEM, data analysis, UI evaluation, and scientific reasoning. Where o1 and o3-mini were limited to text-only input, o3 can now see and reason over visual contexts as part of its internal deliberation.
To illustrate this, we can look at benchmarks like MMMU (college-level multimodal problem solving), MathVista (visual math reasoning), and CharXiv-Reasoning (understanding scientific figures).
On MathVista, o3 achieves 86.8% accuracy, a major leap from o1’s 71.8%.
On CharXiv-Reasoning, o3 scores 78.6%—over 20 points higher than o1’s 55.1%—demonstrating the critical role visual data plays in scientific comprehension.
Even o4-mini, despite its smaller size, performs nearly as well as O3 across all three tests.
Note: Although o1 was not a vision model, researchers often evaluate text-only baselines by feeding them simplified text-only variants of these benchmarks, such as using captions, alt-text, or OCR outputs. This offers a lower-bound comparison and highlights how much vision helps.
Instruction following meets autonomy#
Perhaps the most user-visible difference in o3 is how well it follows complex, multi-step instructions and decides what to do next. Older models often ignored steps or misunderstood the task. o3 breaks down the request, uses tools as needed, and provides structured, verifiable answers.
Example prompt: “Analyze this chart, calculate the average, and write a summary tweet.”
o3 may:
Interpret the chart using vision tools.
Run the average calculation with Python.
Compose a tweet with findings.
This is a shift from “text predictor” to “AI teammate.” Its instruction-following behavior is deeply tied to its agentic architecture—models now decide which steps to take and which tools to use.
That’s more than a design philosophy—results back it. As the chart below shows, o3 and o4-mini outperform o1 and o3-mini on real-world instruction-following tasks (Scale MultiChallenge), web browsing accuracy (BrowseComp), and API-based function calling (Tau-bench). These benchmarks reflect how agentic reasoning translates into better performance across practical, action-driven workflows.
Coding ability in practice#
The improvements in o3 and o4-mini aren’t just theoretical—they show up clearly in real-world coding performance.
On SWE-bench, which tests whether models can resolve real GitHub issues with verifiable edits, o3 scores 69.1%—a dramatic leap over o1’s 48.9%. o4-mini follows closely at 68.1%, demonstrating high performance even in a lighter, cost-optimized form.
In SWE-Lancer, a simulated freelance coding environment where earnings are tied to task quality and completeness, o3 earns over $65,000 worth of completed assignments—more than double o1—far beyond the mini variants. o4-mini still pulls in $56,375, showing that its reasoning capabilities translate directly to usable, billable code output.
A third test—Aider Polyglot, focused on editing and refactoring code—shows a similar pattern:
o3 achieves 81.3% accuracy when editing full files and 79.6% on diff-only edits.
o4-mini follows with 68.9% and 58.2% respectively.
These results highlight how both models generate new code and rapidly understand and modify existing codebases.
o3 and o4-mini vs. other AI models#
With o3 and o4-mini, OpenAI enters a crowded ring of cutting-edge reasoning AI. Competitors include DeepSeek’s R1, an open-source reasoning model from China; LLaMA 4 from Meta AI; Claude 3.7 (Sonnet) from Anthropic; and Gemini 2.5 Pro from Google DeepMind.
But increasingly, the real battleground is defined by which models can reason, plan, and act like agents.
Below is a performance table for different benchmarks, comparing OpenAI’s models with the others:
The results show that:
o3 and o4-mini lead the pack in agentic and structured reasoning, with o3 excelling on adaptive benchmarks like ARC-AGI and o4-mini punching above its size—offering top-tier math performance when tool use is enabled.
Gemini 2.5 Pro leads in general knowledge.
Claude 3.7 impresses with transparent reasoning and strong coding skills but lacks native tool autonomy.
DeepSeek R1, though less powerful than O3, shows remarkable efficiency via RL-only training, making it ideal for open-source, self-hosted use. LLaMA 4 stands out in multimodal tasks but lags in pure reasoning.
Still, o3 and o4-mini remain top contenders across most reasoning benchmarks.
To put o3’s capability in perspective, consider the ARC (Abstraction and Reasoning Corpus), which is sometimes discussed as a proxy for AGI. o3’s performance stunned researchers: it scored 91.5% on the public ARC tasks with an unrestricted compute setting, and even under strict efficiency constraints, it scored 82.8%, taking the #1 spot on the ARC leaderboard. These tasks are puzzles designed to test extreme generalization – solving them requires on-the-fly pattern discovery and analogical reasoning that even humans find challenging. Previous LLMs struggled with ARC, but o3’s leap (roughly 3 times o1’s score) suggests it has a much stronger ability to adapt to new problems.
As the ARC Prize organization noted:
“OpenAI’s new o3 model represents a significant leap forward... not merely incremental improvement, but a genuine breakthrough.”
This shows “arguably human-level” performance in that domain. That’s a big statement—we’re seeing hints of general problem-solving that inch closer to human-like reasoning.
How do the o3 and o4-mini models compare as agents?#
Beyond raw benchmarks and coding scores, the real differentiator for o3 and o4-mini models is their agentic behavior: the ability to decide what actions to take, which tools to use, and how to solve a task autonomously. Unlike traditional models that respond in a single turn, o3 can break a problem into subtasks, use tools mid-reasoning, and adapt its output based on intermediate results.
But how does this agentic capability compare to other top-tier models?
Gemini 2.5 Pro: Strong agentic performance when scaffolded with orchestration layers. Scored 63.8% on SWE-bench using a custom agent setup. The o3 model achieves 69.1% natively, suggesting a stronger built-in agentic loop.
Claude 3.7: Known for transparent reasoning and step-by-step control, but lacks native tool execution. Tool use occurs in separate Claude CoT or Code modules, not directly within the model loop.
DeepSeek R1: Great at internal reasoning and self-reflection through reinforcement learning, but does not support runtime tool execution. Its agentic behavior is internalized, not interactive.
LLaMA 4: No native agent loop. Tool use depends on external frameworks like LangChain. Multimodal input is strong, but real-time tool invocation is unsupported.
While this matrix outlines feature-level capabilities, real-world benchmarks further validate agentic performance. For instance:
On Humanity’s Last Exam, the o3 model scored around ~20%, slightly ahead of Gemini 2.5 Pro’s 18.8%—despite Gemini’s reliance on extensive toolchain orchestration.
Claude 3.7, while not agentic in tool execution, performed strongly on autonomous coding tasks, thanks to its disciplined, step-wise reasoning.
DeepSeek R1 and LLaMA 4, though not built for native action, show promising internal strategies like verification and reflection, hinting at their potential when paired with external agents.
Hallucinated actions and agentic risk#
Agentic models come with a new failure mode: hallucinated tool use. o3 has occasionally fabricated the use of tools, such as claiming it ran code outside ChatGPT or executed actions it never performed. This behavior is likely an artifact of its reinforcement learning strategy—simulate solutions when uncertain.
OpenAI has acknowledged this and is actively working on aligning agentic behavior with verifiability, especially in real-world impact contexts.
Alignment and safety measures#
To manage this autonomy, o3 and o4-mini undergo deliberate safety protocols. OpenAI red-teams these models with scenarios like “MakeMePay” or “MakeMeSay” to test for manipulation using tools. To minimize confusion or misuse, they also limit raw action visibility—users see the summary of steps, not the internal tool stream.
This is the emerging challenge in AI: not just what an agent can do, but what it should do. Alignment is no longer only about factuality — it’s about decision integrity.
Are o3 and o4-mini similar to AGI?#
We’re not there yet with OpenAI’s o3 and o4-mini, but for the first time, the line feels blurry. These models reason, act, and problem-solve in ways beyond previous LLMs.
Why do they feel close?
These aren’t just chatbots with better memory. o3 and o4-mini models can reason across domains, use tools autonomously, and solve tasks that require multiple cognitive steps—traits once reserved for humans.
General problem-solvers: o3 solves complex math problems (like AIME), dominates competitive coding challenges (Codeforces Elo ~2700+), and scores ~90% on ARC, a benchmark designed to test abstract reasoning.
Agentic abilities: Both models can plan, self-correct, and use tools like Python or browsers mid-task, deciding when and how to act, not just what to say.
Multi-skill fusion: They integrate vision, language, logic, and computation. o3 can analyze a diagram, interpret it, run calculations, and generate explanations—all in one flow.
Why are they still not there?
Despite the progress, these models aren’t truly general or grounded. They impress, but also reveal the limits of today’s architecture.
Hallucinations: On benchmarks like PersonQA, o3 hallucinates 33% of the time. o4-mini is worse. An AGI should be more self-aware of what it doesn’t know.
No memory or will: They don’t remember past interactions or act independently. There’s no self-motivation or long-term strategy—just stimulus-response.
Power via compute, not insight: o3’s top ARC score used massive sampling (~1000 attempts per task). With minimal compute (6 samples), its performance drops significantly.
Creativity is still remixing: While o3 can write poems and suggest hypotheses, it doesn’t innovate in the human sense. It recombines learned patterns, not original insights.
So what are they?
They’re not AGI. But they’re not ordinary LLMs either. They simulate many aspects of thinking, but still rely on prompting, lack grounded world models, and occasionally guess when they should reflect. They are brilliant, fallible, and a glimpse of what’s next.
What’s next in AI innovation?#
The rapid progress from o1 to o3 (in just over a year) and the fierce competition among AI labs indicate that innovation in reasoning models is not slowing down. Here are some directions and trends we can expect for the future of these models:
Fewer hallucinations: Expect hybrid training (LLM + retrieval + symbolic checks) so models cite sources or admit uncertainty instead of bluffing.
Leaner computation: New “think-time on demand,” model distillation, and scratch-pad/ToT tricks aim to match o3-class reasoning at a fraction of today’s FLOPs.
Infinite context and memory: Rather than memorizing everything, future models will pull facts from live data stores and, if you allow, remember past sessions.
Built-in workflows: Codex-style agents will seep into IDEs, spreadsheets, research labs, and clinical tools—reasoning where real work happens, not just in chat.
Alignment 2.0: Guardian AIs will monitor bigger models in real time, combining Constitutional rules with continuous human feedback tuning.
Future releases: The next releases include Full o4, Claude 4, Gemini 3, DeepSeek R2, and LLaMA 5, each aiming to lower error rates, richer modalities, and tighter tool integration.
Personalized reasoning: Fine-tune a model on your org’s best practices—or dial a slider for “fast” vs. “deep” thought—so the AI thinks the way you do.
TL;DR:
o3 and o4-mini are not true AGI, but they’re closer than anything we’ve seen before. They reason, act, plan, and adapt with growing autonomy, blurring the line between narrow AI and general problem-solving. The gap to AGI remains, but it's starting to look like a bridge we might actually cross.
Eager to start building agentic AI?#
You can start building that same kind of autonomy into your own AI systems. Learn how to create multi-agent workflows, manage agent crews, and orchestrate autonomous problem-solvers with CrewAI in our hands-on course:
Build AI Agents and Multi-Agent Systems with CrewAI
This course will explore AI agents and teach you how to create multi-agent systems. You’ll explore “What are AI agents?” and examine how they work. You’ll gain hands-on experience using CrewAI tools to build your first multi-agent system step by step, learning to manage agentic workflows for automation. Throughout the course, you’ll delve into AI automation strategies and learn to build agents capable of handling complex workflows. You’ll uncover the CrewAI advantages of integrating powerful tools and large language models (LLMs) to elevate problem-solving capabilities with agents. Then, you’ll master orchestrating multi-agent systems, focusing on efficient management and hierarchical structures while incorporating human input. These skills will enable your AI agents to perform more accurately and adaptively. After completing this CrewAI course, you’ll be equipped to manage agent crews with advanced functionalities such as conditional tasks, robust monitoring systems, and scalable operations.
If you're not there yet, you can dive deeper into other GenAI essentials: