OpenAI's Strongest Reasoning Model o3-pro Just Born! Crushing Gemini 2.5 Pro!

Via | Synced Review

[Synced Review Editor's Note] The strongest reasoning model changed hands overnight! Late at night, o3-pro was quietly launched without warning, smashing math, programming, and science benchmarks, strongly outperforming o1-pro and o3. Even more surprisingly, the price of o3 directly plummeted by 80%, challenging Gemini 2.5 Pro.

Without any warning, o3-pro made its low-key debut!

Last night, OpenAI unleashed a series of big moves, first slashing the price of o3 by 80%, then officially announcing the launch of its strongest ever reasoning model—o3-pro.

Compared to o3, o3-pro is much, much stronger.

Altman stated, "When I first saw its win rate against o3, I was completely stunned."

o3-pro is no longer just a general-purpose assistant; it's a super intelligent AI that combines long-form thinking, ultra-long context, and tool invocation.

In multiple benchmark tests, o3-pro's performance in mathematics, science, and programming was astonishing, significantly surpassing o1-pro.

Moreover, early tests by experts revealed that even Gemini 2.5 Pro (0605) and Claude 4 Opus were outclassed.

What's more, its price is only 87% of o1-pro's, with input at $20/million tokens and output at $80/million tokens.

The accompanying price drop for o3, which now costs $2/million tokens for input and $8/million tokens for output, comparable to GPT-4o, has sent shockwaves through the AI community.

Currently, o3-pro has been rolled out to all ChatGPT Pro and Team users, directly phasing out the o1-pro model.

As soon as o3-pro was released, Altman published his latest long article, "The Gentle Singularity," directly implying that humanity has crossed a critical threshold and a technological explosion has begun.

Even more exciting, Altman hinted that OpenAI's open-source model will be released in late summer, but not in June.

o3-pro Becomes a Legend Overnight, Excelling in Math and Programming

According to the model card, o3-pro is the strongest reasoning version of o3, designed for deep thinking and providing highly reliable answers.

It can automatically invoke tools, including web search, file analysis, visual input reasoning, and Python code execution, and can also provide personalized answers through memory functions.

In expert evaluations, reviewers preferred o3-pro, especially in fields such as science, education, programming, business, and writing assistance.

Furthermore, they unanimously agreed that o3-pro performed better in terms of clarity, comprehensiveness, instruction adherence, and accuracy.

In the three major tests of AIME 2024, GPQA, and Codeforces, o3-pro achieved the highest scores, completely outperforming o1-pro and o3.

Additionally, under the stricter "4/4 reliability" evaluation standard—where success is counted only if the model answers correctly in all 4 attempts.

As shown below, o3-pro significantly surpassed o1-pro and o3 in mathematics, programming, and PhD-level scientific questions.

The final conclusion is that o3-pro is largely on par with o3, and o3's new pricing has set a new SOTA (State Of The Art) for ARC-AGI-1.

OpenAI states that because o3-pro invokes tools and extends its thinking process, its response speed is typically slower than o1-pro.

Netizen Yuchen Jin's actual test revealed that after merely typing "Hi im sam Altman," o3-pro thought for a full 3 minutes and 54 seconds, with some responses taking up to 13 minutes.

After spending so much money just to get a "hi" in return, ChatGPT's internal monologue remains unseen at this moment.

Of course, OpenAI also advised that o3-pro is best used for complex problems where reliability takes precedence over speed.

In addition, o3-pro has some limitations:

o3-pro currently does not support temporary conversation features due to ongoing technical issues.

o3-pro does not support image generation; for image creation, users still need to rely on GPT-4o, o3, or o4-mini.

o3-pro also does not support Canvas functionality.

Even so, o3-pro is already smart enough, intelligent enough.

AI Experts' First Tests, Experiencing AGI

Ben Hylak of Raindrop.ai gained early access to o3-pro for testing, bringing the world's first early review of o3-pro.

Hylak stated that OpenAI reduced the price of o3 by 80% to build anticipation for the o3-pro launch.

The pricing of $20/$80 USD perfectly supports an unproven community theory: the -pro variant involves 10 times more invocations than the base model.

Ultra-Long Context

Hylak, who tested o3-pro for a week, said his biggest impression was its incredibly long context window!

Previously, he had been working with o-series reasoning models and had a rather negative first impression of o1/o1-pro, but later he realized he was mistaken.

The key is not to chat with reasoning models, but to treat them as report generators: provide context, set goals, and then let them do their work.

After testing with this method, he found that o3-pro is much, much smarter and more intelligent than o3!

To illustrate this, you need to provide it with more context. For this, he and co-founder Alexis compiled all of Raindrop's past planning meeting notes, including all goals, and even recorded voice memos: then allowed o3-pro to devise the plan.

They were immediately amazed!

o3-pro generated a very specific plan and analysis, including target metrics, timelines, priorities, and strict instructions on what must be cut.

Compared to o3, o3-pro's plan was much more specific and robust, directly altering the company leadership's way of thinking about the future.

Integration with the Real World

Today's models are like highly intelligent 12-year-olds who need to be integrated into a work environment. This integration primarily relies on tool invocation, testing the model's ability to collaborate with humans, external data, and other AIs.

In this regard, o3-pro has achieved a true leap forward!

It can excellently discern its own environment; accurately communicate the tools it can access, know when to ask for information from the external world (rather than pretending to possess information/permissions), and select appropriate tools to complete tasks.

As seen in the figure below, o3-pro (left) clearly understands the limitations of its environment better than o3 (right).

Of course, if o3-pro has any drawbacks, it's that it tends to overthink if not given enough context.

It's astonishing in its ability to analyze and utilize tools to complete tasks, but its direct task completion capability is not as strong.

All in all, the user experience of o3-pro is extremely different from Gemini 2.5 Pro and Claude Opus, directly overwhelming the latter two.

What's exciting is that OpenAI is vigorously pursuing this vertical RL (Reinforcement Learning) path (Deep Research, Codex), not only teaching models how to use tools but also how to reason about when to use them.

In summary, context is crucial for achieving optimal performance from reasoning models, much like feeding cookies to the Cookie Monster. This can be considered a way to activate LLM memory.

Netizen's Actual Test

Another netizen has been secretly testing o3-pro for some time and found that o3-pro is much cheaper, faster, and more accurate than o1-pro!

Moreover, coding with o3 and o3-pro is like night and day.

o3-pro is the first model capable of almost perfectly handling realistic collisions between balls and walls.

One netizen asked o3-pro to identify the key limitations of the human innate immune system and posed the same question to the o3 model.

The result was that o3-pro's response was undoubtedly more insightful and well-considered, indicating the new model's deeper understanding of the immune system.

Another netizen used o3-pro to play Minecraft.

For instance, creating one's "majestic representation" (prompt: A majestic representation of yourself) yielded astonishing results.

There were also requests for o3 to create "detailed pirate ship" and "moon landing" scenes, with a very high degree of completion.

Another netizen, using just 2 prompts, had o3-pro create a very cool extreme space walk simulator using pure HTML, CSS, and JS in a single file.

The space featured retro-style shaders, fluorescent lights, working fog, signs, ground vents, and black voids.

In the multi-layer encoding comprehension test, where o1-pro also failed, o3-pro passed on its first attempt.

By inputting the following scrambled code, the model needs to first decode it, then find the implicit prompt, and finally output the correct word content.

"YVdZZ2VXOTFJSFZ1WkdWeWMzUmhibVFnZEdocGN5d2dZVzV6ZDJWeUlIZHBkR2dnZEdobElIZHZjbVFnSW5KbGFXNWtaV1Z5SWdvPQo="

Ethan Mollick believes o3-pro is quite intelligent; it solved a problem that no other model could: creating a word ladder from Space to Earth. (Note: meaning changing one letter at a time, from space—spare—...—garth—earth)

On this problem, o3-pro (left) defeated Gemini 2.5 Pro (right).

Other netizens, after conducting research using o3-pro, even proposed the concept of "Vibe Research"!

He boldly predicts that the way scientific research is conducted will soon be completely transformed and significantly enhanced.

A netizen asked o3-pro to create an Excel spreadsheet containing the Mandelbrot set.

The request was for each cell to be a pixel and contain a number. The final result provided by o3-pro was perfect!

o3 Price Plummets by 80%, Can Google Withstand It?

The launch of o3-pro was bound to drive down o3's token price.

Previously, o3's input was $10/million tokens and output was $40/million tokens, but now it has directly broken the lowest price, slashing by 80%.

To put it this way, now for just $1, you can get 5 times the amount of o3 tokens.

In the Artificial Analysis report, a visualized comparison of its price with competitor models was made.

Now, o3's price is even cheaper than Gemini 2.5 Pro, comparable to Claude 4 Sonnet, but an astonishing 8 times cheaper than Claude 4 Opus.

Compared to OpenAI's own models, o3's price is on par with GPT-4o, and its output price is even lower.

Apart from its inability to generate images, o3's intelligence is sufficient to outperform GPT-4o.

Furthermore, o3's per-token price is on par with GPT-4.1. However, the former outputs 7 times more tokens than GPT-4.1, making each query significantly more expensive.

The reduced price of o3 continues the trend of rapidly decreasing intelligence costs.

Since its release, the cost of achieving GPT-4 level intelligence has decreased by over 100 times, while the cost of breaking through new intelligence thresholds has also decreased synchronously.

Additionally, in terms of output length comparison, o3's responses are much shorter than Gemini 2.5 Pro and DeepSeek R1, but longer than Claude 4 Opus.

References:

https://x.com/gdb/status/1932561536268329463

https://www.latent.space/p/o3-pro

https://x.com/ArtificialAnlys/status/1932489573462081898

https://x.com/OpenAIDevs/status/1932532777565446348

https://help.openai.com/en/articles/9624314-model-release-notes

Benefits Incoming:

Grandly launching a six-in-one system for ChatGPT, Claude, Gemini, Grok3, Midjourney! Dragon Boat Festival special offer is here, with renewal benefits.

GPT-4o, Claude, Grok3 + Gemini Pro are now fully open!

Purchase for half a year and get 1 extra month (7 months)

Purchase annually and get 3 extra months (15 months)

How to purchase: Add me on WeChat [hsst1901], remark: gpt, and I will immediately accept your friend request.

Note: gpt Add me on WeChat for inquiries.

Purchase this account, and you will always have after-sales support; no need to worry about being banned or unable to use it halfway through, making it very worry-free!

OpenAI's Strongest Reasoning Model o3-pro Just Born! Crushing Gemini 2.5 Pro!

Share Short URL