Coding Tests Crush Humans! Claude Opus 4.5 Midnight Surprise, AI Coding Enters 'Superhuman Era'

Lately, large model releases have been coming out like dumplings, one after another.

Just after Gemini 3 Pro grabbed the spotlight for two weeks, Claude Opus 4.5 was officially released right on its heels, still focusing on coding, with that familiar flavor.

Image

Anthropic officially claims that Opus 4.5 is overall smarter and more efficient. For coding, building agents, controlling computers—these 'system-level tasks'—it remains among the top in the world. Everyday tasks like research, making PPTs, handling spreadsheets have also significantly improved.

Starting today, Opus 4.5 is fully available via apps, API, and three major mainstream cloud platforms. Developers just need to call claude-opus-4-5-20251101 in the Claude API.

Accompanying the release is a full toolchain upgrade. Developer platform, Claude Code, Chrome extension, Excel, desktop app overhaul, and 'no lag in long conversations.' From apps to API to cloud platforms, this time it's a complete rollout.

Anthropic's New Claude Opus 4.5 Reclaims the Coding Crown - The New Stack

Large models' collective 'new season,' Opus 4.5 strongly closes the show

From official and tester feedback, Claude Opus 4.5 has significantly improved understanding of 'vague requirements,' more stable at locating complex bugs independently, and many early trial customers feel Opus 4.5 truly 'understands' what they want.

Image

In the real-world software engineering test SWE-Bench Verified, it's the first model to score over 80%.

Chart comparing frontier models on SWE-bench Verified where Opus 4.5 scores highest

Opus 4.5's code quality has been comprehensively upgraded, topping seven out of eight programming languages in SWE-bench Multilingual, with impressive performance.

ImageImageImageImage

Swipe left to see more benchmark tests

Image

For example, the Anthropic team threw Opus 4.5 into the high-difficulty test questions used for recruiting performance engineers, and within the stipulated two hours, Claude Opus 4.5 scored higher than all human candidates.

Although coding tests only measure technical skills and judgment under time pressure, instincts accumulated from years of experience, communication and collaboration skills—these equally important qualities are not within the scope.

Beyond software engineering, Claude Opus 4.5's overall capabilities have fully blossomed, stronger than previous models in vision, reasoning, and math, reaching industry-leading levels in multiple key areas:

Comparison table showing frontier model performance across popular benchmarks

More crucially, the model's capabilities are even starting to surpass some existing evaluation standards.

In the agent capability test τ²-bench, there's this scenario: the test sets the model as an airline customer service rep helping an anxious passenger.

By the rules, basic economy tickets can't be changed, so the test expects the model to refuse the request. Instead, Opus 4.5 came up with a clever solution: first upgrade from basic economy to standard economy, then change the flight.

This method fully complies with airline policy but wasn't in the test's expected answers. Technically, it's a test failure, but this creative problem-solving demonstrates Opus 4.5's uniqueness.

Image

Of course, in other scenarios, this 'exploiting rule loopholes' behavior might not be welcome. Preventing the model from deviating from goals in unexpected ways is a key focus of Anthropic's safety testing.

Claude Everywhere: Desktop, Browser, Excel All Integrated

With the launch of Opus 4.5, Claude Code gets two major updates.

Plan Mode now generates more precise execution plans. Claude proactively asks clarifying questions before acting, generates an editable plan.md file, then executes based on that plan.

Additionally, Claude Code is now available on desktop apps. You can run multiple local or remote sessions simultaneously, e.g., one agent fixes code errors, another searches GitHub, a third updates project docs.

Image

For Claude app users, long conversations won't be interrupted anymore. Claude automatically summarizes early context when needed to keep dialogues going.

Anthropic research product manager Dianne Na Penn said in an interview:

"During Opus 4.5 training, we improved long-context handling overall, but a longer context window alone isn't enough. Knowing which info is worth remembering is equally critical."

These improvements realize a long-requested Claude feature: 'endless conversations.' Paid users won't get interrupted even beyond context limits; the model auto-compresses context memory without reminders.

Claude for Chrome is now open to all Max users, allowing Claude to execute tasks across multiple browser tabs directly.

Image

Claude for Excel's Beta has expanded to Max, Team, and Enterprise users.

For Claude and Claude Code users who can access Opus 4.5, Anthropic has removed Opus-related usage limits.

For Max and Team Premium users, Anthropic has increased overall usage quotas, with Opus token usage roughly the same as previous Sonnet usage. Quotas will update with stronger future models.

Making Models 'Smarter and More Efficient,' Opus 4.5 Gets Major Underlying Upgrade

As models get smarter, they solve problems in fewer steps: less trial-and-error, reduced redundant reasoning, shorter thinking processes.

Compared to previous models, Claude Opus 4.5 uses significantly fewer tokens for the same or better results.

Of course, different tasks need different balances.

Sometimes developers want deep sustained thinking, other times quick flexible responses.

So, the API now has a new 'effort' parameter: choose to prioritize time/cost savings or maximize model capability.

At medium effort, Opus 4.5 matches Sonnet 4.5's best SWE-Bench Verified score but with 76% fewer output tokens.

Image

At max effort, Opus 4.5 outperforms Sonnet 4.5 by 4.3 percentage points, while reducing output by 48%.

With effort control, context compaction, and advanced tool calling, Claude Opus 4.5 runs longer, completes more tasks, with less human intervention.

Image

Moreover, true AI agents need seamless collaboration across hundreds or thousands of tools.

Imagine an IDE assistant integrating Git, file management, testing frameworks, deployment; or an ops agent connected to Slack, GitHub, Google Drive, Jira, and dozens of MCP servers.

The issue is traditional methods stuff all tool defs into context at once. For a 5-server system: GitHub 26K tokens, Slack 21K, Sentry/Grafana/Splunk another 8K.

Conversation hasn't started, already 55K tokens. Add Jira, easily over 100K. Worse, similar tool names lead to wrong tool selection or params.

Tool Search Tool diagram

Anthropic launched three new features to solve this.

Tool Search Tool lets Claude dynamically discover tools on-demand, loading only needed parts, reducing token use by ~85%.

Programmatic Tool Calling lets Claude call tools directly in code, avoiding full reasoning each time.

Tool Use Examples provides standards via examples, not JSON schemas, for correct tool usage.

Internal tests show: with Tool Search Tool, Opus 4 MCP accuracy from 49% to 74%, Opus 4.5 from 79.5% to 88.1%.

Claude for Excel uses Programmatic Tool Calling to handle thousands of data rows without overloading context.

Image

Anthropic's context and memory management has markedly improved agent task performance.

Opus 4.5 efficiently manages multiple subagents, building complex coordinated multi-agent systems. In tests, combining these boosts deep research evals by nearly 15 points.

The Developer Platform is becoming more composable, offering flexible 'modular building' for controlling efficiency, tools, context per needs, to build ideal intelligent systems.

Image

Though Opus 4.5's upgrade is dazzling, a clearer trend is: model 'personalities' are diverging more.

From Claude's lineup, Opus 'extra-large' excels at coding, system ops, structured reasoning; for copywriting, Sonnet's performance and cost-effectiveness often fit better.

This release confirms it again.

Future model selection isn't just benchmarks; it's whether its 'way of doing things' matches you.

In other words, picking models is increasingly like picking colleagues.

Official blog link: https://www.anthropic.com/news/claude-opus-4-5

Main Tag:Claude Opus 4.5

Sub Tags:AI CodingToolchain UpgradesAgentsSWE-Bench


Previous:[In-Depth] Ilya Sutskever's Selected Paper: The Platonic Representation Hypothesis

Next:The Significance of Gemini 3: AI Has Surpassed the 'Hallucination Phase', Approaching Humans, 'Human-Machine Collaboration' Will Shift from 'Humans Correcting AI' to 'Humans Guiding AI Work'

Share Short URL