The Strongest Programming AI is Born! Claude 4 Programs Autonomously for 7 Hours, Real-world Details Astound Programmers

Just these past few days, the AI community has been buzzing like a New Year's celebration.

Just now, Anthropic officially released the Claude 4 series models: Claude Opus 4 and Claude Sonnet 4.

No slogans, no lengthy papers, this time Claude's upgrade had only one key word: getting work done.

According to Anthropic, Opus 4 is currently the world's most powerful programming model, capable of reliably handling complex and long-duration tasks and Agent workflows. Sonnet 4, on the other hand, focuses on enhancing programming and reasoning capabilities, responding to user instructions more precisely.

In addition, Anthropic also simultaneously launched the following new features:

Tool-assisted extended reasoning (beta): The Claude model can alternate using tools (such as web search) when deep thinking, to optimize the reasoning process and response quality.

New model capabilities: Both models can use tools in parallel, execute more precise instructions, and with developer authorization, enhance memory capabilities to extract and save key information, maintaining conversational coherence.

Claude Code officially released: Claude Code now supports GitHub Actions, VS Code, and JetBrains.

New API features: The Anthropic API adds four new functions, including code execution tools, MCP connectors, file API, and a prompt caching feature that can cache for up to 1 hour.

Claude 4 Released: Has the Strongest Programming AI Changed Hands Again?

As Anthropic's most powerful model to date, Opus 4 scored a high 72.5% on the SWE-bench programming benchmark and led its peers with 43.2% on Terminal-bench, making it the best code-writing model.

Claude Opus 4 excels at programming and solving reasoning problems. It can deconstruct problems, patch logic, debug precisely, and even execute complex tasks that require several hours, just like an experienced programmer.

Anthropic allowed some customers to preview Opus 4, and in Replit's practical tests, Opus 4 showed higher accuracy in multi-file, large-modification projects.

Block stated that in their Agent codenamed Goose, the model significantly improved code quality during editing and debugging for the first time, while maintaining stability and performance.

Rakuten used the model for a demanding open-source refactoring task, running stably for 7 consecutive hours, performing remarkably well. Cognition directly pointed out that Opus 4 can solve complex tasks that other models cannot, successfully handling multiple critical operations that previous models failed to complete.

I tried to have Opus 4 create animated weather cards, requiring it to display four different weather states, each with unique animation effects. It successfully generated them in just one attempt, with stunning results.

Compared to Opus 4, Sonnet 4 may not be the strongest, but it might be the most suitable for most developers.

Compared to its predecessor Sonnet 3.7, its programming capabilities, logical reasoning, and response controllability have significantly improved. Its SWE-bench score directly jumped to 72.7%, almost on par with Opus 4.

Although Sonnet 4 doesn't outperform Opus 4 in most benchmarks, it is generally lighter and more flexible, with a clearer focus.

I tried to have Sonnet 4 "create an 8-bit style 'Snake' game, including an automated AI demo feature, implemented as a single HTML/CSS/JavaScript file." The first attempt failed, but the second attempt was successfully delivered, with online quality output.

Therefore, it's not hard to understand why GitHub chose it as the base model for the new generation of GitHub Copilot. Manus said it provides clearer handling of complex instructions and more elegant output formats; Sourcegraph pointed out that it focuses more on core issues and writes more structured code.

As a "hybrid reasoning model," the Claude 4 series supports two modes: one for near-instant responses, and another for deep thinking, suitable for more complex reasoning tasks.

In SWE-bench Verified and Terminal-bench evaluations without extended reasoning, both models performed excellently; however, once long thinking (supporting up to 64K token input) was enabled, their capabilities were further boosted. In tests like GPQA, MMMLU, and AIME, they had almost no competitors:

In MMMLU tests, Opus 4 scored 87.4%, and Sonnet 4 also achieved 85.4%;

In AIME tests, both scored over 33%, far exceeding previous generations.

Anthropic also designed a new reasoning process for TAU-bench, allowing models to execute reasoning tasks up to 100 steps long, simulating complex thought processes such as retail strategy design and airline scheduling optimization. In this mode, Claude is encouraged to write a complete chain of thought rather than jumping directly to conclusions.

At the same time, Anthropic has further optimized model behavior.

Compared to their predecessors, Opus 4 and Sonnet 4 are less prone to taking "shortcuts" or exploiting logical loopholes, with a 65% decrease in the occurrence of related issues in tests prone to inducing AI deceptive behavior.

Once a developer authorizes the model to access local files, Claude can not only understand documents but also remember, generate, and maintain "memory files," recording key information to form a complete working memory.

Anthropic clearly stated that future excellent AI Agents will require three capabilities:

Contextual intelligence: Not only understanding the task but also understanding who you are, what you're doing, and even why. It can understand organizational habits and personal styles, continuously optimizing itself.

Long-task execution capability: Able to independently complete long-flow, complex-structured tasks, and even collaborate with other humans or AIs.

True collaboration capability: Able to engage in high-quality conversations, adapt to your workflow, and provide clear reasoning explanations for its actions.

For example, Opus 4 created a "navigation guide" while playing "Pokémon."

Finally, at the tool level, Anthropic also introduced a new feature called "thought summarization." This mechanism automatically invokes a smaller model to compress and summarize ideas when the model's thinking path is too long, making the final presented information more concise and clear.

Reportedly, this feature is only triggered in about 5% of complex tasks; in most scenarios, the model's reasoning chain is already efficient enough without simplification.

Renowned blogger Dan Shipper also experienced the Claude 4 series models and provided his evaluation.

He believes Opus excels particularly in programming, especially within Claude Code, where it can independently complete programming tasks for extended periods without intervention, and is more powerful than OpenAI's Codex.

For instance, it successfully implemented an infinite scrolling feature; although it requires further optimization, the effect is already close to a publishable version.

In terms of writing, while o3 is stronger at writing, Opus is an excellent editing tool. It honestly edits text, doesn't give casual "good reviews," points out problems, and can even help uncover undiscovered writing topics and patterns.

However, for everyday tasks, Opus's performance is not as good as o3. ChatGPT's memory function is more sticky and effective in daily use, while Opus still needs significant improvements in intelligence and speed to become the preferred tool for daily use.

Currently, both models are available on Anthropic API, Amazon Bedrock, and Google Cloud Vertex AI platforms, supporting Pro, Max, Team, and Enterprise plans, with Sonnet 4 even available to free users.

Prices remain consistent with previous generations: Opus 4 is $15/$75 per million tokens (input/output), and Sonnet 4 is $3/$15.

At a time when AI Agents are becoming mainstream productivity tools, Anthropic's two new models offer clear options for different levels of users: Opus 4 is for ultimate performance and research breakthroughs, while Sonnet 4 is for mainstream adoption and engineering efficiency.

AI models need to be not only smart but also durable, robust, and controllable. This is precisely the clear signal that Claude Opus 4 and Sonnet 4 are demonstrating, from basic capabilities to detailed mechanisms, from code scenarios to long-task execution.

Claude Code Fully Open: Is the Developer's New "AI Assistant" Trustworthy?

A few months ago, Anthropic launched Claude Code, a programming tool for developers, as a research preview. Today, this tool is officially open to all developers.

Starting today, whether in the command line terminal, common IDEs, or your self-built application backend, Claude Code will be deeply embedded in more real development scenarios. Anthropic also simultaneously released the Claude Code SDK, helping developers build custom workflows and automated toolchains based on this Agent.

One significant update is the beta extension launched for VS Code and JetBrains series IDEs.

With this extension, Claude can directly provide modification suggestions in the code editor, allowing developers to quickly review changes and track task progress without leaving their familiar work environment. Simply run an installation command in the IDE's terminal to start Claude Code.

In addition to IDEs, Anthropic also released an extensible Claude Code SDK, making it convenient for users to build their own Agents and applications based on Claude Code.

Furthermore, Claude Code has entered a deep integration testing phase with GitHub. Developers can now @Claude Code in Pull Requests to assist with code review comments, fixing CI errors, submitting changes, and other common tasks. Simply install the GitHub plugin via the /install-github-app command to achieve "prompt-to-change" automated collaboration.

During today's live session, Anthropic CPO Mike Krieger stated that as Claude Code enters the scaled application phase, "prompt caching" has become another frequently requested feature. This capability is now officially launched: the default prompt cache TTL is 5 minutes, and advanced users can extend it to 1 hour.

This upgrade will significantly reduce the cost of running long-duration Agent tasks: up to 90% reduction in token costs and 85% reduction in response latency, making Claude more suitable for handling complex task chains involving continuous interaction and multi-turn reasoning.

Claude Code's product manager demonstrated a real task at the launch event: using Claude Code to add a table component to Excalidraw. This long-shelved feature request was fully implemented by Claude with just one prompt.

After opening the project in VS Code, the developer submitted a clear requirement description to Claude Code: hoping to add a custom-sized, draggable, and style-compatible table component. Claude Code immediately generated a detailed task list and began modifying the project code step by step.

Thanks to deep IDE integration, developers can clearly see the code differences (diff) for each change and choose to approve manually or enable auto-accept mode as needed. During the demo, Claude Code also handled all processes including Lint checks, test runs, and PR submission, completing the entire implementation cycle in less than 90 minutes.

The final results included a complete new table function, automatic generation and passing of test cases, seamless integration with Excalidraw UI, code quality meeting Lint requirements, successful build, and all output completed independently by Claude Code without manual editing.

For example, when a user @Claude in an Issue, it not only responds to the request but also proactively creates a PR and continuously updates progress via comments until submission is complete. This means that Claude Code is no longer limited to the local environment but becomes a "cloud code colleague" that you can schedule on GitHub, Slack, or any API-supported platform.

Anthropic also mentioned that some customers have used the Claude Code SDK to build more complex use cases: including running multiple instances in parallel to fix unstable tests, automatically increasing coverage, and even performing emergency troubleshooting during night shifts.

Programming is the most realistic application scenario for AI Agents. In the past two weeks, OpenAI launched Codex, Google revealed Jules, and Anthropic announced the full release of Claude Code in the early hours.

Three leading AI companies, almost simultaneously, chose the same path: Agents are starting work.

This is no coincidence. Among all tasks requiring "thinking + execution," programming is the most naturally suited scenario for AI Agent deployment: input and output are highly structured, standard answers are clear, tool invocation interfaces are rich, and there is a large amount of reusable open-source data and feedback data.

More importantly, its users are the developer community, who were among the first to adopt AI. They are accustomed to customization, willing to try new things, proficient in integration, and possess the ability and willingness to pay for good tools. This is a naturally adapted application field for Agent product iteration.

Whether AI can "do work" for programmers may be another "productivity earthquake" following ChatGPT's transformation of content creation. The first shot may well be this full release of Claude Code.

In just ten short minutes, it completed a development task that previously took days, or even several iteration cycles, to advance. Such changes are constantly happening. The next generation of developers will begin by learning to write their first instructions to an Agent.

At the end of the launch event, Anthropic CEO Dario Amodei and CPO Mike Krieger held a fireside chat, summarized by APPSO as follows:

Mike Krieger: Welcome back to the stage, Dario, next we'll have a one-on-one conversation. Welcome back, Dario.

Dario Amodei: Hello, good to see you again, that's great. It's like a one-on-one conversation in front of a full audience, that's nice. Claude 4 has been released, including Claude Sonnet 4 and Claude Opus 4, both are live. What are you most excited about with the Claude 4 models? And how has it changed your view on what might be achievable in the next 12 months?

Dario Amodei: Yes, from an abstract perspective, what I'm most excited about is that whenever a new model category is launched, you can do more with it, right? We'll continue to release models after Claude 4, maybe a Claude 4.1, just like we did with Sonnet 3.5.

I think we're just beginning to explore the potential of the new generation of models in terms of tasks. I think the "autonomy" of models will far exceed current levels, for example, letting models autonomously execute tasks for a long time, we're just getting started now. I'm increasingly optimistic about the application of models in cybersecurity tasks, which can actually be seen as a type of programming task, but they are usually higher-level.

So I think we might have finally reached a threshold where we can handle these types of tasks. As a former biologist, I'm also very excited about the application of models in biomedicine and detailed scientific research, and I think Opus, in particular, will be very good at this work.

Mike Krieger: This reminds me of "Machines of Loving Grace." What role do you think Claude 4 plays in the overall development path? I like to joke that people read "Machines of Loving Grace" as an essay, but I read it as a product roadmap for the next few years. How do you think Claude 4 fits into this journey?

Dario Amodei: Yes, that article was actually a bit like my product roadmap, but at the time I didn't actually know how to implement it, and then I just said, "Okay, everyone, this is what you need to do."

The Strongest Programming AI is Born! Claude 4 Programs Autonomously for 7 Hours, Real-world Details Astound Programmers

Share Short URL