GPT-5 vs Claude Opus 4.1: Coding Capability Assessment

Reproduced with authorization from Big Data Digest, originally from Xi Xiaoyao Tech Talk.

When it comes to serious programming, Anthropic's Claude is almost universally acknowledged as the king, holding the No. 1 position in many developers' minds.

But recently, the tide seems to have turned.

OpenAI released GPT-5, and I've seen messages circulating widely in official accounts, communities, and forums: GPT-5 is here, and its coding capabilities are "terrifyingly strong."

Although I've seen many hyped claims about GPT-5 being the "new king of programming" and various GPT-5 evaluations, honestly, I haven't seen a convincing report yet. Either they rely on official demos or claim GPT-5 is strong after testing a few decent-looking web pages. It seems a bit premature to draw conclusions based on these.

So, like many others, I'm curious about whether GPT-5 or Claude is more powerful, and what programming features each model excels at.

Today, I came across a blog post by a developer named Rohit, who published a practical comparison of GPT-5 vs Claude Opus 4.1's coding capabilities, and I'm sharing it here.

Firstly, all the code generated for the evaluation is open source and can be viewed at this link: https://github.com/rohittcodes/gpt-5-vs-opus-4-1

Here are the core conclusions:

Algorithms: GPT‑5 wins in terms of speed and token count (8K vs 79K).

Web Development: Opus 4.1 has higher fidelity in reproducing Figma designs but consumes more tokens (900K vs 1.4M+ tokens).

GPT-5 responds faster and costs less, saving approximately 90% of token consumption compared to Opus 4.1, making it more suitable as an efficient daily development assistant; if you prioritize high design fidelity and have a flexible budget, Opus 4.1 has a greater advantage.

Next, let's look at the basic model information and token usage efficiency comparison:

Context Window: Claude Opus 4.1 supports 200,000 tokens, with the maximum output length unknown; GPT‑5 supports 400,000 tokens for context and can output up to 128K tokens.

Token Usage Efficiency: Although GPT‑5 has a larger context window, it consistently uses fewer tokens for the same tasks, significantly reducing running costs.

While GPT‑5 slightly outperforms Opus 4.1 in coding benchmarks like SWE-bench, the author also conducted practical tests on several cases.

圖片

The test content covers common scenarios in actual development:

Programming Languages and Task Types:

Algorithm Problems: Implementing LeetCode Advanced problems using Java.

Web Development: Using TypeScript + React to write Next.js pages based on Figma designs, generating code via Rube MCP (a general MCP access layer).

Other Tasks: Including business logic implementation such as customer churn prediction models.

Environment: All tasks were completed in the Cursor IDE in conjunction with Rube MCP.

Measurement Metrics: Token count, time taken, code quality, actual results.

Both models used the exact same prompts.

01 Figma Design Development

Rohit found a complex dashboard design from the Figma community and asked both models to replicate it using Next.js and TypeScript.

圖片

Prompt as follows:

Create a Figma design clone using the given Figma design as a reference: [FIGMA_URL]. Use MCP's Figma toolkit for this task.

Try to make it as close as possible. Use Next.js with TypeScript. Include:

Responsive design

Proper component structure

Styled-components or CSS modules

Interactive elements

Performance of the two contestants:

GPT-5:

Time Taken: Approximately 10 minutes

Tokens: 906,485 (900K tokens)

GPT-5's efficiency is undeniable; it completed the task in 10 minutes, and the application ran. But the result... how to put it, functionally complete, but the visual effect was unsatisfactory. It grasped the design framework but completely ignored its essence. Colors, spacing, and fonts were far from the original, as if it was running in "low-fidelity" mode.

圖片

It's an engineer who can get the job done, but lacks aesthetic sense and produces rough work.

Claude Opus 4.1:

Time Taken: Longer (due to iterative refinement)

Tokens: Over 1.4 million tokens (55% more than GPT-5!)

Opus 4.1 started with a bit of "temper," insisting on using Tailwind even though styled-components were specified, requiring manual correction. But once it "admitted its mistake" and started working, the results were astonishing.

The UI was almost identical to the Figma design! The visual fidelity was perfect.

圖片

A perfectionist "artist," albeit expensive and a bit stubborn, but whose work is impeccable.

02 LeetCode Algorithm Problem

To test pure logic and efficiency, Rohit posed the classic LeetCode problem: "Median of Two Sorted Arrays," requiring an O(log(m+n)) time complexity.

Prompt as follows:

Given two sorted arrays nums1 and nums2 of size m and n respectively, return the median of the two sorted arrays. The overall run time complexity should be O(log (m+n)).

GPT-5:

Time Taken: Approximately 13 seconds

Tokens: 8,253

GPT-5 provided a clean, concise, and perfectly correct binary search solution in just 13 seconds, with virtually no unnecessary output. The code was elegant and highly efficient.

Claude Opus 4.1:

Time Taken: Approximately 34 seconds

Tokens: 78,920 (nearly 10 times more than GPT-5!)

Opus 4.1, however, took a completely different approach. It not only provided the answer but also included a "mini-thesis": detailed reasoning steps, comprehensive code comments, and even built-in test cases, as if it feared you wouldn't understand. While the core algorithm was the same, its output offered extremely high "educational value."

圖片

If you want a quick answer, ask GPT-5; if you want to learn the problem-solving methodology, Opus 4.1 is your best teacher.

03 Complex ML Task

The final challenge was to build a complete machine learning pipeline for predicting customer churn.

However, after witnessing Opus 4.1's astonishing token consumption in the first round, Rohit wisely "rested" it out of respect for his wallet. This round, GPT-5 went solo.

Prompt as follows:

Build a complete ML pipeline for predicting customer churn, including:

Data preprocessing and cleaning

Feature engineering

Model selection and training

Evaluation and metrics

Explain the reasoning behind each step in detail

The results showed that GPT-5 was fully capable of handling such complex end-to-end tasks. From data preprocessing and feature engineering to multi-model training (logistic regression, random forest, XGBoost), and using SMOTE to address data imbalance issues and comprehensive performance evaluation, the entire process was seamless, and the code was robust and reliable.

Time Taken: Approximately 4-5 minutes

Tokens: Approximately 86,850

04 Cost Showdown: A Battle of Real Money

Now that we've seen the performance, let's look at the costs. After all, this might be the most influential factor for developers' choices.

圖片

GPT-5 (Thinking Mode) - Completed three test tasks

Web Application: ~$2.58

Algorithm: ~$0.03

ML Pipeline: ~$0.88

Total: Approximately $3.50

Opus 4.1 (Thinking + Max Mode) - Completed only two test tasks

Web Application: ~$7.15

Algorithm: ~$0.43

Total: $7.58

The conclusion is clear at a glance: Opus 4.1's usage cost is more than double that of GPT-5.

05 Evaluation Conclusions

Advantages of GPT-5:

Low token usage and fast response in algorithm tasks, extremely high efficiency.

More suitable for daily development, especially for rapid iteration and prototype validation.

Overall token cost is significantly lower than Opus 4.1.

Advantages of Claude Opus 4.1:

Provides clear, step-by-step explained code logic, user-friendly for learning.

Excels in visual fidelity (design reproduction accuracy), very close to Figma originals.

Suitable for scenarios requiring high interface precision.

Therefore, if you are doing daily development, prioritize GPT‑5 for a balance of performance and cost. If the design task requires high interface reproduction accuracy, choose Claude Opus 4.1 to improve the final result, but be prepared for a sufficient budget.

Recommended Combination Strategy: First, lay the foundation with GPT‑5, then use Opus 4.1 to refine details in critical interface segments, achieving a balance between efficiency and precision.

References https://composio.dev/blog/openai-gpt-5-vs-claude-opus-4-1-a-coding-comparison

Main Tag:AI Models

Sub Tags:CodingDevelopment ToolsLarge Language ModelsPerformance Comparison


Previous:OpenAI Board Chair: "Per-Token Billing" Is Completely Wrong, Market Will Eventually Choose "Outcome-Based Pricing"

Next:The "Mirage" of Chain-of-Thought Reasoning: An In-depth Look at LLM Generalization

Share Short URL