Crushing DeepSeek V3! Alibaba Open-Sources New Qwen-3, Dominating Benchmarks with a Clear Lead

At 1 AM today, Alibaba open-sourced a new version of the Qwen3 series, Qwen3-235B-A22B-2507.

Surprisingly, Alibaba has discontinued the mixed-thought model. The new Qwen3 is a non-thought reasoning model, returning to an instruction-tuned model, but its performance is exceptionally strong.

According to data released by Alibaba, the new Qwen3 has significantly surpassed DeepSeek's open-source new V3-0324 model across dozens of benchmarks in 6 major categories: knowledge, reasoning, code, alignment, agent capabilities, and multilingual testing.

For example, in the SimpleQA test, DeepSeekV3 scored 27.2 points, while the new Qwen3 scored 54.3 points; in the CSimpleQA test, DeepSeekV3 scored 71.1 points, while the new Qwen3 scored 84.3 points;

In the ZebraLogic test, DeepSeekV3 scored 83.4 points, while the new Qwen3 scored 95 points; in the WritingBench test, DeepSeekV3 scored 74.5 points, while the new Qwen3 scored 85.2 points; in the TAU-Airline test, DeepSeekV3 scored 32.0 points, while the new Qwen3 scored 44.0 points; in the PolyMATH test, DeepSeekV3 scored 32.2 points, while the new Qwen3 scored 50.2 points.

Similarly, the new Qwen3 also surpassed Moonshot AI's latest open-source Kimi-K2.

Qwen3 vs DeepSeekV3 Performance Comparison

Open-source addresses: https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507

https://modelscope.cn/models/Qwen/Qwen3-235B-A22B-Instruct-2507

A netizen commented, "Of all the mid-sized large language models I've evaluated, none come close to Qwen in terms of strict adherence to prompts. I don't know what secret sauce you used, but keep up the excellent work."

Netizen comment on Qwen's prompt adherence

"Wow, does this mean your new thoughtless model has beaten KimiK2 in all these benchmarks?"

Netizen comment questioning Qwen3's win over KimiK2

"Impressive optimization improvements."

Netizen comment on impressive improvements

"Great work, folks. But when will you release a smaller model?"

Netizen comment asking for a smaller Qwen model

"It has already beaten Kimi-K2."

Netizen confirming Qwen3 beat Kimi-K2

"I just compared KimiK2's single-turn coding. The prompt was: 'Create a complete POS system in an HTML file, with great design and suitable for mobile use.' I was more impressed with Qwen3 than KimiK2."

Netizen comparing Qwen3 and KimiK2 for coding

"The Qwen team's update this time is fantastic! The new Qwen3-235B-A22B-Instruct-2507 adopts a mode where instruction models and thought models are trained separately. This move is very smart and is expected to improve model performance and versatility. I look forward to seeing this innovative achievement continue to develop!"

Netizen praising Qwen team's update strategy

"Honestly, I love your team so much! Keep up the great work! Super excited for the visual language version!"

Netizen looking forward to Qwen's visual language version

The new Qwen3 has a total of 235 billion parameters, with 22 billion active. The number of non-embedding parameters is 234 billion, with 94 layers, using a grouped-query attention mechanism with 64 query heads and 4 key-value heads. It has 128 experts, 8 of which are active. Its native context length supports 262144.

The new Qwen3 has undergone significant optimization in general capabilities such as instruction following, logical reasoning, text comprehension, mathematics, science, programming, and tool usage. It has also made notable progress in covering long-tail knowledge across multiple languages and achieves higher alignment with user preferences in subjective and open-ended tasks, generating more helpful and higher-quality text, while also enhancing its understanding of 256K long-text contexts.

In terms of performance, Qwen3-235B-A22B-Instruct-2507 performs excellently across multiple benchmarks. For instance, in knowledge-based tests, it scored 83.0 in MMLU-Pro, 93.1 in MMLU-Redux, and 77.5 in GPQA. For reasoning capabilities, it scored 70.3 in AIME25 and 55.4 in HMMT25.

Qwen3 Benchmark Performance Chart

In terms of programming capabilities, it scored 51.8 in LiveCodeBenchv6 and 87.9 in MultiPL-E. For alignment capabilities, it scored 88.7 in IFEval and 79.2 in Arena-Hardv2. Furthermore, it demonstrates excellent performance in multilingual capabilities, scoring 77.5 in MultiIF and 79.4 in MMLU-ProX.

Additionally, Qwen3 excels in tool-calling capabilities. It is recommended to use Qwen-Agent to fully leverage its agent capabilities. Qwen-Agent internally encapsulates tool-calling templates and tool-calling parsers, significantly reducing coding complexity. Available tools can be defined via MCP configuration files, Qwen-Agent's integrated tools, or by integrating other tools independently.

Main Tag:Large Language Models

Sub Tags:Open Source AIAI BenchmarksLLM PerformanceAlibaba


Previous:New Book Recommendation: "God, AI and the End of History: Understanding the Book of Revelation in an Age of Intelligent Machines"

Next:Kimi K2's Key Training Technique: QK-Clip!

Share Short URL