At 1 AM today, Alibaba open-sourced a new version of the Qwen3 series, Qwen3-235B-A22B-2507.
Surprisingly, Alibaba has discontinued the mixed-thought model. The new Qwen3 is a non-thought reasoning model, returning to an instruction-tuned model, but its performance is exceptionally strong.
According to data released by Alibaba, the new Qwen3 has significantly surpassed DeepSeek's open-source new V3-0324 model across dozens of benchmarks in 6 major categories: knowledge, reasoning, code, alignment, agent capabilities, and multilingual testing.
For example, in the SimpleQA test, DeepSeekV3 scored 27.2 points, while the new Qwen3 scored 54.3 points; in the CSimpleQA test, DeepSeekV3 scored 71.1 points, while the new Qwen3 scored 84.3 points;
In the ZebraLogic test, DeepSeekV3 scored 83.4 points, while the new Qwen3 scored 95 points; in the WritingBench test, DeepSeekV3 scored 74.5 points, while the new Qwen3 scored 85.2 points; in the TAU-Airline test, DeepSeekV3 scored 32.0 points, while the new Qwen3 scored 44.0 points; in the PolyMATH test, DeepSeekV3 scored 32.2 points, while the new Qwen3 scored 50.2 points.
Similarly, the new Qwen3 also surpassed Moonshot AI's latest open-source Kimi-K2.
Open-source addresses: https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507
https://modelscope.cn/models/Qwen/Qwen3-235B-A22B-Instruct-2507
A netizen commented, "Of all the mid-sized large language models I've evaluated, none come close to Qwen in terms of strict adherence to prompts. I don't know what secret sauce you used, but keep up the excellent work."
"Wow, does this mean your new thoughtless model has beaten KimiK2 in all these benchmarks?"
"Impressive optimization improvements."
"Great work, folks. But when will you release a smaller model?"
"It has already beaten Kimi-K2."
"I just compared KimiK2's single-turn coding. The prompt was: 'Create a complete POS system in an HTML file, with great design and suitable for mobile use.' I was more impressed with Qwen3 than KimiK2."
"The Qwen team's update this time is fantastic! The new Qwen3-235B-A22B-Instruct-2507 adopts a mode where instruction models and thought models are trained separately. This move is very smart and is expected to improve model performance and versatility. I look forward to seeing this innovative achievement continue to develop!"
"Honestly, I love your team so much! Keep up the great work! Super excited for the visual language version!"
The new Qwen3 has a total of 235 billion parameters, with 22 billion active. The number of non-embedding parameters is 234 billion, with 94 layers, using a grouped-query attention mechanism with 64 query heads and 4 key-value heads. It has 128 experts, 8 of which are active. Its native context length supports 262144.
The new Qwen3 has undergone significant optimization in general capabilities such as instruction following, logical reasoning, text comprehension, mathematics, science, programming, and tool usage. It has also made notable progress in covering long-tail knowledge across multiple languages and achieves higher alignment with user preferences in subjective and open-ended tasks, generating more helpful and higher-quality text, while also enhancing its understanding of 256K long-text contexts.
In terms of performance, Qwen3-235B-A22B-Instruct-2507 performs excellently across multiple benchmarks. For instance, in knowledge-based tests, it scored 83.0 in MMLU-Pro, 93.1 in MMLU-Redux, and 77.5 in GPQA. For reasoning capabilities, it scored 70.3 in AIME25 and 55.4 in HMMT25.
In terms of programming capabilities, it scored 51.8 in LiveCodeBenchv6 and 87.9 in MultiPL-E. For alignment capabilities, it scored 88.7 in IFEval and 79.2 in Arena-Hardv2. Furthermore, it demonstrates excellent performance in multilingual capabilities, scoring 77.5 in MultiIF and 79.4 in MMLU-ProX.
Additionally, Qwen3 excels in tool-calling capabilities. It is recommended to use Qwen-Agent to fully leverage its agent capabilities. Qwen-Agent internally encapsulates tool-calling templates and tool-calling parsers, significantly reducing coding complexity. Available tools can be defined via MCP configuration files, Qwen-Agent's integrated tools, or by integrating other tools independently.