Interpretation of the Qwen3 Technical Report

Original text: https://zhuanlan.zhihu.com/p/1905735426339218114

Technical Report: https://github.com/QwenLM/Qwen3/blob/main/Qwen3_Technical_Report.pdf

0 Abstract

Qwen3 includes a series of LLMs designed to enhance performance, efficiency, and multilingual capabilities.

It covers Dense and MoE architectures, with parameter sizes ranging from 0.6B to 235B.

A key innovation of Qwen3 is the integration of thinking mode (for complex multi-step reasoning) and non-thinking mode (for fast, context-driven responses) into a unified framework, with the ability to dynamically switch modes based on user query or chat templates. This eliminates the need to switch between chat-optimized models (like GPT-4o) and reasoning-specific models (like QwQ-32B).

At the same time, Qwen3 introduces a thinking budget mechanism that allows for adaptive allocation of computational resources during inference, balancing latency and performance.

Furthermore, by leveraging the knowledge of flagship models, the computational resources required to build smaller models are significantly reduced while ensuring performance.

Test results show that Qwen3 achieves SOTA results on multiple benchmarks such as code generation, mathematical reasoning, and Agent tasks, demonstrating competitiveness against larger MoE models and closed-source models.

Compared to the previous Qwen2.5, Qwen3 expands multilingual support from 29 languages to 119 languages and dialects.

1 Introduction

The pre-training process of Qwen3 utilizes a large-scale dataset containing approximately 36T tokens.

To effectively expand training data, a multimodal approach was adopted: fine-tuning Qwen2.5-VL to extract text from a large number of PDF documents.

Domain-specific models were also used to produce synthetic data: Qwen2.5-Math for mathematical content, and Qwen2.5-Coder for code-related data.

The pre-training process adopts a three-stage strategy:

Stage 1, training on about 30T tokens, builds a solid foundation of general knowledge.

Stage 2, further training on knowledge-intensive data, enhances reasoning capabilities in areas such as science, technology, engineering, mathematics, and code.

Stage 3, training on long-context data, increases the maximum context length from 4096 to 32768.

Post-training also adopts a multi-stage strategy, simultaneously enhancing both thinking and non-thinking modes:

The first two stages cultivate reasoning ability through long CoT cold start fine-tuning and RL on mathematical and code tasks.

The final two stages combine datasets with and without reasoning paths to form a unified dataset for further fine-tuning, enabling the model to effectively handle both types of input. Then, general domain RL is applied to improve performance on a large number of downstream tasks.

For smaller models, a strong-to-weak distillation method is used, leveraging off-policy and on-policy knowledge transfer from larger models to enhance the capabilities of smaller models. Distillation from better teacher models significantly outperforms RL in terms of performance and efficiency.

Pre-trained and post-trained models were evaluated on comprehensive benchmarks covering various tasks and domains. Results show that Qwen3 Base pre-trained models achieve SOTA performance. Post-trained models (both thinking and non-thinking modes) perform well in competition with currently leading closed-source models (like o1, o3-mini) and large MoE models (like DeepSeek-V3).

Qwen3 performs particularly well in programming, math, and Agent tasks. For example, Qwen3-235B-A22B scored 85.7 on AIME'24, 81.5 on AIME'25, 70.7 on LiveCodeBench v5, 2056 on CodeForces, and 70.8 on BFCL v3. Other models in the Qwen3 series also exhibit strong performance at similar scales.

Furthermore, it was observed that increasing the budget for thinking tokens leads to a continuous improvement in the model's performance on various tasks.

2 Architecture

The Qwen3 series includes 6 Dense models (0.6B, 1.7B, 4B, 8B, 14B, 32B) and 2 MoE models (Qwen3-30B-A3B and Qwen3-235B-A22B).

The Dense model architecture is similar to Qwen2.5, including the use of GQA, SwiGLU, RoPE, RMSNorm with pre-normalization. The QKV-bias from Qwen2 was removed, and QK-Norm was introduced in the attention mechanism to ensure stable training.

MoE models share the same basic architecture as Dense models. Consistent with Qwen2.5-MoE, fine-grained expert segmentation was achieved. Qwen3 MoE models have a total of 128 experts, with 8 experts activated per token. Unlike Qwen2.5-MoE, shared experts were removed. Global-batch load balancing loss is used. These architectural and training innovations significantly improve performance on downstream tasks.

Qwen3 models use Qwen's tokenizer, byte-level BPE, with a vocabulary size of 151669.

3 Pre-training

3.1 Pre-training Data

Compared to Qwen2.5, the scale and diversity of training data were significantly expanded. Twice the amount of pre-training tokens were collected, covering more than three times the number of languages.

All Qwen3 models are trained on data containing 119 languages and dialects, totaling 36T tokens.

The data includes high-quality content, covering multiple domains such as code, STEM (Science, Technology, Engineering, Mathematics), reasoning tasks, books, multilingual text, and synthetic data.

To further expand the pre-training corpus, Qwen2.5-VL was first used to identify text from a large number of PDF documents. Then, Qwen2.5 was utilized to refine the identified text, improving quality. This yielded T-level high-quality tokens.

Additionally, Qwen2.5, Qwen2.5-Math, and Qwen2.5-Coder were used to synthesize T-level tokens in different formats, including textbooks, Q&A, instructions, code snippets, and dozens of other domains.

Finally, additional multilingual data was added to further expand the corpus.

A multilingual data annotation system was developed and applied to the large-scale pre-training dataset, annotating over 30T tokens across multiple dimensions such as educational value, fields, domains, and safety. These detailed annotations support more effective data filtering and combination.

Unlike previous work on optimizing data combination at the data source or domain level, extensive ablation experiments on small models with fine-grained labels were conducted to optimize data combination at the instance-level.

3.2 Pre-training Stage

Qwen3 underwent 3 stages of pre-training:

General Stage (S1): 4096 length, trained on over 30T tokens. In this stage, the model was comprehensively pre-trained on data covering 119 languages and dialects to build language proficiency and general world knowledge.

Reasoning Stage (S2): Increased the proportion of STEM, code, reasoning, and synthetic data to optimize the pre-training corpus. Trained on approximately 5T high-quality tokens, 4096 length. The learning rate decay was accelerated in this stage.

Long Context Stage (S3): Collected high-quality long-context corpus, and all models were trained on hundreds of billions of tokens with a length of 32768. 75% was 16384-32768 length, and 25% was 4096-16384 length. The ABF technique was used to increase the base frequency of RoPE from 10000 to 1000000. YARN and DCK were introduced to achieve a 4x increase in sequence length capacity during inference.

Based on the above three pre-training stages, scaling laws were explored to predict optimal hyperparameters (such as lr scheduler and batchsize). The relationship between model architecture, training data, training stage, and optimal hyperparameters was systematically studied through extensive experiments. Finally, predicted optimal learning rate strategies and batchsize strategies were set for each Dense and MoE model.

3.3 Pre-training Evaluation

15 benchmarks:

General Tasks: MMLU (5-shot), MMLU-Pro (5-shot, CoT), MMLU-redux (5-shot), BBH (3-shot, CoT), SuperGPQA (5-shot, CoT)

Math & STEM Tasks: GPQA (5-shot, CoT), GSM8K (4-shot, CoT), MATH (4-shot, CoT)

Coding Tasks: EvalPlus (0-shot) (Average of HumanEval, MBPP, Humaneval+, MVPP+), MultiPL-E (0-shot) (Python, C++, JAVA, PHP, TypeScript, C#, Bash, JavaScript), MBPP-3shot, CRUX-O of CRUXEval (1-shot)

Multilingual Tasks: MGSM (8-shot, CoT), MMMLU (5-shot), INCLUDE (5-shot)

Qwen3 series Base models were compared with Qwen2.5, DeepSeek-V3, Gemma-3, Llama-3, and Llama-4. All models used the same evaluation process and widely used evaluation settings to ensure fair comparison.

Pre-training Evaluation Summary

(1) Compared to previous open-source MoE models (such as DeepSeek-V3 Base, Llama-4-Maverick Base, Qwen2.5-72B-Base), Qwen3-235B-A22B-Base performs better in most tasks with significantly reduced total or activated parameters.

(2) For Qwen3 MoE Base models, experimental results show that

With the same pre-training data, MoE models can achieve performance similar to Qwen3 Dense models using only 1/5 of the activated parameters.

Qwen3 MoE Base models can outperform Qwen2.5 MoE Base models with less than 1/2 of the activated parameters and fewer total parameters.

Even with only 1/10 of the activated parameters of the Qwen2.5 Dense model, Qwen3 MoE Base models can achieve comparable performance.

(3) The overall performance of Qwen3 Dense Base models is comparable to Qwen2.5 Base models with more parameters.

4 Post-training

The post-training pipeline aims to achieve two core objectives:

Thinking Control: Integrate thinking and non-thinking modes, allowing users to flexibly choose whether the model performs reasoning and control the depth of thinking by specifying a token budget for thinking.

Strong-to-Weak Distillation: Aims to simplify and optimize the post-training process for smaller models.

Directly distilling the output logits of the teacher model to smaller models can effectively improve performance while maintaining fine-grained control over the reasoning process, eliminating the need for separate 4-stage training for each smaller model. This results in better Pass@1 scores and also improves the model's exploration capabilities (reflected in better Pass@64 performance). Compared to the 4-stage training method, it requires only 1/10 of the GPU hours.

4.1 Long-CoT Cold Start

First, a comprehensive dataset covering a wide range of categories including data, code, logical reasoning, and general STEM problems is constructed. Each problem in the dataset is paired with a verified reference answer or code-based test cases. This dataset is used for the cold start of long-CoT.

Dataset construction involves two filtering processes: query filtering and response filtering.

Query filtering: Qwen2.5-72B-Instruct is used to identify and remove queries that are difficult to verify, including queries with multiple sub-problems or general text generation queries. In addition, queries that Qwen2.5-72B-Instruct can answer correctly without using CoT reasoning are excluded. Furthermore, Qwen2.5-72B-Instruct is used to label the domain of each query to balance the dataset.

Response filtering: A set of verification queries is retained, and then QwQ-32B is used to generate N candidate responses for each remaining query. When QwQ-32B consistently fails to generate the correct answer, human evaluation is used to assess the accuracy of the response. For positive Pass@N queries, stricter filtering criteria are applied: (1) those that produce incorrect final answers. (2) those containing significant repetition. (3) those with guesses lacking sufficient reasoning. (4) those where thinking content and summary content are inconsistent. (5) those involving inappropriate language mixing or style changes. (6) those suspected of being too similar to the potential verification set.

Afterward, a subset is carefully selected from the refined dataset for the initial cold start training of the reasoning mode, planting the basic reasoning mode to ensure the model's potential is not limited, allowing for greater flexibility and improvement in subsequent RL stages. The amount of data and training steps in this stage are kept to a minimum.

4.2 Reasoning RL

The query-verifier pairs used in the Reasoning RL stage must satisfy the following four criteria:

Not used in the cold start stage

Learnable by the cold start model

As challenging as possible

Cover a wide range of subdomains

Ultimately, 3995 query-verifier pairs were collected, and GRPO was used to update model parameters.

It was observed that using large batchsize, large rollout, and off-policy training is beneficial for improving sample efficiency in the training process.

It also addresses how to balance exploration and exploitation by controlling the model's entropy to steadily increase or remain stable, which is crucial for maintaining stable training.

Therefore, in a single RL run, consistent improvements in training reward and validation set performance were achieved without any manual intervention on hyperparameters. For example, Qwen3-235B-A22B's AIME'24 score increased from 70.1 to 85.1 after a total of 170 steps of RL training.

4.3 Thinking Mode Fusion

The goal of Thinking Mode Fusion is to integrate non-thinking capabilities into the previously developed thinking model, allowing developers to manage and control reasoning behavior.

The Reasoning RL model is further fine-tuned with SFT, and a chat template is designed to fuse the two modes. It was found that models capable of handling both modes skillfully perform well under different thinking budgets.

Construction of SFT Data

The SFT dataset combines thinking and non-thinking data.

To ensure that the Stage 2 model is not affected by additional SFT, the thinking data is obtained by rejection sampling queries from Stage 1 using the Stage 2 model itself.

Non-thinking data is carefully designed to cover diverse tasks, including code, math, instruction following, multilingual tasks, creative writing, Q&A, role-playing, etc. Automated checklists are used to evaluate the quality of non-thinking data. The proportion of translation tasks is particularly increased to improve performance on low-resource language tasks.

Chat Template Design

To better integrate the two modes and enable dynamic switching, a chat template was designed for Qwen3.

Introducing /think and /no_think tags in the user query or system message allows the model to select the appropriate thinking mode based on user input.

For non-thinking samples, an empty thinking block is kept in the response to ensure internal format consistency.

The default is the thinking mode, so some thinking training samples where the user query does not contain the /think tag were added.

For more complex multi-turn conversations, multiple /think and /no_think tags are randomly inserted in the user query, and the model response follows the last encountered tag.

Thinking Budget

An additional advantage of Thinking Mode Fusion is that once the model learns to respond in both non-thinking and thinking modes, it naturally develops the ability to handle intermediate situations – generating responses based on incomplete thinking. This provides a foundation for controlling the model's thinking process budget.

When the model's thinking length reaches a user-defined threshold, the thinking process is manually stopped, and the stopping thinking instruction is inserted: “Considering the limited time by the user, I have to give the solution based on the thinking directly now. </think>. “. The model will then generate the final response based on the accumulated reasoning at that point. This capability was not explicitly trained but emerged naturally after applying thinking mode fusion.

General RL

The General RL stage aims to broadly improve the model's capabilities and stability in various scenarios.

A complex reward system was built, covering over 20 different tasks, each with customized scoring criteria. These tasks target the enhancement of the following core capabilities:

Instruction Following: Ensure the model accurately interprets and follows user instructions, including requirements related to content, format, length, and the use of structured output, to provide responses that meet user expectations.

Format Following: Expect the model to adhere to specific format specifications. For example, switching between thinking and non-thinking modes based on /think and /no-think tags, and consistently using specified tags to separate the thinking and response parts in the final output.

Preference Alignment: Focus on improving the model's usefulness, engagement, and style, ultimately providing a more natural and satisfying user experience.

Agent Capability: Involves training the model to correctly call tools through specified interfaces. During RL rollout, the model is allowed to execute a complete multi-turn interaction cycle and receive feedback from real environment execution, thereby improving its performance and stability in long-term decision-making tasks.

Scenario-Specific Capability: Design tasks for specific contexts in more specialized scenarios. For example, in RAG tasks, reward signals are combined to guide the model to generate accurate and contextually relevant responses, thereby minimizing the risk of generating hallucinations.

To provide feedback for the above tasks, three different types of rewards are used:

(1) Rule-based Reward: Well-designed rule-based rewards can evaluate the correctness of model output with high accuracy, preventing issues like reward hacking.

(2) Model-based Reward with Reference Answer: Provide a reference answer for each query and use Qwen2.5-72B-Instruct to score the model's response based on the reference answer. This method allows for more flexible handling of diverse tasks without strict format commands, avoiding false negatives of rule-based rewards.

(3) Model-based Reward without Reference Answer: Utilize human preference data to train a Reward Model to provide a scalar score for each response.

4.5 Strong-to-Weak Distillation

Used to optimize smaller models, including 5 Dense models (0.6B, 1.7B, 4B, 8B, 14B) and 1 MoE model (Qwen3-30B-A3B). Divided into two main stages:

(1) Off-policy Distillation: In this initial stage, the outputs of the teacher model in both /think and /no_think modes are combined for response distillation.

(2) On-policy Distillation: The student model generates on-policy data for fine-tuning. Specifically, the student model is sampled in /think or /no_think mode and fine-tuned by aligning its logits with the logits of the teacher model (Qwen3-32B or Qwen3-235B-A22B), minimizing KL divergence.

4.6 Post-training Evaluation

Numerous tables are detailed in the original paper.

4.7 Discussion

The Effectiveness of Thinking Budget

To verify whether Qwen3 can enhance its intelligence level by utilizing an increased thinking budget, the allocated thinking budget was adjusted on four benchmarks in the math, code, and STEM domains. As the budget continuously increased, the thinking model showed scalable and smooth performance improvement.

The Effectiveness and Efficiency of On-Policy Distillation

The Effects of Thinking Mode Fusion and General RL

Evaluate the effectiveness of Thinking Mode Fusion (Stage 3) and General RL (Stage 4). Several internal benchmarks were also included, such as:

CounterFactQA: Contains counterfactual questions, requiring the model to identify the counterfactuality of the question and avoid generating hallucinatory answers.

LengthCtrl: Includes creative writing tasks with length requirements, where the final score is based on the difference between the generated content length and the target length.

ThinkFollow: Involves multi-turn conversations with randomly inserted /think and /no_think tags, testing the model's ability to switch modes correctly.

ToolUse: Evaluates the stability of single-turn, multi-turn, and multi-step tool calling processes. Scores include accuracy of tool call intent identification, format accuracy, and parameter accuracy.

Interpretation of the Qwen3 Technical Report

Share Short URL