ZTE Research: LLM Adaptive Question Difficulty Grading Distillation Gives Small Models 'Long Chain Thinking'

The authors of this article are all from the "Large Model Deep Diving" team at ZTE Wireless Research Institute. The team's key areas of focus include "Inference Model Construction: Distillation and Reinforcement Learning Methods," "Wireless Communication Fault Location and Root Cause Analysis Inference Models," "Multimodal Inference Models," and "Inference Acceleration Technology." Core members graduated from prestigious universities and research institutes such as the University of Science and Technology of China and the Institute of Software, Chinese Academy of Sciences.

In recent years, "Chain of Thought (CoT)" has become a prominent technique in large model inference, but it is not easy for small models to possess long-chain reasoning capabilities.

The "Large Model Deep Diving Team" at ZTE Wireless Research Institute approached this from the perspective of "Data Static Experience Flow," pioneering the "LLM Adaptive Question Difficulty Grading Distillation" method, which simultaneously maximizes the efficiency and effectiveness of generating high-quality CoT data.

ZTE AIM Team Introduction

Paper Title: Rethinking the Generation of High-Quality CoT Data from the Perspective of LLM-Adaptive Question Difficulty Grading

Paper Link: https://arxiv.org/pdf/2504.11919

Open source links are as follows:

Code Data: https://huggingface.co/datasets/ZTE-AIM/32B_LLM_AdaptiveCode_data

Math Data: https://huggingface.co/datasets/ZTE-AIM/32B_LLM_AdaptiveMath_data

Code Model: https://huggingface.co/ZTE-AIM/LLM-Adaptive-ZCode-model-32B

Math Model: https://huggingface.co/ZTE-AIM/LLM-Adaptive-ZMath-model-32B

Research Motivation: Small Models Also Want "Long Chain Thinking"

Large Models Have Clear Advantages, But Deployment is Difficult

With the release of the DeepSeek-R1 model (671B parameters), long Chain of Thought (CoT) inference technology has rapidly popularized in foundational large models and industrial applications. While DeepSeek-R1 has powerful inference capabilities, models with 600+B parameters are difficult to use on edge devices and real-time systems.

Small Models Urgently Need a "Boost"

This has prompted continuous research in the industry on small models with parameters below 7 billion, especially focusing on long-chain reasoning scenarios such as complex mathematical problem-solving and code generation. Notably, leveraging the inference process of DeepSeek-R1 can build high-quality Chain of Thought (CoT) data, thereby significantly enhancing the inference capabilities of small models. However, small models currently ranging from billions to tens of billions of parameters still face significant bottlenecks in multi-step reasoning tasks (such as complex mathematical problems and programming problems), making it difficult to fully meet the demands of such applications.

The Dilemma of Existing CoT Data

Research on generating CoT data based on DeepSeek-R1 generally falls into two technical routes:

1. Massive Data Driven (Labs 2025; Team 2025c): Improves inference capabilities by stacking ultra-large-scale CoT corpora, but at high computational and labeling costs and low efficiency.

2. High-Quality Data Driven (Ye et al. 2025; Muennighoff et al. 2025): Relies on a small number of high-quality samples to activate the model's potential, but due to scale limitations, performance gains are difficult to sustain.

Although existing work (Wen et al. 2025a) has introduced curriculum learning and rejection sampling to optimize the training process, these methods generally overlook the dynamic matching between "model capability and data difficulty."

This directly leads to two core questions:

1. How should high-quality CoT corpus be defined?

2. How to extract transferable "static experience flow" from existing data?

New Method: Model-Adaptive Difficulty Grading Distillation

Recently, Richard Sutton, the father of reinforcement learning, proposed the idea that "experience" is the next generation of super data sources, defining the essence of large model reinforcement learning as the dynamic experience flow mining of data. Based on this, our team started from the perspective of building a static experience flow from data and proposed a method for distilling CoT corpus based on model-adaptive question difficulty grading, significantly improving the quality of long CoT corpus.

This method proposes a complete CoT construction process centered around "model-data dynamic matching," featuring four major innovations:

1. Based on the model's inherent inference capability, establish a question difficulty grading system, forming reusable "static experience."

2. According to difficulty labels, construct an adaptive question bank covering the entire gradient.

3. Design a difficulty distribution sampling strategy that conforms to the idea of curriculum learning, ensuring that training data is aligned with model capability in real time.

4. Leverage DeepSeek-R1 to batch generate high-quality CoT corpus in two major scenarios: mathematical reasoning and code generation.

Under the same computational budget, this adaptive scheme can continuously improve the inference performance of models of different scales – taking the AIME24 mathematical competition dataset as an example, the accuracy of models across different parameter levels improved by 6.66%–26.7% compared to the traditional "non-adaptive" strategy (see Figure 1).

Figure 1: CoT Data Construction Effect Comparison

Figure 1: Comparison of CoT Data Construction Effect Based on LLM Adaptive Question Difficulty Grading

For LLMs of different parameter scales, the inference models trained with COT data constructed using the question adaptive difficulty grading method (left) consistently outperform non-adaptive methods (right) on the AIME24 mathematical competition dataset. This indicates that the CoT data constructed by the former is of higher quality and that it finds a static data experience flow adapted to the model.

This method effectively mines the static experience flow in CoT data, and this static experience flow is closely related to the model itself.

Method Framework, Understand with One Picture

Figure 2: CoT Data Generation Framework

Figure 2: CoT Data Generation Framework Based on LLM Adaptive Question Difficulty Grading

The framework includes three core components: Distribution Construction, LLM Adaptive Question Difficulty Grading and Distribution Sampling, and LLM Adaptive Chain of Thought (CoT) Generation.

1. Distribution Construction

Construct two difficulty distribution strategies as the basis for subsequent sampling:

Option1: Distribution Based on Model Actual Performance (Pₑᵥₐₗ)

Dynamically generate difficulty distribution based on the performance of the base LLM (Sₗₗₘ) on the evaluation dataset (DBₑᵥₐₗ):

Correctly answered questions: Labeled as "Easy".

Incorrectly answered questions: Further graded through PRM-Grader (Process Reward Model), mapped to 5 difficulty levels (0-1 score, lower score means higher difficulty) based on the quality of the inference trajectory generated by the model.

Option2: Prior Distribution Based on Curriculum Learning (P₆)

Manually define five difficulty levels, following the principle of "more easy problems, fewer difficult problems," with weights decreasing as difficulty increases:

For example, the number of samples for difficulty level 1 is the largest, and level 5 is the smallest.

2. LLM Adaptive Question Difficulty Grading and Distribution Sampling

Step 1: Construct Adaptive Question Bank (DBₐdₐₚₜᵢᵥₑ)

Collect original questions (DBᵣₐw) from open-source datasets, use Sₗₗₘ to generate answers and record inference trajectories.

Verify answer correctness:

Mathematical reasoning tasks: Directly compare the model's answer with the standard answer.

Code generation tasks: Verify code correctness through test case execution.

Difficulty Grading:

Correct questions are labeled as "Easy" and added to the question bank.

Incorrect questions are subdivided into 5 difficulty levels (Level 1-5, Level 1 being the hardest) through PRM-Grader and added to the question bank.

Step 2: Distribution Sampling (DBₛₐₘₚₗₑ)

According to the constructed distribution (Pₑᵥₐₗ or P₆), sample questions from the adaptive question bank based on the difficulty ratio.

3. LLM Adaptive CoT Generation

Generation Stage: Input sampled questions (DBₛₐₘₚₗₑ) into the teacher model (Tₗₗₘ, i.e., DeepSeek-R1) to generate detailed inference chains (CoT).

Verification Stage: Strictly filter correct CoT data through Result-Verifier (consistent with the verification method in Step 2), finally forming the high-quality dataset COTₐdₐₚₜᵢᵥₑ.

Model Training: Perform supervised fine-tuning (SFT) on the base model (Sₗₗₘ) using COTₐdₐₚₜᵢᵥₑ to obtain the optimized inference model (Rₗₗₘ).

Key Innovation Points of the Method:

Model-adaptive difficulty matching: Adjust the question difficulty distribution based on the model's actual capabilities, avoiding subjective "one-size-fits-all" grading, and building a static data experience flow that is truly closely tied to the model;

Lightweight process: No complex curriculum learning or rejection sampling required, data quality can be improved simply through grading and sampling;

Multi-task compatibility: Supports mathematical reasoning and code generation tasks, with flexible verification methods (answer comparison / test cases).

Experimental Results: Constant Surprises

To study the quality effect of the CoT data we proposed, we conducted detailed verification on models of different sizes and natures, covering tasks including mathematical reasoning tasks and code generation tasks.

Here is a detailed introduction to the important experimental results:

Mathematical Reasoning (MATH500, AIME24/25, GPQA)

In mathematical benchmarks such as MATH500, AIME24/25, and GPQA, ZMath series models trained with 2k adaptive CoT data significantly outperformed baseline models.

ZMath-32B achieved 94.6% accuracy on MATH500, surpassing DeepSeek-Distill-32B (89.8%) and Sky-32B-Preview (90%); it improved to 73.33% on AIME24 (baseline was 66.67%).

ZMath-14B achieved 50% accuracy on AIME24, significantly higher than phi4-14B (30%), and reached 63.13% on GPQA (phi4-14B was 54.55%).

Figure 3: Mathematical Reasoning Experiment Results

Code Generation (LiveCodeBench)

ZCode-32B achieved 96.06%, 75.53%, and 31.85% on Easy, Medium, and Hard difficulty levels respectively, comprehensively outperforming DeepSeek-Distill-32B (92.11%, 74.92%, 30%).

ZCode-14B significantly led phi4-14B (72.4%) on the Easy difficulty level with 89.96%, indicating that small parameter models can also achieve competitive performance through adaptive data training.

Figure 4: Code Generation Experiment Results

Ablation Study & Distribution Transfer

When applying the difficulty distribution of the 32B model directly to the 7B model, the latter's accuracy on the MATH500 dataset was only 92%, lower than the 93.2% achieved when trained using its own difficulty distribution. The results show that the difficulty distribution must be dynamically matched with the target model's capability, and adaptive distribution is key to performance improvement; it also indicates that the truly valuable experience in the static experience flow should be closely corresponding to the specific model, rather than being transferred across models in a "one-size-fits-all" manner.

Figure 5: Ablation Study and Distribution Transfer Results

Figure 5: Code Generation Experiment Results

Summary and Outlook

The paper proposes a high-quality CoT data generation framework based on LLM adaptive difficulty grading, and verifies its efficiency, effectiveness, and generalization capabilities through systematic experiments. The core conclusions are as follows:

Efficient Data Generation

Dynamically evaluate the model's current inference capability first, then construct a matching adaptive question bank. Performance can be significantly improved with only about 2k high-quality CoT samples, significantly reducing data and computing costs.

Cross-task and Parameter Generalization

Achieved leading performance in both mathematical reasoning (AIME series) and code generation (LiveCodeBench) scenarios; brought stable gains to models of different scales from 7B to 32B.

Methodological Contribution

Constructed a systematic CoT data generation and evaluation process, providing a new path for improving the chain reasoning ability of small parameter LLMs in resource-constrained environments, and also providing a reusable paradigm for "static experience flow" mining.

Future work: Further combine reinforcement learning to explore deeper reasoning capabilities, and extend to more complex cross-domain tasks such as communication fault diagnosis.

Please contact this official account for authorization to reprint

Submissions or press inquiries: liyazhou@jiqizhixin.com

ZTE Research: LLM Adaptive Question Difficulty Grading Distillation Gives Small Models 'Long Chain Thinking'

Share Short URL