Source |SyncedReview
The authors of this article are all from the "Large Model Diving" team at ZTE Wireless Institute. The team focuses on key areas including "Reasoning Model Construction: Distillation and Reinforcement Learning Methods", "Wireless Communication Fault Location and Root Cause Analysis Reasoning Models", "Multimodal Reasoning Models", and "Reasoning Acceleration Technology". Core members graduated from well-known universities and research institutes such as the University of Science and Technology of China and the Institute of Software, Chinese Academy of Sciences.
In recent years, "Chain of Thought (CoT)" has become a prominent technique for large model reasoning, but it is not easy for small models to also possess long-chain reasoning capabilities.
The "Large Model Diving Team" at ZTE Wireless Institute tackled this from the perspective of "static data experience flow" and pioneered the "LLM-Adaptive Question Difficulty Distillation" method, simultaneously maximizing the production efficiency and effectiveness of high-quality CoT corpus.
Paper Title: Rethinking the Generation of High-Quality CoT Data from the Perspective of LLM-Adaptive Question Difficulty Grading
Paper Link: https://arxiv.org/pdf/2504.11919
Open source links are as follows:
Code Data: https://huggingface.co/datasets/ZTE-AIM/32B_LLM_AdaptiveCode_data
Math Data: https://huggingface.co/datasets/ZTE-AIM/32B_LLM_AdaptiveMath_data
Code Model: https://huggingface.co/ZTE-AIM/LLM-Adaptive-ZCode-model-32B
Math Model: https://huggingface.co/ZTE-AIM/LLM-Adaptive-ZMath-model-32B
Research Motivation: Small Models Also Want "Long-Chain Thinking"
Large Models Have Clear Advantages, Difficult to Deploy
With the release of the DeepSeek-R1 (671B parameters) model, long Chain of Thought (CoT) reasoning technology has rapidly become popular in foundational large models and industrial applications. Although DeepSeek-R1 has powerful reasoning capabilities, a model with over 600B parameters is difficult to use in edge devices and real-time systems.
Small Models Urgently Need "Enhancement"
This has prompted the industry to conduct continuous research on small models with parameters below 7 billion, particularly focusing on long-chain reasoning scenarios such as complex mathematical problem-solving and code generation. It is noteworthy that by leveraging the reasoning process of DeepSeek-R1, high-quality Chain of Thought (CoT) data can be constructed, thereby significantly enhancing the reasoning ability of small models. However, small models currently at the billions to tens of billions of parameters level still have significant bottlenecks in multi-step reasoning tasks (such as complex mathematical problems and programming questions), making it difficult to fully meet the requirements of such applications.
Predicament of Existing CoT Data
Research on generating CoT data based on DeepSeek-R1 is roughly divided into two technical routes:
1. Massive Data Driven (Labs 2025; Team 2025c): Improving reasoning capability by stacking ultra-large scale CoT corpus, but with high computation and annotation costs and low efficiency.
2. Boutique Data Driven (Ye et al. 2025; Muennighoff et al. 2025): Relying on a small number of high-quality samples to activate model potential, but performance gains are difficult to sustain due to scale limitations.
Although existing work (Wen et al. 2025a) has introduced curriculum learning and rejection sampling to optimize the training process, the above methods generally overlook the dynamic matching between "model capability - data difficulty".
This directly leads to two core questions:
1. How should high-quality CoT corpus be defined?
2. How to extract transferable "static experience flow" from existing data?
New Method: Model-Adaptive Difficulty Grading Distillation
Recently, the father of reinforcement learning, Richard Sutton, proposed the idea that "experience" is the next generation of super data sources, defining the essence of large model reinforcement learning as a dynamic experience flow mining of data. Based on this, our team started from the perspective of building a static data experience flow and proposed a method for distilling CoT corpus based on model-adaptive question difficulty grading, which significantly improved the quality of long CoT corpus.
This method proposes a complete CoT construction process centered on "model-data dynamic matching" and has four major innovations:
1. Based on the model's inherent reasoning capabilities, establish a question difficulty grading system to form reusable "static experience".
2. According to difficulty labels, construct an adaptive question bank covering all gradients.
3. Design a difficulty distribution sampling strategy that conforms to the idea of curriculum learning to ensure real-time alignment of training data with model capabilities.
4. With the help of DeepSeek-R1, batch generate high-quality CoT corpus in two major scenarios: mathematical reasoning and code generation.
Under the same computational budget, this adaptive scheme can continuously improve the reasoning performance of models of different scales — taking the AIME24 mathematical competition dataset as an example, the accuracy of models with various parameter sizes increased by 6.66%–26.7% compared to the traditional "non-adaptive" strategy (see Figure 1).
Figure 1: Comparison of CoT Data Construction Effects Based on LLM-Adaptive Question Difficulty Grading
For LLMs of different parameter scales, the reasoning performance of inference models trained with COT data constructed using the question-adaptive difficulty grading method (left) consistently outperforms non-adaptive methods (right) on the AIME24 mathematical competition dataset. This indicates that the former constructs higher quality CoT data and finds a static data experience flow adapted to the model itself.
This method effectively mines the static experience flow in CoT data, and this static experience flow is closely related to the model itself.
Method Framework, Understood in One Picture
Figure 2: CoT Data Generation Framework Based on LLM-Adaptive Question Difficulty Grading
The framework includes three core components: Distribution Construction, LLM-Adaptive Question Difficulty Grading and Distribution Sampling, and LLM-Adaptive Chain of Thought (CoT) Generation.
1. Distribution Construction
Construct two difficulty distribution strategies as the basis for subsequent sampling:
Option 1: Distribution Based on Actual Model Performance (Pₑᵥₐₗ)
Dynamically generate the difficulty distribution based on the performance of the base LLM (Sₗₗₘ) on the evaluation dataset (DBₑᵥₐₗ):
Correctly answered questions: labeled as "Simple" (Easy).
Incorrectly answered questions: further graded by PRM-Grader (Process Reward Model), mapping the quality of the model's generated answer reasoning trajectory (0-1 points) to 5 difficulty levels (lower score means higher difficulty).
Option 2: Prior Distribution Based on Curriculum Learning (P₆)
Manually define five difficulty levels, following the principle of "more easy questions, fewer difficult questions", with weights decreasing as difficulty increases:
For example, the number of samples at difficulty level 1 is the highest, and level 5 is the lowest.
2. LLM-Adaptive Question Difficulty Grading and Distribution Sampling
Step 1: Construct the Adaptive Question Bank (DBₐdₐₚₜᵢᵥₑ)
Collect original questions (DBᵣₐw) from open-source datasets, use Sₗₗₘ to generate answers and record reasoning trajectories.
Verify answer correctness:
Mathematical reasoning tasks: directly compare the model answer with the standard answer.
Code generation tasks: execute test cases to verify code correctness.
Difficulty grading:
Correct questions are labeled as "Simple" and added to the question bank.
Incorrect questions are further divided into 5 difficulty levels (1-5, 1 being the most difficult) by PRM-Grader and added to the question bank.
Step 2: Distribution Sampling (DBₛₐₘₚₗₑ)
Sample questions from the adaptive question bank according to the difficulty ratio based on the constructed distribution (Pₑᵥₐₗ or P₆).
3. LLM-Adaptive CoT Generation
Generation stage: Input the sampled questions (DBₛₐₘₚₗₑ) into the teacher model (Tₗₗₘ, i.e., DeepSeek-R1) to generate a detailed reasoning chain (CoT).
Verification stage: Strictly filter correct CoT data using Result-Verifier (consistent with the verification method in step 2) to finally form a high-quality dataset COTₐdₐₚₜᵢᵥₑ.
Model training: Perform supervised fine-tuning (SFT) of the base model (Sₗₗₘ) using COTₐdₐₚₜᵢᵥₑ to obtain the optimized inference model (Rₗₗₘ).
Key innovative points of the method:
Model-adaptive difficulty matching: Adjust the question difficulty distribution based on the model's actual capabilities, avoiding a "one-size-fits-all" subjective grading, and constructing a static data experience flow truly closely tied to the model;
Lightweight process: No need for complex curriculum learning or rejection sampling, data quality can be improved just through grading and sampling;
Multi-task compatibility: Supports mathematical reasoning and code generation tasks, with flexible verification methods (answer comparison / test cases).
Experimental Results: Pleasant Surprises
To study the quality effect of the CoT data we proposed, we conducted detailed verification on models of different sizes and natures, covering tasks including mathematical reasoning tasks and code generation tasks.
The following is a detailed introduction to the important experimental results:
Mathematical Reasoning (MATH500, AIME24/25, GPQA)
In mathematical benchmarks such as MATH500, AIME24/25, and GPQA, the ZMath series models trained with 2k adaptive CoT data significantly outperform the baseline models.
ZMath-32B achieved 94.6% accuracy on MATH500, surpassing DeepSeek-Distill-32B (89.8%) and Sky-32B-Preview (90%); on AIME24, it improved to 73.33% (baseline was 66.67%).
ZMath-14B achieved 50% accuracy on AIME24, significantly exceeding phi4-14B (30%), and reached 63.13% on GPQA (phi4-14B was 54.55%).
Figure 3: Mathematical Reasoning Experimental Results
Code Generation (LiveCodeBench)
ZCode-32B achieved 96.06%, 75.53%, and 31.85% on Easy, Medium, and Hard difficulty levels respectively, comprehensively outperforming DeepSeek-Distill-32B (92.11%, 74.92%, 30%).
ZCode-14B significantly led phi4-14B (72.4%) on the Easy difficulty level with 89.96%, indicating that small parameter models can also achieve competitive performance through adaptive data training.
Figure 4: Code Generation Experimental Results
Ablation Study & Distribution Transfer
When the difficulty distribution of the 32B model was directly applied to the 7B model, the latter's accuracy on the MATH500 dataset was only 92%, lower than the 93.2% obtained by training with its own difficulty distribution. The results show that the difficulty distribution must dynamically match the target model's capabilities, and the adaptive distribution is key to performance improvement; it also indicates that the truly valuable experience in the static experience flow should correspond closely to the specific model, rather than being transferred across models in a "one-size-fits-all" manner.
Figure 5: Code Generation Experimental Results
Summary and Outlook
The paper proposes a high-quality CoT data generation framework based on LLM-adaptive difficulty grading and verifies its efficiency, effectiveness, and generalization ability through systematic experiments. The core conclusions are as follows:
Efficient Data Generation
First, dynamically evaluate the model's current reasoning ability, then construct a matching adaptive question bank. Significant performance improvement can be achieved with only about 2k high-quality CoT samples, significantly reducing data and computational costs.
Cross-Task and Parameter Generalization
Achieved leading performance in both mathematical reasoning (AIME series) and code generation (LiveCodeBench) scenarios; provides stable gains for models of different scales from 7B to 32B.
Methodological Contribution
Constructed a systematic CoT data generation and evaluation process, providing a new path for improving the chain reasoning capability of small parameter LLMs in resource-constrained environments, and presenting a reusable paradigm for "static experience flow" mining.
Future work: Further combine reinforcement learning to explore deeper reasoning capabilities and expand to more complex cross-domain tasks such as communication fault diagnosis.
Technical Exchange Group Invitation
△Long press to add assistant
Scan the QR code to add the assistant's WeChat
Please note: Name - School/Company - Research Direction
(e.g., Xiao Zhang - Harbin Institute of Technology - Dialogue Systems)
to apply to join technical exchange groups such as Natural Language Processing / Pytorch
About Us
MLNLP Community is a non-governmental academic community jointly built by machine learning and natural language processing scholars from home and abroad. It has developed into a well-known machine learning and natural language processing community at home and abroad, aiming to promote progress among the academic, industrial, and enthusiasts of machine learning and natural language processing.
The community can provide an open exchange platform for relevant practitioners' further studies, employment, and research. Everyone is welcome to follow and join us.