The explosive growth of Large Language Models (LLMs) has revealed a core contradiction: the more powerful the model, the greater the need for vast amounts of high-quality training data. Traditional human-annotated data faces three dilemmas:
- Cost Trap: Annotation in specialized fields (e.g., mathematical proofs) requires expert involvement, leading to exponentially increasing costs.
- Quality Bottleneck: Human annotation inherently has an error rate (studies show an average error rate >5%).
- Privacy Minefield: Data in fields like healthcare/finance is difficult to obtain legally.
More problematically, existing synthetic data methods (such as Self-Instruct) suffer from a "garbage in, garbage out" risk—models directly replicate simple patterns from seed examples, lacking deep reasoning capabilities. This is akin to asking elementary school students to mimic a university thesis; the form might be similar, but there's no substantive depth.
- Paper: CoT-Self-Instruct: Building high-quality synthetic prompts for reasoning and non-reasoning tasks
- Link: https://arxiv.org/pdf/2507.23751
The CoT-Self-Instruct method proposed in this paper is like equipping LLMs with a "mind-mapping tool":
- Plan First, Then Generate: It requires the model to analyze the essential characteristics of seed examples through Chain of Thought (CoT).
- Dual Assurance Filtering: For verifiable problems, consistency of answers is used for quality control; for open-ended problems, a reward model is employed for screening.
Experiments show that the synthetic data generated by this method surpasses human-annotated datasets by 12.6% on mathematical reasoning tasks and outperforms the best baseline by 7.3% on instruction-following tasks, opening a new path to resolve the data dilemma.
Methodology Explained: A Reasoning-Driven Data Factory
Overall Process
As shown in the figure, the process resembles a sophisticated "data pipeline":
- Seed Input: A small number of high-quality human-annotated examples (e.g., 10 math problems).
- CoT Engine: The LLM progressively reasons to generate new prompts (core innovation).
- Quality Gate: Different filters are applied based on the scenario.
Chain of Thought Generator (Core Innovation)
The biggest difference from traditional methods lies in enforced deep analysis. Taking mathematical problem generation as an example:
# Traditional Self-Instruct
Input: Seed Problem → Output: New Problem
# CoT-Self-Instruct
Input: Seed Problem → Model Execution:
Step 1: Analyze the domain/difficulty/structural characteristics of the seed problem
Step 2: Design a new problem framework that satisfies the same characteristics
Step 3: Gradually derive the answer to ensure logical rigor
Step 4: Output the complete problem solution with the standard answer
Rigorous Guarantee of Mathematical Principles For problems with verifiable answers (e.g., math problems), the final answer must be a scalar value:
- Format Requirements: Integer (42), simplest fraction (3/7), exact radical (√2).
- Verification Formula: where is the generated answer, and is the majority vote result of K inferences. This design ensures that the problem has a clear solution and filters out "out-of-scope" problems that the model itself cannot consistently solve.
Dual-Track Filtering Mechanism
Verifiable Tasks: Answer-Consistency
Like "multiple grading" for a math exam
- Generate K model solutions (K=16 in experiments).
- If the majority answer ≠ the generated standard answer → discard the data.
- Essence: Eliminating problems that LLMs collectively "got wrong."
Open-Ended Tasks: RIP Filtering
Similar to a "survival elimination round"
- Generate K responses → score with a Reward Model (RM).
- Take the lowest score as the quality score for that prompt.
- Retain high-scoring prompts (experiments show the 50th percentile is optimal).
Experimental Design: Comprehensive Stress Test
Reasoning Task Arena
- Datasets: MATH500 (Olympiad math problems), AMC23 (American Mathematics Competitions), GPQA (Graduate-level QA).
- Seed Data: 893 verifiable math problems from s1k (filtered out theorem proving type).
- Training Method: GRPO reinforcement learning + Qwen3-4B model.
- Key Comparisons:
- Traditional Self-Instruct.
- Human-annotated dataset (s1k).
- 10K-scale OpenMath-Reasoning.
Open-Ended Task Arena
- Datasets: AlpacaEval 2.0 (instruction following), Arena-Hard (complex interaction).
- Seed Data: 4K high-quality dialogues from WildChat (categorized into 8 major domains to prevent mixing).
- Training Method: DPO alignment + LLama-3.1-8B.
- Evaluator Upgrade: Due to OpenAI API limitations, GPT-4-turbo/GPT-4o dual evaluators were used.
Filtering Strategy Comparison
| Filtering Type | Applicable Scenario | Core Metric |
|---|---|---|
| Self-Consistency | Verifiable Tasks | Majority Vote Pass Rate |
| RIP | Open-Ended Tasks | Reward Model Lowest Score |
| Answer-Consistency | Verifiable Tasks | Standard Answer Matching Degree |
Results Analysis: The Counterattack of Synthetic Data
Reasoning Tasks: Comprehensive Overwhelm of Human Data
Key Data Interpretation:
- Quality > Quantity: 5K CoT data (57.2%) > 10K OpenMath data (47.5%).
- Filtering Power: CoT+Answer-Consistency improved by 4.2% compared to the unfiltered version.
- Historic Breakthrough: Achieved 47.4% on GPQA diamond-level problems, surpassing s1k's 40.1%.
Counter-Intuitive Discovery:
When the training volume was fixed at 893 samples:
- CoT synthetic data (54.2%) > human s1k data (44.6%).Meaning: Carefully designed synthetic data is 10 times more efficient than human annotation.
Open-Ended Tasks: Surpassing Human Dialogue
Shocking Comparison:
- Basic Performance: CoT data (54.7%) > human WildChat data (50.7%).
- Online Evolution: After online DPO training, the gap widened to 67.1% vs 63.1%.
- Length Trap: Human data tends to produce redundant answers (resolved in experiments by length normalization).
Key Insight:
Human data improved more after RIP filtering (46.8%→50.7%)
→ Proving that human data has higher noise, and filtering yields more significant benefits
Impact of Filtering Mechanism
| Method | Before Filtering | +Answer-Consistency | +RIP |
|---|---|---|---|
| Self-Instruct | 49.5% | - | 54.5% |
| CoT-Self-Instruct | 53.0% | 57.2% | 56.2% |
Data Description: Answer-Consistency performs best for verifiable tasks.
Conclusion
CoT-Self-Instruct is not just a data generation tool, but an engine for elevating LLM cognitive capabilities. It achieves breakthroughs in multiple dimensions through three revolutionary designs:
- Deep Reasoning Guidance (replacing mechanical copying).
- Contextual Filtering (using mathematical consistency for verifiable tasks, and reward distribution for open-ended tasks).
- Domain Pure Sampling (preventing knowledge contamination).
It achieved breakthroughs in multiple dimensions:
- Mathematical Reasoning: 58.7% accuracy sets a new record (exceeding human data by 14.1%).
- Instruction Following: 67.1% win rate defines a new benchmark.
- Data Efficiency: 893 synthetic data samples > 893 human data samples.
This work heralds a new paradigm in AI development: when large models learn to create data through "deep thinking," humanity will be liberated from the drudgery of data annotation, shifting towards higher-level creative empowerment. The future path to AGI will undoubtedly be paved by self-evolving synthetic data.