Today, Large Language Models (LLMs) have become star technologies in the field of AI, but have you noticed that even the latest models still make some head-scratching errors when handling math problems and writing code? Why is this the case? In reality, the issue may not lie with the models themselves, but with their "dietary structure" - the quality of the pre-training data.
Recently, a study achieved astonishing improvements in models' mathematical reasoning and code generation capabilities by "rewriting" rather than simply "filtering" pre-training data. This research not only open-sourced the datasets but, more importantly, provided a reproducible method that allows us to continuously improve model performance in specialized domains.
1. Limitations of Traditional Data Processing Methods
Imagine if you were learning programming, and your textbook was filled with incorrect code, inconsistent naming conventions, and fragmented snippets lacking context. Could you learn well? Similarly, Large Language Models face similar challenges during the learning process.
Currently, publicly available code datasets (like The-Stack-v1/v2) and mathematical datasets (like Finemath-4+) primarily rely on rule-based extraction or model scoring to filter low-quality samples. However, these methods have clear limitations:
(1) Simple Filtering: Only removes "unqualified" samples, but the remaining content may still have issues like inconsistent style or inefficient algorithms.
(2) Keeping As Is: While avoiding information loss, it doesn't improve sample quality.
(3) Quality Scoring: Although it can identify high-quality content, it cannot improve content of medium quality.
This is like selecting fruit where we only discard the obviously rotten ones, but don't process or purify those with minor imperfections.
2. SwallowCode: Full Pipeline Optimization from Cleaning to Rewriting
The research team proposed a four-stage data processing pipeline that not only filters low-quality code but also completely rewrites the retained code:
(1) Syntax Error Filtering
First, the team used Python's built-in compile() function to check each piece of code, removing samples that failed to compile. This step alone reduced the samples by about 10.6%, laying the foundation for subsequent processing.
Interestingly, simply removing code with syntax errors can improve model performance on the HumanEval and HumanEval+ benchmarks.
(2) Code Quality Filtering
The team used Pylint (a widely used Python code quality checker) to score the code, keeping only samples with scores above 7.0 (out of 10). They also used a custom algorithm to penalize overly verbose comments. This step further reduced the samples by 34.3%.
These seemingly strict filtering criteria proved worthwhile in subsequent tests, as they provided the model with higher quality "learning material."
(3) Style-Guided Code Rewriting (SGCR)
This stage used a Large Language Model (Llama-3.3-70B-Instruct) to rewrite the code according to the Google Python style guide:
1) Add docstrings and type hints
2) Unify variable reassignment patterns
3) Standardize function and class names
Through this step, the model's performance on HumanEval and HumanEval+ improved by over 9 percentage points!
(4) Self-Contained Optimization Rewriting (SCOR)
While SGCR primarily focuses on code style, SCOR further optimizes code semantics and functionality:
1) Ensure code self-containment: inline or satisfy external dependencies
2) Replace inefficient algorithms with more efficient ones
3) Transform trivial code snippets into meaningful executable examples
This step brought an additional 5 percentage points of performance improvement, emphasizing the importance of semantic-level rewriting.
The final SwallowCode dataset contains approximately 16.1 billion tokens and significantly outperforms other public code datasets with the same training budget.
4. SwallowMath: Improving the Quality of Math Datasets
The research team also applied a similar rewriting method to the Finemath-4+ math dataset, creating SwallowMath (approximately 2.3 billion tokens). The rewriting process included:
(1) Removing residual web headers, footers, and privacy statements
(2) Eliminating redundant metadata (such as question and answer timestamps)
(3) Restoring missing context in incomplete questions or answers
(4) Rewriting explanations into a concise and comprehensive form
(5) Providing clear, step-by-step solutions
These improvements led to performance gains of 12.4 and 7.6 percentage points on the GSM8K and MATH benchmarks, respectively, proving the effectiveness of the rewriting method in the mathematical domain.
5. Results: Doubling Model Progress
Researchers conducted continued pre-training on the Llama-3.1-8B model within a fixed budget of 50 billion tokens. The results showed:
(1) Using SwallowCode: 17.0 percentage points improvement on HumanEval, 17.7 percentage points improvement on HumanEval+
(2) Using SwallowMath: 12.4 percentage points improvement on GSM8K, 7.6 percentage points improvement on MATH
This is equivalent to nearly "doubling" the model's capabilities at the same training cost!
The research team also conducted rigorous test set contamination checks to ensure that performance gains were not due to the training data containing test samples. The results showed that there were no documents in SwallowCode with high similarity to HumanEval or HumanEval+ prompts.
6. Why is Rewriting More Effective Than Filtering?
Traditional data processing methods primarily rely on filtering, whereas the "transform and retain" method adopted in this study can:
(1) Increase Data Utilization: Instead of simply discarding low-quality samples, their quality is improved through rewriting.
(2) Unify Style and Structure: Make datasets consistent in style, facilitating model learning.
(3) Ensure Data Self-Containment: Reduce external dependencies and improve code executability.
(4) Optimize Algorithm Efficiency: Replace inefficient implementations with more efficient algorithms.
This is like not only filtering out high-quality "textbooks" but also carefully "editing" and "restructuring" them so that the model can learn deeper knowledge.
7. Implications
This research not only provides two high-quality open-source datasets but, more importantly, offers a systematic method for improving pre-training data that can be applied to various specialized domains:
(1) Quality over Quantity: With limited computational resources, improving data quality is more effective than simply increasing data volume.
(2) Rewriting over Filtering: By rewriting low-quality data, data utilization can be maximized.
(3) Domain-Specific Processing: Different domains (such as code, math) require specially designed data processing pipelines.
It is worth mentioning that although the experiments focused on Python, this pipeline design is language-agnostic; it can be applied to other programming languages with just a parsable syntax and a code style checker.
This pioneering research shows us that through carefully designed data processing pipelines, we can significantly enhance the capabilities of Large Language Models in specialized domains without relying on larger models or more training data. It paves the way for advancements in key areas of AI such as automated reasoning and software development.
With the open-sourcing of these methods and datasets, we have reason to believe that future Large Language Models will become more adept at handling math and programming tasks, providing us with more reliable assistance in these areas. This also proves once again that in AI development, data quality is as important as model architecture, and perhaps even more critical.
Paper Title: Rewriting Pre-Training Data Boosts LLM Performance in Math and Code
Paper Link: https://arxiv.org/abs/2505.02881
Recommended Reading
NVIDIA Releases Llama-Nemotron Series Inference Models, Zero to One: Detailed Explanation of AI Agent Design Patterns
RM-R1: An Innovative Approach to Reward Modeling as a Reasoning Process
100 Days After DeepSeek-R1 Release: A Survey on Replication Studies and Reasoning Language Models