(Reading time: 7 minutes)
Editor's note: Data is the "fuel" driving the development of artificial intelligence, but it is now facing a risk of "depletion." This "data wall" has become a key bottleneck limiting the performance breakthroughs of large models. In this context, synthetic data technology has emerged. Recently, Microsoft Research Asia launched a scalable SYNTHLLM framework that can generate diverse synthetic data, effectively filling the gaps in natural data. In addition, researchers have discovered and validated the scaling laws of synthetic data, providing a scientific basis for training and optimizing large models using synthetic data.
One of the key factors contributing to the significant development of artificial intelligence today is the vast amount of data that provides powerful "fuel" for model training, especially high-quality data, which is core to improving model performance. However, as the data available for training on the internet is gradually exhausted, obtaining high-quality pre-training data is becoming increasingly difficult, as if a "data wall" has been erected on the path of AI development. This has led to a bottleneck in the performance improvement of current large models, increasing training costs, while the performance gains are gradually diminishing, thus slowing down the overall development speed.
Facing the dilemma of insufficient data, synthetic data provides an effective solution. Algorithmically generated artificial data, while not from the real world, can accurately simulate real-world patterns. However, although previous research has proven the scaling laws of pre-training data, whether synthetic data follows similar principles has long lacked systematic verification.
To address this, Microsoft Research Asia built the extended SYNTHLLM framework, which can generate synthetic data on a large scale and verified the scaling laws of synthetic data through extensive experiments. These research results provide a scientific basis for training and optimizing large models using synthetic data.
Scaling Laws of Synthetic Data for Language Model
Paper link:
https://arxiv.org/pdf/2503.19551
Synthetic data follows the rectified scaling law
The performance of large language models exhibits a power-law relationship with both model size and dataset size. This scaling law provides a predictive framework for estimating model performance and has been widely researched and confirmed. These laws offer valuable insights into how performance scales with computational resources, helping to make more informed decisions about the optimal allocation of computational resources when pre-training large language models.
However, this scaling law primarily applies to the pre-training phase using natural data. Whether synthetic data follows similar patterns has been an unknown. In the latest research based on the SYNTHLLM framework, researchers empirically validated the applicability of scaling laws for the first time when fine-tuning language models with synthetic data.
Through extensive experiments in the field of mathematical reasoning, the researchers obtained the following key findings:
1. Synthetic data generated by SYNTHLLM reliably follows a rectified scaling law across various scales. This means synthetic data is predictable, and researchers can reasonably select model size and training data volume through scaling laws to maximize model performance.
2. Performance improvement plateaus around 300 billion tokens. That is, after reaching this scale, the effect of synthetic data on improving model performance gradually diminishes. This finding helps researchers find the optimal balance between data generation and model training.
3. Larger models can approach optimal performance with fewer training tokens. For example, an 8 billion parameter model peaks at 1 trillion tokens, while a 3 billion parameter model requires 4 trillion tokens. This finding reveals the relationship between model size and training efficiency, meaning larger models can achieve better performance with less training data, while smaller models require more data to improve performance, providing guidance for the development and optimization of future large models.
Figure 1: Synthetic data generated by SYNTHLLM consistently follows the rectified scaling law across various model sizes. (Note: The curves in the figure represent error rate, not accuracy)
SYNTHLLM Synthetic Dataset: More Scalable and Diverse
Traditional synthetic dataset construction methods rely heavily on limited human-annotated seed examples in the target domain, fundamentally limiting the diversity and scalability of the resulting datasets. In contrast, pre-training corpora are both massive and highly diverse, representing an underutilized resource for scalable synthetic data generation. Based on this, researchers developed the SYNTHLLM framework, a scalable web-scale synthetic data generation method that systematically transforms pre-training data into high-quality synthetic datasets.
SYNTHLLM completes the generation of synthetic data through the following three stages:
First, SYNTHLLM autonomously identifies and filters high-quality web documents in the target domain.
Subsequently, SYNTHLLM leverages these high-quality reference documents and uses three complementary methods, with the help of open-source large language models, to generate a large scale of diverse questions (or prompts). Each method is carefully designed to progressively increase the diversity of questions.
Finally, SYNTHLLM again uses open-source large language models to generate corresponding answers (or responses) for these generated questions, forming complete synthetic data samples.
It is worth noting that in the second stage, previous methods typically used direct question extraction or document back-translation to generate questions. However, these methods have inherent limitations in scalability, as question generation is either limited by the number of reference documents containing high-quality questions or requires training specialized back-translation models. The SYNTHLLM framework goes beyond direct extraction by using graph algorithms to automatically extract and randomly combine high-level concepts from multiple documents, while establishing connections between reference documents.
Experiments show that SYNTHLLM can generate more diverse questions by decomposing and recombining knowledge concepts. As clearly observed in Figure 2, the questions generated by the secondary methods have lower similarity, indicating greater diversity among questions generated from the same document.
Figure 2: Histogram of question similarity within the same document
Furthermore, compared to existing augmentation methods, SYNTHLLM's knowledge-guided approach utilizes limited reference documents more effectively, resulting in more scalable high-quality synthetic question generation, as shown in Figure 3. This provides a more effective training path for further improving model performance.
Figure 3: (a) Performance of other augmentation methods on the MATH benchmark; (b) Average performance across various benchmarks. (x-axis represents sample number, y-axis represents accuracy)
Synthetic Data: A Continuous Supply Source for Model Training Data
In the foreseeable future, the data wall will continue to accompany the development of artificial intelligence, and synthetic data will become an important supplement to model training data. Synthetic data has several advantages: first, it is highly scalable and can quickly generate large-scale datasets according to demand; second, it is low-cost, requiring no extensive human effort for data annotation. These characteristics make synthetic data an ideal choice for solving the problem of data scarcity.
In different fields, the application value of synthetic data is particularly prominent. For example, in the medical field, synthetic cases can effectively avoid privacy issues; in the autonomous driving field, virtual scenarios can be generated infinitely, providing rich testing materials for technology development; in the field of AI education, through algorithmic combinations, millions of math problems can be easily generated.
The SYNTHLLM framework further amplifies the advantages of synthetic data. Besides the field of mathematical reasoning, this framework can be easily extended to other downstream areas, such as code generation, physics and chemistry, and healthcare, to explore its application potential in different fields.
In the future, researchers will also develop more efficient strategies to continuously optimize and improve the SYNTHLLM framework and explore the effectiveness of synthetic data in the pre-training phase, further improving the efficiency and quality of synthetic data generation, and injecting a continuous stream of power into the ongoing development of artificial intelligence.
With the rapid development of artificial intelligence technology, ensuring that related technologies can be trusted is an urgent issue. Microsoft has taken a series of proactive measures to anticipate and mitigate the risks brought by AI technology. Microsoft is committed to promoting the development of AI in accordance with human-centered ethical principles. As early as 2018, it released the "fairness, inclusiveness, reliability and safety, transparency, privacy and security, accountability" six Responsible AI Principles. Subsequently, it released the Responsible AI Standards to implement these principles and established a governance structure to ensure that teams implement these principles and standards in their daily work. Microsoft also continues to collaborate with researchers and academic institutions worldwide to continuously advance the practice and technology of responsible AI.