Statistically Controllable Data Synthesis! New Framework Breaks LLM Data Generation Limitations, McGill University Team Launches LLMSynthor

Existing data synthesis methods suffer from shortcomings in plausibility and distribution consistency, lack the ability to automatically adapt to different data, and have poor scalability.

Large language models are limited by sampling efficiency and context window size, making it difficult to directly synthesize large-scale datasets.

How to use large models to generate data that is structurally aligned, statistically credible, and semantically plausible has become an urgent problem to be solved.

To this end, a team from McGill University proposed a new method, LLMSynthor—

Through this method, large models can be transformed into structure-aware data simulators, generating high-quality, non-leaking alternative data for privacy-sensitive and data-scarce scenarios.

LLMSynthor: Turning LLMs into "Structure-Aware Generators"

In scenarios like population, e-commerce, and mobility, data sensitivity makes sharing difficult, and different data formats require separate model designs, leading to high costs and poor portability.

Traditional methods such as Bayesian networks and GANs either struggle to model high-dimensional dependencies or suffer from poor generalization and instability, often generating samples like "9-year-old doctors" that are statistically plausible but semantically absurd.

Similarly, recent large models have also been used for data generation, but they suffer from slow sampling, uncontrollable distributions, and context limitations, making it difficult to efficiently generate large-scale datasets with complete structures.

LLMSynthor's solution is to make LLMs not directly generate data, but rather become "structure-aware generators" that continuously iterate and optimize through statistical alignment feedback.

The overall framework is as follows:

Step 1: Structure Inference

To generate credible data, the key is to understand the dependency structure between variables.

Although traditional Copula models can separate variable distribution and relationship modeling, they are difficult to extend in high-dimensional, multi-semantic scenarios.

LLMSynthor's key innovation is to use large language models to simulate Copula.

LLMs themselves can be regarded as a high-dimensional prior for real-world joint distributions, having internalized the co-occurrence patterns of human behavior and social structures during their pre-training.

By combining an understanding of statistical summaries (e.g., frequencies, distributions), it can infer higher-order relationships between variables and leverage semantic information to uncover hidden dependencies.

Step 2: Statistical Alignment

LLMSynthor does not directly compare with raw data; instead, it measures the discrepancy between real and synthetic data through statistical summaries (e.g., variable distributions, joint frequencies).

This way, it both preserves structural information and avoids leaking individual data.

(Because it relies only on statistical features, even if aggregated indicators are input, it can generate structurally reasonable and semantically consistent synthetic data, making it particularly suitable for privacy-sensitive scenarios like population censuses and surveys.)

Furthermore, LLMSynthor's alignment mechanism is attributable: it not only measures "overall deviation" but also pinpoints which variable or variable combination caused a specific deviation.

This fine-grained feedback can be directly used for structural adjustments in the next round of generation, enabling gradual alignment.

Step 3: Generating Distributions Instead of Samples

Traditional methods generate samples one by one, which is inefficient and makes distribution control difficult.

LLMSynthor instead generates sampleable distribution rules (proposals), such as: "25-year-old female, in a tier-one city, purchasing beauty products," then samples in batches, and can even call external generators like image generators to extend to multimodal tasks.

Proposals are guided by both statistical feedback and LLM common sense, naturally avoiding absurd variable combinations like "10-year-old doctor."

This approach is not only efficient and structurally credible but also enables coordination with other models for collaborative generation through a "distribution description language," achieving cross-modal, multi-source, and multi-task data synthesis and simulation.

Step 4: Iterative Alignment

By continuously looping through "structure inference-statistical comparison-rule generation-new data sampling," the model eventually generates a synthetic dataset that is highly similar to real data in terms of structure and statistics, and is also common-sensical.

Theoretical Guarantees

In addition to empirical results, LLMSynthor also comes with theoretical convergence guarantees.

The LLMSynthor team proposed the Local Structural Consistency Theorem: under reasonable assumptions, if a variable or variable group's initial distribution has a bias, the error can converge to any controllable range after a finite number of iterations.

This indicates that LLMSynthor does not "approach by feeling" but rather converges to the real data structure step by step with mathematical guarantees.

Multi-Scenario Testing

To verify the practicality and stability of LLMSynthor, the authors conducted experiments in three representative real-world scenarios, including e-commerce transactions, population statistics, and urban mobility.

E-commerce Transaction Generation

This is a mixed scenario containing continuous and discrete variables, with complex variable relationships.

The authors constructed controllable datasets based on Bayesian networks, setting clear structures for evaluating modeling capabilities.

The results show that LLMSynthor performs optimally in both marginal and joint distribution errors, accurately restoring variable dependencies.

Further prediction experiments also showed that models trained with its synthetic data performed best on real data, demonstrating strong practical value.

Population Micro-Synthesis

Population data contains nested family-individual structures, which are inherently unstructured. This type of data is widely used in critical tasks such as urban planning, policy evaluation, and resource allocation. LLMSynthor can handle such complex structures and significantly outperforms existing methods across 6 categories and 16 policy indicators (e.g., elderly poverty rate).

Urban Mobility Simulation

Mobility data includes various complex types such as temporal, geographical, and behavioral information, forming the basis for traffic simulation and emergency management.

LLMSynthor based on multi-source data, successfully generates simulated trajectories that conform to urban rhythms. More critically, it can respond to prompts to control generation.

For example, inputting "There's a concert at Tokyo Dome at 8 PM," the synthetic data will exhibit corresponding tidal passenger flow changes during that period, demonstrating its real-world restoration and scenario manipulation capabilities, making it suitable for policy simulation and event rehearsal.

Large Model Compatibility

LLMSynthor boasts high generation efficiency, requires no training, and is compatible with various large models. It can run stably even with open-source models like Qwen-2.5-7B, demonstrating good scalability and adaptability for deployment.

Paper link: https://arxiv.org/pdf/2505.14752

Project address: https://yihongt.github.io/llmsynthor_web/

Statistically Controllable Data Synthesis! New Framework Breaks LLM Data Generation Limitations, McGill University Team Launches LLMSynthor

Share Short URL