Open-Sourcing the Largest High-Quality Scientific Reasoning Post-Training Dataset to Quickly Turn Qwen3 and Others into "Scientists"

The largest-scale open-source scientific reasoning post-training dataset in history is here!

Shanghai AI Academy and Shanghai Jiao Tong University (GAIR Lab) have released MegaScience. This dataset contains approximately 1.25 million question-answer pairs and their reference answers, extensively covering multiple disciplines such as biology, chemistry, computer science, economics, mathematics, medicine, and physics. It aims to provide solid data for training and evaluating the scientific reasoning capabilities of general artificial intelligence systems.

Image

Experiments have shown that models trained on MegaScience significantly outperform their official Instruct models in scientific reasoning tasks. Furthermore, MegaScience demonstrates excellent scalability: as the scale of the base model increases, the performance gains brought by MegaScience become more pronounced.

Image

Currently, the team has fully open-sourced MegaScience and all its related components, including data construction process source code, the scientific reasoning evaluation system, the dataset itself, and models trained on this dataset. The goal is to provide systematic, high-quality resource support to the research community and further advance the research and application of general artificial intelligence in the scientific domain.

Image

MegaScience’s answers are relatively short with optimal performance, achieving both efficiency and effectiveness.

Within just one week of its release, the dataset has surpassed 4.6k downloads and ranks fourth on the HuggingFace Datasets Trending list, receiving widespread attention and positive feedback from researchers in academia and industry.

Image

Why is MegaScience needed?

Although advanced reasoning models like o1 and DeepSeek-R1 have shown performance levels close to or even surpassing human experts in math and programming tasks, the performance of current mainstream models in scientific reasoning tasks still significantly lags behind those in math and code due to a long-standing lack of large-scale, high-quality training data support in the scientific reasoning domain.

Existing scientific reasoning post-training datasets still present some unresolved challenges:

Unreliable Benchmark Evaluation: Many open-source scientific benchmarks use multiple-choice question formats, which, while easy to implement, oversimplify the complexity of scientific reasoning. Consequently, post-training datasets in the scientific domain often adopt this format to maintain data distribution consistency. However, the authors' experiments show that models trained on such data perform excellently in multiple-choice evaluations but significantly poorly in computational tasks, indicating a disconnect between benchmark evaluation results and true reasoning ability.

Imprecise Decoupling of Contamination: Existing decontamination techniques typically rely on n-gram or vector similarity to identify and remove potential benchmark data leakage. These methods are inherently fragile and easily circumvented by subtle changes in wording or structure, making it difficult to truly guarantee the fairness of benchmark evaluations. The authors found significant overlap between most existing scientific post-training datasets and evaluation benchmarks.

Low Quality of Reference Answers: Reference answers in many scientific datasets come from unreliable sources, often web scraping or direct generation by large language models. However, as web content is increasingly saturated with AI-generated text, coupled with LLMs' susceptibility to hallucination, the reliability of both methods continuously decreases, making it difficult to ensure the factual accuracy and scientific rigor of the answers.

Superficial Knowledge Distillation: A common practice is to distill data from large reasoning models, such as directly using DeepSeek-R1 to generate longer chains of thought (CoT). While this method is intuitive and easy to implement, it essentially remains superficial. The generated CoT data often suffers from "overthinking," which also poses challenges in terms of training (especially for smaller models) and inference efficiency. This shallow operation limits the further development of principled knowledge transfer, efficiency, and generalization capabilities.

Image

To address these challenges, the MegaScience team proposed a systematic solution, including the following four key components:

Construction of a Scientific Reasoning Evaluation System: The team first developed an evaluation framework for scientific reasoning tasks, covering 15 representative benchmarks with question types including multiple-choice, computational, true/false, and short-answer questions. This covers a wide range of task types, enabling comprehensive and reliable evaluation of a model's scientific reasoning capabilities.

Large-Model-Based Data Decontamination: To address data contamination issues, the authors implemented a strict large-model-based decontamination process for both the proposed dataset and the baseline datasets used. Experiments show that after processing with this method, other existing open-source datasets exhibited significant performance drops under the same benchmarks, further validating the effectiveness of this decontamination strategy in improving evaluation trustworthiness.

High-Quality Data Source Construction Strategy: In terms of data construction, the team systematically collected question-answer content primarily from university-level professional textbooks. Compared to traditional web-based Q&A resources, textbook content offers higher authority and accuracy of reference answers, providing a solid guarantee for data quality.

Optimized Data Refinement Method: Unlike previous approaches of distilling using reasoning models, the authors opted to refine the initially extracted data using a chat model. This method improves the linguistic fluency and logical consistency of the Q&A while avoiding the efficiency bottlenecks common in long chain-of-thought methods, thereby achieving an organic combination of high quality and high efficiency.

Specifically:

The MegaScience team first proposed TextbookReasoning, an open-source post-training dataset for university-level scientific reasoning, which includes reliable reference answers. The data source comes from nearly 120,000 university textbooks, and a total of 650,000 scientific reasoning questions covering physics, biology, chemistry, computer science, mathematics, and economics were constructed. Specifically, the data construction process includes textbook digitization, dual question-answer pair extraction, deduplication, question-answer pair refinement, filtering, and large-model-based decontamination. This process is fully automated, leveraging large language models to greatly enhance the scalable acquisition of high-quality datasets.

To further promote the construction of open-source post-training data for scientific reasoning, the team then proposed MegaScience, a large-scale mixed dataset composed of high-quality open-source datasets, containing 1.25 million data points. It first collected multiple public datasets and conducted systematic ablation experiments on different data filtering strategies to select the optimal subset for each dataset. In addition, besides TextbookReasoning, step-by-step solution processes were annotated for all datasets.

To support the development of scientific reasoning capabilities in the open-source community, the team designed and open-sourced an evaluation framework covering a wide range of disciplines and various question types, encompassing 15 representative benchmarks. This framework not only facilitates the reproduction of experimental results but also enables fair comparison between models through unified evaluation standards. A comprehensive answer extraction strategy was also designed to ensure the accuracy of the final evaluation metrics.

Experiments show that the constructed datasets not only achieved efficient training and inference processes but also obtained leading performance in the scientific domain. The team further trained Llama3.1, Qwen2.5, and Qwen3 series base models on MegaScience, which outperformed their official Instruct models on average, significantly advancing the development of the open-source community in the scientific domain. Concurrently, MegaScience shows more significant effects on larger and stronger models, demonstrating its good scalability advantage in instruction fine-tuning. The team has open-sourced the data construction process, evaluation system, datasets, and trained models to support the continuous development of scientific reasoning research.

TextbookReasoning Construction Process

The research team proposed a data construction process fully driven by large language models, used to build a large-scale, high-quality scientific reasoning dataset—TextbookReasoning. This process extracts and refines a total of 650,000 question-answer pairs from approximately 120,000 university and graduate-level textbooks. The overall process comprises five stages:

Image

TextbookReasoning Dataset Construction Flowchart

1. Book Collection and Digitization

Researchers collected a total of 128,000 university-level and above textbooks covering multiple scientific fields and used the olmOCR system to process them via OCR, converting them into structured text content. To strictly adhere to copyright regulations, the research team combined rule matching and large language model techniques to comprehensively review book copyright information and removed books with copyright restrictions. Furthermore, this open-source dataset adopts the CC-BY-NC-SA-4.0 license, strictly limiting commercial use.

2. Dual Question-Answer Pair Extraction

Researchers first chunked each textbook content into document segments of 4096 tokens and designed two extraction templates for each subject:

High-standard extraction: Only retains question-answer pairs containing detailed reasoning steps and explanations;

Low-standard extraction: Retains any question pair containing a clear answer.

Llama3.3-70B-Instruct was used to perform Q&A extraction on all documents, ultimately yielding 945,000 raw Q&A pairs.

Image

Question-Answer Pair Extraction Statistics for Each Discipline

3. Question Deduplication

To avoid redundant information, researchers employed Locality-Sensitive Hashing (LSH) combined with MinHash technology to perform semantic-level deduplication on all questions.

4. Question-Answer Pair Refinement

Researchers used DeepSeek-V3 to refine the content of question-answer pairs by referencing the original document content and further called Llama3.3-70B-Instruct to identify questions lacking a chain of thought, which were then completed using DeepSeek-V3. Additionally, to ensure data quality, researchers again utilized Llama3.3-70B-Instruct to automatically filter out low-quality question-answer pairs with logical contradictions or incorrect answers.

5. Large-Model-Based Question Decontamination

To reduce training contamination caused by overlap with existing evaluation benchmarks, researchers designed a large-model-driven contamination identification mechanism, with the following process:

a. For each question, perform vector similarity search using BGE-large-en-v1.5 to retrieve the top 5 most similar questions from all benchmarks covered by the 15 evaluation systems;

b. Then, use Llama3.3-70B-Instruct to compare candidate questions one by one to determine if there are highly semantically similar contamination items; if any pair is deemed a duplicate, the question is marked as a contaminated sample and removed from the training set.

Image

MegaScience Construction Process

To further promote the development of open-source scientific reasoning post-training datasets, the authors systematically integrated multiple existing public data sources and extensively explored various data filtering strategies and problem annotation methods. This led to the construction of MegaScience, a mixed dataset comprising 1.25 million high-quality question-answer pairs. The construction process of this dataset includes four key steps, ensuring data diversity, accuracy, and applicability.

Image

Dataset Construction Flowchart

1. Public Dataset Collection

The authors selected NaturalReasoning, Nemotron-Science, and TextbookReasoning datasets as initial corpus sources to build the raw dataset collection.

2. Question Deduplication and Decontamination

To improve data quality, the authors applied the same deduplication strategy as TextbookReasoning to the NaturalReasoning and Nemotron-Science datasets, along with large language model-based question decontamination, thereby eliminating duplicates and contaminated questions.

3. Data Filtering

The authors proposed 3 data filtering techniques:

(1) Filtering based on answer length: The authors used Qwen2.5-72B-Instruct to annotate answers for questions and retained those questions that generated the longest answers.

(2) Filtering based on question difficulty: Since high-difficulty questions are crucial for improving model reasoning capabilities, the authors proposed a two-stage difficulty assessment and filtering method:

a. Reference Answer Annotation:

For the TextbookReasoning dataset, the authors used Llama3.3-70B-Instruct to generate high-quality reference answers for each question;

For NaturalReasoning, its officially provided reference answers were directly used;

For Nemotron-Science, the summary paragraphs from DeepSeek-R1's model output were used as reference answers.

b. Difficulty Assessment: The authors used Qwen2.5-7B-Instruct to generate 16 candidate answers for each question and utilized Qwen2.5-32B-Instruct to score these answers from 0–10 based on reference answers, with the scoring criteria measuring accuracy and completeness. The average score was then taken as the difficulty index for that question. Lower scores indicate more challenging questions.

(3) Random Sampling Filtering: Randomly selected questions.

Image

Effect of 3 Data Filtering Methods on Each Dataset

For each dataset, the authors first used the difficulty selection method to filter n samples, and set the number of samples selected by the answer length filtering and random selection methods also to n, to ensure fair comparison. Subsequently, the authors performed supervised fine-tuning on the Qwen2.5-7B model to select the optimal data selection strategy for each dataset.

Random selection performed best on the NaturalReasoning dataset, while difficulty selection achieved optimal performance on Nemotron-Science. However, no single data selection method could surpass the results achieved by directly using the complete TextbookReasoning, indicating that this dataset contains very few low-quality samples. This finding supports the authors' decision to retain all samples in TextbookReasoning.

4. Solution Step Annotation

For TextbookReasoning, the authors retained its refined solutions. For NaturalReasoning, due to the lower quality of original answers generated by Llama3.3-70B-Instruct, the authors used DeepSeek-V3 to annotate step-by-step solutions. For Nemotron-Science, DeepSeek-R1 generated overly verbose answers even for relatively simple questions, significantly reducing inference efficiency. To address this, the authors also used DeepSeek-V3 to annotate step-by-step solutions. Subsequently, they filtered out answers exceeding 4096 tokens, removing approximately 8,000 samples from the dataset.

Image

MegaScience Construction Process Quantity Changes, DC for Data Decontamination, DS for Data Filtering

MegaScience Evaluation Framework

To enhance the reliability, reproducibility, and fairness of the evaluation process, the authors proposed an open-source scientific reasoning evaluation framework—Language Model Open Science Evaluation. This framework covers 15 representative scientific reasoning benchmark tasks, encompassing various types of question formats, aiming to comprehensively assess the scientific reasoning capabilities of language models.

Image

List of Benchmarks Involved in the MegaScience Evaluation Framework

This evaluation system has the following characteristics:

Supports evaluation of Instruct models and base models;

Easy to integrate new evaluation benchmarks and configurations;

Supports multi-node and multi-GPU parallel execution, enabling scalable evaluation across multiple models, benchmarks, and tasks;

Provides comprehensive instance-level output data, supporting fine-grained analysis of model prediction results.

The authors also optimized answer extraction, which is crucial during evaluation as extraction accuracy significantly impacts overall results. Many scientific evaluation methods only extract content within \boxed{}, often ignoring answers not in this format and incorrectly attributing these format errors to a decrease in accuracy. To improve extraction precision, the authors designed a comprehensive rule-based method for answer extraction tailored to different question types. The answer extraction method employs a two-stage process: (1) identifying prompt phrases indicating the presence of the final answer; (2) extracting specific answer content from various formats. Additionally, for multiple-choice questions, if the option label cannot be directly extracted, the system also matches within the option content to determine the corresponding option label.

Image

Experimental Results

The authors first trained TextbookReasoning and MegaScience datasets on the Qwen2.5-7B-Base model and systematically compared them with existing scientific reasoning datasets. The results show that both datasets achieved optimal performance in the current open-source community across multiple evaluation metrics. Furthermore, MegaScience's performance on scientific reasoning tasks also surpassed that of the officially released Qwen2.5-7B Instruct model.

Image

To further demonstrate the effectiveness of this dataset, the authors fine-tuned Llama3.1, Qwen2.5, and Qwen3 series base models using MegaScience and compared them with their official instruct models, leading to the following interesting conclusions:

Breaking Performance Bottlenecks in Scientific Domains: Introducing MegaScience during training significantly boosted performance across different model families and scales. Qwen2.5-7B, all Qwen3 series models, and Llama3.1-8B trained with MegaScience all substantially surpassed their official Instruct versions in average performance. This widespread improvement across various base models indicates that MegaScience can effectively push the frontiers of performance in scientific domains.

Scalability Advantage for Larger, Stronger Models: MegaScience demonstrates more significant effects on larger and more capable models, indicating its potential advantage in scalability for instruction fine-tuning. In the Qwen2.5 series, a non-monotonic trend was observed: although Qwen2.5-1.5B-Instruct was 2.99% higher than Qwen2.5-1.5B-MegaScience, this gap significantly narrowed to only 0.15% for the 3B model, and reversed for Qwen2.5-7B, where the MegaScience version achieved a 2.21% improvement over the instruct version. Furthermore, the higher-performing Qwen3 series consistently outperformed official Instruct models at all scales with MegaScience versions, and the performance gap progressively widened with increasing model scale.

Mathematical Reasoning Capability Depends on Model Capacity: The authors found that improvements in mathematical ability particularly rely on sufficient base model capacity. Only in stronger base models (e.g., Qwen2.5-7B and Qwen3-8B) could MegaScience outperform official instruction fine-tuned models in mathematical reasoning tasks. The authors speculate that this selective improvement stems from the high difficulty characteristics of mathematical problems in their dataset, many of which involve professional mathematical concepts at university undergraduate level and above. Such complex mathematical reasoning tasks seem to require models to possess a certain capability threshold to effectively learn and benefit from this type of challenging training data.

Future Outlook

While the current work primarily focuses on supervised fine-tuning, it does not yet involve scientific reasoning research based on reinforcement learning. Notably, MegaScience provides high-quality and reliable reference answers, which can serve as supervision for generating accurate reward signals within reinforcement learning frameworks. This feature provides a good research foundation for the community, encouraging further exploration of the potential of reinforcement learning in scientific reasoning tasks to see if it can further enhance models' reasoning capabilities beyond existing supervised training results.

This dataset uses short chains of thought. A promising research direction is to introduce reinforcement learning on this basis to further learn more complex, longer reasoning chains, and explore whether this strategy can surpass the performance of models obtained from traditional intermediate training stages in a more efficient manner. If research proves this direction feasible, it will provide new opportunities for the expansion of reinforcement learning in language models and also suggest that supervised fine-tuning based on MegaScience could be an efficient alternative path to intermediate training.

Given the limitations of computational resources, the authors have not yet conducted systematic research on chain-of-thought compression strategies. Future work could further explore whether compressing longer CoT reasoning into a more concise form could achieve better performance at a response length comparable to MegaScience.

Paper Title: MegaScience: Pushing the Frontiers of Post-Training Datasets for Science Reasoning

Paper Link: https://arxiv.org/abs/2507.16812

Open-Source Dataset & Models: https://huggingface.co/MegaScience

Data Processing Code: https://github.com/GAIR-NLP/MegaScience

Evaluation System Code: https://github.com/GAIR-NLP/lm-open-science-evaluation

Main Tag:Artificial Intelligence

Sub Tags:Scientific ReasoningOpen SourceLarge Language ModelsDataset


Previous:Wang Mengdi's Team Review of "Self-Evolving Agents": From Static LLMs to Artificial Superintelligence (ASI)

Next:ARPO: Agentic Reinforced Policy Optimization, Enabling Agents to Explore One Step Further at Critical Moments

Share Short URL