Tsinghua and Others Propose Absolute Zero Self-Play Large Models, Achieving Top Performance on Multiple Tasks with Zero-Data Training

Introducing a new method called Absolute Zero Reasoner (AZR), which provides a model with a way to autonomously evolve its reasoning skills without human input.

Authors: Andrew Zhao, Yiran Wu, Yang Yue, Tong Wu, Quentin Xu, Matthieu Lin, Shenzhi Wang, Qingyun Wu, Zilong Zheng*, Gao Huang*

Affiliations:

Tsinghua University
Beijing Institute for General Artificial Intelligence (BIGAI)
Pennsylvania State University

Paper Link: https://arxiv.org/abs/2505.03335

Code Link: https://github.com/LeapLabTHU/Absolute-Zero-Reasoner

Introduction

How can Artificial Intelligence models learn autonomously without human data?

While LLMs' reasoning abilities are continuously improving through learning from human-filtered samples, this reliance on expertly crafted data is becoming a bottleneck. As model capabilities advance, the effort to maintain high-quality training datasets is becoming unsustainable.

This article introduces a new method called Absolute Zero Reasoner (AZR): a unified Large Language Model (LLM) simultaneously acting as a Proposer and a Solver, undergoing reinforced self-play training through interaction with an executable environment (such as a Python interpreter). Despite being trained entirely without human-annotated data, AZR surpasses multiple SOTA models that use tens of thousands of annotated samples on mathematical and programming reasoning tasks. The Absolute Zero paradigm is shown in Figure 1.

Absolute Zero Paradigm

Figure 1. Absolute Zero Paradigm

Supervised learning relies on human-curated reasoning trajectories for behavior cloning. Reinforcement learning, based on verified rewards, enables agents to learn reasoning autonomously but still depends on expert-defined learning distributions and a set of carefully curated question-answer pairs, which require domain expertise and human input. In contrast, this paper introduces a new paradigm—Absolute Zero—for training reasoning models without any human-curated data. The idea is that an agent should autonomously propose tasks optimized for learnability and learn how to solve these tasks using a unified model. The agent learns by interacting with an environment that provides verifiable feedback, thereby achieving reliable and continuous self-improvement entirely without human intervention.

Research Motivation

Traditional Supervised Fine-tuning (SFT) requires manual annotation of reasoning processes, which is not scalable.
Reinforcement Learning with Verifiable Rewards (RLVR), while alleviating some issues, still requires humans to provide QA distributions.
As large model capabilities improve, the training gain from manually designed tasks gradually decreases.
There is an urgent need for a self-proposing, self-solving, self-learning paradigm, i.e., the Absolute Zero Paradigm.

Paper Contribution

Proposes the Absolute Zero Paradigm: zero data, zero external QA, pure self-play reinforcement learning.
Implements Absolute Zero Reasoner (AZR): a unified model for bootstrap learning of various reasoning tasks.
Uses an executable environment (code executor) as the sole source of reward.
Designs three basic reasoning tasks: induction, deduction, and abduction.
AZR surpasses multiple SOTA models in code and math tasks without any human data.
Proposes a new advantage estimator TRR++ for multi-task reinforcement learning.

Absolute Zero Reasoner's Working Principle

The AZR model, as shown in Figure 2, employs a continuous cycle of task creation and problem solving, guided by three core reasoning modes. It relies on a code executor that verifies tasks, checks solutions, and provides objective feedback without human intervention.

(1) Dual Roles:

AZR leverages the LLM as both:

Task Proposer: generates learnable reasoning tasks.
Task Solver: attempts to solve these tasks.

The Proposer and Solver are the same model serving two functions. As a proposer, it generates coding tasks, such as writing functions or predicting outputs, while ensuring these tasks are neither too simple nor too difficult to solve. As a solver, it attempts to execute these tasks, improving its reasoning abilities through trial and error. Rewards are structured such that the proposer earns points for creating "A" tasks (medium difficulty), while the solver is scored based on correctness.

(2) Three Reasoning Modes

Tasks are categorized into three types, inspired by logical reasoning:

Deduction: Predict output based on code and input (e.g., "Given f(x)=x+2, what does it return when x=3?").

Abduction: Infer the input that produced a specific output (e.g., "Find x such that f(x)=5").

Induction: Write code that matches given input-output examples (e.g., "Create a function that maps these pairs").

The objective function aims to optimize the learning process, considering both the learnability of proposed tasks and the accuracy of solved tasks.

AZR Overall Process

Figure 2. AZR Overall Process

Module 1: Reasoning Task Three Classification

Each task is in the form of a triplet (program, input, output):

Deduction: Given a program and input, predict the output.
Abduction: Given a program and output, predict the input such that the program applied to the input yields the output.
Induction: Given multiple input-output examples, induce the program that generates them.

Module 2: Task Reward Mechanism

Proposer Task Reward (Learnability): If a task is too simple or too difficult, no reward is given; medium difficulty tasks provide the maximum training gain.
Solver Task Reward (Accuracy): Rewards for correct solutions.
Final Reward: Considers penalties for format non-compliance:
- Lawful output if format correct and output correct.
- Correct format but wrong output.
- Wrong format.

Module 3: Self-Play Training Process

Steps are as follows:

Initialize three task buffers (Ded, Abd, Ind).
Per round:

Propose new tasks.
Validate legality with the environment.
Add to buffer.
Solve given tasks.
Calculate reward + RL update (using TRR++).

Reinforcement learning uses Task-Relative REINFORCE++ (TRR++).

Experiment Results

Experiment Settings

Models: Qwen2.5 series (3B / 7B / 14B), Llama3.1-8B.
Data: Completely without human data.
Evaluation:
- Math: AIME, OlympiadBench, AMC, MATH500, Minerva, etc.
- Programming: HumanEval+, MBPP+, LiveCodeBench, etc.

Main Results

Advantages of Absolute Zero Reasoner:

The Absolute Zero Reasoner model can be trained entirely without human data, and its performance even surpasses models fine-tuned on thousands of expert examples. It set new SOTA scores on coding benchmarks like HumanEval+ and MBPP+.

Table 1: Absolute Zero Reasoner Performance on Coding Benchmarks

In mathematical reasoning (AIME, AMC), even when trained only on code tasks, it demonstrates strong cross-domain generalization. Key findings include:

Scaling Advantages: Larger base models (7B→14B parameters) show greater performance improvements, suggesting continued performance gains with model growth.

Code-Enhanced Reasoning Capabilities: Models pre-trained on code perform better in mathematics after AZR training than general-purpose models, hinting at a synergy between programming and abstract reasoning.

Emergent Planning: Like humans, AZR begins to add step-by-step comments to its code, mimicking techniques like ReAct prompting, an explicitly untaught behavior.

Table 2: Absolute Zero Reasoner Performance on Math Benchmarks

Figure 3: Qualitative Analysis of Emergent Behaviors

However, there are also some points to note. Larger models occasionally produce poorer results in reasoning chains, highlighting the need for safeguards. Additionally, autonomous systems can exhibit unexpected behaviors, and verifying their solutions becomes increasingly difficult as tasks become more abstract.

Paper Summary

Overall, the main characteristics of AZR are as follows:

No human data required, yet performance exceeds SOTA.
Models with initial coding capabilities improve faster.
Different task types are complementary: retaining all yields best performance.
Model gradually exhibits "intermediate commentary planning" behavior.
Different reasoning tasks show different "cognitive behaviors".
Llama models occasionally show strange outputs, raising security concerns.