"Absolute Zero": A Zero-Data, Self-Evolving AI Reasoning Method Surpasses SOTA

Recent groundbreaking AI research has quietly been released, challenging our traditional understanding of how artificial intelligence learns. This innovative method, named "Absolute Zero", achieves AI systems that completely do not rely on any human-annotated data, surpassing existing state-of-the-art models on multiple complex reasoning tasks through self-play and self-evolution. This breakthrough may redefine the future path of AI training.

1. Bottleneck of Traditional AI Learning: Dependence on Human Data

Current state-of-the-art large language models (LLMs) have made significant progress in reasoning ability, primarily relying on a method called "Reinforcement Learning from Verified Rewards" (RLVR). However, these methods are still highly dependent on carefully curated question-answer datasets from experts.

This dependence brings significant challenges:

(1)Unsustainable human cost: As model capabilities improve, the difficulty of building high-quality datasets grows exponentially.

(2)Development bottlenecks: Similar scalability issues have emerged in LLM pre-training.

(3)Potential limitations: If AI systems continue to develop and potentially surpass human intelligence, over-reliance on human-designed tasks may limit their autonomous learning and growth capabilities.

2. Absolute Zero: A New Paradigm for AI Self-Evolution

The "Absolute Zero" paradigm proposed by the research team completely overturns this status quo. In this paradigm, the model simultaneously learns to propose tasks that maximize learning potential and effectively solve these tasks, evolving through self-play without relying on any external data.

The core mechanisms of this method include:

(1)Dual Roles: The same model acts as both "Questioner" and "Solver"

(2)Environment Feedback: Utilizes a code executor as a verifiable feedback source to ensure training stability.

(3)Three Reasoning Modes: Introduces three complementary reasoning modes: Deduction (predicting output), Abduction (inferring input), and Induction (synthesizing programs)

Diagram 1

3. Results: Zero-Data Training Surpasses SOTA

Chart 1

Chart 2

The research team developed the "Absolute Zero Reasoner" (AZR) based on this paradigm and conducted extensive experimental evaluations. The results are astonishing:

(1)Despite having no exposure to any domain-specific human-annotated data, AZR's overall performance on math and programming reasoning tasks exceeded all previous models.

(2)In code generation tasks, AZR outperformed models trained specifically on programming datasets by 0.3 percentage points.

(3)In mathematical reasoning, AZR showed astonishing cross-domain generalization ability, improving by 15.2 percentage points compared to the base model.

These results surprisingly prove that even without human-designed domain-specific training data, AI systems can develop strong reasoning abilities through self-play.

Chart 3

4. In-depth Analysis: How AZR Works?

Diagram 2

(1) Self-Generated Tasks and Self-Evaluation

AZR uses a unified large language model to simultaneously play two roles:

1) Questioner: Creates new reasoning tasks to promote diversity and broad coverage of the task space.

2) Solver: Attempts to solve these newly posed tasks, receiving feedback from the environment.

(2) Three Core Reasoning Modes

AZR utilizes a code executor as a flexible interface and verifiable environment to learn through three different reasoning modes:

1) Deduction: Given a program and input, predict the output, capturing step-by-step logical reasoning.

2) Abduction: Given a program and output, infer plausible input, similar to trial-and-error or online search.

3) Induction: From a set of input-output examples, synthesize a generalizable program, requiring generalization from partial information.

(3) Reward Design

The Questioner's reward function encourages generating tasks with meaningful learning potential—neither too simple nor unsolvable:

1) Task too simple (success rate=1): Provides little learning signal.

2) Task too difficult (success rate=0): Also provides little learning signal.

3) Medium-difficulty tasks: Provide the richest feedback and learning potential.

Chart 4

5. Discovery: Increasingly Human-like Thinking Patterns

During the research process, the team discovered several interesting phenomena:

(1) Coding ability amplifies overall reasoning ability: The initial Qwen-Coder-7b model performed 3.6 points lower in math than the standard Qwen-7b. But after AZR training, the code expert model surprisingly surpassed the standard model by 0.7 percentage points in math, indicating that strong coding ability may amplify overall reasoning ability after AZR training.

(2) Significant cross-domain transfer: Traditional code expert models using RLVR showed an average math accuracy increase of only 0.65 percentage points, while AZR-trained models showed an average increase of 10.9 to 15.2 percentage points in math, demonstrating extremely strong generalization reasoning ability.

(3) The larger the model, the more significant the benefit: Performance improvement scales with model size: 3B, 7B, and 14B models saw increases of +5.7, +10.2, and +13.2 points respectively, indicating that scaling up is continuously beneficial for AZR.

(4) Intermediate planning ability naturally emerges: When solving code induction tasks, AZR often intersperses step-by-step planning in comments and code, similar to the ReAct prompting framework. This behavior has also been observed in larger formal math models like DeepSeek Prover v2 (671B), suggesting that allowing the model to use intermediate thinking drafts when generating long-form answers might be beneficial in other domains as well.

(5) Cognitive behavior and token length vary by reasoning mode: Different types of tasks exhibit different cognitive behaviors: Abduction tasks grow the most as the model keeps trying until the output matches, while Deduction and Induction grow more moderately.

6. Outlook: The Beginning of the Experience Era

This research marks a new phase for AI reasoning models—the beginning of the "Experience Era". By allowing models to not only solve given tasks but also define and evolve their own task distribution, the research shows this shift can achieve powerful performance across diverse reasoning tasks, even with significantly reduced privileged resources like human data.

Future research directions may include:

(1) Exploring more environments as verifiable feedback sources, such as the World Wide Web, formal mathematical languages, world simulators, or even the real world.

(2) Extending to different application areas such as more complex agent tasks or scientific experiments.

(3) Exploring multimodal reasoning models.

(4) Designing more effective exploration/diversity reward mechanisms.

This breakthrough may ultimately free reasoning models from the limitations of human-curated data, ushering in a new era where AI systems continuously evolve through their own experience.

"Absolute Zero" paradigm offers a new approach to AI training, challenging our inherent assumption that AI learning must rely on human data. It demonstrates that AI systems can develop strong reasoning capabilities through self-play and environment feedback, without direct human guidance.

This discovery is not only theoretically significant but could also fundamentally change how AI models are trained in practice. As model capabilities continue to improve, human-provided tasks may become increasingly less challenging for superintelligent systems, and the "Absolute Zero" paradigm might be a crucial step towards truly autonomous learning AI.

Paper Title: Absolute Zero: Reinforced Self-play Reasoning with Zero Data

Paper Link: https://arxiv.org/abs/2505.03335

Recommended Reading

NVIDIA Releases Llama-Nemotron Series Inference Models, Zero to One: Detailed Explanation of AI Agent Design Patterns

RM-R1: An Innovative Method Treating Reward Modeling as a Reasoning Process

100 Days After DeepSeek-R1 Release: A Review of Replication Studies and Reasoning Language Models

Main Tag:AI Research

Sub Tags:Machine LearningReasoningSelf-PlayZero-Data Training


Previous:Apple and Anthropic Collaborate on AI Coding Platform for Xcode

Next:Rewriting Pre-Training Data Significantly Boosts LLM Performance in Math and Code

Share Short URL