First Code World Model Ignites AI Community, Enabling "True Reasoning" for Agents, Meta Open-Sources It

Reported by JIQIZHIXIN

Editors: Zenan, Lengmao

Is the architecture of large models about to evolve completely?

Starting last night, the AI community has been studying a magical new species—the Code World Model (CWM).

Image

The first major research initiative from Meta's reorganized AI department is a world model designed for writing code.

Its approach differs from "traditional" Large Language Models (LLMs), based on the following theory:

When humans plan, we imagine the potential outcomes of different actions in our minds. When we reason about code, we mentally simulate parts of its execution. Current generations of large language models perform poorly in this regard, often struggling with true reasoning and simulation. Could an explicitly trained Code World Model open up new research directions?

Image

Meta's newly released CWM is a 32-billion parameter open-weight LLM, designed to advance research on world model-based code generation.

CWM is a dense, decoder-only LLM that supports context lengths up to 131k tokens. Independent of its world modeling capabilities, CWM demonstrates strong performance on general programming and mathematical tasks:

  • SWE-bench Verified (with test-time augmentation): pass@1 65.8%

  • LiveCodeBench: 68.6%

  • Math-500: 96.6%

  • AIME 2024: 76.0%

Image

Evidently, while CWM's absolute performance is not yet exceptionally high, its performance in a horizontal comparison with 30B-level models is quite good.

Image

SWE-bench Verified pass@1 scores

To enhance code understanding, rather than being limited to learning from static code training, the Meta FAIR CodeGen team used extensive observation-action trajectories in Python interpreters and agent-like Docker environments for mid-training. This was followed by large-scale multi-task reasoning reinforcement learning (RL) in verifiable coding, mathematical, and multi-turn software engineering environments.

To support further code world modeling research, Meta has released model checkpoints from the mid-training, SFT, and RL stages.

Image

With CWM, Meta proposes a powerful testbed to explore opportunities for world modeling to improve reasoning and planning capabilities in code generation.

This research demonstrates how world models can benefit agentic coding, allowing Python code execution to be progressively simulated, and shows early results of how reasoning can benefit from such simulation.

In this research, Meta appears to have drawn inspiration from the traditional development process. Excellent programmers mentally rehearse before writing code, whereas current LLM-based code generation tools primarily generate "imitations" of relevant code based on vast amounts of data. While this may seem correct, there's often a gap between imitation and truly understanding the code produced.

An explicitly trained code world model should be able to predict the consequences of its actions, thereby making judgments and effective decisions.

There's an interesting example: large models often make elementary mistakes, such as being unable to count how many "r"s are in "strawberry".

Image

By using CWM, one can trace the execution process of code that counts the letter "r" in "strawberry". This can be likened to a neural version of pdb—you can set it to any initial frame state, and then the reasoning process can invoke this tool in the token space for queries.

Image

CWM's Python tracing format. Given the source code context and trace starting point markers, CWM predicts a series of call stack frames, representing the program state and corresponding execution actions.

The CWM model is trained on a large amount of coding data and customized Python + Bash world modeling data, enabling it to simulate Python function execution and agent interactions in a Bash environment.

Image

In further experiments conducted by Meta, CWM achieved state-of-the-art performance with and without test-time augmentation (tts), scoring 65.8% and 53.9% respectively. Note that the GPT-oss score is calculated based on a subset of 477 out of 500 problems.

Image

CWM vs. baseline models on Aider Polyglot, taken from the official leaderboard.

Image

CWM vs. various baseline models on Terminal-Bench, taken from the official leaderboard.

Image

BigOBench results

On tasks involving predicting and generating time and space complexity, CWM was compared against Qwen3-32B (with reasoning capabilities), Qwen3-coder-30B, and Gemma-3-27B. CWM surpassed baseline models on all metrics for time complexity prediction and generation. For space complexity generation, CWM achieved the best pass@1 score in code-only mode and ranked second in other metrics.

The Meta team's vision is for code world models to bridge the gap between language-level reasoning and executable semantics.

Ablation studies have shown that world modeling data, Python execution trajectories, and executable Docker environments can directly improve downstream task performance. More broadly, CWM provides a powerful experimental platform to support future research in areas such as zero-shot planning, embodied chain-of-thought, and reinforcement learning with sparse and verifiable rewards.

World models should improve reinforcement learning because agents already familiar with environmental dynamics can focus more on learning which actions lead to rewards. Nevertheless, further research is needed to consistently leverage the advantages of world models across tasks during pre-training. Ultimately, models capable of reasoning about the consequences of their actions will be more efficient in interacting with environments and are expected to expand the complexity of tasks they can handle.

For more details, please refer to the original paper.

Main Tag:Artificial Intelligence

Sub Tags:Large Language ModelsOpen SourceWorld ModelsCode Generation


Previous:Anthropic Re-explains Three Recent Claude Outages, States Claude Code Fully Recovered

Share Short URL