Just now, Meta released its latest open-source world model, V-JEPA 2, claiming it achieves state-of-the-art visual understanding and prediction in the physical world, thereby enhancing the physical reasoning capabilities of AI agents.
Yann LeCun, Meta's VP and Chief AI Scientist, personally unveiled the model. In an official video, he stated that with the help of world models, AI no longer requires millions of training iterations to master a new ability. Instead, world models directly inform AI how the world operates, which can significantly boost efficiency.
For example, AI will predict that when we scoop something out, it is intended to be put into another container:
AI can even understand complex diving movements of athletes and break down the actions:
According to Meta's test data, V-JEPA 2 reduced the planning time per step in test tasks to one-thirtieth of Nvidia's Cosmos model, while also achieving a higher success rate. V-JEPA 2 reportedly used over a million hours of video for self-supervised learning training.
From Meta's perspective, physical reasoning ability is crucial for building AI agents that operate in the real world and achieving advanced machine intelligence (AMI), allowing AI agents to truly "Think Before Acts."
Additionally, Meta has released three new benchmarks for evaluating existing models' ability to reason about the physical world from video.
Yesterday, news broke that Meta plans to establish a new AI lab, recruit a 28-year-old Chinese-American genius, and invest $14.8 billion (approximately 106.1 billion RMB) to acquire a 49% stake in Scale AI. Today, Meta announced its new world model, with Yann LeCun presenting Meta AI's key research directions and vision, which seems somewhat like an "advertisement" for recruiting talent.
Paper link:
https://ai.meta.com/research/publications/v-jepa-2-self-supervised-video-models-enable-understanding-prediction-and-planning/
World Models Give AI "Human-like Intuition"
Enhancing AI Agents' Understanding, Prediction, and Planning Capabilities
Understanding the physical laws of the world may not sound complex, but this is an area where AI lags significantly behind humans.
For example, when you throw a ball into the air, you know gravity will pull it back to the ground; when you navigate a crowded, unfamiliar area, you move towards your destination while avoiding bumping into pedestrians or obstacles; when playing hockey, you skate to where the puck will be, not where it currently is.
▲ Judging the trajectory of a basketball
However, AI finds it difficult to master this ability and to build such a "mental model" for understanding the physical world.
Meta's world model primarily aims to strengthen AI agents' three core capabilities: understanding, prediction, and planning.
Key Architectural Innovation Significantly Improves Learning Efficiency
High Performance and Accuracy Simultaneously
Meta uses video to train V-JEPA 2, helping the model learn important laws in the physical world, including how humans interact with objects, how objects move in the physical world, and how objects interact with each other.
V-JEPA 2 reportedly trained on over 1 million hours of video through self-supervised learning.
V-JEPA 2 is a Joint Embedding Predictive Architecture (JEPA) model, which is where the "JEPA" name comes from.
The model includes two main components:
An encoder, responsible for receiving raw video and outputting embeddings that contain semantically useful content for observing the world state.
A predictor, responsible for receiving video embeddings and additional content to be predicted, and outputting the predicted embeddings.
V-JEPA 2 has significant performance differences compared to traditional generative models that predict pixels. According to Meta's test data, V-JEPA 2 reduced the planning time per step to one-thirtieth of the Cosmos model, and its success rate was also higher.
V-JEPA 2's capabilities are critical for real-world agents to understand complex movements and temporal dynamics, as well as to predict actions based on contextual cues.
Based on this predictive capability, world models are highly useful for planning sequences of actions for a given goal, such as the actions required to transition a cup from being on the table to being at the edge of the table.
Most AI today requires specialized training to solve specific tasks. However, V-JEPA's self-supervised approach allows it to master new abilities with only a few examples, achieving higher performance across different tasks and domains.
The model can be deployed on robotic arms to perform object manipulation tasks such as Reach, Grasp, and Pick-and-place, without requiring large amounts of robot data or task-specific training.
According to test data, V-JEPA 2 achieved success rates of 100%, 45%, and 73% for these three types of tasks, respectively.
Yann LeCun Demonstrates World Model Applications
Launches Three Specialized Benchmarks
Yann LeCun also presented some potential application scenarios for world models.
AI agents powered by world models can help visually impaired individuals better perceive the world;
AI agents in MR headsets can provide guidance for more complex tasks, for example, making education more personalized;
AI coding assistants can truly understand how a new line of code will change the state or variables of a program;
World models are also very important for autonomous systems, such as self-driving cars and robots;
Meta believes that world models will usher in a new era for robots, allowing AI agents in the real world to perform household chores or physical labor without needing to learn astronomical amounts of training data.
In addition to releasing V-JEPA 2, Meta also shared three new benchmarks to help the research community evaluate existing models' ability to learn from and reason about the world through video:
1. IntPhys 2: For testing models' intuitive physical understanding in complex synthetic environments (Benchmarking Intuitive Physics Understanding In Complex Synthetic Environments).
2. A Shortcut-aware Video-QA Benchmark for Physical Understanding via Minimal Video Pairs.
3. CausalVQA: A Physically Grounded Causal Reasoning Benchmark for Video Models.
Benchmark links:
IntPhys 2:
https://ai.meta.com/research/publications/intphys-2-benchmarking-intuitive-physics-understanding-in-complex-synthetic-environments/
CausalVQA:
https://ai.meta.com/research/publications/causalvqa-a-physically-grounded-causal-reasoning-benchmark-for-video-models/
Shortcut-aware Video-QA Benchmark:
https://ai.meta.com/research/publications/a-shortcut-aware-video-qa-benchmark-for-physical-understanding-via-minimal-video-pairs/
Conclusion: AI's Perception of the World Accelerates
AI Accelerates from the Digital World to the Physical World
Meta's release of the second-generation world model further optimized the model's performance and accuracy, enabling AI agents in the physical world to perform tasks more efficiently without needing massive amounts of training data. This direction is undoubtedly one of the focal points in the current AI industry.
As data bottleneck issues become increasingly prominent, achieving breakthroughs at the underlying technical level becomes even more critical. Meta's innovation at the model architecture level is a core advantage of its world model.
With an increasing number of video models being released today, AI is gradually moving from text and images to dynamic videos. AI's speed in understanding and perceiving the world is continuously accelerating. From giants like Nvidia, Meta, and Google to various startups, all are keenly interested in building world models. The battle for world models may become a key highlight in future AI industry technological competition.
Source: Meta Official Website
At the end of the article, we recommend a treasure trove of utility mini-programs!