Abandoning Manual Annotation! Chinese Team Proposes Self-Evolution Algorithm for Multimodal Large Models

The code, model, and project homepage are now available:

The authors include Qinsi Wang, Yueqian Lin, Professor Hai Li, Professor Yiran Chen from Duke University, Bo Liu from the National University of Singapore, Professor Tianyi Zhou from the University of Maryland, and researchers Jing Shi, Kun Wan, and Wentian Zhao from Adobe.

Background Introduction

Although current Vision-Language Models (VLMs) perform outstandingly in multimodal tasks, their training relies excessively on manually annotated data and carefully designed reinforcement learning (RL) rewards. This dependence leads to the data scarcity problem: the high cost of multimodal annotation limits the scale and diversity of training data. It also creates a knowledge ceiling: model capability is restricted by the boundaries of human supervision, making it difficult to surpass existing human knowledge and strategies. AlphaGo’s self-play technique, which involved the model competing and interacting with its own copies and automatically obtaining feedback, transformed computation into data while eliminating the reliance on human supervision. This allowed AlphaGo to continuously advance and break through human capability limits. However, due to the multimodal characteristics of VLMs, systematic research applying self-play to VLMs is rare. Therefore, the research team designed a self-play framework adapted to VLM characteristics, called Vision-Zero. This framework has the following features:

Policy Self-Play Framework: Vision-Zero trains VLMs in an environment modeled after social deduction games, allowing the agent to automatically generate highly complex reasoning data during self-play, without the need for manual annotation.
Arbitrary Image Input: Unlike previous game-based training frameworks with restrictive conditions, Vision-Zero can initiate the game on arbitrary forms of images. This allows the model to gain performance improvements across many different domains and exhibit excellent generalization capabilities.
Continuous Performance Improvement: The research team proposed the Iterative Self-Play Policy Optimization (Iterative-SPO) algorithm, which alternately optimizes self-play and Reinforcement Learning with Verifiable Rewards (RLVR). This algorithm solves the common performance bottleneck issues found in traditional self-play algorithms.

Despite not using any annotated data for training, Vision-Zero surpasses other SOTA post-training methods that rely on annotation across multiple domains such as reasoning, chart question answering, and vision-centric understanding tasks.

From Chessboard to Reality: Generalizing AlphaGo's Self-Play Idea

Self-play, as one of OpenAI’s critical early technical routes, has been a key driver for several milestones in the development of artificial intelligence. Typical examples include AlphaGo defeating Lee Sedol in 2016 and OpenAI Five beating the world champion OG team in Dota 2 in 2019. While observing self-play significantly surpass human intelligence in certain specialized domains, people often wonder if this idea could be applied to more open scenarios. However, transitioning AlphaGo from the chessboard to reality requires solving the following challenges:

The skills learned by the agent to win the game must be highly consistent with the skills required for the target task.
The game environment must be diverse and complex enough so that a wide range of target tasks can satisfy condition 1.
Skill growth must be scalable: as self-play progresses, the environment should continuously increase difficulty, allowing increasingly stronger agents to emerge, rather than letting the training converge to a fixed ceiling.

Inspired by social deduction games, such as "Who is the Spy" (Werewolf), the research team designed a complete set of self-play rules to solve these challenges. The specific rules are as follows:

The game involves n civilians and 1 spy. Players are first informed of their role.
Each player receives an image. The spy’s image is slightly different from the civilians’ (e.g., an object is missing, added, or modified).
Clue Phase: Each player observes their image and gives a verbal clue describing the image content (which can be an object description, inferred information, etc.).
Decision Phase: After multiple rounds of clues, the game enters the decision phase. Players use the clues combined with their own image to vote and identify the spy.

This game is highly strategic and challenging. The spy needs to infer and disguise themselves based on others' clues to avoid exposure. Civilians need to provide clues that are accurate enough but not revealing, while analyzing others' clues to find suspicious points. In this way, the agent can generate sufficiently long and complex reasoning chains during the game. Furthermore, as the opponent's ability improves, the challenges it faces increase, stimulating stronger visual understanding and reasoning capabilities.

Domain-Agnostic Data Input

This game only requires two slightly different image pairs as input to start, and thanks to powerful image editing tools like ChatGPT or nano banana, data construction is extremely simple and low-cost, making the application scope of this framework very broad. The research team used three completely different types of scene images as training data:

CLEVR Synthetic Scenes: 2000 image pairs were automatically generated using the CLEVR renderer. The original image contained 4–6 randomly arranged objects, and in the modified image, two objects had their color and shape changed.
Chart Data: 1000 charts were randomly selected from the ChartQA training set as original images, and corresponding modified images were generated by randomly swapping numerical attributes within the charts using Gemini 2.5-Flash.
Real-World Images: 1000 image pairs were randomly sampled from the ImgEdit training set, a dataset containing high-quality real-world single-turn image editing pairs.

From Local Equilibrium to Sustainable Improvement

Pure self-play training often easily gets stuck in local equilibrium, making it difficult to explore new reasoning paths, while standalone reinforcement learning methods also easily reach knowledge saturation after mastering the existing problem set. To mitigate these issues, the authors proposed using two-stage alternating training: when performance in the decision phase shows saturation in the clue phase, training switches to clue training to increase difficulty; otherwise, it switches back to the decision phase. This method is named Iterative Self-Play Policy Optimization (Iterative-SPO). Experiments show that the two-stage alternating training significantly outperforms single-stage training, as shown in the comparison below.

Experimental Results

Strong Task Generalization Capability. To evaluate whether the VLM trained under the Vision-Zero framework can generalize to broader reasoning and mathematical tasks, the authors tested the model on six benchmark datasets (results shown in Table 1). Experiments show that even without using annotated data for training, Vision-Zero consistently outperforms other SOTA methods requiring annotation across all benchmarks. Specifically, VisionZero-Qwen-7B (CLEVR, Real-World) achieved an improvement of about 3% over the baseline, VisionZero-Qwen-7B (Chart) improved by about 2.8%, while the best existing baseline method improved by only about 1.9%. Notably, baseline methods require extensive training on mathematical and reasoning samples, whereas the Vision-Zero environment did not explicitly include mathematical tasks. It improved logical reasoning solely through natural language strategy games and effectively transferred the learned abilities to broader mathematical and reasoning tasks, even exceeding models specially trained on large-scale task data.

Mitigation of Cross-Capability Negative Transfer.

One key difficulty in VLM post-training is cross-capability negative transfer, where training the model on a specific task degrades its performance on other tasks. Table 2 shows that baseline models significantly decrease performance after post-training on reasoning and mathematical data; for example, MM-Eureka-Qwen-7B dropped about 10% on ChartQA. In contrast, models trained with Vision-Zero effectively mitigate negative transfer: VisionZero-Qwen-7B (CLEVR) significantly improved in vision tasks while only dropping an average of 0.2% across four chart/OCR tasks; VisionZero-Qwen-7B (Chart) improved on all chart/OCR benchmarks and gained an average of 1% on vision tasks. This indicates that Vision-Zero’s multi-ability strategy training significantly alleviates the negative transfer issue common in traditional single-task training.

Implications

Vision-Zero proves the feasibility and immense potential of self-play moving from single tasks to general tasks. By constructing an open and scalable game environment, it eliminates reliance on manual annotation, overcomes data and knowledge bottlenecks, enabling sustainable capability evolution and cross-domain generalization without specific task training. Furthermore, the dual-stage alternating optimization effectively avoids the local equilibrium problem common in self-play. Models trained via self-play also effectively mitigate the issue of cross-capability negative transfer prevalent in traditional single-task training.