o3-pro Completes 'Sokoban,' Classic Retro Games Become New Benchmarks for Large Models

By Crecy from Aofei Temple QbitAI Official Account

Sokoban, Tetris… these classic retro games, beloved by humans, have now become new benchmarks for large language models.

o3-pro recently challenged these two games and performed remarkably well, directly breaking the benchmark's upper limit.

Image

Specifically, in the benchmark, Sokoban only went up to the sixth level, which o3-pro surpassed; Tetris ended due to forced termination, but o3-pro effectively couldn't stop playing.

Compared to the previous SOTA, o3, o3-pro's score doubled directly.

Image

Some netizens commented that this standard is more suitable for testing large models than the LLM arena.

Image

Classic Mini-Games Become New Benchmarks

The two games challenged by o3-pro come from a benchmark set called Lmgame, which, as the name suggests, involves large models playing games.

The Sokoban challenged by o3-pro is a modified version from 1989. Before o3-pro, the evaluation metric was the total number of boxes pushed to target locations before the game ended.

However, this time o3-pro completed all levels, giving the impression that "it got 100 points because the test only had 100 points."

But there's no need to worry, the testing benchmark will be dynamically updated. The game maps updated half a month ago in the GitHub repository only had four levels, while the original game has over 50 levels.

Before o3-pro's challenge, the best performer was o3, followed closely by o4-mini, and then the latest version of DeepSeek-R1 (0528).

Image

Tetris's scoring method adds the number of placed blocks to 10 times the number of cleared lines, until the game ends.

Before o3-pro, the best-performing model was also o3, but R1 and o4-mini swapped positions compared to their Sokoban rankings.

Image

However, in terms of time, o3-pro's operations are quite time-consuming, taking several minutes for each move.

Image

Additionally, some netizens believe that if the large model were to write programs instead of directly challenging, the results might be better.

Image

Besides Sokoban and Tetris played by o3-pro, Lmgame also includes four other games: 2048, Candy Crush, Super Mario Bros., and Ace Attorney.

During the testing process, an iterative interaction loop mode is used, where the game environment continuously provides game states to the large model. The model generates actions based on the state, these actions are then executed in the game environment, and rewards are calculated based on the execution results. The game state is then updated for the next round of decision-making.

Image

An agent framework was also introduced as an auxiliary tool, including modules such as perception, memory, and reasoning; to ensure the stability and comparability of evaluation results, the mode also implemented prompt standardization to reduce performance fluctuations caused by prompt words.

Image

According to the characteristics and rules of each game, the evaluation methods also differ:

Super Mario Bros.: The measure is the cumulative horizontal movement distance (in game units) of Mario across all levels, until all three lives are lost or the final level is completed.

2048: The evaluation metric is the sum of all merged tile values, recorded until the board becomes stagnant (no merges or board changes for ten consecutive turns), with the final score being the base-2 logarithm multiplied by 10.

Candy Crush: The evaluation standard is the total number of candies cleared within a fixed 50 turns.

Ace Attorney: Measured by the total count of correct actions (e.g., submitting evidence, selecting dialogue) across all case levels, until five incorrect decisions are made (i.e., health runs out).

However, none of these game performance metrics consider time as a factor.

Furthermore, this benchmark is open-source, so if you're interested, you can download it yourself to test models.

Image

Some netizens also commented that they want to see the Pokémon results, and the team said they would arrange it soon.

Image

Speaking of Pokémon, Gemini has been livestreaming its challenge across the web and successfully completed Pokémon Blue in early May this year.

At that time, Google CEO Sundar Pichai excitedly announced it immediately, also releasing precious footage of the completion moment:

Image

Produced by the LLM Arena Advisory Task Force

This project comes from Hao AI Lab at UCSD, affiliated with UCSD's Machine Learning Systems Lab and NLP Lab, led by Assistant Professor Hao Zhang of the Halıcıoğlu Data Science Institute.

Hao Zhang received his bachelor's, master's, and doctoral degrees from South China University of Technology, Shanghai Jiao Tong University, and Carnegie Mellon University, respectively. He then conducted postdoctoral research at UC Berkeley before joining UCSD.

Image

In addition, Hao Zhang also participated in founding LMSYS and served as an advisor for the LLM Arena.

LMSYS is a non-profit organization. The LLM Arena and well-known model frameworks SGLang and vLLM were all developed by LMSYS.

Back to Hao AI Lab, the lab has created multiple open-source projects, with FastVideo, a video generation acceleration framework, having the most GitHub stars at 1.5k.

Image

Hao AI Lab also receives funding from Google and NVIDIA. In April this year, NVIDIA donated a DGX B200 to the lab.

Image

References: https://x.com/haoailab/status/1933614723507106226 Project Repository: https://github.com/lmgame-org/GamingAgent Leaderboard: https://huggingface.co/spaces/lmgame/lmgame_bench Paper: https://arxiv.org/abs/2505.15146

— End —

📪 QbitAI AI theme planning is underway! We invite you to participate in our special topics: AI implementation solutions for 365 industries, 1001 AI applications, or share with us AI products you are looking for, or new AI trends you have discovered.

💬 You are also welcome to join the QbitAI Daily AI discussion group and chat about AI!

Image

One-click follow 👇 Star us

Daily updates on cutting-edge tech

One-click triple action Like Share Love

Welcome to leave your thoughts in the comment section!

Main Tag:Large Language Models

Sub Tags:AI BenchmarkingGame AIArtificial IntelligenceRetro Gaming


Previous:4B Qwen3 Overtakes 671B DeepSeek! Is ByteDance's DAPO Fine-tuning Method That Powerful?

Next:Achieving Lossless Mathematical Reasoning with 10% KV Cache: An Open-Source Method to Resolve 'Memory Overload' in Large Inference Models

Share Short URL