Large Models Struggling with Sudoku?! Transformer Author's Startup Releases Ranking: o3 Mini High's "Variant Sudoku" Accuracy Only 2.9%

Wen Le from Aofei Temple, QbitAI Official Account

Large models solving Sudoku, with an overall accuracy of only 15%???

Image

Following the debut of the first-ever "AI Scientist" that came with ten complete academic papers, Transformer author Llion Jones is stirring things up again with his startup Sakana AI.

This time, Sakana AI has released a ranking of AI models' ability to solve Sudoku puzzles.

The problem set is the company's new benchmark, Sudoku-Bench, which includes modern Sudoku problems ranging from simple 4x4 to complex 9x9, designed to test large models' creative reasoning abilities.

The ranking shows that large models not only have an overall accuracy of only 15%, but in 9x9 modern Sudoku, even the high-performance model o3 Mini High achieved an accuracy of only 2.9%.

Image

The Sudoku-Bench project was showcased at the 2025 NVIDIA GTC developer conference.

NVIDIA CEO Jensen Huang commented on this:

Puzzles like Sudoku will help improve AI's reasoning capabilities.

Image

Sudoku-Bench: A New Benchmark Test

Sudoku-Bench, released by Sakana AI in March this year, is a benchmark composed of Sudoku puzzles of different difficulty levels, used to measure AI's multi-level and creative reasoning abilities.

1. Existing Problems: Large Models' "Memory Dependency Syndrome"

Most current reasoning benchmarks have a flaw: large models often complete tasks by memorizing standard answers or fixed patterns, rather than truly applying logical reasoning capabilities.

When encountering problems "similar" to those in the training data, models directly apply memorized solutions instead of deriving answers through logical deduction.

For new rules or unseen patterns, models often cannot cope effectively due to the lack of directly matching memory templates.

Traditional Sudoku games might already be "too simple" for large models; they might have just memorized routines rather than learned how to creatively solve new problems.

2. Solution: Sudoku-Bench "Trips Up" Large Models with "Variant Sudoku"

In recent years, various derivative puzzles with unique rules have emerged.

These "variant Sudoku" puzzles require multi-step and creative reasoning skills, but have only one correct answer. They cannot be solved by memorization; a "breakthrough" must be found through multi-step logical reasoning.

These characteristics make "variant Sudoku" an ideal choice for testing AI reasoning capabilities.

Below is an example of a "variant Sudoku": you not only need to follow the original rules, but numbers arranged along colored lines also need to follow additional rules.

Image

The Sudoku-Bench benchmark includes traditional and modern Sudoku (variant Sudoku) problems, categorized by difficulty, ranging from simple problems that current models can solve to extremely difficult ones that even the most advanced reasoning models cannot handle.

Image

Sudoku-Bench also includes 100 hand-crafted Sudoku puzzles provided by Nikoli (a famous Japanese Sudoku company, from which the name Sudoku originated).

Image

3. Large Models' "Devastating Defeat": Baseline Experiment Results

After the benchmark was released in March this year, researchers tested multiple AI models, including advanced large models such as Gemini 2.5 Pro, GPT-4.1, and Claude 3.7.

To give the models a fair chance, the team provided them with partially completed puzzles and evaluated their ability to finish them.

Image

The results show that some models performed quite well with this assistance, but the key results are in the last two columns.

Even the most advanced models couldn't place a single correct number on average, and OpenAI's latest reasoning model ChatGPT o3 was the only model capable of solving all puzzles in the benchmark.

The latest ranking shows:

Without tool assistance, the overall accuracy of all models across 100 puzzles was below 15%;

Smaller grids (4x4) performed slightly better (40%-73% accuracy), but 9x9 grids almost entirely failed, with accuracy close to 0%. Even the high-performance model "o3 Mini High" had an accuracy of only 2.9%.

Common errors made by models include: incorrect solutions, giving up on solving, misjudging rule contradictions, especially when facing puzzles requiring a "breakthrough" – they only make blind guesses and cannot narrow down the search range through a chain of logic like humans.

Image

The testing team has provided a detailed list of model performance on each puzzle. Interested readers can check the links at the end of the article~

About Sakana AI

Sakana AI was founded in Tokyo in July 2023 by former Google researchers Llion Jones (one of the Transformer authors) and David Ha, focusing on research into fundamental AI models for generating text and images.

Previously, the company open-sourced "AI Scientist" and "AI Reviewer." The former, upon its debut, independently completed ten full academic papers, including but not limited to diffusion models, Transformer, and reinforcement learning, causing quite a stir.

The latter can review AI-written papers and provide suggestions for improvement, embodying "attacking one's own shield with one's own spear."

Image

The company also released a new type of AI model called "Continuous Thinking Machine (CTM)," which surpasses simple pattern recognition by "progressively" thinking like humans and learning internal models of the world, thereby gaining the ability to progressively solve complex problems such as mazes.

Image

Sakana AI has also partnered with Cracking The Cryptic (one of the largest puzzle commentary channels on YouTube). Cracking The Cryptic demonstrates logical solutions to some of the world's best Sudoku puzzles daily.

Image

Sakana AI obtained transcripts of these videos and data on actions taken during solving. This data can serve as ideal data for training AI reasoning models and will be released along with Sudoku-Bench.

Image

Famous Sudoku setter Marty Sears also customized a Sudoku game called "Parity Fish" for Sakana AI: any two adjacent numbers along the red Sakana AI logo line must include one even and one odd number.

Interested friends can give it a try (the solution process is attached at the end of the article)~

Image

Technical Report: https://arxiv.org/abs/2505.16135Ranking: https://pub.sakana.ai/sudoku/Github: https://github.com/SakanaAI/Sudoku-BenchParity Fish Puzzle: https://sudokupad.app/wsj7iunsg6Solution Process: https://www.youtube.com/watch?v=JdHSSNKuIzUReferences: [1]https://x.com/SakanaAILabs/status/1926905826465161629[2]https://sakana.ai/sudoku-bench/

— End —

📪 QbitAI AI theme planning is underway! We welcome you to participate in our special topics: 365 AI implementation solutions, A Thousand and One AI Applications, or share with us AI products you are looking for, or new AI trends you've discovered.

💬 You are also welcome to join the QbitAI daily AI exchange group to chat about AI~

Image

One-click follow 👇 Star us

Daily updates on tech frontier advancements

Like, Share, and Favorite with one click

Feel free to leave your thoughts in the comment section!

Main Tag:Artificial Intelligence

Sub Tags:Large Language ModelsSakana AISudokuAI Reasoning


Previous:Andrej Karpathy Praises Stanford Team's New Work: Achieving Millisecond-Level Inference with Llama-1B

Next:Anthropic CEO's Controversial Prediction: AI to Eliminate Half of Entry-Level White-Collar Jobs in 5 Years, Unemployment Rate Could Soar to 20%! Amodei: It's Time for a Token Tax

Share Short URL