Can AI Solve the Sakana AI Sudoku Puzzle?
Sakana AI is excited to announce the release of a novel reasoning benchmark centered around Sudoku puzzles. This benchmark features puzzles that are exceedingly difficult, even for expert human solvers, presenting a significant challenge for AI reasoning capabilities. To learn more about the technical details, you can visit our GitHub Repo.
We’ve partnered with the popular YouTube channel Cracking The Cryptic to provide thousands of hours of high-quality examples of human puzzle-solving. This immense dataset will be invaluable for training AI reasoning models. Furthermore, the benchmark will include intricate, custom-made Sudoku puzzles from Nikoli, the Japanese company credited with popularizing Sudoku.
The Enduring Appeal of Sudoku
Sudoku is a number-placement puzzle that takes place within a 9×9 grid that is partially filled with numbers. Players aim to fill the missing numbers so that each row, column, and 3×3 box contains all the numbers from 1 to 9, without any repetition. These puzzles gained immense popularity in Japan in the 1980s, thanks to Nikoli’s puzzle books, and later, in the UK in the 2000s, appearing in newspapers.
Since then, Sudoku puzzles have continued to evolve, branching out into numerous variations, which are now called ‘Modern Sudokus’.

An example of a Modern Sudoku puzzle, featuring additional rules (Pierced Butterfly by Awedish).
While computers can solve Sudoku puzzles using computationally intensive search algorithms, and AIs can be trained to solve them, neither approach truly replicates human-like reasoning when problem-solving. Moreover, the most difficult Modern Sudoku puzzles are beyond the capabilities of search algorithms. Training AI for these puzzles is also difficult due to their unique rules and solution paths. These puzzles demand an AI system capable of understanding new rules and applying truly creative reasoning to derive solutions. The central question remains: Can we build an AI that can solve Sudoku puzzles in a way that mirrors how humans approach them?
The Current Frontier in AI: Reasoning Capabilities
Despite significant advancements in AI, the development of robust reasoning capabilities remains an enormous challenge. While modern AI models have shown impressive performances across various domains, they still struggle with tasks that demand accurate, sustained reasoning over multiple steps, or creativity. Recent advancements in language models have necessitated the creation of more demanding benchmarks, with advanced tests, including those at the PhD level, becoming increasingly saturated with the rise of sophisticated reasoning models. The next step requires more rigorous evaluation approaches.
What could the next challenge be? We believe that modern Sudokus are perfect for this, and below, we will explain why.

Llion Jones presenting at NVIDIA’s GTC 2025 event.
At the NVIDIA GTC 2025 event, Jensen Huang agreed that data from puzzles like Sudoku could be used to train AIs to reason.
A Look to Japan
As an AI research firm with a base in Japan, we often turn to Japanese culture for inspiration with our research, such as our Ukiyo-e style image generation. In this instance, we’re looking to Japanese culture to solve a key problem in AI research: enabling AI models to reason robustly when faced with complex problems. We found a treasure trove of explicit reasoning data in a classic piece of Japanese culture: the logic puzzle, Sudoku, which was popularized by Nikoli in the 1980s.
Sudoku starts with a deceptively simple 9×9 grid, associated with well-known newspapers and weekly brain teaser magazines. However, Sudokus have expanded beyond these traditional definitions, varying in shape and kind, and with new rules. Modern Sudokus typically include unique rules, which require creative reasoning to solve. This presents a unique challenge to AI. Unlike Chess or Go, where the rules are constant, difficult Sudoku puzzles have very specific rules that must be understood correctly before attempting to solve the puzzle. This is more than simply learning how to approach a well-defined puzzle. This requires a kind of meta-reasoning where you have to decide how to approach the puzzle before you actually try to solve it.
It might seem that Sudokus might not offer a diverse benchmark, but the rules of modern Sudokus have become incredibly diverse. Some examples of Modern Sudokus include puzzles that require deducing the path a rat takes through a maze, moving cars to locations before solving, or violating the constraints visible in the puzzle. These are examples of rules that require a strong understanding of language, abstract thinking, and strong vision capabilities.

From their diverse nature, these Sudokus present the perfect next step and a unique opportunity to push the boundaries of modern foundation models. In fact, difficult Sudokus require even world puzzle-solving champions to spend hours thinking and annotating before attempting to place a single digit into the grid. Yet, not only is each solution unique and digits immediately verifiable, but there is also a large amount of human reasoning available on the web, which makes training approaches directly applicable.
The New Reasoning Benchmark
Sakana AI is releasing a new reasoning benchmark based on these traditional and modern Sudokus. You can access this new benchmark, along with all the accompanying data and tools here: https://github.com/SakanaAI/Sudoku-Bench
We’ve carefully selected puzzles that demand exceptionally strong reasoning capabilities, including puzzles with unique reasoning requirements, not seen in other puzzles. We have also curated the puzzles to create a smooth ramp from simple Sudokus, which current models can solve easily, to ones that are beyond the reach of today’s strongest reasoning models. This will help us accurately measure progress on this benchmark.
Beyond this benchmark, tens of thousands of puzzles are available on the internet, and more are being created daily. Sakana AI thanks all the talented Sudoku setters who have created these amazing puzzles!
To gauge the capabilities of current reasoning models, we tested several open-source baselines. We’ll keep the GitHub repo updated with the latest results.

A comparison chart of how current reasoning models perform on the Sakana AI Sudoku benchmark.
To give the models a fair chance, we provided them with partially completed puzzles and assessed their ability to finish them. Some models performed reasonably well with this assistance, but the most interesting results lie in the last two columns. Even the most advanced models currently fail to place a single correct digit on average, and OpenAI’s latest reasoning model, ChatGPT o3, is the only one capable of solving any puzzles within the benchmark. Remember that a 5% success rate doesn’t equate to 5% progress, as the puzzles in this benchmark ramp up greatly in difficulty.
See our GitHub Repo for detailed methodology.
Current Limitations in AI Approaches to Sudoku
Contemporary AI systems show a fundamental limitation in their approach to these puzzles. Despite their ability to comprehend new Sudoku rulesets, current reasoning models often falter at the final hurdle. They generate near-complete solutions by placing digits in a series of locally consistent steps. But these models will sometimes choose paths that look valid until a contradiction emerges, forcing them to produce an “almost there” solution. This failure mode highlights a core challenge of modern reasoning models: preserving global consistency across long chains of reasoning.
Human experts approach these puzzles by using exploratory reasoning. They avoid assumptions; they carefully analyze unique constraints and look for the puzzle’s “break-in point,” the critical insight, often embedded intentionally by the puzzle designer, that makes an elegant solution possible. These “break-in points” are a critical part of the reasoning process and are, at present, beyond many state-of-the-art models.
Our Sudoku benchmark is designed to inspire reasoning models to adopt a similar approach.
Partnering with ‘Cracking The Cryptic’
The fact that AI can learn from internet text is remarkable. The issue is that examples of high-quality reasoning are rare on the internet, and even when available, the reasoning is often not written down. This limitation can potentially be a bottleneck in improving the current reasoning capabilities. We need a large amount of explicit, step-by-step reasoning data to train AI models to mimic human-like reasoning.
Where can we find such data? Sakana AI is pleased to announce a partnership with Cracking The Cryptic!

Simon Anthony and Mark Goodliffe, the hosts of the popular YouTube channel Cracking The Cryptic.
Cracking The Cryptic is the largest puzzle-solving channel on YouTube with over 600,000 subscribers. They regularly feature world-class variant Sudoku puzzles and release videos where the hosts attempt to solve the puzzles themselves. Simon Anthony and Mark Goodliffe have released daily videos of themselves solving very difficult puzzles. While solving these puzzles, they explain each step of the reasoning they used to solve that part of the puzzle in great detail.
Both Simon and Mark have appeared in the World Sudoku Championships and the World Puzzle Championships. This means that their YouTube channel contains thousands of hours of content filled with World Championship level reasoning! Not only have we extracted the reasoning transcripts from the videos, we have also extracted the actions they take while solving, creating the perfect data for training an AI reasoning model.
Data to Train AI
In summary, Sakana AI and Cracking The Cryptic will release:
- Over 2,500 videos worth of puzzle-solving data
- Over 2,000 hours of high-quality reasoning traces transcribed into text, on the order of ~10 million words.
- Roughly 2 million actions taken from the solving videos
We are also releasing tools to collect more data, clean it up, and preprocess it for training AI models, so you can start training immediately!
See more at our Github Repo.
Beautiful Hand-made Sudokus from Nikoli
We are also very proud to announce that Nikoli, the Japanese puzzle company that gave Sudoku its name, kindly agreed to supply us with 100 hand-made Sudokus for the benchmark.
The reason we decided to ask the Sudoku setters from Nikoli for hand-made puzzles, rather than generating some with a computer, is that hand-made puzzles are much more interesting and require more varied reasoning to solve.

The logos of Sakana AI and Nikoli.
Computers have been able to solve Sudokus for a long time, but typically by using a brute-force approach by trying very many numbers quickly. Our benchmark has set a different challenge: can AI systems develop human-like reasoning approaches? Hand-made Sudokus by Nikoli are designed to have a “beautiful idea” that you, or the AI, will need to find to solve the puzzle without brute force. The elegant insights required to solve hand-crafted puzzles remain beyond the capabilities of current AI systems.

Example of a beautifully hand-made Nikoli Sudoku puzzle.
Bonus: The Sakana AI Sudoku
As a fun extra, for this project we commissioned a custom Sakana AI Sudoku by Marty Sears, a well-known Sudoku setter whose puzzles often appear on Cracking The Cryptic. This puzzle is called ‘Parity Fish’ and any numbers adjacent along the red Sakana AI logo line must contain an even and an odd digit.

The ‘Parity Fish’ Sudoku designed for the Sakana AI benchmark.
Normal Sudoku rules apply: Fill the grid with the digits 1-9 so that digits don’t repeat in any row, column, and marked 3×3 box. Two cells adjacent along the lines in the Sakana AI logo must contain one even digit and one odd digit. Two cells connected by a white dot contain consecutive digits. Two cells connected by a black dot contain digits where one is double the other.