Sakana AI super-powers AI reasoning using Japan’s own Sudoku Puzzles

March 21, 2025

Can you solve the Sakana AI Sudoku Puzzle?

Update (May 26, 2025).

For this project, we have also released our:

Technical Report: https://arxiv.org/abs/2505.16135
Leaderboard: https://pub.sakana.ai/sudoku/
GitHub: https://github.com/SakanaAI/Sudoku-Bench

(Original Blog, from March 21, 2025).

We are happy to announce that:

Sakana AI is releasing a challenging new reasoning benchmark based on Sudoku™ puzzles.¹ The hardest puzzles in this benchmark are extremely difficult even for professional puzzle solvers and will present a grand challenge to AI reasoning. Read the technical details on our GitHub Repo.
We have partnered with the popular YouTube channel Cracking The Cryptic to provide thousands of hours of high quality examples of human puzzle solving for the purpose of training AI reasoning models.
The benchmark will include beautifully hand-made sudokus from Nikoli, the Japanese company that named Sudoku.

Introduction

Sudoku is a puzzle that takes place in a 9 by 9 grid partially filled with numbers. The aim is to fill in the missing numbers so that each row, column and 3 by 3 box contain all the numbers from 1 to 9. These can be highly addictive and exploded in popularity in Japan in the 80s thanks to Nikoli’s puzzle books and again in the UK in the 2000s where they started featuring in newspapers. Since these times, the Sudoku puzzle has continued to evolve and now can contain very varied rules, which we call ‘Modern Sudokus’. An example of such a puzzle is below. Not only do you need to match the original rules but the numbers that are along the colorful lines shown have extra rules that need to be followed.

Pierced Butterfly by Awedish

While computers have long been able to solve Sudoku puzzles using computationally intensive search algorithms, and AIs can be trained to solve them, neither approach truly replicates human-like reasoning when problem-solving. Furthermore, the most challenging Modern Sudoku puzzles defy search algorithms, and training AI for these puzzles is difficult due to their unique rules and solution paths. These puzzles demand an AI capable of comprehending new rules and reasoning creatively to uncover the solution. The question remains: can we develop an AI that solves Sudoku puzzles in a manner akin to humans?

The Current Frontier in AI: Reasoning Capabilities

Despite significant advancements in artificial intelligence technologies, a critical challenge persists in the development of robust reasoning capabilities. While recent models such as OpenAI’s ChatGPT o3 and DeepSeek’s R1 have demonstrated impressive performance across various domains, they continue to encounter limitations when confronted with tasks requiring sustained accurate reasoning over many steps or reasoning that requires a deep level of creativity.

With the advancement of language models comes the need for increasingly challenging evaluations, with even PhD level tests and maths competitions becoming saturated with the rise of modern reasoning models. As AI models continue to advance, evaluation methodologies must evolve accordingly. Traditional assessment frameworks, including academic-level tests and mathematical competitions, are increasingly mastered by contemporary reasoning models, necessitating more sophisticated evaluation approaches. What could be the next challenge? We believe that modern Sudokus are perfect for this, for reasons we will explain below.

Llion Jones’ talk on “The Next Reasoning Benchmark” at GTC 2025

At the NVIDIA GTC 2025 event, during the keynote speech, Jensen Huang agrees that the data for training AIs to reason could come from puzzles like Sudoku.

Looking to Japan

As an AI research company having chosen Japan as our home country, we often look to Japanese culture to inspire our research direction, for example generating Ukiyo-e style images. This time we look to Japanese culture to solve one of the most pressing issues in current AI research: allowing AI models to reason about very hard problems robustly.

We found a treasure trove of explicit reasoning data in a classic piece of Japanese culture: the logic puzzle popularized in Japan in the 1980 by Nikoli; Sudoku. When talking about Sudoku, the first thing that comes to mind is the deceptively simple 9 x 9 grid of numbers originating from Japan associated with famous newspapers and weekly brain teaser magazines. However, in modern times, Sudokus have evolved to break these traditional conceptions, varying in shapes and kinds and have very varied rules. Modern Sudokus usually have unique rules which require creative reasoning to solve, posing a unique challenge to AI. Unlike something like Chess or Go, where the rules are always the same, the difficult puzzles have very unique rules which have to be understood properly before attempting to solve the puzzle, rather than simply learning how to solve puzzles that always have the same rules. This requires a kind of meta-reasoning where you have to decide what method you will use to solve the puzzle before you attempt to solve it.

Initially it might seem that Sudokus might not offer a particularly diverse benchmark but it’s difficult to overstate how diverse the rules have become on modern Sudokus. Below are three illustrative examples; The first puzzle requires deducing the path a rat takes through a maze of teleporters in order to find the cupcake, the second requires moving cars to the correct locations before attempting to solve and the final puzzle requires violating the constraints visible in the puzzle. Rules that require a strong understanding of language, abstract thinking and strong vision capabilities.

A very varied ruleset. Some examples: (1) RAT RUN 7: Multiple Choice by Marty Sears, (2) Reserved Parking by rockratzero, (3) Chaotic Wrogn by Under Beyond.

From their unique nature, these Sudokus present the perfect next milestone and a unique opportunity to advance the reasoning capabilities of modern foundation models. In fact, hard Sudokus require even world puzzle solving champions to spend hours thinking and annotating before attempting to place a single digit into the grid. Yet, not only is each solution unique and digits immediately verifiable, but thousands of hours of content including human-reasoning and explanations are available over the web - making both reinforcement and imitation approaches directly applicable.

The New Reasoning Benchmark

Sakana AI is releasing a new reasoning benchmark based on these traditional and modern Sudokus. This new benchmark and all the accompanying data and tools is available here:

https://github.com/SakanaAI/Sudoku-Bench

We have very carefully chosen puzzles that require extremely strong reasoning capabilities, including puzzles that have unique reasoning requirements that will not have been seen in other puzzles. We have also curated the puzzles in such a way that there is a smooth ramp from simple Sudokus that current models are able to solve to puzzles that are utterly out of reach of today’s strongest reasoning models, as to accurately measure the progress on this benchmark.

A smooth ramp from simple to impossible…

Outside of this benchmark there are tens of thousands of puzzles to attempt available on the internet and more are being created every day. Sakana AI thanks all of the talented Sudoku setters that have created all of these amazing puzzles!

In order to gauge the capabilities of current reasoning models we tested several baselines, both open and closed source. We will update our GitHub repo with the latest results. Here are some of the current results measured for our benchmark:

In order to give the models a fair chance, we provided the models with partially completed puzzles and assessed their ability to finish them. Some models performed reasonably well with this assistance, but the key results lie in the final two columns. Even the most advanced models currently fail to place even a single correct digit on average, and Open AI’s latest reasoning model, ChatGPT o3, is the only one capable of solving any puzzles within the benchmark. It’s important to note that a 5% success rate on the benchmark’s puzzles doesn’t equate to 5% progress in solving the benchmark, as these are the simplest puzzles and there is a massive ramp in complexity after that as described earlier.

Please see our GitHub Repo for details of how these results were obtained.

Current Limitations in AI Approaches to Sudoku

Contemporary AI systems demonstrate a fundamental limitation in their approach to Sudoku puzzles. Despite their aptitude for understanding new Sudoku rulesets, current reasoning models often stumble in the final stretch of a problem. They generate near-complete solutions by meticulously placing digits in a series of locally consistent steps. Yet, these models sometimes commit to paths that appear valid until a late-stage contradiction emerges—forcing them to either present the user with an “almost there” solution or, in many cases, conclude that the puzzle is unsolvable and convince the user that the puzzle is underspecified. This failure mode underscores a core challenge of frontier reasoning models: maintaining global consistency over long chains of reasoning.

Human experts approach these puzzles through methodical, exploratory reasoning. They avoid premature assumptions, thoroughly analyze unique constraints, and systematically search for the puzzle’s “break-in point”—the critical insight, typically embedded intentionally by the puzzle designer, that facilitates an elegant solution path. Creative logical deductions, once found, are easy to follow and understand; a rich source of eureka moments in logical reasoning. These “break-in points” are a critical part of the reasoning and are currently beyond many state of the art models. We have carefully selected puzzles for the benchmark that include very interesting and challenging break-ins. Our Sudoku benchmark aims to inspire reasoning models to adopt a similarly deliberate and creative strategy.

Cracking The Cryptic

The fact that AI can learn to reason from being trained on internet text is remarkable. The issue is that examples of high quality reasoning are rare on the internet, and even when it is available, the reasoning that produced the text is often not explicitly written down, meaning the AIs have to infer the reasoning steps behind the text that is written. This limitation might be the bottleneck in improving the current reasoning capabilities. If we had access to a large amount of data of humans doing very explicit step by step reasoning then we would be able to train AI models to mimic human-like reasoning directly. The problem is that this data is very hard to find, would be extremely expensive to collect and difficult to generate automatically. Where might we be able to find such data?

Sakana AI is extremely excited to announce that we have partnered with Cracking The Cryptic!

Cracking The Cryptic is the biggest puzzle solving channel on YouTube with over 600 thousand subscribers. Every day they give their viewers the chance to tackle world-class variant sudoku puzzles and release videos in which the hosts attempt to logically solve those puzzles themselves. The stars of the channel are Simon Anthony and Mark Goodliffe (pictured above respectively), who have daily videos of themselves solving very difficult puzzles. While solving these puzzles they will explain in great detail step by step exactly the reasoning they used to solve each part of the puzzle. These solves take time, sometimes hours, and they have released thousands of videos since creating their channel. Both Simon and Mark have represented the UK at the World Sudoku Championships and the World Puzzle Championships, meaning that their YouTube channel contains thousands of hours of World Championship level reasoning!

Not only have we extracted the reasoning transcripts from the videos, we have also extracted the actions they take while solving, making it perfect data for training an AI reasoning model. And we are also releasing this data alongside the benchmark!!

Sudoku Solving Actions Data. From Its a secret by Jaxar, Cracking The Cryptic.

In summary, Sakana AI and Cracking The Cryptic will be releasing:

Over 2500 video’s worth of puzzle solving data.
Over 2000 hours of high quality reasoning traces transcribed into text, on the order of ~10 million words.
Roughly 2 million actions extracted from the solving videos.

We are also releasing tools to collect more data, clean up the data and preprocess the data for training AI models, so you can get training immediately! See more at our Github Repo.

Thank You Cracking The Cryptic! You can find this incredible YouTube channel.

Beautiful Hand-made Sudokus From Nikoli

We are also very proud to announce that Nikoli, the famous Japanese puzzle company that actually gave Sudokus their name, have kindly agreed to supply us with 100 hand made Sudokus for the benchmark. The reason that we decided to ask the expert Sudoku setters at Nikoli for hand made puzzles, rather than simply generating some using a computer, is that hand-made puzzles are much more interesting and require more varied kinds of reasoning to solve. Computers have been able to solve Sudokus for a long time but only using a mostly ‘brute force’ approach, by trying many different numbers very quickly.

Our benchmark presents a different challenge: can AI systems develop human-like reasoning approaches? Hand made Sudokus by Nikoli will have a ‘beautiful idea’ that you (or the AI) will need to find in order to solve the puzzle without brute force. The elegant insights required to efficiently solve hand-crafted puzzles remain beyond the capabilities of current AI systems, underscoring the value of our Nikoli-sourced puzzle collection. Thank You Nikoli!

A Beautiful Nikoli Sudoku

Bonus: The Sakana AI Sudoku

As a fun extra, for this project we commissioned a custom Sakana AI Sudoku by Marty Sears, a well known Sudoku setter whose puzzles often appear on Cracking The Cryptic. This puzzle is called ‘Parity Fish’ and any numbers adjacent along the red Sakana AI logo line must contain an even and an odd digit. Please try solving this puzzle here.

Or if you are feeling lazy you can watch Simon solving it here. Thank You Marty Sears!

The Sakana AI Sudoku (Parity Fish by Marty Sears). Normal Sudoku rules apply: Fill the grid with the digits 1-9 so that digits don’t repeat in any row, column, and marked 3x3 box. Two cells adjacent along the lines in the Sakana AI logo must contain one even digit and one odd digit. Two cells connected by a white dot contain consecutive digits. Two cells connected by a black dot contain digits where one is double the other.

Acknowledgements

Sakana AI thanks all of the amazing setters that have created all of these wonderful puzzles for all of us to enjoy. A particularly big thanks to the setters of the puzzles that appeared in the GTC Talk and this blog:

Sakana AI

Interested in joining us? Please see our career opportunities for more information.

Footnotes

“Sudoku” is a trademark of Nikoli Co., Ltd. ↩