Towards Automating Long-Horizon Algorithm Engineering for Hard Optimization Problems

June 17, 2025

We introduce ALE-Bench, a coding benchmark inspired by hard optimization problems whose true optima are computationally out of reach (e.g., NP-hard). This benchmark enabled the development of our new coding agent, ALE-Agent, designed to tackle hard optimization problems. In May 2025, ALE-Agent achieved 21st out of 1,000 human participants in a live AtCoder Heuristic Competition (AHC), marking a turning point for AI discovery of solutions to hard optimization problems with important real world applications.

Summary

To what extent can AI automate the discovery of algorithms for hard optimization problems? This question represents AI’s next grand challenge for two reasons: (1) whether AI can optimize real-world industries, such as logistics and factory planning, power-grid balancing, and (2) whether AI can perform the long-horizon, creative reasoning needed to continuously improve answers for hard problems with no perfect solution, through trial and error.

To answer this question, Sakana AI collaborated with AtCoder, and we developed the ALE-Bench (ALgorithm Engineering Benchmark), the world’s first benchmark built on past problems from AtCoder Heuristic Contests, one of the largest, and most prestigious competitions in this field.

The tasks in this benchmark are hard optimization problems whose true optima are computationally out of reach (e.g., because the underlying problems are NP-hard). Historically, participants spent weeks iteratively refining their programs to push their scores higher.

Unlike most competitive programming benchmarks, which are based on short-duration, pass/fail coding questions with exact-solutions (inspired by Olympiad programming contests), ALE-Bench encourages iterative solution refinement over long time horizons, which are key to advancing solutions to hard optimization, NP-hard problems.

Overview of ALE-Bench and ALE-Agent

This benchmark enabled the development of our ALE-Agent, a specialized agent for this domain, to demonstrate the potential of AI-driven solutions to hard optimization problems. We deployed ALE-Agent in the wild, participating in a live coding contest, competing with top human contestants around the world in real-time!

In a real AtCoder contest on May 18th, 2025 (participated with permission of AtCoder), our ALE-Agent achieved an impressive 21st place finish among over 1,000 human participants. This marks a turning point for AI discovery of solutions that push the frontiers of hard optimization problems with important real world applications.

Leaderboard from AtCoder Heuristic Contest #47 (AHC047). The 21st place entry, “fishylene,” is Sakana AI’s agent, ALE-Agent. It competed in real-time under the same rules as over 1,000 human contestants, with AtCoder’s permission.

In this blog, and in our paper, we analyze ALE-Agent’s achievement on the AtCoder contest including some of the algorithmic findings it made. We will discuss our learnings about its performance and also highlight the current limitations of the ALE-Agent, and present challenges ahead for future work.

Our Technical Report, ALE-Bench: A Benchmark for Long-Horizon Objective-Driven Algorithm Engineering (https://arxiv.org/abs/2506.09050)
ALE-Bench Dataset: https://huggingface.co/datasets/SakanaAI/ALE-Bench
ALE-Bench Code: https://github.com/SakanaAI/ALE-Bench

Introduction

Combinatorial optimization problems play a significant role at the core of many of society’s essential systems, such as logistics optimization, factory planning, or power grid stability. This is a field of mathematics and computer science focused on finding a high-quality solution from a set of possibilities under given constraints. Since the properties of these problems and the most effective solution methods differ for each specific setting, human experts must dedicate considerable time and effort to build unique algorithms through a process of trial and error. This begs the question: To what extent can AI automate the discovery of these algorithms? If AI could solve such problems, it would have a tremendous impact, unlocking new levels of efficiency across numerous industries.

This connects to a grand challenge in the AI field: how to measure the more general reasoning capabilities of AI. Conventionally, we have used success rates on coding tasks with strictly correct or incorrect answers from platforms like Codeforces. Yet, the performance of modern AI systems is advancing so quickly on these benchmarks that it is already on par with top human competitors and nearing saturation. Existing benchmarks often fall short of adequately measuring capabilities such as creativity, persistent thought processes, and the accumulation of knowledge gained through trial and error. These very aspects are anticipated to be pivotal for the future evolution of AI. Is it feasible to objectively assess these more advanced reasoning abilities?

These questions motivated us to introduce ALE-Bench (ALgorithm Engineering Benchmark) and an AI agent, ALE-Agent. Developed in partnership with AtCoder Inc., ALE-Bench allows AI systems to engage with a rich history of challenges from one of the world’s most prestigious programming competitions in this field. ALE-Agent, a specialized agent designed by Sakana AI for this domain, not only demonstrated outstanding performance on the benchmark but also competed in a live AtCoder contest. There, it achieved an impressive 21st-place finish among more than 1,000 human participants. Through our research with ALE-Bench and ALE-Agent, we have gained insights into the current capabilities and limitations of AI, illuminating the path forward for future research.

ALE-Bench: A Coding Benchmark for Next-Generation Long-horizon and Creative Algorithm Engineering

Designing effective benchmarks for AI and analyzing/interpreting the results requires high-quality data, well-posed problems, and collaboration with domain experts. At Sakana AI, we have previously developed benchmarks like Sudoku-Bench to measure creative reasoning, and EDINET-Bench, a Japanese financial benchmark focused on real-world applications. Now, in partnership with AtCoder Inc., a global leader in competitive programming, we have created ALE-Bench (ALgorithm Engineering Benchmark) to address the challenge of algorithm engineering for optimization problems.

ALE-Bench is created based on the past AtCoder Heuristic Contests (AHC) hosted by AtCoder Inc. The AHC primarily features optimization problems directly linked to real-world industrial challenges, such as logistics optimization and factory production planning. These are high-quality, difficult problems that can take from several hours to several weeks to solve. The contests are fierce, attracting over 1,000 participants at times, including experts in algorithms to solve combinatorial optimization problems and professionals working on industrial applications.

Below is an example of a problem from a past contest: “From 1,000 delivery orders, select 50 and determine their delivery route to minimize the total travel distance.” This is a variation of the Traveling Salesperson Problem, a classic combinatorial optimization problem. It is well known that for such problems, exhaustively checking every possible combination (which amounts to roughly 10^200 in this case) is computationally infeasible. Instead, the challenge lies in designing clever algorithms, like simulated annealing, to efficiently discover better solutions.

An example problem from a past AtCoder Heuristic Contest. “Food Delivery” from AHC006 (Nov 2021), a problem to determine the optimal selection and route for 50 out of 1000 orders to minimize travel distance. (Rendered with tools provided by AtCoder)

ALE-Bench consists of 40 diverse optimization problems from past AHCs (Figure below, left). In this benchmark, we provide the problem statements, visualization tools, a code execution environment, and software for calculating rankings (Figure below, right). This setup allows an AI system to simulate the experience of a human participant, enabling performance comparisons under fair conditions between human and other AI systems. For more details, please refer to our paper and GitHub.

Overview of ALE-Bench.
Left: ALE-Bench consists of challenging AtCoder Heuristic Contest tasks posted in past competitions, which consists of NP-hard optimization problems such as routing and scheduling with no known optimum, and ranks submitted programs by score.
Right: ALE-Bench covers evaluation from bare LLMs to scaffolded agents. An agent receives a task and submits code. It can optionally invoke test runs and visualization utilities during this process to iteratively refine its solution just like a human participant would.

While attempts to automate solving combinatorial optimization problems with AI have been limited, the potential for real-world application makes this a worthwhile area of research. Unlike most existing benchmarks that focus on pass/fail outcomes, ALE-Bench tasks require long-horizon reasoning, creativity, and continuous improvement in the pursuit of an unknown optimal solution. This open-ended nature of ALE-Bench makes it a valuable tool not only for optimization but also for advancing the broader field of AI.

ALE-Agent: A Specialized AI Agent for Algorithm Engineering

Along with the benchmark, Sakana AI developed ALE-Agent, an AI agent specializing in algorithm engineering. It is built upon a state-of-the-art AI (Gemini 2.5 Pro) and combines two key approaches: 1) providing domain knowledge, such as frequently used algorithms and techniques, through prompts, and 2) employing a form of inference-time scaling that generates multiple diverse answers to improve performance. The technical details can be found in our paper.

With permission from AtCoder Inc., we had our AI agent participate in two real-time contests (AHC046 and AHC047), competing under the same rules as over 1,000 human participants. The agent placed 154th (top 16%) in AHC046 and an outstanding 21st (top 2%) in AHC047.

Evaluation results on ALE-Bench.
The performance of ALE-Agent showed higher performance compared to the AI models using a standard refinement method.

Furthermore, we conducted evaluations on a broader range of combinatorial optimization problems using ALE-Bench. In addition to ALE-Agent, we assessed various state-of-the-art AI models under a setting where they continuously refined their solutions using self-refinement within four hours (see the graph above). While the AI models using the standard method performed at a level roughly equivalent to the top 50% of human participants, ALE-Agent achieved performance within the top 6.8%. This demonstrates a significant enhancement in the capabilities of standalone AI models. For full experimental settings and results, please refer to the paper.

Analysis of ALE-Agent and Insights

The ALE-Agent is trained to be competitive at identifying algorithmic improvements to hard optimization problems. When we observe the ALE-Agent’s iterative process at improving its solutions, we saw that it often boosted its scores by applying domain knowledge, such as speeding up search algorithms and fine-tuning hyperparameters, just as competitive human experts in this domain would.

In AHC047, the live competition in which the ALE-Agent achieved scores in the top 2% of all participants, we can see examples of such iterative innovations. Here, in the example shown below, the ALE-Agent incorporated the Poisson distribution to approximate and accelerate score calculation, and devising creative neighbor-search patterns for simulated annealing, which was an essential strategy to getting a higher score in AHC047.

ALE-Agent devised a method for speeding up score calculation using Poisson approximation. This was an essential strategy for getting a higher score in AHC047. The actual submission code is here (lines 254-276).

ALE-Agent implemented a growth of neighborhood search strategy for simulated annealing. This figure presents the overviews of the initial (lines 304-342) and final (lines 492-771) solutions. The score was improved by incorporating a more diverse and efficient set of moves, allowing for a better exploration of the solution space, which ultimately bumped up its performance from 82nd to 21st place out of 1,000 participants.

How did ALE-Agent rank in the top 2% at the AHC047? One key reason is the difference in how humans and AI solve problems. In a four-hour competition, a person might refine their code a dozen times at most. In contrast, current AI can make about 100 revisions. Furthermore, our ALE-Agent generates hundreds or even thousands of potential solutions. This ability to rapidly and parallelly create solutions gives AI a significant advantage, particularly in shorter contests. We also discovered that current AI is very good at using “simulated annealing,” a common algorithm used in AHCs (Example: the agent’s best solution for AHC039 which would have placed 5th in the actual competition).

Limitations, Challenges and Future Work

Despite its successes, ALE-Agent has limitations. The agent also had its struggles. We noticed that ALE-Agent sometimes failed to fix bugs, repeatedly exceeded the time limit because it couldn’t properly analyze the complexity of its own code, and persisted in improving parts of the code that contributed little to score improvement.

While the ALE-Agent performed well in four-hour contests and on problems where simulated annealing was a good fit, it struggled with two-week contests and problems that demanded different types of algorithms rather than simulated annealing. It also showed a tendency to struggle with designing algorithms based on experimental analysis that requires trial and error, relying on observing the program’s behavior.

For future improvements, one direction is to develop an agent capable of making more reliable improvements, for example, by incorporating more of the techniques and tools used by human experts and by enhancing feedback to enable detailed analysis of execution results. Another path is to advance the technology of the agent itself, for instance by combining these ideas with an approach where the agent improves itself. Through such refinements, the ultimate goal is to create an AI with algorithm engineering skills that can match or even surpass the best human algorithm engineers.

We are grateful that AtCoder collaborated with us on the creation of ALE-Bench, and permitted ALE-Agent’s participation in their contests. Our objective here was to understand the current state of AI’s capabilities in algorithm engineering. AtCoder Heuristic Contests should remain an environment where humans can enjoy competing with each other and learning algorithmic and programming skills. Based on the results of this study, AtCoder, in cooperation with Sakana AI, has established new rules for the use of AI in future contests. We believe this represents an important direction for the coexistence and collaboration of AI and humans.

Conclusion

In this work, we developed a new benchmark, ALE-Bench, to measure the algorithm engineering capabilities of AI for hard, combinatorial optimization problems. We also developed a domain-specialized AI agent, ALE-Agent, which achieved remarkable results on the benchmark and in a real-time contest. If AI-driven automation of algorithm discovery becomes a reality, it will trigger a paradigm shift, bringing efficiency gains across numerous industries. Building on the insights from this study, Sakana AI will continue to tackle the challenge of developing AI with even greater algorithm engineering capabilities.

This research was conducted in collaboration with AtCoder Inc. We are deeply grateful for their outstanding expertise and contributions in optimization and algorithms, which were invaluable in providing data, analyzing results, and enabling our AI agent’s participation in their contests.

For more details, please see our paper:

ALE-Bench: A Benchmark for Long-Horizon Objective-Driven Algorithm Engineering (https://arxiv.org/abs/2506.09050) Authors: Yuki Imajuku¹, Kohki Horie¹,², Yoichi Iwata³, Kensho Aoki³, Naohiro Takahashi³, Takuya Akiba¹ (1: Sakana AI, 2: The University of Tokyo, 3: AtCoder Inc.)
ALE-Bench Dataset: https://huggingface.co/datasets/SakanaAI/ALE-Bench
ALE-Bench Code: https://github.com/SakanaAI/ALE-Bench
ALE-Bench Leaderboard: https://sakanaai.github.io/ALE-Bench-Leaderboard/

Sakana AI

Interested in joining us? Please see our career opportunities for more information.