The AI CUDA Engineer: Agentic CUDA Kernel Discovery, Optimization and Composition

February 20, 2025

Note: Updated on February 21, 2025.

At Sakana AI, we believe the path to develop much stronger AI systems is to automate the development of AI using AI. We aim to develop AI systems that can create even more capable and efficient AI systems.

In the past year, we introduced an AI system that can automate the creation of new AI foundation models, at a fraction of the cost. We showed that LLMs can invent more efficient methods to train LLMs. Recently, we proposed the first comprehensive agentic framework for fully automating the entire AI research process in The AI Scientist. This led us to the question: If AI can be used to conduct AI research, can we use AI to research ways to make AI run faster?

Introduction

Just like the human brain, modern AI systems also rely heavily on parallel processing, enabled by hardware accelerators such as GPUs. But unlike the human brain which is evolved (biologically and culturally) to operate efficiently under resource constraints, recent advances in AI foundation models have led to large-scale deployment and ever-growing inference time and energy demand, leading to exponentially increasing resource requirements to train and deploy AI models.

We believe that fundamentally, modern AI systems can and should be as efficient as the human brain, and that the best path to achieve this efficiency is to use AI to make AI more efficient! Inspired by our earlier work on The AI Scientist, we are proud to announce The AI CUDA Engineer, the first comprehensive agentic framework for fully automatic CUDA kernel discovery and optimization.

CUDA is a low-level software layer that gives direct access to the NVIDIA GPU’s hardware instruction set for parallel computation. CUDA kernels are functions written in the CUDA language that run on GPUs. By writing instructions directly at the CUDA kernel level, we can achieve much higher performance for AI algorithms. However, working with CUDA requires quite a bit of GPU knowledge, and in practice, most machine learning algorithms are written in a higher level abstraction layer such as PyTorch or JAX.

The AI CUDA Engineer is an agentic framework that leverages frontier LLMs with the goal of automating the conversion of standard PyTorch code into highly optimized CUDA kernels. Through the use of evolutionary optimization, and leveraging concepts in evolutionary computation, such as ‘crossover’ operations and ‘innovation archive’ to discover promising ‘stepping stone’ kernels, our proposed framework is able to not only automate the process of converting PyTorch modules to CUDA kernels, but our highly optimized CUDA kernels often achieve speedups that have significantly faster runtime.

We believe this technology can enable speedups that will accelerate both the training and running (inference) of foundation models like LLMs or other generative AI models, eventually making AI models run much faster on NVIDIA hardware.

The AI CUDA Engineer is able to generate CUDA Kernels with speedups of 10—100x over common PyTorch operations. Our framework is also able to produce highly optimized CUDA Kernels that are much faster than existing CUDA Kernels that are already commonly used in production (up to 5x speedups).

High-Level Overview of The AI CUDA Engineer Agentic Framework

Stage 1 and 2 (Conversion and Translation): The AI CUDA Engineer first translates PyTorch code into functioning CUDA kernels. We already observe initial runtime improvements without explicitly targeting these.

Stage 3 (Evolutionary Optimization): Inspired by biological evolution, our framework utilizes evolutionary optimization (‘survival of the fittest’) to ensure only the best CUDA kernels are produced. Furthermore, we introduce a novel kernel crossover prompting strategy to combine multiple optimized kernels in a complementary fashion.

Stage 4 (Innovation Archive): Just as how cultural evolution shaped our human intelligence with knowhow from our ancestors through millennia of civilization, The AI CUDA Engineer also takes advantage of what it learned from past innovations and discoveries it made (Stage 4), building an Innovation Archive from the ancestry of known high-performing CUDA Kernels, which uses previous stepping stones to achieve further translation and performance gains.

Kernel Runtime Speedups Discovered by the AI CUDA Engineer

The AI CUDA Engineer robustly discovered CUDA kernels used for common machine learning operations, with speedups as high as 10—100x faster than native and compiled kernels in PyTorch. Our approach is also able to convert entire machine learning architectures into optimized CUDA kernels. Here we highlight a couple of significant speedup discoveries made completely autonomously:

More details of these optimized CUDA kernels available at our interactive website’s leaderboard.

Our approach finds more efficient CUDA kernels for fundamental operations such as matrix multiplications to common deep learning operations, and as of writing, the performance of our discovered CUDA kernels achieved state-of-the-art performance on KernelBench.

Technical Report and Dataset Release

We believe that this is just the beginning of the great optimization of AI!

We’re excited to release our new paper, The AI CUDA Engineer: Agentic CUDA Kernel Discovery and Optimization.

In our report:

We introduce an end-to-end agentic workflow capable of translating PyTorch code to working CUDA kernels, optimizing CUDA runtime performance, and automatically fusing multiple kernels.
Furthermore, we construct various techniques for enhancing the consistency and performance of the pipeline including LLM ensembling, an iterative profiling feedback loop, local kernel code-editing, and crossover kernel optimization.
We show that The AI CUDA Engineer robustly translates more than 230 out of 250 considered torch operations and achieves strong runtime performance improvements for the majority of kernels. Furthermore, our approach is capable of efficiently fusing various kernel operations and can outperform several existing accelerated operations.
We release a dataset of over 17,000 verified kernels for operations covering a wide range of PyTorch operations.

We highlight some notable examples of discovered CUDA kernels that achieved significant speedups on key computation operations in AI models.

Highlighted AI CUDA Engineer-Discovered Kernels

Leveraging our novel LLM-driven evolutionary kernel optimization procedure we robustly obtain speedups for a diverse range of considerations. More specifically, we outperform PyTorch Native runtimes for 81% out of 229 considered tasks. Furthermore, 20% of all discovered CUDA kernels are at least twice as fast as their PyTorch implementations.

The AI CUDA Engineer robustly discovers CUDA kernels that outperform PyTorch implementations.

Below we show a subset of kernels. They highlight the diversity of different operations for which the AI CUDA Engineer can successfully be deployed. This includes normalization methods, loss functions, special matrix multiplications and even entire neural network architectures:

Examples of highly optimized CUDA kernels produced by The AI CUDA Engineer. Please click on the individual thumbnails of the CUDA kernels above to view their details and further analysis, such as runtime speedups, on our interactive website.

The AI CUDA Engineer Archive: A Dataset of 17,000+ Verified CUDA Kernels

A Text Embedding visualization of the AI CUDA Engineer Archive shows that the discovered kernels group into tasks (e.g. MatMul, Pooling, Convolution) and implementation strategies (unrolling, fusing, vectorization). The Archive is openly accessible and can be used for downstream fine-tuning of LLMs.

Along with this paper, we release The AI CUDA Engineer archive, a dataset consisting of more than 30,000 CUDA kernels generated by The AI CUDA Engineer. It is released under the CC-By-4.0 license and is accessible via HuggingFace. The dataset includes a torch reference implementation, torch, NCU and Clang-tidy profiling data, multiple kernels per task, error messages and speedup scores against torch native and compile runtimes.

Summary statistics of the AI CUDA Engineer Archive consisting of more than 30,000 kernels and more than 17,000 correct verified implementations. Approximately, 50% of all kernels improve over the torch native runtime.

We envision that this dataset can enable post-training of open-source models to perform better CUDA-enabling modules. This includes offline Reinforcement Learning, preference optimization, and standard supervised fine-tuning.

Explore 17,000+ Kernels in The AI CUDA Engineer Archive

We also published an interactive website for interactively inspecting more than 17,000 verified kernels and their profiles including torch, NCU and Clang-Tidy data. You can access our interactive website here.

The website allows you to explore various high-performing kernels across 230 tasks. It comes with a custom leaderboard that can be used to inspect related kernels across experiments and LLMs.

Leaderboard of the Kernels discovered by The AI CUDA Engineer

Furthermore, you can visualize the kernel, retrieve related kernels, download code to verify the implementation and speedup as well as view the obtained profiling data. Finally, you can take an in-depth look at the optimization experiment.

Detailed view of an optimized kernel including profiling data, downloading of evaluation scripts, related kernels and discovery experiment details.

Limitations and Bloopers

Combining evolutionary optimization with LLMs is powerful but can also find ways to trick the verification sandbox. We are fortunate to have readers, like @main_horse test our CUDA kernels, to identify that the system had found a way to “cheat”. For example, the system had found a memory exploit in the evaluation code which, in a number of cases, allowed it to avoid checking for correctness.

We have since made the evaluation harness more robust to eliminate this loophole and have updated our results.

Furthermore, we find the system could also find other novel exploits in the benchmark’s tasks. We are in the process of revising our paper and updating results, with further imporvements to the evaluation and runtime profiling harness, to reflect and discuss the effects, and mitigation of LLM reward hacking for CUDA kernel optimization.

In addition, we observed limitations in frontier LLMs’ ability to effectively utilize TensorCore WMMA capabilities. While LLMs could generate basic CUDA code, they often struggled to implement the specialized matrix multiplication acceleration features offered by modern GPU architectures. This suggests a potential gap in the training data or the models’ understanding of advanced hardware-specific optimizations.

As frontier LLMs, especially those with advanced coding reasoning capabilities become more capable, we expect code-optimization systems, such as ours, will continue to face these challenges. We envision a future where it is the role of human engineers to work with code optimization systems as tools, to produce the best and most reliable results.

Future Implications of The AI CUDA Engineer

The AI revolution is just getting started, and we are just at the very beginning of the transformation cycle. It is our view that today’s LLMs are our generation’s “Mainframe Computers”. We are still in the very early stages of AI, and it is inevitable, due to market competition and global innovation (especially from those innovating with resource constraints), that this technology will become a million times more efficient.

Currently, our AI systems consume immense resources, and if the technology continues to scale without thought for efficiency and energy consumption, the result will not be sustainable. There is no fundamental reason why our AI systems can’t be as efficient (or even more efficient) than human intelligence. We believe that the best path to achieve this greater efficiency is to use AI to make AI more efficient.

This is the direction that Sakana AI is pursuing, and this project is an important step towards making AI a million times faster. Just like the evolution of early clunky mainframe computers to modern computing, how we use AI today will look very different in a few years, compared to today’s ‘clunky’, inefficient LLMs.

Sakana AI

Want to make the AI that improves AI? Please see our Careers page for more information.