When we first introduced The AI Scientist in our initial preprint, we shared an ambitious vision: an agent powered by foundation models capable of executing the entire machine learning research lifecycle. Soon after, we shared a historic update: the improved AI Scientist-v2 produced the first fully AI-generated paper to pass a rigorous human peer-review process.
Today, we are happy to announce that a paper describing all of this work and that includes new insights has been published in Nature.
This substantial milestone is the result of a close and fruitful collaboration between researchers at Sakana AI, the University of British Columbia (UBC) and the Vector Institute, and the University of Oxford.
Building upon our previous open-source releases, this open-access Nature publication comprehensively details our system’s architecture, outlines several new scaling results, and discusses the promise and challenges of AI-generated science.
Read the full Nature paper here: https://www.nature.com/articles/s41586-026-10265-5
Explore the code and generated papers on GitHub: AI Scientist-v1, AI Scientist-v2
Example sections of a paper produced by The AI Scientist that passed the peer-review process for a workshop at a top-tier international AI conference.
The Journey So Far
Our journey to this publication has been a 1.5-year process with distinct phases shaped by foundation model developments and our improvement to the system:
- Proving It’s Possible: In our first release, we gave The AI Scientist a starting code template (like a simple training run for nanoGPT). It autonomously generated novel ideas, created and ran experiments to test those ideas, and wrote a full paper. Additionally, we developed and established The Automated Reviewer, which scored the quality of the paper. This work, for the first time, highlighted that end-to-end automation of the entire Machine Learning research process was possible.
- The “Turing Test” of Science: In our second update, we gave the system much more freedom to investigate any broadly defined topic in AI research. We then put the system to the ultimate test. We submitted unedited, fully AI-generated papers to the rigorous, blind, human peer-review process of the ICLR 2025 I Can’t Believe It’s Not Better (ICBINB) workshop. One manuscript achieved an average score of 6.33 (individual scores: 6, 7, 6), surpassing the average human acceptance threshold! The paper scored higher than 55% of human-authored papers. Throughout the process, we received permission from the workshop organizers. We had predetermined that we would withdraw the paper prior to publication if accepted, which we did.
This new Nature paper consolidates these breakthroughs and dives deep into the underlying foundation model improvements that makes them possible. Under the hood, after being given a broad research direction, it autonomously generates of novel research ideas, searches for and reads the relevant literature, designs, programs, and conducts experiments via parallelized agentic tree search, and writes the entire paper (in LaTeX, with feedback on its figures coming from a foundation model with vision capabilities).
Conceptual overview of The AI Scientist workflow, including coming up with research ideas, implementing experiments, executing those experiments, writing the paper, and reviewing it.
New Results: The Automated Reviewer & Scaling Laws of Science
To evaluate AI-generated science at scale without exhausting human reviewers, we built an Automated Reviewer. We prompted it to act as an Area Chair, ensembling five independent reviews into a final decision based on official NeurIPS guidelines. We benchmarked this Automated Reviewer against thousands of actual human decisions from the OpenReview dataset. The Automated Reviewer matches human performance. It achieved a balanced accuracy of 69% (comparable to human reviewers) and an F1-score that actually exceeded the inter-human agreement measured in the famous NeurIPS 2021 consistency experiment.
The Automated Reviewer matches human review judgments on AI papers published at a top conference (ICLR), including papers published after the model was trained (its “knowledge cutoff”). These results suggest The Automated Reviewer is as reliable as human reviewers at providing review scores for newly written AI papers.
Crucially, by using this reviewer to grade papers generated by different foundation models, we discovered a clear scaling law: as the underlying foundation models improve, the quality of the generated papers increases correspondingly. This strongly implies that as compute costs decrease, and model capabilities continue to exponentially increase, future versions of The AI Scientist will be substantially more capable.
The quality of papers generated by The AI Scientist increases when using newer, more intelligent foundation models, as judged by the Automated Reviewer.
Limitations and the Road Ahead
While passing human peer review is a breakthrough, The AI Scientist is still in its early days. As we describe in the Nature paper, the system currently exhibits several limitations:
- It occasionally produces naive or underdeveloped ideas.
- It can struggle with deep methodological rigor and complex code implementation.
- It is susceptible to hallucinations or obvious mistakes, such as generating inaccurate citations or duplicating figures in the appendix.
However, there is a clear trend in machine learning: once a new capability starts to work, even with clear limitations, it becomes superhuman surprisingly soon. That is because scale and better core models rapidly push it past human performance levels. Currently, The AI Scientist is limited to computational experiments. But we expect the playbook we’ve published will be adapted to other domains and catalyze scientific advances by making truly open-ended discoveries.
A Paradigm Shift for Scientific Discovery
The ability to automate paper generation raises profound ethical and societal questions—from overwhelming peer-review systems to artificially inflating research credentials. We are committed to developing this technology responsibly, which we feel includes the need to inform the public that AI-generated papers are not only possible, but in some cases match human performance. We proactively withdrew our accepted AI submissions and obtained IRB approval for our experiments. We also watermark all of our papers so it is clear they were AI-generated, a practice we recommend the community adopt. Additionally, we recommend that the scientific community establish clear norms regarding how to treat AI-generated research.
We extend our deepest gratitude to our incredible collaborators, Jeff Clune (University of British Columbia, the Vector Institute, and a CIFAR Chair) and Jakob Foerster (University of Oxford), for their invaluable contributions to this project.
This Nature publication marks the dawn of a new era where the process of discovery is no longer a solely human pursuit. With AI agents acting as tireless companions, we are accelerating toward a future where we can dramatically speed up the pace of scientific breakthroughs. If done safely, systems like The AI Scientist could thus potentially enable everything from curing all diseases and providing abundance for all humans to protecting our environment and exploring the stars.
To learn more about The AI Scientist, please read our Nature paper or check out the open-source code on GitHub.
AI scientists autonomously explore a “tree” of possibilities to discover scientific breakthroughs. Credit: Artwork by CERTO, Inc.
Sakana AI
Interested in joining us?
Please see our career opportunities for more information.
