
A paper produced by The AI Scientist-v2 passed the peer-review process at a workshop in a top international AI conference.
Note: We conducted this experiment with the full cooperation of both the ICLR leadership and the organizers of an ICLR workshop. See the Transparency and Ethical Code of Conduct section for a discussion.
Summary
We are proud to announce that a paper produced by The AI Scientist passed the peer-review process at a workshop in a top machine learning conference. To our knowledge, this is the first fully AI-generated paper that has passed the same peer-review process that human scientists go through.
The paper was generated by an improved version of the original AI Scientist, called The AI Scientist-v2. We will be sharing the full details of The AI Scientist-v2 in an upcoming release. This paper was submitted to an ICLR 2025 workshop that agreed to work with our team to conduct an experiment to double-blind review AI-generated manuscripts. We selected this workshop because of its broader scope, challenging researchers (and our AI Scientist) to tackle diverse research topics that address practical limitations of deep learning. The workshop is hosted at ICLR, one of three premier conferences in machine learning and artificial intelligence research, along with NeurIPS and ICML.
We conducted this experiment with the full cooperation of both the ICLR leadership and the organizers of this ICLR workshop. We thank all of them for supporting this research into how AI-generated papers fare in peer-review. Furthermore, we also received an institutional review board (IRB) approval for this research from the University of British Columbia. Lastly, we plan to give a talk at the ICLR workshop to share our experiences and particularly the challenges with the AI Scientist project.
We proudly collaborated with the University of British Columbia and the University of Oxford on this exciting project.
Evaluation Process
We worked with the ICLR workshop organizers, and agreed that we would submit 3 AI-generated papers into the workshop for peer-review. We worked with the ICLR workshop organizers, and agreed that we would submit 3 AI-generated papers into the workshop for peer-review. The reviewers were informed about the possibility and likelihood that papers they are reviewing might be AI generated (3 out of 43 papers) but not if the papers assigned to them were actually AI generated or not (for details, see the ICLR workshop’s Reviewer Guidelines).
Critically, the AI-generated papers we submitted were entirely generated end-to-end by AI, without any modifications from humans. The AI Scientist-v2 came up with the scientific hypothesis, proposed the experiments to test the hypothesis, wrote and refined the code to conduct those experiments, ran the experiments, analyzed the data, visualized the data in figures, and wrote every word of the entire scientific manuscript, from the title to the final reference, including placing figures and all formatting.
We, as the humans overseeing this research, merely gave it the broad topic to perform research on (because the topic should be relevant to the workshop we submitted to) and picked 3 AI-generated papers to submit. We chose this number following discussions with the workshop organizers to avoid overburdening reviewers.
We looked at the generated papers and submitted those we thought were the top 3 (factoring in diversity and quality—We conducted our own detailed analysis of the 3 papers, please read on in our analysis section). Of the 3 papers submitted, two papers did not meet the bar for acceptance. One paper received an average score of 6.33, ranking approximately 45% of all submissions. These scores are higher than many other accepted human-written papers at the workshop, placing the paper above the average acceptance threshold. Specifically, the scores were:
- Rating: 6: Marginally above acceptance threshold
- Rating: 7: Good paper, accept
- Rating: 6: Marginally above acceptance threshold
However, as we will highlight in the next section about the Importance of Transparency and Ethical Code of Conduct, it was determined ahead of time, as part of our experiment protocol, that even if papers by The AI Scientist were accepted, we would withdraw them before they were actually published. This is because they were AI-generated, and the AI and scientific communities have not yet decided whether we want to publish AI-generated manuscripts in the same venues.
For transparency, because this paper was withdrawn after the peer-review process, the ICLR workshop organizers did not perform any additional meta-review on the paper, as they were already aware of this experiment. Hence, even though the paper received an average score of 6.33, it is still possible that a meta reviewer (in this case, the workshop organizers), in theory, could have rejected this paper.
The original AI Scientist represented the first time AI generated entire scientific manuscripts. To our knowledge, this is the first time a fully AI-generated paper was good enough to pass a standard scientific peer-review process like the one described.
The AI Scientist-v2, after being given a broad topic to conduct research on, generated a paper titled “Compositional Regularization: Unexpected Obstacles in Enhancing Neural Network Generalization”. This paper reported a negative result that The AI Scientist encountered while trying to innovate on novel regularization methods for training neural networks that can improve their compositional generalization. This manuscript received an average reviewer score of 6.33 at the ICLR workshop, placing it above the average acceptance threshold.
Importance of Transparency and Ethical Code of Conduct
We believe it is important for the scientific community to study the quality of AI-generated research, and one of the best ways to do so is to submit a small sample of it to the same rigorous peer-review processes we use to assess human-generated science (provided one has permission from those managing such processes).
As earlier mentioned, we conducted this research with the full cooperation of both the ICLR leadership and the organizers of this ICLR workshop. We thank all of them for supporting this research into how AI-generated papers fare in peer-review. We also received IRB approval from the University of British Columbia for this study.
Furthermore, our AI-generated papers will not be made accessible on OpenReview’s public forum. This is because for the purpose of this particular experiment, the ICLR conference organizers, ICLR workshop organizers and ourselves have agreed that AI-generated papers will be withdrawn from further consideration, and automatically desk-rejected after the peer-review process has been completed.
We as a community also need to develop norms regarding AI-generated science, including when and how to declare that a paper is fully or partially AI-generated, and at what point in the process. We will share more details on these issues in our forthcoming paper, but at a high level, we believe in providing as much transparency as possible regarding what is AI-generated, although there are difficult questions about whether the science should be judged on its own merits first to avoid bias against it.
Going forward, we will continue to exchange opinions with the research community on the state of this technology to ensure that it does not develop into a situation in the future where its sole purpose is to pass peer review, thereby substantially undermining the meaning of the scientific peer review process.
Challenges and Limitations
We note that while our AI Scientist has successfully generated peer-reviewed work, the venue in which the work is presented is at the workshop track, rather than at the main conference track. We also reiterate that only 1 out of the 3 generated papers had been accepted at this workshop.
Typically, workshop papers present preliminary findings that are less refined compared to main conference submissions, and in fact, many conference papers started off as a workshop paper. As we will describe later in our analysis section) below, we, as human AI researchers, also conducted our own internal reviews of the 3 papers, but concluded that none of them passed our internal bar for an ICLR conference track publication.
The acceptance rates at the main conference at a top machine learning conference like ICLR, ICML and NeurIPS are typically in the 20-30% range, while at the workshops like the one we submitted to, hosted along top ML conferences, have acceptance rates in the 60-70% range. In future work, we intend to improve our process to produce even higher quality scientific papers that may pass the bar of top-tier conferences.
We would also like to note that The AI Scientist is a system based mostly off of state-of-the-art large language models, and thus, the performance of The AI scientist is directly tied to the performance of these LLMs. If frontier foundation models keep getting better, as many scientists expect, then The AI Scientist will also keep improving.
Our Analysis of the AI-Generated Papers
In addition to the peer-review process, as human AI researchers, we also conducted our own analysis and reviewed all of the 3 AI-generated papers. We treated the 3 papers as if they were manuscripts submitted to the main ICLR conference track (which has a higher bar for acceptance), and our team wrote comprehensive reviews for each generated paper.
In addition to our own reviews, we also added inline comments for each AI-generated paper.
We assumed the role of an ICLR conference reviewer, providing the author of the paper (The AI Scientist) with issues we found with the paper as well as suggestions on how the author should incorporate our comments to improve the paper by addressing our issues raised. Unlike the workshop review process, this back-and-forth exchange is part of the typical peer-review process for a top conference or journal, where the reviewers work together with the authors to improve the work.
The AI Scientist occasionally made embarrassing citation errors. For instance, here, we found that it incorrectly attributed “an LSTM-based neural network” to Goodfellow (2016) rather than to the correct authors, Hochreiter and Schmidhuber (1997).
In addition to our reviews and comments, we also provided initial assessment scores for each paper in the initial review phase, where our assessment is provided in accordance with guidelines at top ML conferences such as NeurIPS and ICLR.
Furthermore, we have also conducted a code review to ensure that the experimental results made by The AI Scientist-v2 are reproducible. We checked for errors such as missing figures, excessive missing citations, and formatting issues. To improve scientific accuracy, reproducibility, and statistical rigor in the results, we encouraged The AI Scientist to repeat each of its experiments (that were selected for inclusion in the paper) several times.
Ultimately, we concluded that none of the 3 papers passed our internal bar for what we believe would qualify as an accepted ICLR conference track paper, in their current forms. However, we believe that the papers we sent to the workshop contain interesting, original, though preliminary ideas that can be developed further, hence we believe they may qualify for the ICLR workshop track.
We have made available our own human reviews, alongside these 3 AI-generated papers in our GitHub repository. We would like to invite you, our readers, to judge these papers yourselves, and let us know your feedback and even your own reviews!
The Future of The AI Scientist
We believe the next generations of The AI Scientist will usher in a new era in science. That AI can generate an entire scientific paper that passes peer-review at a top-tier ML workshop conveys very promising early signs of progress. But this is just the beginning. We expect AI to continue to improve, potentially exponentially. At some point in the future, AI will probably be able to generate papers at and beyond human levels, including at the highest level of scientific publishing. We predict The AI Scientist and systems like it will create papers worthy of acceptance not only at top ML conferences, but also in the top journals in science.
Ultimately, we believe what matters most is not how AI science is judged vs. human science, but whether its discoveries aid in human flourishing, such as curing diseases or expanding our knowledge of the laws that govern our universe. We look forward to helping usher in this era of AI science contributing to the betterment of humanity.
Sakana AI
Want to make the AI that improves AI? Please see our Careers page for more information.