Reinforcement Learning Teachers of Test Time Scaling

June 23, 2025

Summary

We introduce a new way to teach large language models (LLMs) how to reason by learning to teach, not solve.

Paper: https://www.arxiv.org/abs/2506.08388
Code: https://github.com/SakanaAI/RLT

Reinforcement Learned Teacher (RLT) trains a teacher model to generate explanations from question-answer pairs, optimized to improve a student model’s understanding. Rather than solving problems from scratch, the teacher is rewarded based on how effectively its explanations help the student recover correct solutions.

Many advanced reasoning models, like DeepSeek R1, follow a two-stage learning process: first, a teacher model is trained, and then its outputs are used to train a student model, which becomes the final product. As illustrated above, traditionally, these teacher models are trained using expensive reinforcement learning (RL), where the model must “learn to solve” complex problems from scratch and is rewarded only when it gets the right answer. This process is slow, costly, and often narrowly focused, requiring carefully filtered outputs from the teacher to ensure the student learns effectively.

Our method tackles precisely these challenges. Instead of teaching by solving, our new reinforcement-learned teachers (RLTs) “learn to teach”, tasking them to output clear step-by-step explanations based on known solutions, just like great human instructors. Crucially, together with the questions, we also provide the correct answers to our teachers during training, which are now rewarded not for solving problems themselves, but for how helpful their explanations are to the student. This feedback loop aligns teacher training with its true purpose (being helpful to students), making it far more effective. It also allows us to use small, efficient models that wouldn’t otherwise be able to solve problems on their own.

Average performance across the 2024 American Invitational Mathematics Examination (AIME), competition MATH, and the Graduate-Level Q&A benchmarks (GPQA).

The result is surprising. Our compact teachers with only 7B parameters are better at teaching reasoning skills than orders-of-magnitude larger LLMs, making advanced AI more affordable and much faster to train. This holds not only for students of the same size (26.3% on our set of tasks vs. 18.9% when using DeepSeek R1, which has 671B parameters) but also 32B students, much larger than the teacher itself (37.6% vs. 34.4% using R1). We have released our code, research report, and open models to support broader innovation in AI.

The Role of Reinforcement Learning in Reasoning Models

The modern rise of LLMs with advanced reasoning capabilities, such as the Deepseek R1 models, leveraged a powerful technique called reinforcement learning (RL). Through RL, expensive LLMs learn to solve intricate math, coding, and logical problems from scratch, improving by making more likely (“reinforcing”) their own past correct attempts with trial and error. While highly effective, this approach comes with significant drawbacks. Most notably, RL-trained models tend to become narrowly focused. That is, they are good at the tasks they have been trained on, but less capable of generalizing to broader applications.

To work around this limitation, researchers often use a two-stage “Learning to Solve” training process. First, a large teacher model is trained with RL to solve problems. Then, its output is carefully filtered and repurposed as training data for a student model, which becomes the final product. This new second phase, illustrated above, is often referred to as distillation or cold-starting. However, two major issues further constrain this process:

First, reasoning-oriented RL training can only practically be applied to models that are already sufficiently capable of solving challenging tasks. This limits its applicability only to the most expensive teacher LLMs.
Second, there is a critical misalignment between the teacher’s objective during RL training and its intended role at test time. In “Learning to Solve,” teachers are trained uniquely to solve problems from scratch rather than generating clear, instructive outputs suitable for teaching the student models.

Learning to Teach with Reinforcement Learning

To overcome the limitations of “Learning to Solve”, we introduce a new class of models inspired by how real teachers work, which we call Reinforcement-Learned Teachers (RLTs). Just like a good teacher doesn’t need to rediscover math theorems to explain them, RLTs are given both the question and the correct answer to each problem in their input prompt. Their job is to connect the dots with helpful, step-by-step explanations that a student model can learn from.

What makes this approach powerful is how we train our teachers. RLTs are trained to maximize the clarity and instructiveness of their explanations, similar to how teachers gauge student comprehension in classrooms. Specifically, if the student model can easily understand the correct solution given the teacher’s explanations of a problem, that’s a signal that the teacher did a good job. We quantify student understanding with the “log probabilities”, a metric analogous to a student’s clarity in comprehending the lesson. We refer to our paper for the full technical details.

Our new “Learning to Teach” approach tackles both issues of the traditional “Learning to solve” framework. First, our new training loop aligns teacher training with its true purpose (being helpful to students for distillation/coldstarting), making it far more effective. Second, feeding RLTs both the question and its correct answer allows us to use small, efficient teacher models that wouldn’t otherwise be able to solve problems on their own. Ultimately, these properties make our method faster, cheaper, and more effective at producing strong reasoning students.

The Unreasonable Effectiveness of Tiny Specialized Teachers

We put our approach to the test by comparing a small RLT model, with just 7 billion parameters, to the best-known methods in the field. These competing methods use much larger models, like DeepSeek R1 and QwQ, combined with extra help from tools like GPT-4o-mini to clean up their outputs before using them to train student models.

Even so, our much smaller RLT outperformed them across multiple challenging benchmarks in math and science (see table below, top group). Using the same Qwen2.5 student models, the same questions, and the same evaluation setup, our RLT delivered better results with far less computational effort. It set a new standard for both efficiency and effectiveness in teaching reasoning to language models.

The results were just as impressive when we scaled up the student. Our 7B teacher successfully trained a 32B student model, more than four times its own size, with excellent outcomes (see table below, bottom group). This shows that small, specialized teachers can transfer deep reasoning skills even to much larger students.

Performance on the 2024 American Invitational Mathematics Examination (AIME), competition MATH, and the Graduate-Level Q&A benchmarks (GPQA). Our RLTs obtain improved performance and complement traditional reinforcement learning for problem solving.

We also found that our approach complements traditional RL. When used as a starting point, our RLT helped the student model reach even higher levels of performance (see the plot below). And from a cost perspective, the difference is dramatic: training the 32B student with our method took less than a day on a single compute node, while traditional RL would have taken months on the same hardware.

Average performance across the 2024 American Invitational Mathematics Examination (AIME), competition MATH, and the Graduate-Level Q&A benchmarks (GPQA). Our RLTs obtain improved performance and complement traditional reinforcement learning for problem solving.

A qualitative examination reveals some differences between the explanations provided by our RLTs and the distillation traces from Deepseek R1. We find the outputs of this traditional RL model often appear to rely on external tools like calculators, and even include out-of-place language patterns such as humorous comments. In contrast, our RLT explanations are more focused and even manage to add additional logical steps omitted by R1 using a clear and direct language. These intuitive enhancements translate to improved learning for our student language models, mirroring the conciseness and clarity of expert human educators.

Compared to the reasoning traces from DeepSeek R1, the explanations from our RLTs avoid confusing language and add additional logical steps to help the students.

The Future: A New Frontier of More Advanced and Cheaper Reasoning Models

Our RLT framework rethinks how we build reasoning models. Rather than training models to solve problems from scratch, we train them to explain known solutions clearly, much like skilled human educators. This shift makes it possible to apply RL in areas once considered too difficult for language models to handle directly.

RLTs could disrupt the cost of training advanced models. Instead of relying on massive systems at every stage, we can train small, specialized teachers and use them to teach much larger models efficiently. This flips the traditional scaling paradigm: the heaviest work is handled by compact, affordable models that unlock powerful capabilities in the students they train.

Looking ahead, this framework hints at something even more intriguing: a model that plays both the teacher and the student roles at once. By generating explanations for its own benefit, such a system could learn how to teach itself better over time. This idea echoes the vision of the Darwin Gödel Machine, where a model evolves through self-reflection and recursive learning.

Paper: Reinforcement Learning Teachers of Test Time Scaling https://www.arxiv.org/abs/2506.08388
Code: https://github.com/SakanaAI/RLT

Sakana AI

Interested in joining us? Please see our career opportunities for more information.