TAID: A Novel Method for Efficient Knowledge Transfer from Large Language Models to Small Language Models

TAID has been accepted as a Spotlight Paper at ICLR, a top international conference in machine learning. With TAID, we have developed TinySwallow-1.5B, a Japanese small language model that achieves state-of-the-art performance among models of similar size.


This is the third in a series of blog posts presenting the results of Sakana AI’s research projects that were supported by the Japanese Ministry of Economy, Trade and Industry’s GENIAC supercomputing grant.


Summary

Large Language Models (LLMs) have achieved remarkable capabilities, performing at human-competitive levels not only in everyday conversation but also in complex tasks like mathematics and coding. As these models become increasingly essential tools, we face a significant challenge: the enormous computational resources they require. This makes it difficult for many companies to develop their own LLMs and impractical to run these models on edge devices like smartphones, ultimately limiting their accessibility and use cases.

At Sakana AI, we have been advancing research to develop and utilize large language models more efficiently through projects like Evolutionary Model Merge and The AI Scientist. In parallel with these efforts, we have focused on developing high-performing yet compact language models. We are excited to introduce TAID (Temporally Adaptive Interpolated Distillation), a novel method for efficiently building high-performing small language models (SLMs), which has been accepted as a Spotlight paper at ICLR 2025.


A demonstration of text generation using TinySwallow-1.5B on an iPhone, running at actual generation speed.


TAID represents a new approach to knowledge distillation, a technique for transferring knowledge from LLMs to SLMs. Unlike existing distillation methods, TAID achieves more efficient and effective knowledge transfer by gradually transferring LLM knowledge based on the student model’s learning progress. To demonstrate TAID’s practical utility, we have developed two models: TAID-LLM-1.5B for English and TinySwallow-1.5B for Japanese. TinySwallow-1.5B, developed in collaboration with the Swallow team at the Institute of Science Tokyo, achieves state-of-the-art performance among similarly sized models through knowledge distillation from a 32B parameter LLM to a 1.5B parameter SLM.

Key highlights of this release:

This research was conducted in collaboration with Han Bao (Kyoto University) and Sho Yokoi (National Institute for Japanese Language and Linguistics / Tohoku University / RIKEN). The paper is available on arXiv, and the TinySwallow-1.5B is publicly available on the Hugging Face Hub. We hope that TAID will be widely adopted by the community and contribute to the further advancement of efficient AI development.


What is Knowledge Distillation?

Knowledge distillation is one of the most promising approaches for training high-performance SLMs. This technique enables the transfer of knowledge from a high-performing LLM (the teacher model) to an SLM (the student model), offering a more efficient path to creating capable compact models compared to training from scratch.

What makes knowledge distillation particularly fascinating is that it can transfer not just “correct answers” but also the teacher model’s “way of thinking.” The figure below illustrates this concept by comparing traditional training methods (left) with knowledge distillation (right) in language models.


Consider the task of predicting the missing word in the sentence: “Sakana AI develops efficient methods for _”. In traditional training (left), the model would only learn from the correct answer “AI”.

However, with knowledge distillation (right), the teacher model can communicate a more nuanced understanding: while recognizing “AI” as the most appropriate choice at 35% probability, it also acknowledges “ML” at 25% and “LLM” at 15% as contextually natural alternatives. This judgment, expressed as a probability distribution, represents crucial knowledge possessed by the language model.

Through this knowledge distillation process, we can transfer richer information from the teacher model to the student model, going beyond what’s available in simple correct/incorrect training data. This raises an intriguing question: Should we aim to use the largest and most capable LLMs as teacher models to create better student models (SLMs)?

Counterintuitively, research has shown that “bigger isn’t always better” when it comes to choosing teacher models. This represents one of the fundamental challenges in traditional knowledge distillation approaches. Because student models have limited capacity, too large a gap between teacher and student capabilities can actually hinder effective knowledge transfer. It’s similar to trying to teach graduate-level concepts to elementary school students - the teacher’s knowledge, while extensive, may be too advanced for the student to effectively absorb and understand.


Introducing TAID

TAID represents a novel approach to knowledge distillation that directly addresses the challenges described above. The key innovation is its ability to gradually adapt the teacher model based on the student’s learning progress, enabling more effective knowledge transfer.


Specifically, TAID introduces an “intermediate teacher” — a bridge between the student and teacher models. This intermediate teacher is carefully designed to provide knowledge that is both accessible to the student model while being slightly more advanced than the student’s current capabilities. As training progresses, the intermediate teacher gradually evolves to introduce more sophisticated knowledge. This approach mirrors effective classroom education, where teachers adjust their instruction based on students’ understanding and gradually introduce more complex concepts.

The figure below demonstrates TAID’s effectiveness compared to conventional methods. The horizontal axis shows different teacher model sizes, while the vertical axis indicates the student model’s performance (Test Accuracy). Traditional approaches (KL and RKL) actually show declining performance as teacher models get larger. In contrast, TAID shows consistent improvement in student model performance as teacher size increases, demonstrating its ability to effectively bridge the capacity gap between teacher and student models.


Performance comparison of student models (70M parameters) across different teacher model sizes.Models were pre-trained using the Pythia Suite on a 1B token subset of the SmolLM-Corpus. Test Accuracy shows the accuracy on the LAMBADA dataset using lm-eval-harness.

This research has been accepted as a Spotlight paper at ICLR, a top-tier conference in machine learning. For those interested in the technical details, the complete paper is available.


Release of TinySwallow-1.5B: A New Compact Japanese Language Model

As part of our efforts to validate TAID’s effectiveness, we collaborated with the Institute of Science Tokyo to develop TinySwallow-1.5B, a compact Japanese language model. This model was created through TAID-based knowledge distillation from a 32B parameter LLM to a 1.5B parameter SLM — representing a reduction to approximately 1/20th of the original size.

Average scores on Japanese language understanding and generation tasks from the Japanese LLM Evaluation Benchmark for models under 3B parameters.

Thanks to its compact size, TinySwallow-1.5B can run efficiently not only on PCs but also on mobile devices. As demonstrated in the demo video above, TinySwallow-1.5B-Instruct running on an iPhone 14 achieves impressive text generation speeds.


To showcase the possibility of running the entire TinySwallow model on your personal computer, we’ve also developed a Japanese web interface that runs a Javascript version of the model directly inside your browser session, without relying on external APIs. (Note that this model understands both Japanese and English languages, so you can chat with it in English.)


The model weights are publicly available on the Hugging Face Hub. We hope these models will contribute to the advancement of Japanese language AI technology.


The Future: TAID’s Broader Impact

TTAID represents a significant advancement in knowledge distillation, offering a powerful new way to transfer knowledge from large models to smaller ones. While this blog post has focused on its application to language models, TAID’s potential extends far beyond LLMs. In fact, our research paper demonstrates this versatility through the development of TAID-VLM-2B, a compact English vision-language model that outperforms existing methods. We plan to continue exploring TAID’s effectiveness across various model architectures and environmental conditions.

At Sakana AI, we have consistently drawn inspiration from natural phenomena in our research projects. In developing TAID, we achieved smooth knowledge transfer by modeling the process after human learning patterns. This biomimetic approach has proven particularly effective in creating efficient and high-performing models.

Looking ahead, we remain committed to our mission of making powerful AI models more accessible to everyone. By developing technologies like TAID that enable the creation of efficient, high-performing compact models, we’re working toward a future where the benefits of AI can be enjoyed by a broader audience.




Acknowledgement

We extend our heartfelt gratitude to the New Energy and Industrial Technology Development Organization (NEDO) and the Japanese Ministry of Economy, Trade and Industry (METI) for organizing the Generative AI Accelerator Challenge(GENIAC) and for selecting us as one of the participants. This project, JPNP20017, was made possible through the support provided by this initiative.


Sakana AI

Interested in joining us? Please see our career opportunities for more information.