Essentials
Losses
Loss Type | Best For | Special Requirements |
---|---|---|
KL Divergence (fkl, kld) | Same tokenizer distillation | None |
Universal Logit Distillation (uld) | Cross-tokenizer distillation | Requires teacher_labels |
Multi-Level Optimal Transport (multi-ot) | Cross-tokenizer distillation | Requires teacher_labels, additional parameters |
References
-
Distilling the Knowledge in a Neural Network
Geoffrey Hinton, Oriol Vinyals, Jeff Dean
arXiv preprint arXiv:1503.02531, 2015.
https://arxiv.org/abs/1503.02531 -
Towards Cross-Tokenizer Distillation: the Universal Logit Distillation Loss for LLMs
Nicolas Boizard, Kevin El Haddad, Céline Hudelot, Pierre Colombo
arXiv preprint arXiv:2402.12030, 2025.
https://arxiv.org/abs/2402.12030 -
Multi-Level Optimal Transport for Universal Cross-Tokenizer Knowledge Distillation on Language Models
Xiao Cui, Mo Zhu, Yulei Qin, Liang Xie, Wengang Zhou, Houqiang Li
arXiv preprint arXiv:2412.14528, 2025.
https://arxiv.org/abs/2412.14528