Losses

Loss Type	Best For	Special Requirements
KL Divergence (fkl, kld)	Same tokenizer distillation	None
Universal Logit Distillation (uld)	Cross-tokenizer distillation	Requires teacher_labels
Multi-Level Optimal Transport (multi-ot)	Cross-tokenizer distillation	Requires teacher_labels, additional parameters

References

Distilling the Knowledge in a Neural Network
Geoffrey Hinton, Oriol Vinyals, Jeff Dean
arXiv preprint arXiv:1503.02531, 2015.
https://arxiv.org/abs/1503.02531
Towards Cross-Tokenizer Distillation: the Universal Logit Distillation Loss for LLMs
Nicolas Boizard, Kevin El Haddad, Céline Hudelot, Pierre Colombo
arXiv preprint arXiv:2402.12030, 2025.
https://arxiv.org/abs/2402.12030
Multi-Level Optimal Transport for Universal Cross-Tokenizer Knowledge Distillation on Language Models
Xiao Cui, Mo Zhu, Yulei Qin, Liang Xie, Wengang Zhou, Houqiang Li
arXiv preprint arXiv:2412.14528, 2025.
https://arxiv.org/abs/2412.14528