Loss TypeBest ForSpecial Requirements
KL Divergence (fkl, kld)Same tokenizer distillationNone
Universal Logit Distillation (uld)Cross-tokenizer distillationRequires teacher_labels
Multi-Level Optimal Transport (multi-ot)Cross-tokenizer distillationRequires teacher_labels, additional parameters

References

  1. Distilling the Knowledge in a Neural Network
    Geoffrey Hinton, Oriol Vinyals, Jeff Dean
    arXiv preprint arXiv:1503.02531, 2015.
    https://arxiv.org/abs/1503.02531

  2. Towards Cross-Tokenizer Distillation: the Universal Logit Distillation Loss for LLMs
    Nicolas Boizard, Kevin El Haddad, Céline Hudelot, Pierre Colombo
    arXiv preprint arXiv:2402.12030, 2025.
    https://arxiv.org/abs/2402.12030

  3. Multi-Level Optimal Transport for Universal Cross-Tokenizer Knowledge Distillation on Language Models
    Xiao Cui, Mo Zhu, Yulei Qin, Liang Xie, Wengang Zhou, Houqiang Li
    arXiv preprint arXiv:2412.14528, 2025.
    https://arxiv.org/abs/2412.14528