Detailed Configuration Parameters

Here’s a complete reference for all available configuration options:

Top-Level Configuration

ParameterTypeDescription
project_namestringName for the distillation project (used for logging)
datasetobjectDataset configuration (see below)
modelsobjectModel configuration (see below)
tokenizerobjectTokenizer configuration (see below)
trainingobjectTraining arguments (see below)
distillationobjectDistillation-specific settings (see below)
model_configobjectModel loading options (see below)
loraobjectLoRA/PEFT configuration (see below)
quantizationobjectQuantization settings (see below)
executionobjectExecution environment settings (see below)
hf_tokenstring | nullHugging Face API token for private models/datasets

Dataset Configuration (dataset)

ParameterTypeDescription
namestringPath or Hugging Face dataset name
splitstringDataset split to use (e.g., “train”, “validation”)
logits_filestring | nullPath to TFRecord file with pre-computed logits (null for on-the-fly)
num_samplesnumber | nullMaximum number of samples to use (null for all)
select_range[number, number] | nullRange of samples to select [start, end] (null for all)
format_functionstring | nullName of formatter function (see Formatters section)

Models Configuration (models)

ParameterTypeDescription
teacherstring | nullTeacher model path/ID (needed if logits_file is null)
studentstringStudent model path/ID
student_adapterstring | nullPath to pre-trained student adapter (e.g., LoRA)
teacher_adapterstring | nullPath to pre-trained teacher adapter
teacher_vocab_sizenumberVocabulary size of teacher model (required if using logits_file)

Tokenizer Configuration (tokenizer)

ParameterTypeDescription
max_lengthnumberMaximum sequence length for truncation/filtering
chat_templatestring | nullOptional Jinja chat template string
student_pad_token_idnumberPad token ID for student tokenizer
teacher_pad_token_idnumberPad token ID for teacher tokenizer

Training Configuration (training)

This section contains standard Hugging Face TrainingArguments parameters. Here are the most common ones:

ParameterTypeDescription
output_dirstringDirectory to save model checkpoints and results
num_train_epochsnumberNumber of training epochs
per_device_train_batch_sizenumberBatch size per GPU
gradient_accumulation_stepsnumberNumber of forward passes before backward pass
save_stepsnumberSave checkpoint every N steps
logging_stepsnumberLog metrics every N steps
learning_ratenumberInitial learning rate
warmup_rationumberRatio of steps for learning rate warmup
lr_scheduler_typestringLR scheduler (e.g., “cosine”, “linear”)
resume_from_checkpointstring | nullPath to checkpoint to resume from
bf16booleanEnable bfloat16 mixed precision training
fp16booleanEnable float16 mixed precision training

Distillation Configuration (distillation)

ParameterTypeDescription
temperaturenumberTemperature for softening distributions (typically 2.0-4.0)
alphanumberWeight for distillation loss (between 0-1)
loss_typestringDistillation loss type: “fkl”, “kld”, “uld”, “multi-ot”
student_response_templatestringTemplate for student response (used in uld/multi-ot)
teacher_response_templatestringTemplate for teacher response (used in uld/multi-ot)
knumberTop-k parameter for “uld” and “multi-ot” losses
loss_kwargsobjectAdditional parameters for “multi-ot” loss type. Parameters: “log_loss_weight”, “sikhorn_loss_weight”.

Model Configuration (model_config)

ParameterTypeDescription
use_flash_attentionbooleanEnable Flash Attention 2 during model loading
trust_remote_codebooleanSet trust_remote_code for model loading

LoRA Configuration (lora)

ParameterTypeDescription
enable_trainingbooleanEnable LoRA training for the student model
rnumberLoRA rank (typically 8-64)
alphanumberLoRA alpha scaling factor (typically 2×r)
dropoutnumberDropout probability in LoRA layers
biasstringLoRA bias type: “none”, “all”, “lora_only”
task_typestringType of task (usually “CAUSAL_LM”)
target_modulesarray of stringsList of modules to apply LoRA to (e.g., “q_proj”, “k_proj”)
modules_to_savearray of stringsAdditional modules to make trainable

Quantization Configuration (quantization)

ParameterTypeDescription
enabledbooleanEnable 4-bit quantization (BitsAndBytes NF4)

Execution Configuration (execution)

ParameterTypeDescription
use_acceleratebooleanWhether HF Accelerate is used (for distributed training)
accelerate_configstring | nullPath to accelerate config file (only required when using modal)

Sample Configuration

Here’s a complete example configuration file for a typical distillation scenario:

{
  "project_name": "llama-3.1-70b-to-8b-distillation",
  "dataset": {
    "name": "tatsu-lab/alpaca",
    "split": "train",
    "logits_file": "/vol/logits/llama-3.1-70b-alpaca.tfrecord",
    "num_samples": 10000,
    "select_range": null,
    "format_function": "default_format"
  },
  "models": {
    "teacher": null,
    "student": "meta-llama/Llama-3.1-8B-Instruct",
    "student_adapter": null,
    "teacher_adapter": null,
    "teacher_vocab_size": 128256
  },
  "tokenizer": {
    "max_length": 2048,
    "chat_template": null,
    "student_pad_token_id": 128001,
    "teacher_pad_token_id": 128001
  },
  "training": {
    "output_dir": "/vol/distilled_model",
    "num_train_epochs": 3,
    "per_device_train_batch_size": 1,
    "gradient_accumulation_steps": 8,
    "save_steps": 500,
    "logging_steps": 10,
    "learning_rate": 2e-5,
    "weight_decay": 0.01,
    "warmup_ratio": 0.03,
    "lr_scheduler_type": "cosine",
    "resume_from_checkpoint": null,
    "fp16": false,
    "bf16": true
  },
  "distillation": {
    "temperature": 2.0,
    "alpha": 0.1,
    "loss_type": "fkl",
    "student_response_template": "<|start_header_id|>assistant<|end_header_id|>\n\n",
    "teacher_response_template": "<|start_header_id|>assistant<|end_header_id|>\n\n",
    "k": 100,
    "loss_kwargs": {}
  },
  "model_config": {
    "use_flash_attention": true,
    "trust_remote_code": false
  },
  "lora": {
    "enable_training": true,
    "r": 16,
    "alpha": 32,
    "dropout": 0.05,
    "bias": "none",
    "task_type": "CAUSAL_LM",
    "target_modules": [
      "q_proj", "k_proj", "v_proj", "o_proj",
      "gate_proj", "up_proj", "down_proj"
    ],
    "modules_to_save": []
  },
  "quantization": {
    "enabled": true
  },
  "execution": {
    "use_accelerate": true,
    "accelerate_config": null
  },
  "hf_token": null
}

Important: Paths specified in the configuration (e.g., dataset.logits_file, training.output_dir, model paths) should point to locations within your accessible storage volume. In the example above, paths like /vol/logits/... and /vol/distilled_model assume your data and output directories are mapped to /vol inside your execution environment (like a container or VM).

Configuration Tips

  1. Memory Optimization:

    • Enable quantization (quantization.enabled: true) for large models
    • Use LoRA (lora.enable_training: true) instead of full model fine-tuning
    • Adjust tokenizer.max_length based on your GPU memory
  2. Training Speed:

    • Enable Flash Attention 2 with model_config.use_flash_attention: true
    • Use bfloat16 mixed precision with training.bf16: true on compatible hardware
    • Increase training.per_device_train_batch_size if memory allows
    • Set up distributed training with execution.use_accelerate: true
  3. Result Quality:

    • Experiment with temperature values (typically between 1.0-4.0)
    • Adjust alpha to balance between distillation and task losses