Configuration

Detailed Configuration Parameters

Here’s a complete reference for all available configuration options:

Top-Level Configuration

Parameter	Type	Description
project_name	string	Name for the distillation project (used for logging)
dataset	object	Dataset configuration (see below)
models	object	Model configuration (see below)
tokenizer	object	Tokenizer configuration (see below)
training	object	Training arguments (see below)
distillation	object	Distillation-specific settings (see below)
model_config	object	Model loading options (see below)
lora	object	LoRA/PEFT configuration (see below)
quantization	object	Quantization settings (see below)
execution	object	Execution environment settings (see below)
hf_token	string \| null	Hugging Face API token for private models/datasets

Dataset Configuration (`dataset`)

Parameter	Type	Description
name	string	Path or Hugging Face dataset name
split	string	Dataset split to use (e.g., “train”, “validation”)
logits_file	string \| null	Path to TFRecord file with pre-computed logits (null for on-the-fly)
num_samples	number \| null	Maximum number of samples to use (null for all)
select_range	[number, number] \| null	Range of samples to select [start, end] (null for all)
format_function	string \| null	Name of formatter function (see Formatters section)

Models Configuration (`models`)

Parameter	Type	Description
teacher	string \| null	Teacher model path/ID (needed if logits_file is null)
student	string	Student model path/ID
student_adapter	string \| null	Path to pre-trained student adapter (e.g., LoRA)
teacher_adapter	string \| null	Path to pre-trained teacher adapter
teacher_vocab_size	number	Vocabulary size of teacher model (required if using logits_file)

Tokenizer Configuration (`tokenizer`)

Parameter	Type	Description
max_length	number	Maximum sequence length for truncation/filtering
chat_template	string \| null	Optional Jinja chat template string
student_pad_token_id	number	Pad token ID for student tokenizer
teacher_pad_token_id	number	Pad token ID for teacher tokenizer

Training Configuration (`training`)

This section contains standard Hugging Face TrainingArguments parameters. Here are the most common ones:

Parameter	Type	Description
output_dir	string	Directory to save model checkpoints and results
num_train_epochs	number	Number of training epochs
per_device_train_batch_size	number	Batch size per GPU
gradient_accumulation_steps	number	Number of forward passes before backward pass
save_steps	number	Save checkpoint every N steps
logging_steps	number	Log metrics every N steps
learning_rate	number	Initial learning rate
warmup_ratio	number	Ratio of steps for learning rate warmup
lr_scheduler_type	string	LR scheduler (e.g., “cosine”, “linear”)
resume_from_checkpoint	string \| null	Path to checkpoint to resume from
bf16	boolean	Enable bfloat16 mixed precision training
fp16	boolean	Enable float16 mixed precision training

Distillation Configuration (`distillation`)

Parameter	Type	Description
temperature	number	Temperature for softening distributions (typically 2.0-4.0)
alpha	number	Weight for distillation loss (between 0-1)
loss_type	string	Distillation loss type: “fkl”, “kld”, “uld”, “multi-ot”
student_response_template	string	Template for student response (used in uld/multi-ot)
teacher_response_template	string	Template for teacher response (used in uld/multi-ot)
k	number	Top-k parameter for “uld” and “multi-ot” losses
loss_kwargs	object	Additional parameters for “multi-ot” loss type. Parameters: “log_loss_weight”, “sikhorn_loss_weight”.

Model Configuration (`model_config`)

Parameter	Type	Description
use_flash_attention	boolean	Enable Flash Attention 2 during model loading
trust_remote_code	boolean	Set trust_remote_code for model loading

LoRA Configuration (`lora`)

Parameter	Type	Description
enable_training	boolean	Enable LoRA training for the student model
r	number	LoRA rank (typically 8-64)
alpha	number	LoRA alpha scaling factor (typically 2×r)
dropout	number	Dropout probability in LoRA layers
bias	string	LoRA bias type: “none”, “all”, “lora_only”
task_type	string	Type of task (usually “CAUSAL_LM”)
target_modules	array of strings	List of modules to apply LoRA to (e.g., “q_proj”, “k_proj”)
modules_to_save	array of strings	Additional modules to make trainable

Quantization Configuration (`quantization`)

Parameter	Type	Description
enabled	boolean	Enable 4-bit quantization (BitsAndBytes NF4)

Execution Configuration (`execution`)

Parameter	Type	Description
use_accelerate	boolean	Whether HF Accelerate is used (for distributed training)
accelerate_config	string \| null	Path to accelerate config file (only required when using modal)

Sample Configuration

Here’s a complete example configuration file for a typical distillation scenario:

{
  "project_name": "llama-3.1-70b-to-8b-distillation",
  "dataset": {
    "name": "tatsu-lab/alpaca",
    "split": "train",
    "logits_file": "/vol/logits/llama-3.1-70b-alpaca.tfrecord",
    "num_samples": 10000,
    "select_range": null,
    "format_function": "default_format"
  },
  "models": {
    "teacher": null,
    "student": "meta-llama/Llama-3.1-8B-Instruct",
    "student_adapter": null,
    "teacher_adapter": null,
    "teacher_vocab_size": 128256
  },
  "tokenizer": {
    "max_length": 2048,
    "chat_template": null,
    "student_pad_token_id": 128001,
    "teacher_pad_token_id": 128001
  },
  "training": {
    "output_dir": "/vol/distilled_model",
    "num_train_epochs": 3,
    "per_device_train_batch_size": 1,
    "gradient_accumulation_steps": 8,
    "save_steps": 500,
    "logging_steps": 10,
    "learning_rate": 2e-5,
    "weight_decay": 0.01,
    "warmup_ratio": 0.03,
    "lr_scheduler_type": "cosine",
    "resume_from_checkpoint": null,
    "fp16": false,
    "bf16": true
  },
  "distillation": {
    "temperature": 2.0,
    "alpha": 0.1,
    "loss_type": "fkl",
    "student_response_template": "<|start_header_id|>assistant<|end_header_id|>\n\n",
    "teacher_response_template": "<|start_header_id|>assistant<|end_header_id|>\n\n",
    "k": 100,
    "loss_kwargs": {}
  },
  "model_config": {
    "use_flash_attention": true,
    "trust_remote_code": false
  },
  "lora": {
    "enable_training": true,
    "r": 16,
    "alpha": 32,
    "dropout": 0.05,
    "bias": "none",
    "task_type": "CAUSAL_LM",
    "target_modules": [
      "q_proj", "k_proj", "v_proj", "o_proj",
      "gate_proj", "up_proj", "down_proj"
    ],
    "modules_to_save": []
  },
  "quantization": {
    "enabled": true
  },
  "execution": {
    "use_accelerate": true,
    "accelerate_config": null
  },
  "hf_token": null
}

Important: Paths specified in the configuration (e.g., dataset.logits_file, training.output_dir, model paths) should point to locations within your accessible storage volume. In the example above, paths like /vol/logits/... and /vol/distilled_model assume your data and output directories are mapped to /vol inside your execution environment (like a container or VM).

Configuration Tips

Memory Optimization:
- Enable quantization (quantization.enabled: true) for large models
- Use LoRA (lora.enable_training: true) instead of full model fine-tuning
- Adjust tokenizer.max_length based on your GPU memory
Training Speed:
- Enable Flash Attention 2 with model_config.use_flash_attention: true
- Use bfloat16 mixed precision with training.bf16: true on compatible hardware
- Increase training.per_device_train_batch_size if memory allows
- Set up distributed training with execution.use_accelerate: true
Result Quality:
- Experiment with temperature values (typically between 1.0-4.0)
- Adjust alpha to balance between distillation and task losses

Get Started

Essentials

Detailed Configuration Parameters

Top-Level Configuration

Dataset Configuration (`dataset`)

Models Configuration (`models`)

Tokenizer Configuration (`tokenizer`)

Training Configuration (`training`)

Distillation Configuration (`distillation`)

Model Configuration (`model_config`)

LoRA Configuration (`lora`)

Quantization Configuration (`quantization`)

Execution Configuration (`execution`)

Sample Configuration

Configuration Tips

Get Started

Essentials

​Detailed Configuration Parameters

​Top-Level Configuration

​Dataset Configuration (dataset)

​Models Configuration (models)

​Tokenizer Configuration (tokenizer)

​Training Configuration (training)

​Distillation Configuration (distillation)

​Model Configuration (model_config)

​LoRA Configuration (lora)

​Quantization Configuration (quantization)

​Execution Configuration (execution)

​Sample Configuration

​Configuration Tips

Detailed Configuration Parameters

Top-Level Configuration

Dataset Configuration (`dataset`)

Models Configuration (`models`)

Tokenizer Configuration (`tokenizer`)

Training Configuration (`training`)

Distillation Configuration (`distillation`)

Model Configuration (`model_config`)

LoRA Configuration (`lora`)

Quantization Configuration (`quantization`)

Execution Configuration (`execution`)

Sample Configuration

Configuration Tips