Detailed Configuration Parameters
Here’s a complete reference for all available configuration options:Top-Level Configuration
| Parameter | Type | Description |
|---|---|---|
| project_name | string | Name for the distillation project (used for logging) |
| dataset | object | Dataset configuration (see below) |
| models | object | Model configuration (see below) |
| tokenizer | object | Tokenizer configuration (see below) |
| training | object | Training arguments (see below) |
| distillation | object | Distillation-specific settings (see below) |
| model_config | object | Model loading options (see below) |
| lora | object | LoRA/PEFT configuration (see below) |
| quantization | object | Quantization settings (see below) |
| execution | object | Execution environment settings (see below) |
| hf_token | string | null | Hugging Face API token for private models/datasets |
Dataset Configuration (dataset)
| Parameter | Type | Description |
|---|---|---|
| name | string | Path or Hugging Face dataset name |
| split | string | Dataset split to use (e.g., “train”, “validation”) |
| logits_file | string | null | Path to TFRecord file with pre-computed logits (null for on-the-fly) |
| num_samples | number | null | Maximum number of samples to use (null for all) |
| select_range | [number, number] | null | Range of samples to select [start, end] (null for all) |
| format_function | string | null | Name of formatter function (see Formatters section) |
Models Configuration (models)
| Parameter | Type | Description |
|---|---|---|
| teacher | string | null | Teacher model path/ID (needed if logits_file is null) |
| student | string | Student model path/ID |
| student_adapter | string | null | Path to pre-trained student adapter (e.g., LoRA) |
| teacher_adapter | string | null | Path to pre-trained teacher adapter |
| teacher_vocab_size | number | Vocabulary size of teacher model (required if using logits_file) |
Tokenizer Configuration (tokenizer)
| Parameter | Type | Description |
|---|---|---|
| max_length | number | Maximum sequence length for truncation/filtering |
| chat_template | string | null | Optional Jinja chat template string |
| student_pad_token_id | number | Pad token ID for student tokenizer |
| teacher_pad_token_id | number | Pad token ID for teacher tokenizer |
Training Configuration (training)
This section contains standard Hugging Face TrainingArguments parameters. Here are the most common ones:
| Parameter | Type | Description |
|---|---|---|
| output_dir | string | Directory to save model checkpoints and results |
| num_train_epochs | number | Number of training epochs |
| per_device_train_batch_size | number | Batch size per GPU |
| gradient_accumulation_steps | number | Number of forward passes before backward pass |
| save_steps | number | Save checkpoint every N steps |
| logging_steps | number | Log metrics every N steps |
| learning_rate | number | Initial learning rate |
| warmup_ratio | number | Ratio of steps for learning rate warmup |
| lr_scheduler_type | string | LR scheduler (e.g., “cosine”, “linear”) |
| resume_from_checkpoint | string | null | Path to checkpoint to resume from |
| bf16 | boolean | Enable bfloat16 mixed precision training |
| fp16 | boolean | Enable float16 mixed precision training |
Distillation Configuration (distillation)
| Parameter | Type | Description |
|---|---|---|
| temperature | number | Temperature for softening distributions (typically 2.0-4.0) |
| alpha | number | Weight for distillation loss (between 0-1) |
| loss_type | string | Distillation loss type: “fkl”, “kld”, “uld”, “multi-ot” |
| student_response_template | string | Template for student response (used in uld/multi-ot) |
| teacher_response_template | string | Template for teacher response (used in uld/multi-ot) |
| k | number | Top-k parameter for “uld” and “multi-ot” losses |
| loss_kwargs | object | Additional parameters for “multi-ot” loss type. Parameters: “log_loss_weight”, “sikhorn_loss_weight”. |
Model Configuration (model_config)
| Parameter | Type | Description |
|---|---|---|
| use_flash_attention | boolean | Enable Flash Attention 2 during model loading |
| trust_remote_code | boolean | Set trust_remote_code for model loading |
LoRA Configuration (lora)
| Parameter | Type | Description |
|---|---|---|
| enable_training | boolean | Enable LoRA training for the student model |
| r | number | LoRA rank (typically 8-64) |
| alpha | number | LoRA alpha scaling factor (typically 2×r) |
| dropout | number | Dropout probability in LoRA layers |
| bias | string | LoRA bias type: “none”, “all”, “lora_only” |
| task_type | string | Type of task (usually “CAUSAL_LM”) |
| target_modules | array of strings | List of modules to apply LoRA to (e.g., “q_proj”, “k_proj”) |
| modules_to_save | array of strings | Additional modules to make trainable |
Quantization Configuration (quantization)
| Parameter | Type | Description |
|---|---|---|
| enabled | boolean | Enable 4-bit quantization (BitsAndBytes NF4) |
Execution Configuration (execution)
| Parameter | Type | Description |
|---|---|---|
| use_accelerate | boolean | Whether HF Accelerate is used (for distributed training) |
| accelerate_config | string | null | Path to accelerate config file (only required when using modal) |
Sample Configuration
Here’s a complete example configuration file for a typical distillation scenario:Important: Paths specified in the configuration (e.g.,
dataset.logits_file, training.output_dir, model paths) should point to locations within your accessible storage volume. In the example above, paths like /vol/logits/... and /vol/distilled_model assume your data and output directories are mapped to /vol inside your execution environment (like a container or VM).Configuration Tips
-
Memory Optimization:
- Enable quantization (
quantization.enabled: true) for large models - Use LoRA (
lora.enable_training: true) instead of full model fine-tuning - Adjust
tokenizer.max_lengthbased on your GPU memory
- Enable quantization (
-
Training Speed:
- Enable Flash Attention 2 with
model_config.use_flash_attention: true - Use bfloat16 mixed precision with
training.bf16: trueon compatible hardware - Increase
training.per_device_train_batch_sizeif memory allows - Set up distributed training with
execution.use_accelerate: true
- Enable Flash Attention 2 with
-
Result Quality:
- Experiment with temperature values (typically between 1.0-4.0)
- Adjust alpha to balance between distillation and task losses

