Essentials
Configuration
Detailed Configuration Parameters
Here’s a complete reference for all available configuration options:
Top-Level Configuration
Parameter | Type | Description |
---|---|---|
project_name | string | Name for the distillation project (used for logging) |
dataset | object | Dataset configuration (see below) |
models | object | Model configuration (see below) |
tokenizer | object | Tokenizer configuration (see below) |
training | object | Training arguments (see below) |
distillation | object | Distillation-specific settings (see below) |
model_config | object | Model loading options (see below) |
lora | object | LoRA/PEFT configuration (see below) |
quantization | object | Quantization settings (see below) |
execution | object | Execution environment settings (see below) |
hf_token | string | null | Hugging Face API token for private models/datasets |
Dataset Configuration (dataset
)
Parameter | Type | Description |
---|---|---|
name | string | Path or Hugging Face dataset name |
split | string | Dataset split to use (e.g., “train”, “validation”) |
logits_file | string | null | Path to TFRecord file with pre-computed logits (null for on-the-fly) |
num_samples | number | null | Maximum number of samples to use (null for all) |
select_range | [number, number] | null | Range of samples to select [start, end] (null for all) |
format_function | string | null | Name of formatter function (see Formatters section) |
Models Configuration (models
)
Parameter | Type | Description |
---|---|---|
teacher | string | null | Teacher model path/ID (needed if logits_file is null) |
student | string | Student model path/ID |
student_adapter | string | null | Path to pre-trained student adapter (e.g., LoRA) |
teacher_adapter | string | null | Path to pre-trained teacher adapter |
teacher_vocab_size | number | Vocabulary size of teacher model (required if using logits_file) |
Tokenizer Configuration (tokenizer
)
Parameter | Type | Description |
---|---|---|
max_length | number | Maximum sequence length for truncation/filtering |
chat_template | string | null | Optional Jinja chat template string |
student_pad_token_id | number | Pad token ID for student tokenizer |
teacher_pad_token_id | number | Pad token ID for teacher tokenizer |
Training Configuration (training
)
This section contains standard Hugging Face TrainingArguments
parameters. Here are the most common ones:
Parameter | Type | Description |
---|---|---|
output_dir | string | Directory to save model checkpoints and results |
num_train_epochs | number | Number of training epochs |
per_device_train_batch_size | number | Batch size per GPU |
gradient_accumulation_steps | number | Number of forward passes before backward pass |
save_steps | number | Save checkpoint every N steps |
logging_steps | number | Log metrics every N steps |
learning_rate | number | Initial learning rate |
warmup_ratio | number | Ratio of steps for learning rate warmup |
lr_scheduler_type | string | LR scheduler (e.g., “cosine”, “linear”) |
resume_from_checkpoint | string | null | Path to checkpoint to resume from |
bf16 | boolean | Enable bfloat16 mixed precision training |
fp16 | boolean | Enable float16 mixed precision training |
Distillation Configuration (distillation
)
Parameter | Type | Description |
---|---|---|
temperature | number | Temperature for softening distributions (typically 2.0-4.0) |
alpha | number | Weight for distillation loss (between 0-1) |
loss_type | string | Distillation loss type: “fkl”, “kld”, “uld”, “multi-ot” |
student_response_template | string | Template for student response (used in uld/multi-ot) |
teacher_response_template | string | Template for teacher response (used in uld/multi-ot) |
k | number | Top-k parameter for “uld” and “multi-ot” losses |
loss_kwargs | object | Additional parameters for “multi-ot” loss type. Parameters: “log_loss_weight”, “sikhorn_loss_weight”. |
Model Configuration (model_config
)
Parameter | Type | Description |
---|---|---|
use_flash_attention | boolean | Enable Flash Attention 2 during model loading |
trust_remote_code | boolean | Set trust_remote_code for model loading |
LoRA Configuration (lora
)
Parameter | Type | Description |
---|---|---|
enable_training | boolean | Enable LoRA training for the student model |
r | number | LoRA rank (typically 8-64) |
alpha | number | LoRA alpha scaling factor (typically 2×r) |
dropout | number | Dropout probability in LoRA layers |
bias | string | LoRA bias type: “none”, “all”, “lora_only” |
task_type | string | Type of task (usually “CAUSAL_LM”) |
target_modules | array of strings | List of modules to apply LoRA to (e.g., “q_proj”, “k_proj”) |
modules_to_save | array of strings | Additional modules to make trainable |
Quantization Configuration (quantization
)
Parameter | Type | Description |
---|---|---|
enabled | boolean | Enable 4-bit quantization (BitsAndBytes NF4) |
Execution Configuration (execution
)
Parameter | Type | Description |
---|---|---|
use_accelerate | boolean | Whether HF Accelerate is used (for distributed training) |
accelerate_config | string | null | Path to accelerate config file (only required when using modal) |
Sample Configuration
Here’s a complete example configuration file for a typical distillation scenario:
Important: Paths specified in the configuration (e.g., dataset.logits_file
, training.output_dir
, model paths) should point to locations within your accessible storage volume. In the example above, paths like /vol/logits/...
and /vol/distilled_model
assume your data and output directories are mapped to /vol
inside your execution environment (like a container or VM).
Configuration Tips
-
Memory Optimization:
- Enable quantization (
quantization.enabled: true
) for large models - Use LoRA (
lora.enable_training: true
) instead of full model fine-tuning - Adjust
tokenizer.max_length
based on your GPU memory
- Enable quantization (
-
Training Speed:
- Enable Flash Attention 2 with
model_config.use_flash_attention: true
- Use bfloat16 mixed precision with
training.bf16: true
on compatible hardware - Increase
training.per_device_train_batch_size
if memory allows - Set up distributed training with
execution.use_accelerate: true
- Enable Flash Attention 2 with
-
Result Quality:
- Experiment with temperature values (typically between 1.0-4.0)
- Adjust alpha to balance between distillation and task losses