Policy Distillation
Train a student network to match a teacher’s behavior. Used in CSPL (Phase 2) to merge multiple specialist teachers into one generalist.
API
use rl4burn::algo::imitation::distillation::{distillation_loss, DistillationConfig};
let config = DistillationConfig {
temperature: 2.0,
soft_weight: 1.0,
hard_weight: 0.0,
t_squared_scaling: true,
};
let loss = distillation_loss(teacher_logits, student_logits, &config);
Temperature
Higher temperature produces softer probability distributions. The student learns more from the relative ordering of actions, not just the best one.
- T=1: standard softmax (peaked)
- T=5: much softer (exposes teacher’s “second choice” preferences)
T-squared scaling
Hinton et al. recommend scaling the soft-target loss by T-squared. Without this, gradients from soft targets are 1/T-squared too small.
Value distillation
use rl4burn::algo::imitation::distillation::value_distillation_loss;
let vloss = value_distillation_loss(teacher_values, student_values);