Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Policy Distillation

Train a student network to match a teacher’s behavior. Used in CSPL (Phase 2) to merge multiple specialist teachers into one generalist.

API

use rl4burn::algo::imitation::distillation::{distillation_loss, DistillationConfig};

let config = DistillationConfig {
    temperature: 2.0,
    soft_weight: 1.0,
    hard_weight: 0.0,
    t_squared_scaling: true,
};

let loss = distillation_loss(teacher_logits, student_logits, &config);

Temperature

Higher temperature produces softer probability distributions. The student learns more from the relative ordering of actions, not just the best one.

  • T=1: standard softmax (peaked)
  • T=5: much softer (exposes teacher’s “second choice” preferences)

T-squared scaling

Hinton et al. recommend scaling the soft-target loss by T-squared. Without this, gradients from soft targets are 1/T-squared too small.

Value distillation

use rl4burn::algo::imitation::distillation::value_distillation_loss;
let vloss = value_distillation_loss(teacher_values, student_values);