Behavioral Cloning

Train a policy to imitate expert demonstrations via supervised learning. JueWu showed this provides ~64% of final RL performance as initialization.

API

use rl4burn::{bc_loss_discrete, bc_step};

// Single loss computation
let loss = bc_loss_discrete(logits, expert_actions, &device);

// Full training step (forward + backward + optimizer step)
let (model, loss_val) = bc_step(model, &mut optim, obs, expert_actions, lr, &device);

Multi-head actions

For hierarchical action spaces:

use rl4burn::bc_loss_multi_head;

let loss = bc_loss_multi_head(logits, expert_actions, &[11, 30, 8], &device);
// head_sizes: action_type(11), target(30), ability(8)

Tips

The uniform-policy cross-entropy loss should equal ln(K) where K is the number of actions. If your initial loss is much higher, something is wrong.
BC is most useful as RL weight initialization, not as a standalone method. BC policies are brittle — they fail on states not in the training data.