Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

UPGO (Self-Imitation Learning)

UPGO (Upgoing Policy Gradient) reinforces only trajectories where the agent performed better than expected. Used by ROA-Star alongside V-trace.

API

use rl4burn::upgo_advantages;

let advantages = upgo_advantages(&rewards, &values, &dones, last_value, gamma);

How it works

At each timestep, UPGO checks if the one-step TD error is positive (did better than the value predicted):

  • Positive TD: Propagate the actual return backward (learn from this)
  • Negative TD: Truncate to the value estimate (ignore this)

This creates a self-imitation effect: the agent only reinforces actions that led to above-average outcomes.

When to use

UPGO is complementary to V-trace, not a replacement. ROA-Star uses both:

  • V-trace for stable off-policy value targets
  • UPGO for the policy gradient (only reinforce good trajectories)