UPGO (Self-Imitation Learning)

UPGO (Upgoing Policy Gradient) reinforces only trajectories where the agent performed better than expected. Used by ROA-Star alongside V-trace.

API

use rl4burn::upgo_advantages;

let advantages = upgo_advantages(&rewards, &values, &dones, last_value, gamma);

How it works

At each timestep, UPGO checks if the one-step TD error is positive (did better than the value predicted):

Positive TD: Propagate the actual return backward (learn from this)
Negative TD: Truncate to the value estimate (ignore this)

This creates a self-imitation effect: the agent only reinforces actions that led to above-average outcomes.

When to use

UPGO is complementary to V-trace, not a replacement. ROA-Star uses both:

V-trace for stable off-policy value targets
UPGO for the policy gradient (only reinforce good trajectories)