AlphaStar & ROA-Star
The 30-second version
AlphaStar (DeepMind, 2019) was the first AI to beat a top professional StarCraft II player. ROA-Star (Tencent, NeurIPS 2023) achieves the same level with 4x less compute by adding opponent modeling and smarter exploiter training.
Both are massive RL systems, but their core ideas decompose into modular building blocks — most of which are in rl4burn.
What makes StarCraft II hard for RL?
Imagine playing chess, except:
- You can only see part of the board (fog of war)
- Both players move simultaneously
- You control 200 pieces at once
- Each piece has 10+ possible actions
- Games last 20+ minutes (thousands of decisions)
Standard RL algorithms break under this complexity. AlphaStar’s solution: decompose the problem.
Key ideas (and where they are in rl4burn)
1. Auto-regressive action space
Instead of choosing from millions of possible joint actions, AlphaStar samples one decision at a time:
action_type → delay → queue → selected_units → target_unit → target_location
Each head is conditioned on the previous samples. This is exactly what CompositeDistribution provides.
use rl4burn::CompositeDistribution;
let dist = CompositeDistribution::from_heads(
&["action_type", "target", "ability"],
&[11, 30, 8],
);
See Auto-Regressive Action Distributions for details.
2. V-trace for off-policy correction
With thousands of parallel actors, the behavior policy is always slightly stale. V-trace corrects for this. Already in rl4burn as vtrace_targets.
See V-trace.
3. UPGO (self-imitation learning)
Only learn from experiences where you did better than expected. If the return exceeds the value baseline, reinforce it. Otherwise, ignore it.
use rl4burn::upgo_advantages;
let advantages = upgo_advantages(&rewards, &values, &dones, last_value, gamma);
See UPGO.
4. League training with PFSP
Instead of just self-play, AlphaStar trains a league of agents:
- Main agent: plays against everyone
- Main exploiter: specializes in beating the main agent
- League exploiters: find weaknesses across the entire pool
Opponents are sampled using PFSP — harder opponents (lower win rate) get sampled more often.
use rl4burn::{League, AgentRole, LeagueAgentConfig, PfspMatchmaking};
See League Training and PFSP Matchmaking.
5. ROA-Star’s additions
ROA-Star adds two ideas:
- Beta-VAE opponent modeling: A frozen encoder predicts what the opponent is doing behind fog of war. The latent embedding is fed to all agents as extra context. See Beta-VAE Opponent Modeling.
- Goal-conditioned exploiters: Exploiters are conditioned on strategy descriptors z, letting them specialize rapidly. See Goal-Conditioned RL.
Further reading
- AlphaStar paper (Nature, 2019)
- ROA-Star paper (NeurIPS, 2023)