DreamerV3
The 30-second version
DreamerV3 (Hafner et al., Nature 2025) learns a model of the world and then trains a policy entirely inside imagined trajectories. It works across wildly different domains (Atari, robotic control, Minecraft) with zero hyperparameter tuning.
The secret: symlog transforms that make gradient magnitudes independent of reward scale.
What’s a world model?
Most RL algorithms learn by trial and error in the real environment. World models flip this:
- Play the game a bit, store transitions
- Train a model to predict what happens next (the “world model”)
- Imagine thousands of trajectories inside the model
- Train the policy on imagined data
Step 3 is free — no environment interaction needed. This makes DreamerV3 extremely sample-efficient.
The RSSM (how the world model works)
The RSSM (Recurrent State-Space Model) has 5 networks:
| Component | Input | Output | Purpose |
|---|---|---|---|
| Sequence model (GRU) | h_{t-1}, z_{t-1}, a_{t-1} | h_t | Deterministic memory |
| Encoder (posterior) | h_t, observation | z_t | What actually happened |
| Dynamics (prior) | h_t | z_hat_t | What the model predicts |
| Reward predictor | h_t, z_t | reward | Expected reward |
| Continue predictor | h_t, z_t | continue prob | Is the episode over? |
The state is (h_t, z_t) where h is a deterministic GRU hidden state and z is a stochastic categorical variable (32 groups x 32 classes = 1024 dims).
use rl4burn::{Rssm, RssmConfig};
let rssm = RssmConfig::new(obs_dim, action_dim).init(&device);
let state = rssm.initial_state(batch_size, &device);
// Training: use observations
let (next_state, post_logits, prior_logits) = rssm.obs_step(&state, action, obs);
// Imagination: no observations needed
let next_state = rssm.imagine_step(&state, action);
See RSSM.
Symlog: the key to fixed hyperparameters
The biggest problem with RL across domains is reward scale. Atari rewards are 0-1000. Robotic rewards are -1 to 0. Without normalization, you need different learning rates for each.
DreamerV3 solves this with symlog: symlog(x) = sign(x) * ln(|x| + 1). This compresses large values and keeps small values linear. Combined with twohot encoding (distributional predictions), gradient magnitudes become independent of value scale.
use rl4burn::{symlog, symexp, TwohotEncoder};
let encoder = TwohotEncoder::new(); // 255 bins, [-20, 20] symlog space
let targets = encoder.encode(values, &device); // [batch, 255]
let loss = encoder.loss(logits, values, &device); // cross-entropy
let decoded = encoder.decode(softmax(logits, 1), &device); // back to scalars
See Symlog and Twohot Encoding.
KL balancing: training the world model
The RSSM is trained with two KL losses:
- Dynamics loss: Make the prior match the posterior (train the predictor)
- Representation loss: Make the posterior predictable (don’t be too complex)
Each has a stop-gradient on one side, plus a “free bits” threshold (ignore KL below 1 nat).
use rl4burn::{kl_balanced_loss, KlBalanceConfig};
let config = KlBalanceConfig {
dyn_weight: 0.5,
rep_weight: 0.1,
free_bits: 1.0,
};
let loss = kl_balanced_loss(posterior_logits, prior_logits, &config);
See KL Balancing with Free Bits.
Imagination rollouts
Once the world model is trained, generate trajectories purely in latent space:
use rl4burn::algo::planning::imagination::{imagine_rollout, lambda_returns};
let trajectory = imagine_rollout(&rssm, initial_states, |h, z| actor(h, z), 15);
// trajectory.states: 16 states (initial + 15 steps)
// trajectory.reward_logits: 15 predicted reward distributions
Compute lambda-returns on the imagined rewards, then train actor and critic on these imagined trajectories. The world model parameters are frozen during actor-critic training.
See Imagination Rollouts.
Sequence replay buffer
DreamerV3 samples contiguous sequences (T=64) from a FIFO buffer, never crossing episode boundaries.
use rl4burn::{SequenceReplayBuffer, SequenceStep};
let mut buffer = SequenceReplayBuffer::new(1_000_000, 64);
Percentile return normalization
Instead of per-minibatch normalization, DreamerV3 tracks the 5th-95th percentile range of returns with an EMA and divides by max(1, range). The floor of 1 prevents amplifying noise.
use rl4burn::PercentileNormalizer;
let mut normalizer = PercentileNormalizer::new();
normalizer.update(&returns);
let normalized = normalizer.normalize(&advantages);
See Percentile Return Normalization.
Further reading
- DreamerV3 paper (Nature, 2025)
- DreamerV2 paper (ICLR, 2021)
- Original Dreamer paper (ICLR, 2020)