Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Cookbook

rl4burn ships with 15 runnable examples in the examples/ directory, organized into five tiers of increasing complexity. Each example is a standalone Cargo package that you can run with cargo run -p <name> --release.

Tier 1: Fundamentals

ExampleCommandDescription
quickstartcargo run -p quickstart --releaseMinimal PPO on CartPole — the “hello world” of RL
ppo-annotatedcargo run -p ppo-annotated --releaseSame as quickstart but with detailed comments explaining every line
config-drivencargo run -p config-driven --releaseLoad hyperparameters from a TOML file instead of hardcoding them

Tier 2: Environment Variations

ExampleCommandDescription
custom-envcargo run -p custom-env --releaseImplement the Env trait for your own environment
ppo-continuouscargo run -p ppo-continuous --releasePPO with continuous actions on Pendulum
ppo-multi-discretecargo run -p ppo-multi-discrete --releasePPO with multi-discrete action spaces

Tier 3: Techniques

ExampleCommandDescription
action-maskingcargo run -p action-masking --releaseInvalid action masking with the masked PPO pipeline
reward-shapingcargo run -p reward-shaping --releaseIntrinsic rewards and reward shaping wrappers
lstm-policycargo run -p lstm-policy --releaseRecurrent policy for partially observable environments

Tier 4: Multi-Agent & Game AI

ExampleCommandDescription
self-playcargo run -p self-play --releaseSelf-play training with an opponent pool
multi-agentcargo run -p multi-agent --releaseShared-weight multi-agent training
curriculumcargo run -p curriculum --releaseCurriculum self-play learning (CSPL)

Tier 5: Production

ExampleCommandDescription
diagnosticscargo run -p diagnostics --releaseTensorBoard logging, video recording, and training diagnostics
hyperparameter-tuningcargo run -p hyperparameter-tuning --releaseSystematic hyperparameter sweeps
deploy-policycargo run -p deploy-policy --releaseExport a trained policy for inference on a different backend

Which algorithm should I use?

Use this decision guide to pick the right starting point:

ScenarioRecommended algorithmStart from example
Discrete actions (e.g., CartPole, Atari)PPO or DQNquickstart
Continuous actions (e.g., Pendulum, MuJoCo)PPO with Gaussian policyppo-continuous
Multi-discrete actions (e.g., RTS games)PPO with multi-headppo-multi-discrete
Invalid actions vary per stepMasked PPOaction-masking
Competitive game (1v1 or teams)Self-play PPOself-play
Partial observabilityLSTM policy + PPOlstm-policy
Multiple cooperating agentsShared-weight PPOmulti-agent
Large observation space / model-basedDreamerV3 (future)

When in doubt, start with PPO (quickstart). It is the most versatile algorithm and works well across a wide range of problems. Switch to DQN only if you need off-policy learning or have a small discrete action space where sample efficiency matters.