Distributed Training

Abstractions for multi-GPU/multi-machine gradient synchronization.

The GradientSync Trait

use rl4burn::algo::distributed::{GradientSync, ReduceStrategy};

pub trait GradientSync {
    fn all_reduce_f32(&self, values: &[f32], strategy: ReduceStrategy) -> Vec<f32>;
    fn rank(&self) -> usize;
    fn world_size(&self) -> usize;
    fn barrier(&self);
}

Local Development

Use LocalSync for single-machine development. All operations are no-ops.

use rl4burn::algo::distributed::LocalSync;
let sync = LocalSync;
assert_eq!(sync.world_size(), 1);

Custom Implementations

Implement GradientSync for your cluster’s communication library (MPI, NCCL, gRPC, etc.):

struct MpiSync { /* ... */ }

impl GradientSync for MpiSync {
    fn all_reduce_f32(&self, values: &[f32], strategy: ReduceStrategy) -> Vec<f32> {
        // Call MPI_Allreduce
    }
    // ...
}

At scale (SCC: ~1000 envs per agent, HoK: 320 GPUs), ring all-reduce is the standard choice for gradient averaging.

Keyboard shortcuts

rl4burn

Distributed Training

The GradientSync Trait

Local Development

Custom Implementations