Skip to content

eval: gate eval_manager calls on rank 0 under DDP#421

Merged
eugenevinitsky merged 1 commit into
emerge/temp_trainingfrom
ev/eval-rank0-guard
May 11, 2026
Merged

eval: gate eval_manager calls on rank 0 under DDP#421
eugenevinitsky merged 1 commit into
emerge/temp_trainingfrom
ev/eval-rank0-guard

Conversation

@eugenevinitsky
Copy link
Copy Markdown

@eugenevinitsky eugenevinitsky commented May 11, 2026

Summary

  • Under torchrun/DDP, every rank's evaluate() was firing _eval_manager.maybe_run in parallel. Every rank has identical weights, so all N rollouts produce the same output — pure duplication.

Under torchrun/DDP, every rank's `evaluate()` would fire `_eval_manager.maybe_run`
in parallel. Since every rank has identical weights post-allreduce, this is pure
duplication — and it's actively harmful: N ranks each spin up an eval env (~5GB
each), each open ffmpeg pipes to the same per-evaluator subdir and race on the
same mp4 filenames, and each call `logger.log` which double-publishes scalars
to wandb.

Other rank-0-only operations in pufferl.py (`save_checkpoint`, `print_dashboard`,
the logger path) already guard with `if torch.distributed.is_initialized() and
torch.distributed.get_rank() != 0: return`. The eval block missed that pattern.

Add an `is_rank0` check around both eval call sites (the in-loop `maybe_run`
and the shutdown `maybe_run(force=True)`). Non-rank-0 workers block naturally
on the next DDP allreduce until rank 0 rejoins.

Latent today (we don't run DDP for any current jobs), but would silently
corrupt mp4 outputs and inflate wandb scalars the moment torchrun is used.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 11, 2026 00:32
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR ensures unified evaluation via EvalManager only runs on rank 0 when training under torchrun / DistributedDataParallel (DDP), preventing duplicated rollouts, MP4 filename races, and duplicated metric logging across ranks.

Changes:

  • Gate the in-training-loop _eval_manager.maybe_run(...) call behind a rank-0 check under DDP.
  • Gate the shutdown _eval_manager.maybe_run(force=True) call behind the same rank-0 check under DDP.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pufferlib/pufferl.py
Comment on lines +464 to +469
# Under DDP, only rank 0 runs eval — every rank has identical
# weights so duplicating the rollout wastes memory + compute,
# and parallel mp4 writes from N ranks race on filenames. Other
# ranks block on the next allreduce until rank 0 rejoins.
is_rank0 = (not torch.distributed.is_initialized()) or torch.distributed.get_rank() == 0
if self._eval_manager is not None and is_rank0:
@eugenevinitsky eugenevinitsky merged commit a8f46a6 into emerge/temp_training May 11, 2026
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants