eval: gate eval_manager calls on rank 0 under DDP by eugenevinitsky · Pull Request #421 · Emerge-Lab/PufferDrive

eugenevinitsky · 2026-05-11T00:32:45Z

Summary

Under torchrun/DDP, every rank's evaluate() was firing _eval_manager.maybe_run in parallel. Every rank has identical weights, so all N rollouts produce the same output — pure duplication.

Under torchrun/DDP, every rank's `evaluate()` would fire `_eval_manager.maybe_run` in parallel. Since every rank has identical weights post-allreduce, this is pure duplication — and it's actively harmful: N ranks each spin up an eval env (~5GB each), each open ffmpeg pipes to the same per-evaluator subdir and race on the same mp4 filenames, and each call `logger.log` which double-publishes scalars to wandb. Other rank-0-only operations in pufferl.py (`save_checkpoint`, `print_dashboard`, the logger path) already guard with `if torch.distributed.is_initialized() and torch.distributed.get_rank() != 0: return`. The eval block missed that pattern. Add an `is_rank0` check around both eval call sites (the in-loop `maybe_run` and the shutdown `maybe_run(force=True)`). Non-rank-0 workers block naturally on the next DDP allreduce until rank 0 rejoins. Latent today (we don't run DDP for any current jobs), but would silently corrupt mp4 outputs and inflate wandb scalars the moment torchrun is used. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

This PR ensures unified evaluation via EvalManager only runs on rank 0 when training under torchrun / DistributedDataParallel (DDP), preventing duplicated rollouts, MP4 filename races, and duplicated metric logging across ranks.

Changes:

Gate the in-training-loop _eval_manager.maybe_run(...) call behind a rank-0 check under DDP.
Gate the shutdown _eval_manager.maybe_run(force=True) call behind the same rank-0 check under DDP.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+        # Under DDP, only rank 0 runs eval — every rank has identical
+        # weights so duplicating the rollout wastes memory + compute,
+        # and parallel mp4 writes from N ranks race on filenames. Other
+        # ranks block on the next allreduce until rank 0 rejoins.
+        is_rank0 = (not torch.distributed.is_initialized()) or torch.distributed.get_rank() == 0
+        if self._eval_manager is not None and is_rank0:


Copilot AI review requested due to automatic review settings May 11, 2026 00:32

Copilot started reviewing on behalf of eugenevinitsky May 11, 2026 00:33 View session

Copilot AI reviewed May 11, 2026

View reviewed changes

eugenevinitsky merged commit a8f46a6 into emerge/temp_training May 11, 2026
15 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

eval: gate eval_manager calls on rank 0 under DDP#421

eval: gate eval_manager calls on rank 0 under DDP#421
eugenevinitsky merged 1 commit into
emerge/temp_trainingfrom
ev/eval-rank0-guard

eugenevinitsky commented May 11, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

eugenevinitsky commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

eugenevinitsky commented May 11, 2026 •

edited

Loading