[WIP] eval: unified Evaluator + EvalManager refactor by eugenevinitsky · Pull Request #420 · Emerge-Lab/PufferDrive

eugenevinitsky · 2026-05-09T22:33:32Z

Summary

Replaces the five overlapping eval pipelines with one base class + manager.

Test plan

Run a short training job on the cluster end-to-end and verify `validation_gigaflow` fires every 25 epochs, `behaviors_*` fires every 250.
Verify wandb panels land under `{eval_name}/` and renders under `{eval_name}/render/{view}`.
Verify `puffer eval puffer_drive --evaluator behaviors_hard_stop` runs standalone against a fixed checkpoint.
Verify `mode = "subprocess"` works (currently untested; the JSON-out path is plumbed but not exercised).
CI pre-commit + utest workflows green.

Out of scope

Persistent eval-worker pool (deferred until subprocess overhead becomes painful).
C-engine multi-pipe render (one rollout, N ffmpeg outputs per view).
Engine-level vecenv reuse via runtime `map_dir` swap.
`eval_summary` rollup metric.

…ation yet) Adds the new evaluator framework as pure additions: pufferlib/ocean/benchmark/evaluators/ base.py Evaluator + EvalResult dataclass multi_scenario.py MultiScenarioEvaluator (replaces eval_multi_scenarios) human_replay.py HumanReplayEvaluator (replay + control_sdc_only loop) behavior_class.py BehaviorClassEvaluator (per-class nuPlan suite) wosac.py Thin wrapper around the existing WOSACEvaluator pufferlib/ocean/benchmark/manager.py EvalManager: section discovery, inheritance chain, clean-eval macro, dotted-key flattening, inline + subprocess dispatch, wandb logging. Plus tests/test_eval_manager.py for the config-merge logic. See docs/eval_unification.md for the full design rationale. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Auto-discovered evaluator sections, schema A (dotted keys). Each behavior class is its own section inheriting from behaviors_defaults template; gigaflow validation has separate metric-only and render sections at different intervals. driving_behaviours_eval.ini deleted — folded into drive.ini. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Removed from pufferl.py (~990 lines): eval(), eval_multi_scenarios(), eval_multi_scenarios_render(), build_eval_overrides(), load_eval_multi_scenarios_config(), _swap_policy_obs_counts(), _render_driving_behaviours(), _export_metrics(), _log_eval_metrics(), verify_scenario_coverage(), verify_scenario_coverage_gigaflow(). Plus the legacy eval block in PuffeRL.evaluate() and the driving_behaviours_eval.ini loader in load_config. Removed from utils.py (~300 lines): run_human_replay_eval_in_subprocess, run_wosac_eval_in_subprocess, run_driving_behaviours_eval_in_subprocess. PuffeRL.evaluate() now calls self._eval_manager.maybe_run() — single unified path for all evals. main() wires `puffer eval --evaluator <name> --out <path>` for both standalone and subprocess use. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…EPLAY in c_step Add GOAL_ADVANCE_REGENERATE / GOAL_ADVANCE_SATURATE constants and a goal_advance_mode field on Drive. c_step's last-goal branch now dispatches on goal_advance_mode instead of simulation_mode. Drive.__init__ accepts goal_advance_mode kwarg with auto-pick based on simulation_mode (gigaflow → regenerate, replay → saturate) — same behavior as before, but the choice is explicit and per-eval-overridable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

This PR refactors PufferDrive’s evaluation system by replacing multiple overlapping eval entry points with a unified Evaluator framework and an EvalManager that discovers [eval.<name>] sections from drive.ini and runs them inline or via subprocess. It also introduces a new C/Python goal_advance_mode knob to decouple last-goal behavior from simulation_mode.

Changes:

Added pufferlib/ocean/benchmark/evaluators/ (base Evaluator + concrete evaluators) and pufferlib/ocean/benchmark/manager.py (discovery, inheritance merge, clean macro, dispatch, logging).
Refactored pufferlib/pufferl.py eval CLI + training loop to route evaluation through EvalManager; removed legacy subprocess eval helpers from pufferlib/utils.py.
Updated drive.ini to [eval.<name>] suites (with templates + inheritance), removed driving_behaviours_eval.ini, and added goal_advance_mode through Python/C bindings.

Reviewed changes

Copilot reviewed 16 out of 17 changed files in this pull request and generated 9 comments.

Show a summary per file

File	Description
`tests/test_eval_manager.py`	New unit tests for eval-section discovery, inheritance merge, dotted-key expansion, and clean macro behavior.
`README.md`	Updates cluster install wording (NYU cluster specifics).
`pufferlib/utils.py`	Removes legacy subprocess eval helpers; retains video rendering helper.
`pufferlib/pufferl.py`	Routes eval through `EvalManager`; rewires CLI `puffer eval --evaluator ... --out ...`; removes legacy eval pipelines.
`pufferlib/ocean/drive/drive.py`	Adds Python-side `goal_advance_mode` and passes it into C env init kwargs.
`pufferlib/ocean/drive/drive.h`	Adds `GOAL_ADVANCE_*` constants + `goal_advance_mode` field; switches last-goal behavior to the new knob in `c_step`.
`pufferlib/ocean/drive/binding.c`	Unpacks `goal_advance_mode` into the C env struct.
`pufferlib/ocean/benchmark/manager.py`	New `EvalManager` with section discovery, inheritance merge, clean macro, inline/subprocess dispatch, and logging.
`pufferlib/ocean/benchmark/evaluators/base.py`	New `Evaluator` base class + `EvalResult`.
`pufferlib/ocean/benchmark/evaluators/multi_scenario.py`	New multi-scenario evaluator + optional EGL render pass.
`pufferlib/ocean/benchmark/evaluators/human_replay.py`	New human replay evaluator implementation.
`pufferlib/ocean/benchmark/evaluators/behavior_class.py`	New behavior-class evaluator (sampling + symlink-dir cleanup).
`pufferlib/ocean/benchmark/evaluators/wosac.py`	New WOSAC evaluator adapter around existing realism evaluator logic.
`pufferlib/ocean/benchmark/evaluators/__init__.py`	Registers evaluator types for manager lookup.
`pufferlib/config/ocean/driving_behaviours_eval.ini`	Deleted (folded into `drive.ini`).
`pufferlib/config/ocean/drive.ini`	Replaces `[eval]` with `[eval.<name>]` suites + templates/inheritance; adds WOSAC and behavior-class sections.
`.gitignore`	Un-ignores `pufferlib/ocean/benchmark/*` (was previously matched by `benchmark/`).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+            obs, _, _, _, info_list = vecenv.step(action_np)
+            if info_list:
+                all_infos.extend(info_list)
+            # Stop once every bin has yielded one info to avoid double-counting
+            # on the second cycle through the dir.
+            if len(all_infos) >= num_maps:
+                break


+    def _run_inline(self, ev: Evaluator, policy, env_name: str, global_step) -> EvalResult:
+        args = self._build_eval_args(ev, env_name=env_name, global_step=global_step)
+
+        package = args.get("package", "ocean")
+        module_name = "pufferlib.ocean" if package == "ocean" else f"pufferlib.environments.{package}"
+        env_module = importlib.import_module(module_name)
+        make_env = env_module.env_creator(env_name)
+
+        vec_kwargs = ev.vec_overrides()
+        num_envs = int(vec_kwargs.get("num_envs", 1))
+        env_kwargs_list = [args["env"] for _ in range(num_envs)]
+        env_creators = [make_env] * num_envs
+        env_args_list = [[]] * num_envs
+
+        vec_call_kwargs = dict(vec_kwargs)
+        vec_call_kwargs.setdefault("num_workers", num_envs)
+        vec_call_kwargs.setdefault("batch_size", num_envs)
+
+        vecenv = pufferlib.vector.make(
+            env_creators, env_args=env_args_list, env_kwargs=env_kwargs_list, **vec_call_kwargs
+        )


+    def _run_subprocess(self, ev: Evaluator, env_name: str, global_step) -> EvalResult:
+        out_path = Path(self.train_config.get("data_dir", ".")) / "eval_subprocess_out" / f"{ev.name}.json"
+        out_path.parent.mkdir(parents=True, exist_ok=True)
+        cfg_path = out_path.with_suffix(".cfg.json")
+        with open(cfg_path, "w") as f:
+            json.dump({"name": ev.name, "global_step": global_step}, f)


+        subprocess.run(cmd, check=True)
+        with open(out_path) as f:
+            payload = json.load(f)


-            eval_args["load_model_path"] = os.path.join(
-                self.config["data_dir"], experiment_name, "models", f"inline_epoch_{self.epoch}.pt"
+                logger=self.logger,
+                global_step=self.global_step,


+env.map_dir = "/scratch/$USER/data/nuplan/nuplan_mini_train_bins"
+
+[eval.behaviors_hard_stop]
+inherits = "behaviors_defaults"
+type = "behavior_class"
+enabled = true
+env.map_dir = "/scratch/$USER/data/nuplan/categories_v021/hard_stop"
+
+[eval.behaviors_highway_straight]
+inherits = "behaviors_defaults"
+type = "behavior_class"
+enabled = true
+env.map_dir = "/scratch/$USER/data/nuplan/categories_v021/highway_straight"
+
+[eval.behaviors_lane_change]
+inherits = "behaviors_defaults"
+type = "behavior_class"
+enabled = true
+env.map_dir = "/scratch/$USER/data/nuplan/categories_v021/lane_change"


 import os
 import sys
 import glob
 import random
 import shutil
 import subprocess
 import tempfile
 import json


+    def maybe_run(self, epoch: int, policy, env_name: str, logger=None, global_step=None) -> dict:
+        """Called from the training loop. Runs every enabled evaluator
+        whose `interval` divides `epoch`. Returns a dict of {eval_name → metrics}."""
+        results = {}
+        for ev in self.evaluators:
+            if not ev.enabled:
+                continue
+            if ev.interval <= 0:
+                continue
+            if epoch % ev.interval != 0:
+                continue
+            res = self._run_one(ev, policy=policy, env_name=env_name, logger=logger, global_step=global_step)
+            results[ev.name] = res
+        return results


+        try:
+            res = ev.rollout(vecenv, policy, args)


The HumanReplay and MultiScenario rollouts had ~80% the same step loop (reset → forward_eval → step → collect infos → aggregate). Move it to the base class as `_run_rollout_loop`. Subclasses override only the hooks they actually need to diverge on: _initial_reset (sync vs async reset) _maybe_reset_lstm (per-scenario reset cadence) _should_stop (termination condition) _flatten_infos (multi-worker vs single-worker info shape) _aggregate_infos (per-key mean is the default) _render_pass (no-op default; MultiScenario implements EGL) HumanReplayEvaluator now overrides only `env_overrides` + `_should_stop`. BehaviorClassEvaluator still inherits HumanReplay; only adds the tmp symlink dir for random sampling. WOSAC keeps its custom rollout — its per-scene multi-rollout loop doesn't fit the default shape. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The benchmark/** unignore was over-broad — it pulled in pycache files that should never be committed. Add an explicit re-ignore for __pycache__ under that path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- eval.render_num_scenarios: explicit per-evaluator render budget; defaults to min(eval.num_scenarios, 3). Renders are expensive (mp4 encode + wandb upload) and shouldn't run at metric scale. - Render path randomizes starting_map per epoch when not pinned, so successive renders show different bins from the dir instead of the first N alphabetically. Restores the old behavior from _render_driving_behaviours that the refactor lost. - behaviors_defaults pins render_num_scenarios = 2 so 12 classes × 2 views × 2 scenarios = 48 mp4s/epoch (vs 1200 if it inherited num_scenarios=50). - Test: render_num_scenarios is inheritable. "Worst N" selection deferred — needs per-scenario scores piped out of the metric rollout into the render call. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two coupled fixes: EvalManager: - Accepts run_id at construction. Used to locate the per-run models/ directory. - latest_checkpoint(env_name) walks data_dir/<env>_<run_id>/models/ for the newest model_*.pt. Falls back to train_config.load_model_path if no checkpoints exist yet. - has_subprocess_evals_at(epoch) reports whether any enabled subprocess evaluator would fire at that epoch. - _run_subprocess uses latest_checkpoint instead of train_config.load_model_path. PuffeRL.evaluate: - Calls save_checkpoint() before maybe_run() if any subprocess evaluator would fire. Mirrors the old run_driving_behaviours _eval_in_subprocess flow. - train() passes logger.run_id when constructing the manager. For all-inline configs (today's drive.ini default) this is a no-op. Activates when an evaluator is flipped to mode=subprocess. Tests: latest_checkpoint picks the newest by ctime, falls back to load_model_path when no models exist; has_subprocess_evals_at fires only on enabled subprocess evaluators at matching intervals. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The unified-eval refactor doesn't actually need this knob, and pushing it through the C struct + binding + drive.py at the same time bloats this PR. Restoring the original `if (env->simulation_mode == SIMULATION_REPLAY)` branch in c_step. The "promote implicit branches to explicit knobs" audit is real work but lives in its own PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…erride stack Adds 4 tests covering the regression-prone surface the parser-only tests miss: test_maybe_run_dispatches_by_interval_and_enabled Stubs _run_one and verifies that maybe_run fires only enabled evaluators whose interval divides epoch. epoch=33 fires nothing, epoch=250 fires both 25-interval and 250-interval evaluators. test_flatten_infos_handles_shape_variations _flatten_infos must handle multi-worker (list of lists) AND PufferEnv (flat list) backends, plus None / empty entries — one bad isinstance check silently drops episode infos. test_behavior_class_cleanup_removes_symlink_dir Builds a real 5-bin map_dir, requests a 2-bin sample, verifies the tmp symlink dir gets created with 2 bins, then cleanup() removes it. tempfile.mkdtemp leftovers are a real footgun. test_eval_args_compose_train_section_and_clean_macro The full override stack: train baseline → section overrides → clean macro. Section beats baseline, explicit beats macro, untouched baseline survives. 25 tests total (was 21). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Aligns with the heavier-eval cadence the rest of the new eval sections use. 25-epoch interval was too aggressive for inline gigaflow validation given the per-pass setup cost. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Time the rollout in Evaluator.rollout (base class) and inject `eval_seconds` into the returned metrics dict. Manager's _log posts it to wandb under {ev.name}/eval_seconds — wall-clock cost per evaluator becomes a first-class panel. Refactor WOSACEvaluator to override _run_rollout_loop instead of rollout — now WOSAC also benefits from the timing without code duplication. Test: stub evaluator with a forced 20ms floor; verifies eval_seconds lands in the result and the inner metrics survive. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The eval-mode resample path at drive.py:447 was missing goal_radius in its kwargs to binding.shared, while the initial-spawn call at line 293 has it. binding.shared's C side (binding.c:1545) requires goal_radius via unpack(), so the resample crashes with "Missing required keyword argument 'goal_radius'" once a scenario batch completes and a new one is requested. Latent bug — never triggered because the legacy [eval] section had multi_scenario_eval = False as the default, so the eval rollout path that triggers resample never fired in production. The new EvalManager flips multi_scenario_eval-equivalent ([eval.validation_gigaflow]) on by default, surfacing the crash. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Single section now does both: 250-scenario metric pass + 5-scenario render pass via the render_num_scenarios knob. The split made sense before render became a per-section flag; it's redundant now. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

PufferEnv backend in pufferlib.vector.make treats env_creator as a single callable and passes env_args/env_kwargs to it directly (line 697 in vector.py). Multiprocessing/Serial backends expect lists. Render must use PufferEnv (one ffmpeg pipe per env), so pass a single creator + dict, not lists. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The 8-map carla dir doesn't need 250 scenarios — gigaflow's C-side eval already creates one internal env per scenario (capped at num_scenarios) and steps them in a single batched rollout, so num_scenarios=8 covers every map exactly once in parallel. Drop num_agents 512→400 to fill exactly 8 × 50 slots with no wasted capacity. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…o_length Render path was falling back to env.scenario_length (3000 = 100s mp4 at 30fps) because render_max_steps wasn't a real config knob. Two fixes: 1. multi_scenario.py: read eval.render_max_steps from args["eval"] where _build_eval_args puts evaluator-private fields, and default to 91 (~3 sec) — not scenario_length, which is the metric pass. 2. drive.ini: set eval.render_max_steps = 91 on validation_gigaflow explicitly so the comment + value document the intent. EGL render is ~3 fps wall-clock at 1080p, so 91 × 5 × 2 ≈ 5 min/render pass instead of 28 min when defaulting to scenario_length=3000. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The old MultiScenarioEvaluator forced Multiprocessing + async_reset by default, which (a) added fork/IPC startup cost without throughput gain at our scale, (b) broke render (one ffmpeg pipe per env requires single-process), and (c) couldn't construct the vec env at all because the manager passed the PufferEnv-shaped single creator to the list-shaped backend dispatch. Three changes: 1. multi_scenario.py: drop the vec_overrides() and _initial_reset() overrides that pinned Multiprocessing — inherit base-class PufferEnv + sync reset. 2. manager.py:_run_inline: branch on backend so Multiprocessing remains a valid opt-in via [eval.<name>.vec] backend = "Multiprocessing". Useful for memory-bound replay sweeps, hetero-config evals, or async overlap on long rollouts — not needed for the 8-map carla validation eval that's the hot path today. 3. The 8-map validation eval now spins up one Drive with 8 internal envs in-process, no fork. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… path Two fixes for the multi-evaluator smoke test: 1. multi_scenario._render_view: glob filter excludes mp4s that already existed in out_dir before this pass. The dir is shared across epochs and across views, so a bare `out_dir.glob('*.mp4')` was returning every mp4 from prior render passes too — turning epoch 12510's 16-mp4 render into a 32-mp4 wandb upload (the 16 from epoch 12505 were still on disk). 2. drive.ini: replace `/scratch/$USER/data/nuplan/...` with the literal path. configparser doesn't expand env vars, and the manager doesn't either, so behavior evals were crashing on FileNotFoundError trying to open `/scratch/$USER/data/...`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Previously `{scenario_id}{view_suffix}.mp4` — each render epoch overwrote the previous epoch's mp4 in place, making the prior fix's existing-file snapshot filter incorrect (filter would exclude the freshly-written file because Path equality matches the prior path). Now `{scenario_id}_step{N}{view_suffix}.mp4` so: - successive eval epochs produce distinct mp4s (no overwrites). - wandb's render carousel shows one entry per epoch, letting the user watch policy evolve over training. - the return-paths glob is exact: just files matching this step's pattern, no snapshot-filtering trickery. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Behavior-class evaluators previously had render=true in config but no _render_pass implementation, so the videos silently never appeared. Three changes: 1. base.py: lift _render_pass + _render_view from MultiScenarioEvaluator into Evaluator. The render loop is generic — fresh PufferEnv with render_mode=headless, ffmpeg pipe per active env per view, mp4s stamped with global_step in the filename. 2. base.py: add _render_env_overrides hook so subclasses can tweak the render env (default = metric env + render_mode=headless). Render loop now caps internal-env render count at eval.render_num_scenarios instead of always rendering the full batch (the C kernel still steps the full batch — just fewer ffmpeg pipes). Drops the dead batch_size_eval write. 3. multi_scenario.py: keeps the random-starting_map override (its only real difference from the default render path) — everything else is now inherited. Result: behavior_class and human_replay inherit a working render path. For 12 behavior classes with render_num_scenarios=2, render is bounded by ~3 fps × 300 steps × 2 views ≈ 3.5 min/class. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Every evaluator runs at the same global_step, so the previous shared out_dir + step glob made each evaluator's _render_view re-collect every earlier evaluator's mp4s into result.frames. wandb then logged validation_gigaflow's videos under behaviors_*/render too. Per-evaluator subdir (`mp4/<ev_name>/`) keeps each evaluator's render output isolated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… aliasing Every behavior_class evaluator was rendering the same bin (the LAST section's map_dir won for all siblings) because _deep_merge mutates its dst in place, but _expand_dotted's shallow-copy meant the first level merged into a fresh `merged` dict had `merged["env"]` aliasing the parent section's env dict directly. The next level's merge then recursed into that alias and overwrote keys on the parent itself, so: - building child_a left defaults["env"]["map_dir"] = a's map_dir - building child_b read the polluted defaults, set map_dir = b's - … repeated through all siblings — every cfg["env"] is the same reference, the last write wins for everyone. Diagnostic showed all 12 behavior evaluators returning map_dir = unprotected_right (the last section), num_maps=1, num_agents=1. Fix: deepcopy each level before merging so `merged` owns its own nested dicts and the parent section dict in `all_sections` stays pristine across sibling builds. Plus a regression test that builds three siblings inheriting from the same parent and asserts each gets its own map_dir AND that the parent dict is untouched. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Render mp4 filenames now use `_epoch{E}_step{N}{view_suffix}.mp4` instead of just `_step{N}{view_suffix}.mp4`. Epoch is the human-readable checkpoint index ("which epoch did this come from"), step is the precise env-step count. Both are useful — epoch for sorting through training history at a glance, step for tying back to wandb's x-axis. Threads `epoch` through maybe_run → _run_one → _run_inline → _build_eval_args → args["epoch"], where _render_pass picks it up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…enarios=2 The 'Template — no type, never instantiated' inline comment on [eval.behaviors_defaults] was redundant with the section-header block above ('All inherit from the template below') and read like internal prose mid-section. Also drop the explicit `eval.render_num_scenarios = 2` whose comment defended pinning lower than the default of 3 — the 48-vs-72 difference in mp4 count isn't load-bearing, and the contradiction (set 2, comment defaults 3) was confusing. Inherits the default 3 now. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…efaults base.py:_render_view used to default `render_num_scenarios = min(metric_count, 3)` and `render_max_steps = 300` if the section didn't set them. That hides the render-cost knob inside code where readers won't see it. Render is expensive (EGL frame-by-frame at 1080p, ffmpeg encode + wandb upload), and the right budget depends on scenario_length and how many evaluators are firing — both are config-side concerns. Now: if `render = true`, the section MUST set both `eval.render_num_scenarios` and `eval.render_max_steps`. Missing either raises KeyError at first render attempt. drive.ini: behaviors_defaults already had render=true; add explicit render_num_scenarios=3 and render_max_steps=200 (=full bin at scenario_length=201). validation_gigaflow already set both. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The default rollout loop now reads `terminals` and `truncations` from vecenv.step (previously discarded) and masks the LSTM state per-agent — either signal means 'episode over, env reset' so we OR them. Agents whose env truncated/terminated start the next forward with zeroed hidden state; the rest carry recurrent memory across the step. This replaces the whole-batch heuristic (`steps % scenario_length == 0`) that lived in MultiScenarioEvaluator._maybe_reset_lstm. The heuristic only worked because gigaflow validation runs uniform scenarios with termination_mode=0, where every env truncated on the same Python tick. It would silently desync if any env truncated early (e.g. termination_mode=1, future per-bin scenario_length). Also fixes the silent correctness bug for behavior_class / human_replay evaluators, which inherited the base no-op and were never resetting LSTM state across scenario boundaries. (Latent because rnn_name=None in the smoke configs we've been running, but it'd bite as soon as RNN is enabled.) - base.py: capture terminals/truncations from vecenv.step; mask state per-agent with `(1 - (term | trunc)).reshape(-1, 1)` before next forward. - base.py: drop the `_maybe_reset_lstm` hook (now redundant). - multi_scenario.py: drop the override (was the only implementer). - tests: regression test that constructs a 4-agent rollout where 3 agents fire terminate/truncate and one stays alive; asserts only the alive agent's state survives. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Each test now opens with a 1-3 line docstring stating what it asserts and why — useful for future readers scanning the file or for failure output to make the intent obvious without re-reading the body. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The C-side replay branch caps env_count at num_eval_scenarios when eval_mode is on (binding.c:1653). HumanReplay/BehaviorClass were never setting it, so it fell back to drive.py's default of 16. Behavior classes with >16 bins (or eval.num_scenarios > 16 with sampling) were silently only loading 16 of N envs as bins, then deterministically re-running those same 16 until len(infos) >= num_maps padded the count. Net effect: 34 of 50 sampled bins wasted CPU on tmp symlinks and never got evaluated. - human_replay.env_overrides: always (re)set num_agents = num_maps = num_eval_scenarios = bin count. Drops the prior `if num_agents not in env` guard that was silently bypassed by the train config's num_agents=1024 default. - human_replay._should_stop: now stops at scenario_length+1 (= one full cycle). Every env truncates simultaneously at scenario_length, vec_log emits one aggregate dict averaging all SDCs at that step, and we have the full per-bin metric. Continuing past that just deterministically re-runs the same bins (resample_frequency is effectively never in replay), so the prior `len(infos) >= num_maps` threshold was 49× wasteful for 50-bin classes. - behavior_class.env_overrides: sets num_eval_scenarios alongside num_agents/num_maps in both the sampling and all-bins branches. - regression test: builds a 50-bin map_dir and asserts both branches of behavior_class.env_overrides set num_eval_scenarios = N. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ip rest Two fixes: 1. Subprocess eval now receives --global-step and --epoch. Previously _run_subprocess wrote a .cfg.json next to --out but never passed it (or any other reference to it) on the cmd line, and the eval CLI had no flags for either field. Result: every subprocess render wrote `{scenario_id}_epoch0_step0.mp4`, and successive eval epochs would silently overwrite each other on disk. Metric data was correct (parent re-attached the right global_step on _log) but the wandb captions were uninterpretable. Now: _run_subprocess passes --global-step and --epoch on the cmd line; pufferl.py's main() pulls them out of argv (same pattern as --evaluator / --out) and threads them into eval(); the eval() function passes them to run_one_by_name; the manager threads them through to _build_eval_args where the render path picks them up via args["global_step"] / args["epoch"]. Drops the dead .cfg.json write. 2. maybe_run isolates per-evaluator failures. A bad bin dir or KeyError in one [eval.behaviors_X] section used to crash the for-loop and skip every subsequent evaluator at that epoch. Now each _run_one is wrapped: the exception is printed in full (header to stdout, traceback to stderr) and the loop moves on. Successful evaluators land in the returned results dict; the failed one is absent. Errors stay loud — no silent swallow. Just stops one bad evaluator from dragging the whole eval batch down with it. - test_maybe_run_isolates_evaluator_failures: 3 evaluators where the middle one raises; asserts the other two succeed and that the failure is visible in stdout+stderr. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…rocess Two new tests filled in (concerns 16 and 17 from the audit) — and the render test caught an actual bug while writing it. - test_render_pass_per_evaluator_subdir_and_step_glob: mocks vector.make + sample_logits so the render loop runs without EGL. Verifies per-evaluator subdir, epoch+step in filename, and that stale mp4s from a prior epoch don't leak into result.frames via the glob. - test_run_subprocess_passes_step_epoch_and_reads_back_result: mocks subprocess.run; asserts the cmd line carries --global-step, --epoch, --evaluator, --out, --load-model-path. After the fake subprocess writes the canonical JSON, the parent reads it back into EvalResult.metrics/frames correctly. Bug the render test caught: _render_view was globbing `*{step_suffix}*.mp4` to filter prior-epoch mp4s, but that glob also matched OTHER views' files from the same pass (e.g. the bev view glob picked up sim_state files that already existed in the subdir). all_paths.extend across views was then double-counting sim_state mp4s. Cluster ran was silently logging duplicates to wandb's render key. Fix: glob by full view_suffix (step + view marker), with exact .mp4 suffix anchoring. sim_state's `*_epoch7_step12.mp4` no longer matches the bev variant `*_epoch7_step12_bev.mp4`, so each view's return is just its own mp4s. The extend call accumulates cleanly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… counts Three of the five aggregation issues from the recent audit. The remaining two (failed-spawn dilution in gigaflow, per-bin vs per-SDC weighting) need C-side changes — flagging as future work. - _aggregate_infos: switch from mean-of-means to weighted mean by per-emission `n`. Equivalent to current behavior when batches are uniform (today), correct when they aren't (termination_mode=1, mixed scenario_lengths, future per-bin overrides). Caveat in the docstring for ratio fields like avg_distance_per_infraction (numerator/denominator aren't observable from Python — weighted mean is the same approximation as mean-of-means there). - _aggregate_infos: rename `num_scenarios_completed` → `num_log_cycles` (it was the count of vec_log emissions, not the count of unique scenes), and add `num_agents_evaluated` (sum of n across emissions = true count of SDC-trajectories behind the metric). For behaviors with one cycle of N bins, that's now `num_log_cycles=1, num_agents_evaluated=N` — much less misleading than the prior `num_scenarios_completed=1`. - BehaviorClassEvaluator: seed random.sample with the section name so every epoch picks the same bins. Cross-epoch metric jitter from bin selection used to be indistinguishable from policy improvement; now jitter comes only from policy changes. Also sorted the bin listing for determinism before the sample. - tests: weighted-mean correctness with two emissions of different n; empty-infos handling; reproducible sampling across two evaluator instances with the same name; differing sections sample different bins. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Eugene Vinitsky and others added 5 commits May 9, 2026 18:27

README: cluster section now framed for NYU specifically

a6dd740

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings May 9, 2026 22:33

Copilot started reviewing on behalf of eugenevinitsky May 9, 2026 22:34 View session

ruff-format: collapse goal_advance_mode raise to one line

170183b

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot AI reviewed May 9, 2026

View reviewed changes

Eugene Vinitsky and others added 21 commits May 9, 2026 18:46

tests: add 3-level inheritance + self-cycle cases for EvalManager parser

3f28b08

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

[WIP] eval: bump render_max_steps default 91 → 300 (10s clip vs 3s)

ec0698a

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Eugene Vinitsky and others added 13 commits May 10, 2026 11:40

[WIP] eval: trim path-history comment in validation_gigaflow section

1f4d91d

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

[WIP] eval: behaviors render_num_scenarios = 2

fa52ef5

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

[WIP] eval: ruff-format collapse on a 3-line monkeypatch.setattr

dd44008

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

[WIP] eval: render all 8 carla maps in validation_gigaflow

0d10f88

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] eval: unified Evaluator + EvalManager refactor#420

[WIP] eval: unified Evaluator + EvalManager refactor#420
eugenevinitsky wants to merge 40 commits intoemerge/temp_trainingfrom
ev/unified-evals

eugenevinitsky commented May 9, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

eugenevinitsky commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Out of scope

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

eugenevinitsky commented May 9, 2026 •

edited

Loading