[WIP] eval: unified Evaluator + EvalManager refactor#420
Open
eugenevinitsky wants to merge 40 commits intoemerge/temp_trainingfrom
Open
[WIP] eval: unified Evaluator + EvalManager refactor#420eugenevinitsky wants to merge 40 commits intoemerge/temp_trainingfrom
eugenevinitsky wants to merge 40 commits intoemerge/temp_trainingfrom
Conversation
…ation yet)
Adds the new evaluator framework as pure additions:
pufferlib/ocean/benchmark/evaluators/
base.py Evaluator + EvalResult dataclass
multi_scenario.py MultiScenarioEvaluator (replaces eval_multi_scenarios)
human_replay.py HumanReplayEvaluator (replay + control_sdc_only loop)
behavior_class.py BehaviorClassEvaluator (per-class nuPlan suite)
wosac.py Thin wrapper around the existing WOSACEvaluator
pufferlib/ocean/benchmark/manager.py
EvalManager: section discovery, inheritance chain, clean-eval macro,
dotted-key flattening, inline + subprocess dispatch, wandb logging.
Plus tests/test_eval_manager.py for the config-merge logic.
See docs/eval_unification.md for the full design rationale.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Auto-discovered evaluator sections, schema A (dotted keys). Each behavior class is its own section inheriting from behaviors_defaults template; gigaflow validation has separate metric-only and render sections at different intervals. driving_behaviours_eval.ini deleted — folded into drive.ini. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Removed from pufferl.py (~990 lines): eval(), eval_multi_scenarios(), eval_multi_scenarios_render(), build_eval_overrides(), load_eval_multi_scenarios_config(), _swap_policy_obs_counts(), _render_driving_behaviours(), _export_metrics(), _log_eval_metrics(), verify_scenario_coverage(), verify_scenario_coverage_gigaflow(). Plus the legacy eval block in PuffeRL.evaluate() and the driving_behaviours_eval.ini loader in load_config. Removed from utils.py (~300 lines): run_human_replay_eval_in_subprocess, run_wosac_eval_in_subprocess, run_driving_behaviours_eval_in_subprocess. PuffeRL.evaluate() now calls self._eval_manager.maybe_run() — single unified path for all evals. main() wires `puffer eval --evaluator <name> --out <path>` for both standalone and subprocess use. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…EPLAY in c_step Add GOAL_ADVANCE_REGENERATE / GOAL_ADVANCE_SATURATE constants and a goal_advance_mode field on Drive. c_step's last-goal branch now dispatches on goal_advance_mode instead of simulation_mode. Drive.__init__ accepts goal_advance_mode kwarg with auto-pick based on simulation_mode (gigaflow → regenerate, replay → saturate) — same behavior as before, but the choice is explicit and per-eval-overridable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR refactors PufferDrive’s evaluation system by replacing multiple overlapping eval entry points with a unified Evaluator framework and an EvalManager that discovers [eval.<name>] sections from drive.ini and runs them inline or via subprocess. It also introduces a new C/Python goal_advance_mode knob to decouple last-goal behavior from simulation_mode.
Changes:
- Added
pufferlib/ocean/benchmark/evaluators/(baseEvaluator+ concrete evaluators) andpufferlib/ocean/benchmark/manager.py(discovery, inheritance merge, clean macro, dispatch, logging). - Refactored
pufferlib/pufferl.pyeval CLI + training loop to route evaluation throughEvalManager; removed legacy subprocess eval helpers frompufferlib/utils.py. - Updated
drive.inito[eval.<name>]suites (with templates + inheritance), removeddriving_behaviours_eval.ini, and addedgoal_advance_modethrough Python/C bindings.
Reviewed changes
Copilot reviewed 16 out of 17 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
tests/test_eval_manager.py |
New unit tests for eval-section discovery, inheritance merge, dotted-key expansion, and clean macro behavior. |
README.md |
Updates cluster install wording (NYU cluster specifics). |
pufferlib/utils.py |
Removes legacy subprocess eval helpers; retains video rendering helper. |
pufferlib/pufferl.py |
Routes eval through EvalManager; rewires CLI puffer eval --evaluator ... --out ...; removes legacy eval pipelines. |
pufferlib/ocean/drive/drive.py |
Adds Python-side goal_advance_mode and passes it into C env init kwargs. |
pufferlib/ocean/drive/drive.h |
Adds GOAL_ADVANCE_* constants + goal_advance_mode field; switches last-goal behavior to the new knob in c_step. |
pufferlib/ocean/drive/binding.c |
Unpacks goal_advance_mode into the C env struct. |
pufferlib/ocean/benchmark/manager.py |
New EvalManager with section discovery, inheritance merge, clean macro, inline/subprocess dispatch, and logging. |
pufferlib/ocean/benchmark/evaluators/base.py |
New Evaluator base class + EvalResult. |
pufferlib/ocean/benchmark/evaluators/multi_scenario.py |
New multi-scenario evaluator + optional EGL render pass. |
pufferlib/ocean/benchmark/evaluators/human_replay.py |
New human replay evaluator implementation. |
pufferlib/ocean/benchmark/evaluators/behavior_class.py |
New behavior-class evaluator (sampling + symlink-dir cleanup). |
pufferlib/ocean/benchmark/evaluators/wosac.py |
New WOSAC evaluator adapter around existing realism evaluator logic. |
pufferlib/ocean/benchmark/evaluators/__init__.py |
Registers evaluator types for manager lookup. |
pufferlib/config/ocean/driving_behaviours_eval.ini |
Deleted (folded into drive.ini). |
pufferlib/config/ocean/drive.ini |
Replaces [eval] with [eval.<name>] suites + templates/inheritance; adds WOSAC and behavior-class sections. |
.gitignore |
Un-ignores pufferlib/ocean/benchmark/** (was previously matched by benchmark*/). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+62
to
+68
| obs, _, _, _, info_list = vecenv.step(action_np) | ||
| if info_list: | ||
| all_infos.extend(info_list) | ||
| # Stop once every bin has yielded one info to avoid double-counting | ||
| # on the second cycle through the dir. | ||
| if len(all_infos) >= num_maps: | ||
| break |
Comment on lines
+105
to
+125
| def _run_inline(self, ev: Evaluator, policy, env_name: str, global_step) -> EvalResult: | ||
| args = self._build_eval_args(ev, env_name=env_name, global_step=global_step) | ||
|
|
||
| package = args.get("package", "ocean") | ||
| module_name = "pufferlib.ocean" if package == "ocean" else f"pufferlib.environments.{package}" | ||
| env_module = importlib.import_module(module_name) | ||
| make_env = env_module.env_creator(env_name) | ||
|
|
||
| vec_kwargs = ev.vec_overrides() | ||
| num_envs = int(vec_kwargs.get("num_envs", 1)) | ||
| env_kwargs_list = [args["env"] for _ in range(num_envs)] | ||
| env_creators = [make_env] * num_envs | ||
| env_args_list = [[]] * num_envs | ||
|
|
||
| vec_call_kwargs = dict(vec_kwargs) | ||
| vec_call_kwargs.setdefault("num_workers", num_envs) | ||
| vec_call_kwargs.setdefault("batch_size", num_envs) | ||
|
|
||
| vecenv = pufferlib.vector.make( | ||
| env_creators, env_args=env_args_list, env_kwargs=env_kwargs_list, **vec_call_kwargs | ||
| ) |
Comment on lines
+132
to
+137
| def _run_subprocess(self, ev: Evaluator, env_name: str, global_step) -> EvalResult: | ||
| out_path = Path(self.train_config.get("data_dir", ".")) / "eval_subprocess_out" / f"{ev.name}.json" | ||
| out_path.parent.mkdir(parents=True, exist_ok=True) | ||
| cfg_path = out_path.with_suffix(".cfg.json") | ||
| with open(cfg_path, "w") as f: | ||
| json.dump({"name": ev.name, "global_step": global_step}, f) |
Comment on lines
+153
to
+155
| subprocess.run(cmd, check=True) | ||
| with open(out_path) as f: | ||
| payload = json.load(f) |
| eval_args["load_model_path"] = os.path.join( | ||
| self.config["data_dir"], experiment_name, "models", f"inline_epoch_{self.epoch}.pt" | ||
| logger=self.logger, | ||
| global_step=self.global_step, |
Comment on lines
+266
to
+284
| env.map_dir = "/scratch/$USER/data/nuplan/nuplan_mini_train_bins" | ||
|
|
||
| [eval.behaviors_hard_stop] | ||
| inherits = "behaviors_defaults" | ||
| type = "behavior_class" | ||
| enabled = true | ||
| env.map_dir = "/scratch/$USER/data/nuplan/categories_v021/hard_stop" | ||
|
|
||
| [eval.behaviors_highway_straight] | ||
| inherits = "behaviors_defaults" | ||
| type = "behavior_class" | ||
| enabled = true | ||
| env.map_dir = "/scratch/$USER/data/nuplan/categories_v021/highway_straight" | ||
|
|
||
| [eval.behaviors_lane_change] | ||
| inherits = "behaviors_defaults" | ||
| type = "behavior_class" | ||
| enabled = true | ||
| env.map_dir = "/scratch/$USER/data/nuplan/categories_v021/lane_change" |
Comment on lines
1
to
8
| import os | ||
| import sys | ||
| import glob | ||
| import random | ||
| import shutil | ||
| import subprocess | ||
| import tempfile | ||
| import json |
Comment on lines
+71
to
+84
| def maybe_run(self, epoch: int, policy, env_name: str, logger=None, global_step=None) -> dict: | ||
| """Called from the training loop. Runs every enabled evaluator | ||
| whose `interval` divides `epoch`. Returns a dict of {eval_name → metrics}.""" | ||
| results = {} | ||
| for ev in self.evaluators: | ||
| if not ev.enabled: | ||
| continue | ||
| if ev.interval <= 0: | ||
| continue | ||
| if epoch % ev.interval != 0: | ||
| continue | ||
| res = self._run_one(ev, policy=policy, env_name=env_name, logger=logger, global_step=global_step) | ||
| results[ev.name] = res | ||
| return results |
Comment on lines
+126
to
+127
| try: | ||
| res = ev.rollout(vecenv, policy, args) |
The HumanReplay and MultiScenario rollouts had ~80% the same step loop (reset → forward_eval → step → collect infos → aggregate). Move it to the base class as `_run_rollout_loop`. Subclasses override only the hooks they actually need to diverge on: _initial_reset (sync vs async reset) _maybe_reset_lstm (per-scenario reset cadence) _should_stop (termination condition) _flatten_infos (multi-worker vs single-worker info shape) _aggregate_infos (per-key mean is the default) _render_pass (no-op default; MultiScenario implements EGL) HumanReplayEvaluator now overrides only `env_overrides` + `_should_stop`. BehaviorClassEvaluator still inherits HumanReplay; only adds the tmp symlink dir for random sampling. WOSAC keeps its custom rollout — its per-scene multi-rollout loop doesn't fit the default shape. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The benchmark/** unignore was over-broad — it pulled in pycache files that should never be committed. Add an explicit re-ignore for __pycache__ under that path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- eval.render_num_scenarios: explicit per-evaluator render budget; defaults to min(eval.num_scenarios, 3). Renders are expensive (mp4 encode + wandb upload) and shouldn't run at metric scale. - Render path randomizes starting_map per epoch when not pinned, so successive renders show different bins from the dir instead of the first N alphabetically. Restores the old behavior from _render_driving_behaviours that the refactor lost. - behaviors_defaults pins render_num_scenarios = 2 so 12 classes × 2 views × 2 scenarios = 48 mp4s/epoch (vs 1200 if it inherited num_scenarios=50). - Test: render_num_scenarios is inheritable. "Worst N" selection deferred — needs per-scenario scores piped out of the metric rollout into the render call. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two coupled fixes:
EvalManager:
- Accepts run_id at construction. Used to locate the per-run
models/ directory.
- latest_checkpoint(env_name) walks data_dir/<env>_<run_id>/models/
for the newest model_*.pt. Falls back to train_config.load_model_path
if no checkpoints exist yet.
- has_subprocess_evals_at(epoch) reports whether any enabled
subprocess evaluator would fire at that epoch.
- _run_subprocess uses latest_checkpoint instead of train_config.load_model_path.
PuffeRL.evaluate:
- Calls save_checkpoint() before maybe_run() if any subprocess
evaluator would fire. Mirrors the old run_driving_behaviours
_eval_in_subprocess flow.
- train() passes logger.run_id when constructing the manager.
For all-inline configs (today's drive.ini default) this is a no-op.
Activates when an evaluator is flipped to mode=subprocess.
Tests: latest_checkpoint picks the newest by ctime, falls back to
load_model_path when no models exist; has_subprocess_evals_at fires
only on enabled subprocess evaluators at matching intervals.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The unified-eval refactor doesn't actually need this knob, and pushing it through the C struct + binding + drive.py at the same time bloats this PR. Restoring the original `if (env->simulation_mode == SIMULATION_REPLAY)` branch in c_step. The "promote implicit branches to explicit knobs" audit is real work but lives in its own PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…erride stack
Adds 4 tests covering the regression-prone surface the parser-only
tests miss:
test_maybe_run_dispatches_by_interval_and_enabled
Stubs _run_one and verifies that maybe_run fires only enabled
evaluators whose interval divides epoch. epoch=33 fires nothing,
epoch=250 fires both 25-interval and 250-interval evaluators.
test_flatten_infos_handles_shape_variations
_flatten_infos must handle multi-worker (list of lists) AND
PufferEnv (flat list) backends, plus None / empty entries —
one bad isinstance check silently drops episode infos.
test_behavior_class_cleanup_removes_symlink_dir
Builds a real 5-bin map_dir, requests a 2-bin sample, verifies
the tmp symlink dir gets created with 2 bins, then cleanup()
removes it. tempfile.mkdtemp leftovers are a real footgun.
test_eval_args_compose_train_section_and_clean_macro
The full override stack: train baseline → section overrides →
clean macro. Section beats baseline, explicit beats macro,
untouched baseline survives.
25 tests total (was 21).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Aligns with the heavier-eval cadence the rest of the new eval sections use. 25-epoch interval was too aggressive for inline gigaflow validation given the per-pass setup cost. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Time the rollout in Evaluator.rollout (base class) and inject
`eval_seconds` into the returned metrics dict. Manager's _log
posts it to wandb under {ev.name}/eval_seconds — wall-clock cost
per evaluator becomes a first-class panel.
Refactor WOSACEvaluator to override _run_rollout_loop instead of
rollout — now WOSAC also benefits from the timing without code
duplication.
Test: stub evaluator with a forced 20ms floor; verifies eval_seconds
lands in the result and the inner metrics survive.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The eval-mode resample path at drive.py:447 was missing goal_radius in its kwargs to binding.shared, while the initial-spawn call at line 293 has it. binding.shared's C side (binding.c:1545) requires goal_radius via unpack(), so the resample crashes with "Missing required keyword argument 'goal_radius'" once a scenario batch completes and a new one is requested. Latent bug — never triggered because the legacy [eval] section had multi_scenario_eval = False as the default, so the eval rollout path that triggers resample never fired in production. The new EvalManager flips multi_scenario_eval-equivalent ([eval.validation_gigaflow]) on by default, surfacing the crash. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Single section now does both: 250-scenario metric pass + 5-scenario render pass via the render_num_scenarios knob. The split made sense before render became a per-section flag; it's redundant now. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PufferEnv backend in pufferlib.vector.make treats env_creator as a single callable and passes env_args/env_kwargs to it directly (line 697 in vector.py). Multiprocessing/Serial backends expect lists. Render must use PufferEnv (one ffmpeg pipe per env), so pass a single creator + dict, not lists. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The 8-map carla dir doesn't need 250 scenarios — gigaflow's C-side eval already creates one internal env per scenario (capped at num_scenarios) and steps them in a single batched rollout, so num_scenarios=8 covers every map exactly once in parallel. Drop num_agents 512→400 to fill exactly 8 × 50 slots with no wasted capacity. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…o_length
Render path was falling back to env.scenario_length (3000 = 100s mp4 at
30fps) because render_max_steps wasn't a real config knob. Two fixes:
1. multi_scenario.py: read eval.render_max_steps from args["eval"]
where _build_eval_args puts evaluator-private fields, and default
to 91 (~3 sec) — not scenario_length, which is the metric pass.
2. drive.ini: set eval.render_max_steps = 91 on validation_gigaflow
explicitly so the comment + value document the intent.
EGL render is ~3 fps wall-clock at 1080p, so 91 × 5 × 2 ≈ 5 min/render
pass instead of 28 min when defaulting to scenario_length=3000.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The old MultiScenarioEvaluator forced Multiprocessing + async_reset by
default, which (a) added fork/IPC startup cost without throughput gain
at our scale, (b) broke render (one ffmpeg pipe per env requires
single-process), and (c) couldn't construct the vec env at all because
the manager passed the PufferEnv-shaped single creator to the list-shaped
backend dispatch. Three changes:
1. multi_scenario.py: drop the vec_overrides() and _initial_reset()
overrides that pinned Multiprocessing — inherit base-class PufferEnv
+ sync reset.
2. manager.py:_run_inline: branch on backend so Multiprocessing remains
a valid opt-in via [eval.<name>.vec] backend = "Multiprocessing".
Useful for memory-bound replay sweeps, hetero-config evals, or
async overlap on long rollouts — not needed for the 8-map carla
validation eval that's the hot path today.
3. The 8-map validation eval now spins up one Drive with 8 internal
envs in-process, no fork.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… path
Two fixes for the multi-evaluator smoke test:
1. multi_scenario._render_view: glob filter excludes mp4s that already
existed in out_dir before this pass. The dir is shared across epochs
and across views, so a bare `out_dir.glob('*.mp4')` was returning
every mp4 from prior render passes too — turning epoch 12510's
16-mp4 render into a 32-mp4 wandb upload (the 16 from epoch 12505
were still on disk).
2. drive.ini: replace `/scratch/$USER/data/nuplan/...` with the literal
path. configparser doesn't expand env vars, and the manager doesn't
either, so behavior evals were crashing on FileNotFoundError trying
to open `/scratch/$USER/data/...`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously `{scenario_id}{view_suffix}.mp4` — each render epoch
overwrote the previous epoch's mp4 in place, making the prior fix's
existing-file snapshot filter incorrect (filter would exclude the
freshly-written file because Path equality matches the prior path).
Now `{scenario_id}_step{N}{view_suffix}.mp4` so:
- successive eval epochs produce distinct mp4s (no overwrites).
- wandb's render carousel shows one entry per epoch, letting the
user watch policy evolve over training.
- the return-paths glob is exact: just files matching this step's
pattern, no snapshot-filtering trickery.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Behavior-class evaluators previously had render=true in config but no
_render_pass implementation, so the videos silently never appeared.
Three changes:
1. base.py: lift _render_pass + _render_view from MultiScenarioEvaluator
into Evaluator. The render loop is generic — fresh PufferEnv with
render_mode=headless, ffmpeg pipe per active env per view, mp4s
stamped with global_step in the filename.
2. base.py: add _render_env_overrides hook so subclasses can tweak the
render env (default = metric env + render_mode=headless). Render
loop now caps internal-env render count at eval.render_num_scenarios
instead of always rendering the full batch (the C kernel still steps
the full batch — just fewer ffmpeg pipes). Drops the dead
batch_size_eval write.
3. multi_scenario.py: keeps the random-starting_map override (its only
real difference from the default render path) — everything else is
now inherited.
Result: behavior_class and human_replay inherit a working render path.
For 12 behavior classes with render_num_scenarios=2, render is bounded
by ~3 fps × 300 steps × 2 views ≈ 3.5 min/class.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Every evaluator runs at the same global_step, so the previous shared out_dir + step glob made each evaluator's _render_view re-collect every earlier evaluator's mp4s into result.frames. wandb then logged validation_gigaflow's videos under behaviors_*/render too. Per-evaluator subdir (`mp4/<ev_name>/`) keeps each evaluator's render output isolated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… aliasing
Every behavior_class evaluator was rendering the same bin (the LAST
section's map_dir won for all siblings) because _deep_merge mutates
its dst in place, but _expand_dotted's shallow-copy meant the first
level merged into a fresh `merged` dict had `merged["env"]` aliasing
the parent section's env dict directly. The next level's merge then
recursed into that alias and overwrote keys on the parent itself, so:
- building child_a left defaults["env"]["map_dir"] = a's map_dir
- building child_b read the polluted defaults, set map_dir = b's
- … repeated through all siblings — every cfg["env"] is the same
reference, the last write wins for everyone.
Diagnostic showed all 12 behavior evaluators returning map_dir =
unprotected_right (the last section), num_maps=1, num_agents=1.
Fix: deepcopy each level before merging so `merged` owns its own
nested dicts and the parent section dict in `all_sections` stays
pristine across sibling builds.
Plus a regression test that builds three siblings inheriting from
the same parent and asserts each gets its own map_dir AND that the
parent dict is untouched.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Render mp4 filenames now use `_epoch{E}_step{N}{view_suffix}.mp4`
instead of just `_step{N}{view_suffix}.mp4`. Epoch is the human-readable
checkpoint index ("which epoch did this come from"), step is the
precise env-step count. Both are useful — epoch for sorting through
training history at a glance, step for tying back to wandb's x-axis.
Threads `epoch` through maybe_run → _run_one → _run_inline →
_build_eval_args → args["epoch"], where _render_pass picks it up.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…enarios=2
The 'Template — no type, never instantiated' inline comment on
[eval.behaviors_defaults] was redundant with the section-header block
above ('All inherit from the template below') and read like internal
prose mid-section.
Also drop the explicit `eval.render_num_scenarios = 2` whose comment
defended pinning lower than the default of 3 — the 48-vs-72 difference
in mp4 count isn't load-bearing, and the contradiction (set 2, comment
defaults 3) was confusing. Inherits the default 3 now.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…efaults base.py:_render_view used to default `render_num_scenarios = min(metric_count, 3)` and `render_max_steps = 300` if the section didn't set them. That hides the render-cost knob inside code where readers won't see it. Render is expensive (EGL frame-by-frame at 1080p, ffmpeg encode + wandb upload), and the right budget depends on scenario_length and how many evaluators are firing — both are config-side concerns. Now: if `render = true`, the section MUST set both `eval.render_num_scenarios` and `eval.render_max_steps`. Missing either raises KeyError at first render attempt. drive.ini: behaviors_defaults already had render=true; add explicit render_num_scenarios=3 and render_max_steps=200 (=full bin at scenario_length=201). validation_gigaflow already set both. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The default rollout loop now reads `terminals` and `truncations` from
vecenv.step (previously discarded) and masks the LSTM state per-agent —
either signal means 'episode over, env reset' so we OR them. Agents
whose env truncated/terminated start the next forward with zeroed
hidden state; the rest carry recurrent memory across the step.
This replaces the whole-batch heuristic (`steps % scenario_length == 0`)
that lived in MultiScenarioEvaluator._maybe_reset_lstm. The heuristic
only worked because gigaflow validation runs uniform scenarios with
termination_mode=0, where every env truncated on the same Python tick.
It would silently desync if any env truncated early (e.g.
termination_mode=1, future per-bin scenario_length).
Also fixes the silent correctness bug for behavior_class / human_replay
evaluators, which inherited the base no-op and were never resetting
LSTM state across scenario boundaries. (Latent because rnn_name=None in
the smoke configs we've been running, but it'd bite as soon as RNN is
enabled.)
- base.py: capture terminals/truncations from vecenv.step; mask state
per-agent with `(1 - (term | trunc)).reshape(-1, 1)` before next
forward.
- base.py: drop the `_maybe_reset_lstm` hook (now redundant).
- multi_scenario.py: drop the override (was the only implementer).
- tests: regression test that constructs a 4-agent rollout where 3
agents fire terminate/truncate and one stays alive; asserts only
the alive agent's state survives.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Each test now opens with a 1-3 line docstring stating what it asserts and why — useful for future readers scanning the file or for failure output to make the intent obvious without re-reading the body. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The C-side replay branch caps env_count at num_eval_scenarios when
eval_mode is on (binding.c:1653). HumanReplay/BehaviorClass were never
setting it, so it fell back to drive.py's default of 16. Behavior
classes with >16 bins (or eval.num_scenarios > 16 with sampling) were
silently only loading 16 of N envs as bins, then deterministically
re-running those same 16 until len(infos) >= num_maps padded the count.
Net effect: 34 of 50 sampled bins wasted CPU on tmp symlinks and never
got evaluated.
- human_replay.env_overrides: always (re)set num_agents = num_maps =
num_eval_scenarios = bin count. Drops the prior `if num_agents not
in env` guard that was silently bypassed by the train config's
num_agents=1024 default.
- human_replay._should_stop: now stops at scenario_length+1 (= one
full cycle). Every env truncates simultaneously at scenario_length,
vec_log emits one aggregate dict averaging all SDCs at that step,
and we have the full per-bin metric. Continuing past that just
deterministically re-runs the same bins (resample_frequency is
effectively never in replay), so the prior `len(infos) >= num_maps`
threshold was 49× wasteful for 50-bin classes.
- behavior_class.env_overrides: sets num_eval_scenarios alongside
num_agents/num_maps in both the sampling and all-bins branches.
- regression test: builds a 50-bin map_dir and asserts both branches
of behavior_class.env_overrides set num_eval_scenarios = N.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ip rest
Two fixes:
1. Subprocess eval now receives --global-step and --epoch.
Previously _run_subprocess wrote a .cfg.json next to --out but never
passed it (or any other reference to it) on the cmd line, and the
eval CLI had no flags for either field. Result: every subprocess
render wrote `{scenario_id}_epoch0_step0.mp4`, and successive eval
epochs would silently overwrite each other on disk. Metric data was
correct (parent re-attached the right global_step on _log) but the
wandb captions were uninterpretable.
Now: _run_subprocess passes --global-step and --epoch on the cmd
line; pufferl.py's main() pulls them out of argv (same pattern as
--evaluator / --out) and threads them into eval(); the eval()
function passes them to run_one_by_name; the manager threads them
through to _build_eval_args where the render path picks them up
via args["global_step"] / args["epoch"]. Drops the dead .cfg.json
write.
2. maybe_run isolates per-evaluator failures.
A bad bin dir or KeyError in one [eval.behaviors_X] section used to
crash the for-loop and skip every subsequent evaluator at that
epoch. Now each _run_one is wrapped: the exception is printed in
full (header to stdout, traceback to stderr) and the loop moves on.
Successful evaluators land in the returned results dict; the failed
one is absent.
Errors stay loud — no silent swallow. Just stops one bad evaluator
from dragging the whole eval batch down with it.
- test_maybe_run_isolates_evaluator_failures: 3 evaluators where the
middle one raises; asserts the other two succeed and that the
failure is visible in stdout+stderr.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rocess
Two new tests filled in (concerns 16 and 17 from the audit) — and the
render test caught an actual bug while writing it.
- test_render_pass_per_evaluator_subdir_and_step_glob: mocks
vector.make + sample_logits so the render loop runs without EGL.
Verifies per-evaluator subdir, epoch+step in filename, and that
stale mp4s from a prior epoch don't leak into result.frames via
the glob.
- test_run_subprocess_passes_step_epoch_and_reads_back_result:
mocks subprocess.run; asserts the cmd line carries --global-step,
--epoch, --evaluator, --out, --load-model-path. After the fake
subprocess writes the canonical JSON, the parent reads it back
into EvalResult.metrics/frames correctly.
Bug the render test caught: _render_view was globbing
`*{step_suffix}*.mp4` to filter prior-epoch mp4s, but that glob also
matched OTHER views' files from the same pass (e.g. the bev view
glob picked up sim_state files that already existed in the subdir).
all_paths.extend across views was then double-counting sim_state
mp4s. Cluster ran was silently logging duplicates to wandb's render
key.
Fix: glob by full view_suffix (step + view marker), with exact
.mp4 suffix anchoring. sim_state's `*_epoch7_step12.mp4` no longer
matches the bev variant `*_epoch7_step12_bev.mp4`, so each view's
return is just its own mp4s. The extend call accumulates cleanly.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… counts
Three of the five aggregation issues from the recent audit. The
remaining two (failed-spawn dilution in gigaflow, per-bin vs per-SDC
weighting) need C-side changes — flagging as future work.
- _aggregate_infos: switch from mean-of-means to weighted mean by
per-emission `n`. Equivalent to current behavior when batches are
uniform (today), correct when they aren't (termination_mode=1,
mixed scenario_lengths, future per-bin overrides). Caveat in the
docstring for ratio fields like avg_distance_per_infraction
(numerator/denominator aren't observable from Python — weighted
mean is the same approximation as mean-of-means there).
- _aggregate_infos: rename `num_scenarios_completed` →
`num_log_cycles` (it was the count of vec_log emissions, not the
count of unique scenes), and add `num_agents_evaluated` (sum of n
across emissions = true count of SDC-trajectories behind the
metric). For behaviors with one cycle of N bins, that's now
`num_log_cycles=1, num_agents_evaluated=N` — much less misleading
than the prior `num_scenarios_completed=1`.
- BehaviorClassEvaluator: seed random.sample with the section name
so every epoch picks the same bins. Cross-epoch metric jitter
from bin selection used to be indistinguishable from policy
improvement; now jitter comes only from policy changes. Also
sorted the bin listing for determinism before the sample.
- tests: weighted-mean correctness with two emissions of different
n; empty-infos handling; reproducible sampling across two
evaluator instances with the same name; differing sections sample
different bins.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Replaces the five overlapping eval pipelines with one base class + manager.
Test plan
Out of scope