Skip to content

[WIP] eval: unified Evaluator + EvalManager refactor#420

Open
eugenevinitsky wants to merge 40 commits intoemerge/temp_trainingfrom
ev/unified-evals
Open

[WIP] eval: unified Evaluator + EvalManager refactor#420
eugenevinitsky wants to merge 40 commits intoemerge/temp_trainingfrom
ev/unified-evals

Conversation

@eugenevinitsky
Copy link
Copy Markdown

@eugenevinitsky eugenevinitsky commented May 9, 2026

Summary

Replaces the five overlapping eval pipelines with one base class + manager.

Test plan

  • Run a short training job on the cluster end-to-end and verify `validation_gigaflow` fires every 25 epochs, `behaviors_*` fires every 250.
  • Verify wandb panels land under `{eval_name}/` and renders under `{eval_name}/render/{view}`.
  • Verify `puffer eval puffer_drive --evaluator behaviors_hard_stop` runs standalone against a fixed checkpoint.
  • Verify `mode = "subprocess"` works (currently untested; the JSON-out path is plumbed but not exercised).
  • CI pre-commit + utest workflows green.

Out of scope

  • Persistent eval-worker pool (deferred until subprocess overhead becomes painful).
  • C-engine multi-pipe render (one rollout, N ffmpeg outputs per view).
  • Engine-level vecenv reuse via runtime `map_dir` swap.
  • `eval_summary` rollup metric.

Eugene Vinitsky and others added 5 commits May 9, 2026 18:27
…ation yet)

Adds the new evaluator framework as pure additions:

  pufferlib/ocean/benchmark/evaluators/
    base.py            Evaluator + EvalResult dataclass
    multi_scenario.py  MultiScenarioEvaluator (replaces eval_multi_scenarios)
    human_replay.py    HumanReplayEvaluator (replay + control_sdc_only loop)
    behavior_class.py  BehaviorClassEvaluator (per-class nuPlan suite)
    wosac.py           Thin wrapper around the existing WOSACEvaluator
  pufferlib/ocean/benchmark/manager.py
    EvalManager: section discovery, inheritance chain, clean-eval macro,
    dotted-key flattening, inline + subprocess dispatch, wandb logging.

Plus tests/test_eval_manager.py for the config-merge logic.

See docs/eval_unification.md for the full design rationale.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Auto-discovered evaluator sections, schema A (dotted keys). Each
behavior class is its own section inheriting from behaviors_defaults
template; gigaflow validation has separate metric-only and render
sections at different intervals.

driving_behaviours_eval.ini deleted — folded into drive.ini.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Removed from pufferl.py (~990 lines):
  eval(), eval_multi_scenarios(), eval_multi_scenarios_render(),
  build_eval_overrides(), load_eval_multi_scenarios_config(),
  _swap_policy_obs_counts(), _render_driving_behaviours(),
  _export_metrics(), _log_eval_metrics(),
  verify_scenario_coverage(), verify_scenario_coverage_gigaflow().
  Plus the legacy eval block in PuffeRL.evaluate() and the
  driving_behaviours_eval.ini loader in load_config.

Removed from utils.py (~300 lines):
  run_human_replay_eval_in_subprocess,
  run_wosac_eval_in_subprocess,
  run_driving_behaviours_eval_in_subprocess.

PuffeRL.evaluate() now calls self._eval_manager.maybe_run() — single
unified path for all evals. main() wires `puffer eval --evaluator <name>
--out <path>` for both standalone and subprocess use.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…EPLAY in c_step

Add GOAL_ADVANCE_REGENERATE / GOAL_ADVANCE_SATURATE constants and a
goal_advance_mode field on Drive. c_step's last-goal branch now
dispatches on goal_advance_mode instead of simulation_mode.

Drive.__init__ accepts goal_advance_mode kwarg with auto-pick based
on simulation_mode (gigaflow → regenerate, replay → saturate) — same
behavior as before, but the choice is explicit and per-eval-overridable.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 9, 2026 22:33
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors PufferDrive’s evaluation system by replacing multiple overlapping eval entry points with a unified Evaluator framework and an EvalManager that discovers [eval.<name>] sections from drive.ini and runs them inline or via subprocess. It also introduces a new C/Python goal_advance_mode knob to decouple last-goal behavior from simulation_mode.

Changes:

  • Added pufferlib/ocean/benchmark/evaluators/ (base Evaluator + concrete evaluators) and pufferlib/ocean/benchmark/manager.py (discovery, inheritance merge, clean macro, dispatch, logging).
  • Refactored pufferlib/pufferl.py eval CLI + training loop to route evaluation through EvalManager; removed legacy subprocess eval helpers from pufferlib/utils.py.
  • Updated drive.ini to [eval.<name>] suites (with templates + inheritance), removed driving_behaviours_eval.ini, and added goal_advance_mode through Python/C bindings.

Reviewed changes

Copilot reviewed 16 out of 17 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
tests/test_eval_manager.py New unit tests for eval-section discovery, inheritance merge, dotted-key expansion, and clean macro behavior.
README.md Updates cluster install wording (NYU cluster specifics).
pufferlib/utils.py Removes legacy subprocess eval helpers; retains video rendering helper.
pufferlib/pufferl.py Routes eval through EvalManager; rewires CLI puffer eval --evaluator ... --out ...; removes legacy eval pipelines.
pufferlib/ocean/drive/drive.py Adds Python-side goal_advance_mode and passes it into C env init kwargs.
pufferlib/ocean/drive/drive.h Adds GOAL_ADVANCE_* constants + goal_advance_mode field; switches last-goal behavior to the new knob in c_step.
pufferlib/ocean/drive/binding.c Unpacks goal_advance_mode into the C env struct.
pufferlib/ocean/benchmark/manager.py New EvalManager with section discovery, inheritance merge, clean macro, inline/subprocess dispatch, and logging.
pufferlib/ocean/benchmark/evaluators/base.py New Evaluator base class + EvalResult.
pufferlib/ocean/benchmark/evaluators/multi_scenario.py New multi-scenario evaluator + optional EGL render pass.
pufferlib/ocean/benchmark/evaluators/human_replay.py New human replay evaluator implementation.
pufferlib/ocean/benchmark/evaluators/behavior_class.py New behavior-class evaluator (sampling + symlink-dir cleanup).
pufferlib/ocean/benchmark/evaluators/wosac.py New WOSAC evaluator adapter around existing realism evaluator logic.
pufferlib/ocean/benchmark/evaluators/__init__.py Registers evaluator types for manager lookup.
pufferlib/config/ocean/driving_behaviours_eval.ini Deleted (folded into drive.ini).
pufferlib/config/ocean/drive.ini Replaces [eval] with [eval.<name>] suites + templates/inheritance; adds WOSAC and behavior-class sections.
.gitignore Un-ignores pufferlib/ocean/benchmark/** (was previously matched by benchmark*/).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +62 to +68
obs, _, _, _, info_list = vecenv.step(action_np)
if info_list:
all_infos.extend(info_list)
# Stop once every bin has yielded one info to avoid double-counting
# on the second cycle through the dir.
if len(all_infos) >= num_maps:
break
Comment thread pufferlib/ocean/benchmark/manager.py Outdated
Comment on lines +105 to +125
def _run_inline(self, ev: Evaluator, policy, env_name: str, global_step) -> EvalResult:
args = self._build_eval_args(ev, env_name=env_name, global_step=global_step)

package = args.get("package", "ocean")
module_name = "pufferlib.ocean" if package == "ocean" else f"pufferlib.environments.{package}"
env_module = importlib.import_module(module_name)
make_env = env_module.env_creator(env_name)

vec_kwargs = ev.vec_overrides()
num_envs = int(vec_kwargs.get("num_envs", 1))
env_kwargs_list = [args["env"] for _ in range(num_envs)]
env_creators = [make_env] * num_envs
env_args_list = [[]] * num_envs

vec_call_kwargs = dict(vec_kwargs)
vec_call_kwargs.setdefault("num_workers", num_envs)
vec_call_kwargs.setdefault("batch_size", num_envs)

vecenv = pufferlib.vector.make(
env_creators, env_args=env_args_list, env_kwargs=env_kwargs_list, **vec_call_kwargs
)
Comment thread pufferlib/ocean/benchmark/manager.py Outdated
Comment on lines +132 to +137
def _run_subprocess(self, ev: Evaluator, env_name: str, global_step) -> EvalResult:
out_path = Path(self.train_config.get("data_dir", ".")) / "eval_subprocess_out" / f"{ev.name}.json"
out_path.parent.mkdir(parents=True, exist_ok=True)
cfg_path = out_path.with_suffix(".cfg.json")
with open(cfg_path, "w") as f:
json.dump({"name": ev.name, "global_step": global_step}, f)
Comment on lines +153 to +155
subprocess.run(cmd, check=True)
with open(out_path) as f:
payload = json.load(f)
Comment thread pufferlib/pufferl.py
eval_args["load_model_path"] = os.path.join(
self.config["data_dir"], experiment_name, "models", f"inline_epoch_{self.epoch}.pt"
logger=self.logger,
global_step=self.global_step,
Comment thread pufferlib/config/ocean/drive.ini Outdated
Comment on lines +266 to +284
env.map_dir = "/scratch/$USER/data/nuplan/nuplan_mini_train_bins"

[eval.behaviors_hard_stop]
inherits = "behaviors_defaults"
type = "behavior_class"
enabled = true
env.map_dir = "/scratch/$USER/data/nuplan/categories_v021/hard_stop"

[eval.behaviors_highway_straight]
inherits = "behaviors_defaults"
type = "behavior_class"
enabled = true
env.map_dir = "/scratch/$USER/data/nuplan/categories_v021/highway_straight"

[eval.behaviors_lane_change]
inherits = "behaviors_defaults"
type = "behavior_class"
enabled = true
env.map_dir = "/scratch/$USER/data/nuplan/categories_v021/lane_change"
Comment thread pufferlib/utils.py
Comment on lines 1 to 8
import os
import sys
import glob
import random
import shutil
import subprocess
import tempfile
import json
Comment on lines +71 to +84
def maybe_run(self, epoch: int, policy, env_name: str, logger=None, global_step=None) -> dict:
"""Called from the training loop. Runs every enabled evaluator
whose `interval` divides `epoch`. Returns a dict of {eval_name → metrics}."""
results = {}
for ev in self.evaluators:
if not ev.enabled:
continue
if ev.interval <= 0:
continue
if epoch % ev.interval != 0:
continue
res = self._run_one(ev, policy=policy, env_name=env_name, logger=logger, global_step=global_step)
results[ev.name] = res
return results
Comment on lines +126 to +127
try:
res = ev.rollout(vecenv, policy, args)
Eugene Vinitsky and others added 21 commits May 9, 2026 18:46
The HumanReplay and MultiScenario rollouts had ~80% the same step loop
(reset → forward_eval → step → collect infos → aggregate). Move it to
the base class as `_run_rollout_loop`. Subclasses override only the
hooks they actually need to diverge on:

  _initial_reset       (sync vs async reset)
  _maybe_reset_lstm    (per-scenario reset cadence)
  _should_stop         (termination condition)
  _flatten_infos       (multi-worker vs single-worker info shape)
  _aggregate_infos     (per-key mean is the default)
  _render_pass         (no-op default; MultiScenario implements EGL)

HumanReplayEvaluator now overrides only `env_overrides` + `_should_stop`.
BehaviorClassEvaluator still inherits HumanReplay; only adds the tmp
symlink dir for random sampling. WOSAC keeps its custom rollout —
its per-scene multi-rollout loop doesn't fit the default shape.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The benchmark/** unignore was over-broad — it pulled in pycache files
that should never be committed. Add an explicit re-ignore for
__pycache__ under that path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- eval.render_num_scenarios: explicit per-evaluator render budget;
  defaults to min(eval.num_scenarios, 3). Renders are expensive
  (mp4 encode + wandb upload) and shouldn't run at metric scale.
- Render path randomizes starting_map per epoch when not pinned,
  so successive renders show different bins from the dir instead
  of the first N alphabetically. Restores the old behavior from
  _render_driving_behaviours that the refactor lost.
- behaviors_defaults pins render_num_scenarios = 2 so 12 classes
  × 2 views × 2 scenarios = 48 mp4s/epoch (vs 1200 if it inherited
  num_scenarios=50).
- Test: render_num_scenarios is inheritable.

"Worst N" selection deferred — needs per-scenario scores piped out
of the metric rollout into the render call.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two coupled fixes:

  EvalManager:
    - Accepts run_id at construction. Used to locate the per-run
      models/ directory.
    - latest_checkpoint(env_name) walks data_dir/<env>_<run_id>/models/
      for the newest model_*.pt. Falls back to train_config.load_model_path
      if no checkpoints exist yet.
    - has_subprocess_evals_at(epoch) reports whether any enabled
      subprocess evaluator would fire at that epoch.
    - _run_subprocess uses latest_checkpoint instead of train_config.load_model_path.

  PuffeRL.evaluate:
    - Calls save_checkpoint() before maybe_run() if any subprocess
      evaluator would fire. Mirrors the old run_driving_behaviours
      _eval_in_subprocess flow.
    - train() passes logger.run_id when constructing the manager.

For all-inline configs (today's drive.ini default) this is a no-op.
Activates when an evaluator is flipped to mode=subprocess.

Tests: latest_checkpoint picks the newest by ctime, falls back to
load_model_path when no models exist; has_subprocess_evals_at fires
only on enabled subprocess evaluators at matching intervals.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The unified-eval refactor doesn't actually need this knob, and pushing
it through the C struct + binding + drive.py at the same time bloats
this PR. Restoring the original `if (env->simulation_mode == SIMULATION_REPLAY)`
branch in c_step. The "promote implicit branches to explicit knobs"
audit is real work but lives in its own PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…erride stack

Adds 4 tests covering the regression-prone surface the parser-only
tests miss:

  test_maybe_run_dispatches_by_interval_and_enabled
    Stubs _run_one and verifies that maybe_run fires only enabled
    evaluators whose interval divides epoch. epoch=33 fires nothing,
    epoch=250 fires both 25-interval and 250-interval evaluators.

  test_flatten_infos_handles_shape_variations
    _flatten_infos must handle multi-worker (list of lists) AND
    PufferEnv (flat list) backends, plus None / empty entries —
    one bad isinstance check silently drops episode infos.

  test_behavior_class_cleanup_removes_symlink_dir
    Builds a real 5-bin map_dir, requests a 2-bin sample, verifies
    the tmp symlink dir gets created with 2 bins, then cleanup()
    removes it. tempfile.mkdtemp leftovers are a real footgun.

  test_eval_args_compose_train_section_and_clean_macro
    The full override stack: train baseline → section overrides →
    clean macro. Section beats baseline, explicit beats macro,
    untouched baseline survives.

25 tests total (was 21).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Aligns with the heavier-eval cadence the rest of the new eval
sections use. 25-epoch interval was too aggressive for inline
gigaflow validation given the per-pass setup cost.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Time the rollout in Evaluator.rollout (base class) and inject
`eval_seconds` into the returned metrics dict. Manager's _log
posts it to wandb under {ev.name}/eval_seconds — wall-clock cost
per evaluator becomes a first-class panel.

Refactor WOSACEvaluator to override _run_rollout_loop instead of
rollout — now WOSAC also benefits from the timing without code
duplication.

Test: stub evaluator with a forced 20ms floor; verifies eval_seconds
lands in the result and the inner metrics survive.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The eval-mode resample path at drive.py:447 was missing goal_radius
in its kwargs to binding.shared, while the initial-spawn call at
line 293 has it. binding.shared's C side (binding.c:1545) requires
goal_radius via unpack(), so the resample crashes with
"Missing required keyword argument 'goal_radius'" once a scenario
batch completes and a new one is requested.

Latent bug — never triggered because the legacy [eval] section had
multi_scenario_eval = False as the default, so the eval rollout
path that triggers resample never fired in production. The new
EvalManager flips multi_scenario_eval-equivalent ([eval.validation_gigaflow])
on by default, surfacing the crash.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Single section now does both: 250-scenario metric pass + 5-scenario
render pass via the render_num_scenarios knob. The split made sense
before render became a per-section flag; it's redundant now.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PufferEnv backend in pufferlib.vector.make treats env_creator as a
single callable and passes env_args/env_kwargs to it directly (line
697 in vector.py). Multiprocessing/Serial backends expect lists.
Render must use PufferEnv (one ffmpeg pipe per env), so pass a
single creator + dict, not lists.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The 8-map carla dir doesn't need 250 scenarios — gigaflow's C-side eval
already creates one internal env per scenario (capped at num_scenarios)
and steps them in a single batched rollout, so num_scenarios=8 covers
every map exactly once in parallel. Drop num_agents 512→400 to fill
exactly 8 × 50 slots with no wasted capacity.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…o_length

Render path was falling back to env.scenario_length (3000 = 100s mp4 at
30fps) because render_max_steps wasn't a real config knob. Two fixes:

  1. multi_scenario.py: read eval.render_max_steps from args["eval"]
     where _build_eval_args puts evaluator-private fields, and default
     to 91 (~3 sec) — not scenario_length, which is the metric pass.

  2. drive.ini: set eval.render_max_steps = 91 on validation_gigaflow
     explicitly so the comment + value document the intent.

EGL render is ~3 fps wall-clock at 1080p, so 91 × 5 × 2 ≈ 5 min/render
pass instead of 28 min when defaulting to scenario_length=3000.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The old MultiScenarioEvaluator forced Multiprocessing + async_reset by
default, which (a) added fork/IPC startup cost without throughput gain
at our scale, (b) broke render (one ffmpeg pipe per env requires
single-process), and (c) couldn't construct the vec env at all because
the manager passed the PufferEnv-shaped single creator to the list-shaped
backend dispatch. Three changes:

  1. multi_scenario.py: drop the vec_overrides() and _initial_reset()
     overrides that pinned Multiprocessing — inherit base-class PufferEnv
     + sync reset.

  2. manager.py:_run_inline: branch on backend so Multiprocessing remains
     a valid opt-in via [eval.<name>.vec] backend = "Multiprocessing".
     Useful for memory-bound replay sweeps, hetero-config evals, or
     async overlap on long rollouts — not needed for the 8-map carla
     validation eval that's the hot path today.

  3. The 8-map validation eval now spins up one Drive with 8 internal
     envs in-process, no fork.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… path

Two fixes for the multi-evaluator smoke test:

  1. multi_scenario._render_view: glob filter excludes mp4s that already
     existed in out_dir before this pass. The dir is shared across epochs
     and across views, so a bare `out_dir.glob('*.mp4')` was returning
     every mp4 from prior render passes too — turning epoch 12510's
     16-mp4 render into a 32-mp4 wandb upload (the 16 from epoch 12505
     were still on disk).

  2. drive.ini: replace `/scratch/$USER/data/nuplan/...` with the literal
     path. configparser doesn't expand env vars, and the manager doesn't
     either, so behavior evals were crashing on FileNotFoundError trying
     to open `/scratch/$USER/data/...`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously `{scenario_id}{view_suffix}.mp4` — each render epoch
overwrote the previous epoch's mp4 in place, making the prior fix's
existing-file snapshot filter incorrect (filter would exclude the
freshly-written file because Path equality matches the prior path).

Now `{scenario_id}_step{N}{view_suffix}.mp4` so:
  - successive eval epochs produce distinct mp4s (no overwrites).
  - wandb's render carousel shows one entry per epoch, letting the
    user watch policy evolve over training.
  - the return-paths glob is exact: just files matching this step's
    pattern, no snapshot-filtering trickery.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Behavior-class evaluators previously had render=true in config but no
_render_pass implementation, so the videos silently never appeared.
Three changes:

  1. base.py: lift _render_pass + _render_view from MultiScenarioEvaluator
     into Evaluator. The render loop is generic — fresh PufferEnv with
     render_mode=headless, ffmpeg pipe per active env per view, mp4s
     stamped with global_step in the filename.

  2. base.py: add _render_env_overrides hook so subclasses can tweak the
     render env (default = metric env + render_mode=headless). Render
     loop now caps internal-env render count at eval.render_num_scenarios
     instead of always rendering the full batch (the C kernel still steps
     the full batch — just fewer ffmpeg pipes). Drops the dead
     batch_size_eval write.

  3. multi_scenario.py: keeps the random-starting_map override (its only
     real difference from the default render path) — everything else is
     now inherited.

Result: behavior_class and human_replay inherit a working render path.
For 12 behavior classes with render_num_scenarios=2, render is bounded
by ~3 fps × 300 steps × 2 views ≈ 3.5 min/class.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Every evaluator runs at the same global_step, so the previous shared
out_dir + step glob made each evaluator's _render_view re-collect every
earlier evaluator's mp4s into result.frames. wandb then logged
validation_gigaflow's videos under behaviors_*/render too.

Per-evaluator subdir (`mp4/<ev_name>/`) keeps each evaluator's render
output isolated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… aliasing

Every behavior_class evaluator was rendering the same bin (the LAST
section's map_dir won for all siblings) because _deep_merge mutates
its dst in place, but _expand_dotted's shallow-copy meant the first
level merged into a fresh `merged` dict had `merged["env"]` aliasing
the parent section's env dict directly. The next level's merge then
recursed into that alias and overwrote keys on the parent itself, so:

  - building child_a left defaults["env"]["map_dir"] = a's map_dir
  - building child_b read the polluted defaults, set map_dir = b's
  - … repeated through all siblings — every cfg["env"] is the same
    reference, the last write wins for everyone.

Diagnostic showed all 12 behavior evaluators returning map_dir =
unprotected_right (the last section), num_maps=1, num_agents=1.

Fix: deepcopy each level before merging so `merged` owns its own
nested dicts and the parent section dict in `all_sections` stays
pristine across sibling builds.

Plus a regression test that builds three siblings inheriting from
the same parent and asserts each gets its own map_dir AND that the
parent dict is untouched.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Eugene Vinitsky and others added 13 commits May 10, 2026 11:40
Render mp4 filenames now use `_epoch{E}_step{N}{view_suffix}.mp4`
instead of just `_step{N}{view_suffix}.mp4`. Epoch is the human-readable
checkpoint index ("which epoch did this come from"), step is the
precise env-step count. Both are useful — epoch for sorting through
training history at a glance, step for tying back to wandb's x-axis.

Threads `epoch` through maybe_run → _run_one → _run_inline →
_build_eval_args → args["epoch"], where _render_pass picks it up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…enarios=2

The 'Template — no type, never instantiated' inline comment on
[eval.behaviors_defaults] was redundant with the section-header block
above ('All inherit from the template below') and read like internal
prose mid-section.

Also drop the explicit `eval.render_num_scenarios = 2` whose comment
defended pinning lower than the default of 3 — the 48-vs-72 difference
in mp4 count isn't load-bearing, and the contradiction (set 2, comment
defaults 3) was confusing. Inherits the default 3 now.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…efaults

base.py:_render_view used to default `render_num_scenarios = min(metric_count, 3)`
and `render_max_steps = 300` if the section didn't set them. That hides the
render-cost knob inside code where readers won't see it. Render is expensive
(EGL frame-by-frame at 1080p, ffmpeg encode + wandb upload), and the right
budget depends on scenario_length and how many evaluators are firing — both
are config-side concerns.

Now: if `render = true`, the section MUST set both `eval.render_num_scenarios`
and `eval.render_max_steps`. Missing either raises KeyError at first render
attempt.

drive.ini: behaviors_defaults already had render=true; add explicit
render_num_scenarios=3 and render_max_steps=200 (=full bin at scenario_length=201).
validation_gigaflow already set both.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The default rollout loop now reads `terminals` and `truncations` from
vecenv.step (previously discarded) and masks the LSTM state per-agent —
either signal means 'episode over, env reset' so we OR them. Agents
whose env truncated/terminated start the next forward with zeroed
hidden state; the rest carry recurrent memory across the step.

This replaces the whole-batch heuristic (`steps % scenario_length == 0`)
that lived in MultiScenarioEvaluator._maybe_reset_lstm. The heuristic
only worked because gigaflow validation runs uniform scenarios with
termination_mode=0, where every env truncated on the same Python tick.
It would silently desync if any env truncated early (e.g.
termination_mode=1, future per-bin scenario_length).

Also fixes the silent correctness bug for behavior_class / human_replay
evaluators, which inherited the base no-op and were never resetting
LSTM state across scenario boundaries. (Latent because rnn_name=None in
the smoke configs we've been running, but it'd bite as soon as RNN is
enabled.)

  - base.py: capture terminals/truncations from vecenv.step; mask state
    per-agent with `(1 - (term | trunc)).reshape(-1, 1)` before next
    forward.
  - base.py: drop the `_maybe_reset_lstm` hook (now redundant).
  - multi_scenario.py: drop the override (was the only implementer).
  - tests: regression test that constructs a 4-agent rollout where 3
    agents fire terminate/truncate and one stays alive; asserts only
    the alive agent's state survives.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Each test now opens with a 1-3 line docstring stating what it asserts
and why — useful for future readers scanning the file or for failure
output to make the intent obvious without re-reading the body.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The C-side replay branch caps env_count at num_eval_scenarios when
eval_mode is on (binding.c:1653). HumanReplay/BehaviorClass were never
setting it, so it fell back to drive.py's default of 16. Behavior
classes with >16 bins (or eval.num_scenarios > 16 with sampling) were
silently only loading 16 of N envs as bins, then deterministically
re-running those same 16 until len(infos) >= num_maps padded the count.
Net effect: 34 of 50 sampled bins wasted CPU on tmp symlinks and never
got evaluated.

  - human_replay.env_overrides: always (re)set num_agents = num_maps =
    num_eval_scenarios = bin count. Drops the prior `if num_agents not
    in env` guard that was silently bypassed by the train config's
    num_agents=1024 default.

  - human_replay._should_stop: now stops at scenario_length+1 (= one
    full cycle). Every env truncates simultaneously at scenario_length,
    vec_log emits one aggregate dict averaging all SDCs at that step,
    and we have the full per-bin metric. Continuing past that just
    deterministically re-runs the same bins (resample_frequency is
    effectively never in replay), so the prior `len(infos) >= num_maps`
    threshold was 49× wasteful for 50-bin classes.

  - behavior_class.env_overrides: sets num_eval_scenarios alongside
    num_agents/num_maps in both the sampling and all-bins branches.

  - regression test: builds a 50-bin map_dir and asserts both branches
    of behavior_class.env_overrides set num_eval_scenarios = N.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ip rest

Two fixes:

1. Subprocess eval now receives --global-step and --epoch.

   Previously _run_subprocess wrote a .cfg.json next to --out but never
   passed it (or any other reference to it) on the cmd line, and the
   eval CLI had no flags for either field. Result: every subprocess
   render wrote `{scenario_id}_epoch0_step0.mp4`, and successive eval
   epochs would silently overwrite each other on disk. Metric data was
   correct (parent re-attached the right global_step on _log) but the
   wandb captions were uninterpretable.

   Now: _run_subprocess passes --global-step and --epoch on the cmd
   line; pufferl.py's main() pulls them out of argv (same pattern as
   --evaluator / --out) and threads them into eval(); the eval()
   function passes them to run_one_by_name; the manager threads them
   through to _build_eval_args where the render path picks them up
   via args["global_step"] / args["epoch"]. Drops the dead .cfg.json
   write.

2. maybe_run isolates per-evaluator failures.

   A bad bin dir or KeyError in one [eval.behaviors_X] section used to
   crash the for-loop and skip every subsequent evaluator at that
   epoch. Now each _run_one is wrapped: the exception is printed in
   full (header to stdout, traceback to stderr) and the loop moves on.
   Successful evaluators land in the returned results dict; the failed
   one is absent.

   Errors stay loud — no silent swallow. Just stops one bad evaluator
   from dragging the whole eval batch down with it.

  - test_maybe_run_isolates_evaluator_failures: 3 evaluators where the
    middle one raises; asserts the other two succeed and that the
    failure is visible in stdout+stderr.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rocess

Two new tests filled in (concerns 16 and 17 from the audit) — and the
render test caught an actual bug while writing it.

  - test_render_pass_per_evaluator_subdir_and_step_glob: mocks
    vector.make + sample_logits so the render loop runs without EGL.
    Verifies per-evaluator subdir, epoch+step in filename, and that
    stale mp4s from a prior epoch don't leak into result.frames via
    the glob.

  - test_run_subprocess_passes_step_epoch_and_reads_back_result:
    mocks subprocess.run; asserts the cmd line carries --global-step,
    --epoch, --evaluator, --out, --load-model-path. After the fake
    subprocess writes the canonical JSON, the parent reads it back
    into EvalResult.metrics/frames correctly.

Bug the render test caught: _render_view was globbing
`*{step_suffix}*.mp4` to filter prior-epoch mp4s, but that glob also
matched OTHER views' files from the same pass (e.g. the bev view
glob picked up sim_state files that already existed in the subdir).
all_paths.extend across views was then double-counting sim_state
mp4s. Cluster ran was silently logging duplicates to wandb's render
key.

Fix: glob by full view_suffix (step + view marker), with exact
.mp4 suffix anchoring. sim_state's `*_epoch7_step12.mp4` no longer
matches the bev variant `*_epoch7_step12_bev.mp4`, so each view's
return is just its own mp4s. The extend call accumulates cleanly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… counts

Three of the five aggregation issues from the recent audit. The
remaining two (failed-spawn dilution in gigaflow, per-bin vs per-SDC
weighting) need C-side changes — flagging as future work.

  - _aggregate_infos: switch from mean-of-means to weighted mean by
    per-emission `n`. Equivalent to current behavior when batches are
    uniform (today), correct when they aren't (termination_mode=1,
    mixed scenario_lengths, future per-bin overrides). Caveat in the
    docstring for ratio fields like avg_distance_per_infraction
    (numerator/denominator aren't observable from Python — weighted
    mean is the same approximation as mean-of-means there).

  - _aggregate_infos: rename `num_scenarios_completed` →
    `num_log_cycles` (it was the count of vec_log emissions, not the
    count of unique scenes), and add `num_agents_evaluated` (sum of n
    across emissions = true count of SDC-trajectories behind the
    metric). For behaviors with one cycle of N bins, that's now
    `num_log_cycles=1, num_agents_evaluated=N` — much less misleading
    than the prior `num_scenarios_completed=1`.

  - BehaviorClassEvaluator: seed random.sample with the section name
    so every epoch picks the same bins. Cross-epoch metric jitter
    from bin selection used to be indistinguishable from policy
    improvement; now jitter comes only from policy changes. Also
    sorted the bin listing for determinism before the sample.

  - tests: weighted-mean correctness with two emissions of different
    n; empty-infos handling; reproducible sampling across two
    evaluator instances with the same name; differing sections sample
    different bins.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants