Fix: benchmark_rounds.sh parses device log instead of RunTiming#828
Conversation
PR hw-native-sys#790 rewrote the script to read Worker.run's RunTiming (host/device wall) and print Avg Host / Avg Device lines. That broke the benchmark workflow: - The /benchmark skill (SKILL.md Step 6) parses `Trimmed Avg:` and `Orch Trimmed Avg:` and reports Elapsed / Sched / Orch columns — lines the RunTiming-based script no longer emitted, so /benchmark reported no timing. - Host wall (Python dispatch) and the coarse device wall are not the per-round Elapsed / Sched / Orch latencies the benchmark targets. Restore the device-log parser: scrape the runtime's `Thread N: orch_start=… orch_end=…` / `sched_start=… sched_end=…` lines (emitted under compile-time PTO2_PROFILING, independent of the enable_l2_swimlane flag that --rounds > 1 forces off) and report the three per-round times the skill consumes. RunTiming itself is untouched.
There was a problem hiding this comment.
Code Review
This pull request refactors the benchmark_rounds.sh script to parse per-round latency from device logs, adding a new awk-based timing parser and support for platform-specific clock frequencies. The review feedback identifies a logic error in array indexing that could lead to incorrect averages and suggests making the trimming logic dynamic to accommodate smaller round counts. Further improvements were suggested regarding the handling of missing timing metrics and the removal of redundant legacy fallback logic.
| function flush_round() { | ||
| if (round >= 0 && max_end > 0 && min_start > 0) { | ||
| results[round] = (max_end - min_start) / freq | ||
| if (max_sched_end > 0 && min_sched_start > 0) | ||
| sched_results[round] = (max_sched_end - min_sched_start) / freq | ||
| if (max_orch_end > 0 && min_orch_start > 0) | ||
| orch_results[round] = (max_orch_end - min_orch_start) / freq | ||
| count++ | ||
| } | ||
| } |
There was a problem hiding this comment.
There is a logic error in how results, sched_results, and orch_results are indexed. They are currently indexed by round, but the subsequent loops (lines 269, 297, etc.) iterate from 0 to count. If any round is skipped by the if guard in flush_round (e.g., due to missing log lines), count will be less than round, and the arrays will have gaps. This causes the loops to access undefined elements (treated as 0 in numeric context), leading to incorrect averages and sorting. These arrays should be indexed by count instead.
| function flush_round() { | |
| if (round >= 0 && max_end > 0 && min_start > 0) { | |
| results[round] = (max_end - min_start) / freq | |
| if (max_sched_end > 0 && min_sched_start > 0) | |
| sched_results[round] = (max_sched_end - min_sched_start) / freq | |
| if (max_orch_end > 0 && min_orch_start > 0) | |
| orch_results[round] = (max_orch_end - min_orch_start) / freq | |
| count++ | |
| } | |
| } | |
| function flush_round() { | |
| if (round >= 0 && max_end > 0 && min_start > 0) { | |
| results[count] = (max_end - min_start) / freq | |
| if (max_sched_end > 0 && min_sched_start > 0) | |
| sched_results[count] = (max_sched_end - min_sched_start) / freq | |
| if (max_orch_end > 0 && min_orch_start > 0) | |
| orch_results[count] = (max_orch_end - min_orch_start) / freq | |
| count++ | |
| } | |
| } |
| for (i = 0; i < count; i++) ss[i] = sched_results[i] | ||
| for (i = 1; i < count; i++) { | ||
| k = ss[i]; j = i - 1 | ||
| while (j >= 0 && ss[j] > k) { ss[j+1] = ss[j]; j-- } | ||
| ss[j+1] = k | ||
| } |
There was a problem hiding this comment.
Even with the indexing fix, the Sched Trimmed Avg (and Orch Trimmed Avg below) calculation is fragile. If a round has valid 'Elapsed' data but is missing 'Sched' data, sched_results[i] will be undefined and treated as 0 in the ss array. This will skew the trimmed average significantly. It would be better to maintain a separate count of valid samples for each metric and only average/sort those samples, or ensure that only rounds with complete data are processed.
| if [[ -z "$EXAMPLE_DIR" ]]; then | ||
| for dir in "${EXAMPLES_DIRS[@]}"; do | ||
| candidate="$dir/$example" | ||
| if [[ -f "$candidate/golden.py" && -d "$candidate/kernels" ]]; then | ||
| EXAMPLE_DIR="$candidate" | ||
| break | ||
| fi | ||
| done | ||
| fi |
There was a problem hiding this comment.
The legacy test fallback logic added here is ineffective and contradicts the repository rule that each example directory should contain exactly one test_*.py file. Since run_bench already skips directories without a test_*.py file, this fallback block for golden.py is dead code and should be removed to align with the repository standard.
References
- Each example directory is expected to contain exactly one
test_*.pyfile. Scripts processing these directories, such asbenchmark_rounds.sh, should be designed to find and use only that single test file.
| TRIM = 10 | ||
| if (count > 2 * TRIM) { |
There was a problem hiding this comment.
The TRIM value is hardcoded to 10, and the trimmed average is only calculated if count > 20. If a user overrides the number of rounds to a small value (e.g., ./benchmark_rounds.sh -n 20), the script will not emit the Trimmed Avg lines. According to the PR description, these lines are required by the benchmark skill. Consider making TRIM dynamic (e.g., 10% of count) to ensure these lines are present even for smaller round counts.
| TRIM = 10 | |
| if (count > 2 * TRIM) { | |
| TRIM = int(count * 0.1) | |
| if (TRIM > 0 && count > 2 * TRIM) { |
Restore the Host (RunTiming.host_wall) and Device (RunTiming.device_wall) per-round columns that hw-native-sys#828 dropped, alongside the device-log Total/Sched/Orch. Rename Elapsed -> Total since "elapsed" is ambiguous next to two wall clocks; "Total" describes what the column actually is: the device-log span across all sched/orch events. - parse_timing takes (framework_stdout, device_log) and merges per-round Host/Device from _log_round_timings with Total/Sched/Orch from sched_*/orch_* device-log lines into a 5-column table. - Framework header matched anywhere on the line -- the test runner prints the case name with end="" so the header gets glued onto the same line as "Case1 ...". - Avg / Trimmed Avg lines expanded to all five metrics; /benchmark skill report updated to show all five columns (Single Mode) and five metric sub-rows per example (Compare Mode).
#832) Restore the Host (RunTiming.host_wall) and Device (RunTiming.device_wall) per-round columns that #828 dropped, alongside the device-log Total/Sched/Orch. Rename Elapsed -> Total since "elapsed" is ambiguous next to two wall clocks; "Total" describes what the column actually is: the device-log span across all sched/orch events. - parse_timing takes (framework_stdout, device_log) and merges per-round Host/Device from _log_round_timings with Total/Sched/Orch from sched_*/orch_* device-log lines into a 5-column table. - Framework header matched anywhere on the line -- the test runner prints the case name with end="" so the header gets glued onto the same line as "Case1 ...". - Avg / Trimmed Avg lines expanded to all five metrics; /benchmark skill report updated to show all five columns (Single Mode) and five metric sub-rows per example (Compare Mode). Co-authored-by: Chao Wang <26245345+ChaoWao@users.noreply.github.com>
Summary
tools/benchmark_rounds.shstopped reporting timing after #790 rewrote itto read
Worker.run'sRunTiming(host/device wall) and printAvg Host:/Avg Device:lines.This broke the
/benchmarkworkflow:SKILL.mdStep 6) parsesTrimmed Avg:andOrch Trimmed Avg:and reports Elapsed / Sched / Orch columns —lines the RunTiming-based script no longer emitted, so
/benchmarkreported no timing.
the per-round Elapsed / Sched / Orch latencies the benchmark targets
(the three are genuinely different numbers).
Fix
Restore the device-log parser. It scrapes the runtime's
Thread N: orch_start=… orch_end=…andsched_start=… sched_end=…linesand computes per-round Elapsed / Sched / Orch in µs, emitting the
Avg:/Sched Avg:/Orch Avg:/Trimmed Avg:/Orch Trimmed Avg:lines the skill consumes.
Those log lines are gated only by compile-time
PTO2_PROFILING(defaulton), not by the runtime
enable_l2_swimlaneflag that--rounds > 1forces off — so they remain available during benchmark runs.
RunTiming(Worker.run return,_log_round_timings, the nanobindbinding) is left untouched; the benchmark script simply no longer relies
on it. The example/case map is unchanged.
Testing
bash -n tools/benchmark_rounds.shpasses; emitted summary linesmatch
SKILL.mdStep 6.exercised on the local macOS sim).