Fix: benchmark_rounds.sh parses device log instead of RunTiming by ChaoWao · Pull Request #828 · hw-native-sys/simpler

ChaoWao · 2026-05-20T09:28:58Z

Summary

tools/benchmark_rounds.sh stopped reporting timing after #790 rewrote it
to read Worker.run's RunTiming (host/device wall) and print
Avg Host: / Avg Device: lines.

This broke the /benchmark workflow:

The benchmark skill (SKILL.md Step 6) parses Trimmed Avg: and
Orch Trimmed Avg: and reports Elapsed / Sched / Orch columns —
lines the RunTiming-based script no longer emitted, so /benchmark
reported no timing.
Host wall (Python dispatch overhead) and the coarse device wall are not
the per-round Elapsed / Sched / Orch latencies the benchmark targets
(the three are genuinely different numbers).

Fix

Restore the device-log parser. It scrapes the runtime's
Thread N: orch_start=… orch_end=… and sched_start=… sched_end=… lines
and computes per-round Elapsed / Sched / Orch in µs, emitting the
Avg: / Sched Avg: / Orch Avg: / Trimmed Avg: / Orch Trimmed Avg:
lines the skill consumes.

Those log lines are gated only by compile-time PTO2_PROFILING (default
on), not by the runtime enable_l2_swimlane flag that --rounds > 1
forces off — so they remain available during benchmark runs.

RunTiming (Worker.run return, _log_round_timings, the nanobind
binding) is left untouched; the benchmark script simply no longer relies
on it. The example/case map is unchanged.

Testing

bash -n tools/benchmark_rounds.sh passes; emitted summary lines
match SKILL.md Step 6.
Hardware run on a2a3 (log-parsing path is hardware-only; cannot be
exercised on the local macOS sim).

PR hw-native-sys#790 rewrote the script to read Worker.run's RunTiming (host/device wall) and print Avg Host / Avg Device lines. That broke the benchmark workflow: - The /benchmark skill (SKILL.md Step 6) parses `Trimmed Avg:` and `Orch Trimmed Avg:` and reports Elapsed / Sched / Orch columns — lines the RunTiming-based script no longer emitted, so /benchmark reported no timing. - Host wall (Python dispatch) and the coarse device wall are not the per-round Elapsed / Sched / Orch latencies the benchmark targets. Restore the device-log parser: scrape the runtime's `Thread N: orch_start=… orch_end=…` / `sched_start=… sched_end=…` lines (emitted under compile-time PTO2_PROFILING, independent of the enable_l2_swimlane flag that --rounds > 1 forces off) and report the three per-round times the skill consumes. RunTiming itself is untouched.

gemini-code-assist

Code Review

This pull request refactors the benchmark_rounds.sh script to parse per-round latency from device logs, adding a new awk-based timing parser and support for platform-specific clock frequencies. The review feedback identifies a logic error in array indexing that could lead to incorrect averages and suggests making the trimming logic dynamic to accommodate smaller round counts. Further improvements were suggested regarding the handling of missing timing metrics and the removal of redundant legacy fallback logic.

gemini-code-assist · 2026-05-20T09:32:34Z

+    function flush_round() {
+        if (round >= 0 && max_end > 0 && min_start > 0) {
+            results[round] = (max_end - min_start) / freq
+            if (max_sched_end > 0 && min_sched_start > 0)
+                sched_results[round] = (max_sched_end - min_sched_start) / freq
+            if (max_orch_end > 0 && min_orch_start > 0)
+                orch_results[round] = (max_orch_end - min_orch_start) / freq
+            count++
+        }
+    }


There is a logic error in how results, sched_results, and orch_results are indexed. They are currently indexed by round, but the subsequent loops (lines 269, 297, etc.) iterate from 0 to count. If any round is skipped by the if guard in flush_round (e.g., due to missing log lines), count will be less than round, and the arrays will have gaps. This causes the loops to access undefined elements (treated as 0 in numeric context), leading to incorrect averages and sorting. These arrays should be indexed by count instead.

Suggested change

function flush_round() {

if (round >= 0 && max_end > 0 && min_start > 0) {

results[round] = (max_end - min_start) / freq

if (max_sched_end > 0 && min_sched_start > 0)

sched_results[round] = (max_sched_end - min_sched_start) / freq

if (max_orch_end > 0 && min_orch_start > 0)

orch_results[round] = (max_orch_end - min_orch_start) / freq

count++

}

}

function flush_round() {

if (round >= 0 && max_end > 0 && min_start > 0) {

results[count] = (max_end - min_start) / freq

if (max_sched_end > 0 && min_sched_start > 0)

sched_results[count] = (max_sched_end - min_sched_start) / freq

if (max_orch_end > 0 && min_orch_start > 0)

orch_results[count] = (max_orch_end - min_orch_start) / freq

count++

}

}

gemini-code-assist · 2026-05-20T09:32:34Z

+                for (i = 0; i < count; i++) ss[i] = sched_results[i]
+                for (i = 1; i < count; i++) {
+                    k = ss[i]; j = i - 1
+                    while (j >= 0 && ss[j] > k) { ss[j+1] = ss[j]; j-- }
+                    ss[j+1] = k
+                }


Even with the indexing fix, the Sched Trimmed Avg (and Orch Trimmed Avg below) calculation is fragile. If a round has valid 'Elapsed' data but is missing 'Sched' data, sched_results[i] will be undefined and treated as 0 in the ss array. This will skew the trimmed average significantly. It would be better to maintain a separate count of valid samples for each metric and only average/sort those samples, or ensure that only rounds with complete data are processed.

gemini-code-assist · 2026-05-20T09:32:34Z

+    if [[ -z "$EXAMPLE_DIR" ]]; then
+        for dir in "${EXAMPLES_DIRS[@]}"; do
+            candidate="$dir/$example"
+            if [[ -f "$candidate/golden.py" && -d "$candidate/kernels" ]]; then
+                EXAMPLE_DIR="$candidate"
+                break
+            fi
+        done
+    fi


The legacy test fallback logic added here is ineffective and contradicts the repository rule that each example directory should contain exactly one test_*.py file. Since run_bench already skips directories without a test_*.py file, this fallback block for golden.py is dead code and should be removed to align with the repository standard.

References

Each example directory is expected to contain exactly one test_*.py file. Scripts processing these directories, such as benchmark_rounds.sh, should be designed to find and use only that single test file.

gemini-code-assist · 2026-05-20T09:32:34Z

+        TRIM = 10
+        if (count > 2 * TRIM) {


The TRIM value is hardcoded to 10, and the trimmed average is only calculated if count > 20. If a user overrides the number of rounds to a small value (e.g., ./benchmark_rounds.sh -n 20), the script will not emit the Trimmed Avg lines. According to the PR description, these lines are required by the benchmark skill. Consider making TRIM dynamic (e.g., 10% of count) to ensure these lines are present even for smaller round counts.

Suggested change

TRIM = 10

if (count > 2 * TRIM) {

TRIM = int(count * 0.1)

if (TRIM > 0 && count > 2 * TRIM) {

Restore the Host (RunTiming.host_wall) and Device (RunTiming.device_wall) per-round columns that hw-native-sys#828 dropped, alongside the device-log Total/Sched/Orch. Rename Elapsed -> Total since "elapsed" is ambiguous next to two wall clocks; "Total" describes what the column actually is: the device-log span across all sched/orch events. - parse_timing takes (framework_stdout, device_log) and merges per-round Host/Device from _log_round_timings with Total/Sched/Orch from sched_*/orch_* device-log lines into a 5-column table. - Framework header matched anywhere on the line -- the test runner prints the case name with end="" so the header gets glued onto the same line as "Case1 ...". - Avg / Trimmed Avg lines expanded to all five metrics; /benchmark skill report updated to show all five columns (Single Mode) and five metric sub-rows per example (Compare Mode).

#832) Restore the Host (RunTiming.host_wall) and Device (RunTiming.device_wall) per-round columns that #828 dropped, alongside the device-log Total/Sched/Orch. Rename Elapsed -> Total since "elapsed" is ambiguous next to two wall clocks; "Total" describes what the column actually is: the device-log span across all sched/orch events. - parse_timing takes (framework_stdout, device_log) and merges per-round Host/Device from _log_round_timings with Total/Sched/Orch from sched_*/orch_* device-log lines into a 5-column table. - Framework header matched anywhere on the line -- the test runner prints the case name with end="" so the header gets glued onto the same line as "Case1 ...". - Avg / Trimmed Avg lines expanded to all five metrics; /benchmark skill report updated to show all five columns (Single Mode) and five metric sub-rows per example (Compare Mode). Co-authored-by: Chao Wang <26245345+ChaoWao@users.noreply.github.com>

gemini-code-assist Bot reviewed May 20, 2026

View reviewed changes

ChaoWao merged commit d423e87 into hw-native-sys:main May 20, 2026
15 checks passed

ChaoWao deleted the fix-benchmark-rounds-log-parsing branch May 20, 2026 09:45

hw-native-sys-bot mentioned this pull request May 21, 2026

Update: benchmark_rounds.sh adds Host/Device, renames Elapsed to Total #832

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: benchmark_rounds.sh parses device log instead of RunTiming#828

Fix: benchmark_rounds.sh parses device log instead of RunTiming#828
ChaoWao merged 1 commit into
hw-native-sys:mainfrom
ChaoWao:fix-benchmark-rounds-log-parsing

ChaoWao commented May 20, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 20, 2026

Uh oh!

gemini-code-assist Bot May 20, 2026

Uh oh!

gemini-code-assist Bot May 20, 2026

Uh oh!

gemini-code-assist Bot May 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ChaoWao commented May 20, 2026

Summary

Fix

Testing

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant