Add: parallel SPMD block dispatch across schedulers by zhusy54 · Pull Request #827 · hw-native-sys/simpler

zhusy54 · 2026-05-20T09:28:24Z

Summary

Collapse the serial SPMD dispatch staircase: dispatch_shape claims a
contiguous block range, pushes the slot back to the ready queue early,
then performs the expensive dispatch_block calls so other schedulers
concurrently claim the remaining blocks of the same SPMD task
Thread block_idx explicitly through build_payload /
dispatch_subtask_to_core / dispatch_mix_block_to_cluster /
dispatch_block to avoid the race once the slot is pushed back early
Apply the same optimization to both a2a3 and a5 schedulers (a5 adapts
to its Runtime* first-arg and s_block_idx/s_block_num fields)
Add spmd_parallel_dispatch_demo (a2a3) for L2-swimlane comparison

Testing

a2a3sim numerical PASS; a2a3 hardware swimlane PASS — each task's 24
blocks split 8/8/8 across the 3 schedulers with overlapping
first-dispatch times
a5sim numerical PASS: spmd_multiblock_mix, spmd_sync_start_stress,
spmd_starvation

Swimlane comparison

Before：

After：

Related Issues

Closes #818

gemini-code-assist

Code Review

This pull request optimizes SPMD parallel dispatch in the a2a3 and a5 runtimes by allowing scheduler threads to claim block ranges and return tasks to the ready queue immediately, enabling concurrent execution across threads. It includes a new QK matmul kernel, orchestration logic, and a demo test case. Feedback recommends using more robust types for size calculations to avoid potential truncation and simplifying std::min calls by removing redundant template arguments.

Previously dispatch_shape() drained all idle cores for one SPMD task on a single scheduler thread before re-queuing the slot, so peer schedulers spun idle while one thread serialized every block dispatch. - Claim a contiguous block range, advance next_block_idx, and push the slot back to the ready queue immediately, then perform the expensive per-block dispatches afterward so other schedulers can concurrently claim and dispatch the remaining blocks of the same task. - Thread an explicit block_idx through build_payload / dispatch_block / dispatch_mix_block_to_cluster / dispatch_subtask_to_core, since next_block_idx may already be advanced by another scheduler once the slot is re-queued. - Apply symmetrically to a2a3 and a5 runtimes.

gemini-code-assist Bot reviewed May 20, 2026

View reviewed changes

zhusy54 force-pushed the opt-parallel-spmd-dispatch branch from f4cca1f to f732288 Compare May 20, 2026 09:42

zhusy54 force-pushed the opt-parallel-spmd-dispatch branch from f732288 to ca84559 Compare May 20, 2026 09:59

ChaoWao approved these changes May 21, 2026

View reviewed changes

ChaoWao merged commit 827fc27 into hw-native-sys:main May 21, 2026
15 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add: parallel SPMD block dispatch across schedulers#827

Add: parallel SPMD block dispatch across schedulers#827
ChaoWao merged 1 commit into
hw-native-sys:mainfrom
zhusy54:opt-parallel-spmd-dispatch

zhusy54 commented May 20, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

zhusy54 commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

Swimlane comparison

Related Issues

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zhusy54 commented May 20, 2026 •

edited

Loading