Skip to content

Add: parallel SPMD block dispatch across schedulers#827

Merged
ChaoWao merged 1 commit into
hw-native-sys:mainfrom
zhusy54:opt-parallel-spmd-dispatch
May 21, 2026
Merged

Add: parallel SPMD block dispatch across schedulers#827
ChaoWao merged 1 commit into
hw-native-sys:mainfrom
zhusy54:opt-parallel-spmd-dispatch

Conversation

@zhusy54
Copy link
Copy Markdown
Contributor

@zhusy54 zhusy54 commented May 20, 2026

Summary

  • Collapse the serial SPMD dispatch staircase: dispatch_shape claims a
    contiguous block range, pushes the slot back to the ready queue early,
    then performs the expensive dispatch_block calls so other schedulers
    concurrently claim the remaining blocks of the same SPMD task
  • Thread block_idx explicitly through build_payload /
    dispatch_subtask_to_core / dispatch_mix_block_to_cluster /
    dispatch_block to avoid the race once the slot is pushed back early
  • Apply the same optimization to both a2a3 and a5 schedulers (a5 adapts
    to its Runtime* first-arg and s_block_idx/s_block_num fields)
  • Add spmd_parallel_dispatch_demo (a2a3) for L2-swimlane comparison

Testing

  • a2a3sim numerical PASS; a2a3 hardware swimlane PASS — each task's 24
    blocks split 8/8/8 across the 3 schedulers with overlapping
    first-dispatch times
  • a5sim numerical PASS: spmd_multiblock_mix, spmd_sync_start_stress,
    spmd_starvation

Swimlane comparison

Before:
image

After:
image

Related Issues

Closes #818

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request optimizes SPMD parallel dispatch in the a2a3 and a5 runtimes by allowing scheduler threads to claim block ranges and return tasks to the ready queue immediately, enabling concurrent execution across threads. It includes a new QK matmul kernel, orchestration logic, and a demo test case. Feedback recommends using more robust types for size calculations to avoid potential truncation and simplifying std::min calls by removing redundant template arguments.

Comment thread src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_dispatch.cpp Outdated
@zhusy54 zhusy54 force-pushed the opt-parallel-spmd-dispatch branch from f4cca1f to f732288 Compare May 20, 2026 09:42
Previously dispatch_shape() drained all idle cores for one SPMD task on
a single scheduler thread before re-queuing the slot, so peer schedulers
spun idle while one thread serialized every block dispatch.

- Claim a contiguous block range, advance next_block_idx, and push the
  slot back to the ready queue immediately, then perform the expensive
  per-block dispatches afterward so other schedulers can concurrently
  claim and dispatch the remaining blocks of the same task.
- Thread an explicit block_idx through build_payload / dispatch_block /
  dispatch_mix_block_to_cluster / dispatch_subtask_to_core, since
  next_block_idx may already be advanced by another scheduler once the
  slot is re-queued.
- Apply symmetrically to a2a3 and a5 runtimes.
@zhusy54 zhusy54 force-pushed the opt-parallel-spmd-dispatch branch from f732288 to ca84559 Compare May 20, 2026 09:59
@ChaoWao ChaoWao merged commit 827fc27 into hw-native-sys:main May 21, 2026
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Performance] SPMD blocks start sequentially instead of simultaneously despite single dispatch

2 participants