Add: parallel SPMD block dispatch across schedulers#827
Merged
Conversation
There was a problem hiding this comment.
Code Review
This pull request optimizes SPMD parallel dispatch in the a2a3 and a5 runtimes by allowing scheduler threads to claim block ranges and return tasks to the ready queue immediately, enabling concurrent execution across threads. It includes a new QK matmul kernel, orchestration logic, and a demo test case. Feedback recommends using more robust types for size calculations to avoid potential truncation and simplifying std::min calls by removing redundant template arguments.
f4cca1f to
f732288
Compare
Previously dispatch_shape() drained all idle cores for one SPMD task on a single scheduler thread before re-queuing the slot, so peer schedulers spun idle while one thread serialized every block dispatch. - Claim a contiguous block range, advance next_block_idx, and push the slot back to the ready queue immediately, then perform the expensive per-block dispatches afterward so other schedulers can concurrently claim and dispatch the remaining blocks of the same task. - Thread an explicit block_idx through build_payload / dispatch_block / dispatch_mix_block_to_cluster / dispatch_subtask_to_core, since next_block_idx may already be advanced by another scheduler once the slot is re-queued. - Apply symmetrically to a2a3 and a5 runtimes.
f732288 to
ca84559
Compare
ChaoWao
approved these changes
May 21, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
dispatch_shapeclaims acontiguous block range, pushes the slot back to the ready queue early,
then performs the expensive
dispatch_blockcalls so other schedulersconcurrently claim the remaining blocks of the same SPMD task
block_idxexplicitly throughbuild_payload/dispatch_subtask_to_core/dispatch_mix_block_to_cluster/dispatch_blockto avoid the race once the slot is pushed back earlyto its
Runtime*first-arg ands_block_idx/s_block_numfields)spmd_parallel_dispatch_demo(a2a3) for L2-swimlane comparisonTesting
blocks split 8/8/8 across the 3 schedulers with overlapping
first-dispatch times
spmd_multiblock_mix,spmd_sync_start_stress,spmd_starvationSwimlane comparison
Before:

After:

Related Issues
Closes #818