Add explicit multi-buffer annotation and manual buffer select support#685
Add explicit multi-buffer annotation and manual buffer select support#685zhangstevenunity wants to merge 4 commits into
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces an explicit multi-buffer expression and automatic synchronization scheme, adding the !pto.multi_tile_buf type and associated pto.alloc_multi_tile and pto.multi_tile_get operations. The implementation includes a new PTOResolveBufferSelect pass to lower slot selections into arith.select chains or single-address casts, and updates synchronization solvers to support dynamic event ID derivation and disjoint slot optimization via affine analysis. Review feedback suggests optimizing the arith.select lowering for high slot counts, refining the constant-peeling heuristic in affine analysis, and adding tracking for the current lack of support for low-precision types in multi-buffer allocations.
| for (uint32_t i = 1; i < n; ++i) { | ||
| Value iIdx = rewriter.create<arith::ConstantIndexOp>(loc, i); | ||
| Value isThis = rewriter.create<arith::CmpIOp>( | ||
| loc, arith::CmpIPredicate::eq, slot, iIdx); | ||
| selected = rewriter.create<arith::SelectOp>(loc, isThis, slotMems[i], | ||
| selected); | ||
| } |
There was a problem hiding this comment.
The N-way arith.select chain is O(N) in terms of generated operations. While N is currently capped at 16, for larger N this could lead to significant IR bloat and potentially inefficient hardware execution. Consider if a more efficient lowering (like a binary search tree of selects or a jump table if supported by the target) would be beneficial if the slot count limit is increased in the future.
| // Peel at most one add/sub of a constant. | ||
| Value rem = inner; | ||
| int peeled = 0; | ||
| while (peeled++ < 4) { |
There was a problem hiding this comment.
| if (isPTOLowPrecisionType(elemTy)) | ||
| return emitOpError() << "slot dtype " << elemTy | ||
| << " is not supported by pto.alloc_multi_tile yet"; |
Codex Review该评论由 review 机器人自动更新。
Summary检查到 2 个问题:dynamic-valid multi-buffer 在 Findings
Stage 0 会把函数签名上的 |
Remove the `allComparableSlotPairsEqual` early-bail in `getMultiBufferEventIdInfo`. When producer and consumer access a multi-buffer alloc through the same slot SSA, GSS previously fell back to a single static event id; InsertSync kept allocating N dyn event ids so consecutive iterations touching different physical slots could overlap. The two solvers now emit the same pre-loop primes, loop-body dyn `set_flag_dyn` / `wait_flag_dyn`, and post-loop drains for this case. * `multi_tile_prefetch_gss_event_id.pto`: CHECKs updated to expect 2-slot dyn flag pipeline (matching InsertSync). * Design doc: status table marks alignment landed; same-SSA asymmetry removed from limitations / follow-ups. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Resolves conflicts in: - lib/PTO/Transforms/InsertSync/InsertSyncAnalysis.cpp: Keep both slot-affine forward-dep filter (this branch) and CanPrunePipeVBarrier (main's #674). They're orthogonal: the affine filter prunes depVec for same-iter deps; the pipe-v pruner short- circuits the whole sync if all remaining deps are exact-same-access on PIPE_V. - lib/PTO/Transforms/InsertSync/PTOIRTranslator.cpp: Keep the arith.select alias handler (multi-buffer dynamic-slot path) and the multi-address PointerCast slot-offset population. Main only had a codecheck reformat of the single-address-only legacy path, which is now the `op.getAddrs().size() == 1` branch. - lib/PTO/Transforms/InsertSync/SyncCodegen.cpp: Keep this branch's full rewrite of CreateSetWaitOpForMultiBuffer that emits pto.set_flag_dyn / pto.wait_flag_dyn via an N-way select chain over eventIds[slot % N]. Main had a stale GetBufferSelected call whose return was already `(void)`-discarded; the new dyn-flag path supersedes it. - tools/ptoas/ptoas.cpp: Keep the createPTOResolveBufferSelectPass() registration. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
No description provided.