Skip to content

Add explicit multi-buffer annotation and manual buffer select support#685

Draft
zhangstevenunity wants to merge 4 commits into
mainfrom
claude/serene-bhaskara-f0221e
Draft

Add explicit multi-buffer annotation and manual buffer select support#685
zhangstevenunity wants to merge 4 commits into
mainfrom
claude/serene-bhaskara-f0221e

Conversation

@zhangstevenunity
Copy link
Copy Markdown
Collaborator

No description provided.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an explicit multi-buffer expression and automatic synchronization scheme, adding the !pto.multi_tile_buf type and associated pto.alloc_multi_tile and pto.multi_tile_get operations. The implementation includes a new PTOResolveBufferSelect pass to lower slot selections into arith.select chains or single-address casts, and updates synchronization solvers to support dynamic event ID derivation and disjoint slot optimization via affine analysis. Review feedback suggests optimizing the arith.select lowering for high slot counts, refining the constant-peeling heuristic in affine analysis, and adding tracking for the current lack of support for low-precision types in multi-buffer allocations.

Comment on lines +181 to +187
for (uint32_t i = 1; i < n; ++i) {
Value iIdx = rewriter.create<arith::ConstantIndexOp>(loc, i);
Value isThis = rewriter.create<arith::CmpIOp>(
loc, arith::CmpIPredicate::eq, slot, iIdx);
selected = rewriter.create<arith::SelectOp>(loc, isThis, slotMems[i],
selected);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The N-way arith.select chain is O(N) in terms of generated operations. While N is currently capped at 16, for larger N this could lead to significant IR bloat and potentially inefficient hardware execution. Consider if a more efficient lowering (like a binary search tree of selects or a jump table if supported by the target) would be beneficial if the slot count limit is increased in the future.

// Peel at most one add/sub of a constant.
Value rem = inner;
int peeled = 0;
while (peeled++ < 4) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The heuristic of peeling up to 4 constant additions/subtractions is simple and likely sufficient for common prefetch patterns. However, it is a hardcoded limit. If more complex index expressions are expected, a more robust affine expression walker might be needed.

Comment thread lib/PTO/IR/PTO.cpp
Comment on lines +2188 to +2190
if (isPTOLowPrecisionType(elemTy))
return emitOpError() << "slot dtype " << elemTy
<< " is not supported by pto.alloc_multi_tile yet";
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Low precision types are explicitly blocked for multi-buffer allocations. If this is a temporary hardware or compiler limitation, it should be tracked with a TODO or a reference to a tracking issue to ensure it's addressed when support is added.

@reedhecre
Copy link
Copy Markdown

reedhecre commented May 19, 2026

Codex Review

该评论由 review 机器人自动更新。

Summary

检查到 2 个问题:dynamic-valid multi-buffer 在 multi_tile_get lowering 中会丢失 valid shape;另外 multi_tile_buf 仍可出现在函数边界并被静默错误 lowering。

Findings

  1. P1 multi_tile_get lowering 丢失了动态 valid_row/valid_col lib/PTO/Transforms/PTOViewToMemref.cpp:1498

pto.alloc_multi_tile 允许通过 operands 传入动态 valid_row/valid_col,但这里在 lowering pto.multi_tile_get 时重新造了一个新的 pto.bind_tile,只会从结果类型里回填静态常量 valid dim。对 v_row/v_col = ? 的 slot type,这个新 bind_tile 会完全丢掉运行时 valid shape。后续 PTOMaterializeTileHandles 看不到动态 valid operands,就会按整块静态 shape 物化 tile,尾块场景会把“部分有效”tile 当成“全有效”tile 继续参与 tload/tstore/compute,属于直接的结果错误风险。当前新增测试里也没有覆盖动态-valid 的 multi-buffer 用例。

  1. P2 未禁止函数边界上的 multi_tile_buf,当前会被静默错误 lowering lib/PTO/Transforms/PTOViewToMemref.cpp:1247

Stage 0 会把函数签名上的 !pto.multi_tile_buf 直接改写成普通 slot memref,但这里不会保留 count,也不会生成后续多槽位所依赖的 pto.multi_buffer / multi-address pto.pointer_cast 锚点。这样一来,如果用户把 multi_tile_buf 放在函数参数或返回值上,后续 pto.multi_tile_get 虽然仍能通过 verifier,却只会在 memref 层看到一个普通 buffer;PTOResolveBufferSelect 也找不到 multi-buffer 根节点,只能把 slot 选择退化成 identity,导致不同 slot 实际别名到同一块内存。设计文档明确说首版不支持 function boundary,但当前实现没有把这类 IR 拒掉,而是静默错误 lowering。

@zhangstevenunity zhangstevenunity changed the title Add explicit multi-buffer support Add explicit multi-buffer and manual buffer select support May 19, 2026
Remove the `allComparableSlotPairsEqual` early-bail in
`getMultiBufferEventIdInfo`. When producer and consumer access a
multi-buffer alloc through the same slot SSA, GSS previously fell
back to a single static event id; InsertSync kept allocating N dyn
event ids so consecutive iterations touching different physical
slots could overlap. The two solvers now emit the same pre-loop
primes, loop-body dyn `set_flag_dyn` / `wait_flag_dyn`, and
post-loop drains for this case.

* `multi_tile_prefetch_gss_event_id.pto`: CHECKs updated to expect
  2-slot dyn flag pipeline (matching InsertSync).
* Design doc: status table marks alignment landed; same-SSA
  asymmetry removed from limitations / follow-ups.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@zhangstevenunity zhangstevenunity changed the title Add explicit multi-buffer and manual buffer select support Add explicit multi-buffer annotation and manual buffer select support May 20, 2026
Resolves conflicts in:
- lib/PTO/Transforms/InsertSync/InsertSyncAnalysis.cpp:
  Keep both slot-affine forward-dep filter (this branch) and
  CanPrunePipeVBarrier (main's #674). They're orthogonal: the affine
  filter prunes depVec for same-iter deps; the pipe-v pruner short-
  circuits the whole sync if all remaining deps are exact-same-access
  on PIPE_V.
- lib/PTO/Transforms/InsertSync/PTOIRTranslator.cpp:
  Keep the arith.select alias handler (multi-buffer dynamic-slot path)
  and the multi-address PointerCast slot-offset population. Main only
  had a codecheck reformat of the single-address-only legacy path,
  which is now the `op.getAddrs().size() == 1` branch.
- lib/PTO/Transforms/InsertSync/SyncCodegen.cpp:
  Keep this branch's full rewrite of CreateSetWaitOpForMultiBuffer
  that emits pto.set_flag_dyn / pto.wait_flag_dyn via an N-way
  select chain over eventIds[slot % N]. Main had a stale
  GetBufferSelected call whose return was already `(void)`-discarded;
  the new dyn-flag path supersedes it.
- tools/ptoas/ptoas.cpp:
  Keep the createPTOResolveBufferSelectPass() registration.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants