Add explicit multi-buffer annotation and manual buffer select support by zhangstevenunity · Pull Request #685 · hw-native-sys/PTOAS

zhangstevenunity · 2026-05-19T07:58:28Z

No description provided.

gemini-code-assist

Code Review

This pull request introduces an explicit multi-buffer expression and automatic synchronization scheme, adding the !pto.multi_tile_buf type and associated pto.alloc_multi_tile and pto.multi_tile_get operations. The implementation includes a new PTOResolveBufferSelect pass to lower slot selections into arith.select chains or single-address casts, and updates synchronization solvers to support dynamic event ID derivation and disjoint slot optimization via affine analysis. Review feedback suggests optimizing the arith.select lowering for high slot counts, refining the constant-peeling heuristic in affine analysis, and adding tracking for the current lack of support for low-precision types in multi-buffer allocations.

gemini-code-assist · 2026-05-19T08:04:19Z

+      for (uint32_t i = 1; i < n; ++i) {
+        Value iIdx = rewriter.create<arith::ConstantIndexOp>(loc, i);
+        Value isThis = rewriter.create<arith::CmpIOp>(
+            loc, arith::CmpIPredicate::eq, slot, iIdx);
+        selected = rewriter.create<arith::SelectOp>(loc, isThis, slotMems[i],
+                                                    selected);
+      }


The N-way arith.select chain is O(N) in terms of generated operations. While N is currently capped at 16, for larger N this could lead to significant IR bloat and potentially inefficient hardware execution. Consider if a more efficient lowering (like a binary search tree of selects or a jump table if supported by the target) would be beneficial if the slot count limit is increased in the future.

gemini-code-assist · 2026-05-19T08:04:19Z

+    // Peel at most one add/sub of a constant.
+    Value rem = inner;
+    int peeled = 0;
+    while (peeled++ < 4) {


The heuristic of peeling up to 4 constant additions/subtractions is simple and likely sufficient for common prefetch patterns. However, it is a hardcoded limit. If more complex index expressions are expected, a more robust affine expression walker might be needed.

gemini-code-assist · 2026-05-19T08:04:19Z

+  if (isPTOLowPrecisionType(elemTy))
+    return emitOpError() << "slot dtype " << elemTy
+                         << " is not supported by pto.alloc_multi_tile yet";


Low precision types are explicitly blocked for multi-buffer allocations. If this is a temporary hardware or compiler limitation, it should be tracked with a TODO or a reference to a tracking issue to ensure it's addressed when support is added.

reedhecre · 2026-05-19T08:10:27Z

Codex Review

该评论由 review 机器人自动更新。

PR: Add explicit multi-buffer annotation and manual buffer select support #685 Add explicit multi-buffer annotation and manual buffer select support
Author: zhangstevenunity
Base/Head: main / claude/serene-bhaskara-f0221e
Head SHA: 95cb6c6ff0a0
Trigger: PR 有新提交
Generated At: 2026-05-20T02:40:14Z
Previous Head SHA: efef1281c881
Status: completed

Summary

检查到 2 个问题：dynamic-valid multi-buffer 在 multi_tile_get lowering 中会丢失 valid shape；另外 multi_tile_buf 仍可出现在函数边界并被静默错误 lowering。

Findings

P1 multi_tile_get lowering 丢失了动态 valid_row/valid_col lib/PTO/Transforms/PTOViewToMemref.cpp:1498

pto.alloc_multi_tile 允许通过 operands 传入动态 valid_row/valid_col，但这里在 lowering pto.multi_tile_get 时重新造了一个新的 pto.bind_tile，只会从结果类型里回填静态常量 valid dim。对 v_row/v_col = ? 的 slot type，这个新 bind_tile 会完全丢掉运行时 valid shape。后续 PTOMaterializeTileHandles 看不到动态 valid operands，就会按整块静态 shape 物化 tile，尾块场景会把“部分有效”tile 当成“全有效”tile 继续参与 tload/tstore/compute，属于直接的结果错误风险。当前新增测试里也没有覆盖动态-valid 的 multi-buffer 用例。

P2 未禁止函数边界上的 multi_tile_buf，当前会被静默错误 lowering lib/PTO/Transforms/PTOViewToMemref.cpp:1247

Stage 0 会把函数签名上的 !pto.multi_tile_buf 直接改写成普通 slot memref，但这里不会保留 count，也不会生成后续多槽位所依赖的 pto.multi_buffer / multi-address pto.pointer_cast 锚点。这样一来，如果用户把 multi_tile_buf 放在函数参数或返回值上，后续 pto.multi_tile_get 虽然仍能通过 verifier，却只会在 memref 层看到一个普通 buffer；PTOResolveBufferSelect 也找不到 multi-buffer 根节点，只能把 slot 选择退化成 identity，导致不同 slot 实际别名到同一块内存。设计文档明确说首版不支持 function boundary，但当前实现没有把这类 IR 拒掉，而是静默错误 lowering。

Remove the `allComparableSlotPairsEqual` early-bail in `getMultiBufferEventIdInfo`. When producer and consumer access a multi-buffer alloc through the same slot SSA, GSS previously fell back to a single static event id; InsertSync kept allocating N dyn event ids so consecutive iterations touching different physical slots could overlap. The two solvers now emit the same pre-loop primes, loop-body dyn `set_flag_dyn` / `wait_flag_dyn`, and post-loop drains for this case. * `multi_tile_prefetch_gss_event_id.pto`: CHECKs updated to expect 2-slot dyn flag pipeline (matching InsertSync). * Design doc: status table marks alignment landed; same-SSA asymmetry removed from limitations / follow-ups. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Resolves conflicts in: - lib/PTO/Transforms/InsertSync/InsertSyncAnalysis.cpp: Keep both slot-affine forward-dep filter (this branch) and CanPrunePipeVBarrier (main's #674). They're orthogonal: the affine filter prunes depVec for same-iter deps; the pipe-v pruner short- circuits the whole sync if all remaining deps are exact-same-access on PIPE_V. - lib/PTO/Transforms/InsertSync/PTOIRTranslator.cpp: Keep the arith.select alias handler (multi-buffer dynamic-slot path) and the multi-address PointerCast slot-offset population. Main only had a codecheck reformat of the single-address-only legacy path, which is now the `op.getAddrs().size() == 1` branch. - lib/PTO/Transforms/InsertSync/SyncCodegen.cpp: Keep this branch's full rewrite of CreateSetWaitOpForMultiBuffer that emits pto.set_flag_dyn / pto.wait_flag_dyn via an N-way select chain over eventIds[slot % N]. Main had a stale GetBufferSelected call whose return was already `(void)`-discarded; the new dyn-flag path supersedes it. - tools/ptoas/ptoas.cpp: Keep the createPTOResolveBufferSelectPass() registration. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Add explicit multi-buffer support

f215ce1

gemini-code-assist Bot reviewed May 19, 2026

View reviewed changes

Refine multi-tile slot sync handling

5b6f6cb

zhangstevenunity changed the title ~~Add explicit multi-buffer support~~ Add explicit multi-buffer and manual buffer select support May 19, 2026

zhangstevenunity changed the title ~~Add explicit multi-buffer and manual buffer select support~~ Add explicit multi-buffer annotation and manual buffer select support May 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add explicit multi-buffer annotation and manual buffer select support#685

Add explicit multi-buffer annotation and manual buffer select support#685
zhangstevenunity wants to merge 4 commits into
mainfrom
claude/serene-bhaskara-f0221e

zhangstevenunity commented May 19, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 19, 2026

Uh oh!

gemini-code-assist Bot May 19, 2026

Uh oh!

gemini-code-assist Bot May 19, 2026

Uh oh!

reedhecre commented May 19, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

zhangstevenunity commented May 19, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

reedhecre commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codex Review

Summary

Findings

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

reedhecre commented May 19, 2026 •

edited

Loading