Skip to content

Add PTO pointer cast op#692

Merged
zhangstevenunity merged 1 commit into
mainfrom
codex/ptr-cast
May 22, 2026
Merged

Add PTO pointer cast op#692
zhangstevenunity merged 1 commit into
mainfrom
codex/ptr-cast

Conversation

@zhangstevenunity
Copy link
Copy Markdown
Collaborator

Summary: add pto.ptr_cast IR op, verifier, Python helper, manual docs, GM memref lowering, and EmitC reinterpret_cast support. Validation: cmake --build build-codex-ci --target ptoas -j2; llvm-lit -sv --filter=ptr_cast build-codex-ci/test/lit; llvm-lit -sv --filter=issue481_addptr_gm_slot_buffer build-codex-ci/test/lit; python3 -m py_compile python/pto/dialects/pto.py; git diff --cached --check.

@zhangstevenunity zhangstevenunity marked this pull request as ready for review May 21, 2026 08:42
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the pto.ptr_cast operation, enabling the reinterpretation of global pointers with different element types. The changes include IR definitions, documentation, Python bindings, and lowering logic for both memref and EmitC. Review feedback identifies a potential issue where hardcoding the size to 1 in memref.reinterpret_cast could lead to out-of-bounds errors when the resulting pointer is used for array indexing. Additionally, it is suggested to optimize the lowering pass by replacing the nested while(changed) walk with a more efficient traversal strategy to improve performance on larger functions.

Comment thread lib/PTO/Transforms/PTOViewToMemref.cpp Outdated
auto offsetTy =
MemRefType::get({dyn}, targetTy.getElementType(), offsetLayout,
targetTy.getMemorySpace());
SmallVector<OpFoldResult, 1> sizes{rewriter.getIndexAttr(1)};
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Hardcoding the size to 1 in the memref.reinterpret_cast operation is potentially problematic. If the resulting pointer is used to access an array (e.g., via pto.load_scalar with a non-zero offset or pto.make_tensor_view), a static size of 1 will lead to out-of-bounds accesses in the memref representation. In the provided test case, pto.load_scalar at index 1 would be invalid for a memref<1xT>. It is safer to preserve the dynamic size from the source memref, potentially scaling it if the element sizes differ.

Comment thread lib/PTO/Transforms/PTOViewToMemref.cpp Outdated
Comment on lines +865 to +868
while (changed) {
changed = false;
DefaultInlineVector<mlir::pto::PtrCastOp> ptrCasts;
func.walk([&](mlir::pto::PtrCastOp op) { ptrCasts.push_back(op); });
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current implementation uses a while(changed) loop with a full func.walk inside each iteration. This results in $O(N \cdot M)$ complexity, where $N$ is the nesting depth of pointer casts and $M$ is the total number of operations in the function. While functional for small functions, this is inefficient. Consider using a single walk with a worklist or a post-order (bottom-up) traversal to handle nested transformations in a single pass.

@reedhecre
Copy link
Copy Markdown

reedhecre commented May 21, 2026

Codex Review

该评论由 review 机器人自动更新。

  • PR: Add PTO pointer cast op #692 Add PTO pointer cast op
  • Author: zhangstevenunity
  • Base/Head: main / codex/ptr-cast
  • Head SHA: a43976d9aa42
  • Trigger: PR 有新提交
  • Generated At: 2026-05-22T06:08:25Z
  • Previous Head SHA: fe5cdc49d788
  • Status: completed

Summary

发现 2 个问题:ptrtoint 的 lowering 还不能处理控制流产出的指针,且 memref 形式的 ptr/int cast 校验过宽,会接受会被错误降级的布局类型。

Findings

  1. P2 `ptrtoint` lowering 不能处理 `scf.if` / `scf.for` 等控制流产出的指针 lib/PTO/Transforms/PTOViewToMemref.cpp:913

lowerPtrToIntOps 只特判了直接以 pto.addptr 为输入的情况,否则就要求 pto.ptrtoint 的输入在这个阶段已经是 memref。问题是这里发生在 Stage 0.30,而 scf.if / scf.for 结果类型的修正要到后面的 reconcileSCFIfResultTypes / reconcileSCFForResultTypes 才做。因此像 %p = scf.if ... -> !pto.ptr<T> / %addr = pto.ptrtoint %p 这类语义上合法的用法,会在这里落到 ptrtoint source could not be lowered to a GM memref 的报错路径,导致编译失败。也就是说,新 op 目前实际上不能处理控制流产出的指针值。

  1. P3 memref 形式的 ptr/int cast 校验过宽,会接受后端无法正确表达的 layout lib/PTO/IR/PTO.cpp:1995

这里把 pto.ptrtoint / pto.inttoptr 的 memref 形式放行到了“任意 rank-1 GM memref”,但后续 EmitC lowering 并不会使用 memref 的 layout 信息,而是直接按原始 __gm__ T* / int64_t 指针处理。这样一来,像 memref<8xf32, strided<[2], offset:1>> 这类带 offset/stride 的类型现在会被 verifier 接受,却会在 codegen 时被当成 identity-layout 指针静默错误降级。若 memref 形式只是内部 lowering 形态,这里至少应拒绝非 identity layout;否则这是对已接受 IR 的静默误编译。

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 23a98762d8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread lib/PTO/Transforms/PTOViewToMemref.cpp Outdated
Comment on lines +844 to +848
if (getConstIndexValue(offset, constOffset)) {
int64_t byteOffset = constOffset * static_cast<int64_t>(srcBytes);
if (byteOffset % static_cast<int64_t>(dstBytes) != 0) {
anchorOp->emitError(
"ptr_cast addptr byte offset is not aligned to the result element size");
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Accumulate addptr offsets before checking cast alignment

When lowering pto.ptr_cast(pto.addptr(...)), alignment is checked on each addptr offset independently (byteOffset % dstBytes) instead of on the combined byte offset. This rejects valid programs where individual offsets are misaligned but their sum is aligned (for example, casting !pto.ptr<ui16> to !pto.ptr<ui32> after two +1 element steps should be legal because total byte offset is 4). In this case the pass emits an error and fails lowering even though the final address is representable in destination element units.

Useful? React with 👍 / 👎.

@zhangstevenunity zhangstevenunity force-pushed the codex/ptr-cast branch 6 times, most recently from c212290 to fe5cdc4 Compare May 22, 2026 03:46
@zhangstevenunity zhangstevenunity merged commit 7715654 into main May 22, 2026
10 checks passed
@reedhecre
Copy link
Copy Markdown

A5 板测成功

  • 触发方式:merged
  • 源码提交:7715654d7d4f
  • 结果汇总:OK 14 / FAIL 0 / SKIP 0
  • 日志:/root/ptoas-board-monitor-a5/logs/20260522_133205_merged_pr692.log
  • 结果 TSV:/root/ptoas-board-monitor-a5/logs/20260522_133205_merged_pr692.tsv

@reedhecre
Copy link
Copy Markdown

A3 板测失败

  • 触发方式:merged
  • 源码提交:7715654d7d4f
  • 结果汇总:OK 207 / FAIL 2 / SKIP 2
  • 日志:/home/zhongxuan/ptoas-board-monitor/runtime/logs/20260522_133204_merged_pr692.log
  • 失败阶段:board-validation / exit=1

失败用例

  • down_proj_residual (run, exit=1)
  • out_proj_residual (run, exit=1)

@reedhecre
Copy link
Copy Markdown

A3 板测失败详情:PR #692

down_proj_residual

stage=run info=exit=1

[ERROR] aclrtSynchronizeStream(stream) failed: 507015 (/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260522_133204_merged_pr692/npu_validation/Qwen3DecodeA3/down_proj_residual/main.cpp:116)
[ERROR] RecentErrMsg: EZ9999: Inner Error!
EZ9999[PID: 3215898] 2026-05-22-13:55:09.456.068 (EZ9999):  The error from device(chipId:3, dieId:1), serial number is 32, there is an exception of fftsplus aivector error, core id is 7, error code = 0, dump info: pc start: 0x124e000009bc, current: 0x124e00000b40, vec error info: 0xab00000078, mte error info: 0x9800000052, ifu error info: 0x1000000000000, ccu error info: 0xd00000e000000000, cube error info: 0, biu error info: 0, aic error mask: 0x6500020bd00028c, para base: 0x12c100000080.[FUNC:PrintCoreInfo][FILE:device_error_core_proc.cc][LINE:645]
        TraceBack (most recent call last):
       The extend info: errcode:(0, 0x8000, 0) errorStr: When the D-cache reads and writes data to the UB, the response value returned by the bus is a non-zero value. fixp_error0 info: 0x52, fixp_error1 info: 0x98, fsmId:0, tslot:0, thread:0, ctxid:0, blk:0, sublk:0, subErrType:4.[FUNC:PrintCoreInfo][FILE:device_error_core_proc.cc][LINE:658]
       Kernel task happen error, retCode=0x26, [aicore exception].[FUNC:PreCheckTaskErr][FILE:davinci_kernel_task.cc][LINE:1729]
       AICORE Kernel task happen error, retCode=0x26.[FUNC:GetError][FILE:stream.cc][LINE:1475]
       [AIC_INFO] after execute:args print end[FUNC:GetError][FILE:stream.cc][LINE:1475]
       [AIC_INFO] after execute:mixCtx print end[FUNC:GetError][FILE:stream.cc][LINE:1475]
       [DFX_INFO]Aicore kernel execute failed, device_id=7, stream_id=46, report_stream_id=46, task_id=0, flip_num=0, fault kernel_name=_Z18down_proj_residualPu6__bf16PfS_S_S0_i, fault kernel info ext=_Z18down_proj_residualPu6__bf16PfS_S_S0_i, program id=0, hash=4204309005333304519.[FUNC:GetError][FILE:stream.cc][LINE:1475]
       rtStreamSynchronize execution failed, reason=aicore exception[FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:65]
       synchronize stream failed, runtime result = 507015[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:148]
[2026-05-22 13:55:11] ERROR: testcase failed (exit 1): down_proj_residual
out_proj_residual

stage=run info=exit=1

[ERROR] aclrtSynchronizeStream(stream) failed: 507015 (/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260522_133204_merged_pr692/npu_validation/Qwen3DecodeA3/out_proj_residual/main.cpp:116)
[ERROR] RecentErrMsg: EZ9999: Inner Error!
EZ9999[PID: 3223047] 2026-05-22-13:55:16.748.905 (EZ9999):  The error from device(chipId:3, dieId:1), serial number is 33, there is an exception of fftsplus aivector error, core id is 9, error code = 0, dump info: pc start: 0x124e000009cc, current: 0x124e00000b14, vec error info: 0x5316b72342, mte error info: 0x580301501e, ifu error info: 0x1000000000000, ccu error info: 0xd00000e000000000, cube error info: 0, biu error info: 0, aic error mask: 0x6500020bd00028c, para base: 0x12c100000080.[FUNC:PrintCoreInfo][FILE:device_error_core_proc.cc][LINE:645]
        TraceBack (most recent call last):
       The extend info: errcode:(0, 0x8000, 0) errorStr: When the D-cache reads and writes data to the UB, the response value returned by the bus is a non-zero value. fixp_error0 info: 0x301501e, fixp_error1 info: 0x58, fsmId:0, tslot:0, thread:0, ctxid:0, blk:0, sublk:0, subErrType:4.[FUNC:PrintCoreInfo][FILE:device_error_core_proc.cc][LINE:658]
       Kernel task happen error, retCode=0x26, [aicore exception].[FUNC:PreCheckTaskErr][FILE:davinci_kernel_task.cc][LINE:1729]
       AICORE Kernel task happen error, retCode=0x26.[FUNC:GetError][FILE:stream.cc][LINE:1475]
       [AIC_INFO] after execute:args print end[FUNC:GetError][FILE:stream.cc][LINE:1475]
       [AIC_INFO] after execute:mixCtx print end[FUNC:GetError][FILE:stream.cc][LINE:1475]
       [DFX_INFO]Aicore kernel execute failed, device_id=7, stream_id=46, report_stream_id=46, task_id=0, flip_num=0, fault kernel_name=_Z17out_proj_residualPfPu6__bf16S0_S0_S_i, fault kernel info ext=_Z17out_proj_residualPfPu6__bf16S0_S0_S_i, program id=0, hash=16695015867854040655.[FUNC:GetError][FILE:stream.cc][LINE:1475]
       rtStreamSynchronize execution failed, reason=aicore exception[FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:65]
       synchronize stream failed, runtime result = 507015[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:148]
[2026-05-22 13:55:18] ERROR: testcase failed (exit 1): out_proj_residual

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants