Skip to content

feat: Support low precision#691

Open
FangRui0 wants to merge 4 commits into
hw-native-sys:mainfrom
FangRui0:support_low_precision
Open

feat: Support low precision#691
FangRui0 wants to merge 4 commits into
hw-native-sys:mainfrom
FangRui0:support_low_precision

Conversation

@FangRui0
Copy link
Copy Markdown
Contributor

@FangRui0 FangRui0 commented May 21, 2026

1、 alloc_tile tstore tload tcvt tprefetch op在a5上支持接受低精度类型
2、支持非mlir内置类型返回正确位宽

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements support for low-precision element types (FP8, FP4, etc.) on the A5 NPU architecture. It updates the PTO IR manual and adds design documentation, refactors type utility functions for storage size calculations, and enhances verifiers for tload, tstore, tprefetch, and tcvt to enforce architecture-specific constraints. Feedback highlights a bug in getPTOStorageElemByteSize that causes regressions for sub-byte types, the presence of absolute local paths in documentation, and the need to use isI8Like instead of isInteger(8) in verifiers to correctly support both signed and unsigned 8-bit integers.


unsigned mlir::pto::getPTOStorageElemByteSize(Type t) {
unsigned bitWidth = getPTOStorageElemBitWidth(t);
return bitWidth == 0 ? 0 : bitWidth / kBitsPerByte;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current implementation of getPTOStorageElemByteSize returns 0 for types with a bit width less than 8 (e.g., i1). In most NPU contexts, the smallest addressable storage unit is a byte, and sub-byte types like i1 are stored in a full byte. This causes regressions in several passes (like PTOIRTranslator and PTOPlanMemory) where a size of 0 now triggers failures or fatal errors instead of falling back to 1 byte. Using (bitWidth + kBitsPerByte - 1) / kBitsPerByte correctly calculates the storage size in bytes for all types.

Suggested change
return bitWidth == 0 ? 0 : bitWidth / kBitsPerByte;
return (bitWidth + kBitsPerByte - 1) / kBitsPerByte;

## Scope

This note records the current low-precision C++ API support status in
`/Users/fangrui/workspace/huawei/pto-isa` for the following PTO operations:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The documentation contains absolute local file paths (e.g., /Users/fangrui/...). These should be replaced with relative paths or simply the repository name to ensure the documentation is portable and relevant for all contributors.

Suggested change
`/Users/fangrui/workspace/huawei/pto-isa` for the following PTO operations:
`.` for the following PTO operations:

Comment thread lib/PTO/IR/PTO.cpp

static bool isA5AccStorePreQuantDstType(Type srcElem, Type dstElem) {
if (srcElem.isInteger(32))
return dstElem.isInteger(8) || dstElem.isF16() || dstElem.isBF16();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The check dstElem.isInteger(8) is too restrictive as it typically only matches signless 8-bit integers. The previous implementation used isI8Like(dstElem), which likely covers both signed and unsigned 8-bit integers (i8 and ui8). According to the manual (line 899), both i8 and ui8 should be supported here.

Suggested change
return dstElem.isInteger(8) || dstElem.isF16() || dstElem.isBF16();
return isI8Like(dstElem) || dstElem.isF16() || dstElem.isBF16();

Comment thread lib/PTO/IR/PTO.cpp
return dstElem.isInteger(8) || dstElem.isF16() || dstElem.isBF16();
if (!srcElem.isF32())
return false;
return dstElem.isInteger(8) || dstElem.isF16() || dstElem.isBF16() ||
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to the i32 source case, isInteger(8) should be replaced with isI8Like(dstElem) to correctly support both i8 and ui8 destination types as specified in the manual.

Suggested change
return dstElem.isInteger(8) || dstElem.isF16() || dstElem.isBF16() ||
return isI8Like(dstElem) || dstElem.isF16() || dstElem.isBF16() ||

@reedhecre
Copy link
Copy Markdown

reedhecre commented May 21, 2026

Codex Review

该评论由 review 机器人自动更新。

  • PR: feat: Support low precision #691 feat: Support low precision
  • Author: FangRui0
  • Base/Head: main / support_low_precision
  • Head SHA: 0c17c69c098f
  • Trigger: PR 有新提交
  • Generated At: 2026-05-22T02:41:40Z
  • Previous Head SHA: 417e9d757edd
  • Status: completed

Summary

检查到 3 个 P2 级问题:tprefetch 的 memref 源约束缺失,且 tload/tstore/tcvt 的低精度 memref 形式会绕过这次 PR 新增的校验。

Findings

  1. P2 `TPREFETCH` 的 memref 源没有限制为 GM,验证器与 lowering 假设不一致 lib/PTO/IR/PTO.cpp:2404

这里对 memref 源只检查了 rank/shape 和元素类型,没有检查地址空间是否为 GM/Zero,也没有保证它能被 lowering 成 GlobalTensor。但 PTOTPrefetchToTPREFETCH 只会把 GM/Zero memref 包成 GlobalTensor,其它 memref 会直接作为原始指针发给 TPREFETCH。这样像 memref<..., #pto.address_space<vec>> 这样的 IR 现在会通过 verifier,却落到一个不符合 TPREFETCH 合约的 codegen 路径,存在较高的编译/运行失败风险。

  1. P2 `tload`/`tstore` 的低精度 memref 形式会绕过新增的 A5/A3 约束 lib/PTO/IR/PTO.cpp:2304

这次 PR 明确新增了低精度 tload/tstore 支持,并且还新增了直接 memref 形式的正向用例;但这两个 verifier 仍然通过 dispatchVerifierByArch 退出,而该 helper 在任一操作数是 memref 时会直接返回 success。结果是,这次新增的“低精度仅 A5 支持”检查根本不会作用到 direct-memref 形式上:--pto-arch=a3 下的 memref<...x!pto.hif8> / f8* tload/tstore 仍会被接受并继续生成 TLOAD/TSTORE,而不是在 verifier 阶段拒绝。

  1. P2 `tcvt` 的 memref 形式会跳过低精度 pair 和 packed-shape 检查 lib/PTO/IR/PTO.cpp:4892

TCvtOp::verify 新增了 A5 低精度 pair 限制,以及 FP4 packed 目标需要匹配逻辑 extent 的检查,但在这里一旦操作已经是 memref 形式就会直接 return success()。而这次 PR 自己的 --emit-pto-ir 用例已经说明 pto.tcvt 会保留为 memref 形式,所以 direct-memref tcvt 现在可以绕过 A2/A3 的低精度禁用,以及 A5 的 packed-shape/type-pair 校验;例如 bf16 -> !pto.f4E1M2x2 仍可用未压缩的 16x16 目标形状通过 verifier。

@FangRui0 FangRui0 force-pushed the support_low_precision branch 2 times, most recently from 4c5e02b to cc66d99 Compare May 21, 2026 07:02
@FangRui0
Copy link
Copy Markdown
Contributor Author

/review

@reedhecre
Copy link
Copy Markdown

Manual Codex Review

该评论由 /review 手动触发。

Summary

收到 /review,正在执行 Codex review。

Findings

Review in progress.

@reedhecre
Copy link
Copy Markdown

Manual Codex Review

该评论由 /review 手动触发。

Summary

Review failed at stage codex-review: exit=1

Findings

未生成结构化 findings,因为 review 过程提前失败。

Log Tail

  8771	  auto isAllowedDataElem = [&](mlir::Type t) -> bool {
  8772	    if (t.isF16() || t.isF32() || t.isBF16()) return true;
  8773	    if (auto it = mlir::dyn_cast<mlir::IntegerType>(t))
  8774	      return (it.getWidth() == 8 || it.getWidth() == 16 || it.getWidth() == 32);
  8775	    return false;
  8776	  };
  8777	  auto isAllowedIndexElem = [&](mlir::Type t) -> bool {
  8778	    if (auto it = mlir::dyn_cast<mlir::IntegerType>(t))
  8779	      return (it.getWidth() == 16 || it.getWidth() == 32);
exec
/bin/bash -lc "git diff --unified=20 origin/main...HEAD -- lib/PTO/Transforms/InsertSync/PTOIRTranslator.cpp | sed -n '1,220p'" in /tmp/ptoas-pr-review-monitor/runs/20260521_150503_pr691/repo succeeded in 0ms:
diff --git a/lib/PTO/Transforms/InsertSync/PTOIRTranslator.cpp b/lib/PTO/Transforms/InsertSync/PTOIRTranslator.cpp
index 1aad797..0f3dba4 100644
--- a/lib/PTO/Transforms/InsertSync/PTOIRTranslator.cpp
+++ b/lib/PTO/Transforms/InsertSync/PTOIRTranslator.cpp
@@ -188,42 +188,44 @@ getMemrefSubViewBaseAddresses(memref::SubViewOp op, MemRefType sourceType,
     addresses.reserve(static_cast<size_t>(colSize));
     for (int64_t col = 0; col < colSize; ++col) {
       int64_t elemOffset =
           rowOffset * strides[0] + (colOffset + col) * strides[1];
       addresses.push_back(static_cast<uint64_t>(elemOffset * elemBytes));
     }
   } else {
     return std::nullopt;
   }
 
   return addresses;
 }
 
 // [辅助函数] 尝试从 Operation 中计算相对于 Source 的字节偏移量和新大小
 // 返回值: pair<offsetInBytes, sizeInBytes>
exec
/bin/bash -lc "git diff --unified=20 origin/main...HEAD -- lib/PTO/Transforms/PTOPlanMemory.cpp | sed -n '1,220p'" in /tmp/ptoas-pr-review-monitor/runs/20260521_150503_pr691/repo succeeded in 0ms:
diff --git a/lib/PTO/Transforms/PTOPlanMemory.cpp b/lib/PTO/Transforms/PTOPlanMemory.cpp
index 36ec6bb..cf6c25c 100644
--- a/lib/PTO/Transforms/PTOPlanMemory.cpp
+++ b/lib/PTO/Transforms/PTOPlanMemory.cpp
@@ -1,33 +1,35 @@
 // Copyright (c) 2026 Huawei Technologies Co., Ltd.
 // This program is free software, you can redistribute it and/or modify it under the terms and conditions of
 // CANN Open Software License Agreement Version 2.0 (the "License").
 // Please refer to the License for details. You may not use this file except in compliance with the License.
 // THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED,
 // INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE.
 // See LICENSE in the root of the software repository for the full text of the License.
 
 //===- PlanMemory.cpp ----Plan Buffer Memory Address ----------------------===//
 //===----------------------------------------------------------------------===//
 
 #include "PTOPlanMemory.h"
 
+#include "PTO/IR/PTOTypeUtils.h"
+
exec
/bin/bash -lc "git diff --unified=20 origin/main...HEAD -- lib/PTO/Transforms/PTOToEmitC.cpp | sed -n '1,260p'" in /tmp/ptoas-pr-review-monitor/runs/20260521_150503_pr691/repo succeeded in 0ms:
diff --git a/lib/PTO/Transforms/PTOToEmitC.cpp b/lib/PTO/Transforms/PTOToEmitC.cpp
index 6228c78..6fb8bfa 100644
--- a/lib/PTO/Transforms/PTOToEmitC.cpp
+++ b/lib/PTO/Transforms/PTOToEmitC.cpp
@@ -3364,108 +3364,82 @@ struct SubviewToEmitCPattern : public OpConversionPattern<memref::SubViewOp> {
     gtConstructorArgs.push_back(strideInstOp.getResult(0)); // 拿到 stride_inst 的 SSA Value
 
     rewriter.replaceOpWithNewOp<emitc::CallOpaqueOp>(
         op, 
         gtType, 
         gtCppType,
         /*args=*/ArrayAttr{}, 
         /*templateArgs=*/ArrayAttr{},
         /*operands=*/ValueRange(gtConstructorArgs)
     );
 
     return success();
   }
 };
 
ERROR: Selected model is at capacity. Please try a different model.
Warning: no last agent message; wrote empty content to /tmp/ptoas-pr-review-monitor/runs/20260521_150503_pr691/codex_last_message.json
tokens used
83,078
===== END STAGE codex-review rc=1 @ 2026-05-21 15:07:33 =====

@FangRui0
Copy link
Copy Markdown
Contributor Author

/review

@reedhecre
Copy link
Copy Markdown

Manual Codex Review

该评论由 /review 手动触发。

Summary

收到 /review,正在执行 Codex review。

Findings

Review in progress.

@reedhecre
Copy link
Copy Markdown

Manual Codex Review

该评论由 /review 手动触发。

Summary

Review failed at stage codex-review: exit=1

Findings

未生成结构化 findings,因为 review 过程提前失败。

Log Tail

     if (!srcElem || !tmpElem || !dstElem || srcElem != dstElem || srcElem != tmpElem)
       return emitOpError() << "expects src, tmp, and dst to have the same element type";
-    unsigned elemBytes = srcElem.getIntOrFloatBitWidth() / 8;
+    unsigned elemBytes = getPTOStorageElemByteSize(srcElem);
+    if (elemBytes == 0)
+      return emitOpError() << "failed to get transpose element size";
     if (elemBytes != 1 && elemBytes != 2 && elemBytes != 4)
       return emitOpError() << "expects transpose element size to be 1, 2, or 4 bytes";
     auto isAllowedWidthType = [&](Type ty) {
       if (elemBytes == 4)
         return ty.isInteger(32) || ty.isF32();
       if (elemBytes == 2)
         return ty.isInteger(16) || ty.isF16() || ty.isBF16();
       return ty.isInteger(8);
     };
     if (!isAllowedWidthType(srcElem))
       return emitOpError() << "expects transpose element type to match the supported set for its width";
     auto checkAlignedMajor = [&](Type ty, StringRef name) -> LogicalResult {
       auto tb = mlir::dyn_cast<pto::TileBufType>(ty);
       if (!tb)
exec
/bin/bash -lc "git diff --unified=20 origin/main...HEAD -- lib/PTO/IR/PTO.cpp | sed -n '1115,1165p'" in /tmp/ptoas-pr-review-monitor/runs/20260521_153004_pr691/repo succeeded in 0ms:
-  }
+  int64_t elemBytes = static_cast<int64_t>(getElemByteSize(elemTy));
   if (elemBytes <= 0) return failure();
 
   if (fr == 1024) {
     innerRows = 16;
     innerCols = 16;
     return success();
   }
   if (fr == 32) {
     innerRows = 16;
     innerCols = 2;
     return success();
   }
   if (fr == 512) {
     if (sl == 1) {
       innerRows = 16;
       innerCols = 32 / elemBytes;
       return success();
     }
exec
/bin/bash -lc "grep -n \"verifyTFillPadLike\\|TFillPadOp::verify\\|TExtractOp::verify\\|TPadOp::verify\" -n lib/PTO/IR/PTO.cpp" in /tmp/ptoas-pr-review-monitor/runs/20260521_153004_pr691/repo succeeded in 0ms:
5050:mlir::LogicalResult mlir::pto::TExtractOp::verify() {
5552:static mlir::LogicalResult verifyTFillPadLike(Operation *op, Type srcTy, Type dstTy,
5646:mlir::LogicalResult mlir::pto::TFillPadOp::verify() {
5647:  return verifyTFillPadLike(getOperation(), getSrc().getType(), getDst().getType(),
5652:  return verifyTFillPadLike(getOperation(), getSrc().getType(), getDst().getType(),
5657:  return verifyTFillPadLike(getOperation(), getSrc().getType(), getDst().getType(),
exec
/bin/bash -lc "sed -n '5552,5675p' lib/PTO/IR/PTO.cpp" in /tmp/ptoas-pr-review-monitor/runs/20260521_153004_pr691/repo succeeded in 0ms:
static mlir::LogicalResult verifyTFillPadLike(Operation *op, Type srcTy, Type dstTy,
                                              bool allowDstExpand,
                                              llvm::StringRef opName) {
  if (!isPTOShapedLike(srcTy) || !isPTOShapedLike(dstTy))
    return op->emitError("expects src/dst to be PTO shaped-like types");

  auto srcShape = getShapeVec(srcTy);
  auto dstShape = getShapeVec(dstTy);
  if (srcShape.size() != 2 || dstShape.size() != 2)
    return op->emitError("expects rank-2 shaped types for src/dst");

  auto srcElem = getElemTy(srcTy);
  auto dstElem = getElemTy(dstTy);

  auto getElemBytes = [](mlir::Type t) -> int64_t {
    unsigned elemBytes = getPTOStorageElemByteSize(t);
    return elemBytes == 0 ? -1 : static_cast<int64_t>(elemBytes);
  };

  int64_t srcB = getElemBytes(srcElem);
exec
/bin/bash -lc "grep -R \"tfillpad\\|TFILLPAD\" -n docs include/PTO/IR/PTOOps.td test/lit/pto | head -120" in /tmp/ptoas-pr-review-monitor/runs/20260521_153004_pr691/repo succeeded in 1ms:

ERROR: Selected model is at capacity. Please try a different model.
Warning: no last agent message; wrote empty content to /tmp/ptoas-pr-review-monitor/runs/20260521_153004_pr691/codex_last_message.json
tokens used
900,991
===== END STAGE codex-review rc=1 @ 2026-05-21 15:45:53 =====

@FangRui0 FangRui0 force-pushed the support_low_precision branch from cc66d99 to 417e9d7 Compare May 21, 2026 11:49
@FangRui0
Copy link
Copy Markdown
Contributor Author

/review

@reedhecre
Copy link
Copy Markdown

Manual Codex Review

该评论由 /review 手动触发。

Summary

收到 /review,正在执行 Codex review。

Findings

Review in progress.

@reedhecre
Copy link
Copy Markdown

Manual Codex Review

该评论由 /review 手动触发。

Summary

PR #691 has three real issues: FP4 tcvt shape validation is inconsistent with packed-dimension semantics, the new memref element interface is registered twice, and memref-form tprefetch skips the new arch/type checks.

Findings

  1. P1 FP4 `tcvt` accepts packed shapes that lower to mismatched tile extents lib/PTO/IR/PTO.cpp:4851

TCvtOp::verify still treats src/dst as shape-compatible only when their raw rows/cols are equal. That is wrong for the new packed FP4 types, because this PR also documents !pto.f4E1M2x2/!pto.f4E2M1x2 dimensions as packed extents rather than logical scalar extents. The mismatch is visible in lowering: EmitC expands the packed FP4 dimension by 2 (renderTileTemplateDim), so a verifier-accepted case like bf16 16x16 -> !pto.f4E1M2x2 16x16 lowers to TCVT between different tile extents (16x16 vs 16x32). That is a high-probability wrong-code / no-overload situation for the main feature added in this PR.

  1. P2 Dialect initialization registers `MemRefElementTypeInterface` twice for the new types lib/PTO/IR/PTO.cpp:2049

The three new low-precision types already declare MemRefElementTypeInterface in ODS, so the generated type classes register that interface statically. PTODialect::initialize() then attaches external MemRefElementTypeInterface models for the same types a second time. In MLIR this interface registration is expected to happen once per type; assertion-enabled builds typically abort during dialect initialization on the duplicate registration. That makes ptoas fail before parsing any IR in Debug/ASan-style CI configurations.

  1. P2 Memref-form `tprefetch` no longer enforces the new low-precision arch checks lib/PTO/IR/PTO.cpp:2404

TPrefetchOp::verify now routes through dispatchVerifierByArch, but dispatchVerifierByArch immediately returns success whenever any operand is a memref (shouldBypassDecodedMemrefVerifier). tprefetch is also the one verifier here that explicitly accepts memref forms, so the new low-precision A2/A3/A5 checks never run for that post-lowering form anymore. As a result, an unsupported memref-form pto.tprefetch (for example A3 + !pto.hif8) now passes verification and only fails later in codegen/runtime, which breaks the new “A5-only” contract for low-precision prefetch.

…verage

Signed-off-by: FangRui <fangrui_95@163.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants