Port frontend tile fusion to EmitC mainline by Zhendong404 · Pull Request #679 · hw-native-sys/PTOAS

Zhendong404 · 2026-05-16T06:21:31Z

Summary

Reintroduce frontend tile fusion on the current A5 EmitC mainline behind
--enable-op-fusion, but keep the implementation intentionally small:

run fusion planning and scheduling on tile-native PTO IR before
PTOViewToMemref
mark fused tile ops with pto.last_use directly on scheduled block-local
spans
preserve the final EmitC contract by emitting
[[pto::last_use(... )]] CALLEE(...)
do not introduce or preserve a pto.fusion_region / pto.yield
lifecycle in the shared mainline

In other words, this PR keeps the user-visible goal of "frontend op scheduling

final last_use emission", while removing the larger FusionRegion-based IR
contract from the implementation.

What changed

Driver and pipeline

add --enable-op-fusion on the current ptoas driver
gate it to --pto-arch=a5 with --pto-level=level2|level3
run the frontend fusion core on tile-native PTO IR:
- FusionPlan
- OpScheduling
- PTOMarkLastUse
keep this pipeline before PTOViewToMemref
leave unsupported configurations on the ordinary unfused path with warnings
instead of failing compilation

Frontend fusion core

port the tile-fusion planning/scheduling support needed on the current
mainline:
- FusionAnalysis
- FusionOpSemantics
- PTOFusionPlan
- PTOOpScheduling
represent accepted fusion groups as contiguous scheduled spans in a block
rather than wrapping them in a region op

`last_use` implementation

introduce PTOMarkLastUse as the place that computes pto.last_use
make the analysis span-based instead of region/yield-based:
- collect each contiguous scheduled group span from
  pto.fusion.group_id / pto.fusion.order
- compute last-use per tile operand slot inside that span
- block a bit if the tile value is used later in the same span
- also block a bit if the tile value is used later in the parent block after
  the span
encode last_use per tile operand slot, with the following rules:
- scalar operands do not occupy slots
- DPS init / output tile slots are preserved but always stay 0
- repeated SSA tile operands are evaluated independently per slot

EmitC `last_use` output

keep the final output contract as [[pto::last_use(... )]] CALLEE(...)
lower marked fused tile ops through a PTOAS-local marker callee path in
PTOToEmitC
rewrite that marker to the final C++ attribute spelling in
CppPostprocess
fix marker bit ordering so single-DPS-init tile intrinsics follow the final
emitted operand order, which keeps the output tile slot at 0 in the final
emitted attribute

Explicit non-goals / removed scope

no pto.fusion_region
no pto.yield
no PTOFusionRegionGen
no PTOFlattenFusionRegion
no shared-pass preservation contract for fusion-region lifecycle through
PTOViewToMemref, memory planning, reserved-buffer resolution, sync
insertion, or tile-handle materialization

Why this shape

The original larger port bundled three concerns together:

frontend fusion planning/scheduling
region formation / flattening
final EmitC last_use emission

For the current goal, only (1) and (3) are essential. This PR keeps the
useful part of the feature and localizes the extra complexity to
PTOMarkLastUse, instead of requiring multiple existing shared passes to
understand and preserve a new region lifecycle.

Testing

Added focused tile-fusion coverage for:

fusion planning:
- join
- diamond
- interleaved join
- treshape boundary
- dynamic-shape negative case
scheduling:
- basic compaction
- treshape bridge
- pure-op bridge
- negative region / call / SSA boundary cases
last_use:
- slot-mask encoding
- repeated SSA operands
- post-span later-use blocking
end-to-end EmitC output:
- final [[pto::last_use(... )]] emission
- absence of residual pto.fusion_region / pto.yield
control surface:
- CLI visibility / gating
- non-fused fallback behavior
- adapter placement in level2 and level3 shared lowering paths

Focused verification run:

llvm-lit -sv build/test/lit/tile_fusion

gemini-code-assist

Code Review

This pull request ports frontend tile fusion capabilities to the EmitC mainline, introducing pto.fusion_region and pto.yield operations along with passes for analysis, planning, scheduling, and region formation. The implementation ensures fusion occurs on tile-native PTO IR and is preserved through the shared mainline passes until a final flattening stage. Review feedback correctly identifies non-deterministic logic in the liveness analysis where lastLocalConsumer is assigned without considering block order, as Value::getUses() returns uses in an arbitrary sequence. An improvement to string reservation in the C++ post-processing logic was also suggested to optimize performance by reducing reallocations.

gemini-code-assist · 2026-05-16T06:24:01Z

+        if (nodeIt == computeNodeByOp.end())
+          continue;
+        appendUniqueNode(state.live.consumerNodes, nodeIt->second);
+        state.live.lastLocalConsumer = nodeIt->second;


The assignment of lastLocalConsumer here is non-deterministic because Value::getUses() returns operands in an arbitrary order. Since node.id is assigned in block order, you should only update lastLocalConsumer if the current node.id is greater than the previously recorded one.

unsigned consumerId = nodeIt->second; appendUniqueNode(state.live.consumerNodes, consumerId); if (!state.live.lastLocalConsumer || consumerId > *state.live.lastLocalConsumer) state.live.lastLocalConsumer = consumerId;

gemini-code-assist · 2026-05-16T06:24:01Z

+        if (nodeIt == computeNodeByOp.end())
+          continue;
+        appendUniqueNode(writeLive.consumerNodes, nodeIt->second);
+        writeLive.lastLocalConsumer = nodeIt->second;


Similar to the liveness finalization above, lastLocalConsumer for write instances should be updated using a maximum check to ensure it correctly identifies the last consumer in block order, regardless of the iteration order of getUses().

unsigned consumerId = nodeIt->second; appendUniqueNode(writeLive.consumerNodes, consumerId); if (!writeLive.lastLocalConsumer || consumerId > *writeLive.lastLocalConsumer) writeLive.lastLocalConsumer = consumerId;

gemini-code-assist · 2026-05-16T06:24:01Z

+    }
+
+    std::string replacement;
+    replacement.reserve(callee.size() + lastUseArgs.size() + 32);


To minimize reallocations when constructing the replacement string, consider including the size of the original arguments string (argsRef) in the initial reservation.

Suggested change

replacement.reserve(callee.size() + lastUseArgs.size() + 32);

replacement.reserve(callee.size() + lastUseArgs.size() + argsRef.size() + 32);

reedhecre · 2026-05-16T06:32:59Z

Codex Review

该评论由 review 机器人自动更新。

PR: Port frontend tile fusion to EmitC mainline #679 Port frontend tile fusion to EmitC mainline
Author: Zhendong404
Base/Head: main / feature-tile-fusion-frontend
Head SHA: 681c5747b682
Trigger: PR 有新提交
Generated At: 2026-05-18T14:38:29Z
Previous Head SHA: 15ec7fc3a355
Status: completed

Summary

发现 2 个 P2 问题：一类被声明为可融合的算子在 pto-view-to-memref 后会丢失 last_use/fusion 元数据；另外规划器会跨不可调度边界建组，导致打开 --enable-op-fusion 后把原本可编译的程序变成硬错误。

Findings

P2 TDiv/TMax/TMin 系列在 PTOViewToMemref 中会丢失 frontend fusion 元数据 lib/PTO/Transforms/PTOViewToMemref.cpp:2505

PTOMarkLastUse 是在 pto-view-to-memref 之前给已规划的融合成员打上 pto.fusion.* 和 pto.last_use 的，但这里仍然直接用 replaceOpWithNewOp 重建 TDivOp。同样的遗漏还出现在 TDivSOp、TMaxOp、TMaxSOp、TMinOp、TMinSOp。这些算子都被 isCurrentlyPlannableOp 列为可融合，且后面的 PTOToEmitC 会依赖 pto.last_use 去生成 [[pto::last_use(...)]]。属性在这一层被丢掉后，这几类融合算子最终会静默退化成没有 last-use 标注的普通 EmitC 调用，和这次 PR 为其它算子建立的 mainline contract 不一致。

P2 FusionPlan 会跨不可调度边界建组并把 `--enable-op-fusion` 变成编译失败 lib/PTO/Transforms/TileFusion/PTOFusionPlan.cpp:352

这里的 evaluateAppend 只要求同 iteration-domain 且和现有 group 有任意数据流连接，没有把 call/region/SSA blocker 之类的调度边界纳入规划条件；planBlock 还会对整个 block 反复吸收候选节点。结果是 planner 会先给这类节点打上同一个 pto.fusion.group_id，然后调度阶段无法压成连续 span，最后 PTOMarkLastUse 在发现 split span 时直接报错退出。PR 自带的 op_scheduling_negative_call_boundary.pto、op_scheduling_negative_region.pto、op_scheduling_negative_ssa.pto 已经把这种失败固定下来，这意味着用户只要打开 --enable-op-fusion，原本合法的 kernel 就可能因为不可融合的边界而硬失败，而不是保守地保持非融合代码路径。

Zhendong404 marked this pull request as ready for review May 16, 2026 06:22

gemini-code-assist Bot reviewed May 16, 2026

View reviewed changes

Zhendong404 force-pushed the feature-tile-fusion-frontend branch 2 times, most recently from 806764b to 59bf8fb Compare May 16, 2026 06:30

Zhendong404 force-pushed the feature-tile-fusion-frontend branch 2 times, most recently from 15ec7fc to f05f582 Compare May 18, 2026 14:03

feature(tile fusion): support tile op scheduling and marking last_use

681c574

Zhendong404 force-pushed the feature-tile-fusion-frontend branch from f05f582 to 681c574 Compare May 18, 2026 14:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Port frontend tile fusion to EmitC mainline#679

Port frontend tile fusion to EmitC mainline#679
Zhendong404 wants to merge 1 commit into
hw-native-sys:mainfrom
Zhendong404:feature-tile-fusion-frontend

Zhendong404 commented May 16, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 16, 2026

Uh oh!

gemini-code-assist Bot May 16, 2026

Uh oh!

gemini-code-assist Bot May 16, 2026

Uh oh!

reedhecre commented May 16, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	replacement.reserve(callee.size() + lastUseArgs.size() + 32);
	replacement.reserve(callee.size() + lastUseArgs.size() + argsRef.size() + 32);

Conversation

Zhendong404 commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed

Driver and pipeline

Frontend fusion core

last_use implementation

EmitC last_use output

Explicit non-goals / removed scope

Why this shape

Testing

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 16, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 16, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 16, 2026

Choose a reason for hiding this comment

Uh oh!

reedhecre commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codex Review

Summary

Findings

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Zhendong404 commented May 16, 2026 •

edited

Loading

`last_use` implementation

EmitC `last_use` output

reedhecre commented May 16, 2026 •

edited

Loading