Skip to content

Port frontend tile fusion to EmitC mainline#679

Open
Zhendong404 wants to merge 1 commit into
hw-native-sys:mainfrom
Zhendong404:feature-tile-fusion-frontend
Open

Port frontend tile fusion to EmitC mainline#679
Zhendong404 wants to merge 1 commit into
hw-native-sys:mainfrom
Zhendong404:feature-tile-fusion-frontend

Conversation

@Zhendong404
Copy link
Copy Markdown
Contributor

@Zhendong404 Zhendong404 commented May 16, 2026

Summary

Reintroduce frontend tile fusion on the current A5 EmitC mainline behind
--enable-op-fusion, but keep the implementation intentionally small:

  • run fusion planning and scheduling on tile-native PTO IR before
    PTOViewToMemref
  • mark fused tile ops with pto.last_use directly on scheduled block-local
    spans
  • preserve the final EmitC contract by emitting
    [[pto::last_use(... )]] CALLEE(...)
  • do not introduce or preserve a pto.fusion_region / pto.yield
    lifecycle in the shared mainline

In other words, this PR keeps the user-visible goal of "frontend op scheduling

  • final last_use emission", while removing the larger FusionRegion-based IR
    contract from the implementation.

What changed

Driver and pipeline

  • add --enable-op-fusion on the current ptoas driver
  • gate it to --pto-arch=a5 with --pto-level=level2|level3
  • run the frontend fusion core on tile-native PTO IR:
    • FusionPlan
    • OpScheduling
    • PTOMarkLastUse
  • keep this pipeline before PTOViewToMemref
  • leave unsupported configurations on the ordinary unfused path with warnings
    instead of failing compilation

Frontend fusion core

  • port the tile-fusion planning/scheduling support needed on the current
    mainline:
    • FusionAnalysis
    • FusionOpSemantics
    • PTOFusionPlan
    • PTOOpScheduling
  • represent accepted fusion groups as contiguous scheduled spans in a block
    rather than wrapping them in a region op

last_use implementation

  • introduce PTOMarkLastUse as the place that computes pto.last_use
  • make the analysis span-based instead of region/yield-based:
    • collect each contiguous scheduled group span from
      pto.fusion.group_id / pto.fusion.order
    • compute last-use per tile operand slot inside that span
    • block a bit if the tile value is used later in the same span
    • also block a bit if the tile value is used later in the parent block after
      the span
  • encode last_use per tile operand slot, with the following rules:
    • scalar operands do not occupy slots
    • DPS init / output tile slots are preserved but always stay 0
    • repeated SSA tile operands are evaluated independently per slot

EmitC last_use output

  • keep the final output contract as [[pto::last_use(... )]] CALLEE(...)
  • lower marked fused tile ops through a PTOAS-local marker callee path in
    PTOToEmitC
  • rewrite that marker to the final C++ attribute spelling in
    CppPostprocess
  • fix marker bit ordering so single-DPS-init tile intrinsics follow the final
    emitted operand order, which keeps the output tile slot at 0 in the final
    emitted attribute

Explicit non-goals / removed scope

  • no pto.fusion_region
  • no pto.yield
  • no PTOFusionRegionGen
  • no PTOFlattenFusionRegion
  • no shared-pass preservation contract for fusion-region lifecycle through
    PTOViewToMemref, memory planning, reserved-buffer resolution, sync
    insertion, or tile-handle materialization

Why this shape

The original larger port bundled three concerns together:

  1. frontend fusion planning/scheduling
  2. region formation / flattening
  3. final EmitC last_use emission

For the current goal, only (1) and (3) are essential. This PR keeps the
useful part of the feature and localizes the extra complexity to
PTOMarkLastUse, instead of requiring multiple existing shared passes to
understand and preserve a new region lifecycle.

Testing

Added focused tile-fusion coverage for:

  • fusion planning:
    • join
    • diamond
    • interleaved join
    • treshape boundary
    • dynamic-shape negative case
  • scheduling:
    • basic compaction
    • treshape bridge
    • pure-op bridge
    • negative region / call / SSA boundary cases
  • last_use:
    • slot-mask encoding
    • repeated SSA operands
    • post-span later-use blocking
  • end-to-end EmitC output:
    • final [[pto::last_use(... )]] emission
    • absence of residual pto.fusion_region / pto.yield
  • control surface:
    • CLI visibility / gating
    • non-fused fallback behavior
    • adapter placement in level2 and level3 shared lowering paths

Focused verification run:

  • llvm-lit -sv build/test/lit/tile_fusion

@Zhendong404 Zhendong404 marked this pull request as ready for review May 16, 2026 06:22
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request ports frontend tile fusion capabilities to the EmitC mainline, introducing pto.fusion_region and pto.yield operations along with passes for analysis, planning, scheduling, and region formation. The implementation ensures fusion occurs on tile-native PTO IR and is preserved through the shared mainline passes until a final flattening stage. Review feedback correctly identifies non-deterministic logic in the liveness analysis where lastLocalConsumer is assigned without considering block order, as Value::getUses() returns uses in an arbitrary sequence. An improvement to string reservation in the C++ post-processing logic was also suggested to optimize performance by reducing reallocations.

if (nodeIt == computeNodeByOp.end())
continue;
appendUniqueNode(state.live.consumerNodes, nodeIt->second);
state.live.lastLocalConsumer = nodeIt->second;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The assignment of lastLocalConsumer here is non-deterministic because Value::getUses() returns operands in an arbitrary order. Since node.id is assigned in block order, you should only update lastLocalConsumer if the current node.id is greater than the previously recorded one.

        unsigned consumerId = nodeIt->second;
        appendUniqueNode(state.live.consumerNodes, consumerId);
        if (!state.live.lastLocalConsumer || consumerId > *state.live.lastLocalConsumer)
          state.live.lastLocalConsumer = consumerId;

if (nodeIt == computeNodeByOp.end())
continue;
appendUniqueNode(writeLive.consumerNodes, nodeIt->second);
writeLive.lastLocalConsumer = nodeIt->second;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to the liveness finalization above, lastLocalConsumer for write instances should be updated using a maximum check to ensure it correctly identifies the last consumer in block order, regardless of the iteration order of getUses().

        unsigned consumerId = nodeIt->second;
        appendUniqueNode(writeLive.consumerNodes, consumerId);
        if (!writeLive.lastLocalConsumer || consumerId > *writeLive.lastLocalConsumer)
          writeLive.lastLocalConsumer = consumerId;

Comment thread lib/PTO/Transforms/CppPostprocess.cpp Outdated
}

std::string replacement;
replacement.reserve(callee.size() + lastUseArgs.size() + 32);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To minimize reallocations when constructing the replacement string, consider including the size of the original arguments string (argsRef) in the initial reservation.

Suggested change
replacement.reserve(callee.size() + lastUseArgs.size() + 32);
replacement.reserve(callee.size() + lastUseArgs.size() + argsRef.size() + 32);

@Zhendong404 Zhendong404 force-pushed the feature-tile-fusion-frontend branch 2 times, most recently from 806764b to 59bf8fb Compare May 16, 2026 06:30
@reedhecre
Copy link
Copy Markdown

reedhecre commented May 16, 2026

Codex Review

该评论由 review 机器人自动更新。

  • PR: Port frontend tile fusion to EmitC mainline #679 Port frontend tile fusion to EmitC mainline
  • Author: Zhendong404
  • Base/Head: main / feature-tile-fusion-frontend
  • Head SHA: 681c5747b682
  • Trigger: PR 有新提交
  • Generated At: 2026-05-18T14:38:29Z
  • Previous Head SHA: 15ec7fc3a355
  • Status: completed

Summary

发现 2 个 P2 问题:一类被声明为可融合的算子在 pto-view-to-memref 后会丢失 last_use/fusion 元数据;另外规划器会跨不可调度边界建组,导致打开 --enable-op-fusion 后把原本可编译的程序变成硬错误。

Findings

  1. P2 TDiv/TMax/TMin 系列在 PTOViewToMemref 中会丢失 frontend fusion 元数据 lib/PTO/Transforms/PTOViewToMemref.cpp:2505

PTOMarkLastUse 是在 pto-view-to-memref 之前给已规划的融合成员打上 pto.fusion.*pto.last_use 的,但这里仍然直接用 replaceOpWithNewOp 重建 TDivOp。同样的遗漏还出现在 TDivSOpTMaxOpTMaxSOpTMinOpTMinSOp。这些算子都被 isCurrentlyPlannableOp 列为可融合,且后面的 PTOToEmitC 会依赖 pto.last_use 去生成 [[pto::last_use(...)]]。属性在这一层被丢掉后,这几类融合算子最终会静默退化成没有 last-use 标注的普通 EmitC 调用,和这次 PR 为其它算子建立的 mainline contract 不一致。

  1. P2 FusionPlan 会跨不可调度边界建组并把 `--enable-op-fusion` 变成编译失败 lib/PTO/Transforms/TileFusion/PTOFusionPlan.cpp:352

这里的 evaluateAppend 只要求同 iteration-domain 且和现有 group 有任意数据流连接,没有把 call/region/SSA blocker 之类的调度边界纳入规划条件;planBlock 还会对整个 block 反复吸收候选节点。结果是 planner 会先给这类节点打上同一个 pto.fusion.group_id,然后调度阶段无法压成连续 span,最后 PTOMarkLastUse 在发现 split span 时直接报错退出。PR 自带的 op_scheduling_negative_call_boundary.ptoop_scheduling_negative_region.ptoop_scheduling_negative_ssa.pto 已经把这种失败固定下来,这意味着用户只要打开 --enable-op-fusion,原本合法的 kernel 就可能因为不可融合的边界而硬失败,而不是保守地保持非融合代码路径。

@Zhendong404 Zhendong404 force-pushed the feature-tile-fusion-frontend branch 2 times, most recently from 15ec7fc to f05f582 Compare May 18, 2026 14:03
@Zhendong404 Zhendong404 force-pushed the feature-tile-fusion-frontend branch from f05f582 to 681c574 Compare May 18, 2026 14:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants