Port frontend tile fusion to EmitC mainline#679
Conversation
There was a problem hiding this comment.
Code Review
This pull request ports frontend tile fusion capabilities to the EmitC mainline, introducing pto.fusion_region and pto.yield operations along with passes for analysis, planning, scheduling, and region formation. The implementation ensures fusion occurs on tile-native PTO IR and is preserved through the shared mainline passes until a final flattening stage. Review feedback correctly identifies non-deterministic logic in the liveness analysis where lastLocalConsumer is assigned without considering block order, as Value::getUses() returns uses in an arbitrary sequence. An improvement to string reservation in the C++ post-processing logic was also suggested to optimize performance by reducing reallocations.
| if (nodeIt == computeNodeByOp.end()) | ||
| continue; | ||
| appendUniqueNode(state.live.consumerNodes, nodeIt->second); | ||
| state.live.lastLocalConsumer = nodeIt->second; |
There was a problem hiding this comment.
The assignment of lastLocalConsumer here is non-deterministic because Value::getUses() returns operands in an arbitrary order. Since node.id is assigned in block order, you should only update lastLocalConsumer if the current node.id is greater than the previously recorded one.
unsigned consumerId = nodeIt->second;
appendUniqueNode(state.live.consumerNodes, consumerId);
if (!state.live.lastLocalConsumer || consumerId > *state.live.lastLocalConsumer)
state.live.lastLocalConsumer = consumerId;| if (nodeIt == computeNodeByOp.end()) | ||
| continue; | ||
| appendUniqueNode(writeLive.consumerNodes, nodeIt->second); | ||
| writeLive.lastLocalConsumer = nodeIt->second; |
There was a problem hiding this comment.
Similar to the liveness finalization above, lastLocalConsumer for write instances should be updated using a maximum check to ensure it correctly identifies the last consumer in block order, regardless of the iteration order of getUses().
unsigned consumerId = nodeIt->second;
appendUniqueNode(writeLive.consumerNodes, consumerId);
if (!writeLive.lastLocalConsumer || consumerId > *writeLive.lastLocalConsumer)
writeLive.lastLocalConsumer = consumerId;| } | ||
|
|
||
| std::string replacement; | ||
| replacement.reserve(callee.size() + lastUseArgs.size() + 32); |
There was a problem hiding this comment.
To minimize reallocations when constructing the replacement string, consider including the size of the original arguments string (argsRef) in the initial reservation.
| replacement.reserve(callee.size() + lastUseArgs.size() + 32); | |
| replacement.reserve(callee.size() + lastUseArgs.size() + argsRef.size() + 32); |
806764b to
59bf8fb
Compare
Codex Review该评论由 review 机器人自动更新。
Summary发现 2 个 P2 问题:一类被声明为可融合的算子在 Findings
这里的 |
15ec7fc to
f05f582
Compare
f05f582 to
681c574
Compare
Summary
Reintroduce frontend tile fusion on the current A5 EmitC mainline behind
--enable-op-fusion, but keep the implementation intentionally small:PTOViewToMemrefpto.last_usedirectly on scheduled block-localspans
[[pto::last_use(... )]] CALLEE(...)pto.fusion_region/pto.yieldlifecycle in the shared mainline
In other words, this PR keeps the user-visible goal of "frontend op scheduling
FusionRegion-based IRcontract from the implementation.
What changed
Driver and pipeline
--enable-op-fusionon the currentptoasdriver--pto-arch=a5with--pto-level=level2|level3FusionPlanOpSchedulingPTOMarkLastUsePTOViewToMemrefinstead of failing compilation
Frontend fusion core
mainline:
FusionAnalysisFusionOpSemanticsPTOFusionPlanPTOOpSchedulingrather than wrapping them in a region op
last_useimplementationPTOMarkLastUseas the place that computespto.last_usepto.fusion.group_id/pto.fusion.orderthe span
last_useper tile operand slot, with the following rules:0EmitC
last_useoutput[[pto::last_use(... )]] CALLEE(...)PTOToEmitCCppPostprocessemitted operand order, which keeps the output tile slot at
0in the finalemitted attribute
Explicit non-goals / removed scope
pto.fusion_regionpto.yieldPTOFusionRegionGenPTOFlattenFusionRegionPTOViewToMemref, memory planning, reserved-buffer resolution, syncinsertion, or tile-handle materialization
Why this shape
The original larger port bundled three concerns together:
last_useemissionFor the current goal, only (1) and (3) are essential. This PR keeps the
useful part of the feature and localizes the extra complexity to
PTOMarkLastUse, instead of requiring multiple existing shared passes tounderstand and preserve a new region lifecycle.
Testing
Added focused tile-fusion coverage for:
treshapeboundarytreshapebridgelast_use:[[pto::last_use(... )]]emissionpto.fusion_region/pto.yieldFocused verification run:
llvm-lit -sv build/test/lit/tile_fusion