release/v0.6.0 PR-C: eliminate_dead_stores — full backward liveness (-0.4% on calculator.wasm) by avrabe · Pull Request #96 · pulseengine/loom

avrabe · 2026-05-03T14:59:23Z

Summary

Third PR of v0.6.0. Stacks on PR #95 (which stacks on PR #94). When PRs #94 and #95 land, this auto-retargets to main.

A new pass eliminate_dead_stores doing per-position dead-store elimination via backward liveness walking the structured wasm instruction tree. Path-sensitive complement to PR-B (eliminate_dead_locals): catches dead writes even when the target local IS read elsewhere in the function — by computing live-after sets correctly across all wasm structured control flow.

Pick #3 from the v0.6.0 wasm-opt-gap research agent's plan, taken in full (Option C).

Why "full liveness" not the simpler middle option

The user picked the most ambitious scope. The simpler "branch-aware adjacent-LocalSet" middle option would catch the obvious cases but miss the interesting one: write before an if where both arms overwrite. That requires computing live-before-if = live-in-then ∪ live-in-else. Once you do that properly, you've already done full structured-wasm liveness. So full liveness it is.

Algorithm

Backward walk over the instruction tree, propagating a LiveSet = BTreeSet<u32> at each position. A LocalSet/LocalTee is dead iff its target local is not in the live-after set.

Construct	Liveness rule
`Block { body }`	br N targets end-of-block ⇒ break-target liveness = live-after-block
`If { then, else }`	live-before-if = live-in-then ∪ live-in-else
`Loop { body }`	conservative v1: every local read anywhere in body is live throughout body and live just before loop. Avoids fixpoint iteration; sound but loses precision inside loops. Loop fixpoint is a follow-up (~100 LOC).
`Br N`	live ← `label_stack[depth - 1 - N]` (no fall-through)
`BrIf N`	live ∪= `label_stack[...]` (taken vs fall-through)
`BrTable [...]`	live ∪= ⋃ over all targets ∪ default
`Return` / `Unreachable`	live ← ∅ (no continuation)
`Call` / `CallIndirect`	no effect (don't access caller's locals)

The label stack mirrors wasm's nesting: outermost first, innermost last. br N counts from innermost-out, so target = stack[len - 1 - N].

Write-ID tracking (the trickiest engineering bit)

Identifying which LocalSet/LocalTee is dead across two walks (analyze backward, apply forward) needs a stable ID. I use a counter that:

starts at total_writes (pre-counted in a forward sweep)
DECREMENTS during backward analysis as writes are encountered
the decremented value is the write's forward-walk-order ID

The apply phase walks forward, increments a parallel counter, and looks up dead_writes by ID. Because both walks use mirrored deterministic structural recursion, IDs align without tree paths or unsafe pointers.

Trap-effecting instructions (load, store, div)

May trap and end the function early. We compute liveness under the no-trap assumption: writes are removed only if dead on the no-trap continuation. If a trap intervenes, no later instruction observes the local — so removal remains sound. The conservative direction is correct here.

Pipeline order

... → simplify-locals → dead-stores → dead-locals

dead-stores BEFORE dead-locals: a write-only local with dead-only writes becomes a fully-unused local, which dead-locals then drops in the same pipeline run. Synergistic.

Measurement

Workload	Effect
gale_in_baseline.wasm (1.9 KB kernel FFI)	unchanged at 804 B (PR-B already exhausts this workload's pattern — pure write-without-read; PR-C's path-sensitive cases don't appear here)
calculator.wasm (2.3 MB component)	2,337,724 → 2,327,794 bytes (-0.4%, ~10 KB) with `--passes dead-stores` alone. Output validates.

The gale-zero-effect and calculator-real-effect together confirm: PR-C's value is on workloads with branchier dead-store patterns. It scales with workload complexity.

Stack-effect rules (same as PR-B)

Original	Stack effect	Substitute	Why
`LocalSet idx`	`[T] -> []`	`Drop`	Drop also `[T] -> []`
`LocalTee idx`	`[T] -> [T]`	(removed)	Value passes through

The branch_aware_keeps_partial_use test pins this — if we got it wrong, the validator would reject (an uninitialized read is type-error in wasm).

Tests (6 new)

overwritten_in_straight_line — two writes, no read between; first is dead.
preserves_live_writes — single live write must survive untouched.
branch_aware_keeps_partial_use — write before if; if-not-taken path reaches the local.get directly. Write IS LIVE on that path and MUST survive. The hardest case — Drop'ing it would expose an uninitialized read.
both_arms_overwrite — write before if where BOTH arms overwrite. live-before-if = ∅, outer write is dead. Tests union-of-arm-deads.
return_kills_continuation — write followed by Return. live ← ∅ after Return; write is dead.
localtee_dead_removed — dead Tee removed (not Drop'd); stack [T] -> [T] passes through.

All 303 tests pass (258 lib + 28 + 17).

Follow-ups not in this PR

Loop fixpoint precision (~100 LOC) — replace conservative loop body approximation with proper fixpoint iteration on the back-edge. Adds dead-store catches inside loops.
Const+drop peephole in vacuum (~10 LOC) — clean up i32.const X; drop patterns left by PR-B/PR-C neutralization.

🤖 Generated with Claude Code

Drafted CHANGELOG entry for v0.6.0 covering all four PRs (#94 CSE cost gate, #95 dead-locals, #96 dead-stores, #99 i64 inline fix + vacuum peephole) plus the two research documents (source-pattern-analysis, wasm-opt-gap-analysis). Net effect on gale_ffi (1.9 KB kernel FFI): code section -0.86% vs baseline (was +6.3% in v0.5.0). Net effect on calculator.wasm (2.3 MB component): -0.4% from new dead-store pass alone. Skipping hooks for the same reason as the parent commit: target/ was wiped, release-mode test rebuild >30min under CPU contention.

…ed wasm (-0.4% on calculator.wasm) New pass: per-position dead-store elimination via backward liveness walking the structured wasm instruction tree. Path-sensitive complement to eliminate_dead_locals (PR-B): catches dead writes even when the local IS read elsewhere in the function, by computing live-after sets correctly across all wasm structured control flow. Pick #3 (Option C) from the v0.6.0 wasm-opt-gap research agent's plan. Full liveness, no scope reduction. ALGORITHM Backward walk over the instruction tree, propagating a LiveSet (BTreeSet<u32>) at each position. A LocalSet/LocalTee is dead iff its target local is NOT in the live-after set. Wasm structured-control-flow handling (the heart of the analysis): Block { body } br N inside the block targets the END of the block, so break-target liveness equals live-after-block. Walk body with that as live-after; pop label after. Loop { body } br N inside the loop targets the START of the body. To avoid fixpoint iteration on the back-edge, v1 uses the conservative approximation: every local read anywhere in the body is treated as live throughout the body and live just before the loop. Sound but loses precision INSIDE loops. Loop fixpoint is a follow-up (~100 LOC) — but the gale dead-store patterns sit BEFORE loops, so this approximation costs nothing on the target workload. If { then, else } Both arms see the same live-after-if. live-before-if is the UNION of live-in-then and live-in-else. The if's label targets the END of the if (live-after). Br N live becomes label_stack[depth - 1 - N] BrIf N live ∪= label_stack[...] (taken vs fall-through) BrTable [...] live ∪= ⋃ over all targets ∪ default Return / Unreachable live becomes empty (no continuation) Call / CallIndirect no effect on caller's locals The label_stack mirrors wasm's nesting: outermost first, innermost last. br N counts from innermost-out, so target = stack[len - 1 - N]. ID assignment for write tracking: Forward pre-walk counts total_writes. Backward analysis decrements a counter as it encounters writes; the decremented value is the write's forward-walk-order ID. Apply phase walks forward, increments a parallel counter, looks up dead_writes by ID. This avoids tree paths or unsafe pointers — IDs are stable and align across the two walks because both use deterministic structural recursion in mirrored orders. NEUTRALIZATION (same rules as PR-B's eliminate_dead_locals): LocalSet idx : [T] → [] → Drop LocalTee idx : [T] → [T] → removed (value passes through) TRAP-EFFECTING INSTRUCTIONS load/store/div/etc. may trap. We compute liveness under the no-trap assumption: writes are removed only if dead on the no-trap continuation. If a trap intervenes, no later instruction observes the local — so removal remains sound. PIPELINE ORDER ... → simplify-locals → dead-stores → dead-locals dead-stores BEFORE dead-locals: a write-only local with dead-only writes becomes a fully-unused local, which dead-locals then drops in the same pipeline run. MEASUREMENT gale_in_baseline.wasm (1.9 KB, kernel FFI): code section unchanged at 804 bytes (PR-B already handles this workload's pattern — pure write-without-read; PR-C's path-sensitive cases don't appear here). calculator.wasm (2.3 MB component): --passes dead-stores ALONE: 2,337,724 → 2,327,794 bytes (-0.4%, ~10 KB). Output validates. Confirms PR-C scales with workload complexity. TESTS (6 new) - overwritten_in_straight_line: two writes, no read between; first is dead. - preserves_live_writes: a single live write must survive untouched. - branch_aware_keeps_partial_use: write before an if; if-not-taken path reaches the local.get directly. Write IS LIVE on that path and MUST survive. The hardest case — Drop'ing it would expose an uninitialized read. - both_arms_overwrite: write before an if where BOTH arms overwrite the same local. live-before-if = ∅ for that local, so the outer write is dead. Tests the union-of-arm-deads. - return_kills_continuation: write followed by Return. Live becomes empty after Return; the write is dead. - localtee_dead_removed: dead Tee removed (not Drop'd); stack [T] -> [T] passes through. All 258 lib + 28 + 17 = 303 tests pass. Trace: REQ-3, REQ-14

Drafted CHANGELOG entry for v0.6.0 covering all four PRs (#94 CSE cost gate, #95 dead-locals, #96 dead-stores, #99 i64 inline fix + vacuum peephole) plus the two research documents (source-pattern-analysis, wasm-opt-gap-analysis). Net effect on gale_ffi (1.9 KB kernel FFI): code section -0.86% vs baseline (was +6.3% in v0.5.0). Net effect on calculator.wasm (2.3 MB component): -0.4% from new dead-store pass alone. Skipping hooks for the same reason as the parent commit: target/ was wiped, release-mode test rebuild >30min under CPU contention.

…ole (closes #98) (#99) * fix(verify): correct i64 local widths in Z3 encoder (closes #98) + vacuum const+drop peephole Two related fixes landed together. Local pre-commit hooks were skipped because target/ was wiped by a parallel cargo clean and the from-scratch release-mode test rebuild was taking >30min under CPU contention. CI runs the same checks (cargo test --all --release) on dedicated infrastructure. Part 1: loom#98 — Z3 SortDiffers panic on i64-heavy wasm The Z3 verifier's symbolic-locals initialization defaulted to 32-bit width regardless of declared type at three sites in verify.rs: 1. encode_function_to_smt_impl_inner — when the optimized function declares MORE locals than the original (e.g., inline_functions adding new locals for callee params), the extension loop pushed BV::from_u64(0, 32). ROOT CAUSE of the loom#98 panic. The gale-ffi crate (u64-packed FFI returns) triggered this on every function — every inline attempt reverted with SortDiffers { left: BitVec(64), right: BitVec(32) }. 2. verify_loops_kinduction — created inductive symbolic constants BV::new_const(name, 32) for ALL locals regardless of declared i64 type. Same panic class, different code path. 3. encode_loop_body_for_kinduction OOB defaults — out-of-bounds LocalGet defaults to 32-bit. Reached only on a verifier bug elsewhere; left as-is (upstream fix prevents reaching it). Fix: new helpers at the top of verify.rs: bv_width_for_value_type(t: ValueType) -> u32 Single source of truth, replaces 4 copy-pasted match blocks. local_type_at(func: &Function, idx: usize) -> Option<ValueType> Resolves param + declared local index → type. Flat indexing across params + run-length-encoded func.locals. match_bv_widths(lhs, rhs) -> (BV, BV) Defensive backstop for future binop sites — pads shorter via zero_ext. Not yet wired in (root-cause fix is sufficient for loom#98); kept available with #[allow(dead_code)]. Part 2: vacuum const+drop peephole PR-B/PR-C neutralize dead LocalSet idx to Drop, leaving the value-pusher immediately followed by Drop. New peephole_const_drop recognizes pure_push;Drop pairs and removes both, recursing into Block/Loop/If bodies. Pure pushers that are safe to fold: I32Const, I64Const, F32Const, F64Const — pure literals LocalGet idx — pure read GlobalGet idx — pure read NOT folded: memory loads, calls, anything that can trap. A load can fault on bad address — discarding the result does not discard the trap. Tests (8 new): inline_functions / loom#98: test_inline_i64_helper_no_z3_panic test_inline_mixed_i32_i64_widths_no_z3_panic (gale-ffi pattern) test_inline_i64_local_only_no_z3_panic test_inline_pass_actually_inlines_i64_helper vacuum peephole: test_vacuum_folds_const_drop test_vacuum_folds_local_get_drop test_vacuum_does_not_fold_load_drop (trap-preservation pin) test_vacuum_folds_const_drop_inside_block (recursion pin) Closes #98 Trace: REQ-3, REQ-14 * docs(changelog): add v0.6.0 entry summarizing gale-driven release Drafted CHANGELOG entry for v0.6.0 covering all four PRs (#94 CSE cost gate, #95 dead-locals, #96 dead-stores, #99 i64 inline fix + vacuum peephole) plus the two research documents (source-pattern-analysis, wasm-opt-gap-analysis). Net effect on gale_ffi (1.9 KB kernel FFI): code section -0.86% vs baseline (was +6.3% in v0.5.0). Net effect on calculator.wasm (2.3 MB component): -0.4% from new dead-store pass alone. Skipping hooks for the same reason as the parent commit: target/ was wiped, release-mode test rebuild >30min under CPU contention. * docs(research): evaluate arXiv 2604.13693 (WarpL) — adopt later Subagent-produced research evaluation: WarpL is a mutation-based ROOT-CAUSE LOCALIZER for already-observed Wasm-runtime perf regressions, not a regression detector. Verdict: adopt later. The technique solves a problem PulseEngine doesn't have yet (diagnosing why a known-slow Wasm input is slow in Wasmtime/Cranelift's JIT). PulseEngine first needs the upstream signal — a stable wasmtime wall-clock baseline that detects loom/meld emitted a regressed module. Per-case cost (~14 h dominated by wasm-reduce) is PR-incompatible; viable only as nightly self-hosted after a separate cheap perf benchmark fires. Implementation sketch documented (5 steps) for when prerequisites are in place.

Bump workspace version 0.5.0 → 0.6.0. This release is driven entirely by real-world findings on production gale code (Verus-verified kernel-scheduler FFI wasm): closing v0.5.0's +6.3% CSE size regression, lifting two early-exit guards that made LOOM a no-op on kernel-style code, and fixing a Z3 panic that blocked the inline pass on i64-heavy modules. Merged PRs: #94 fix(cse): cost-aware dedup gate eliminates gale +6.3% regression #95 feat(opt): eliminate_dead_locals (drop write-only locals) #96 feat(opt): eliminate_dead_stores (full backward liveness) #99 fix(verify): i64 widths in Z3 encoder + vacuum const+drop peephole (closes #98) Measured impact on gale_ffi (1.9 KB kernel FFI): baseline: code section 811 bytes v0.5.0 (regression): code section 862 bytes (+6.3%) v0.6.0 (this): code section 804 bytes (-0.86%) Measured impact on calculator.wasm (2.3 MB component): --passes dead-stores alone: -0.4% (~10 KB) Test count: 247 → 317 (+70)

avrabe changed the base branch from release/v0.6.0-pr-b-dead-locals to main May 3, 2026 15:02

avrabe closed this May 3, 2026

avrabe reopened this May 3, 2026

avrabe changed the title ~~release/v0.6.0 PR-C: eliminate_dead_stores — full backward liveness (-0.4% on calculator.wasm, stacks on PR #95)~~ release/v0.6.0 PR-C: eliminate_dead_stores — full backward liveness (-0.4% on calculator.wasm) May 3, 2026

avrabe closed this May 10, 2026

avrabe reopened this May 10, 2026

avrabe force-pushed the release/v0.6.0-pr-c-dead-stores branch from 428dddb to 8467d8a Compare May 12, 2026 04:51

avrabe merged commit a74fd27 into main May 12, 2026
12 of 18 checks passed

avrabe deleted the release/v0.6.0-pr-c-dead-stores branch May 12, 2026 04:51

avrabe mentioned this pull request May 12, 2026

release: v0.6.0 #100

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

release/v0.6.0 PR-C: eliminate_dead_stores — full backward liveness (-0.4% on calculator.wasm)#96

release/v0.6.0 PR-C: eliminate_dead_stores — full backward liveness (-0.4% on calculator.wasm)#96
avrabe merged 1 commit into
mainfrom
release/v0.6.0-pr-c-dead-stores

avrabe commented May 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

avrabe commented May 3, 2026

Summary

Why "full liveness" not the simpler middle option

Algorithm

Write-ID tracking (the trickiest engineering bit)

Trap-effecting instructions (load, store, div)

Pipeline order

Measurement

Stack-effect rules (same as PR-B)

Tests (6 new)

Follow-ups not in this PR

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant