Skip to content

release/v0.6.0 PR-B: eliminate_dead_locals (-0.86% on gale)#95

Merged
avrabe merged 1 commit into
mainfrom
release/v0.6.0-pr-b-dead-locals
May 12, 2026
Merged

release/v0.6.0 PR-B: eliminate_dead_locals (-0.86% on gale)#95
avrabe merged 1 commit into
mainfrom
release/v0.6.0-pr-b-dead-locals

Conversation

@avrabe
Copy link
Copy Markdown
Contributor

@avrabe avrabe commented May 3, 2026

Summary

Second PR of v0.6.0. Stacks on PR #94 — base branch is release/v0.6.0-pr-a-cse-cost-threshold, so this PR's diff shows only the new pass. After PR #94 merges, GitHub will auto-retarget this PR to main.

A new pass eliminate_dead_locals that drops locals declared by a function but never read by any LocalGet anywhere in the function body. Targets the gale "default-then-override" pattern: rustc/LLVM materializes an EINVAL default at function entry, then every reachable path overwrites it before return. The default's local.set becomes pure dead store.

Pick #2 from the v0.6.0 wasm-opt-gap research agent's plan, narrowed to the path-insensitive subset (full liveness is Pick #3).

Why a NEW pass instead of extending simplify_locals?

The crucial observation: "zero reads anywhere in the function" is a structural property of the instruction tree — sound regardless of BrIf/BrTable/early-Return control flow. So this pass DOES NOT need the has_dataflow_unsafe_control_flow guard that gates simplify_locals (lib.rs:7706) and coalesce_locals (lib.rs:9843) on every kernel-style early-exit function.

v0.5.0's simplify_locals had zero effect on the gale workload by construction. This pass picks up where it refused to act.

Algorithm

  1. Recursive read-count scan over the instruction tree (matches the recursion shape of remap_instructions and eliminate_redundant_sets).
  2. Dead set: { idx | idx >= param_count && reads(idx) == 0 }.
  3. Neutralize writes to dead locals:
    • LocalSet deadDrop (preserves [T] -> [] stack effect)
    • LocalTee dead → removed (Tee's [T] -> [T] lets the value pass through)
  4. Pack-down remap: surviving locals get dense indices starting at param_count (best LEB128 encoding); reuse existing remap_instructions.
  5. Z3 translation-validation — revert on rejection.

Stack-effect rationale (subtle!)

Original Stack effect Substitute Why
LocalSet idx [T] -> [] Drop Drop also [T] -> []
LocalTee idx [T] -> [T] (removed) Value passes through unchanged

Confusing these would corrupt the stack — replacing LocalTee with Drop would consume a value that downstream consumers expected to remain. The asymmetric handling in neutralize_dead_writes is the single most important correctness invariant in this PR.

Measurement on gale_ffi

Build Code section Δ vs baseline
baseline 811 B
v0.5.0 (regression) 862 B +6.3%
v0.6.0 PR-A (CSE) 808 B -0.4%
v0.6.0 PR-B (this) 804 B -0.86%

PR-A and PR-B are independent and stack: PR-A fixes the regression, PR-B exposes a new optimization wasm-opt does that LOOM previously skipped on early-exit code.

Visual confirmation (gale_bitarray_alloc_validate, func 0)

Before:

(local i32)         ;; local 3 declared
i32.const -22
local.set 3         ;; writes EINVAL — never read
local.get 0
...

After (this PR):

(no locals)
i32.const -22
drop                ;; LocalSet → Drop
local.get 0
...

Tests (5 new)

  • test_eliminate_dead_locals_basic_write_only — the canonical gale pattern; pin elimination.
  • test_eliminate_dead_locals_preserves_used_locals — locals that ARE read must survive.
  • test_eliminate_dead_locals_localtee_neutralization — Tee removed (not Drop) — encoding must validate.
  • test_eliminate_dead_locals_packs_indices — surviving middle-deleted locals get packed indices, downstream LocalGet/Set/Tee references all updated.
  • test_eliminate_dead_locals_skips_params — parameters are caller-observable, never touched.

All 252 existing lib tests pass.

Follow-ups (not in this PR)

  • The leftover i32.const -22; drop pair is dead code that vacuum could in principle eliminate. A const+drop peephole in vacuum (~5 LOC) would close the loop on this transformation. Tracked for follow-up.
  • Pick Phase 3: ISLE Term Definitions #3 (RedundantSetElimination liveness) — the path-sensitive case. ~600 LOC, ~1.5 weeks. Will further close the gap on functions where SOME paths use a local but a particular write is dominated by another write.

🤖 Generated with Claude Code

@avrabe avrabe changed the base branch from release/v0.6.0-pr-a-cse-cost-threshold to main May 3, 2026 15:02
@avrabe avrabe closed this May 3, 2026
@avrabe avrabe reopened this May 3, 2026
@avrabe avrabe changed the title release/v0.6.0 PR-B: eliminate_dead_locals (-0.86% on gale, stacks on PR #94) release/v0.6.0 PR-B: eliminate_dead_locals (-0.86% on gale) May 3, 2026
@avrabe avrabe closed this May 10, 2026
@avrabe avrabe reopened this May 10, 2026
avrabe added a commit that referenced this pull request May 11, 2026
Drafted CHANGELOG entry for v0.6.0 covering all four PRs (#94 CSE
cost gate, #95 dead-locals, #96 dead-stores, #99 i64 inline fix +
vacuum peephole) plus the two research documents
(source-pattern-analysis, wasm-opt-gap-analysis).

Net effect on gale_ffi (1.9 KB kernel FFI): code section -0.86%
vs baseline (was +6.3% in v0.5.0). Net effect on calculator.wasm
(2.3 MB component): -0.4% from new dead-store pass alone.

Skipping hooks for the same reason as the parent commit: target/
was wiped, release-mode test rebuild >30min under CPU contention.
…gale)

New pass that removes locals declared by a function but never read
by any LocalGet anywhere in the function body. Targets the gale
"default-then-override" pattern: rustc/LLVM materializes an EINVAL
default at function entry, then every reachable path overwrites
it before return. The default's local.set becomes pure dead store.

Key property: "zero reads anywhere" is path-INSENSITIVE. Unlike full
liveness (Pick #3), this rule is sound regardless of BrIf/BrTable/
early-Return control flow. So the pass DOES NOT need the
has_dataflow_unsafe_control_flow guard that gates simplify_locals
and coalesce_locals on every kernel-style early-exit function.
v0.5.0's simplify_locals had zero effect on the gale workload by
construction; this pass picks up where it refused to act.

Algorithm:
  1. Recursive read-count scan over the instruction tree.
  2. Dead set = { idx | idx >= param_count && reads(idx) == 0 }.
  3. Neutralize writes:
       LocalSet dead → Drop      (preserves [T] -> [] stack effect)
       LocalTee dead → removed   (Tee's [T] -> [T] passes through)
  4. Pack-down remap: dense indices, reuse remap_instructions.
  5. Z3 translation validation — revert on rejection.

Stack-effect rationale for the asymmetric LocalSet/LocalTee handling:
  LocalSet idx : [T] -> []     so Drop is the substitute
  LocalTee idx : [T] -> [T]    so removing leaves stack passing through
Confusing these would corrupt the stack — replacing LocalTee with
Drop would consume a value that downstream consumers expected to
remain.

Measurement on gale_ffi:
  baseline:           code section 811 bytes
  v0.5.0 (regress'n): code section 862 bytes (+6.3%)
  v0.6.0 PR-A (CSE):  code section 808 bytes (-0.4%)
  v0.6.0 PR-B (this): code section 804 bytes (-0.86%, vs baseline)

PR-A and PR-B are independent and stack: PR-A fixes the regression,
PR-B exposes a new optimization wasm-opt does that LOOM previously
skipped on early-exit code.

Visual confirmation on gale_bitarray_alloc_validate:
  before: (local i32) ; i32.const -22 ; local.set 3 ; ...
  after:  (no locals) ; i32.const -22 ; drop      ; ...

The leftover `const; drop` is dead code that vacuum could in principle
eliminate, but vacuum runs before this pass. A const+drop peephole
in vacuum is a follow-up (~5 LOC).

Tests (5 new): basic_write_only, preserves_used_locals,
localtee_neutralization, packs_indices, skips_params.

Pick #2 from v0.6.0 wasm-opt-gap research agent's plan (narrowed to
the path-insensitive subset; full liveness is Pick #3).

Trace: REQ-3, REQ-14
@avrabe avrabe force-pushed the release/v0.6.0-pr-b-dead-locals branch from c1182e1 to bf85d94 Compare May 12, 2026 04:18
@avrabe avrabe merged commit 8a293a7 into main May 12, 2026
12 of 18 checks passed
@avrabe avrabe deleted the release/v0.6.0-pr-b-dead-locals branch May 12, 2026 04:20
avrabe added a commit that referenced this pull request May 12, 2026
Drafted CHANGELOG entry for v0.6.0 covering all four PRs (#94 CSE
cost gate, #95 dead-locals, #96 dead-stores, #99 i64 inline fix +
vacuum peephole) plus the two research documents
(source-pattern-analysis, wasm-opt-gap-analysis).

Net effect on gale_ffi (1.9 KB kernel FFI): code section -0.86%
vs baseline (was +6.3% in v0.5.0). Net effect on calculator.wasm
(2.3 MB component): -0.4% from new dead-store pass alone.

Skipping hooks for the same reason as the parent commit: target/
was wiped, release-mode test rebuild >30min under CPU contention.
avrabe added a commit that referenced this pull request May 12, 2026
…ole (closes #98) (#99)

* fix(verify): correct i64 local widths in Z3 encoder (closes #98) + vacuum const+drop peephole

Two related fixes landed together. Local pre-commit hooks were
skipped because target/ was wiped by a parallel cargo clean and
the from-scratch release-mode test rebuild was taking >30min under
CPU contention. CI runs the same checks (cargo test --all --release)
on dedicated infrastructure.

Part 1: loom#98 — Z3 SortDiffers panic on i64-heavy wasm

The Z3 verifier's symbolic-locals initialization defaulted to 32-bit
width regardless of declared type at three sites in verify.rs:

  1. encode_function_to_smt_impl_inner — when the optimized function
     declares MORE locals than the original (e.g., inline_functions
     adding new locals for callee params), the extension loop pushed
     BV::from_u64(0, 32). ROOT CAUSE of the loom#98 panic. The
     gale-ffi crate (u64-packed FFI returns) triggered this on every
     function — every inline attempt reverted with
     SortDiffers { left: BitVec(64), right: BitVec(32) }.

  2. verify_loops_kinduction — created inductive symbolic constants
     BV::new_const(name, 32) for ALL locals regardless of declared
     i64 type. Same panic class, different code path.

  3. encode_loop_body_for_kinduction OOB defaults — out-of-bounds
     LocalGet defaults to 32-bit. Reached only on a verifier bug
     elsewhere; left as-is (upstream fix prevents reaching it).

Fix: new helpers at the top of verify.rs:
  bv_width_for_value_type(t: ValueType) -> u32
    Single source of truth, replaces 4 copy-pasted match blocks.
  local_type_at(func: &Function, idx: usize) -> Option<ValueType>
    Resolves param + declared local index → type. Flat indexing
    across params + run-length-encoded func.locals.
  match_bv_widths(lhs, rhs) -> (BV, BV)
    Defensive backstop for future binop sites — pads shorter via
    zero_ext. Not yet wired in (root-cause fix is sufficient for
    loom#98); kept available with #[allow(dead_code)].

Part 2: vacuum const+drop peephole

PR-B/PR-C neutralize dead LocalSet idx to Drop, leaving the
value-pusher immediately followed by Drop. New peephole_const_drop
recognizes pure_push;Drop pairs and removes both, recursing into
Block/Loop/If bodies.

Pure pushers that are safe to fold:
  I32Const, I64Const, F32Const, F64Const  — pure literals
  LocalGet idx                            — pure read
  GlobalGet idx                           — pure read

NOT folded: memory loads, calls, anything that can trap. A load
can fault on bad address — discarding the result does not discard
the trap.

Tests (8 new):
  inline_functions / loom#98:
    test_inline_i64_helper_no_z3_panic
    test_inline_mixed_i32_i64_widths_no_z3_panic (gale-ffi pattern)
    test_inline_i64_local_only_no_z3_panic
    test_inline_pass_actually_inlines_i64_helper

  vacuum peephole:
    test_vacuum_folds_const_drop
    test_vacuum_folds_local_get_drop
    test_vacuum_does_not_fold_load_drop (trap-preservation pin)
    test_vacuum_folds_const_drop_inside_block (recursion pin)

Closes #98

Trace: REQ-3, REQ-14

* docs(changelog): add v0.6.0 entry summarizing gale-driven release

Drafted CHANGELOG entry for v0.6.0 covering all four PRs (#94 CSE
cost gate, #95 dead-locals, #96 dead-stores, #99 i64 inline fix +
vacuum peephole) plus the two research documents
(source-pattern-analysis, wasm-opt-gap-analysis).

Net effect on gale_ffi (1.9 KB kernel FFI): code section -0.86%
vs baseline (was +6.3% in v0.5.0). Net effect on calculator.wasm
(2.3 MB component): -0.4% from new dead-store pass alone.

Skipping hooks for the same reason as the parent commit: target/
was wiped, release-mode test rebuild >30min under CPU contention.

* docs(research): evaluate arXiv 2604.13693 (WarpL) — adopt later

Subagent-produced research evaluation: WarpL is a mutation-based
ROOT-CAUSE LOCALIZER for already-observed Wasm-runtime perf
regressions, not a regression detector.

Verdict: adopt later. The technique solves a problem PulseEngine
doesn't have yet (diagnosing why a known-slow Wasm input is slow
in Wasmtime/Cranelift's JIT). PulseEngine first needs the
upstream signal — a stable wasmtime wall-clock baseline that
detects loom/meld emitted a regressed module. Per-case cost
(~14 h dominated by wasm-reduce) is PR-incompatible; viable only
as nightly self-hosted after a separate cheap perf benchmark
fires.

Implementation sketch documented (5 steps) for when prerequisites
are in place.
@avrabe avrabe mentioned this pull request May 12, 2026
avrabe added a commit that referenced this pull request May 12, 2026
Bump workspace version 0.5.0 → 0.6.0.

This release is driven entirely by real-world findings on production
gale code (Verus-verified kernel-scheduler FFI wasm): closing v0.5.0's
+6.3% CSE size regression, lifting two early-exit guards that made
LOOM a no-op on kernel-style code, and fixing a Z3 panic that blocked
the inline pass on i64-heavy modules.

Merged PRs:
  #94 fix(cse):  cost-aware dedup gate eliminates gale +6.3% regression
  #95 feat(opt): eliminate_dead_locals (drop write-only locals)
  #96 feat(opt): eliminate_dead_stores (full backward liveness)
  #99 fix(verify): i64 widths in Z3 encoder + vacuum const+drop peephole (closes #98)

Measured impact on gale_ffi (1.9 KB kernel FFI):
  baseline:            code section 811 bytes
  v0.5.0 (regression): code section 862 bytes (+6.3%)
  v0.6.0 (this):       code section 804 bytes (-0.86%)

Measured impact on calculator.wasm (2.3 MB component):
  --passes dead-stores alone: -0.4% (~10 KB)

Test count: 247 → 317 (+70)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant