feat(opt): fold constant address into base+offset for memory ops (#95) by avrabe · Pull Request #96 · pulseengine/synth

avrabe · 2026-05-10T16:41:02Z

Closes #95.

Summary

When wasm code emits i32.const C; i32.load offset=O (the typical pattern for reads of merged statics in .data/.rodata after meld linking), synth was producing a three-instruction, ten-byte sequence:

movw  ip, #lo16(C)        ; 4 bytes
movt  ip, #hi16(C)        ; 4 bytes
ldr.w r3, [r11, ip, #O]   ; 4 bytes  (encoder-expanded: ADD scratch + LDR reg)

This PR folds the const-address pattern at instruction-selection time, emitting a single Thumb-2 immediate-offset load:

ldr.w r3, [r11, #(C+O)]   ; 4 bytes

For the canonical (i32.const 0x100) (i32.load offset=8) benchmark from the issue, the body shrinks from ~10 bytes (MOVW+MOVT+indexed LDR) to 4 bytes — a 6-byte saving per access. Per #95, this accounts for ~20% of the gale-ffi size delta vs LLVM-LTO, dominated by hot-path scalar loads like sem->count / sem->limit.

Where the fold lives

crates/synth-synthesis/src/instruction_selector.rs — select_with_stack(). Done at the target-aware lowering layer rather than as a wasm-IR rewrite, so we don't have to invent a synthetic op.

Two helpers added:

try_fold_const_addr — load fold predicate (checks wasm_ops[idx-1] == I32Const(C), BoundsCheckConfig::None, and (C as u32).wrapping_add(O) <= 0xFFF).
try_fold_const_addr_store + splice_out_addr_const_materialization — store fold for i32.const ADDR; <(0,1) value-pusher>; i32.store. Splices the addr-const instructions out of the tail while preserving the value chunk that sits on top.

Coverage

i32.load, i32.load8_s, i32.load8_u, i32.load16_s, i32.load16_u
i32.store, i32.store8, i32.store16

For stores, the fold is applied conservatively: the value-pusher must be (0, 1) (I32Const / LocalGet / GlobalGet) so that wasm_ops[idx-2] is reliably the address-pushing op without intermediate stack consumption. Complex-value stores (e.g., value computed via i32.add) intentionally do not fold to keep the splice logic simple — covered by test_issue_95_no_fold_when_value_is_complex_expression.

Before / after — canonical sequence

(i32.const 0x100) (i32.load offset=8):

	ARM ops emitted (body, prologue/epilogue elided)	Bytes
Before	`MOVW r3, #0x100` ; `MOVT r3, #0` ; `ADD ip, r3, #8` ; `LDR.W r3, [r11, ip]`	≥ 10
After	`LDR.W r3, [r11, #0x108]`	4

(i32.const 0x10000) (i32.load offset=8) (effective offset > 4095) — falls back to MOVW+MOVT+indexed LDR. Verified by const_addr_load_falls_back_when_offset_too_large.

Tests added

Unit tests in crates/synth-synthesis/src/instruction_selector.rs:

test_issue_95_const_addr_load_folds_to_base_offset
test_issue_95_const_addr_load_falls_back_when_offset_too_large
test_issue_95_const_addr_store_folds_to_base_offset
test_issue_95_const_addr_subword_loads_fold
test_issue_95_const_addr_subword_stores_fold
test_issue_95_no_fold_when_value_is_complex_expression

Integration tests in crates/synth-backend/tests/issue_95_const_addr_load.rs (drive the real encoder):

canonical_const_addr_load_drops_from_10_to_4_bytes
canonical_load_before_vs_after_byte_count (encoder-validated byte counts)
const_addr_load_falls_back_when_offset_too_large
canonical_const_addr_store_folds
const_addr_subword_loads_fold

Test plan

cargo test --workspace — green (all suites pass; no regressions)
cargo clippy -p synth-synthesis -p synth-backend --all-targets -- -D warnings — clean
cargo fmt --check -p synth-synthesis -p synth-backend — clean
Fold verified to reduce canonical sequence to 4 bytes via ArmEncoder
Fall-back path verified for (C as u32).wrapping_add(O) > 4095

🤖 Generated with Claude Code

codecov · 2026-05-10T21:47:11Z

Codecov Report

❌ Patch coverage is 96.19048% with 12 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
crates/synth-synthesis/src/instruction_selector.rs	96.19%	12 Missing ⚠️

📢 Thoughts on this report? Let us know!

Unblocks every open PR (#96, #97, #99, #100, #101) that was failing on `No space left on device` during z3-sys's C++ build on smithy runners.

Detects the wasm pattern `i32.const C; i32.{load,store}{,8,16}{_s,_u} offset=O` in `select_with_stack` and lowers it to a single `LDR/STR rd, [R11, #(C+O)]` (4 bytes) when the effective offset fits in the Thumb-2 imm12 range (0..=4095) and bounds checking is disabled (bare-metal default). Replaces the previous `MOVW + MOVT + LDR.W` sequence (10 bytes) with a single 4-byte load — saving 6 bytes per constant-address access. Per the issue, this accounts for ~20% of the gale-ffi size delta vs LLVM-LTO. The fold is target-aware and stays in the instruction selector to avoid inventing a synthetic wasm op. Constraints: - Effective offset `(C as u32).wrapping_add(O)` must be <= 4095. - Only applied when `BoundsCheckConfig::None` (the bare-metal default). - For stores, the value-pusher at idx-1 must be a (0,1) op (I32Const / LocalGet / GlobalGet) so the address-pusher at idx-2 is reliably an I32Const without intermediate stack effects. - Falls back to the existing materialization path when any precondition fails — confirmed by `const_addr_load_falls_back_when_offset_too_large`. Coverage: - I32Load + I32Load8S/U + I32Load16S/U - I32Store + I32Store8 + I32Store16 Tests added: - `instruction_selector::tests::test_issue_95_*` (6 unit tests) - `synth-backend/tests/issue_95_const_addr_load.rs` (5 integration tests including before/after byte-count comparison via the real encoder) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

This was referenced May 10, 2026

test: i64 semantic correctness — 25+ tests covering all i64 wasm ops #99

Merged

feat(fuzz): cargo-fuzz harnesses for ARM instruction selection (#82) #100

Open

avrabe force-pushed the feat/issue-95-const-addr-load branch from 6bba829 to 69210b1 Compare May 11, 2026 03:37

avrabe mentioned this pull request May 11, 2026

fix(ci): move Test + Clippy back to ubuntu-latest (unblocks all open PRs) #102

Merged

avrabe added a commit that referenced this pull request May 11, 2026

fix(ci): move Test + Clippy back to ubuntu-latest (#102)

6febcbe

Unblocks every open PR (#96, #97, #99, #100, #101) that was failing on `No space left on device` during z3-sys's C++ build on smithy runners.

avrabe force-pushed the feat/issue-95-const-addr-load branch from 69210b1 to 9630fef Compare May 11, 2026 05:30

avrabe force-pushed the feat/issue-95-const-addr-load branch from 9630fef to 10bd36b Compare May 11, 2026 17:42

avrabe mentioned this pull request May 11, 2026

docs+test: spectre/csdb policy + aarch64 CVE audit + arXiv 2604.17391 citation #105

Merged

6 tasks

avrabe merged commit a32800f into main May 11, 2026
8 checks passed

avrabe deleted the feat/issue-95-const-addr-load branch May 11, 2026 18:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(opt): fold constant address into base+offset for memory ops (#95)#96

feat(opt): fold constant address into base+offset for memory ops (#95)#96
avrabe merged 1 commit into
mainfrom
feat/issue-95-const-addr-load

avrabe commented May 10, 2026

Uh oh!

codecov Bot commented May 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

avrabe commented May 10, 2026

Summary

Where the fold lives

Coverage

Before / after — canonical sequence

Tests added

Test plan

Uh oh!

codecov Bot commented May 10, 2026

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant