Skip to content

feat(opt): fold constant address into base+offset for memory ops (#95)#96

Merged
avrabe merged 1 commit into
mainfrom
feat/issue-95-const-addr-load
May 11, 2026
Merged

feat(opt): fold constant address into base+offset for memory ops (#95)#96
avrabe merged 1 commit into
mainfrom
feat/issue-95-const-addr-load

Conversation

@avrabe
Copy link
Copy Markdown
Contributor

@avrabe avrabe commented May 10, 2026

Closes #95.

Summary

When wasm code emits i32.const C; i32.load offset=O (the typical pattern for reads of merged statics in .data/.rodata after meld linking), synth was producing a three-instruction, ten-byte sequence:

movw  ip, #lo16(C)        ; 4 bytes
movt  ip, #hi16(C)        ; 4 bytes
ldr.w r3, [r11, ip, #O]   ; 4 bytes  (encoder-expanded: ADD scratch + LDR reg)

This PR folds the const-address pattern at instruction-selection time, emitting a single Thumb-2 immediate-offset load:

ldr.w r3, [r11, #(C+O)]   ; 4 bytes

For the canonical (i32.const 0x100) (i32.load offset=8) benchmark from the issue, the body shrinks from ~10 bytes (MOVW+MOVT+indexed LDR) to 4 bytes — a 6-byte saving per access. Per #95, this accounts for ~20% of the gale-ffi size delta vs LLVM-LTO, dominated by hot-path scalar loads like sem->count / sem->limit.

Where the fold lives

crates/synth-synthesis/src/instruction_selector.rsselect_with_stack(). Done at the target-aware lowering layer rather than as a wasm-IR rewrite, so we don't have to invent a synthetic op.

Two helpers added:

  • try_fold_const_addr — load fold predicate (checks wasm_ops[idx-1] == I32Const(C), BoundsCheckConfig::None, and (C as u32).wrapping_add(O) <= 0xFFF).
  • try_fold_const_addr_store + splice_out_addr_const_materialization — store fold for i32.const ADDR; <(0,1) value-pusher>; i32.store. Splices the addr-const instructions out of the tail while preserving the value chunk that sits on top.

Coverage

  • i32.load, i32.load8_s, i32.load8_u, i32.load16_s, i32.load16_u
  • i32.store, i32.store8, i32.store16

For stores, the fold is applied conservatively: the value-pusher must be (0, 1) (I32Const / LocalGet / GlobalGet) so that wasm_ops[idx-2] is reliably the address-pushing op without intermediate stack consumption. Complex-value stores (e.g., value computed via i32.add) intentionally do not fold to keep the splice logic simple — covered by test_issue_95_no_fold_when_value_is_complex_expression.

Before / after — canonical sequence

(i32.const 0x100) (i32.load offset=8):

ARM ops emitted (body, prologue/epilogue elided) Bytes
Before MOVW r3, #0x100 ; MOVT r3, #0 ; ADD ip, r3, #8 ; LDR.W r3, [r11, ip] ≥ 10
After LDR.W r3, [r11, #0x108] 4

(i32.const 0x10000) (i32.load offset=8) (effective offset > 4095) — falls back to MOVW+MOVT+indexed LDR. Verified by const_addr_load_falls_back_when_offset_too_large.

Tests added

Unit tests in crates/synth-synthesis/src/instruction_selector.rs:

  • test_issue_95_const_addr_load_folds_to_base_offset
  • test_issue_95_const_addr_load_falls_back_when_offset_too_large
  • test_issue_95_const_addr_store_folds_to_base_offset
  • test_issue_95_const_addr_subword_loads_fold
  • test_issue_95_const_addr_subword_stores_fold
  • test_issue_95_no_fold_when_value_is_complex_expression

Integration tests in crates/synth-backend/tests/issue_95_const_addr_load.rs (drive the real encoder):

  • canonical_const_addr_load_drops_from_10_to_4_bytes
  • canonical_load_before_vs_after_byte_count (encoder-validated byte counts)
  • const_addr_load_falls_back_when_offset_too_large
  • canonical_const_addr_store_folds
  • const_addr_subword_loads_fold

Test plan

  • cargo test --workspace — green (all suites pass; no regressions)
  • cargo clippy -p synth-synthesis -p synth-backend --all-targets -- -D warnings — clean
  • cargo fmt --check -p synth-synthesis -p synth-backend — clean
  • Fold verified to reduce canonical sequence to 4 bytes via ArmEncoder
  • Fall-back path verified for (C as u32).wrapping_add(O) > 4095

🤖 Generated with Claude Code

@codecov
Copy link
Copy Markdown

codecov Bot commented May 10, 2026

Codecov Report

❌ Patch coverage is 96.19048% with 12 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
crates/synth-synthesis/src/instruction_selector.rs 96.19% 12 Missing ⚠️

📢 Thoughts on this report? Let us know!

@avrabe avrabe force-pushed the feat/issue-95-const-addr-load branch from 6bba829 to 69210b1 Compare May 11, 2026 03:37
avrabe added a commit that referenced this pull request May 11, 2026
Unblocks every open PR (#96, #97, #99, #100, #101) that was failing on `No space left on device` during z3-sys's C++ build on smithy runners.
@avrabe avrabe force-pushed the feat/issue-95-const-addr-load branch from 69210b1 to 9630fef Compare May 11, 2026 05:30
Detects the wasm pattern `i32.const C; i32.{load,store}{,8,16}{_s,_u}
offset=O` in `select_with_stack` and lowers it to a single
`LDR/STR rd, [R11, #(C+O)]` (4 bytes) when the effective offset fits in
the Thumb-2 imm12 range (0..=4095) and bounds checking is disabled
(bare-metal default). Replaces the previous `MOVW + MOVT + LDR.W`
sequence (10 bytes) with a single 4-byte load — saving 6 bytes per
constant-address access.

Per the issue, this accounts for ~20% of the gale-ffi size delta vs
LLVM-LTO. The fold is target-aware and stays in the instruction
selector to avoid inventing a synthetic wasm op.

Constraints:
- Effective offset `(C as u32).wrapping_add(O)` must be <= 4095.
- Only applied when `BoundsCheckConfig::None` (the bare-metal default).
- For stores, the value-pusher at idx-1 must be a (0,1) op
  (I32Const / LocalGet / GlobalGet) so the address-pusher at idx-2 is
  reliably an I32Const without intermediate stack effects.
- Falls back to the existing materialization path when any precondition
  fails — confirmed by `const_addr_load_falls_back_when_offset_too_large`.

Coverage:
- I32Load + I32Load8S/U + I32Load16S/U
- I32Store + I32Store8 + I32Store16

Tests added:
- `instruction_selector::tests::test_issue_95_*` (6 unit tests)
- `synth-backend/tests/issue_95_const_addr_load.rs` (5 integration tests
  including before/after byte-count comparison via the real encoder)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@avrabe avrabe force-pushed the feat/issue-95-const-addr-load branch from 9630fef to 10bd36b Compare May 11, 2026 17:42
@avrabe avrabe merged commit a32800f into main May 11, 2026
8 checks passed
@avrabe avrabe deleted the feat/issue-95-const-addr-load branch May 11, 2026 18:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

wasm linear-memory access lowering: emit base+offset instead of movw+movt+ldr for constant addresses

1 participant