feat(opt): fold constant address into base+offset for memory ops (#95)#96
Merged
Conversation
This was referenced May 10, 2026
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
6bba829 to
69210b1
Compare
69210b1 to
9630fef
Compare
Detects the wasm pattern `i32.const C; i32.{load,store}{,8,16}{_s,_u}
offset=O` in `select_with_stack` and lowers it to a single
`LDR/STR rd, [R11, #(C+O)]` (4 bytes) when the effective offset fits in
the Thumb-2 imm12 range (0..=4095) and bounds checking is disabled
(bare-metal default). Replaces the previous `MOVW + MOVT + LDR.W`
sequence (10 bytes) with a single 4-byte load — saving 6 bytes per
constant-address access.
Per the issue, this accounts for ~20% of the gale-ffi size delta vs
LLVM-LTO. The fold is target-aware and stays in the instruction
selector to avoid inventing a synthetic wasm op.
Constraints:
- Effective offset `(C as u32).wrapping_add(O)` must be <= 4095.
- Only applied when `BoundsCheckConfig::None` (the bare-metal default).
- For stores, the value-pusher at idx-1 must be a (0,1) op
(I32Const / LocalGet / GlobalGet) so the address-pusher at idx-2 is
reliably an I32Const without intermediate stack effects.
- Falls back to the existing materialization path when any precondition
fails — confirmed by `const_addr_load_falls_back_when_offset_too_large`.
Coverage:
- I32Load + I32Load8S/U + I32Load16S/U
- I32Store + I32Store8 + I32Store16
Tests added:
- `instruction_selector::tests::test_issue_95_*` (6 unit tests)
- `synth-backend/tests/issue_95_const_addr_load.rs` (5 integration tests
including before/after byte-count comparison via the real encoder)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
9630fef to
10bd36b
Compare
6 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #95.
Summary
When wasm code emits
i32.const C; i32.load offset=O(the typical pattern for reads of merged statics in.data/.rodataafter meld linking), synth was producing a three-instruction, ten-byte sequence:This PR folds the const-address pattern at instruction-selection time, emitting a single Thumb-2 immediate-offset load:
For the canonical
(i32.const 0x100) (i32.load offset=8)benchmark from the issue, the body shrinks from ~10 bytes (MOVW+MOVT+indexed LDR) to 4 bytes — a 6-byte saving per access. Per #95, this accounts for ~20% of the gale-ffi size delta vs LLVM-LTO, dominated by hot-path scalar loads likesem->count/sem->limit.Where the fold lives
crates/synth-synthesis/src/instruction_selector.rs—select_with_stack(). Done at the target-aware lowering layer rather than as a wasm-IR rewrite, so we don't have to invent a synthetic op.Two helpers added:
try_fold_const_addr— load fold predicate (checkswasm_ops[idx-1] == I32Const(C),BoundsCheckConfig::None, and(C as u32).wrapping_add(O) <= 0xFFF).try_fold_const_addr_store+splice_out_addr_const_materialization— store fold fori32.const ADDR; <(0,1) value-pusher>; i32.store. Splices the addr-const instructions out of the tail while preserving the value chunk that sits on top.Coverage
i32.load,i32.load8_s,i32.load8_u,i32.load16_s,i32.load16_ui32.store,i32.store8,i32.store16For stores, the fold is applied conservatively: the value-pusher must be
(0, 1)(I32Const/LocalGet/GlobalGet) so thatwasm_ops[idx-2]is reliably the address-pushing op without intermediate stack consumption. Complex-value stores (e.g., value computed viai32.add) intentionally do not fold to keep the splice logic simple — covered bytest_issue_95_no_fold_when_value_is_complex_expression.Before / after — canonical sequence
(i32.const 0x100) (i32.load offset=8):MOVW r3, #0x100;MOVT r3, #0;ADD ip, r3, #8;LDR.W r3, [r11, ip]LDR.W r3, [r11, #0x108](i32.const 0x10000) (i32.load offset=8)(effective offset > 4095) — falls back to MOVW+MOVT+indexed LDR. Verified byconst_addr_load_falls_back_when_offset_too_large.Tests added
Unit tests in
crates/synth-synthesis/src/instruction_selector.rs:test_issue_95_const_addr_load_folds_to_base_offsettest_issue_95_const_addr_load_falls_back_when_offset_too_largetest_issue_95_const_addr_store_folds_to_base_offsettest_issue_95_const_addr_subword_loads_foldtest_issue_95_const_addr_subword_stores_foldtest_issue_95_no_fold_when_value_is_complex_expressionIntegration tests in
crates/synth-backend/tests/issue_95_const_addr_load.rs(drive the real encoder):canonical_const_addr_load_drops_from_10_to_4_bytescanonical_load_before_vs_after_byte_count(encoder-validated byte counts)const_addr_load_falls_back_when_offset_too_largecanonical_const_addr_store_foldsconst_addr_subword_loads_foldTest plan
cargo test --workspace— green (all suites pass; no regressions)cargo clippy -p synth-synthesis -p synth-backend --all-targets -- -D warnings— cleancargo fmt --check -p synth-synthesis -p synth-backend— cleanArmEncoder(C as u32).wrapping_add(O) > 4095🤖 Generated with Claude Code