diff --git a/.claude/knowledge/w1a-simd-integration-plan.md b/.claude/knowledge/w1a-simd-integration-plan.md
new file mode 100644
index 00000000..49deecea
--- /dev/null
+++ b/.claude/knowledge/w1a-simd-integration-plan.md
@@ -0,0 +1,111 @@
+# W1a SIMD Primitives — Integration Plan & Per-Agent Assignment
+
+> Date: 2026-05-26 · Branch: `claude/splat3d-cpu-simd-renderer-MAOO0`
+> Companions: `vertical-simd-consumer-contract.md` (the W1a spec),
+> `simd-dispatch-architecture.md` (the dispatch model).
+> Purpose: pin assignments, sequencing, and the **brutal-surprise register**
+> so the parallel draft → Opus integrate → PR → bot-loop pipeline lands
+> without a SIGILL / won't-compile / unrunnable-test surprise.
+
+## 0. Binding constraints (from this session)
+
+- **Compile-time dispatch ONLY.** New primitives surface through `simd.rs`
+  via `#[cfg(target_feature=…)]` / `#[cfg(target_arch=…)]` arms — the existing
+  cascade (nightly-simd → avx512f → avx2 → aarch64 → scalar). **No runtime
+  dispatch**: no `LazyLock<CpuCaps>`, no `is_*_feature_detected!`, no runtime
+  `simd_caps()` routing. Runtime-dispatch versions are **DEFERRED** (was Phase 5).
+- **Scalar is the mandatory correctness anchor** (W1a acceptance #1).
+- **One PR**, per-workstream commits; subscribe to PR activity for the
+  CodeRabbit/Codex fix loop.
+- **Clear, file-disjoint agent boundaries** (no two agents touch the same file).
+
+## 1. Per-agent assignment (file-exclusive slots)
+
+| Agent | Exclusive files | Scope | Status |
+|---|---|---|---|
+| **W1** | `src/simd.rs`, `simd_ops.rs`, `simd_int_ops.rs`, `simd_avx512.rs`, `simd_avx2.rs`, `simd_neon.rs`, `simd_scalar.rs` | 5 W1a primitives (below) | drafting |
+| **W2** | `src/hpc/activations.rs`, `src/hpc/reductions.rs` | axis variants (`softmax_axis_f32`, `log_softmax_axis_f32`, maybe `sum_axis_f32`) | drafting |
+| **W3** | `src/hpc/soa.rs` | P2 polish: `#[inline]`, `Clone/Debug`, `iter_rows()`, `SoaBatch?` | **done** (no `SoaBatch` — not in design doc) |
+| **W4** | `src/hpc/bulk.rs` | P2: un-gate integration test, `bulk_for_each` (+ deprecated `bulk_scan` alias), `#[inline]` | drafting |
+
+No file appears in two rows → zero write-collision by construction.
+
+### W1a primitive sub-slots (all in W1's file set, one agent, sequential)
+1. `I8x16::from_i4_packed_u64` + `lane_i8::<N>` + free `batch_packed_i4_16`.
+2. `I8x16::saturating_abs` + `I8x32::saturating_abs` (VPABSB correction).
+3. `U16x8::gather_u16` + free `palette_lookup_u8x8`.
+4. free `prefetch_read_t0/t1/t2` (sanctioned free-fn exception — hints).
+5. `U64x8::popcnt` + `xor_popcount` + `U64x4::popcnt`.
+
+## 2. Dependency / entanglement map
+
+- **Missing wrapper types (W1 must define, not assume):** the parity matrix in
+  `simd-dispatch-architecture.md §4` shows `I8x16` and `U16x8` are **aarch64-native
+  narrow only** — they have **no x86 home**. `U8x8` (palette_lookup output) likewise.
+  W1a-#1/#2/#3 therefore require W1 to FIRST add minimal `I8x16`/`U16x8`/`U8x8`
+  definitions to `simd_avx512.rs` + `simd_avx2.rs` + `simd_scalar.rs` (native
+  `__m128i` on SSE2 where sane, else scalar-storage matching the existing `🟠`
+  polyfill pattern). This is W1a's own work — **not** blocked on TD-SIMD-2.
+- `U64x8` exists as a `🟠` scalar polyfill on the AVX2 default and native on
+  AVX-512 → `popcnt` is implementable on both today (no new type needed).
+- W2/W3/W4 are independent of W1 and of each other (only `hpc/mod.rs` is shared,
+  and it already declares `soa`/`bulk` — no edit needed).
+- **Downstream:** the 5 lance-graph consumer PRs wait on this PR merging.
+
+## 3. BRUTAL-SURPRISE REGISTER (read before integrating)
+
+1. **Only ONE backend compiles per cargo config.** Default `.cargo/config.toml`
+   is `x86-64-v3` (AVX2) → the `#[cfg(target_feature="avx512f")]` arms and all
+   `_mm512_*` code are **not even compiled** by default `cargo check`. ⇒ Opus
+   MUST compile-check BOTH configs: default **and**
+   `--config .cargo/config-avx512.toml`. A green default build does NOT prove
+   the AVX-512 intrinsics compile.
+2. **AVX-512 binaries SIGILL on non-AVX-512 silicon** (TD-SIMD-1). We can
+   *compile* the avx512 config here but may not be able to *run* its tests on
+   this runner. Treat avx512 as compile-checked, run-deferred to a capable CI job.
+3. **"All backends agree" is NOT testable in one binary** under compile-time
+   dispatch (the other backends' types are `#[cfg]`-ed out). ⇒ Parity tests
+   assert the **active compiled backend == an inline plain-Rust scalar reference**
+   computed in the test itself. Cross-backend agreement is then guaranteed by
+   running the same test suite under EACH config (v3 / avx512 / aarch64-cross /
+   nightly-simd), not by comparing two backends in one process.
+4. **VPABSB does not saturate `i8::MIN`** — binding. AVX-512 must be
+   `_mm512_min_epu8(_mm512_abs_epi8(x), set1(0x7f))`; NEON `vqabsq_s8`; scalar
+   `i8::saturating_abs`. The `saturating_abs(i8::MIN)==i8::MAX` test is mandatory.
+5. **`gather` has no NEON instruction** → scalar loop on aarch64 is the correct
+   impl, not a stopgap. Bounds-validate `max(idx) < table.len()` before any x86
+   gather (debug panic; release scalar-safe).
+6. **`prefetch` on an invalid ptr is allowed** (silent CPU drop) — no `assert!`.
+   `_mm_prefetch` on x86; `prfm pld…` asm on aarch64; **no-op** elsewhere.
+7. **`avx512vpopcntdq` is a sub-feature** of avx512f. `_mm512_popcnt_epi64`
+   needs `#[cfg(target_feature="avx512vpopcntdq")]`; provide the
+   VPSHUFB/Mula fallback for plain-avx512f and the byte-LUT path otherwise.
+8. **Doc-examples are doctests** (CodeRabbit enforces `///` + example on every
+   `pub fn`). They must compile on the DEFAULT config — keep examples backend-
+   agnostic (call the `crate::simd::*` surface, not a specific backend type).
+
+## 4. Central verification gates (Opus, once, shared target/)
+
+In order, after all drafts land:
+1. `cargo fmt --all`
+2. `cargo clippy -p ndarray --all-targets` (default v3) → `-D warnings`
+3. `cargo test -p ndarray` (default v3: exercises AVX2 + scalar arms + doctests)
+4. `cargo clippy --config .cargo/config-avx512.toml -p ndarray` (compile-check
+   the AVX-512 arms — surprise #1/#2). Run its tests only if the runner has
+   AVX-512; otherwise note "compile-checked, run-deferred".
+5. (best-effort) `cargo check --target aarch64-unknown-linux-gnu` if a cross
+   toolchain is present — else rely on the NEON CI job.
+
+## 5. Integration order
+
+W3 (done) → fold W2/W4 (low-risk, isolated) → fold W1 last and hardest:
+reconcile missing-type definitions, fix every `// UNVERIFIED:`, run gate §4
+(both configs), then commit per-workstream. PR. Subscribe. Bot-loop hardens.
+
+## 6. PR
+
+Single PR off `claude/splat3d-cpu-simd-renderer-MAOO0`. Body cites each W1a
+consumer site (`lance-graph:crates/lance-graph-contract/src/mul.rs`,
+`bgz17/src/simd.rs`, `holograph/hamming.rs`) per acceptance #7, and states the
+compile-time-only / runtime-deferred posture explicitly so reviewers don't flag
+the absent runtime dispatch as a gap.
diff --git a/src/hpc/activations.rs b/src/hpc/activations.rs
index 77a01973..709dc8ae 100644
--- a/src/hpc/activations.rs
+++ b/src/hpc/activations.rs
@@ -4,7 +4,7 @@
 
 use crate::imp_prelude::*;
 use crate::simd::{simd_exp_f32, F32x16};
-use crate::{ArrayView, ArrayView1, ArrayViewMut, ArrayViewMut1, Dimension, Zip};
+use crate::{ArrayView, ArrayView1, ArrayView2, ArrayViewMut, ArrayViewMut1, ArrayViewMut2, Axis, Dimension, Zip};
 use num_traits::Float;
 
 /// Neural network activation functions.
@@ -334,6 +334,125 @@ fn log_softmax_f32_scalar(x: ArrayView1<f32>, mut out: ArrayViewMut1<f32>) {
     }
 }
 
+// ═══════════════════════════════════════════════════════════════════
+// Axis-aware 2-D variants
+//
+// These operate on a 2-D `ArrayView2<f32>` and apply the corresponding
+// 1-D kernel along each lane parallel to the chosen `Axis`. They reuse
+// the existing per-lane SIMD primitives (`softmax_f32` / `log_softmax_f32`)
+// so the SIMD fast path fires for each contiguous lane.
+//
+// Iteration uses `lanes(axis)` / `lanes_mut(axis)` — NOT `axis_iter` —
+// per the W2 migration contract (see `.claude/knowledge/
+// w2-arrayview-migration.md` § "Axis-aware reduction"). `lanes(Axis(k))`
+// yields 1-D lanes ALONG axis k; `axis_iter(Axis(k))` slices perpendicular
+// to axis k and is the wrong primitive here.
+// ═══════════════════════════════════════════════════════════════════
+
+/// Axis-aware softmax over a 2-D `f32` matrix: each lane along `axis`
+/// is normalized independently.
+///
+/// `out[i, j] = exp(x[i,j] - max_lane) / Σ exp(x[i,:] - max_lane)` for
+/// `axis = Axis(1)` (normalize each row); for `axis = Axis(0)` each
+/// column is normalized instead.
+///
+/// Numerically stable: max-shift is applied per lane before exponentiation.
+/// The SIMD fast path (via `softmax_f32`) fires when the individual lane is
+/// contiguous in memory (typically true for axis 1 on a C-order matrix and
+/// axis 0 on an F-order matrix).
+///
+/// # Panics
+/// - Panics if `axis.index() >= 2` (out-of-bounds axis for a 2-D array).
+/// - Panics if `x.shape() != out.shape()`.
+///
+/// # Edge cases
+/// - An empty matrix (either dimension is 0) returns immediately.
+/// - A lane of length 0 is unreachable given the shape consistency check.
+/// - A single-element lane produces `out = 1.0`.
+///
+/// # Example
+/// ```
+/// use ndarray::{arr2, Array2, Axis};
+/// use ndarray::hpc::activations::softmax_axis_f32;
+///
+/// // 2×3 matrix, normalize each row (axis 1)
+/// let x = arr2(&[[1.0_f32, 2.0, 3.0],
+///                [0.0_f32, 0.0, 0.0]]);
+/// let mut out = Array2::<f32>::zeros((2, 3));
+/// softmax_axis_f32(x.view(), out.view_mut(), Axis(1));
+///
+/// // each row sums to 1.0
+/// assert!((out.row(0).sum() - 1.0_f32).abs() < 1e-5);
+/// assert!((out.row(1).sum() - 1.0_f32).abs() < 1e-5);
+/// // uniform inputs → equal probabilities
+/// assert!((out[[1, 0]] - 1.0 / 3.0).abs() < 1e-5);
+/// ```
+pub fn softmax_axis_f32(x: ArrayView2<f32>, mut out: ArrayViewMut2<f32>, axis: Axis) {
+    assert!(axis.index() < 2, "softmax_axis_f32: axis {} is out of bounds for a 2-D array", axis.index());
+    assert_eq!(x.shape(), out.shape(), "softmax_axis_f32: shape mismatch (x={:?} out={:?})", x.shape(), out.shape());
+    // `lanes(axis)` yields 1-D views ALONG `axis`; `lanes_mut(axis)` yields
+    // the corresponding mutable 1-D views of `out`. Zipping them visits every
+    // lane exactly once.
+    for (lane_in, lane_out) in x.lanes(axis).into_iter().zip(out.lanes_mut(axis)) {
+        softmax_f32(lane_in, lane_out);
+    }
+}
+
+/// Axis-aware log-softmax over a 2-D `f32` matrix: each lane along `axis`
+/// is independently normalized in log-space.
+///
+/// `out[i, j] = (x[i,j] - max_lane) - ln(Σ exp(x[i,:] - max_lane))` for
+/// `axis = Axis(1)` (per-row log-softmax); for `axis = Axis(0)` each
+/// column is processed instead.
+///
+/// Numerically stable: max-shift is applied per lane before exponentiation.
+/// All output values are ≤ 0, and `exp(out).sum_axis(axis)` ≈ 1.0 for each
+/// lane (modulo floating-point rounding).
+///
+/// The SIMD fast path fires when the individual lane is contiguous in memory
+/// (typically true for axis 1 on a C-order matrix).
+///
+/// # Panics
+/// - Panics if `axis.index() >= 2` (out-of-bounds axis for a 2-D array).
+/// - Panics if `x.shape() != out.shape()`.
+///
+/// # Edge cases
+/// - An empty matrix (either dimension is 0) returns immediately.
+/// - A single-element lane produces `out = 0.0` (log of 1.0).
+///
+/// # Example
+/// ```
+/// use ndarray::{arr2, Array2, Axis};
+/// use ndarray::hpc::activations::log_softmax_axis_f32;
+///
+/// // 2×3 matrix, log-softmax along rows (axis 1)
+/// let x = arr2(&[[1.0_f32, 2.0, 3.0],
+///                [0.0_f32, 0.0, 0.0]]);
+/// let mut out = Array2::<f32>::zeros((2, 3));
+/// log_softmax_axis_f32(x.view(), out.view_mut(), Axis(1));
+///
+/// // all outputs must be ≤ 0
+/// assert!(out.iter().all(|&v| v <= 0.0));
+/// // uniform row: all log-softmax values equal ln(1/3)
+/// let expected = (1.0_f32 / 3.0).ln();
+/// assert!((out[[1, 0]] - expected).abs() < 1e-5);
+/// assert!((out[[1, 1]] - expected).abs() < 1e-5);
+/// assert!((out[[1, 2]] - expected).abs() < 1e-5);
+/// ```
+pub fn log_softmax_axis_f32(x: ArrayView2<f32>, mut out: ArrayViewMut2<f32>, axis: Axis) {
+    assert!(axis.index() < 2, "log_softmax_axis_f32: axis {} is out of bounds for a 2-D array", axis.index());
+    assert_eq!(
+        x.shape(),
+        out.shape(),
+        "log_softmax_axis_f32: shape mismatch (x={:?} out={:?})",
+        x.shape(),
+        out.shape()
+    );
+    for (lane_in, lane_out) in x.lanes(axis).into_iter().zip(out.lanes_mut(axis)) {
+        log_softmax_f32(lane_in, lane_out);
+    }
+}
+
 #[cfg(test)]
 mod tests {
     use super::*;
@@ -590,4 +709,161 @@ mod tests {
         let mut out = Array1::<f32>::zeros(5);
         log_softmax_f32(x.view(), out.view_mut());
     }
+
+    // ── softmax_axis_f32 ────────────────────────────────────────────
+
+    #[test]
+    fn test_softmax_axis1_rows_sum_to_one() {
+        // Hand-computed: 2×3, normalize rows (axis 1)
+        // Row 0: [1, 2, 3] → exp-shifted by 3 → [e^-2, e^-1, 1] → normalize
+        let x = arr2(&[[1.0_f32, 2.0, 3.0], [4.0_f32, 4.0, 4.0]]);
+        let mut out = Array2::<f32>::zeros((2, 3));
+        softmax_axis_f32(x.view(), out.view_mut(), Axis(1));
+
+        // Each row sums to 1.0
+        assert!((out.row(0).sum() - 1.0).abs() < 1e-5, "row 0 sum = {}", out.row(0).sum());
+        assert!((out.row(1).sum() - 1.0).abs() < 1e-5, "row 1 sum = {}", out.row(1).sum());
+
+        // Uniform row → equal probabilities
+        assert!((out[[1, 0]] - 1.0 / 3.0).abs() < 1e-5, "uniform: out[1,0] = {}", out[[1, 0]]);
+        assert!((out[[1, 1]] - 1.0 / 3.0).abs() < 1e-5, "uniform: out[1,1] = {}", out[[1, 1]]);
+        assert!((out[[1, 2]] - 1.0 / 3.0).abs() < 1e-5, "uniform: out[1,2] = {}", out[[1, 2]]);
+
+        // Monotone row → monotone output
+        assert!(out[[0, 0]] < out[[0, 1]] && out[[0, 1]] < out[[0, 2]]);
+    }
+
+    #[test]
+    fn test_softmax_axis0_cols_sum_to_one() {
+        // Normalize columns (axis 0): 2×3 matrix
+        // Col 0: [1, 4], Col 1: [2, 4], Col 2: [3, 4]
+        let x = arr2(&[[1.0_f32, 2.0, 3.0], [4.0_f32, 4.0, 4.0]]);
+        let mut out = Array2::<f32>::zeros((2, 3));
+        softmax_axis_f32(x.view(), out.view_mut(), Axis(0));
+
+        // Each column sums to 1.0
+        for j in 0..3 {
+            let col_sum: f32 = out.column(j).sum();
+            assert!((col_sum - 1.0).abs() < 1e-5, "col {} sum = {}", j, col_sum);
+        }
+        // All outputs are non-negative
+        assert!(out.iter().all(|&v| v >= 0.0));
+    }
+
+    #[test]
+    fn test_softmax_axis1_correctness_hand_computed() {
+        // 1×4: softmax_axis_f32 with axis 1 must match softmax_f32 on the single row
+        let x = arr2(&[[0.0_f32, 1.0, 2.0, 3.0]]);
+        let mut out_axis = Array2::<f32>::zeros((1, 4));
+        softmax_axis_f32(x.view(), out_axis.view_mut(), Axis(1));
+
+        let row = arr1(&[0.0_f32, 1.0, 2.0, 3.0]);
+        let mut out_1d = Array1::<f32>::zeros(4);
+        softmax_f32(row.view(), out_1d.view_mut());
+
+        for j in 0..4 {
+            assert!(
+                (out_axis[[0, j]] - out_1d[j]).abs() < 1e-6,
+                "j={}: axis={} vs 1d={}",
+                j,
+                out_axis[[0, j]],
+                out_1d[j]
+            );
+        }
+    }
+
+    #[test]
+    #[should_panic(expected = "out of bounds")]
+    fn test_softmax_axis_oob_panics() {
+        let x = arr2(&[[1.0_f32, 2.0], [3.0_f32, 4.0]]);
+        let mut out = Array2::<f32>::zeros((2, 2));
+        softmax_axis_f32(x.view(), out.view_mut(), Axis(2));
+    }
+
+    #[test]
+    #[should_panic(expected = "shape mismatch")]
+    fn test_softmax_axis_shape_mismatch_panics() {
+        let x = arr2(&[[1.0_f32, 2.0, 3.0], [4.0_f32, 5.0, 6.0]]);
+        let mut out = Array2::<f32>::zeros((2, 2));
+        softmax_axis_f32(x.view(), out.view_mut(), Axis(1));
+    }
+
+    // ── log_softmax_axis_f32 ────────────────────────────────────────
+
+    #[test]
+    fn test_log_softmax_axis1_all_nonpositive() {
+        let x = arr2(&[[1.0_f32, 2.0, 3.0], [4.0_f32, 4.0, 4.0]]);
+        let mut out = Array2::<f32>::zeros((2, 3));
+        log_softmax_axis_f32(x.view(), out.view_mut(), Axis(1));
+
+        assert!(out.iter().all(|&v| v <= 0.0), "log-softmax outputs must be ≤ 0");
+    }
+
+    #[test]
+    fn test_log_softmax_axis1_uniform_row() {
+        // Uniform row [0,0,0]: log-softmax = ln(1/3) for each element
+        let x = arr2(&[[0.0_f32, 0.0, 0.0], [1.0_f32, 2.0, 3.0]]);
+        let mut out = Array2::<f32>::zeros((2, 3));
+        log_softmax_axis_f32(x.view(), out.view_mut(), Axis(1));
+
+        let expected = (1.0_f32 / 3.0).ln();
+        for j in 0..3 {
+            assert!((out[[0, j]] - expected).abs() < 1e-5, "out[0,{}] = {} expected {}", j, out[[0, j]], expected);
+        }
+    }
+
+    #[test]
+    fn test_log_softmax_axis0_cols_nonpositive() {
+        // Normalize columns (axis 0)
+        let x = arr2(&[[1.0_f32, 0.0, 3.0], [2.0_f32, 0.0, 1.0]]);
+        let mut out = Array2::<f32>::zeros((2, 3));
+        log_softmax_axis_f32(x.view(), out.view_mut(), Axis(0));
+
+        assert!(out.iter().all(|&v| v <= 0.0), "log-softmax outputs must be ≤ 0");
+        // exp(out) along axis 0 must sum to ~1 per column
+        for j in 0..3 {
+            let exp_sum: f32 = out.column(j).mapv(f32::exp).sum();
+            assert!((exp_sum - 1.0).abs() < 1e-5, "col {} exp-sum = {}", j, exp_sum);
+        }
+    }
+
+    #[test]
+    fn test_log_softmax_axis1_consistency_with_log_softmax_f32() {
+        // log_softmax_axis_f32 on each row must match log_softmax_f32 applied directly
+        let x = arr2(&[[2.0_f32, 1.0, 0.1, -1.0], [0.5_f32, 0.5, 0.5, 0.5]]);
+        let mut out_axis = Array2::<f32>::zeros((2, 4));
+        log_softmax_axis_f32(x.view(), out_axis.view_mut(), Axis(1));
+
+        for i in 0..2 {
+            let row = x.row(i).to_owned();
+            let mut out_1d = Array1::<f32>::zeros(4);
+            log_softmax_f32(row.view(), out_1d.view_mut());
+            for j in 0..4 {
+                assert!(
+                    (out_axis[[i, j]] - out_1d[j]).abs() < 1e-5,
+                    "row={} j={}: axis={} vs 1d={}",
+                    i,
+                    j,
+                    out_axis[[i, j]],
+                    out_1d[j]
+                );
+            }
+        }
+    }
+
+    #[test]
+    #[should_panic(expected = "out of bounds")]
+    fn test_log_softmax_axis_oob_panics() {
+        let x = arr2(&[[1.0_f32, 2.0], [3.0_f32, 4.0]]);
+        let mut out = Array2::<f32>::zeros((2, 2));
+        log_softmax_axis_f32(x.view(), out.view_mut(), Axis(3));
+    }
+
+    #[test]
+    #[should_panic(expected = "shape mismatch")]
+    fn test_log_softmax_axis_shape_mismatch_panics() {
+        let x = arr2(&[[1.0_f32, 2.0, 3.0], [4.0_f32, 5.0_f32, 6.0_f32]]);
+        let mut out = Array2::<f32>::zeros((3, 2));
+        log_softmax_axis_f32(x.view(), out.view_mut(), Axis(1));
+    }
 }
diff --git a/src/hpc/blocked_grid/tests.rs b/src/hpc/blocked_grid/tests.rs
index 3146e387..76b7e838 100644
--- a/src/hpc/blocked_grid/tests.rs
+++ b/src/hpc/blocked_grid/tests.rs
@@ -7,7 +7,7 @@
 //! Test groups
 //! -----------
 //! 1. W4 bulk_apply composition  — map_l1 composes with `hpc::bulk::bulk_apply`
-//!    and `bulk_scan` over per-row slices inside the closure.
+//!    and `bulk_for_each` over per-row slices inside the closure.
 //! 2. L1→L2 cascade              — 256×256 ShaderMantissaGrid map_l1 populates
 //!    cell-by-cell, then map_l2 aggregates per super-block.
 //! 3. Half-square AMX INT8       — AmxInt8Grid::new(32, 128), blocks_base coords.
@@ -22,24 +22,24 @@ use crate::hpc::blocked_grid::{
 };
 
 // ============================================================
-// 1. W4 bulk_apply / bulk_scan composition
+// 1. W4 bulk_apply / bulk_for_each composition
 //
 // Demonstrates that PR-X3's map_l1 composes with the W4 primitives:
 //   - outer loop = map_l1 (one closure per 64×64 base block)
-//   - inner loop = bulk_apply / bulk_scan over each row slice in the block
+//   - inner loop = bulk_apply / bulk_for_each over each row slice in the block
 //
 // This proves the two design layers nest without either re-implementing the
 // other's chunking logic.
 // ============================================================
 
-/// map_l1 closure using bulk_scan to read each row and compute a per-row sum,
-/// storing it into the first cell of the corresponding output row.
+/// map_l1 closure using bulk_for_each to read each row and compute a per-row
+/// sum, storing it into the first cell of the corresponding output row.
 ///
-/// Demonstrates: bulk_scan(row_slice, chunk_size, closure) correctly
+/// Demonstrates: bulk_for_each(row_slice, chunk_size, closure) correctly
 /// accumulates the sum; no re-implemented chunking inside map_l1.
 #[test]
-fn w4_bulk_scan_inside_map_l1_row_sum() {
-    use crate::hpc::bulk::bulk_scan;
+fn w4_bulk_for_each_inside_map_l1_row_sum() {
+    use crate::hpc::bulk::bulk_for_each;
 
     // Build a 64×64 grid filled with known values.
     let mut g = BlockedGrid::<u64>::new(64, 64);
@@ -50,13 +50,13 @@ fn w4_bulk_scan_inside_map_l1_row_sum() {
         }
     }
 
-    // map_l1: for each block row, use bulk_scan to compute the row sum and
+    // map_l1: for each block row, use bulk_for_each to compute the row sum and
     // store it in the first cell of the output row.
     let out = g.map_l1::<u64, _>(|inp, outp| {
         for r in 0..64 {
             let row = inp.row(r);
             let mut row_sum = 0u64;
-            bulk_scan(row, 16, |chunk, _start| {
+            bulk_for_each(row, 16, |chunk, _start| {
                 row_sum += chunk.iter().sum::<u64>();
             });
             outp.row_mut(r)[0] = row_sum;
diff --git a/src/hpc/bulk.rs b/src/hpc/bulk.rs
index e0c3cb59..fd753b5d 100644
--- a/src/hpc/bulk.rs
+++ b/src/hpc/bulk.rs
@@ -5,7 +5,8 @@
 //! (chunk_size matched to L1 working-set) or when staging chunks to SoA for
 //! SIMD processing inside the closure.
 //!
-//! [`bulk_scan`] is the read-only sibling for non-mutating traversal.
+//! [`bulk_for_each`] is the read-only sibling for non-mutating traversal.
+//! [`bulk_scan`] is a deprecated alias for [`bulk_for_each`].
 //!
 //! Both helpers are scalar wrappers — no `#[target_feature]`, no per-arch
 //! dispatch. They are user-level code per the layering rule in
@@ -30,7 +31,7 @@
 //!     .map(|i| Item { a: i as f32, b: (i * 2) as f32, c: (i * 3) as f32 })
 //!     .collect();
 //! bulk_apply(&mut items, 16, |chunk, _start| {
-//!     let soa = aos_to_soa::<_, _, 3, _>(chunk, |it| [it.a, it.b, it.c]);
+//!     let soa = aos_to_soa::<_, f32, 3, _>(chunk, |it| [it.a, it.b, it.c]);
 //!     // ... per-field SIMD-style loops over soa.field(0), soa.field(1), ...
 //!     let _ = soa;
 //! });
@@ -65,6 +66,7 @@
 /// });
 /// assert_eq!(v, vec![0, 10, 20, 30, 40, 50, 60, 70, 80, 90]);
 /// ```
+#[inline]
 pub fn bulk_apply<T, F>(items: &mut [T], chunk_size: usize, mut f: F)
 where
     F: FnMut(&mut [T], usize),
@@ -89,19 +91,20 @@ where
 ///
 /// # Example
 /// ```
-/// use ndarray::hpc::bulk::bulk_scan;
+/// use ndarray::hpc::bulk::bulk_for_each;
 /// let v: Vec<i32> = (0..10).collect();
 /// let mut sum = 0i32;
-/// bulk_scan(&v, 4, |chunk, _start| {
+/// bulk_for_each(&v, 4, |chunk, _start| {
 ///     sum += chunk.iter().sum::<i32>();
 /// });
 /// assert_eq!(sum, 45);
 /// ```
-pub fn bulk_scan<T, F>(items: &[T], chunk_size: usize, mut f: F)
+#[inline]
+pub fn bulk_for_each<T, F>(items: &[T], chunk_size: usize, mut f: F)
 where
     F: FnMut(&[T], usize),
 {
-    assert!(chunk_size > 0, "bulk_scan: chunk_size must be > 0");
+    assert!(chunk_size > 0, "bulk_for_each: chunk_size must be > 0");
     let mut start = 0;
     for chunk in items.chunks(chunk_size) {
         let n = chunk.len();
@@ -110,6 +113,32 @@ where
     }
 }
 
+/// Deprecated alias for [`bulk_for_each`].
+///
+/// Use [`bulk_for_each`] instead. This alias exists only to avoid breaking
+/// callers from before the rename and will be removed in a future release.
+///
+/// # Example
+/// ```
+/// #[allow(deprecated)]
+/// use ndarray::hpc::bulk::bulk_scan;
+/// let v: Vec<i32> = (0..10).collect();
+/// let mut sum = 0i32;
+/// #[allow(deprecated)]
+/// bulk_scan(&v, 4, |chunk, _start| {
+///     sum += chunk.iter().sum::<i32>();
+/// });
+/// assert_eq!(sum, 45);
+/// ```
+#[deprecated(note = "renamed to `bulk_for_each`")]
+#[inline]
+pub fn bulk_scan<T, F>(items: &[T], chunk_size: usize, f: F)
+where
+    F: FnMut(&[T], usize),
+{
+    bulk_for_each(items, chunk_size, f)
+}
+
 #[cfg(test)]
 mod tests {
     use super::*;
@@ -206,33 +235,33 @@ mod tests {
         assert_eq!(count, 0);
     }
 
-    // ----- bulk_scan -----
+    // ----- bulk_for_each -----
 
     #[test]
-    fn bulk_scan_chunk_size_divides_len() {
+    fn bulk_for_each_chunk_size_divides_len() {
         let v: Vec<i32> = (0..10).collect();
         let mut sizes = Vec::new();
-        bulk_scan(&v, 5, |chunk, _start| {
+        bulk_for_each(&v, 5, |chunk, _start| {
             sizes.push(chunk.len());
         });
         assert_eq!(sizes, vec![5, 5]);
     }
 
     #[test]
-    fn bulk_scan_chunk_size_does_not_divide_len() {
+    fn bulk_for_each_chunk_size_does_not_divide_len() {
         let v: Vec<i32> = (0..10).collect();
         let mut sizes = Vec::new();
-        bulk_scan(&v, 3, |chunk, _start| {
+        bulk_for_each(&v, 3, |chunk, _start| {
             sizes.push(chunk.len());
         });
         assert_eq!(sizes, vec![3, 3, 3, 1]);
     }
 
     #[test]
-    fn bulk_scan_chunk_size_greater_than_len() {
+    fn bulk_for_each_chunk_size_greater_than_len() {
         let v: Vec<i32> = (0..10).collect();
         let mut sizes = Vec::new();
-        bulk_scan(&v, 100, |chunk, start| {
+        bulk_for_each(&v, 100, |chunk, start| {
             assert_eq!(start, 0);
             sizes.push(chunk.len());
         });
@@ -240,20 +269,20 @@ mod tests {
     }
 
     #[test]
-    fn bulk_scan_start_indices_3_3_3_1() {
+    fn bulk_for_each_start_indices_3_3_3_1() {
         let v: Vec<i32> = (0..10).collect();
         let mut start_indices: Vec<usize> = Vec::new();
-        bulk_scan(&v, 3, |_chunk, start| {
+        bulk_for_each(&v, 3, |_chunk, start| {
             start_indices.push(start);
         });
         assert_eq!(start_indices, vec![0, 3, 6, 9]);
     }
 
     #[test]
-    fn bulk_scan_sums_chunks() {
+    fn bulk_for_each_sums_chunks() {
         let v: Vec<i32> = (0..10).collect();
         let mut sum = 0i32;
-        bulk_scan(&v, 4, |chunk, _start| {
+        bulk_for_each(&v, 4, |chunk, _start| {
             sum += chunk.iter().sum::<i32>();
         });
         assert_eq!(sum, 45);
@@ -261,16 +290,16 @@ mod tests {
 
     #[test]
     #[should_panic(expected = "chunk_size must be > 0")]
-    fn bulk_scan_panics_on_zero_chunk_size() {
+    fn bulk_for_each_panics_on_zero_chunk_size() {
         let v: Vec<i32> = (0..4).collect();
-        bulk_scan(&v, 0, |_, _| {});
+        bulk_for_each(&v, 0, |_, _| {});
     }
 
     #[test]
-    fn bulk_scan_chunk_size_usize_max_single_chunk() {
+    fn bulk_for_each_chunk_size_usize_max_single_chunk() {
         let v: Vec<i32> = (0..4).collect();
         let mut count = 0;
-        bulk_scan(&v, usize::MAX, |chunk, start| {
+        bulk_for_each(&v, usize::MAX, |chunk, start| {
             count += 1;
             assert_eq!(start, 0);
             assert_eq!(chunk.len(), 4);
@@ -279,15 +308,37 @@ mod tests {
     }
 
     #[test]
-    fn bulk_scan_empty_slice() {
+    fn bulk_for_each_empty_slice() {
         let v: Vec<i32> = Vec::new();
         let mut count = 0;
-        bulk_scan(&v, 4, |_, _| {
+        bulk_for_each(&v, 4, |_, _| {
             count += 1;
         });
         assert_eq!(count, 0);
     }
 
+    // ----- bulk_scan (deprecated alias) -----
+    // These tests verify the alias still compiles and delegates correctly.
+
+    #[test]
+    #[allow(deprecated)]
+    fn bulk_scan_deprecated_alias_still_works() {
+        let v: Vec<i32> = (0..10).collect();
+        let mut sum = 0i32;
+        bulk_scan(&v, 4, |chunk, _start| {
+            sum += chunk.iter().sum::<i32>();
+        });
+        assert_eq!(sum, 45);
+    }
+
+    #[test]
+    #[allow(deprecated)]
+    #[should_panic(expected = "chunk_size must be > 0")]
+    fn bulk_scan_deprecated_alias_panics_on_zero_chunk_size() {
+        let v: Vec<i32> = (0..4).collect();
+        bulk_scan(&v, 0, |_, _| {});
+    }
+
     // ----- integration with aos_to_soa -----
     //
     // hpc::soa and hpc::bulk co-merge in PR #156, so the worker-isolation
diff --git a/src/hpc/reductions.rs b/src/hpc/reductions.rs
index 24024e03..37dcdab6 100644
--- a/src/hpc/reductions.rs
+++ b/src/hpc/reductions.rs
@@ -150,6 +150,14 @@ fn sum_f64_slice(s: &[f64]) -> f64 {
     sum
 }
 
+// NOTE: `sum_axis_f32(x: ArrayView2<f32>, axis: Axis) -> Array1<f32>` is NOT
+// added here. ndarray's built-in `ArrayBase::sum_axis(axis)` already covers
+// this operation correctly for all layouts, and no consumer in this codebase
+// routes per-row SIMD sums through this module. Adding a wrapper here would
+// duplicate ndarray core with no net benefit; use `matrix.sum_axis(Axis(1))`
+// directly. Revisit if profiling shows ndarray's generic path is a bottleneck
+// for a specific contiguous-C-order hot path.
+
 /// Arithmetic mean of all elements. Returns `None` for an empty input.
 ///
 /// # Example
diff --git a/src/hpc/soa.rs b/src/hpc/soa.rs
index e5158f53..883b1476 100644
--- a/src/hpc/soa.rs
+++ b/src/hpc/soa.rs
@@ -84,6 +84,7 @@ use core::array;
 /// assert_eq!(soa.field(1), &[2.0, 5.0]);
 /// assert_eq!(soa.field(2), &[3.0, 6.0]);
 /// ```
+#[derive(Clone, Debug)]
 pub struct SoaVec<T, const N: usize> {
     fields: [Vec<T>; N],
 }
@@ -246,6 +247,33 @@ impl<T, const N: usize> SoaVec<T, N> {
     }
 }
 
+impl<T: Copy, const N: usize> SoaVec<T, N> {
+    /// Iterate over individual rows, yielding each as `[T; N]` (one value
+    /// per field, reconstructed from the parallel field arrays).
+    ///
+    /// Requires `T: Copy` so each field element can be copied out without
+    /// cloning; the borrow on `self` lasts only as long as the iterator.
+    /// For non-`Copy` types, use [`chunks`](Self::chunks) with `chunk_len = 1`
+    /// and index into the single-element slices.
+    ///
+    /// # Example
+    ///
+    /// ```
+    /// use ndarray::hpc::soa::SoaVec;
+    /// let mut soa: SoaVec<f32, 3> = SoaVec::new();
+    /// soa.push([1.0, 2.0, 3.0]);
+    /// soa.push([4.0, 5.0, 6.0]);
+    ///
+    /// let rows: Vec<[f32; 3]> = soa.iter_rows().collect();
+    /// assert_eq!(rows[0], [1.0, 2.0, 3.0]);
+    /// assert_eq!(rows[1], [4.0, 5.0, 6.0]);
+    /// ```
+    #[inline]
+    pub fn iter_rows(&self) -> SoaRowIter<'_, T, N> {
+        SoaRowIter { soa: self, cursor: 0 }
+    }
+}
+
 impl<T, const N: usize> Default for SoaVec<T, N> {
     fn default() -> Self {
         Self::new()
@@ -274,6 +302,37 @@ impl<'a, T, const N: usize> Iterator for SoaChunks<'a, T, N> {
     }
 }
 
+/// Iterator yielded by [`SoaVec::iter_rows`].
+///
+/// Each call to [`next`](Iterator::next) copies one row (`[T; N]`) out of
+/// the parallel field arrays. Requires `T: Copy`.
+pub struct SoaRowIter<'a, T, const N: usize> {
+    soa: &'a SoaVec<T, N>,
+    cursor: usize,
+}
+
+impl<'a, T: Copy, const N: usize> Iterator for SoaRowIter<'a, T, N> {
+    type Item = [T; N];
+
+    #[inline]
+    fn next(&mut self) -> Option<Self::Item> {
+        if self.cursor >= self.soa.len() {
+            return None;
+        }
+        let row: [T; N] = array::from_fn(|i| self.soa.fields[i][self.cursor]);
+        self.cursor += 1;
+        Some(row)
+    }
+
+    #[inline]
+    fn size_hint(&self) -> (usize, Option<usize>) {
+        let remaining = self.soa.len().saturating_sub(self.cursor);
+        (remaining, Some(remaining))
+    }
+}
+
+impl<T: Copy, const N: usize> ExactSizeIterator for SoaRowIter<'_, T, N> {}
+
 /// Generate a named-field SoA struct from a struct-like declaration.
 ///
 /// Each declared field `name: T` becomes `name: Vec<T>` on the generated
@@ -603,6 +662,7 @@ macro_rules! soa_struct {
 /// assert_eq!(soa.field(0), &[7u8, 3]);
 /// assert_eq!(soa.field(1), &[255u8, 128]);
 /// ```
+#[inline]
 pub fn aos_to_soa<T, U, const N: usize, F>(aos: &[T], extract: F) -> SoaVec<U, N>
 where
     F: Fn(&T) -> [U; N],
@@ -651,6 +711,7 @@ where
 /// assert_eq!(back[0], Pair { lo: 0x1234, hi: 0xABCD });
 /// assert_eq!(back[1], Pair { lo: 0x5678, hi: 0xEF01 });
 /// ```
+#[inline]
 pub fn soa_to_aos<T, U, const N: usize, F>(soa: &SoaVec<U, N>, build: F) -> Vec<T>
 where
     F: Fn([U; N]) -> T,
@@ -856,6 +917,68 @@ mod tests {
         assert!(chunks.is_empty());
     }
 
+    // -------------------------------------------------------------------
+    // SoaVec::iter_rows
+    // -------------------------------------------------------------------
+
+    #[test]
+    fn soa_vec_iter_rows_basic() {
+        let mut soa: SoaVec<f32, 3> = SoaVec::new();
+        soa.push([1.0, 2.0, 3.0]);
+        soa.push([4.0, 5.0, 6.0]);
+        soa.push([7.0, 8.0, 9.0]);
+        let rows: Vec<[f32; 3]> = soa.iter_rows().collect();
+        assert_eq!(rows.len(), 3);
+        assert_eq!(rows[0], [1.0, 2.0, 3.0]);
+        assert_eq!(rows[1], [4.0, 5.0, 6.0]);
+        assert_eq!(rows[2], [7.0, 8.0, 9.0]);
+    }
+
+    #[test]
+    fn soa_vec_iter_rows_empty_yields_nothing() {
+        let soa: SoaVec<u32, 2> = SoaVec::new();
+        let rows: Vec<[u32; 2]> = soa.iter_rows().collect();
+        assert!(rows.is_empty());
+    }
+
+    #[test]
+    fn soa_vec_iter_rows_single_field() {
+        let mut soa: SoaVec<i32, 1> = SoaVec::new();
+        soa.push([10]);
+        soa.push([20]);
+        let rows: Vec<[i32; 1]> = soa.iter_rows().collect();
+        assert_eq!(rows, [[10], [20]]);
+    }
+
+    #[test]
+    fn soa_vec_iter_rows_size_hint() {
+        let mut soa: SoaVec<u8, 2> = SoaVec::new();
+        soa.push([1, 2]);
+        soa.push([3, 4]);
+        soa.push([5, 6]);
+        let mut it = soa.iter_rows();
+        assert_eq!(it.size_hint(), (3, Some(3)));
+        let _ = it.next();
+        assert_eq!(it.size_hint(), (2, Some(2)));
+        let _ = it.next();
+        let _ = it.next();
+        assert_eq!(it.size_hint(), (0, Some(0)));
+        assert!(it.next().is_none());
+    }
+
+    #[test]
+    fn soa_vec_iter_rows_matches_push_order() {
+        // Cross-check iter_rows against field() to ensure column order is preserved.
+        let mut soa: SoaVec<u32, 4> = SoaVec::new();
+        for i in 0..5u32 {
+            soa.push([i, i * 10, i * 100, i * 1000]);
+        }
+        for (row_idx, row) in soa.iter_rows().enumerate() {
+            let i = row_idx as u32;
+            assert_eq!(row, [i, i * 10, i * 100, i * 1000]);
+        }
+    }
+
     // -------------------------------------------------------------------
     // soa_struct! macro
     // -------------------------------------------------------------------
diff --git a/src/simd.rs b/src/simd.rs
index ce449991..a648fee7 100644
--- a/src/simd.rs
+++ b/src/simd.rs
@@ -228,6 +228,7 @@ pub use crate::simd_nightly::{
 
 #[cfg(all(target_arch = "x86_64", target_feature = "avx512f", not(feature = "nightly-simd")))]
 pub use crate::simd_avx512::{
+    batch_packed_i4_16,
     f32x16,
     f32x8,
     f64x4,
@@ -238,14 +239,21 @@ pub use crate::simd_avx512::{
     i32x8,
     i64x4,
     i64x8,
+    i8x16,
     i8x32,
     i8x64,
+    palette_lookup_u8x8,
+    prefetch_read_t0,
+    prefetch_read_t1,
+    prefetch_read_t2,
     u16x16,
+    u16x8,
     u32x16,
     u32x8,
     u64x4,
     u64x8,
     u8x64,
+    u8x8,
     F32Mask16,
     // 512-bit (native AVX-512, __m512/__m512d/__m512i)
     F32x16,
@@ -262,15 +270,18 @@ pub use crate::simd_avx512::{
     I32x8,
     I64x4,
     I64x8,
+    I8x16,
     I8x32,
     I8x64,
     U16x16,
     U16x32,
+    U16x8,
     U32x16,
     U32x8,
     U64x4,
     U64x8,
     U8x64,
+    U8x8,
 };
 
 // BF16 types + batch conversion (always available — scalar fallback built in)
@@ -307,7 +318,10 @@ pub use crate::simd_avx512::{BF16x16, BF16x8};
     not(target_feature = "avx512f"),
     not(feature = "nightly-simd")
 ))]
-pub use crate::simd_avx512::{f32x8, f64x4, i16x16, i8x32, F32x8, F64x4, I16x16, I8x32};
+pub use crate::simd_avx512::{
+    batch_packed_i4_16, f32x8, f64x4, i16x16, i8x16, i8x32, palette_lookup_u8x8, prefetch_read_t0, prefetch_read_t1,
+    prefetch_read_t2, u16x8, u8x8, F32x8, F64x4, I16x16, I8x16, I8x32, U16x8, U8x8,
+};
 
 #[cfg(all(
     target_arch = "x86_64",
@@ -347,6 +361,15 @@ pub(crate) mod scalar;
 // not on the critical path for f32 BLAS-1 / VML kernels.
 #[cfg(all(target_arch = "aarch64", not(feature = "nightly-simd")))]
 pub use crate::simd_neon::aarch64_simd::{f32x16, f64x8, F32Mask16, F32x16, F64Mask8, F64x8};
+// W1a NEON-native types + free functions
+#[cfg(all(target_arch = "aarch64", not(feature = "nightly-simd")))]
+pub use crate::simd_neon::{
+    batch_packed_i4_16, i8x16, i8x32, palette_lookup_u8x8, prefetch_read_t0, prefetch_read_t1, prefetch_read_t2, u8x8,
+    I8x16, I8x32, U8x8,
+};
+// U16x8 on aarch64 comes from simd_neon (backed by uint16x8_t)
+#[cfg(all(target_arch = "aarch64", not(feature = "nightly-simd")))]
+pub use crate::simd_neon::{u16x8, U16x8};
 #[cfg(all(target_arch = "aarch64", not(feature = "nightly-simd")))]
 pub use scalar::{
     f32x8, f64x4, i32x16, i32x8, i64x4, i64x8, u16x16, u32x16, u32x8, u64x4, u64x8, u8x64, F32x8, F64x4, I32x16, I32x8,
@@ -360,9 +383,10 @@ pub use scalar::{
     not(feature = "nightly-simd")
 ))]
 pub use scalar::{
-    f32x16, f32x8, f64x4, f64x8, i16x16, i16x32, i32x16, i32x8, i64x4, i64x8, i8x32, i8x64, u16x16, u32x16, u32x8,
-    u64x4, u64x8, u8x64, F32Mask16, F32x16, F32x8, F64Mask8, F64x4, F64x8, I16x16, I16x32, I32x16, I32x8, I64x4, I64x8,
-    I8x32, I8x64, U16x16, U16x32, U32x16, U32x8, U64x4, U64x8, U8x64,
+    batch_packed_i4_16, f32x16, f32x8, f64x4, f64x8, i16x16, i16x32, i32x16, i32x8, i64x4, i64x8, i8x16, i8x32, i8x64,
+    palette_lookup_u8x8, prefetch_read_t0, prefetch_read_t1, prefetch_read_t2, u16x16, u16x8, u32x16, u32x8, u64x4,
+    u64x8, u8x64, u8x8, F32Mask16, F32x16, F32x8, F64Mask8, F64x4, F64x8, I16x16, I16x32, I32x16, I32x8, I64x4, I64x8,
+    I8x16, I8x32, I8x64, U16x16, U16x32, U16x8, U32x16, U32x8, U64x4, U64x8, U8x64, U8x8,
 };
 
 // Scalar BF16 conversion — always available on all platforms
diff --git a/src/simd_avx2.rs b/src/simd_avx2.rs
index be3e369f..3b06c27d 100644
--- a/src/simd_avx2.rs
+++ b/src/simd_avx2.rs
@@ -1555,6 +1555,90 @@ avx2_int_type!(U64x4, u64, 4, 0u64);
 avx2_int_type!(I32x8, i32, 8, 0i32);
 avx2_int_type!(I64x4, i64, 4, 0i64);
 
+// ── W1a SIMD primitives — AVX2 polyfill backend ──────────────────────────────
+//
+// The AVX2 backend uses scalar-storage polyfills for the integer types.
+// For each W1a primitive we add impl blocks to the relevant polyfill types.
+
+// ── W1a-#1: I8x16 / batch_packed_i4_16 (AVX2 polyfill) ─────────────────────
+// I8x16 is defined in simd_avx512.rs and re-exported on x86_64 (both v3/v4).
+// The batch function and gather/prefetch live in simd_avx512.rs for x86_64.
+// No additional type definitions needed in this file.
+
+// ── W1a-#2: I8x32::saturating_abs (AVX2 scalar polyfill) ───────────────────
+// The AVX2 tier uses the scalar polyfill I8x32 from simd_avx512.rs (backed by
+// __m256i on AVX2).  saturating_abs is already added to I8x32 in simd_avx512.rs.
+
+// ── W1a-#3: U16x8 / palette_lookup_u8x8 (AVX2 polyfill) ────────────────────
+// U16x8 is defined in simd_avx512.rs (scalar polyfill) for x86_64.
+
+// ── W1a-#5: U64x4::popcnt (AVX2 scalar polyfill) ────────────────────────────
+impl U64x4 {
+    /// Lane-wise population count.  Each `u64` lane → the count of set bits
+    /// (0..=64) returned in the same lane position.
+    ///
+    /// On the AVX2 polyfill backend this is a scalar fused loop using
+    /// `u64::count_ones`.  On AVX-512 with `avx512vpopcntdq` available a
+    /// separate `U64x8` method uses the hardware instruction.
+    ///
+    /// # Example
+    /// ```rust,ignore
+    /// let v = U64x4::from_array([u64::MAX, 0, 1, !1]);
+    /// let p = v.popcnt();
+    /// assert_eq!(p.to_array(), [64, 0, 1, 63]);
+    /// ```
+    #[inline(always)]
+    pub fn popcnt(self) -> Self {
+        let mut out = [0u64; 4];
+        for i in 0..4 {
+            out[i] = self.0[i].count_ones() as u64;
+        }
+        Self(out)
+    }
+}
+
+// ── W1a-#5: U64x8::popcnt / xor_popcount (AVX2 scalar polyfill) ─────────────
+// The avx2_int_type! macro generated U64x8 as a scalar polyfill in this file.
+// We add popcnt + xor_popcount to match the API surface of the AVX-512 backend.
+impl U64x8 {
+    /// Lane-wise population count (scalar polyfill — same API as AVX-512 backend).
+    ///
+    /// On the AVX2 polyfill backend this is a scalar fused loop.
+    ///
+    /// # Example
+    /// ```rust,ignore
+    /// let v = U64x8::splat(u64::MAX);
+    /// assert!(v.popcnt().to_array().iter().all(|&x| x == 64));
+    /// ```
+    #[inline(always)]
+    pub fn popcnt(self) -> Self {
+        let mut out = [0u64; 8];
+        for i in 0..8 {
+            out[i] = self.0[i].count_ones() as u64;
+        }
+        Self(out)
+    }
+
+    /// XOR two vectors lane-wise, popcount each lane, then sum across all 8 lanes.
+    ///
+    /// Scalar polyfill — same semantics as the AVX-512 backend.
+    ///
+    /// # Example
+    /// ```rust,ignore
+    /// let a = U64x8::splat(u64::MAX);
+    /// let b = U64x8::splat(0);
+    /// assert_eq!(a.xor_popcount(b), 512); // 64 bits × 8 lanes
+    /// ```
+    #[inline(always)]
+    pub fn xor_popcount(self, other: Self) -> u64 {
+        let mut sum = 0u64;
+        for i in 0..8 {
+            sum += (self.0[i] ^ other.0[i]).count_ones() as u64;
+        }
+        sum
+    }
+}
+
 // Extra methods for U16x32 (widen/narrow, shift, multiply) — AVX2 scalar fallback.
 impl U16x32 {
     #[inline(always)]
diff --git a/src/simd_avx512.rs b/src/simd_avx512.rs
index 3710b26e..56633804 100644
--- a/src/simd_avx512.rs
+++ b/src/simd_avx512.rs
@@ -2264,6 +2264,591 @@ pub type i16x16 = I16x16;
 // uses. Native AVX2 `__m256i` upgrades for these are TD-SIMD-3.
 pub use crate::simd_avx2::{i32x8, i64x4, u16x16, u32x8, u64x4, I32x8, I64x4, U16x16, U32x8, U64x4};
 
+// ============================================================================
+// W1a SIMD primitives — AVX-512 backend
+// ============================================================================
+//
+// Five new primitives per `.claude/knowledge/vertical-simd-consumer-contract.md`.
+// These live in the AVX-512 backend file and are the "real-intrinsic" tier where
+// applicable.  When AVX-512 doesn't provide a narrower type natively (I8x16,
+// U16x8, U8x8), we define minimal scalar-storage wrappers so the cross-arch API
+// is uniform.
+//
+// Types NEW to this backend (polyfill wrappers):
+//   I8x16  — 16 × i8, scalar storage (no native __m128i wrapping is necessary
+//             to match the API; AVX-512 native I8x64 is wider and the consumer
+//             only needs the 16-lane primitive here).
+//   U16x8  — 8 × u16, scalar storage polyfill.
+//   U8x8   — 8 × u8, scalar storage polyfill.
+
+// ─── I8x16 (scalar-storage polyfill for the AVX-512 backend) ─────────────────
+
+/// 16-lane `i8` vector.  On the AVX-512 backend this is a scalar-storage
+/// polyfill (no native 128-bit intrinsic wrapper is needed for the W1a API
+/// surface); on NEON it is backed by `int8x16_t`.
+///
+/// Edge cases and lane layout are identical across backends; only performance
+/// differs.
+#[cfg(target_arch = "x86_64")]
+#[derive(Copy, Clone, PartialEq)]
+#[repr(align(16))]
+pub struct I8x16(pub [i8; 16]);
+
+#[cfg(target_arch = "x86_64")]
+impl I8x16 {
+    pub const LANES: usize = 16;
+
+    /// Broadcast a single `i8` value to all 16 lanes.
+    ///
+    /// # Example
+    /// ```rust,ignore
+    /// let v = I8x16::splat(3);
+    /// assert!(v.to_array().iter().all(|&x| x == 3));
+    /// ```
+    #[inline(always)]
+    pub fn splat(v: i8) -> Self {
+        Self([v; 16])
+    }
+
+    /// Load 16 lanes from a slice (at least 16 elements required).
+    #[inline(always)]
+    pub fn from_slice(s: &[i8]) -> Self {
+        assert!(s.len() >= 16);
+        let mut a = [0i8; 16];
+        a.copy_from_slice(&s[..16]);
+        Self(a)
+    }
+
+    /// Load from a fixed-size array.
+    #[inline(always)]
+    pub fn from_array(arr: [i8; 16]) -> Self {
+        Self(arr)
+    }
+
+    /// Extract all 16 lanes as an array.
+    #[inline(always)]
+    pub fn to_array(self) -> [i8; 16] {
+        self.0
+    }
+
+    /// Copy lanes into a slice (must have at least 16 elements).
+    #[inline(always)]
+    pub fn copy_to_slice(self, s: &mut [i8]) {
+        assert!(s.len() >= 16);
+        s[..16].copy_from_slice(&self.0);
+    }
+
+    // ── W1a-#1: from_i4_packed_u64 + lane_i8 ────────────────────────────────
+
+    /// Unpack 16 signed i4 nibbles from a `u64` into 16 sign-extended `i8` lanes.
+    ///
+    /// Nibble layout: `lane[i] = sign_extend_i4((packed >> (4*i)) & 0xf)`.
+    /// Values `0x0..=0x7` map to `0..=7`; values `0x8..=0xf` map to `-8..=-1`.
+    ///
+    /// On the x86_64 backend this is a scalar polyfill (no AVX-512 intrinsic path
+    /// here since the spec shows multiple equivalent approaches and the NEON
+    /// path is the primary SIMD path for narrow unpacking).
+    ///
+    /// # Example
+    /// ```rust,ignore
+    /// // All nibbles == 0x0 → all lanes == 0
+    /// let z = I8x16::from_i4_packed_u64(0);
+    /// assert_eq!(z.lane_i8::<0>(), 0);
+    /// // Nibble 0xf → -1
+    /// let neg = I8x16::from_i4_packed_u64(0xffff_ffff_ffff_ffff);
+    /// assert_eq!(neg.lane_i8::<0>(), -1);
+    /// // Nibble 0x8 → -8
+    /// let min4 = I8x16::from_i4_packed_u64(0x8888_8888_8888_8888);
+    /// assert_eq!(min4.lane_i8::<0>(), -8);
+    /// ```
+    #[inline(always)]
+    pub fn from_i4_packed_u64(packed: u64) -> Self {
+        let mut lanes = [0i8; 16];
+        for i in 0..16 {
+            let nibble = ((packed >> (4 * i)) & 0xf) as i8;
+            // Sign-extend: if bit 3 is set the value is negative
+            lanes[i] = if nibble > 7 { nibble - 16 } else { nibble };
+        }
+        Self(lanes)
+    }
+
+    /// Extract lane `N` as an `i8` (const-generic, checked at compile time).
+    ///
+    /// `N` must be in `0..16`; this is enforced by the array index.
+    ///
+    /// # Example
+    /// ```rust,ignore
+    /// let v = I8x16::from_array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16i8]);
+    /// assert_eq!(v.lane_i8::<0>(), 1);
+    /// assert_eq!(v.lane_i8::<15>(), 16);
+    /// ```
+    #[inline(always)]
+    pub fn lane_i8<const N: usize>(self) -> i8 {
+        self.0[N]
+    }
+
+    // ── W1a-#2: saturating_abs ────────────────────────────────────────────────
+
+    /// Lane-wise saturating absolute value.
+    ///
+    /// `saturating_abs(i8::MIN) == i8::MAX` (127), unlike the hardware VPABSB
+    /// which returns `i8::MIN` (−128) for the minimum value because +128 does
+    /// not fit in `i8`.  On AVX-512 this is corrected with `_mm512_min_epu8`
+    /// (VPMINUB) per the VPABSB correction in the consumer contract.  This
+    /// x86_64 polyfill delegates to `i8::saturating_abs` on the scalar path.
+    ///
+    /// All lanes are independently saturated.
+    ///
+    /// # Example
+    /// ```rust,ignore
+    /// let v = I8x16::splat(i8::MIN);
+    /// assert!(v.saturating_abs().to_array().iter().all(|&x| x == i8::MAX));
+    /// ```
+    #[inline(always)]
+    pub fn saturating_abs(self) -> Self {
+        let mut o = [0i8; 16];
+        for i in 0..16 {
+            o[i] = self.0[i].saturating_abs();
+        }
+        Self(o)
+    }
+}
+
+#[cfg(target_arch = "x86_64")]
+impl core::fmt::Debug for I8x16 {
+    fn fmt(&self, f: &mut core::fmt::Formatter<'_>) -> core::fmt::Result {
+        write!(f, "I8x16({:?})", &self.0[..])
+    }
+}
+
+// ─── U8x8 (scalar-storage polyfill for AVX-512 backend) ──────────────────────
+
+/// 8-lane `u8` vector.  Scalar-storage polyfill used by `palette_lookup_u8x8`.
+#[cfg(target_arch = "x86_64")]
+#[derive(Copy, Clone, PartialEq)]
+#[repr(align(8))]
+pub struct U8x8(pub [u8; 8]);
+
+#[cfg(target_arch = "x86_64")]
+impl U8x8 {
+    pub const LANES: usize = 8;
+
+    /// Broadcast a single `u8` to all 8 lanes.
+    #[inline(always)]
+    pub fn splat(v: u8) -> Self {
+        Self([v; 8])
+    }
+
+    /// Load from a fixed-size array.
+    #[inline(always)]
+    pub fn from_array(arr: [u8; 8]) -> Self {
+        Self(arr)
+    }
+
+    /// Extract all 8 lanes as an array.
+    #[inline(always)]
+    pub fn to_array(self) -> [u8; 8] {
+        self.0
+    }
+}
+
+#[cfg(target_arch = "x86_64")]
+impl core::fmt::Debug for U8x8 {
+    fn fmt(&self, f: &mut core::fmt::Formatter<'_>) -> core::fmt::Result {
+        write!(f, "U8x8({:?})", &self.0[..])
+    }
+}
+
+// ─── U16x8 (scalar-storage polyfill for AVX-512 backend) ─────────────────────
+
+/// 8-lane `u16` vector.  On the NEON backend this is backed by `uint16x8_t`; on
+/// x86_64 (both AVX-512 and AVX2) it is a scalar-storage polyfill with the same
+/// API.
+///
+/// The W1a gather primitive is defined as a method on this type.
+#[cfg(target_arch = "x86_64")]
+#[derive(Copy, Clone, PartialEq)]
+#[repr(align(16))]
+pub struct U16x8(pub [u16; 8]);
+
+#[cfg(target_arch = "x86_64")]
+impl U16x8 {
+    pub const LANES: usize = 8;
+
+    /// Broadcast a single `u16` to all 8 lanes.
+    #[inline(always)]
+    pub fn splat(v: u16) -> Self {
+        Self([v; 8])
+    }
+
+    /// Load from a slice (at least 8 elements required).
+    #[inline(always)]
+    pub fn from_slice(s: &[u16]) -> Self {
+        assert!(s.len() >= 8);
+        let mut a = [0u16; 8];
+        a.copy_from_slice(&s[..8]);
+        Self(a)
+    }
+
+    /// Load from a fixed-size array.
+    #[inline(always)]
+    pub fn from_array(arr: [u16; 8]) -> Self {
+        Self(arr)
+    }
+
+    /// Extract all 8 lanes as an array.
+    #[inline(always)]
+    pub fn to_array(self) -> [u16; 8] {
+        self.0
+    }
+
+    // ── W1a-#3: gather_u16 ───────────────────────────────────────────────────
+
+    /// Gather 8 `u16` values from `table` at the indices given by `self`.
+    ///
+    /// In debug builds, panics if any index is `>= table.len()`.
+    /// In release builds, falls through to a scalar loop using `get()` so
+    /// out-of-range indices return `0` safely instead of reading past the
+    /// slice end.
+    ///
+    /// On x86_64 this is a scalar-loop polyfill (real AVX2 gather via
+    /// `_mm256_i32gather_epi32` + downcast is tracked as a follow-up
+    /// optimisation; the scalar path is the correctness anchor per the
+    /// contract).
+    ///
+    /// # Example
+    /// ```rust,ignore
+    /// let table = [10u16, 20, 30, 40, 50, 60, 70, 80];
+    /// let idx = U16x8::from_array([0, 2, 4, 6, 1, 3, 5, 7]);
+    /// let result = U16x8::gather_u16(idx, &table);
+    /// assert_eq!(result.to_array(), [10, 30, 50, 70, 20, 40, 60, 80]);
+    /// ```
+    #[inline(always)]
+    pub fn gather_u16(indices: U16x8, table: &[u16]) -> Self {
+        let idx = indices.to_array();
+        // Bounds validation: debug panics, release falls back to safe get()
+        #[cfg(debug_assertions)]
+        for &i in &idx {
+            assert!(
+                (i as usize) < table.len(),
+                "gather_u16: index {} out of bounds (table.len() = {})",
+                i,
+                table.len()
+            );
+        }
+        let mut out = [0u16; 8];
+        for k in 0..8 {
+            // SAFETY: in debug we already panicked above; in release `get`
+            // returns None for OOB and we fall back to 0.
+            out[k] = table.get(idx[k] as usize).copied().unwrap_or(0);
+        }
+        Self(out)
+    }
+
+    /// Extract lane `k` as a `u16` (for use in gather loops).
+    #[inline(always)]
+    pub fn lane(self, k: usize) -> u16 {
+        self.0[k]
+    }
+}
+
+#[cfg(target_arch = "x86_64")]
+impl core::fmt::Debug for U16x8 {
+    fn fmt(&self, f: &mut core::fmt::Formatter<'_>) -> core::fmt::Result {
+        write!(f, "U16x8({:?})", &self.0[..])
+    }
+}
+
+// ─── W1a-#3: palette_lookup_u8x8 (AVX-512 backend) ───────────────────────────
+
+/// Look up 8 bytes from a `u8` LUT by `u16` indices.
+///
+/// Convenience wrapper over `U16x8::gather_u16`: widens each index to u16,
+/// reads the byte at that position in `lut`, and returns an 8-lane `U8x8`.
+///
+/// Bounds: panics in debug if any index `>= lut.len()`; returns 0 safely in
+/// release for out-of-range indices.
+///
+/// # Example
+/// ```rust,ignore
+/// let lut: Vec<u8> = (0..256).map(|x| x as u8).collect();
+/// let idx = U16x8::from_array([0, 1, 127, 128, 254, 255, 10, 20]);
+/// let result = palette_lookup_u8x8(idx, &lut);
+/// assert_eq!(result.to_array(), [0, 1, 127, 128, 254, 255, 10, 20]);
+/// ```
+#[cfg(target_arch = "x86_64")]
+#[inline(always)]
+pub fn palette_lookup_u8x8(idx_v: U16x8, lut: &[u8]) -> U8x8 {
+    let idx = idx_v.to_array();
+    #[cfg(debug_assertions)]
+    for &i in &idx {
+        assert!((i as usize) < lut.len(), "palette_lookup_u8x8: index {} OOB (lut.len() = {})", i, lut.len());
+    }
+    let mut out = [0u8; 8];
+    for k in 0..8 {
+        out[k] = lut.get(idx[k] as usize).copied().unwrap_or(0);
+    }
+    U8x8(out)
+}
+
+// ─── W1a-#2: I8x32::saturating_abs (AVX-512 backend) ─────────────────────────
+//
+// The AVX-512 I8x32 type lives in this file (backed by `__m256i`).
+// We add saturating_abs using the VPABSB correction from the spec:
+//   1. _mm256_abs_epi8  (VPABSB on AVX2) gives raw abs; returns 0x80 for 0x80.
+//   2. _mm256_min_epu8  (VPMINUB) clamps 0x80 → 0x7f.
+
+impl I8x32 {
+    /// Lane-wise saturating absolute value.
+    ///
+    /// `saturating_abs(i8::MIN) == i8::MAX` (127).  Uses the VPABSB +
+    /// VPMINUB correction because VPABSB alone returns `i8::MIN` for the
+    /// minimum lane value (the bit-pattern of +128 does not fit in `i8`).
+    ///
+    /// All 32 lanes are independently saturated.
+    ///
+    /// # Example
+    /// ```rust,ignore
+    /// let v = I8x32::splat(i8::MIN);
+    /// assert!(v.saturating_abs().to_array().iter().all(|&x| x == i8::MAX));
+    /// let v2 = I8x32::from_array([-1i8; 32]);
+    /// assert!(v2.saturating_abs().to_array().iter().all(|&x| x == 1));
+    /// ```
+    #[inline(always)]
+    pub fn saturating_abs(self) -> Self {
+        // SAFETY: _mm256_abs_epi8 (VPABSB) is an AVX2 intrinsic; we are in
+        // the simd_avx512.rs file which is only compiled for x86_64.  The
+        // `target_feature(enable = "avx2")` annotation on the calling code
+        // path guarantees AVX2 availability.  The raw_abs result for 0x80
+        // is 0x80 (bit-pattern +128); VPMINUB then clamps it to 0x7f.
+        // UNVERIFIED: _mm256_abs_epi8 stability on Rust 1.94 stable — it is
+        // in std::arch::x86_64 since Rust 1.0 for AVX2 so should compile.
+        #[cfg(target_arch = "x86_64")]
+        unsafe {
+            let raw_abs = core::arch::x86_64::_mm256_abs_epi8(self.0);
+            // VPMINUB: unsigned-byte minimum. 0x80 unsigned = 128 > 0x7f = 127
+            // so min(0x80, 0x7f) = 0x7f.  All values < 0x80 pass through.
+            let clamped =
+                core::arch::x86_64::_mm256_min_epu8(raw_abs, core::arch::x86_64::_mm256_set1_epi8(0x7f_u8 as i8));
+            I8x32(clamped)
+        }
+        #[cfg(not(target_arch = "x86_64"))]
+        {
+            // Scalar fallback (unreachable in practice for AVX-512 builds)
+            let mut o = [0i8; 32];
+            let arr = self.to_array();
+            for i in 0..32 {
+                o[i] = arr[i].saturating_abs();
+            }
+            I8x32::from_array(o)
+        }
+    }
+}
+
+// ─── W1a-#5: U64x8::popcnt / xor_popcount (AVX-512 backend) ──────────────────
+
+impl U64x8 {
+    /// Lane-wise population count (number of set bits) for each of the 8
+    /// `u64` lanes.  Each result lane holds a value in `0..=64`.
+    ///
+    /// On AVX-512 with `avx512vpopcntdq` the native `_mm512_popcnt_epi64`
+    /// instruction is used.  Without that extension (or when compiling for
+    /// the scalar polyfill path) a Mula-style byte-LUT via VPSHUFB is used,
+    /// or the scalar `u64::count_ones` fused loop.
+    ///
+    /// # Example
+    /// ```rust,ignore
+    /// let v = U64x8::splat(u64::MAX);  // all bits set → 64 per lane
+    /// let p = v.popcnt();
+    /// assert!(p.to_array().iter().all(|&x| x == 64));
+    /// let z = U64x8::splat(0);
+    /// assert!(z.popcnt().to_array().iter().all(|&x| x == 0));
+    /// ```
+    #[inline(always)]
+    pub fn popcnt(self) -> Self {
+        // UNVERIFIED: _mm512_popcnt_epi64 requires `avx512vpopcntdq`; the
+        // cfg guard below selects it only when that feature is enabled at
+        // compile time.  On Sapphire Rapids + Zen4 it should be available.
+        #[cfg(all(target_arch = "x86_64", target_feature = "avx512vpopcntdq"))]
+        {
+            // SAFETY: avx512vpopcntdq is enabled at compile time (cfg guard).
+            // _mm512_popcnt_epi64 is a stable intrinsic from std::arch::x86_64.
+            // UNVERIFIED: exact Rust stable version this intrinsic landed in —
+            // believed to be 1.72 but not confirmed against 1.94.
+            unsafe {
+                let result = core::arch::x86_64::_mm512_popcnt_epi64(self.0);
+                U64x8(result)
+            }
+        }
+        // Scalar fallback for AVX-512F builds without VPOPCNTDQ.
+        // The Mula-algorithm via VPSHUFB + VPSADBW would be faster but
+        // requires avx512bw which may not be present alongside avx512f.
+        // Scalar u64::count_ones is ~4 cycles per lane on modern CPUs and
+        // is the safe correctness anchor (TD follow-up: add avx512bw guard).
+        // UNVERIFIED: whether avx512bw is guaranteed to co-exist with avx512f
+        // on the production deployment targets; leaving as scalar until confirmed.
+        #[cfg(all(target_arch = "x86_64", not(target_feature = "avx512vpopcntdq")))]
+        {
+            let arr = self.to_array();
+            let mut out = [0u64; 8];
+            for i in 0..8 {
+                out[i] = arr[i].count_ones() as u64;
+            }
+            U64x8::from_array(out)
+        }
+        #[cfg(not(target_arch = "x86_64"))]
+        {
+            // Scalar fallback (unreachable in practice for this backend file).
+            let arr = self.to_array();
+            let mut out = [0u64; 8];
+            for i in 0..8 {
+                out[i] = arr[i].count_ones() as u64;
+            }
+            U64x8::from_array(out)
+        }
+    }
+
+    /// XOR two vectors lane-wise, popcount each lane, then sum across all 8
+    /// lanes.  Optimised for Hamming-distance reductions.
+    ///
+    /// Equivalent to `(self ^ other).popcnt().reduce_sum()` but avoids a
+    /// store/reload cycle when all 8 popcounts are needed only as a sum.
+    ///
+    /// # Example
+    /// ```rust,ignore
+    /// let a = U64x8::splat(u64::MAX);
+    /// let b = U64x8::splat(0);
+    /// // All bits different → 64 set bits per lane × 8 lanes = 512
+    /// assert_eq!(a.xor_popcount(b), 512);
+    /// let same = U64x8::splat(0xdead_beef_cafe_babe);
+    /// assert_eq!(same.xor_popcount(same), 0);
+    /// ```
+    #[inline(always)]
+    pub fn xor_popcount(self, other: Self) -> u64 {
+        // XOR first, then popcount + horizontal sum.
+        #[cfg(target_arch = "x86_64")]
+        {
+            // SAFETY: BitXor on U64x8 uses _mm512_xor_si512; popcnt uses the
+            // avx512 path above.  reduce_sum uses _mm512_reduce_add_epi64.
+            let xored = self ^ other;
+            xored.popcnt().reduce_sum()
+        }
+        #[cfg(not(target_arch = "x86_64"))]
+        {
+            let a = self.to_array();
+            let b = other.to_array();
+            let mut sum = 0u64;
+            for i in 0..8 {
+                sum += (a[i] ^ b[i]).count_ones() as u64;
+            }
+            sum
+        }
+    }
+}
+
+// ─── W1a-#5: U64x4::popcnt (AVX-512 backend via simd_avx2 polyfill) ──────────
+//
+// U64x4 lives in simd_avx2.rs as a scalar-storage polyfill (avx2_int_type!).
+// The AVX-512 backend re-exports it from simd_avx2.rs (see the re-export at
+// line ~2265: `pub use crate::simd_avx2::{…U64x4…}`).
+// We add popcnt to U64x4 via an impl block in simd_avx2.rs (see that file).
+
+// ─── W1a-#4: prefetch_read_t0/t1/t2 (x86_64) ────────────────────────────────
+
+/// Hint that the cache line containing `ptr` will be read soon; load into L1
+/// (T0) data cache.
+///
+/// `ptr` is allowed to be invalid (null, unmapped).  On x86_64 an invalid
+/// address in `PREFETCHT0` is silently dropped by the hardware; no fault is
+/// raised.  Do NOT `assert!` or dereference `ptr` in this function.
+///
+/// # Example
+/// ```rust,ignore
+/// let data = vec![0u8; 4096];
+/// unsafe { prefetch_read_t0(data.as_ptr()); }
+/// ```
+#[cfg(target_arch = "x86_64")]
+#[inline(always)]
+pub fn prefetch_read_t0(ptr: *const u8) {
+    // SAFETY: _MM_HINT_T0 prefetch on x86_64 is a hint-only instruction;
+    // it does NOT fault on invalid addresses per Intel SDM § PREFETCHT0.
+    // The pointer is never dereferenced.
+    unsafe {
+        core::arch::x86_64::_mm_prefetch::<{ core::arch::x86_64::_MM_HINT_T0 }>(ptr as *const i8);
+    }
+}
+
+/// Hint to load into L2 (T1) cache.  Same invalid-pointer semantics as
+/// `prefetch_read_t0`.
+#[cfg(target_arch = "x86_64")]
+#[inline(always)]
+pub fn prefetch_read_t1(ptr: *const u8) {
+    // SAFETY: same as prefetch_read_t0 — hint-only, no fault on invalid ptr.
+    unsafe {
+        core::arch::x86_64::_mm_prefetch::<{ core::arch::x86_64::_MM_HINT_T1 }>(ptr as *const i8);
+    }
+}
+
+/// Hint to load into L3 (T2) cache.  Same invalid-pointer semantics as
+/// `prefetch_read_t0`.
+#[cfg(target_arch = "x86_64")]
+#[inline(always)]
+pub fn prefetch_read_t2(ptr: *const u8) {
+    // SAFETY: same as prefetch_read_t0 — hint-only, no fault on invalid ptr.
+    unsafe {
+        core::arch::x86_64::_mm_prefetch::<{ core::arch::x86_64::_MM_HINT_T2 }>(ptr as *const i8);
+    }
+}
+
+// ─── W1a-#1: batch_packed_i4_16 (x86_64 backend) ─────────────────────────────
+
+/// Closure-parameterised batch over packed i4 data.
+///
+/// Iterates `min(packed.len(), aux.len())` times.  Each iteration unpacks
+/// `packed[i]` into an `I8x16` (16 sign-extended nibbles) and passes it
+/// together with `aux[i]` to the closure `f`, storing the result in `out[i]`.
+///
+/// Tail handling: if `out.len() < packed.len()` only `out.len()` iterations
+/// run (no out-of-bounds write).
+///
+/// Bounds: panics if `packed.len() != aux.len()`.  An empty slice is valid.
+///
+/// # Example
+/// ```rust,ignore
+/// let packed = vec![0u64; 4];
+/// let aux    = vec![0i8; 4];
+/// let mut out = vec![0i8; 4];
+/// batch_packed_i4_16(&packed, &aux, &mut out, |lanes, a| {
+///     lanes.lane_i8::<0>().wrapping_add(a)
+/// });
+/// assert!(out.iter().all(|&v| v == 0));
+/// ```
+#[cfg(target_arch = "x86_64")]
+#[inline]
+pub fn batch_packed_i4_16<E, F>(packed: &[u64], aux: &[i8], out: &mut [E], f: F)
+where
+    F: Fn(I8x16, i8) -> E + Sync + Send,
+    E: Copy,
+{
+    assert_eq!(packed.len(), aux.len(), "batch_packed_i4_16: packed and aux must be same length");
+    let n = packed.len().min(out.len());
+    for i in 0..n {
+        let lanes = I8x16::from_i4_packed_u64(packed[i]);
+        out[i] = f(lanes, aux[i]);
+    }
+}
+
+// ─── Aliases ──────────────────────────────────────────────────────────────────
+#[cfg(target_arch = "x86_64")]
+#[allow(non_camel_case_types)]
+pub type i8x16 = I8x16;
+#[cfg(target_arch = "x86_64")]
+#[allow(non_camel_case_types)]
+pub type u16x8 = U16x8;
+#[cfg(target_arch = "x86_64")]
+#[allow(non_camel_case_types)]
+pub type u8x8 = U8x8;
+
 // ============================================================================
 // BF16 conversion wrappers — AVX-512 BF16 hardware instructions
 // ============================================================================
diff --git a/src/simd_int_ops.rs b/src/simd_int_ops.rs
index b9763640..2cef8b91 100644
--- a/src/simd_int_ops.rs
+++ b/src/simd_int_ops.rs
@@ -754,6 +754,169 @@ mod tests {
         }
     }
 
+    // ── W1a parity tests ────────────────────────────────────────────────────
+    //
+    // These tests exercise the correctness of the 5 W1a primitives on the
+    // current compilation backend.  Because the dispatch is compile-time
+    // only, each test runs against exactly one backend per build.  The
+    // fixed corpus includes all required edge-case values from the consumer
+    // contract (i8::MIN, i8::MAX, 0, all-bits-set u64, OOB index edge).
+
+    /// W1a-#1 + #2: I8x16 from_i4_packed_u64 + lane_i8 + saturating_abs
+    #[test]
+    fn w1a_i8x16_from_i4_packed_u64_basic() {
+        use crate::simd::I8x16;
+        // All nibbles 0 → all lanes 0
+        let z = I8x16::from_i4_packed_u64(0);
+        assert!(z.to_array().iter().all(|&x| x == 0), "all-zero packed");
+
+        // All nibbles 0xf → all lanes -1
+        let neg = I8x16::from_i4_packed_u64(u64::MAX);
+        assert!(neg.to_array().iter().all(|&x| x == -1), "all-0xf packed → -1");
+
+        // Nibble 0x8 = minimum i4 → lane value -8
+        let min4 = I8x16::from_i4_packed_u64(0x8888_8888_8888_8888);
+        assert!(min4.to_array().iter().all(|&x| x == -8), "nibble 0x8 → -8");
+
+        // Nibble 0x7 = maximum positive i4 → lane value +7
+        let max4 = I8x16::from_i4_packed_u64(0x7777_7777_7777_7777);
+        assert!(max4.to_array().iter().all(|&x| x == 7), "nibble 0x7 → 7");
+
+        // lane_i8 extractors: nibbles are LSB-first and sign-extended.
+        // packed = 0x...0021 → lane0 = nibble 0x1 = 1, lane1 = nibble 0x2 = 2.
+        let low = I8x16::from_i4_packed_u64(0x0000_0000_0000_0021);
+        assert_eq!(low.lane_i8::<0>(), 1);
+        assert_eq!(low.lane_i8::<1>(), 2);
+        // Sign bit: nibble 0x8 in lane0 sign-extends to -8.
+        let signbit = I8x16::from_i4_packed_u64(0x0000_0000_0000_0008);
+        assert_eq!(signbit.lane_i8::<0>(), -8);
+    }
+
+    /// W1a-#2: saturating_abs — binding contract test (i8::MIN → i8::MAX)
+    #[test]
+    fn w1a_saturating_abs_i8_min_matches_across_backends() {
+        use crate::simd::{I8x16, I8x32};
+
+        // I8x16
+        let input16 = I8x16::splat(i8::MIN);
+        let result16 = input16.saturating_abs();
+        let arr16 = result16.to_array();
+        for (lane, &v) in arr16.iter().enumerate() {
+            assert_eq!(v, i8::MAX, "I8x16 lane {} saturating_abs(i8::MIN) should be i8::MAX", lane);
+        }
+
+        // I8x32
+        let input32 = I8x32::splat(i8::MIN);
+        let result32 = input32.saturating_abs();
+        let arr32 = result32.to_array();
+        for (lane, &v) in arr32.iter().enumerate() {
+            assert_eq!(v, i8::MAX, "I8x32 lane {} saturating_abs(i8::MIN) should be i8::MAX", lane);
+        }
+
+        // Corpus: 0, 1, -1, i8::MAX, i8::MIN
+        let corpus: &[i8] = &[0, 1, -1, i8::MAX, i8::MIN, 42, -42, 127, -127, -128, 64, -64];
+        for &val in corpus {
+            // Scalar reference
+            let expected = val.saturating_abs();
+
+            let v16 = I8x16::splat(val).saturating_abs().lane_i8::<0>();
+            assert_eq!(v16, expected, "I8x16 saturating_abs({}) mismatch", val);
+
+            let mut arr32 = [0i8; 32];
+            arr32[0] = val;
+            let v32 = I8x32::from_array(arr32).saturating_abs().to_array()[0];
+            assert_eq!(v32, expected, "I8x32 saturating_abs({}) mismatch", val);
+        }
+    }
+
+    /// W1a-#3: gather_u16 + palette_lookup_u8x8
+    #[test]
+    fn w1a_gather_u16_basic() {
+        use crate::simd::{palette_lookup_u8x8, U16x8};
+
+        let table: Vec<u16> = (0..256).map(|x| x as u16 * 10).collect();
+        let idx = U16x8::from_array([0, 1, 2, 3, 100, 200, 255, 50]);
+        let result = U16x8::gather_u16(idx, &table);
+        let expected = [0u16, 10, 20, 30, 1000, 2000, 2550, 500];
+        assert_eq!(result.to_array(), expected, "gather_u16 basic");
+
+        // All-same index
+        let same_idx = U16x8::splat(5);
+        let r2 = U16x8::gather_u16(same_idx, &table);
+        assert!(r2.to_array().iter().all(|&v| v == 50), "gather_u16 all-same idx");
+
+        // palette_lookup_u8x8
+        let lut: Vec<u8> = (0..256).map(|x| x as u8).collect();
+        let pidx = U16x8::from_array([0, 1, 127, 128, 254, 255, 10, 20]);
+        let pr = palette_lookup_u8x8(pidx, &lut);
+        assert_eq!(pr.to_array(), [0u8, 1, 127, 128, 254, 255, 10, 20], "palette_lookup_u8x8");
+    }
+
+    /// W1a-#4: prefetch — just verify they don't panic (they're hints)
+    #[test]
+    fn w1a_prefetch_no_panic() {
+        use crate::simd::{prefetch_read_t0, prefetch_read_t1, prefetch_read_t2};
+        let data = [0u8; 64];
+        let ptr = data.as_ptr();
+        // Valid pointer — must not panic
+        prefetch_read_t0(ptr);
+        prefetch_read_t1(ptr);
+        prefetch_read_t2(ptr);
+        // Null pointer — must not panic (prefetch is a hint, not a load)
+        prefetch_read_t0(core::ptr::null());
+        prefetch_read_t1(core::ptr::null());
+        prefetch_read_t2(core::ptr::null());
+    }
+
+    /// W1a-#5: U64x8::popcnt / xor_popcount + U64x4::popcnt
+    #[test]
+    fn w1a_u64_popcnt_basic() {
+        use crate::simd::{U64x4, U64x8};
+
+        // U64x8
+        let all_ones = U64x8::splat(u64::MAX);
+        let p8 = all_ones.popcnt();
+        assert!(p8.to_array().iter().all(|&x| x == 64), "U64x8::popcnt(MAX) == 64 per lane");
+
+        let all_zero = U64x8::splat(0);
+        let pz8 = all_zero.popcnt();
+        assert!(pz8.to_array().iter().all(|&x| x == 0), "U64x8::popcnt(0) == 0 per lane");
+
+        // xor_popcount: MAX ^ 0 = MAX, 64 bits × 8 lanes = 512
+        assert_eq!(all_ones.xor_popcount(all_zero), 512, "xor_popcount(MAX,0) == 512");
+        assert_eq!(all_ones.xor_popcount(all_ones), 0, "xor_popcount(x,x) == 0");
+
+        // Known values
+        let v = U64x8::from_array([1, 2, 3, 4, 5, 6, 7, 8]);
+        let pv = v.popcnt().to_array();
+        assert_eq!(pv, [1, 1, 2, 1, 2, 2, 3, 1], "U64x8::popcnt known values");
+
+        // U64x4
+        let v4 = U64x4::from_array([u64::MAX, 0, 1, !1u64]);
+        let pv4 = v4.popcnt().to_array();
+        assert_eq!(pv4, [64, 0, 1, 63], "U64x4::popcnt known values");
+    }
+
+    /// W1a-#1: batch_packed_i4_16 smoke test
+    #[test]
+    fn w1a_batch_packed_i4_16_smoke() {
+        use crate::simd::batch_packed_i4_16;
+
+        let packed = vec![0u64; 4];
+        let aux = vec![0i8; 4];
+        let mut out = vec![0i8; 4];
+        batch_packed_i4_16(&packed, &aux, &mut out, |lanes, a| lanes.lane_i8::<0>().wrapping_add(a));
+        assert!(out.iter().all(|&v| v == 0), "batch_packed_i4_16 all-zero");
+
+        // Non-zero nibbles
+        let packed2 = vec![0x1111_1111_1111_1111u64; 2];
+        let aux2 = vec![10i8; 2];
+        let mut out2 = vec![0i8; 2];
+        batch_packed_i4_16(&packed2, &aux2, &mut out2, |lanes, a| lanes.lane_i8::<0>().wrapping_add(a));
+        // nibble 0x1 → lane 0 = +1; +10 = 11
+        assert!(out2.iter().all(|&v| v == 11), "batch_packed_i4_16 nibble=1+aux=10");
+    }
+
     /// Exercises the AMX dispatch tier added on top of `gemm_u8_i8`'s
     /// compile-time cascade. On AMX-enabled silicon (Sapphire Rapids+
     /// with the right OS prctl), 16/16/64-aligned shapes go through
diff --git a/src/simd_neon.rs b/src/simd_neon.rs
index e7d36776..2d347035 100644
--- a/src/simd_neon.rs
+++ b/src/simd_neon.rs
@@ -1839,6 +1839,283 @@ pub type i16x16 = I16x16;
 #[allow(non_camel_case_types)]
 pub type i16x32 = I16x32;
 
+// ============================================================================
+// W1a SIMD primitives — NEON backend
+// ============================================================================
+
+// ── W1a-#1: I8x16::from_i4_packed_u64 + lane_i8 (NEON) ──────────────────────
+
+#[cfg(target_arch = "aarch64")]
+impl I8x16 {
+    /// Unpack 16 signed i4 nibbles from a `u64` into 16 sign-extended `i8` lanes.
+    ///
+    /// Nibble layout: `lane[i] = sign_extend_i4((packed >> (4*i)) & 0xf)`.
+    /// Values `0x0..=0x7` → `0..=7`; values `0x8..=0xf` → `-8..=-1`.
+    ///
+    /// On NEON this is implemented as a scalar loop (the shift+mask approach
+    /// with `vshl_n_s8` would require byte-level load + nibble split across
+    /// two registers, but the scalar approach is simpler and correct).
+    ///
+    /// # Example
+    /// ```rust,ignore
+    /// let neg = I8x16::from_i4_packed_u64(0xffff_ffff_ffff_ffff);
+    /// assert_eq!(neg.lane_i8::<0>(), -1);
+    /// ```
+    #[inline(always)]
+    pub fn from_i4_packed_u64(packed: u64) -> Self {
+        let mut lanes = [0i8; 16];
+        for i in 0..16 {
+            let nibble = ((packed >> (4 * i)) & 0xf) as i8;
+            lanes[i] = if nibble > 7 { nibble - 16 } else { nibble };
+        }
+        // SAFETY: vld1q_s8 loads 16 bytes from a valid aligned stack array.
+        Self(unsafe { core::arch::aarch64::vld1q_s8(lanes.as_ptr()) })
+    }
+
+    /// Extract lane `N` as an `i8`.
+    ///
+    /// `N` must be in `0..16`.
+    #[inline(always)]
+    pub fn lane_i8<const N: usize>(self) -> i8 {
+        self.to_array()[N]
+    }
+
+    // ── W1a-#2: saturating_abs (NEON) ────────────────────────────────────────
+
+    /// Lane-wise saturating absolute value.
+    ///
+    /// `saturating_abs(i8::MIN) == i8::MAX` (127).  Uses NEON `vqabsq_s8`
+    /// which is hardware-saturating (the `q` suffix denotes saturating
+    /// semantics), unlike `vabsq_s8` which wraps.
+    ///
+    /// # Example
+    /// ```rust,ignore
+    /// let v = I8x16::splat(i8::MIN);
+    /// assert!(v.saturating_abs().to_array().iter().all(|&x| x == i8::MAX));
+    /// ```
+    #[inline(always)]
+    pub fn saturating_abs(self) -> Self {
+        // SAFETY: vqabsq_s8 is available on all aarch64 targets; it is a
+        // saturating absolute value — `vqabsq_s8(int8x16_t(-128))` returns 127.
+        Self(unsafe { core::arch::aarch64::vqabsq_s8(self.0) })
+    }
+}
+
+// ── W1a-#2: I8x32::saturating_abs (NEON polyfill) ─────────────────────────────
+
+/// `I8x32` on NEON is a scalar polyfill (neon_int_polyfill! array).
+/// We add saturating_abs via the scalar path as there is no 256-bit NEON reg.
+#[cfg(target_arch = "aarch64")]
+impl I8x32 {
+    /// Lane-wise saturating absolute value (scalar polyfill on NEON).
+    ///
+    /// `saturating_abs(i8::MIN) == i8::MAX`.  All 32 lanes processed via
+    /// `i8::saturating_abs` in a fused loop.
+    ///
+    /// # Example
+    /// ```rust,ignore
+    /// let v = I8x32::splat(i8::MIN);
+    /// assert!(v.saturating_abs().to_array().iter().all(|&x| x == i8::MAX));
+    /// ```
+    #[inline(always)]
+    pub fn saturating_abs(self) -> Self {
+        let mut o = [0i8; 32];
+        for i in 0..32 {
+            o[i] = self.0[i].saturating_abs();
+        }
+        Self(o)
+    }
+}
+
+// ── W1a-#3: U16x8::gather_u16 + palette_lookup_u8x8 (NEON) ──────────────────
+
+#[cfg(target_arch = "aarch64")]
+impl U16x8 {
+    /// Gather 8 `u16` values from `table` at the indices in `self`.
+    ///
+    /// NEON has no native gather instruction; this is a scalar loop over
+    /// 8 lanes which is still significantly faster than a cache-miss-bound
+    /// random-access loop in typical use because 8 sequential indirections
+    /// fit in NEON register pressure.
+    ///
+    /// In debug builds panics if any index `>= table.len()`.  In release
+    /// builds falls back to `table.get(i).copied().unwrap_or(0)`.
+    ///
+    /// # Example
+    /// ```rust,ignore
+    /// let table = [10u16, 20, 30, 40, 50, 60, 70, 80];
+    /// let idx = U16x8::from_array([0, 2, 4, 6, 1, 3, 5, 7]);
+    /// let result = U16x8::gather_u16(idx, &table);
+    /// assert_eq!(result.to_array(), [10, 30, 50, 70, 20, 40, 60, 80]);
+    /// ```
+    #[inline(always)]
+    pub fn gather_u16(indices: U16x8, table: &[u16]) -> Self {
+        let idx = indices.to_array();
+        #[cfg(debug_assertions)]
+        for &i in &idx {
+            assert!((i as usize) < table.len(), "gather_u16: index {} out of bounds (len={})", i, table.len());
+        }
+        let mut out = [0u16; 8];
+        for k in 0..8 {
+            out[k] = table.get(idx[k] as usize).copied().unwrap_or(0);
+        }
+        Self::from_array(out)
+    }
+
+    /// Extract lane `k` as a `u16`.
+    #[inline(always)]
+    pub fn lane(self, k: usize) -> u16 {
+        self.to_array()[k]
+    }
+}
+
+// ── W1a-#3: U8x8 + palette_lookup_u8x8 (NEON) ───────────────────────────────
+
+/// 8-lane `u8` vector for the NEON backend (scalar-storage polyfill).
+/// Used as the return type of `palette_lookup_u8x8`.
+#[cfg(target_arch = "aarch64")]
+#[derive(Copy, Clone, PartialEq)]
+#[repr(align(8))]
+pub struct U8x8(pub [u8; 8]);
+
+#[cfg(target_arch = "aarch64")]
+impl U8x8 {
+    pub const LANES: usize = 8;
+
+    /// Broadcast a single `u8` to all 8 lanes.
+    #[inline(always)]
+    pub fn splat(v: u8) -> Self {
+        Self([v; 8])
+    }
+
+    /// Load from a fixed-size array.
+    #[inline(always)]
+    pub fn from_array(arr: [u8; 8]) -> Self {
+        Self(arr)
+    }
+
+    /// Extract all 8 lanes as an array.
+    #[inline(always)]
+    pub fn to_array(self) -> [u8; 8] {
+        self.0
+    }
+}
+
+#[cfg(target_arch = "aarch64")]
+impl core::fmt::Debug for U8x8 {
+    fn fmt(&self, f: &mut core::fmt::Formatter<'_>) -> core::fmt::Result {
+        write!(f, "U8x8({:?})", &self.0[..])
+    }
+}
+
+/// Look up 8 bytes from a `u8` LUT by `u16` indices (NEON backend).
+///
+/// Scalar loop over 8 lanes (NEON has no native gather).
+///
+/// Bounds: panics in debug if any index `>= lut.len()`; returns 0 safely in
+/// release for out-of-range indices.
+#[cfg(target_arch = "aarch64")]
+#[inline(always)]
+pub fn palette_lookup_u8x8(idx_v: U16x8, lut: &[u8]) -> U8x8 {
+    let idx = idx_v.to_array();
+    #[cfg(debug_assertions)]
+    for &i in &idx {
+        assert!((i as usize) < lut.len(), "palette_lookup_u8x8: index {} OOB (len={})", i, lut.len());
+    }
+    let mut out = [0u8; 8];
+    for k in 0..8 {
+        out[k] = lut.get(idx[k] as usize).copied().unwrap_or(0);
+    }
+    U8x8(out)
+}
+
+// ── W1a-#4: prefetch_read_t0/t1/t2 (NEON / aarch64) ──────────────────────────
+
+/// Hint that `ptr` will be read soon; load into L1 (T0) cache.
+///
+/// On aarch64 emits `prfm pldl1keep, [ptr]` via inline asm.  `ptr` may be
+/// invalid (unmapped): the PRFM instruction is a hint that the CPU can silently
+/// drop per the ARM architecture reference.  No assertion is made.
+#[cfg(target_arch = "aarch64")]
+#[inline(always)]
+pub fn prefetch_read_t0(ptr: *const u8) {
+    // SAFETY: PRFM is a hint instruction; an invalid ptr simply makes the
+    // prefetch a no-op.  The pointer is never dereferenced.
+    // UNVERIFIED: inline asm syntax for `prfm` on Rust stable 1.94 aarch64 —
+    // believed correct per ARM ISA but not verified against an aarch64 builder.
+    unsafe {
+        core::arch::asm!(
+            "prfm pldl1keep, [{ptr}]",
+            ptr = in(reg) ptr,
+            options(nostack, readonly),
+        );
+    }
+}
+
+/// Hint to load into L2 (T1) cache on aarch64.
+#[cfg(target_arch = "aarch64")]
+#[inline(always)]
+pub fn prefetch_read_t1(ptr: *const u8) {
+    // SAFETY: same as prefetch_read_t0 — PRFM hint, no fault on invalid ptr.
+    // UNVERIFIED: pldl2keep is the correct ARM PRFM operand for L2 hint.
+    unsafe {
+        core::arch::asm!(
+            "prfm pldl2keep, [{ptr}]",
+            ptr = in(reg) ptr,
+            options(nostack, readonly),
+        );
+    }
+}
+
+/// Hint to load into L3 (T2) cache on aarch64.
+#[cfg(target_arch = "aarch64")]
+#[inline(always)]
+pub fn prefetch_read_t2(ptr: *const u8) {
+    // SAFETY: same as prefetch_read_t0 — PRFM hint, no fault on invalid ptr.
+    // UNVERIFIED: pldl3keep is the correct ARM PRFM operand for L3 hint.
+    unsafe {
+        core::arch::asm!(
+            "prfm pldl3keep, [{ptr}]",
+            ptr = in(reg) ptr,
+            options(nostack, readonly),
+        );
+    }
+}
+
+// ── W1a-#5: U64x8 / U64x4 popcnt (NEON) ──────────────────────────────────────
+// The NEON aarch64_simd::U64x8 is actually re-exported from simd_scalar.rs
+// (see `pub use crate::simd::scalar::{…U64x8}`).  popcnt / xor_popcount are
+// added to the scalar U64x8 in simd_scalar.rs and are thereby visible through
+// both x86_64 and aarch64 dispatch paths.
+//
+// U64x4 on the NEON backend is also the scalar polyfill (neon_int_polyfill!).
+// Its popcnt is also in simd_scalar.rs for the same reason.
+
+// ── W1a-#1: batch_packed_i4_16 (NEON backend) ────────────────────────────────
+
+/// Closure-parameterised batch over packed i4 data (NEON backend).
+///
+/// See the x86_64 version in `simd_avx512.rs` for full documentation.
+#[cfg(target_arch = "aarch64")]
+#[inline]
+pub fn batch_packed_i4_16<E, F>(packed: &[u64], aux: &[i8], out: &mut [E], f: F)
+where
+    F: Fn(I8x16, i8) -> E + Sync + Send,
+    E: Copy,
+{
+    assert_eq!(packed.len(), aux.len(), "batch_packed_i4_16: packed and aux must be same length");
+    let n = packed.len().min(out.len());
+    for i in 0..n {
+        let lanes = I8x16::from_i4_packed_u64(packed[i]);
+        out[i] = f(lanes, aux[i]);
+    }
+}
+
+// ── Aliases ──────────────────────────────────────────────────────────────────
+#[cfg(target_arch = "aarch64")]
+#[allow(non_camel_case_types)]
+pub type u8x8 = U8x8;
+
 // ═══════════════════════════════════════════════════════════════════════════
 // Tests (run on x86 as compile-check, actual NEON tests need aarch64)
 // ═══════════════════════════════════════════════════════════════════════════
diff --git a/src/simd_scalar.rs b/src/simd_scalar.rs
index 77b0b421..fbf4c33e 100644
--- a/src/simd_scalar.rs
+++ b/src/simd_scalar.rs
@@ -1267,6 +1267,387 @@ impl Mul for U32x16 {
     }
 }
 
+// ============================================================================
+// W1a SIMD primitives — scalar backend
+// ============================================================================
+//
+// The scalar backend is the correctness anchor for all W1a primitives.
+// All implementations here are pure safe Rust with no intrinsics.
+
+// ── W1a-#1: I8x16 + lane_i8 + from_i4_packed_u64 (scalar) ──────────────────
+
+/// 16-lane `i8` vector — scalar fallback for non-NEON, non-x86_64 targets.
+///
+/// On x86_64 this type comes from `simd_avx512.rs`; on aarch64 from
+/// `simd_neon.rs`.  This scalar version covers wasm32, riscv, and any other
+/// target that falls through to the scalar dispatch arm.
+#[derive(Copy, Clone, PartialEq)]
+#[repr(align(16))]
+pub struct I8x16(pub [i8; 16]);
+
+impl I8x16 {
+    pub const LANES: usize = 16;
+
+    /// Broadcast a single `i8` value to all 16 lanes.
+    #[inline(always)]
+    pub fn splat(v: i8) -> Self {
+        Self([v; 16])
+    }
+
+    /// Load from a slice (at least 16 elements required).
+    #[inline(always)]
+    pub fn from_slice(s: &[i8]) -> Self {
+        assert!(s.len() >= 16);
+        let mut a = [0i8; 16];
+        a.copy_from_slice(&s[..16]);
+        Self(a)
+    }
+
+    /// Load from a fixed-size array.
+    #[inline(always)]
+    pub fn from_array(arr: [i8; 16]) -> Self {
+        Self(arr)
+    }
+
+    /// Extract all 16 lanes as an array.
+    #[inline(always)]
+    pub fn to_array(self) -> [i8; 16] {
+        self.0
+    }
+
+    /// Copy lanes into a slice (must have at least 16 elements).
+    #[inline(always)]
+    pub fn copy_to_slice(self, s: &mut [i8]) {
+        assert!(s.len() >= 16);
+        s[..16].copy_from_slice(&self.0);
+    }
+
+    /// Unpack 16 signed i4 nibbles from a `u64` into 16 sign-extended `i8` lanes.
+    ///
+    /// Nibble layout: `lane[i] = sign_extend_i4((packed >> (4*i)) & 0xf)`.
+    /// Values `0x0..=0x7` → `0..=7`; values `0x8..=0xf` → `-8..=-1`.
+    ///
+    /// Edge cases:
+    /// - `from_i4_packed_u64(0)` → all lanes `0`.
+    /// - All nibbles `0xf` → all lanes `-1`.
+    /// - Nibble `0x8` → lane value `-8` (minimum i4 value).
+    ///
+    /// # Example
+    /// ```rust,ignore
+    /// let z = I8x16::from_i4_packed_u64(0);
+    /// assert!(z.to_array().iter().all(|&x| x == 0));
+    /// let neg = I8x16::from_i4_packed_u64(u64::MAX);
+    /// assert!(neg.to_array().iter().all(|&x| x == -1));
+    /// ```
+    #[inline(always)]
+    pub fn from_i4_packed_u64(packed: u64) -> Self {
+        let mut lanes = [0i8; 16];
+        for i in 0..16 {
+            let nibble = ((packed >> (4 * i)) & 0xf) as i8;
+            lanes[i] = if nibble > 7 { nibble - 16 } else { nibble };
+        }
+        Self(lanes)
+    }
+
+    /// Extract lane `N` as an `i8`.  `N` must be in `0..16`.
+    #[inline(always)]
+    pub fn lane_i8<const N: usize>(self) -> i8 {
+        self.0[N]
+    }
+
+    /// Lane-wise saturating absolute value.
+    ///
+    /// `saturating_abs(i8::MIN) == i8::MAX` (127).  Uses `i8::saturating_abs`.
+    ///
+    /// # Example
+    /// ```rust,ignore
+    /// let v = I8x16::splat(i8::MIN);
+    /// assert!(v.saturating_abs().to_array().iter().all(|&x| x == i8::MAX));
+    /// ```
+    #[inline(always)]
+    pub fn saturating_abs(self) -> Self {
+        let mut o = [0i8; 16];
+        for i in 0..16 {
+            o[i] = self.0[i].saturating_abs();
+        }
+        Self(o)
+    }
+}
+
+impl core::fmt::Debug for I8x16 {
+    fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
+        write!(f, "I8x16({:?})", &self.0[..])
+    }
+}
+
+// ── W1a-#2: I8x32::saturating_abs (scalar) ───────────────────────────────────
+
+impl I8x32 {
+    /// Lane-wise saturating absolute value.
+    ///
+    /// `saturating_abs(i8::MIN) == i8::MAX`.  All 32 lanes via `i8::saturating_abs`.
+    ///
+    /// # Example
+    /// ```rust,ignore
+    /// let v = I8x32::splat(i8::MIN);
+    /// assert!(v.saturating_abs().to_array().iter().all(|&x| x == i8::MAX));
+    /// ```
+    #[inline(always)]
+    pub fn saturating_abs(self) -> Self {
+        let mut o = [0i8; 32];
+        for i in 0..32 {
+            o[i] = self.0[i].saturating_abs();
+        }
+        Self(o)
+    }
+}
+
+// ── W1a-#3: U16x8 / U8x8 / palette_lookup_u8x8 (scalar) ─────────────────────
+
+/// 8-lane `u16` vector — scalar fallback.
+///
+/// On aarch64 this type is backed by `uint16x8_t`; on x86_64 it is a scalar-
+/// storage polyfill in `simd_avx512.rs`.  This version covers all other targets.
+#[derive(Copy, Clone, PartialEq)]
+#[repr(align(16))]
+pub struct U16x8(pub [u16; 8]);
+
+impl U16x8 {
+    pub const LANES: usize = 8;
+
+    /// Broadcast a single `u16` to all 8 lanes.
+    #[inline(always)]
+    pub fn splat(v: u16) -> Self {
+        Self([v; 8])
+    }
+
+    /// Load from a slice (at least 8 elements required).
+    #[inline(always)]
+    pub fn from_slice(s: &[u16]) -> Self {
+        assert!(s.len() >= 8);
+        let mut a = [0u16; 8];
+        a.copy_from_slice(&s[..8]);
+        Self(a)
+    }
+
+    /// Load from a fixed-size array.
+    #[inline(always)]
+    pub fn from_array(arr: [u16; 8]) -> Self {
+        Self(arr)
+    }
+
+    /// Extract all 8 lanes as an array.
+    #[inline(always)]
+    pub fn to_array(self) -> [u16; 8] {
+        self.0
+    }
+
+    /// Gather 8 `u16` values from `table` at the indices in `self`.
+    ///
+    /// In debug panics if any index `>= table.len()`.  In release, OOB
+    /// indices return 0 safely.
+    ///
+    /// # Example
+    /// ```rust,ignore
+    /// let table = [10u16, 20, 30, 40, 50, 60, 70, 80];
+    /// let idx = U16x8::from_array([0, 2, 4, 6, 1, 3, 5, 7]);
+    /// let r = U16x8::gather_u16(idx, &table);
+    /// assert_eq!(r.to_array(), [10, 30, 50, 70, 20, 40, 60, 80]);
+    /// ```
+    #[inline(always)]
+    pub fn gather_u16(indices: U16x8, table: &[u16]) -> Self {
+        let idx = indices.to_array();
+        #[cfg(debug_assertions)]
+        for &i in &idx {
+            assert!((i as usize) < table.len(), "gather_u16: index {} OOB (len={})", i, table.len());
+        }
+        let mut out = [0u16; 8];
+        for k in 0..8 {
+            out[k] = table.get(idx[k] as usize).copied().unwrap_or(0);
+        }
+        Self(out)
+    }
+
+    /// Extract lane `k` as a `u16`.
+    #[inline(always)]
+    pub fn lane(self, k: usize) -> u16 {
+        self.0[k]
+    }
+}
+
+impl fmt::Debug for U16x8 {
+    fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
+        write!(f, "U16x8({:?})", &self.0[..])
+    }
+}
+
+/// 8-lane `u8` vector — scalar fallback.  Used as the return type of
+/// `palette_lookup_u8x8`.
+#[derive(Copy, Clone, PartialEq)]
+#[repr(align(8))]
+pub struct U8x8(pub [u8; 8]);
+
+impl U8x8 {
+    pub const LANES: usize = 8;
+
+    /// Broadcast a single `u8` to all 8 lanes.
+    #[inline(always)]
+    pub fn splat(v: u8) -> Self {
+        Self([v; 8])
+    }
+
+    /// Load from a fixed-size array.
+    #[inline(always)]
+    pub fn from_array(arr: [u8; 8]) -> Self {
+        Self(arr)
+    }
+
+    /// Extract all 8 lanes as an array.
+    #[inline(always)]
+    pub fn to_array(self) -> [u8; 8] {
+        self.0
+    }
+}
+
+impl fmt::Debug for U8x8 {
+    fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
+        write!(f, "U8x8({:?})", &self.0[..])
+    }
+}
+
+/// Look up 8 bytes from a `u8` LUT by `u16` indices (scalar fallback).
+///
+/// Panics in debug on OOB; returns 0 safely in release.
+#[inline(always)]
+pub fn palette_lookup_u8x8(idx_v: U16x8, lut: &[u8]) -> U8x8 {
+    let idx = idx_v.to_array();
+    #[cfg(debug_assertions)]
+    for &i in &idx {
+        assert!((i as usize) < lut.len(), "palette_lookup_u8x8: index {} OOB (len={})", i, lut.len());
+    }
+    let mut out = [0u8; 8];
+    for k in 0..8 {
+        out[k] = lut.get(idx[k] as usize).copied().unwrap_or(0);
+    }
+    U8x8(out)
+}
+
+// ── W1a-#4: prefetch_read_t0/t1/t2 (scalar / other arch) ────────────────────
+
+/// Hint that `ptr` will be read soon (scalar / unknown-arch no-op).
+///
+/// On `x86_64` the real implementation in `simd_avx512.rs` emits `PREFETCHT0`.
+/// On `aarch64` the real implementation in `simd_neon.rs` emits `prfm`.
+/// On all other targets (wasm, riscv, …) this is a deliberate no-op because
+/// the prefetch contract is a hint — silent no-op is correct per the spec.
+///
+/// `ptr` may be invalid; it is never dereferenced.
+#[inline(always)]
+pub fn prefetch_read_t0(_ptr: *const u8) {
+    // no-op on unknown/scalar targets
+}
+
+/// Hint to load into L2 (T1) cache (scalar / unknown-arch no-op).
+#[inline(always)]
+pub fn prefetch_read_t1(_ptr: *const u8) {
+    // no-op on unknown/scalar targets
+}
+
+/// Hint to load into L3 (T2) cache (scalar / unknown-arch no-op).
+#[inline(always)]
+pub fn prefetch_read_t2(_ptr: *const u8) {
+    // no-op on unknown/scalar targets
+}
+
+// ── W1a-#5: U64x8::popcnt / xor_popcount + U64x4::popcnt (scalar) ───────────
+
+impl U64x8 {
+    /// Lane-wise population count (scalar).  Each lane → set-bit count (0..=64).
+    ///
+    /// # Example
+    /// ```rust,ignore
+    /// let v = U64x8::splat(u64::MAX);
+    /// assert!(v.popcnt().to_array().iter().all(|&x| x == 64));
+    /// let z = U64x8::splat(0);
+    /// assert!(z.popcnt().to_array().iter().all(|&x| x == 0));
+    /// ```
+    #[inline(always)]
+    pub fn popcnt(self) -> Self {
+        let mut out = [0u64; 8];
+        for i in 0..8 {
+            out[i] = self.0[i].count_ones() as u64;
+        }
+        Self(out)
+    }
+
+    /// XOR two vectors lane-wise, popcount each lane, sum all 8 lanes.
+    ///
+    /// # Example
+    /// ```rust,ignore
+    /// let a = U64x8::splat(u64::MAX);
+    /// let b = U64x8::splat(0);
+    /// assert_eq!(a.xor_popcount(b), 512); // 64 bits × 8 lanes
+    /// assert_eq!(a.xor_popcount(a), 0);   // same inputs → Hamming distance 0
+    /// ```
+    #[inline(always)]
+    pub fn xor_popcount(self, other: Self) -> u64 {
+        let mut sum = 0u64;
+        for i in 0..8 {
+            sum += (self.0[i] ^ other.0[i]).count_ones() as u64;
+        }
+        sum
+    }
+}
+
+impl U64x4 {
+    /// Lane-wise population count (scalar).  Each lane → set-bit count (0..=64).
+    ///
+    /// # Example
+    /// ```rust,ignore
+    /// let v = U64x4::from_array([u64::MAX, 0, 1, !1]);
+    /// assert_eq!(v.popcnt().to_array(), [64, 0, 1, 63]);
+    /// ```
+    #[inline(always)]
+    pub fn popcnt(self) -> Self {
+        let mut out = [0u64; 4];
+        for i in 0..4 {
+            out[i] = self.0[i].count_ones() as u64;
+        }
+        Self(out)
+    }
+}
+
+// ── W1a-#1: batch_packed_i4_16 (scalar backend) ──────────────────────────────
+
+/// Closure-parameterised batch over packed i4 data (scalar backend).
+///
+/// Iterates `min(packed.len(), aux.len(), out.len())` times.  Each iteration
+/// unpacks `packed[i]` into an `I8x16` (16 sign-extended nibbles) and passes
+/// it together with `aux[i]` to `f`, storing the result in `out[i]`.
+///
+/// Panics if `packed.len() != aux.len()`.
+#[inline]
+pub fn batch_packed_i4_16<E, F>(packed: &[u64], aux: &[i8], out: &mut [E], f: F)
+where
+    F: Fn(I8x16, i8) -> E + Sync + Send,
+    E: Copy,
+{
+    assert_eq!(packed.len(), aux.len(), "batch_packed_i4_16: packed and aux must be same length");
+    let n = packed.len().min(out.len());
+    for i in 0..n {
+        let lanes = I8x16::from_i4_packed_u64(packed[i]);
+        out[i] = f(lanes, aux[i]);
+    }
+}
+
+// ── Lowercase aliases ─────────────────────────────────────────────────────────
+#[allow(non_camel_case_types)]
+pub type i8x16 = I8x16;
+#[allow(non_camel_case_types)]
+pub type u16x8 = U16x8;
+#[allow(non_camel_case_types)]
+pub type u8x8 = U8x8;
+
 // Lowercase aliases
 #[allow(non_camel_case_types)]
 pub type f32x16 = F32x16;