From 63f91df741c66f5a2c3cf525adf3a208db4cb2af Mon Sep 17 00:00:00 2001
From: Claude <noreply@anthropic.com>
Date: Wed, 20 May 2026 14:45:58 +0000
Subject: [PATCH] =?UTF-8?q?docs(simd):=20TD-SIMD-8=20=E2=80=94=20F16=20hon?=
 =?UTF-8?q?esty=20+=20matrix=20audit=20for=20missing=20lanes?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

# F16 honesty (TD-SIMD-8)

`src/simd_half.rs` F16x16: docstring now explicitly discloses scalar
storage and routes hot loops to `core::simd::f16x16` (under
`nightly-simd`) or to fp32 with conversion at boundaries. Disambiguates
from `simd_avx2::F16Scaler` — a scaling CONTEXT for range-normalizing
values before f16 encoding, not the F16x16 SIMD type. Both files cross-
reference each other so a future reader doesn't repeat the confusion.

`src/simd_avx2.rs` F16Scaler: docstring strengthened with the same
disambiguation note.

# Matrix audit (user request)

Cross-referenced every `pub struct *x*` in simd_avx512.rs, simd_avx2.rs,
simd_neon.rs, simd_nightly/mod.rs against the parity matrix in the
architecture doc. Corrections:

- **F32x8 / F64x4 v3 column: ❌ → ✅ `__m256`/`__m256d` (in `simd_avx512`)**.
  The dispatch at `src/simd.rs:294` already imports these from
  simd_avx512 on the v3 / AVX2 path. They're AVX (not AVX-512), so they
  work on every Sandy Bridge+ host. The matrix was stale.
- **U32x8, U64x4 rows added** — nightly-only currently; ❌ on x86 +
  aarch64 + scalar. core::simd has them via `simd_nightly`.
- **U16x16, I32x8, I64x4 rows added** — missing across EVERY backend
  including nightly. Theoretical 256-bit shapes no consumer has reached
  for yet.
- **F32Mask8 / F64Mask4 rows added** — declared in simd_scalar as
  `F32Mask8Scalar` / `F64Mask4Scalar` (rename came from a duplicate-
  decl conflict on i686); not surfaced through `crate::simd::*`. AVX-512
  has them natively via `__mmask8` but they're not typed.
- **Sub-byte lanes section added** — I4 / U4 lanes used by INT4
  quantized inference (Q4_0, Q4_K, GPTQ, AWQ). No first-class wrapper;
  consumers pack 2× nibbles per byte and operate through U8x64 + shr/
  mask. Documents the hardware story (AVX-512 VBMI2, VPCOMPRESSB on
  x86; shr+mask trick on aarch64). Tracked as TD-SIMD-11 if a consumer
  files for it.

TD-SIMD-8 description updated in §5 to point at `simd_half.rs:123` (the
actual F16x16 polyfill) rather than `simd_avx2.rs:2566` (the unrelated
F16Scaler scaling utility).
---
 .../knowledge/simd-dispatch-architecture.md   | 62 ++++++++++++++++---
 src/simd_avx2.rs                              | 11 ++++
 src/simd_half.rs                              | 23 ++++++-
 3 files changed, 85 insertions(+), 11 deletions(-)

diff --git a/.claude/knowledge/simd-dispatch-architecture.md b/.claude/knowledge/simd-dispatch-architecture.md
index 50dc5e89..6181932f 100644
--- a/.claude/knowledge/simd-dispatch-architecture.md
+++ b/.claude/knowledge/simd-dispatch-architecture.md
@@ -144,31 +144,73 @@ tracked as TD-SIMD-3.)
 | Lane type | `simd_avx512` (v4) | `simd_avx2` (v3) | `simd_neon` (aarch64) | `simd_nightly` | `scalar` |
 |---|---|---|---|---|---|
 | `F32x16` | ✅ `__m512` | 🟡 `(f32x8, f32x8)` | 🟡 `[float32x4_t; 4]` | 🔵 `core::simd::f32x16` | ✅ `[f32; 16]` |
-| `F32x8` | ✅ `__m256` | ❌ | ⛔ | 🔵 | ✅ |
+| `F32x8` | ✅ `__m256` | ✅ `__m256` (in `simd_avx512`) | ⛔ | 🔵 | ✅ |
 | `F64x8` | ✅ `__m512d` | 🟡 `(f64x4, f64x4)` | 🟡 `[float64x2_t; 4]` | 🔵 | ✅ |
-| `F64x4` | ✅ `__m256d` | ❌ | ⛔ | 🔵 | ✅ |
+| `F64x4` | ✅ `__m256d` | ✅ `__m256d` (in `simd_avx512`) | ⛔ | 🔵 | ✅ |
 | `U8x64` | ✅ `__m512i` | 🟠 `[u8; 64]` polyfill | ❌ | 🔵 | ✅ |
 | `U8x32` | ✅ `__m256i` | ✅ `__m256i` | ❌ | 🔵 | ✅ |
 | `U16x32` | ✅ `__m512i` | 🟠 `[u16; 32]` polyfill | ❌ | 🔵 | ✅ |
+| `U16x16` | ❌ | ❌ | ❌ | ❌ | ❌ |
 | `U32x16` | ✅ `__m512i` | 🟠 `[u32; 16]` polyfill | ❌ | 🔵 | ✅ |
+| `U32x8` | ❌ | ❌ | ❌ | 🔵 `core::simd::u32x8` | ❌ |
 | `U64x8` | ✅ `__m512i` | 🟠 `[u64; 8]` polyfill | ❌ | 🔵 | ✅ |
+| `U64x4` | ❌ | ❌ | ❌ | 🔵 `core::simd::u64x4` | ❌ |
 | `I8x32` | ✅ `__m256i` | ✅ `__m256i` (in `simd_avx512`) | ❌ | 🔵 | ✅ |
 | `I8x64` | ✅ `__m512i` | 🟠 `[i8; 64]` polyfill | ❌ | 🔵 | ✅ |
 | `I16x16` | ✅ `__m256i` | ✅ `__m256i` (in `simd_avx512`) | ❌ | 🔵 | ✅ |
 | `I16x32` | ✅ `__m512i` | 🟠 `[i16; 32]` polyfill | ❌ | 🔵 | ✅ |
 | `I32x16` | ✅ `__m512i` | 🟠 `[i32; 16]` polyfill | ❌ | 🔵 | ✅ |
+| `I32x8` | ❌ | ❌ | ❌ | ❌ | ❌ |
 | `I64x8` | ✅ `__m512i` | 🟠 `[i64; 8]` polyfill | ❌ | 🔵 | ✅ |
+| `I64x4` | ❌ | ❌ | ❌ | ❌ | ❌ |
 | `BF16x8` | ✅ `__m128bh` | ❌ | ❌ | 🔵 | ✅ |
 | `BF16x16` | ✅ `__m256bh` | ❌ | ❌ | 🔵 | ✅ |
-| `F16x16` | ❌ | 🟡 `F16Scaler` (scalar) | ❌ | 🔵 | ✅ |
+| `F16x16` | ❌ | 🟠 `[u16; 16]` polyfill (`simd_half`) | 🟠 `[u16; 16]` polyfill (`simd_half`) | 🔵 | ✅ |
 | `F32Mask16` | ✅ `__mmask16` | ✅ `u16` bitmask | ✅ `u16` bitmask | 🔵 | ✅ |
+| `F32Mask8` | ❌ exposed (avail via `__mmask8`) | ❌ exposed (avail in `simd_scalar` as `F32Mask8Scalar`) | ❌ exposed | 🔵 | ✅ via `F32Mask8Scalar` |
 | `F64Mask8` | ✅ `__mmask8` | ✅ `u8` bitmask | ✅ `u8` bitmask | 🔵 | ✅ |
-
-**Aarch64-native narrower types** (only useful directly when the
-consumer wants 128-bit shapes): `I8x16`, `I16x8`, `U8x16`, `U16x8`,
-`U32x4`, `U64x2`, `I32x4`, `I64x2`. These are not in the cross-arch
-parity surface — consumers requesting 256-bit / 512-bit shapes go
-through the composed wrappers.
+| `F64Mask4` | ❌ exposed (avail via `__mmask8`) | ❌ exposed (avail in `simd_scalar` as `F64Mask4Scalar`) | ❌ exposed | 🔵 | ✅ via `F64Mask4Scalar` |
+
+### Sub-byte lanes (not a SIMD wrapper anywhere)
+
+**`I4` / `U4`** — 4-bit (nibble) lanes used by INT4 quantized inference
+(GGUF Q4_0 / Q4_K, GPTQ, AWQ). No first-class wrapper exists or is
+planned. Consumers pack 2× nibbles per byte and operate through
+`U8x64` with `shr_epi16` + `& 0x0F` masks; the same trick gives them
+the 128- and 64-byte shapes via the existing AVX-512 / AVX2 / NEON
+paths. If a first-class `I4x128` were ever wanted, AVX-512 VBMI2's
+`VPCOMPRESSB` + `VPEXPANDB` and AVX-512 IFMA's `VPMADD52` give the
+hardware story; on aarch64 there's no native nibble support and the
+shr+mask trick stays. Tracked as TD-SIMD-11 if a consumer files for it.
+
+### Aarch64-native narrower types
+
+Only useful directly when the consumer wants 128-bit shapes:
+`I8x16`, `I16x8`, `U8x16`, `U16x8`, `U32x4`, `U64x2`, `I32x4`, `I64x2`.
+These are not in the cross-arch parity surface — consumers requesting
+256-bit / 512-bit shapes go through the composed wrappers.
+
+### Gaps surfaced 2026-05-20
+
+- **`F32x8` / `F64x4` are universal on x86**, even on the v3 / AVX2 path
+  — they share the `__m256` / `__m256d` declarations exposed by
+  `simd_avx512.rs` (AVX, not AVX-512; works on every host with AVX
+  support, i.e. Sandy Bridge+). The previous matrix marked them `❌`
+  in the v3 column — corrected above.
+- **`U32x8` / `U64x4`** exist only in `simd_nightly` (via `core::simd`).
+  No native or polyfill wrapper on x86 or aarch64. Add to `simd_avx512`
+  + `simd_scalar` if a consumer needs them at 256-bit width.
+- **`I32x8` / `I64x4` / `U16x16`** missing across every backend (incl.
+  nightly). Theoretical 256-bit shapes that no consumer has reached for
+  yet; add to backlog if needed.
+- **`F32Mask8` / `F64Mask4`** are declared in `simd_scalar` as
+  `F32Mask8Scalar` / `F64Mask4Scalar` (the rename came from a duplicate-
+  decl conflict on i686 — see `src/simd_scalar.rs:340-345`). Not
+  surfaced through `crate::simd::*`. If consumers want these mask
+  widths, expose them and unify the name (drop the `Scalar` suffix on
+  AVX-512 where `__mmask8` natively maps to F64Mask8 already; the
+  256-bit f64 lane width needs a 4-bit mask which `__mmask8` can hold
+  but isn't yet typed as `F64Mask4`).
 
 ### Read of the matrix
 
@@ -199,7 +241,7 @@ Ranked by P0 (blocks current CI / consumers) → P3 (nice-to-have).
 | **TD-SIMD-5** | **P1** | Scalar fallback inline in `simd.rs` (`pub(crate) mod scalar`) makes symmetry hard — every other backend is its own file. | inspection | Promote to `src/simd_scalar.rs`; `simd.rs` becomes pure dispatch. ~mechanical refactor. |
 | **TD-SIMD-6** | **P2** | No `runtime-dispatch` feature / `simd_runtime` module exists yet. Release-binary distribution to heterogeneous silicon requires recompile per target today. | `grep -r "LazyLock<CpuCaps>"` only matches reporting code in `simd.rs:52-55` | New module wiring per-op trampolines from the compiled-in backends. ~300 LoC + one new cargo feature. |
 | **TD-SIMD-7** | **P2** | Compile-time arms in `simd.rs:153-194` are duplicated four times (one per type group: F32x16, F64x8, U8x32, BF16x16). Adding a new lane requires copy-pasting four `#[cfg(...)]` arms. | inspection | Single source-of-truth macro emitting the arms. ~one macro_rules!, 50 LoC. |
-| **TD-SIMD-8** | **P2** | `F16Scaler` in `simd_avx2.rs:2566` is a scalar implementation masquerading as a SIMD type. Consumers using `F16x16` on v3 get scalar perf without warning. | grep `F16Scaler` | Either gate `F16x16` behind `target_feature = "f16c"` or rename / document the scalar nature. ~20 LoC + docs. |
+| **TD-SIMD-8** | **P2** | `F16x16` in `src/simd_half.rs:123` is a scalar `[u16; 16]` polyfill — every arithmetic op upcasts to f32, computes, downcasts. Consumers using `crate::simd::F16x16` get scalar perf even on AVX-512 hardware with `vcvtph2ps` / `vcvtps2ph`. (`F16Scaler` in `simd_avx2.rs:2566` is unrelated — it's a *scaling context* for range-normalizing values before f16 encoding, not the F16x16 SIMD type.) | inspection of `src/simd_half.rs:115-150` | (a) Replace the `[u16; 16]` storage with `__m256i` + `_mm256_cvtph_ps` / `_mm256_cvtps_ph` under `target_feature = "f16c"` (Sapphire Rapids+, all Skylake AVX-512). (b) Add an `F16x16Scalar` alias and route consumers explicitly. (c) Add a doc-warning at the type level pointing at the architecture doc. ~80 LoC. |
 | **TD-SIMD-9** | **P3** | No CI matrix entry for the `nightly-simd` polyfill path. | `.github/workflows/ci.yaml` | Add a `nightly-simd-polyfill` job that builds with `--features nightly-simd` on nightly rustc. ~20 LoC YAML. |
 | **TD-SIMD-10** | **P3** | No CI matrix entry for `.cargo/config-avx512.toml`. AVX-512 deployment path silently bit-rots between PRs. | `.github/workflows/ci.yaml` | Add an `avx-512-explicit` job using a runner with AVX-512 silicon. ~20 LoC YAML; runner availability TBD. |
 
diff --git a/src/simd_avx2.rs b/src/simd_avx2.rs
index 5e164eab..2a3f9ed3 100644
--- a/src/simd_avx2.rs
+++ b/src/simd_avx2.rs
@@ -2562,6 +2562,17 @@ pub fn f16_kahan_dot(a: &[u16], b: &[u16]) -> f32 {
 ///
 /// Analyzes the input range, computes scale that maps |max| → 1.0,
 /// then uses that scale for all encode/decode operations.
+///
+/// # NOT a SIMD type
+///
+/// This is a *scaling utility* — it normalizes value ranges before
+/// f32 → f16 conversion so the dynamic range maps cleanly into f16's
+/// `[-65504, 65504]` window. The SIMD f16 wrapper is `simd_half::F16x16`
+/// (also a scalar polyfill on stable — see TD-SIMD-8 in
+/// `.claude/knowledge/simd-dispatch-architecture.md`). Earlier versions
+/// of the architecture doc's parity matrix mistakenly listed
+/// `F16Scaler` in the `F16x16` row's AVX2 column; the two are
+/// unrelated.
 #[derive(Debug, Clone, Copy)]
 pub struct F16Scaler {
     /// Multiply by this before f32→f16 (shifts into sweet spot)
diff --git a/src/simd_half.rs b/src/simd_half.rs
index 327f0943..223b50a1 100644
--- a/src/simd_half.rs
+++ b/src/simd_half.rs
@@ -118,7 +118,28 @@ impl BF16x16 {
 
 /// 16 × F16 (IEEE 754 binary16) packed into a scalar array.
 ///
-/// All arithmetic operates via f32 upcast → op → F16 downcast (round-to-nearest-even).
+/// # Scalar-perf disclosure (TD-SIMD-8)
+///
+/// **This is a scalar polyfill, not a SIMD type.** Storage is plain
+/// `[u16; 16]` (no `__m256i` / `__m256bh` / `float16x8_t`). Every
+/// arithmetic op upcasts to f32, computes lane-by-lane, downcasts back
+/// to f16 with round-to-nearest-even — same path on every backend
+/// (AVX-512, AVX2, NEON, scalar). Consumers in hot loops should NOT
+/// reach for `crate::simd::F16x16` expecting SIMD throughput.
+///
+/// The hardware-native paths exist on x86 via `_mm256_cvtph_ps` /
+/// `_mm256_cvtps_ph` (F16C; Ivy Bridge+) and on aarch64 via
+/// `vfmaq_f16` (ARMv8.2-A `+fp16`; Pi 5, Apple, modern Snapdragons).
+/// Wiring those into `F16x16` is tracked as TD-SIMD-8 in
+/// `.claude/knowledge/simd-dispatch-architecture.md`. Until then, hot
+/// loops on f16 should use `core::simd::f16x16` under the `nightly-simd`
+/// feature (real `core::simd::*` codegen) or stay in f32 and convert
+/// at storage boundaries.
+///
+/// Not to be confused with `simd_avx2::F16Scaler` — that's a *scaling
+/// context* for range-normalizing values before f16 encoding (so the
+/// dynamic range maps to f16's `[-65504, 65504]` window without
+/// clipping), not a SIMD lane type.
 #[derive(Clone, Copy, Debug)]
 pub struct F16x16([u16; 16]);