From 63f91df741c66f5a2c3cf525adf3a208db4cb2af Mon Sep 17 00:00:00 2001 From: Claude Date: Wed, 20 May 2026 14:45:58 +0000 Subject: [PATCH] =?UTF-8?q?docs(simd):=20TD-SIMD-8=20=E2=80=94=20F16=20hon?= =?UTF-8?q?esty=20+=20matrix=20audit=20for=20missing=20lanes?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit # F16 honesty (TD-SIMD-8) `src/simd_half.rs` F16x16: docstring now explicitly discloses scalar storage and routes hot loops to `core::simd::f16x16` (under `nightly-simd`) or to fp32 with conversion at boundaries. Disambiguates from `simd_avx2::F16Scaler` — a scaling CONTEXT for range-normalizing values before f16 encoding, not the F16x16 SIMD type. Both files cross- reference each other so a future reader doesn't repeat the confusion. `src/simd_avx2.rs` F16Scaler: docstring strengthened with the same disambiguation note. # Matrix audit (user request) Cross-referenced every `pub struct *x*` in simd_avx512.rs, simd_avx2.rs, simd_neon.rs, simd_nightly/mod.rs against the parity matrix in the architecture doc. Corrections: - **F32x8 / F64x4 v3 column: ❌ → ✅ `__m256`/`__m256d` (in `simd_avx512`)**. The dispatch at `src/simd.rs:294` already imports these from simd_avx512 on the v3 / AVX2 path. They're AVX (not AVX-512), so they work on every Sandy Bridge+ host. The matrix was stale. - **U32x8, U64x4 rows added** — nightly-only currently; ❌ on x86 + aarch64 + scalar. core::simd has them via `simd_nightly`. - **U16x16, I32x8, I64x4 rows added** — missing across EVERY backend including nightly. Theoretical 256-bit shapes no consumer has reached for yet. - **F32Mask8 / F64Mask4 rows added** — declared in simd_scalar as `F32Mask8Scalar` / `F64Mask4Scalar` (rename came from a duplicate- decl conflict on i686); not surfaced through `crate::simd::*`. AVX-512 has them natively via `__mmask8` but they're not typed. - **Sub-byte lanes section added** — I4 / U4 lanes used by INT4 quantized inference (Q4_0, Q4_K, GPTQ, AWQ). No first-class wrapper; consumers pack 2× nibbles per byte and operate through U8x64 + shr/ mask. Documents the hardware story (AVX-512 VBMI2, VPCOMPRESSB on x86; shr+mask trick on aarch64). Tracked as TD-SIMD-11 if a consumer files for it. TD-SIMD-8 description updated in §5 to point at `simd_half.rs:123` (the actual F16x16 polyfill) rather than `simd_avx2.rs:2566` (the unrelated F16Scaler scaling utility). --- .../knowledge/simd-dispatch-architecture.md | 62 ++++++++++++++++--- src/simd_avx2.rs | 11 ++++ src/simd_half.rs | 23 ++++++- 3 files changed, 85 insertions(+), 11 deletions(-) diff --git a/.claude/knowledge/simd-dispatch-architecture.md b/.claude/knowledge/simd-dispatch-architecture.md index 50dc5e89..6181932f 100644 --- a/.claude/knowledge/simd-dispatch-architecture.md +++ b/.claude/knowledge/simd-dispatch-architecture.md @@ -144,31 +144,73 @@ tracked as TD-SIMD-3.) | Lane type | `simd_avx512` (v4) | `simd_avx2` (v3) | `simd_neon` (aarch64) | `simd_nightly` | `scalar` | |---|---|---|---|---|---| | `F32x16` | ✅ `__m512` | 🟡 `(f32x8, f32x8)` | 🟡 `[float32x4_t; 4]` | 🔵 `core::simd::f32x16` | ✅ `[f32; 16]` | -| `F32x8` | ✅ `__m256` | ❌ | ⛔ | 🔵 | ✅ | +| `F32x8` | ✅ `__m256` | ✅ `__m256` (in `simd_avx512`) | ⛔ | 🔵 | ✅ | | `F64x8` | ✅ `__m512d` | 🟡 `(f64x4, f64x4)` | 🟡 `[float64x2_t; 4]` | 🔵 | ✅ | -| `F64x4` | ✅ `__m256d` | ❌ | ⛔ | 🔵 | ✅ | +| `F64x4` | ✅ `__m256d` | ✅ `__m256d` (in `simd_avx512`) | ⛔ | 🔵 | ✅ | | `U8x64` | ✅ `__m512i` | 🟠 `[u8; 64]` polyfill | ❌ | 🔵 | ✅ | | `U8x32` | ✅ `__m256i` | ✅ `__m256i` | ❌ | 🔵 | ✅ | | `U16x32` | ✅ `__m512i` | 🟠 `[u16; 32]` polyfill | ❌ | 🔵 | ✅ | +| `U16x16` | ❌ | ❌ | ❌ | ❌ | ❌ | | `U32x16` | ✅ `__m512i` | 🟠 `[u32; 16]` polyfill | ❌ | 🔵 | ✅ | +| `U32x8` | ❌ | ❌ | ❌ | 🔵 `core::simd::u32x8` | ❌ | | `U64x8` | ✅ `__m512i` | 🟠 `[u64; 8]` polyfill | ❌ | 🔵 | ✅ | +| `U64x4` | ❌ | ❌ | ❌ | 🔵 `core::simd::u64x4` | ❌ | | `I8x32` | ✅ `__m256i` | ✅ `__m256i` (in `simd_avx512`) | ❌ | 🔵 | ✅ | | `I8x64` | ✅ `__m512i` | 🟠 `[i8; 64]` polyfill | ❌ | 🔵 | ✅ | | `I16x16` | ✅ `__m256i` | ✅ `__m256i` (in `simd_avx512`) | ❌ | 🔵 | ✅ | | `I16x32` | ✅ `__m512i` | 🟠 `[i16; 32]` polyfill | ❌ | 🔵 | ✅ | | `I32x16` | ✅ `__m512i` | 🟠 `[i32; 16]` polyfill | ❌ | 🔵 | ✅ | +| `I32x8` | ❌ | ❌ | ❌ | ❌ | ❌ | | `I64x8` | ✅ `__m512i` | 🟠 `[i64; 8]` polyfill | ❌ | 🔵 | ✅ | +| `I64x4` | ❌ | ❌ | ❌ | ❌ | ❌ | | `BF16x8` | ✅ `__m128bh` | ❌ | ❌ | 🔵 | ✅ | | `BF16x16` | ✅ `__m256bh` | ❌ | ❌ | 🔵 | ✅ | -| `F16x16` | ❌ | 🟡 `F16Scaler` (scalar) | ❌ | 🔵 | ✅ | +| `F16x16` | ❌ | 🟠 `[u16; 16]` polyfill (`simd_half`) | 🟠 `[u16; 16]` polyfill (`simd_half`) | 🔵 | ✅ | | `F32Mask16` | ✅ `__mmask16` | ✅ `u16` bitmask | ✅ `u16` bitmask | 🔵 | ✅ | +| `F32Mask8` | ❌ exposed (avail via `__mmask8`) | ❌ exposed (avail in `simd_scalar` as `F32Mask8Scalar`) | ❌ exposed | 🔵 | ✅ via `F32Mask8Scalar` | | `F64Mask8` | ✅ `__mmask8` | ✅ `u8` bitmask | ✅ `u8` bitmask | 🔵 | ✅ | - -**Aarch64-native narrower types** (only useful directly when the -consumer wants 128-bit shapes): `I8x16`, `I16x8`, `U8x16`, `U16x8`, -`U32x4`, `U64x2`, `I32x4`, `I64x2`. These are not in the cross-arch -parity surface — consumers requesting 256-bit / 512-bit shapes go -through the composed wrappers. +| `F64Mask4` | ❌ exposed (avail via `__mmask8`) | ❌ exposed (avail in `simd_scalar` as `F64Mask4Scalar`) | ❌ exposed | 🔵 | ✅ via `F64Mask4Scalar` | + +### Sub-byte lanes (not a SIMD wrapper anywhere) + +**`I4` / `U4`** — 4-bit (nibble) lanes used by INT4 quantized inference +(GGUF Q4_0 / Q4_K, GPTQ, AWQ). No first-class wrapper exists or is +planned. Consumers pack 2× nibbles per byte and operate through +`U8x64` with `shr_epi16` + `& 0x0F` masks; the same trick gives them +the 128- and 64-byte shapes via the existing AVX-512 / AVX2 / NEON +paths. If a first-class `I4x128` were ever wanted, AVX-512 VBMI2's +`VPCOMPRESSB` + `VPEXPANDB` and AVX-512 IFMA's `VPMADD52` give the +hardware story; on aarch64 there's no native nibble support and the +shr+mask trick stays. Tracked as TD-SIMD-11 if a consumer files for it. + +### Aarch64-native narrower types + +Only useful directly when the consumer wants 128-bit shapes: +`I8x16`, `I16x8`, `U8x16`, `U16x8`, `U32x4`, `U64x2`, `I32x4`, `I64x2`. +These are not in the cross-arch parity surface — consumers requesting +256-bit / 512-bit shapes go through the composed wrappers. + +### Gaps surfaced 2026-05-20 + +- **`F32x8` / `F64x4` are universal on x86**, even on the v3 / AVX2 path + — they share the `__m256` / `__m256d` declarations exposed by + `simd_avx512.rs` (AVX, not AVX-512; works on every host with AVX + support, i.e. Sandy Bridge+). The previous matrix marked them `❌` + in the v3 column — corrected above. +- **`U32x8` / `U64x4`** exist only in `simd_nightly` (via `core::simd`). + No native or polyfill wrapper on x86 or aarch64. Add to `simd_avx512` + + `simd_scalar` if a consumer needs them at 256-bit width. +- **`I32x8` / `I64x4` / `U16x16`** missing across every backend (incl. + nightly). Theoretical 256-bit shapes that no consumer has reached for + yet; add to backlog if needed. +- **`F32Mask8` / `F64Mask4`** are declared in `simd_scalar` as + `F32Mask8Scalar` / `F64Mask4Scalar` (the rename came from a duplicate- + decl conflict on i686 — see `src/simd_scalar.rs:340-345`). Not + surfaced through `crate::simd::*`. If consumers want these mask + widths, expose them and unify the name (drop the `Scalar` suffix on + AVX-512 where `__mmask8` natively maps to F64Mask8 already; the + 256-bit f64 lane width needs a 4-bit mask which `__mmask8` can hold + but isn't yet typed as `F64Mask4`). ### Read of the matrix @@ -199,7 +241,7 @@ Ranked by P0 (blocks current CI / consumers) → P3 (nice-to-have). | **TD-SIMD-5** | **P1** | Scalar fallback inline in `simd.rs` (`pub(crate) mod scalar`) makes symmetry hard — every other backend is its own file. | inspection | Promote to `src/simd_scalar.rs`; `simd.rs` becomes pure dispatch. ~mechanical refactor. | | **TD-SIMD-6** | **P2** | No `runtime-dispatch` feature / `simd_runtime` module exists yet. Release-binary distribution to heterogeneous silicon requires recompile per target today. | `grep -r "LazyLock"` only matches reporting code in `simd.rs:52-55` | New module wiring per-op trampolines from the compiled-in backends. ~300 LoC + one new cargo feature. | | **TD-SIMD-7** | **P2** | Compile-time arms in `simd.rs:153-194` are duplicated four times (one per type group: F32x16, F64x8, U8x32, BF16x16). Adding a new lane requires copy-pasting four `#[cfg(...)]` arms. | inspection | Single source-of-truth macro emitting the arms. ~one macro_rules!, 50 LoC. | -| **TD-SIMD-8** | **P2** | `F16Scaler` in `simd_avx2.rs:2566` is a scalar implementation masquerading as a SIMD type. Consumers using `F16x16` on v3 get scalar perf without warning. | grep `F16Scaler` | Either gate `F16x16` behind `target_feature = "f16c"` or rename / document the scalar nature. ~20 LoC + docs. | +| **TD-SIMD-8** | **P2** | `F16x16` in `src/simd_half.rs:123` is a scalar `[u16; 16]` polyfill — every arithmetic op upcasts to f32, computes, downcasts. Consumers using `crate::simd::F16x16` get scalar perf even on AVX-512 hardware with `vcvtph2ps` / `vcvtps2ph`. (`F16Scaler` in `simd_avx2.rs:2566` is unrelated — it's a *scaling context* for range-normalizing values before f16 encoding, not the F16x16 SIMD type.) | inspection of `src/simd_half.rs:115-150` | (a) Replace the `[u16; 16]` storage with `__m256i` + `_mm256_cvtph_ps` / `_mm256_cvtps_ph` under `target_feature = "f16c"` (Sapphire Rapids+, all Skylake AVX-512). (b) Add an `F16x16Scalar` alias and route consumers explicitly. (c) Add a doc-warning at the type level pointing at the architecture doc. ~80 LoC. | | **TD-SIMD-9** | **P3** | No CI matrix entry for the `nightly-simd` polyfill path. | `.github/workflows/ci.yaml` | Add a `nightly-simd-polyfill` job that builds with `--features nightly-simd` on nightly rustc. ~20 LoC YAML. | | **TD-SIMD-10** | **P3** | No CI matrix entry for `.cargo/config-avx512.toml`. AVX-512 deployment path silently bit-rots between PRs. | `.github/workflows/ci.yaml` | Add an `avx-512-explicit` job using a runner with AVX-512 silicon. ~20 LoC YAML; runner availability TBD. | diff --git a/src/simd_avx2.rs b/src/simd_avx2.rs index 5e164eab..2a3f9ed3 100644 --- a/src/simd_avx2.rs +++ b/src/simd_avx2.rs @@ -2562,6 +2562,17 @@ pub fn f16_kahan_dot(a: &[u16], b: &[u16]) -> f32 { /// /// Analyzes the input range, computes scale that maps |max| → 1.0, /// then uses that scale for all encode/decode operations. +/// +/// # NOT a SIMD type +/// +/// This is a *scaling utility* — it normalizes value ranges before +/// f32 → f16 conversion so the dynamic range maps cleanly into f16's +/// `[-65504, 65504]` window. The SIMD f16 wrapper is `simd_half::F16x16` +/// (also a scalar polyfill on stable — see TD-SIMD-8 in +/// `.claude/knowledge/simd-dispatch-architecture.md`). Earlier versions +/// of the architecture doc's parity matrix mistakenly listed +/// `F16Scaler` in the `F16x16` row's AVX2 column; the two are +/// unrelated. #[derive(Debug, Clone, Copy)] pub struct F16Scaler { /// Multiply by this before f32→f16 (shifts into sweet spot) diff --git a/src/simd_half.rs b/src/simd_half.rs index 327f0943..223b50a1 100644 --- a/src/simd_half.rs +++ b/src/simd_half.rs @@ -118,7 +118,28 @@ impl BF16x16 { /// 16 × F16 (IEEE 754 binary16) packed into a scalar array. /// -/// All arithmetic operates via f32 upcast → op → F16 downcast (round-to-nearest-even). +/// # Scalar-perf disclosure (TD-SIMD-8) +/// +/// **This is a scalar polyfill, not a SIMD type.** Storage is plain +/// `[u16; 16]` (no `__m256i` / `__m256bh` / `float16x8_t`). Every +/// arithmetic op upcasts to f32, computes lane-by-lane, downcasts back +/// to f16 with round-to-nearest-even — same path on every backend +/// (AVX-512, AVX2, NEON, scalar). Consumers in hot loops should NOT +/// reach for `crate::simd::F16x16` expecting SIMD throughput. +/// +/// The hardware-native paths exist on x86 via `_mm256_cvtph_ps` / +/// `_mm256_cvtps_ph` (F16C; Ivy Bridge+) and on aarch64 via +/// `vfmaq_f16` (ARMv8.2-A `+fp16`; Pi 5, Apple, modern Snapdragons). +/// Wiring those into `F16x16` is tracked as TD-SIMD-8 in +/// `.claude/knowledge/simd-dispatch-architecture.md`. Until then, hot +/// loops on f16 should use `core::simd::f16x16` under the `nightly-simd` +/// feature (real `core::simd::*` codegen) or stay in f32 and convert +/// at storage boundaries. +/// +/// Not to be confused with `simd_avx2::F16Scaler` — that's a *scaling +/// context* for range-normalizing values before f16 encoding (so the +/// dynamic range maps to f16's `[-65504, 65504]` window without +/// clipping), not a SIMD lane type. #[derive(Clone, Copy, Debug)] pub struct F16x16([u16; 16]);