Skip to content

Add SQ8↔FP16 x86 SIMD distance kernels [MOD-14954]#970

Open
dor-forer wants to merge 19 commits into
mainfrom
dor-forer-sq8-fp16-x86-kernels-mod-14954
Open

Add SQ8↔FP16 x86 SIMD distance kernels [MOD-14954]#970
dor-forer wants to merge 19 commits into
mainfrom
dor-forer-sq8-fp16-x86-kernels-mod-14954

Conversation

@dor-forer
Copy link
Copy Markdown
Collaborator

@dor-forer dor-forer commented May 26, 2026

Summary

  • Implement AVX-512 (F+BW+VL+VNNI) / AVX2+FMA / AVX2 / SSE4 kernels for asymmetric SQ8↔FP16 distance (IP / Cosine / L2²) on Intel x86.
  • Wire the new kernels into the dispatcher (IP_space.cpp / L2_space.cpp) with F16C gating on the AVX2/SSE4 tiers; AVX-512 path uses _mm512_cvtph_ps which is part of AVX512F (no F16C requirement).
  • Append -mf16c conditionally to AVX2_FMA / AVX2 / SSE4 dispatcher source files in CMake (purely additive — no SQ8_FP32 / SQ8_SQ8 / INT8 / UINT8 codegen change since those sources contain no F16C intrinsics).
  • Extend SQ8_FP16_SpacesOptimizationTest to walk all four ISA tiers against the scalar baseline across dim ∈ [16, 32]; update existing GetDistFunc_*_SQ8_FP16 assertions accordingly.
  • Register per-ISA microbenchmarks in bm_spaces_sq8_fp16.cpp, mirroring the SQ8_FP32 layout.

Spec: docs/superpowers/specs/2026-05-26-sq8-fp16-x86-kernels-design.md
Plan: docs/superpowers/plans/2026-05-26-sq8-fp16-x86-kernels.md

Test plan

  • `make clean ALL=1 && make build && make unit_test` — 2271 tests pass (release).
  • `make asan` — 2271 tests pass under AddressSanitizer.
  • Benchmark binary runs on hosts that lack AVX-512 (AVX-512 tier gracefully reports "AVX512F_BW_VL_VNNI not available"; AVX2+FMA / AVX2 / SSE4 / scalar tiers all execute and report timings).

🤖 Generated with Claude Code


Note

Medium Risk
Changes hot-path vector distance math and runtime ISA selection; incorrect dispatch or SIMD numerics could affect search quality, though tier tests assert ~0.01f agreement with scalar and builds gate intrinsics by CPU flags.

Overview
Adds x86 SIMD for asymmetric SQ8 (index) ↔ FP16 (query) distances: inner product, cosine, and L2², with scalar fallback unchanged for dim < 16 or missing ISA features.

Build: F16C-dependent code moves into separate objects (AVX2_F16C, AVX2_FMA_F16C, SSE4_F16C) compiled with -mf16c (and -mavx for SSE4) only when the toolchain reports F16C+FMA+AVX; base AVX2 / AVX2_FMA / SSE4 TUs stay F16C-free. AVX-512 SQ8↔FP16 kernels live in the existing AVX512F TU (uses _mm512_cvtph_ps, no separate F16C flag).

Runtime: IP_SQ8_FP16_GetDistFunc, Cosine_SQ8_FP16_GetDistFunc, and L2_SQ8_FP16_GetDistFunc select AVX-512F → AVX2+FMA+F16C → AVX2+F16C → SSE4.1+F16C+AVX, set SQ8 alignment hints when dim is a multiple of the chunk size, and otherwise return the scalar implementations.

Validation: Parameterized tests compare each tier to the scalar baseline; benchmarks register the same ISA ladder beside the naive kernels.

Reviewed by Cursor Bugbot for commit 3565985. Bugbot is set up for automated code reviews on this repo. Configure here.

Captures the architecture, file-level plan, CMake F16C gating, and risk
register for adding AVX-512 / AVX2+FMA / AVX2 / SSE4 kernels for the
asymmetric SQ8 (storage) ↔ FP16 (query) distance functions, wiring them
into the existing dispatcher tables and SQ8_FP16 unit/benchmark
scaffolding from MOD-15141.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jit-ci
Copy link
Copy Markdown

jit-ci Bot commented May 26, 2026

🛡️ Jit Security Scan Results

CRITICAL HIGH MEDIUM

✅ No security findings were detected in this PR


Security scan by Jit

dor-forer and others added 8 commits May 26, 2026 13:39
Enables _mm{,256}_cvtph_ps in the AVX2+FMA, AVX2, and SSE4 dispatcher
translation units so the upcoming SQ8↔FP16 kernels can widen FP16 lanes
to FP32. The flag is appended only when CXX_F16C is detected; existing
SQ8_FP32 / SQ8_SQ8 / INT8 / UINT8 sources contain no F16C intrinsics so
emitted code for those kernels is unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Parameterised gtest fixture mirroring SQ8_FP32_SpacesOptimizationTest;
currently asserts only the scalar fallback path. Per-tier SIMD
assertion blocks (AVX-512, AVX2+FMA, AVX2, SSE4) are added alongside
the kernel implementations in subsequent commits.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Implements asymmetric SQ8 (storage) ↔ FP16 (query) Inner Product,
Cosine, and L2² kernels for the AVX-512 F+BW+VL+VNNI tier. Each chunk
widens 16 SQ8 lanes via cvtepu8_epi32 + cvtepi32_ps and 16 FP16 lanes
via _mm512_cvtph_ps, then fmadds into a 16-lane FP32 accumulator. SQ8
storage and FP16 query metadata reads use load_unaligned to tolerate
odd dimensions. Dispatcher branches in IP_space.cpp / L2_space.cpp
select the new Choose_SQ8_FP16_*_implementation_AVX512F_BW_VL_VNNI
when features.avx512f && features.avx512bw && features.avx512vl &&
features.avx512vnni; otherwise behaviour is unchanged from MOD-15141.
A parameterised gtest fixture exercises every residual class in
[16, 32] against the scalar baseline.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
8-wide AVX2+FMA kernels widen 8 SQ8 lanes via cvtepu8_epi32 +
cvtepi32_ps and 8 FP16 lanes via _mm256_cvtph_ps, then fmadd into a
256-bit FP32 accumulator. Residual (< 8) lanes load the full 16-byte
FP16 block, convert, then blend zero across unused lanes — mirroring
the existing F16C FP16 kernel pattern. Dispatcher branch in
{IP,Cosine,L2}_SQ8_FP16_GetDistFunc selects the new
Choose_SQ8_FP16_*_implementation_AVX2_FMA when features.avx2 &&
features.fma3 && features.f16c.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirrors the AVX2+FMA kernels but uses _mm256_mul_ps + _mm256_add_ps
instead of _mm256_fmadd_ps so it can run on Haswell-era AVX2 hardware
without FMA support (uncommon but matches the existing SQ8_FP32
tiering). Dispatcher gate requires features.avx2 && features.f16c
and runs between the AVX2+FMA and SSE4 tiers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
4-wide SSE4 kernels widen 4 SQ8 lanes via cvtepu8_epi32 + cvtepi32_ps
and 4 FP16 lanes via _mm_cvtph_ps (F16C), then mul+add into a 128-bit
FP32 accumulator (SSE4 has no FMA). Residual % 4 lanes are materialised
via _mm_set_ps + the scalar FP16_to_FP32 helper, mirroring the existing
SSE4 SQ8_FP32 residual pattern. Dispatcher gate requires
features.sse4_1 && features.f16c && features.avx since F16C is
VEX-encoded — matches the existing F16C/FP16 dispatcher gate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The SQ8_FP16 GetDistFunc dispatcher now returns AVX-512 / AVX2+FMA /
AVX2 / SSE4 SIMD kernels when the corresponding feature flags are set
(only scalar previously). Updates the GetDistFunc_*_SQ8_FP16 asserts
to compute the expected function for the host's highest supported tier.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds AVX-512 / AVX2+FMA / AVX2 / SSE4 benchmark registrations to
bm_spaces_sq8_fp16.cpp, mirroring the SQ8_FP32 layout. Gates each tier
on the corresponding OPT_* defines plus the runtime feature checks
that mirror the dispatcher in IP_space.cpp / L2_space.cpp.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@dor-forer dor-forer force-pushed the dor-forer-sq8-fp16-x86-kernels-mod-14954 branch from a43eac4 to e21cb3b Compare May 26, 2026 10:42
@codecov
Copy link
Copy Markdown

codecov Bot commented May 26, 2026

Codecov Report

❌ Patch coverage is 99.04762% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 97.05%. Comparing base (bbe9dfd) to head (3565985).

Files with missing lines Patch % Lines
src/VecSim/spaces/IP_space.cpp 95.23% 2 Missing ⚠️
src/VecSim/spaces/L2_space.cpp 95.23% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #970      +/-   ##
==========================================
+ Coverage   96.99%   97.05%   +0.05%     
==========================================
  Files         130      141      +11     
  Lines        7793     8105     +312     
==========================================
+ Hits         7559     7866     +307     
- Misses        234      239       +5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment thread src/VecSim/spaces/CMakeLists.txt Outdated
dor-forer and others added 9 commits May 26, 2026 14:07
- CMake: gate `-mf16c` on CXX_F16C AND CXX_FMA AND CXX_AVX (matches OPT_F16C
  macro) and append `-mavx` to the SSE4 dispatcher when adding -mf16c, since
  F16C is VEX-encoded and requires AVX state. Mirrors the existing F16C.cpp
  recipe and prevents miscompiles on toolchains with F16C but without AVX.
- IP_SSE4_SQ8_FP16.h: replace `*reinterpret_cast<const int32_t *>(pVect1)`
  with `load_unaligned<int32_t>(pVect1)` to remove strict-aliasing UB on
  the uint8_t SQ8 lane load.
- IP_AVX2{,_FMA}_SQ8_FP16.h: improve the residual-mask comment to spell out
  the asymmetric-mask reasoning (SQ8 unmasked is safe because the FP16
  query blend forces those FP32 query lanes to 0 → garbage·0=0).
- IP_AVX{512,2,2_FMA,SSE4}_SQ8_FP16.h: add the `IP = min·y_sum + delta·Σ(q·y)`
  algebraic-identity comment header that AVX-512 already carried, plus a
  precondition note that callers must enforce dim >= 16 (matches the
  established SQ8_FP32 convention; no runtime assert because sibling
  SQ8_FP32 SIMD kernels also rely on the dispatcher gate).
- test_spaces.cpp: route the SQ8_FP16 edge-case tests (ZeroQuery,
  ConstantStorage, MixedSignQuery) through {IP,Cosine,L2}_SQ8_FP16_GetDistFunc
  so the runtime-selected SIMD tier is actually exercised on those inputs,
  not just the scalar reference.
- test_spaces.cpp: add SQ8_FP16_SIMD_HighDim suite with dims {64, 128, 256,
  512, 1024} so multi-iteration do-while loop bugs would fire (the existing
  [16, 32] range covers at most two AVX-512 chunk iterations).
- test_spaces.cpp: add SQ8_FP16_SIMD_TierCoverage.ReportTiersExercised — a
  single test that emits per-tier coverage to stderr and GTEST_SKIPs when
  no SIMD tier is available, so CI runners without AVX-512 do not silently
  report zero tier-1 coverage.
- test_spaces.cpp: scalar-fallback `alignment` checks now seed the value
  with 0xFF and assert it remains 0xFF, verifying the dispatcher contract
  ("scalar leaves caller's value untouched") instead of just measuring
  that the variable's pre-zeroed init survived.
- test_spaces.cpp: drop the stale MOD-15152/MOD-15153 wiring-TODO comment
  on SQ8_FP16_NoOptimizationSpacesTest now that the SIMD tiers are wired.
- bm_spaces_sq8_fp16.cpp: drop the matching stale comment.

Out of scope (separate ticket): two-accumulator FMA refactor (also affects
SQ8_FP32) and the SSE4 residual `_mm_cvtph_ps` perf opportunity.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Break the FMA / mul+add dependency chain in all four SQ8↔FP16 IP kernels
by widening the inner loop to use multiple independent accumulators.
L2 kernels inherit the change through their `…InnerProductImp_…` call.

- IP_AVX512F_BW_VL_VNNI_SQ8_FP16.h: 1 → 4 accumulators, unroll-4 main loop
  (64 lanes/iter) with a 16-lane tail for the 0..3 remaining chunks.
- IP_AVX2_FMA_SQ8_FP16.h, IP_AVX2_SQ8_FP16.h: 1 → 2 accumulators; the
  existing 2-step unrolled body now routes each step to an independent
  accumulator. The `residual >= 8` half-chunk feeds the second accumulator
  so the prologue also breaks the dependency chain.
- IP_SSE4_SQ8_FP16.h: 1 → 2 accumulators; do-while unrolled 1 → 2 steps
  per iteration (4 → 8 lanes/iter). Residual-ladder steps alternate
  between sum_a and sum_b for prologue ILP.

Correctness invariant: residual block consumes exactly `residual` lanes
(0..15) → remaining tail is always a multiple of 16, so the unrolled
loops (multiples of 8 / 16 / 64) terminate exactly. Verified by 131
SQ8_FP16 unit tests + 115 under ASan.
The SQ8↔FP16 AVX-512 kernel does not actually issue any VNNI instruction
— the inner loop is FP32 FMA (`_mm512_fmadd_ps`) over lanes widened from
SQ8 (`_mm512_cvtepu8_epi32` + `_mm512_cvtepi32_ps`) and FP16
(`_mm512_cvtph_ps`). Real VNNI use would require an integer-encoded
query, which is a different kernel entirely.

The file/function names are renamed to match what the kernel actually
uses (AVX-512F). The dispatcher .cpp/.h files stay named after the
runtime tier (AVX512F_BW_VL_VNNI) since the SQ8↔FP16 kernel still
registers under that tier alongside the genuinely VNNI-using SQ8↔SQ8 /
INT8 / UINT8 kernels — the gate is a CPU-feature gate, not an ISA claim.

The same misnomer exists for SQ8↔FP32; tracked separately so the rename
there can ship as its own commit.

Also: fix a strict-aliasing-class UB introduced by the AVX-512 unroll-4
loop. `while (pVec1 + 64 <= pEnd1)` forms a pointer past one-past-end of
the SQ8 storage object when fewer than 64 lane bytes remain, which is UB
in C++ regardless of dereference. Switched to pointer subtraction
(`static_cast<size_t>(pEnd1 - pVec1) >= 64`).

Renames:
- IP_AVX512F_BW_VL_VNNI_SQ8_FP16.h -> IP_AVX512F_SQ8_FP16.h
- L2_AVX512F_BW_VL_VNNI_SQ8_FP16.h -> L2_AVX512F_SQ8_FP16.h
- SQ8_FP16_{InnerProduct,Cosine,L2Sqr}SIMD16_AVX512F_BW_VL_VNNI -> _AVX512F
- Choose_SQ8_FP16_{IP,Cosine,L2}_implementation_AVX512F_BW_VL_VNNI -> _AVX512F

Verified: 131 SQ8_FP16 unit tests + 115 under ASan.
Design doc was added in ad941b8 for planning; not appropriate as
a long-lived in-repo artifact. Keep externally (Confluence / scratch)
rather than ship with the kernel commit.
Two trims, both restoring pre-existing patterns elsewhere in the file:

1. `GetDistFuncSQ8FP16Asymmetric` had grown a runtime SIMD-tier walk that
   duplicated coverage already provided by `SQ8_FP16_SpacesOptimizationTest`.
   Reduced to the bare dispatcher-equality check used by the FP32 / SQ8↔SQ8
   sister tests at lines 540-548 and 551-559.

2. The `SQ8_FP16_EdgeCases` tests (`ZeroQueryTest`, `ConstantStorageTest`,
   `MixedSignQueryTest`) were routed through
   `{IP,Cosine,L2}_SQ8_FP16_GetDistFunc(dim, nullptr)` to force runtime SIMD
   dispatch on adversarial inputs. Reverted to direct scalar calls
   (`SQ8_FP16_InnerProduct`, etc.) — the original pre-fdc5c1cd shape.

   Coverage rationale: the SIMD kernels are branchless on input values
   (verified by grep — no value-dependent `if` in any tier). Every code
   path is therefore exercised by `SQ8_FP16_SpacesOptimizationTest`'s
   random inputs at multiple dims. The edge-case tests verify the
   *algebraic identity* (IP of zero query = 1.0, constant storage matches
   dequant baseline, mixed-sign handling) — scalar correctness on these
   inputs is what was actually being checked, and the SIMD path matches
   scalar via the SpacesOptimizationTest tier walk.

Net: 77 lines removed from the test file, matches sister conventions,
no coverage gap.
The SQ8↔FP16 kernels in the SSE4, AVX2, and AVX2+FMA tiers depend on
F16C (`_mm_cvtph_ps` / `_mm256_cvtph_ps`), while every other kernel in
those dispatcher TUs is F16C-clean. The previous arrangement mixed
both under `#ifdef OPT_F16C` blocks inside the base dispatcher .cpp/.h
files.

Split each tier's F16C-dependent kernels off into a sibling TU:

  functions/SSE4.cpp           → SSE4 + SQ8_FP32 (no F16C)
  functions/SSE4_F16C.cpp      → SQ8_FP16 only (requires -mavx -mf16c)

  functions/AVX2.cpp           → AVX2 + BF16 + SQ8_FP32 (no F16C)
  functions/AVX2_F16C.cpp      → SQ8_FP16 only (requires -mf16c)

  functions/AVX2_FMA.cpp       → SQ8_FP32 (no F16C)
  functions/AVX2_FMA_F16C.cpp  → SQ8_FP16 only (requires -mf16c)

The AVX-512 tier is unaffected — its SQ8_FP16 kernel uses
`_mm512_cvtph_ps`, which is part of AVX-512F and not F16C.

CMake now compiles each sibling TU conditionally on
`_has_full_f16c` and applies the F16C flags only there. Base TUs no
longer carry `-mf16c`, since they no longer reference F16C intrinsics.

Result:
- No `#ifdef OPT_F16C` directives in `functions/*.cpp` or `functions/*.h`.
- Compile-time isolation: an F16C intrinsic accidentally added outside
  a `_F16C` sibling will fail to build, not silently miscompile.
- Caller sites (`IP_space.cpp`, `L2_space.cpp`, `test_spaces.cpp`,
  `bm_spaces.h`) still gate the *calls* with `#ifdef OPT_F16C`; the new
  sibling .h includes are unconditional, since declarations alone don't
  link-error and the calls remain guarded.

Verified: 131 SQ8_FP16 unit tests + 115 ASan + 1166 full test_spaces
suite (covers other ISA tiers SQ8_FP32 / BF16 / INT8 / UINT8 to confirm
no regression from the dispatcher restructure).
…[MOD-14954]

Two related cleanups in the SQ8↔FP16 dispatch path:

1. The AVX-512 SQ8↔FP16 kernel only uses AVX-512F instructions
   (`_mm512_cvtph_ps`, `_mm512_fmadd_ps`, etc.) but was registered under
   the VNNI tier (`OPT_AVX512_F_BW_VL_VNNI` + check of avx512f/bw/vl/vnni).
   That meant CPUs with AVX-512F but no VNNI (Skylake-X, some Cascade Lake
   variants, etc.) would fall through to AVX2_FMA even though they can
   run the AVX-512 kernel.

   Moved the `Choose_SQ8_FP16_{IP,Cosine,L2}_implementation_AVX512F`
   definitions from `functions/AVX512F_BW_VL_VNNI.cpp` to
   `functions/AVX512F.cpp`, with matching header reshuffle. Dispatch
   sites now gate on `OPT_AVX512F` + `features.avx512f`.

2. F16C is a transversal requirement across the non-AVX-512 SQ8↔FP16
   tiers (SSE4, AVX2, AVX2+FMA) — every one of them widens FP16 query
   lanes via `vcvtph2ps`. Per-tier nested `#ifdef OPT_F16C` was hoisted
   into a single outer block around the three ISA branches in
   `IP_SQ8_FP16_GetDistFunc`, `Cosine_SQ8_FP16_GetDistFunc`, and
   `L2_SQ8_FP16_GetDistFunc`.

Verified: 131 SQ8_FP16 release + 115 ASan + 1166 full test_spaces suite.
Remove extraneous blank lines in SSE4 and AVX2_FMA source files, fix indentation in AVX512F SQ8_FP16 function signatures, and reformat benchmark macro invocation to fit line length conventions.
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 839fe3c. Configure here.

any_simd = true;
} else {
std::cerr << "[SQ8_FP16] AVX-512 F+BW+VL+VNNI tier NOT exercised on this host\n";
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tier coverage test checks wrong AVX-512 preprocessor guard

Low Severity

The ReportTiersExercised test gates the AVX-512 check on OPT_AVX512_F_BW_VL_VNNI, but the SQ8_FP16 kernels are dispatched under OPT_AVX512F (as seen in IP_space.cpp and L2_space.cpp and in the SQ8_FP16_SpacesOptimizationTest at line 3097). On CPUs with AVX-512F but without BW+VL+VNNI, the SIMD optimization tests correctly exercise the AVX-512F tier, but this coverage reporter incorrectly omits it — potentially causing any_simd to remain false and the test to spuriously skip.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 839fe3c. Configure here.

The comments referencing SQ8-to-FP16 dispatch location are no longer accurate after the recent refactoring that moved the dispatch logic. Clean up these stale comments from the AVX512F_BW_VL_VNNI files.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant