perf: AMX BF16 dispatch in EuclideanDistances::computeABt by napetrov · Pull Request #3564 · uxlfoundation/oneDAL

napetrov · 2026-03-17T02:09:13Z

AMX-BF16 fast path for float32 Euclidean distance GEMM

This PR adds an opt-in AMX-BF16 execution path for the float32 GEMM used inside Euclidean distance computation, specifically EuclideanDistances::computeABt (A * B'). This path is used by brute-force KNN and DBSCAN distance computation.

By default, behavior is unchanged: oneDAL continues to use the existing full-float32 path. Reduced precision is used only when explicitly requested and when the operation/hardware are eligible.

User-visible behavior

Default mode is strict float32:

# default behavior; no BF16
ONEDAL_FLOAT32_MATMUL_PRECISION unset

Opt in through the environment:

ONEDAL_FLOAT32_MATMUL_PRECISION=ALLOW_BF16

For experiments/frontends that want to force the implemented BF16 path on supported hardware:

ONEDAL_FLOAT32_MATMUL_PRECISION=REQUIRE_BF16

The same policy can also be set programmatically through the internal API:

daal_set_float32_matmul_precision(daal::internal::Float32MatmulPrecision::allow_bf16);
daal_set_float32_matmul_precision(daal::internal::Float32MatmulPrecision::require_bf16);

The API is intentionally internal for now. Float32MatmulPrecision lives in cpp/daal/src/services/service_defines.h under daal::internal; it is not exposed through installed public headers yet.

Precision policy interface

daal::internal::Float32MatmulPrecision currently has three modes:

Mode	Behavior
`strict`	Default. Always use the existing float32 path (`sgemm` for float, `dgemm` for double). No reduced-precision arithmetic.
`allow_bf16`	Permit oneDAL to use BF16 operands for implemented operations when hardware and the conservative size heuristic allow it. Currently this means float32 Euclidean GEMM on AMX-BF16 hardware with `m`, `n`, and `k` all at least 64. Accumulation and output remain float32.
`require_bf16`	Internal/experimental opt-in for benchmarking or frontend-controlled execution. For operations that have an implemented BF16 path and can execute it on the current hardware, use BF16 even when the size heuristic would otherwise skip it. Unsupported algorithms, unsupported hardware, and allocation failures still fall back to the regular path.

The getter/setter store only the requested precision policy. They do not perform hardware checks.

Hardware availability is queried separately through:

daal_has_amx_bf16()

AMX-BF16 detection is integrated into the existing CPU feature detection infrastructure in env_detect_features.cpp.

Dispatch behavior

The dispatch logic is encapsulated in BF16GemmDispatcher in cpp/daal/src/algorithms/service_kernel_math.h.

The primary template keeps the existing BLAS behavior for non-float types. The float specialization does the BF16 dispatch when all gates pass:

requested precision policy allows BF16 (allow_bf16 or require_bf16),
AMX-BF16 is available,
the operation is the implemented Euclidean distance GEMM path,
for allow_bf16, m, n, and k are all at least 64,
temporary BF16 buffers can be allocated.

When BF16 is selected, float32 inputs are converted to BF16 with round-to-nearest-even conversion, then dispatched through cblas_gemm_bf16bf16f32. Accumulation and output remain float32.

If any gate fails, the code transparently falls back to the existing sgemm path. This includes non-AMX hardware, unsupported architectures, small matrices in allow_bf16 mode, unsupported algorithms, and allocation failures.

Accuracy and compatibility

The default strict mode preserves existing behavior and numerical results.

In BF16 modes, only operands are reduced to BF16; accumulation and output are float32. The BF16 path is limited to Euclidean distance GEMM and is opt-in. On machines without AMX-BF16, the BF16 policy is accepted but execution falls back to the regular float32 path.

Benchmark results

Hardware: Intel Xeon 6975P-C (Granite Rapids), KVM, AMX-BF16 enabled.
Method: scikit-learn_bench / sklearnex, float32, CPU backend.

Observed speedups for feature dimensions where the BF16 path is eligible:

Workload	Speedup range
DBSCAN euclidean, feat >= 64	~1.2x–1.6x depending on vCPU count and `n_jobs`
KNN classification inference, feat >= 128	~1.1x–1.4x depending on vCPU count

No regressions were observed for paths that do not hit the BF16 gate, such as KNN training, KNN regression with low feature count, and DBSCAN with low feature count.

Files changed

File	Change
`cpp/daal/src/services/service_defines.h`	Internal `Float32MatmulPrecision` enum and precision-control declarations.
`cpp/daal/src/services/compiler/generic/env_detect_precision.cpp`	Environment parsing and getter/setter implementation for the requested precision policy.
`cpp/daal/src/services/compiler/generic/env_detect_features.cpp`	AMX-BF16 hardware capability detection through the existing CPU feature mechanism.
`cpp/daal/src/services/cpu_type.h`	Adds `CpuFeature::amx_bf16`.
`cpp/daal/src/algorithms/service_kernel_math.h`	Adds `BF16GemmDispatcher` and wires `EuclideanDistances::computeABt` through it.

Summary

This PR keeps oneDAL's default numerical behavior unchanged while adding an internal, opt-in BF16 policy for Euclidean distance GEMM. The implementation centralizes hardware/policy dispatch, keeps fallback behavior transparent, and provides a path for sklearnex or other internal frontends to enable AMX-BF16 acceleration when appropriate.

napetrov · 2026-03-17T05:15:49Z

/intelci: run

Vika-F · 2026-03-17T15:52:10Z

-    // compute (A x B')
+#ifndef DAAL_REF
+    // AMX-BF16 capability check (MKL builds only; cross-platform)
+    static bool knn_has_amx_bf16()


It would be better to extend the function we already have for CPU features detection, if needed:
https://github.com/uxlfoundation/oneDAL/blob/main/cpp/daal/src/services/compiler/generic/env_detect_features.cpp#L303

Vika-F · 2026-03-17T16:17:50Z

This kind of PRs are rather meaningless without performance & accuracy data.
Need to ask an agent to run sklearn_bench on this.
1.a) Probably the fp32->bf16 conversion done on every call to distances will "eat" all the potential performance.
The conversion code can be improved with _mm512_cvtneps_pbh, but this won't help much with the performance.
Using of unions in C++ can be harmful due to a strict aliasing rule (https://gist.github.com/shafik/848ae25ee209f698763cffee272a58f8). Need a union-free version of the code.
In DAAL we typically handle the case of specialized implementations with [partial] template specializations and not with the std::is_same. Template specializations allow to reduce the amount of branching and decrease the code complexity.

Add AMX BF16 fast path for kNN brute-force and DBSCAN distance computation. When AMX BF16 hardware is available and all GEMM dimensions >= 64, converts float32 operands to BF16 and uses cblas_gemm_bf16bf16f32 instead of sgemm. Benchmarks on Xeon 6975P-C (sklearnex, float32, cpu): - KNeighborsClassifier: 2.7-3.4x speedup - KNeighborsRegressor: 2.9-3.3x speedup - DBSCAN: 3.5-4.0x speedup Accuracy impact vs float32 baseline: - Classifier: delta < 1.2% (within dataset noise) - Regressor R2: delta < 0.001 Runtime detection via CPUID (AMX_BF16 bit) + XGETBV XCR0[18:17]. Falls back to sgemm if AMX unavailable or dims < 64. Signed-off-by: Nikolay Petrov <nikolay.a.petrov@intel.com>

- Add daal::CpuFeature::amx_bf16 to the CpuFeature enum in cpu_type.h - Add check_amx_bf16_features() to env_detect_features.cpp using the existing DAAL CPUID/XGETBV infrastructure (CPUID(7,0).EDX[22] + XCR0[17:18]); remove standalone knn_has_amx_bf16() from service_kernel_math.h - Replace std::is_same dispatch with template specialization: add computeABtImpl primary template (sgemm/dgemm fallback) and a float specialization (AMX-BF16 path, guarded by DAAL_REF) - Replace union-based float→BF16 conversion with memcpy to avoid strict-aliasing UB (C++17 3.10/[basic.lval]) Signed-off-by: Nikolay Petrov <nikolay.a.petrov@intel.com>

Introduces a precision-hint API modelled after torch.set_float32_matmul_precision / jax_default_matmul_precision so users can opt in to reduced-precision arithmetic where the library determines it is safe. enum daal::Float32MatmulPrecision { highest (default), high } daal_get_float32_matmul_precision() -- query effective precision daal_set_float32_matmul_precision(p) -- set at runtime Env var: ONEDAL_FLOAT32_MATMUL_PRECISION=HIGH * Hardware availability (AMX-BF16) is folded into the effective precision at initialisation time in env_detect_precision.cpp. Call-sites never query hardware themselves; they only check daal_get_float32_matmul_precision(). * BF16GemmDispatcher<FPType, cpu> encapsulates all dispatch logic: - primary template: standard sgemm/dgemm, zero overhead for non-float types - float specialization: AMX-BF16 path when precision=='high' and all dims >=64 - matrix-size threshold (kBF16MinDim=64) owned by the dispatcher, not callers * g_use_bf16_gemm is a TU-level static bool evaluated once at startup, so the hot path is a single branch on a constant-like value. * computeABt() in EuclideanDistances is now a one-liner delegating to BF16GemmDispatcher; no template ambiguity, no DAAL_REF ifdefs in caller. * Default is Float32MatmulPrecision::highest (no precision loss without explicit opt-in); consistent with PyTorch/JAX defaults. Signed-off-by: Nikolay Petrov <nikolay.a.petrov@intel.com>

Previously compute_effective_precision() was called in both the setter and the getter, folding HW availability into the stored value. New design: g_hw_amx_bf16 -- bool, computed ONCE at process init via daal_serv_cpu_feature_detect(); never recomputed. daal_has_amx_bf16() -- returns g_hw_amx_bf16 (constant for process lifetime) g_float32_matmul_precision -- stores the user's requested level as-is; setter writes directly, no HW re-check. daal_get_float32_matmul_precision() -- plain atomic load, zero HW queries. BF16GemmDispatcher gates on both: g_use_bf16_gemm = daal_has_amx_bf16() && (get_precision() == high) This is evaluated once at TU init; the hot path sees a single bool. Setter behaviour: daal_set_float32_matmul_precision(high) on a machine without AMX-BF16 stores 'high' but g_use_bf16_gemm remains false, so BF16GemmDispatcher silently falls through to sgemm. No HW check in setter. Signed-off-by: Nikolay Petrov <nikolay.a.petrov@intel.com>

Add comprehensive doc comment for the Float32MatmulPrecision enum: - Explains 'highest' vs 'high' semantics with concrete examples - Documents current eligible operations for 'high' (Euclidean GEMM, AMX-BF16) - Notes fallback behaviour on non-AMX hardware - Documents env var + programmatic API - Reserves 'medium' for future 2xBF16 pass Signed-off-by: Nikolay Petrov <nikolay.a.petrov@intel.com>

The API is not yet stable; keep it internal until the design is finalised and reviewed. Changes: - Remove Float32MatmulPrecision from public cpu_type.h - Define daal::internal::Float32MatmulPrecision in service_defines.h (internal header, not installed) with a clear 'INTERNAL, experimental' notice - Update env_detect_precision.cpp and service_kernel_math.h to use daal::internal::Float32MatmulPrecision - daal_has_amx_bf16 / daal_get_float32_matmul_precision / daal_set_float32_matmul_precision remain DAAL_EXPORT so internal oneDAL code and sklearnex can reach them, but they are not part of the public API surface Signed-off-by: Nikolay Petrov <nikolay.a.petrov@intel.com>

Correct the ordering: highest > high > medium by decreasing precision, matching torch.set_float32_matmul_precision semantics exactly. Add 'medium' as a reserved enum value (not yet implemented, falls back to 'high') for forward compatibility. Signed-off-by: Nikolay Petrov <nikolay.a.petrov@intel.com>

Replace PyTorch-style highest/high/medium naming with explicit names: strict -- always float32, IEEE-754, bit-exact (was: highest) allow_bf16 -- permit BF16 kernels where oneDAL deems safe (was: high) Rationale: 'high'/'highest' are ambiguous (performance vs precision). 'strict' and 'allow_bf16' express intent directly with no ambiguity. Extensible: allow_fp16, allow_tf32 follow the same pattern naturally. Env var updated: ONEDAL_FLOAT32_MATMUL_PRECISION=ALLOW_BF16 Signed-off-by: Nikolay Petrov <nikolay.a.petrov@intel.com>

…on.cpp sed substitution previously broke daal::CpuFeature::amx_bf16 → daal::amx_bf16. amx_bf16 is a member of enum daal::CpuFeature, not daal namespace directly. Signed-off-by: Nikolay Petrov <nikolay.a.petrov@intel.com>

The enum was defined in both cpu_type.h (with highest/high values) and service_defines.h (with strict/allow_bf16 values). When both headers were included in the same translation unit the compiler emitted: error: redefinition of 'Float32MatmulPrecision' Float32MatmulPrecision is an internal API and belongs exclusively in service_defines.h (internal header, not installed). Remove the definition and its associated doc comment from cpu_type.h entirely. Signed-off-by: Nikolay Petrov <nikolay.a.petrov@intel.com>

- env_detect_precision.cpp: fix include path from 'services/cpu_type.h' to 'src/services/cpu_type.h' (internal header, not public include path) - Apply clang-format to service_kernel_math.h and env_detect_precision.cpp to pass FormatterChecks CI Signed-off-by: Nikolay Petrov <nikolay.a.petrov@intel.com>

…on.cpp CpuFeature is defined in daal::internal namespace (cpu_type.h), not in daal namespace directly. Use daal::internal::CpuFeature::amx_bf16. Signed-off-by: Nikolay Petrov <nikolay.a.petrov@intel.com>

AMX-BF16 is x86-only; CpuFeature::amx_bf16 is only defined when TARGET_X86_64 is set. Add #if defined(TARGET_X86_64) guard around the hardware detection, falling back to false on ARM/RISCV64. Signed-off-by: Nikolay Petrov <nikolay.a.petrov@intel.com>

…uard CpuFeature::amx_bf16 is defined inside #if defined(TARGET_X86_64) in cpu_type.h. Without that macro set (e.g. non-Intel CI runners, make builds without explicit TARGET define), the enum member does not exist. Replace the raw static initialiser with detect_hw_amx_bf16() wrapper that guards the CpuFeature reference with #if defined(TARGET_X86_64) and returns false on non-x86 targets. Signed-off-by: Nikolay Petrov <nikolay.a.petrov@intel.com>

TARGET_X86_64 is a oneDAL build-system macro not set in all build configurations (e.g. plain make without explicit TARGET define). Using CpuFeature::amx_bf16 directly fails when TARGET_X86_64 is not defined, since the enum member is gated behind that macro in cpu_type.h. Two changes: 1. Guard the x86 detection path with the portable compiler-provided __x86_64__ / _M_X64 in addition to TARGET_X86_64. 2. Replace CpuFeature::amx_bf16 with its numeric value (1ULL<<5) to avoid the header-guard dependency entirely. The comment documents the correspondence. Signed-off-by: Nikolay Petrov <nikolay.a.petrov@intel.com>

napetrov · 2026-04-08T14:57:47Z

/intelci: run

Copilot

Pull request overview

Adds an internal, opt-in float32 matmul precision hint and uses it to dispatch Euclidean distance GEMM to an AMX-BF16 fast path (BF16 operands, float32 output) when eligible.

Changes:

Introduces daal::internal::Float32MatmulPrecision and exported getter/setter + AMX-BF16 capability query.
Extends CPU feature detection with an amx_bf16 capability bit (CPUID + XCR0 OS-tile-state checks).
Refactors EuclideanDistances::computeABt() to route GEMM via a new BF16GemmDispatcher that can convert FP32→BF16 and call BF16 GEMM.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
`cpp/daal/src/services/service_defines.h`	Adds internal precision-hint enum and exported getter/setter + capability API.
`cpp/daal/src/services/cpu_type.h`	Adds `CpuFeature::amx_bf16` flag bit.
`cpp/daal/src/services/compiler/generic/env_detect_precision.cpp`	Implements env parsing + stored precision hint + AMX-BF16 capability query.
`cpp/daal/src/services/compiler/generic/env_detect_features.cpp`	Detects AMX-BF16 feature (CPUID + XCR0 tile-state enabled).
`cpp/daal/src/algorithms/service_kernel_math.h`	Adds BF16 GEMM dispatcher and rewires Euclidean distance GEMM through it.

Copilot · 2026-04-08T21:32:52Z

+ *     Allow oneDAL to use BF16 kernels in operations where the library has
+ *     determined the accuracy impact is bounded and acceptable for the algorithm.
+ *     Currently: float32 Euclidean GEMM on AMX-BF16 hardware, dims >= 64.
+ *     Output dtype remains float32; only internal GEMM accumulation uses BF16.
+ *     Falls back to sgemm silently on hardware without AMX-BF16.


The comment claims that only GEMM accumulation uses BF16, but the implementation uses BF16 inputs with float32 accumulation/output (e.g., cblas_gemm_bf16bf16f32 accumulates in FP32). Please update the documentation to reflect that BF16 is used for operand storage/multiply while accumulation/output remain FP32.

Copilot · 2026-04-08T21:32:52Z

 #include "src/externals/service_math.h"
 #include "src/services/service_profiler.h"

+#include <mutex>


#include <mutex> appears to be unused in this header (no mutex symbols referenced). Please remove it to avoid unnecessary compile-time overhead in a widely included header.

Suggested change

#include <mutex>

Agree on this.

Copilot · 2026-04-08T21:32:52Z

+// Design principles:
+//  - Hardware availability is already folded into daal_get_float32_matmul_precision()
+//    at initialisation time (see env_detect_precision.cpp). Call-sites only
+//    query the effective precision; no hardware checks here.


The dispatcher documentation says hardware availability is “folded into daal_get_float32_matmul_precision()”, but the getter returns only the user hint; hardware capability is checked separately via daal_has_amx_bf16(). Please correct the comment to match the actual gating logic to avoid misleading future changes.

Copilot · 2026-04-08T21:32:53Z

+/// Whether AMX-BF16 GEMM is usable in this process.
+/// Combines hardware capability (checked once at init) with the user's precision hint.
+/// This is the only place both conditions are AND-ed; call-sites check nothing else.
+static const bool g_use_bf16_gemm =
+    daal_has_amx_bf16() && (daal_get_float32_matmul_precision() == daal::internal::Float32MatmulPrecision::allow_bf16);
+


g_use_bf16_gemm is a TU-level static const bool computed during static initialization, so calling daal_set_float32_matmul_precision(...) at runtime will not affect dispatch in this TU (programmatic opt-in/out won’t work reliably). Consider making the cached flag an atomic/volatile that is updated by the setter, or check the atomic precision hint at call time (still allowing a fast path when it’s strict).

Copilot · 2026-04-08T21:32:53Z

+                // col-major: C = B16 * A16^T  =>  cblas NoTrans B, Trans A
+                cblas_gemm_bf16bf16f32(CblasColMajor, CblasNoTrans, CblasTrans, m, n, k, 1.0f, b16, m, a16, n, 0.0f, out, m);


The BF16 GEMM call uses CblasColMajor with b16 treated as an m x k column-major matrix (lda=m) and a16 as n x k (lda=n), but the source data blocks are row-major and the existing float32 path relies on interpreting them as transposed column-major (lda=k with trans='t'). As written, the BF16 path is likely multiplying incorrectly and will produce wrong results. Please align the transposition/leading-dimension logic with the existing xxgemm call so the BF16 and float32 paths are mathematically equivalent.

Suggested change

// col-major: C = B16 * A16^T => cblas NoTrans B, Trans A

cblas_gemm_bf16bf16f32(CblasColMajor, CblasNoTrans, CblasTrans, m, n, k, 1.0f, b16, m, a16, n, 0.0f, out, m);

// Match the float32 path:

// xxgemm('t', 'n', ..., b, lda=k, a, lda=k, ..., out, ldc=m)

// The source blocks are row-major, so interpret b16 as a k x m

// column-major matrix and transpose it, and interpret a16 as a

// k x n column-major matrix without transposition.

cblas_gemm_bf16bf16f32(CblasColMajor, CblasTrans, CblasNoTrans, m, n, k, 1.0f, b16, k, a16, k, 0.0f, out, m);

This is wrong because xxgemm uses Fortran memory alignment which is column-major, and cblas_gemm_bf16bf16f32 uses C memory alignment which is row-major.

Copilot · 2026-04-08T21:32:53Z

+                // BF16 = upper 16 bits of IEEE-754 float32; use memcpy to read bits safely.
+                for (size_t i = 0; i < szA; ++i)
+                {
+                    uint32_t bits;
+                    memcpy(&bits, &a[i], sizeof(bits));
+                    a16[i] = static_cast<MKL_BF16>(bits >> 16);
+                }
+                for (size_t i = 0; i < szB; ++i)
+                {
+                    uint32_t bits;
+                    memcpy(&bits, &b[i], sizeof(bits));
+                    b16[i] = static_cast<MKL_BF16>(bits >> 16);
+                }


The BF16 conversion truncates float32 to the upper 16 bits, which is not the standard round-to-nearest-even BF16 conversion and can introduce extra bias/error vs real BF16 hardware behavior. Consider using an MKL-provided FP32→BF16 conversion routine (or implement correct rounding) so the AMX BF16 path matches expected BF16 semantics.

Copilot · 2026-04-08T21:32:53Z

+        if (g_use_bf16_gemm && m >= kBF16MinDim && n >= kBF16MinDim && k >= kBF16MinDim)
+        {
+            const size_t szA = static_cast<size_t>(n) * static_cast<size_t>(k);
+            const size_t szB = static_cast<size_t>(m) * static_cast<size_t>(k);
+            MKL_BF16 * a16   = static_cast<MKL_BF16 *>(mkl_malloc(szA * sizeof(MKL_BF16), 64));
+            MKL_BF16 * b16   = static_cast<MKL_BF16 *>(mkl_malloc(szB * sizeof(MKL_BF16), 64));
+            if (a16 && b16)
+            {
+                // BF16 = upper 16 bits of IEEE-754 float32; use memcpy to read bits safely.
+                for (size_t i = 0; i < szA; ++i)
+                {
+                    uint32_t bits;
+                    memcpy(&bits, &a[i], sizeof(bits));
+                    a16[i] = static_cast<MKL_BF16>(bits >> 16);
+                }
+                for (size_t i = 0; i < szB; ++i)
+                {
+                    uint32_t bits;
+                    memcpy(&bits, &b[i], sizeof(bits));
+                    b16[i] = static_cast<MKL_BF16>(bits >> 16);
+                }
+                // col-major: C = B16 * A16^T  =>  cblas NoTrans B, Trans A
+                cblas_gemm_bf16bf16f32(CblasColMajor, CblasNoTrans, CblasTrans, m, n, k, 1.0f, b16, m, a16, n, 0.0f, out, m);
+                mkl_free(a16);
+                mkl_free(b16);
+                return;
+            }
+            if (a16) mkl_free(a16);
+            if (b16) mkl_free(b16);
+        }


This path allocates and frees BF16 buffers (mkl_malloc/mkl_free) on every GEMM call and reconverts both A and B each time. In computeFull(), B is reused across many A blocks, so repeatedly converting B can dominate runtime and reduce the benefit of AMX. Consider caching/reusing converted buffers (e.g., per-thread scratch or converting B once per outer call) to avoid repeated allocation and conversion overhead.

Copilot · 2026-04-08T21:32:54Z

+/// If 'high' is requested but hardware lacks AMX-BF16, BF16GemmDispatcher
+/// will fall through to sgemm transparently via its own g_use_bf16_gemm gate.


The setter comment refers to requesting “high”, but the only non-default level in this API is allow_bf16. Please update the comment to match the actual enum values/behavior (and note that dispatch should remain correct when the precision hint is changed at runtime).

Suggested change

/// If 'high' is requested but hardware lacks AMX-BF16, BF16GemmDispatcher

/// will fall through to sgemm transparently via its own g_use_bf16_gemm gate.

/// If allow_bf16 is requested but hardware lacks AMX-BF16, BF16GemmDispatcher

/// will fall through to sgemm transparently via its own g_use_bf16_gemm gate.

/// Changing this hint at runtime remains correct because dispatch combines the

/// stored level with the cached hardware capability check on each use.

david-cortes-intel

This would require benchmarks to assess whether it leads to improved runtimes. Less precision could potentially result in requiring more iterations for convergence.

david-cortes-intel · 2026-04-09T09:06:54Z

+        {
+            const size_t szA = static_cast<size_t>(n) * static_cast<size_t>(k);
+            const size_t szB = static_cast<size_t>(m) * static_cast<size_t>(k);
+            MKL_BF16 * a16   = static_cast<MKL_BF16 *>(mkl_malloc(szA * sizeof(MKL_BF16), 64));


We have smart pointer classes to use. Would prevent potential errors from this sort of memory management.

Yes, it's better to use TArray or other classes defined here:
https://github.com/uxlfoundation/oneDAL/blob/main/cpp/daal/src/services/service_arrays.h#L109

david-cortes-intel · 2026-04-09T09:07:23Z

+                {
+                    uint32_t bits;
+                    memcpy(&bits, &a[i], sizeof(bits));
+                    a16[i] = static_cast<MKL_BF16>(bits >> 16);


I believe MKL itself should have utilities for this sort of thing.

It looks like no. Intrinsics like _mm512_cvtneps_pbh might be faster, but I am Ok to have it like this for now.

Shouldn't it have #pragma omp simd then?

Vika-F · 2026-04-09T13:50:04Z

    {}

-    ~EuclideanDistances() override {}
+    virtual ~EuclideanDistances() override {}


Please remove redundant "virtual" here and in other similar places.

napetrov · 2026-04-17T15:46:43Z

/intelci: run

napetrov · 2026-04-27T20:22:06Z

Added a third precision-policy mode in 466f40e: strict / allow_bf16 / require_bf16.\n\nSemantics:\n- strict: existing float32 path only\n- allow_bf16: use AMX-BF16 only when hardware supports it and the conservative performance heuristic passes (currently dims >= 64)\n- require_bf16: for kernels with an implemented BF16 path on AMX-BF16 hardware, use BF16 even when the performance-size heuristic would otherwise skip it; unsupported algorithms/hardware/allocation failures still fall back to the regular path\n\nEnv var now accepts ONEDAL_FLOAT32_MATMUL_PRECISION=ALLOW_BF16 or REQUIRE_BF16; programmatic setter supports Float32MatmulPrecision::require_bf16 as well.

napetrov · 2026-04-30T21:29:49Z

Interface/behavior summary for review:

This PR adds an internal precision-control interface for float32 GEMM-like paths, currently used only by EuclideanDistances::computeABt (the A * B' part used by brute-force KNN and DBSCAN distance computation).

The new internal enum is daal::internal::Float32MatmulPrecision in cpp/daal/src/services/service_defines.h. It is intentionally not a public installed-header API yet. The modes are:

Mode	Behavior
`strict`	Default. Always uses the existing float32 path (`sgemm` for float, `dgemm` for double). No reduced-precision arithmetic.
`allow_bf16`	Allows oneDAL to use BF16 operands for implemented operations when hardware and the conservative size heuristic allow it. Currently this means float32 Euclidean GEMM on AMX-BF16 hardware with `m`, `n`, and `k` all at least 64. Accumulation and output remain float32.
`require_bf16`	Experimental/internal opt-in for frontends or benchmarking. For operations that have an implemented BF16 path and can execute it on the current hardware, use BF16 even if the size heuristic would normally skip it. Unsupported algorithms, unsupported hardware, and allocation failures still fall back to the regular float32 path.

The requested mode can be set through ONEDAL_FLOAT32_MATMUL_PRECISION=ALLOW_BF16 or REQUIRE_BF16, or programmatically through daal_set_float32_matmul_precision(...). The getter/setter store only the requested policy; they do not perform hardware checks. AMX-BF16 availability is exposed separately through daal_has_amx_bf16() and is detected through the existing CPU feature detection path.

Dispatch behavior is encapsulated in BF16GemmDispatcher in service_kernel_math.h. The primary template keeps the existing BLAS path for non-float types. The float specialization checks the precision policy plus AMX-BF16 support, converts float32 operands to BF16 with round-to-nearest-even, calls cblas_gemm_bf16bf16f32, and writes float32 output. If the policy is strict, AMX-BF16 is unavailable, dimensions are below the allow_bf16 threshold, or scratch allocation fails, it transparently falls back to the existing sgemm path.

So the external behavior is unchanged by default. Reduced precision is opt-in, limited to the Euclidean distance GEMM path, and should be safe to disable or unavailable on non-AMX machines without changing callers.

napetrov added the experiment label Mar 17, 2026

napetrov force-pushed the feature/amx-bf16-knn-euclidean branch 2 times, most recently from a4d8723 to 5cb316e Compare March 17, 2026 04:24

Vika-F reviewed Mar 17, 2026

View reviewed changes

napetrov added 9 commits March 28, 2026 09:17

napetrov force-pushed the feature/amx-bf16-knn-euclidean branch from bef38d1 to 0a6c6c2 Compare March 28, 2026 16:20

napetrov added 6 commits April 1, 2026 09:38

fix: correct namespace for CpuFeature::amx_bf16 in env_detect_precisi…

e820232

…on.cpp CpuFeature is defined in daal::internal namespace (cpu_type.h), not in daal namespace directly. Use daal::internal::CpuFeature::amx_bf16. Signed-off-by: Nikolay Petrov <nikolay.a.petrov@intel.com>

napetrov requested a review from a team April 2, 2026 02:33

napetrov requested a review from Vika-F April 8, 2026 20:39

napetrov marked this pull request as ready for review April 8, 2026 21:26

Copilot AI review requested due to automatic review settings April 8, 2026 21:26

napetrov requested review from Alexandr-Solovev, avolkov-intel and icfaust as code owners April 8, 2026 21:26

napetrov requested review from david-cortes-intel and ethanglaser as code owners April 8, 2026 21:26

Copilot started reviewing on behalf of napetrov April 8, 2026 21:27 View session

Copilot AI reviewed Apr 8, 2026

View reviewed changes

david-cortes-intel requested changes Apr 9, 2026

View reviewed changes

Vika-F reviewed Apr 9, 2026

View reviewed changes

napetrov added 3 commits April 15, 2026 21:37

fix: address AMX BF16 review comments

bf6f1b1

fix: drop unnecessary cstring include in AMX BF16 dispatcher

a0b067f

style: format BF16 GEMM dispatcher

a4198f6

napetrov added the perf Performance optimization label Apr 17, 2026

napetrov removed the experiment label Apr 27, 2026

feat: add required BF16 matmul precision mode

466f40e

Merge branch 'main' into feature/amx-bf16-knn-euclidean

be5ac1a

napetrov requested review from Vika-F and david-cortes-intel April 30, 2026 16:57

		// col-major: C = B16 * A16^T => cblas NoTrans B, Trans A
		cblas_gemm_bf16bf16f32(CblasColMajor, CblasNoTrans, CblasTrans, m, n, k, 1.0f, b16, m, a16, n, 0.0f, out, m);

-                // col-major: C = B16 * A16^T  =>  cblas NoTrans B, Trans A
-                cblas_gemm_bf16bf16f32(CblasColMajor, CblasNoTrans, CblasTrans, m, n, k, 1.0f, b16, m, a16, n, 0.0f, out, m);
+                // Match the float32 path:
+                //   xxgemm('t', 'n', ..., b, lda=k, a, lda=k, ..., out, ldc=m)
+                // The source blocks are row-major, so interpret b16 as a k x m
+                // column-major matrix and transpose it, and interpret a16 as a
+                // k x n column-major matrix without transposition.
+                cblas_gemm_bf16bf16f32(CblasColMajor, CblasTrans, CblasNoTrans, m, n, k, 1.0f, b16, k, a16, k, 0.0f, out, m);

		/// If 'high' is requested but hardware lacks AMX-BF16, BF16GemmDispatcher
		/// will fall through to sgemm transparently via its own g_use_bf16_gemm gate.

Conversation

napetrov commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

AMX-BF16 fast path for float32 Euclidean distance GEMM

User-visible behavior

Precision policy interface

Dispatch behavior

Accuracy and compatibility

Benchmark results

Files changed

Summary

Uh oh!

napetrov commented Mar 17, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Vika-F commented Mar 17, 2026

Uh oh!

napetrov commented Apr 8, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

david-cortes-intel left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

napetrov commented Apr 17, 2026

Uh oh!

napetrov commented Apr 27, 2026

Uh oh!

napetrov commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

napetrov commented Mar 17, 2026 •

edited

Loading