Skip to content

ENH: Covariance CPU SPMD support#3507

Merged
Vika-F merged 8 commits intouxlfoundation:mainfrom
DDJHB:dev_cov_cpu_spmd
Apr 14, 2026
Merged

ENH: Covariance CPU SPMD support#3507
Vika-F merged 8 commits intouxlfoundation:mainfrom
DDJHB:dev_cov_cpu_spmd

Conversation

@DDJHB
Copy link
Copy Markdown
Contributor

@DDJHB DDJHB commented Feb 9, 2026

Description

This PR adds distributed (SPMD) support for Covariance on CPU.


Checklist:

Completeness and readability

  • I have commented my code, particularly in hard-to-understand areas.
  • I have updated the documentation to reflect the changes or created a separate PR with updates and provided its number in the description, if necessary.
  • Git commit message contains an appropriate signed-off-by string (see CONTRIBUTING.md for details).
  • I have resolved any merge conflicts that might occur with the base branch.

Testing

  • I have run it locally and tested the changes extensively.
  • All CI jobs are green or I have provided justification why they aren't.

No new failures in the CIs.

  • I have extended testing suite if new functionality was introduced in this PR.

Performance

  • I have measured performance for affected algorithms using scikit-learn_bench and provided at least a summary table with measured data, if performance change is expected.
  • I have provided justification why performance and/or quality metrics have changed or why changes are not expected.
  • I have extended the benchmarking suite and provided a corresponding scikit-learn_bench PR if new measurable functionality was introduced in this PR.

@DDJHB
Copy link
Copy Markdown
Contributor Author

DDJHB commented Feb 9, 2026

/intelci: run

// Simple allreduce of centered crossproducts is incorrect because each
// rank uses its local mean. Un-center before allreduce, then re-center
// with global statistics after.
if (!desc.get_assume_centered()) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will it be simpler to use DAAL's Distributed<step2Master> algorithm for aggregation?
https://github.com/uxlfoundation/oneDAL/blob/main/samples/daal/cpp/mpi/sources/covariance_dense_distributed_mpi.cpp#L106

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting suggestion, will look into this

Comment thread cpp/oneapi/dal/algo/covariance/compute_types.hpp Outdated
@DDJHB
Copy link
Copy Markdown
Contributor Author

DDJHB commented Mar 29, 2026

/intelci: run

Vika-F added a commit that referenced this pull request Apr 9, 2026
Fixes the ABI breakage introduced by CPU SPMD algorithms in the PRs #3507, #3585
Make the binding type of the destructors (~compute_input(), ~partial_compute_input(), ~train_input(), etc.) stably "WEAK" and not dependent on the compiler's logic by adding their non-default implementations.

Also fixes `set_responses` method in KNN.
mergify Bot pushed a commit that referenced this pull request Apr 9, 2026
Fixes the ABI breakage introduced by CPU SPMD algorithms in the PRs #3507, #3585
Make the binding type of the destructors (~compute_input(), ~partial_compute_input(), ~train_input(), etc.) stably "WEAK" and not dependent on the compiler's logic by adding their non-default implementations.

Also fixes `set_responses` method in KNN.

(cherry picked from commit 0a5f4cb)
maria-Petrova pushed a commit that referenced this pull request Apr 9, 2026
Fixes the ABI breakage introduced by CPU SPMD algorithms in the PRs #3507, #3585
Make the binding type of the destructors (~compute_input(), ~partial_compute_input(), ~train_input(), etc.) stably "WEAK" and not dependent on the compiler's logic by adding their non-default implementations.

Also fixes `set_responses` method in KNN.

(cherry picked from commit 0a5f4cb)

Co-authored-by: Victoriya Fedotova <victoriya.s.fedotova@intel.com>
@Vika-F
Copy link
Copy Markdown
Contributor

Vika-F commented Apr 9, 2026

/intelci: run

@Vika-F Vika-F marked this pull request as ready for review April 10, 2026 08:54
Copilot AI review requested due to automatic review settings April 10, 2026 08:54
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds CPU distributed (SPMD) execution support for the covariance algorithm in oneDAL’s oneAPI C++ implementation, enabling multi-rank aggregation and unblocking CPU execution of existing SPMD tests and samples.

Changes:

  • Enable CPU SPMD dispatch for covariance compute operations.
  • Implement CPU backend SPMD aggregation logic for covariance (partial -> allreduce -> finalize).
  • Add new distributed covariance samples for MPI and CCL; enable CPU execution of the covariance SPMD integration test.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
cpp/oneapi/dal/algo/covariance/detail/compute_ops.cpp Switches covariance CPU dispatch to a universal SPMD-capable kernel dispatcher.
cpp/oneapi/dal/algo/covariance/backend/cpu/compute_kernel_dense.cpp Adds CPU SPMD path: compute local partials, allreduce partial statistics, and finalize on aggregated partials.
cpp/oneapi/dal/algo/covariance/test/spmd.cpp Allows covariance SPMD integration test to run on CPU policies.
samples/oneapi/cpp/mpi/sources/covariance_distr_mpi.cpp New MPI distributed covariance sample using dal::preview::compute.
samples/oneapi/cpp/ccl/sources/covariance_distr_ccl.cpp New CCL distributed covariance sample using dal::preview::compute.

Comment thread cpp/oneapi/dal/algo/covariance/backend/cpu/compute_kernel_dense.cpp
Comment thread cpp/oneapi/dal/algo/covariance/backend/cpu/compute_kernel_dense.cpp Outdated
Comment on lines +47 to +57
static compute_result<Task> call_daal_spmd_kernel(const context_cpu& ctx,
const detail::descriptor_base<Task>& desc,
const detail::compute_parameters<Task>& params,
const table& data) {
auto& comm = ctx.get_communicator();
const std::int64_t component_count = data.get_column_count();

// Compute partial results locally on this rank's data
partial_compute_input<Task> partial_input(data);
auto partial_result =
partial_compute_kernel_cpu<Float, method::by_default, Task>{}(ctx, desc, partial_input);
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

call_daal_spmd_kernel() receives params but never uses it; this means the CPU hyperparameters selected in compute_parameters_cpu (block size, grain size, etc.) are ignored in SPMD mode. Consider passing params through to the DAAL online/finalize calls (instead of default-constructing detail::compute_parameters{} inside partial/finalize kernels), or otherwise document/encode that these parameters are intentionally unused for SPMD.

Copilot uses AI. Check for mistakes.
@Vika-F Vika-F requested a review from Copilot April 13, 2026 11:36
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Comment on lines 34 to 39
compute_parameters<Task> select_parameters(const Policy& ctx,
const descriptor_base<Task>& desc,
const compute_input<Task>& input) const {
using kernel_dispatcher_t = dal::backend::kernel_dispatcher<KERNEL_SINGLE_NODE_CPU(
using kernel_dispatcher_t = dal::backend::kernel_dispatcher<KERNEL_UNIVERSAL_SPMD_CPU(
parameters::compute_parameters_cpu<Float, Method, Task>)>;
return kernel_dispatcher_t{}(ctx, desc, input);
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KERNEL_UNIVERSAL_SPMD_CPU dispatcher uses context_cpu{} for host_policy (see cpp/oneapi/dal/backend/dispatcher.hpp:231-233), which drops any non-default settings carried by the passed policy (e.g., enabled/disabled CPU extensions). To preserve single-node behavior while enabling SPMD, consider using a kernel_dispatcher with both KERNEL_SINGLE_NODE_CPU(...) (for host_policy) and KERNEL_UNIVERSAL_SPMD_CPU(...) (for spmd_host_policy), similar to how compute_ops_dpc.cpp mixes kernel specs.

Copilot uses AI. Check for mistakes.
static compute_result<Task> call_daal_spmd_kernel(const context_cpu& ctx,
const detail::descriptor_base<Task>& desc,
const detail::compute_parameters<Task>& params,
const table& data) {
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

params is unused in call_daal_spmd_kernel(), and the SPMD path currently ignores covariance CPU hyperparameters (compute_parameters) that are applied in the single-node path via convert_parameters(params). Either thread params through the partial/finalize kernels (so distributed runs honor the same tuning knobs) or explicitly mark/remove the parameter to avoid unused-parameter warnings and clarify that hyperparameters are intentionally not supported in SPMD mode.

Suggested change
const table& data) {
const table& data) {
// SPMD covariance currently does not thread compute_parameters through
// partial/finalize kernels, so keep the argument for interface
// compatibility and mark it as intentionally unused here.
(void)params;

Copilot uses AI. Check for mistakes.
Comment on lines +87 to +90
comm.allreduce(nobs_ary).wait();
comm.allreduce(sums_ary).wait();
comm.allreduce(crossproduct_ary).wait();

Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SPMD implementation performs three separate allreduce() calls with an immediate .wait() after each one. This introduces extra synchronization/latency; consider reducing the number of collectives (e.g., pack nobs, sums, and crossproduct into a single buffer and allreduce once) or at least issuing the collectives first and waiting after to allow potential overlap.

Suggested change
comm.allreduce(nobs_ary).wait();
comm.allreduce(sums_ary).wait();
comm.allreduce(crossproduct_ary).wait();
auto nobs_request = comm.allreduce(nobs_ary);
auto sums_request = comm.allreduce(sums_ary);
auto crossproduct_request = comm.allreduce(crossproduct_ary);
nobs_request.wait();
sums_request.wait();
crossproduct_request.wait();

Copilot uses AI. Check for mistakes.
@Vika-F
Copy link
Copy Markdown
Contributor

Vika-F commented Apr 13, 2026

/intelci: run

@Vika-F Vika-F merged commit dc0b21d into uxlfoundation:main Apr 14, 2026
29 of 30 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants