Add chipStar (SPIR-V) support for HIP backends by pvelesko · Pull Request #1942 · CEED/libCEED

pvelesko · 2026-03-17T18:42:03Z

Purpose:

Add support for chipStar, enabling libCEED HIP backends (/gpu/hip/ref, /gpu/hip/shared, /gpu/hip/gen) to run on any OpenCL or Level Zero GPU (Intel, etc.) via SPIR-V.

LLM/GenAI Disclosure: Claude Code Opus 4.6

By submitting this PR, the author certifies to its contents as described by the Developer's Certificate of Origin.
Please follow the Contributing Guidelines for all PRs.

jeremylt · 2026-03-17T18:43:41Z

Note: Please use the libCEED PR template, including the LLM usage declaration. Thanks (requirement in CONTRIBUTING.md)

pvelesko · 2026-04-08T05:40:29Z

@jeremylt When will the CI run for this PR?

jeremylt · 2026-04-08T12:09:33Z

We have to manually trigger any jobs from forks, so just let us know when you want to run CI. I just approved CI running for the current push.

pvelesko · 2026-04-08T14:33:46Z

@jeremylt I've addressed the review feedback in the latest push:

Makefile (db0d71ce → eb092232): hipconfig invocation is now guarded by ROCM_DIR non-empty AND hipconfig binary present, so non-HIP machines build cleanly. Default HIP_LIB_NAME=amdhip64.
Style fixes (blank lines, brace-around-if) + duplication reduction: introduced include/ceed/jit-source/hip/hip-element-loop.h with three macros:
- CEED_HIP_ELEM_LOOP_BEGIN(num_elem) / _END — for kernels with no shared-memory ops
- CEED_HIP_ELEM_LOOP_BEGIN_SHARED(num_elem) / _END_SHARED — for kernels that need to reach __syncthreads() unconditionally
- CEED_HIP_IF_ELEM_VALID — guards individual loads/stores; expands to if (elem < num_elem) on chipStar and to nothing on AMD/NVIDIA (where the for-loop already enforces validity)
Both hip-ref-basis-nontensor.h and hip-shared-basis-tensor.h now use the macros — no more duplicated #ifdef __HIP_PLATFORM_SPIRV__ blocks. The code-generator (ceed-hip-gen-operator-build.cpp) emits the macro names too, so the JIT'd source benefits from the same single source of truth.

Net result: hip-shared-basis-tensor.h shrank from ~760 to ~600 lines (and the diff vs upstream is much smaller).
Verified: clean build with chipStar, t320-basis and ex1-volume (5M DOFs, 3D) pass on all 3 HIP backends (ref/shared/gen). Both AMD and SPIRV macro paths preprocess and compile cleanly.

Could you re-trigger CI when convenient? Thanks!

pvelesko · 2026-04-08T14:46:47Z

Follow-up: replaced the stderr line filter with CHIP_LOGLEVEL=crit set in the test process environment (2b8414e8). Cleaner — chipStar suppresses the noise at source instead of filtering after the fact. Could you re-trigger CI?

jrwrigh · 2026-04-08T15:14:04Z

@jeremylt We disabled the HIP backend checks when we removed the AMD card from Noether, right? Did we ever try and get it back going with the NVIDIA card? I'm afraid the CI may not be actually testing the code here (assuming I understand how chip star interacts with the backends).

jeremylt · 2026-04-08T16:26:21Z

@jeremylt We disabled the HIP backend checks when we removed the AMD card from Noether, right? Did we ever try and get it back going with the NVIDIA card? I'm afraid the CI may not be actually testing the code here (assuming I understand how chip star interacts with the backends).

Correct - CI is only testing that the build system is correct and that the HIP backend compiles correctly.

jeremylt · 2026-04-08T16:38:02Z

Overall this is looking pretty great. I need to do manual testing of the changes since ROCm testing isn't part of CI (as discussed above). Can you post the output of make prove -j for the full test suite so we have documented in the PR the test suite passing on a machine using chipStar?

zatkins-dev

I think that hijacking the HIP backend for this support is interesting, but creates more brittle and confusing code. I think that this ought to be its own backend, perhaps utilizing the same inner kernels, to clarify the differences with the existing HIP backend

jrwrigh · 2026-04-08T20:21:19Z

I think that hijacking the HIP backend for this support is interesting, but creates more brittle and confusing code. I think that this ought to be its own backend, perhaps utilizing the same inner kernels, to clarify the differences with the existing HIP backend

I obviously don't work on the backends that much, so take my opinion with a grain of salt. But the point of chipStar is to just operate straight on HIP code and the changes made, while a bit confusing, are pretty minimal. Perhaps maybe there's some cleaner way to implement the changes.

zatkins-dev · 2026-04-08T20:24:08Z

the changes made, while a bit confusing, are pretty minima

I think my main issue is why is this necessary and in what situations. I don't understand it, which means I can't maintain it.

jeremylt · 2026-04-08T20:25:51Z

The GPU backends are in a tricky spot with maintainability today, let alone 2 months from now, so I think its a good concern to voice. I agree that the why about not looping over elements is not clear to me either.

pvelesko · 2026-04-09T06:51:27Z

@zatkins-dev I wouldn't call this hijacking since chipStar is a full HIP implementation+runtime.

The only real change here is the hoisting of syncthreads call outside of the inner loop. SPIR-V semantics dictate that all threads must reach syncthreads and this case we have cases where some threads are not participating in the calculation thus never reach syncthreads and thus hang.

These changes are benign but I can run some performance benchmarks on MI50 before and after this refactor and report performance.

Having two HIP backends doesn't make sense to me even in the case where there is a performance impact on the AMD side.

If I observe a performance degradation on the AMD side, we can discuss how to best set this up.

jeremylt · 2026-04-09T12:33:44Z

If the goal is to prevent some threads from not hitting sync points, I would tackle this differently. I have done that before with other kernels. Let me take a look at the code again.

zatkins-dev · 2026-04-09T14:14:39Z

Hijacking may have been a strong word, I meant no offense.

I am not concerned about performance, since the macro restores the existing code. I am concerned about how this can be doing the same computation when it removes a loop over elements. I am also confused as to why only the two files that you changed need these changes. As one of the people who will have to maintain this code in the future, I need to know when I should or should not be using these annotations, and I currently have no mental model for that.

jeremylt · 2026-04-09T19:38:26Z

I would use the approach in #1950 instead of the macro magic.

I think it should do what you're trying to do, but clearer so it can be maintained.

I do think that for a big change like this, it is better to first create an issue so design decisions can be discussed first.

pvelesko · 2026-04-13T10:14:00Z

After discussion with @jeremylt, I've pulled all kernel changes out of this PR. @jeremylt's PR #1950 handles the syncthreads fix with a cleaner loop-padding approach that works across all platforms without preprocessor guards. This PR now focuses only on the build-system and test-infrastructure changes needed for chipStar support.

What's in this PR now

Commit	Description
`eb092232`	Makefile: chipStar HIP platform detection and dynamic lib name
`5b80b42c`	Makefile: use SYCLCXX as linker for libceed.so when SYCL is enabled
`f10a5a85`	Makefile: filter chipStar clang-only flags from HIPCONFIG_CPPFLAGS for gcc
`ad0af6e3`	hip-gen: add explicit `#include <cstring>` for chipStar compatibility
`8d62fa67`	tests/junit: add `--env KEY=VAL` flag for subprocess environment

Answers to inline review comments

@zatkins-dev on hip-ref-basis-nontensor.h line 27 ("why does this file need different behavior?"):
That change has been reverted to upstream. It was unnecessarily included in the earlier version of the PR.

@zatkins-dev on hip-element-loop.h line 51 (correctness concerns):
hip-element-loop.h has been deleted — the entire macro-based approach is gone. The kernel fix is now entirely in PR #1950.

@zatkins-dev on ceed-hip-gen-operator-build.cpp line 15 (#include <cstring>):
The include stays. chipStar HIP headers do not transitively include <cstring>, which causes build failures for the four pre-existing strcmp() calls in this file (lines 1191, 1215, 2164, 2188 — resource_root comparisons). AMD ROCm headers include it transitively, so this is a no-op on non-chipStar builds.

Note on PR #1950

I verified the loop-padding approach empirically on chipStar and found a formula bug: elem_loop_bound = num_elem * ceil(num_elem/stride) deadlocks when stride > num_elem. I posted the fix with a reproducer on PR #1950.

chipStar test invocation (for reference)

make junit BACKENDS='/gpu/hip/shared /gpu/hip/gen /gpu/hip/ref' JUNIT_ARGS='--env CHIP_LOGLEVEL=crit'

Test results

With the loop-padding approach (corrected formula equivalent to what I posted on #1950), the following pass on chipStar (Intel Arc A770, Level Zero):

t310-basis, t314-basis, t316-basis on /gpu/hip/shared — all pass
Basis tests previously deadlocked with the wrong formula (hipErrorOutOfMemory is chipStar's symptom for a workgroup barrier deadlock)

Full make prove results will be posted after #1950 merges and the kernel fix is in main.

jeremylt · 2026-04-13T11:14:19Z

It looks like you heavily use Claude or something similar in your workflow - don't forget to update the LLM usage the disclosure so we know how you do so.

Note - you are replying to human generated review comments with machine generated messages. I think it's worth keeping in mind that is a potential frustration point

jeremylt · 2026-04-13T17:06:08Z

 subsearch ?= .*
 JUNIT_BATCH ?= ''
+# Extra arguments forwarded to tests/junit.py (e.g. --env CHIP_LOGLEVEL=crit)
+JUNIT_ARGS ?=


Ideally, we should automatically suppress the Chipstar warnings when we build with Chipstar so the test suite passes - so JUNIT_ARGS should be set appropriately by default if Chipstar is enabled

- Detect chipStar via hipconfig and set HIP_LIB_NAME=CHIP - Filter clang-only flags from HIPCONFIG_CPPFLAGS for gcc-compiled sources - Support separate HIPBLAS_DIR for chipStar (hipblas is a separate install) - Use SYCLCXX as linker when SYCL backends are enabled - Skip AMD-specific HIPRTC flags on chipStar; pass only -DCEED_RUNNING_JIT_PASS=1 - Add explicit #include <cstring> in hip-gen for chipStar header compat

pvelesko · 2026-04-17T04:55:51Z

@jeremylt rebased

jeremylt · 2026-04-17T16:40:48Z

Thanks for the work on this, it looks just about ready @pvelesko

Can you post the output from make prove -j here showing that the HIP backends pass with these changes? I like to include that in PRs when the change is not covered by CI (we don't currently have CI for our HIP backends working).

@jrwrigh you may want to look at this. Unclear how much the perf difference is chipStar vs SYCL and how much is my perf improvements to the HIP backend family. (but this could provide motivation to find someone with the skills to rework the SYCL backend now that I've overhauled CUDA/HIP backends). I wonder how this looks for HONEE's performance on Intel hardware.

jrwrigh · 2026-04-18T00:10:51Z

I don't have access to Aurora at the moment. Or at least I'm 90% sure I don't have access to any computer time on there. I can try and double check that tomorrow or Monday.

pvelesko · 2026-04-18T12:19:49Z

I tested on my local system and on Aurora. All tests pass except for one (unrelated to this PR, working on it). On Aurora haven't been able to run SYCL tests at all:

    terminate called after throwing an instance of 'sycl::_V1::exception'
      what():  No kernel named _ZTSZZN4sycl3_V16detail16NDRangeReductionILNS1_9reduction8strategyE1EE3runINS1_9auto_nameELi1E...CeedVectorNorm_Sycl... was found

    Exit code 134 (SIGABRT from std::terminate()).

  HIP (chipStar) Performance — ex3-volume, 50× CeedOperatorApply:

  ┌─────────────────┬──────────┬────────┬──────────────────────┐
  │     Backend     │ 500K DOF │ 4M DOF │ 4M DOF (DOF·apply/s) │
  ├─────────────────┼──────────┼────────┼──────────────────────┤
  │ /gpu/hip/ref    │ 486 ms   │ 848 ms │ 235M                 │
  ├─────────────────┼──────────┼────────┼──────────────────────┤
  │ /gpu/hip/shared │ 467 ms   │ 768 ms │ 260M                 │
  ├─────────────────┼──────────┼────────┼──────────────────────┤
  │ /gpu/hip/gen    │ 451 ms   │ 653 ms │ 306M                 │
  └─────────────────┴──────────┴────────┴──────────────────────┘

jeremylt · 2026-04-18T21:27:45Z

Well, like I said above, will merge this once you post the output of make prove.

pvelesko · 2026-04-19T11:39:46Z

aurora-make-prove-hip.txt

pvelesko · 2026-04-19T14:05:50Z

Aurora Benchmarking Results:

(200 runs × 50M nodes):

  ┌───────────────────┬────────────┬─────────────┬────────────┐
  │      Backend      │ ex1-volume │ ex2-surface │ ex3-volume │
  ├───────────────────┼────────────┼─────────────┼────────────┤
  │ hip/gen level0    │ 4.6s       │ 7.6s        │ 9.1s       │
  ├───────────────────┼────────────┼─────────────┼────────────┤
  │ hip/gen opencl    │ 4.7s       │ 7.6s        │ 9.2s       │
  ├───────────────────┼────────────┼─────────────┼────────────┤
  │ sycl/gen          │ 5.8s       │ 8.6s        │ 9.7s       │
  ├───────────────────┼────────────┼─────────────┼────────────┤
  │ hip/shared level0 │ 5.6s       │ 11.4s       │ 12.2s      │
  ├───────────────────┼────────────┼─────────────┼────────────┤
  │ hip/shared opencl │ 5.7s       │ 11.5s       │ 12.2s      │
  ├───────────────────┼────────────┼─────────────┼────────────┤
  │ sycl/shared       │ 6.9s       │ 10.6s       │ 12.4s      │
  ├───────────────────┼────────────┼─────────────┼────────────┤
  │ sycl/ref          │ 7.0s       │ 12.0s       │ 14.3s      │
  ├───────────────────┼────────────┼─────────────┼────────────┤
  │ hip/ref level0    │ 7.3s       │ 14.4s       │ 16.7s      │
  ├───────────────────┼────────────┼─────────────┼────────────┤
  │ hip/ref opencl    │ 7.6s       │ 14.8s       │ 17.3s      │
  └───────────────────┴────────────┴─────────────┴────────────┘

pvelesko · 2026-04-20T14:48:13Z

@jeremylt make prove results attached btw

jeremylt · 2026-04-20T15:52:25Z

It just occurred to me that we need to tell the users how to turn on this feature.

Can you add 2-3 sentences explaining what chipStar is and what environment variables the user needs to set here:

libCEED/README.md

Line 209 in 5a7ad6a

ROCm version 4.2 or newer is required.

That's the last thing I think we need to merge!

Explain that the HIP backends can also run on non-AMD GPUs via chipStar (SPIR-V over Level Zero/OpenCL), how to point the build at a chipStar install via HIP_DIR, and which CHIP_* runtime environment variables select the backend and device.

pvelesko · 2026-04-22T13:45:53Z

@jeremylt readme updated

jeremylt · 2026-04-22T16:45:25Z

And libCEED now officially supports chipStar 🎉

pvelesko force-pushed the chipStar branch from d2d8709 to a65eb92 Compare March 18, 2026 06:36

pvelesko marked this pull request as ready for review April 8, 2026 05:40

jeremylt requested changes Apr 8, 2026

View reviewed changes

Comment thread include/ceed/jit-source/hip/hip-ref-basis-nontensor.h Outdated

Comment thread include/ceed/jit-source/hip/hip-ref-basis-nontensor.h

Comment thread include/ceed/jit-source/hip/hip-ref-basis-nontensor.h Outdated

Comment thread Makefile Outdated

pvelesko force-pushed the chipStar branch from a65eb92 to 6a5b77b Compare April 8, 2026 14:33

pvelesko force-pushed the chipStar branch from 6a5b77b to e8e7c92 Compare April 8, 2026 14:46

jeremylt reviewed Apr 8, 2026

View reviewed changes

Comment thread tests/junit_common.py Outdated

zatkins-dev requested changes Apr 8, 2026

View reviewed changes

Comment thread tests/junit_common.py Outdated

Comment thread tests/junit_common.py Outdated

Comment thread tests/junit_common.py Outdated

Comment thread include/ceed/jit-source/hip/hip-element-loop.h Outdated

Comment thread backends/hip-gen/ceed-hip-gen-operator-build.cpp

pvelesko force-pushed the chipStar branch from e8e7c92 to 85c856e Compare April 9, 2026 06:49

zatkins-dev reviewed Apr 9, 2026

View reviewed changes

Comment thread include/ceed/jit-source/hip/hip-ref-basis-nontensor.h Outdated

jeremylt mentioned this pull request Apr 9, 2026

gpu - pad out elem loop for shared/gen #1950

Merged

pvelesko force-pushed the chipStar branch from 85c856e to ad0af6e Compare April 13, 2026 10:13

jeremylt reviewed Apr 13, 2026

View reviewed changes

Comment thread tests/junit.py Outdated

jeremylt reviewed Apr 13, 2026

View reviewed changes

jeremylt added enhancement building GPU 1-In Review HIP labels Apr 15, 2026

jeremylt reviewed Apr 15, 2026

View reviewed changes

Comment thread Makefile Outdated

jeremylt mentioned this pull request Apr 16, 2026

Don't use JiT options that chipStar cannot process #1954

Merged

pvelesko force-pushed the chipStar branch from ad0af6e to ad91931 Compare April 17, 2026 04:52

pvelesko force-pushed the chipStar branch from ad91931 to e85e3fb Compare April 17, 2026 04:55

pvelesko force-pushed the chipStar branch from 60c5f6f to e85e3fb Compare April 19, 2026 11:39

jeremylt merged commit c381141 into CEED:main Apr 22, 2026
25 of 26 checks passed

Conversation

pvelesko commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeremylt commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pvelesko commented Apr 8, 2026

Uh oh!

jeremylt commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pvelesko commented Apr 8, 2026

Uh oh!

pvelesko commented Apr 8, 2026

Uh oh!

jrwrigh commented Apr 8, 2026

Uh oh!

jeremylt commented Apr 8, 2026

Uh oh!

Uh oh!

jeremylt commented Apr 8, 2026

Uh oh!

zatkins-dev left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jrwrigh commented Apr 8, 2026

Uh oh!

zatkins-dev commented Apr 8, 2026

Uh oh!

jeremylt commented Apr 8, 2026

Uh oh!

pvelesko commented Apr 9, 2026

Uh oh!

jeremylt commented Apr 9, 2026

Uh oh!

zatkins-dev commented Apr 9, 2026

Uh oh!

Uh oh!

jeremylt commented Apr 9, 2026

Uh oh!

pvelesko commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What's in this PR now

Answers to inline review comments

Note on PR #1950

chipStar test invocation (for reference)

Test results

Uh oh!

jeremylt commented Apr 13, 2026

Uh oh!

Uh oh!

jeremylt Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pvelesko commented Apr 17, 2026

Uh oh!

jeremylt commented Apr 17, 2026

Uh oh!

jrwrigh commented Apr 18, 2026

Uh oh!

pvelesko commented Apr 18, 2026

Uh oh!

jeremylt commented Apr 18, 2026

Uh oh!

pvelesko commented Apr 19, 2026

Uh oh!

pvelesko commented Apr 19, 2026

Uh oh!

pvelesko commented Apr 20, 2026

pvelesko commented Mar 17, 2026 •

edited

Loading

jeremylt commented Mar 17, 2026 •

edited

Loading

jeremylt commented Apr 8, 2026 •

edited

Loading

pvelesko commented Apr 13, 2026 •

edited

Loading