Skip to content

Fix reverse NVVM barrier calls for LLVM 21+ CUDA intrinsics#2785

Open
Copilot wants to merge 4 commits intomainfrom
copilot/fix-cuda-barrier-handling
Open

Fix reverse NVVM barrier calls for LLVM 21+ CUDA intrinsics#2785
Copilot wants to merge 4 commits intomainfrom
copilot/fix-cuda-barrier-handling

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 9, 2026

Enzyme was emitting malformed reverse-mode NVVM barrier calls on newer LLVM/Clang CUDA toolchains. In particular, LLVM 21+ llvm.nvvm.barrier.cta.sync.aligned.* intrinsics require explicit operands, but Enzyme was recreating some reverse barrier calls with an empty argument list.

  • Reverse-mode NVVM barrier lowering

    • Preserve the correct intrinsic signature across LLVM version boundaries:
      • LLVM 19/20: continue emitting zero-arg llvm.nvvm.barrier0()
      • LLVM 21: emit llvm.nvvm.barrier.cta.sync.aligned.all/count with the required barrier id / count operands
      • LLVM 22+: also map the newer barrier.cta.red.*.aligned.{all,count} reductions back to the matching sync intrinsic with the correct operands
    • Avoid reusing reduction operands as barrier ids on LLVM 21, where the legacy barrier0_* reduction intrinsics still use the old operand shape
  • Version-specific correctness

    • Split the reverse barrier handling at the LLVM 21 / 22 boundary rather than treating all >20 versions the same
    • Use constant barrier id 0 when lowering LLVM 21 legacy reduction barriers to sync.aligned.all, matching LLVM’s NVVM upgrade semantics
  • Regression coverage

    • Add a focused reverse-mode CUDA lit test covering:
      • legacy llvm.nvvm.barrier0()
      • llvm.nvvm.barrier.cta.sync.aligned.all(i32)
      • llvm.nvvm.barrier.cta.sync.aligned.count(i32, i32)

Example of the corrected reverse IR shape on newer LLVM:

; before
call void @llvm.nvvm.barrier.cta.sync.aligned.all()

; after
call void @llvm.nvvm.barrier.cta.sync.aligned.all(i32 0)

Agent-Logs-Url: https://github.com/EnzymeAD/Enzyme/sessions/2c2c6db1-8a94-4691-a34c-569c8949f747

Co-authored-by: minansys <149007967+minansys@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix incorrect arguments for CUDA barrier calls Fix reverse NVVM barrier calls for LLVM 21+ CUDA intrinsics Apr 9, 2026
Copilot AI requested a review from minansys April 9, 2026 20:56
Copy link
Copy Markdown
Collaborator

@minansys minansys left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change "#if LLVM_VERSION_MAJOR > 20
auto BarrierInst = Arch == Triple::amdgcn
? (llvm::Intrinsic::ID)Intrinsic::amdgcn_s_barrier
: (llvm::Intrinsic::ID)
Intrinsic::nvvm_barrier_cta_sync_aligned_all;
#else
auto BarrierInst = Arch == Triple::amdgcn
? (llvm::Intrinsic::ID)Intrinsic::amdgcn_s_barrier
: (llvm::Intrinsic::ID)Intrinsic::nvvm_barrier0;
#endif
instbuilder.CreateCall(
getIntrinsicDeclaration(gutils->newFunc->getParent(), BarrierInst),
{});" to "#if LLVM_VERSION_MAJOR > 20
auto BarrierInst = Arch == Triple::amdgcn
? (llvm::Intrinsic::ID)Intrinsic::amdgcn_s_barrier
: (llvm::Intrinsic::ID)
Intrinsic::nvvm_barrier_cta_sync_aligned_all;
#else
auto BarrierInst = Arch == Triple::amdgcn
? (llvm::Intrinsic::ID)Intrinsic::amdgcn_s_barrier
: (llvm::Intrinsic::ID)Intrinsic::nvvm_barrier0;
#endif
SmallVector<Value *, 1> BarrierArgs;
#if LLVM_VERSION_MAJOR > 20
if (Arch != Triple::amdgcn) {
BarrierArgs.push_back(
ConstantInt::get(Type::getInt32Ty(gutils->newFunc->getContext()),
0));
}
#endif
instbuilder.CreateCall(
getIntrinsicDeclaration(gutils->newFunc->getParent(), BarrierInst),
BarrierArgs);" in Enzyme/EnzymeLogic.cpp 2) use clang format to format it to pass the format reuirement "DoozyX/clang-format-lint-action@v0.20
with:
source: enzyme/Enzyme enzyme/tools/enzyme-tblgen
exclude: ./CMakeLists.txt enzyme/Enzyme/SCEV
style: llvm
clangFormatVersion: 16
extensions: c,h,C,H,cpp,hpp,cc,hh,c++,h++,cxx,hxx
inplace: false"

Agent-Logs-Url: https://github.com/EnzymeAD/Enzyme/sessions/0b1ed8c1-4684-4ccb-aa22-f3c5a8317d93

Co-authored-by: minansys <149007967+minansys@users.noreply.github.com>
Copy link
Copy Markdown
Contributor Author

Copilot AI commented Apr 10, 2026

change "#if LLVM_VERSION_MAJOR > 20
auto BarrierInst = Arch == Triple::amdgcn
? (llvm::Intrinsic::ID)Intrinsic::amdgcn_s_barrier...

Addressed in 078c6be. EnzymeLogic.cpp now passes i32 0 when creating the LLVM > 20 NVVM barrier call, leaves AMDGCN unchanged, and the file was formatted with clang-format-16. I also rebuilt LLVMEnzyme-16 and reran the focused CUDA reverse-mode tests (cuda-barrier-sync.ll and sharedcachefwd.ll).

Copilot AI requested a review from minansys April 10, 2026 15:54
Copy link
Copy Markdown
Contributor

@superustc superustc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix this formating issue "--- enzyme/Enzyme/AdjointGenerator.h (original)
+++ enzyme/Enzyme/AdjointGenerator.h (reformatted)
@@ -4013,8 +4013,8 @@
case Intrinsic::nvvm_barrier0_or: {
SmallVector<Value *, 1> args = {
ConstantInt::get(Type::getInt32Ty(M->getContext()), 0)};

  •    auto *Fn =
    
  •        getIntrinsicDeclaration(M, Intrinsic::nvvm_barrier_cta_sync_aligned_all);
    
  •    auto *Fn = getIntrinsicDeclaration(
    
  •        M, Intrinsic::nvvm_barrier_cta_sync_aligned_all);
       auto cal = cast<CallInst>(Builder2.CreateCall(Fn, args));
       cal->setCallingConv(Fn->getCallingConv());
       cal->setDebugLoc(gutils->getNewFromOriginal(I.getDebugLoc()));
    

@@ -4025,8 +4025,8 @@
case Intrinsic::nvvm_barrier_cta_red_or_aligned_all:
case Intrinsic::nvvm_barrier_cta_red_popc_aligned_all: {
SmallVector<Value *, 1> args = {I.getOperand(0)};

  •    auto *Fn =
    
  •        getIntrinsicDeclaration(M, Intrinsic::nvvm_barrier_cta_sync_aligned_all);
    
  •    auto *Fn = getIntrinsicDeclaration(
    
  •        M, Intrinsic::nvvm_barrier_cta_sync_aligned_all);
       auto cal = cast<CallInst>(Builder2.CreateCall(Fn, args));
       cal->setCallingConv(Fn->getCallingConv());
       cal->setDebugLoc(gutils->getNewFromOriginal(I.getDebugLoc()));"
    

Copy link
Copy Markdown
Collaborator

@minansys minansys left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix this clang format issue "--- enzyme/Enzyme/AdjointGenerator.h (original)
+++ enzyme/Enzyme/AdjointGenerator.h (reformatted)
@@ -4013,8 +4013,8 @@
case Intrinsic::nvvm_barrier0_or: {
SmallVector<Value *, 1> args = {
ConstantInt::get(Type::getInt32Ty(M->getContext()), 0)};

  •    auto *Fn =
    
  •        getIntrinsicDeclaration(M, Intrinsic::nvvm_barrier_cta_sync_aligned_all);
    
  •    auto *Fn = getIntrinsicDeclaration(
    
  •        M, Intrinsic::nvvm_barrier_cta_sync_aligned_all);
       auto cal = cast<CallInst>(Builder2.CreateCall(Fn, args));
       cal->setCallingConv(Fn->getCallingConv());
       cal->setDebugLoc(gutils->getNewFromOriginal(I.getDebugLoc()));
    

@@ -4025,8 +4025,8 @@
case Intrinsic::nvvm_barrier_cta_red_or_aligned_all:
case Intrinsic::nvvm_barrier_cta_red_popc_aligned_all: {
SmallVector<Value *, 1> args = {I.getOperand(0)};

  •    auto *Fn =
    
  •        getIntrinsicDeclaration(M, Intrinsic::nvvm_barrier_cta_sync_aligned_all);
    
  •    auto *Fn = getIntrinsicDeclaration(
    
  •        M, Intrinsic::nvvm_barrier_cta_sync_aligned_all);
       auto cal = cast<CallInst>(Builder2.CreateCall(Fn, args));
       cal->setCallingConv(Fn->getCallingConv());
       cal->setDebugLoc(gutils->getNewFromOriginal(I.getDebugLoc()));"
    

@minansys minansys self-requested a review April 11, 2026 12:46
@minansys minansys marked this pull request as ready for review April 11, 2026 12:46
Copy link
Copy Markdown
Collaborator

@minansys minansys left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix the the clang format issue "--- enzyme/Enzyme/AdjointGenerator.h (original)
+++ enzyme/Enzyme/AdjointGenerator.h (reformatted)
@@ -4013,8 +4013,8 @@
case Intrinsic::nvvm_barrier0_or: {
SmallVector<Value *, 1> args = {
ConstantInt::get(Type::getInt32Ty(M->getContext()), 0)};

  •    auto *Fn =
    
  •        getIntrinsicDeclaration(M, Intrinsic::nvvm_barrier_cta_sync_aligned_all);
    
  •    auto *Fn = getIntrinsicDeclaration(
    
  •        M, Intrinsic::nvvm_barrier_cta_sync_aligned_all);
       auto cal = cast<CallInst>(Builder2.CreateCall(Fn, args));
       cal->setCallingConv(Fn->getCallingConv());
       cal->setDebugLoc(gutils->getNewFromOriginal(I.getDebugLoc()));
    

@@ -4025,8 +4025,8 @@
case Intrinsic::nvvm_barrier_cta_red_or_aligned_all:
case Intrinsic::nvvm_barrier_cta_red_popc_aligned_all: {
SmallVector<Value *, 1> args = {I.getOperand(0)};

  •    auto *Fn =
    
  •        getIntrinsicDeclaration(M, Intrinsic::nvvm_barrier_cta_sync_aligned_all);
    
  •    auto *Fn = getIntrinsicDeclaration(
    
  •        M, Intrinsic::nvvm_barrier_cta_sync_aligned_all);
       auto cal = cast<CallInst>(Builder2.CreateCall(Fn, args));
       cal->setCallingConv(Fn->getCallingConv());
       cal->setDebugLoc(gutils->getNewFromOriginal(I.getDebugLoc()));"
    

Agent-Logs-Url: https://github.com/EnzymeAD/Enzyme/sessions/25c353e0-b395-4365-bbd9-3dda55c8f3e1

Co-authored-by: minansys <149007967+minansys@users.noreply.github.com>
Copy link
Copy Markdown
Contributor Author

Copilot AI commented Apr 11, 2026

fix this clang format issue "--- enzyme/Enzyme/AdjointGenerator.h (original)
+++ enzyme/Enzyme/AdjointGenerator.h (reformatted)
@@ -4013,8 +4013,8 @@...

Fixed in 01e0b55. AdjointGenerator.h now matches clang-format-16 for those getIntrinsicDeclaration(...) calls, and I rebuilt LLVMEnzyme-16 plus reran the focused CUDA reverse-mode tests (cuda-barrier-sync.ll and sharedcachefwd.ll).

@EnzymeAD EnzymeAD deleted a comment from superustc Apr 11, 2026
Copy link
Copy Markdown
Contributor Author

Copilot AI commented Apr 11, 2026

fix the the clang format issue "--- enzyme/Enzyme/AdjointGenerator.h (original)
+++ enzyme/Enzyme/AdjointGenerator.h (reformatted)
@@ -4013,8 +4013,8 @@...

Already fixed in 01e0b55. AdjointGenerator.h has the clang-format-16 wrapping for those getIntrinsicDeclaration(...) calls in the current branch head.

Copilot AI requested a review from minansys April 11, 2026 13:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Incorrect number of arguments passed to called function! call void @llvm.nvvm.barrier.cta.sync.aligned.all()

3 participants