Skip to content

Enable rocm/7.2.1#67

Open
michaelmckinsey1 wants to merge 4 commits into
LBANN:mainfrom
michaelmckinsey1:rocm-721
Open

Enable rocm/7.2.1#67
michaelmckinsey1 wants to merge 4 commits into
LBANN:mainfrom
michaelmckinsey1:rocm-721

Conversation

@michaelmckinsey1
Copy link
Copy Markdown
Collaborator

@michaelmckinsey1 michaelmckinsey1 commented May 7, 2026

Enable rocm/7.2.1 using public AMD wheels https://repo.radeon.com/rocm/manylinux/rocm-rel-7.2.1/ torch wheels https://download.pytorch.org/whl/torch/. Apparently moving forward, we should not rely on the WCI wheels, so it is likely we will maintain scripts/install-tuolumne-torchpypi.sh. And use pre-installed rccl plugin /collab/usr/global/tools/rccl/toss_4_x86_64_ib_cray/rocm-7.2.0/install/lib/librccl-net.so

At scale 7, 1,1,2 sharding, 10 epochs, rocm/7.2.1 is slightly faster than 7.1.0:

  • 1% faster on 1 node
  • 6% faster on 2 nodes
  • 4% faster on 4 nodes

@michaelmckinsey1 michaelmckinsey1 self-assigned this May 7, 2026
@michaelmckinsey1
Copy link
Copy Markdown
Collaborator Author

michaelmckinsey1 commented May 8, 2026

Caveat that I found a regression with this version where you will get the No suitable algorithm was found to execute the required convolution for batch_size>1. This does not happen in rocm/7.1.1

I still think we proceed with updating the version, as the 7.1.1 install WCI version is separate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant