Description
When using ClusterPolicy, the NVIDIA GPU Operator currently assumes that all GPU nodes in the cluster run the same RHEL minor version and selects the driver image based on the OS version detected from the first GPU node returned by the Kubernetes node list API.
The current implementation:
- Lists GPU nodes in the cluster
- Selects the first GPU node returned by the Kubernetes API
- Uses that node’s OS version to determine the driver image tag
Since Kubernetes does not guarantee deterministic ordering of nodes returned by the list API, different reconciliations may select different GPU nodes in environments where GPU worker nodes run different RHEL minor versions (for example, RHEL 9.4 and RHEL 9.7).
As a result, GPU Operator may inconsistently select different driver image tags across reconciliations, even though all nodes belong to the same supported RHEL major version.
Currently, this behavior affects RHEL 8 and RHEL 9 because GPU Operator uses minor-version-specific image tags such as:
RHEL 10 already uses major-version-based image selection logic and is not affected.
Enhance GPU Operator ClusterPolicy logic to normalize RHEL version handling for RHEL 8 and RHEL 9 by using only the RHEL major version for image selection:
- All RHEL 8.x GPU nodes should use image tag
rhel8
- All RHEL 9.x GPU nodes should use image tag
rhel9
This enhancement removes dependency on the specific GPU node selected during reconciliation and ensures consistent image selection behavior across mixed minor-version deployments.
Expected Behavior
| GPU Node OS Version |
Image Tag Used |
| RHEL 8.6 |
rhel8 |
| RHEL 8.10 |
rhel8 |
| RHEL 9.4 |
rhel9 |
| RHEL 9.7 |
rhel9 |
| RHEL 9.8 |
rhel9 |
Benefits
- Eliminates nondeterministic image selection caused by unordered GPU node listing
- Supports mixed RHEL minor-version GPU node deployments when using
ClusterPolicy
- Aligns RHEL 8 and RHEL 9 behavior with existing RHEL 10 logic
- Reduces the number of driver/toolkit image variants that must be maintained and published
- Simplifies operational management and testing
The enhancement should ensure that RHEL image selection logic derives only the RHEL major version for RHEL 8 and RHEL 9 environments when using ClusterPolicy.
Description
When using
ClusterPolicy, the NVIDIA GPU Operator currently assumes that all GPU nodes in the cluster run the same RHEL minor version and selects the driver image based on the OS version detected from the first GPU node returned by the Kubernetes node list API.The current implementation:
Since Kubernetes does not guarantee deterministic ordering of nodes returned by the list API, different reconciliations may select different GPU nodes in environments where GPU worker nodes run different RHEL minor versions (for example, RHEL 9.4 and RHEL 9.7).
As a result, GPU Operator may inconsistently select different driver image tags across reconciliations, even though all nodes belong to the same supported RHEL major version.
Currently, this behavior affects RHEL 8 and RHEL 9 because GPU Operator uses minor-version-specific image tags such as:
rhel8.xrhel9.xRHEL 10 already uses major-version-based image selection logic and is not affected.
Enhance GPU Operator
ClusterPolicylogic to normalize RHEL version handling for RHEL 8 and RHEL 9 by using only the RHEL major version for image selection:rhel8rhel9This enhancement removes dependency on the specific GPU node selected during reconciliation and ensures consistent image selection behavior across mixed minor-version deployments.
Expected Behavior
rhel8rhel8rhel9rhel9rhel9Benefits
ClusterPolicyThe enhancement should ensure that RHEL image selection logic derives only the RHEL major version for RHEL 8 and RHEL 9 environments when using
ClusterPolicy.