Skip to content

[Enhancement]: Enhance RHEL Driver Image Selection to Use Major-Version Tags for RHEL 8 and RHEL 9 #2471

@rahulait

Description

@rahulait

Description

When using ClusterPolicy, the NVIDIA GPU Operator currently assumes that all GPU nodes in the cluster run the same RHEL minor version and selects the driver image based on the OS version detected from the first GPU node returned by the Kubernetes node list API.

The current implementation:

  • Lists GPU nodes in the cluster
  • Selects the first GPU node returned by the Kubernetes API
  • Uses that node’s OS version to determine the driver image tag

Since Kubernetes does not guarantee deterministic ordering of nodes returned by the list API, different reconciliations may select different GPU nodes in environments where GPU worker nodes run different RHEL minor versions (for example, RHEL 9.4 and RHEL 9.7).

As a result, GPU Operator may inconsistently select different driver image tags across reconciliations, even though all nodes belong to the same supported RHEL major version.

Currently, this behavior affects RHEL 8 and RHEL 9 because GPU Operator uses minor-version-specific image tags such as:

  • rhel8.x
  • rhel9.x

RHEL 10 already uses major-version-based image selection logic and is not affected.

Enhance GPU Operator ClusterPolicy logic to normalize RHEL version handling for RHEL 8 and RHEL 9 by using only the RHEL major version for image selection:

  • All RHEL 8.x GPU nodes should use image tag rhel8
  • All RHEL 9.x GPU nodes should use image tag rhel9

This enhancement removes dependency on the specific GPU node selected during reconciliation and ensures consistent image selection behavior across mixed minor-version deployments.

Expected Behavior

GPU Node OS Version Image Tag Used
RHEL 8.6 rhel8
RHEL 8.10 rhel8
RHEL 9.4 rhel9
RHEL 9.7 rhel9
RHEL 9.8 rhel9

Benefits

  • Eliminates nondeterministic image selection caused by unordered GPU node listing
  • Supports mixed RHEL minor-version GPU node deployments when using ClusterPolicy
  • Aligns RHEL 8 and RHEL 9 behavior with existing RHEL 10 logic
  • Reduces the number of driver/toolkit image variants that must be maintained and published
  • Simplifies operational management and testing

The enhancement should ensure that RHEL image selection logic derives only the RHEL major version for RHEL 8 and RHEL 9 environments when using ClusterPolicy.

Metadata

Metadata

Assignees

Labels

enhancementImprovements to existing features, performance, or usability (not bug fixes or new features).lifecycle/frozenneeds-triageissue or PR has not been assigned a priority-px label

Type

No fields configured for Task.

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions