Add KEP to support granular suspend/resume for individual Jobs within a JobSet#1178
Add KEP to support granular suspend/resume for individual Jobs within a JobSet#1178imreddy13 wants to merge 13 commits intokubernetes-sigs:mainfrom
Conversation
✅ Deploy Preview for kubernetes-sigs-jobset canceled.
|
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: imreddy13 The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Hi @imreddy13. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Tip We noticed you've done this a few times! Consider joining the org to skip this step and gain Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/ok-to-test |
|
|
||
| N/A | ||
|
|
||
| ## Implementation History |
There was a problem hiding this comment.
IMO We should have an alternative.
Is it possible to support this via https://github.com/kubernetes-sigs/kueue/tree/main/pkg/controller/jobs/jobset or adding this into a proper Job interface for Kueue?
I'm suspect that this wouldn't be useful for LWS or RayJobs/RayClusters
There was a problem hiding this comment.
@kannon92 I think an alternative is to leverage the RestartJob feature Giuseppe just built, but it doesn't work for some Kueue use cases.
@mimowo can comment on when they plan to introduce the partial admission functionality in Kueue via kubernetes-sigs/kueue#9940
|
The test failed because the toc of another KEP is not up to date. #1177 should fix it shortly. |
|
Please rebase to get this fix |
|
@imreddy13 This proposal was well done but I feel like I am missing the big picture on this. We can add this feature into JobSet but I don't really understand the kueue work to force these restarts. Would that be done after we have support for this in our API? And we would need to consider other workloads for supporting this as I know that relocating failing jobs would be useful for any multi-node workload. |
|
On the Kueue side, we will be creating a Workload per Jobset Job, not for the whole JobSet. As a result Jobset jobs will be admitted somewhat independently, with an option for synchronized start/admission. The appropriate KEPs on the Kueue side are in the works. |
|
|
||
| ## Design Details | ||
|
|
||
| Initially we will support this via an annotation on the JobSet: `jobset.sigs.k8s.io/skip-suspend-reconciliation`. If this annotation is set, then the JobSet webhook enforces `jobSet.spec.suspend = nil and does not reconcile the field job.spec.suspend of child Jobs. |
There was a problem hiding this comment.
This API (the annotation) is expressed in terms of the JobSet controller implementation details. The API should express the user intent of how they want a JobSet to behave, not how the controller should execute on the behavior.
There was a problem hiding this comment.
Does the new name (ReconcilationMode) work better? Also open to other suggestions
|
|
||
| We cannot use `jobSet.spec.suspend = nil` as a signal to not reconcile the field `job.spec.suspend` of child Jobs becuase the JobSet controller currently treats `jobSet.spec.suspend = nil` as a synonym for `jobSet.spec.suspend = false`. If users are unsuspending JobSets by setting `jobSet.spec.suspend = nil` instead of `false`, this change would break their workflow. | ||
|
|
||
| This simple annotation-based solution gives the use case time to evolve before proposing a more complete API (such as an `updatePolicy` field in the future). |
There was a problem hiding this comment.
I am concerned about continuing to use annotations as temporary api, which in the exclusive placement case never graduated to a proper API on the spec.
There was a problem hiding this comment.
I agree that an API is cleaner and customers would not need to be educated to move away from the annotation in the future. My main concern is that the Kueue design in the future may not be compatible with the API we introduce. So I preferred the annotation. Wdyt @ahg-g ? The Kueue issue for reference: kubernetes-sigs/kueue#9940
There was a problem hiding this comment.
This is still unaddressed.
I also don't love the overuse of annotations.
|
|
||
| #### Future API Evolution | ||
|
|
||
| If the `jobset.sigs.k8s.io/skip-suspend-reconciliation` annotation proves successful, we may promote this configuration to a strongly-typed API field within the `JobSetSpec`. A dedicated policy field (e.g. `UpdatePolicy`) would cleanly move this behavior out of annotations, providing better validation, discoverability via OpenAPI schemas, and extensibility if other child Job fields need decoupled reconciliation in the future. |
There was a problem hiding this comment.
I am concerned about this gaining adoption and having to go through the process of migrating users to the new api.
There was a problem hiding this comment.
I agree. I would prefer feature gate with real API if we have real use cases and customer demand.
| // reconciles the `suspend` field of child Jobs to match the JobSet. | ||
| // Valid values are "Sync" (default) and "Ignore". | ||
| // +optional | ||
| SuspendReconciliation *SuspendReconciliationMode `json:"suspendReconciliation,omitempty"` |
There was a problem hiding this comment.
|
We created a corresponding issue on the Kueue side (kubernetes-sigs/kueue#9940). It should give more context why this is needed and how this will work in the bigger picture. |
Co-authored-by: Kevin Hannon <kehannon@redhat.com>
Co-authored-by: Kevin Hannon <kehannon@redhat.com>
kannon92
left a comment
There was a problem hiding this comment.
Can you comment on how this feature will work with depends on?
It may be good to also enumerate the behavior with our other policies (Failure, Success, PVC).
Some of these may be obvious but I think we should think through the interactions of this. Ideally we should have test cases too.
JobSet failure policy applies to suspended jobs too i.e. if job A is suspended & job B fails, job A will also be restarted. If a Job fails and then is suspended by Kueue before the JobSet controller processes the failure, the controller will still respect the FailurePolicy. As a follow up, we will work with @mwielgus to support failure policy actions with Kueue suspend/resume. We can also add a new field to track number of per job restarts that includes suspensions + failure policy restarts.
If a SuccessPolicy targets a specific ReplicatedJob (e.g., "Succeed if the 'leader' Job completes"), suspending Jobs in other ReplicatedJobs will not prevent the JobSet from succeeding once the leader finishes. Once a JobSet meets its SuccessPolicy, the controller deletes or cleans up all Jobs including suspended ones.
Test Cases
@kannon92 PTAL, if these make sense to you I will update the KEP with them |
|
Thank you for that detailed answer. Please go ahead and add that to the KEP. |
Added! @kannon92 @GiuseppeTT PTAL |
Looks good to me. I'll leave the approval for @kannon92. I have a tentative implementation for it in #1202. /lgtm |
| Initially we will support this via an annotation on the JobSet: `jobset.sigs.k8s.io/reconciliation-mode: Independent`. If this annotation is set, then the JobSet webhook enforces `jobSet.spec.suspend = nil` and does not reconcile the field job.spec.suspend of child Jobs. | ||
|
|
There was a problem hiding this comment.
Instead of introducing annotation, why we cannot just add validation webhook that doesn't allow to set suspend API in the individual ReplicatedJobs when this feature is disabled?
If they goal is to allow users/Kueue suspend ReplicatedJobs why not just allow to update that field?
There was a problem hiding this comment.
Because it is not just about update validation in the webhook, but also about signaling to the operator whether or not to reconcile.
There was a problem hiding this comment.
Why we cannot just check that if JobSet reconciler is triggered by job.spec.suspend update, we end the reconciler loop?
| Initially we will support this via an annotation on the JobSet: `jobset.sigs.k8s.io/reconciliation-mode: Independent`. If this annotation is set, then the JobSet webhook enforces `jobSet.spec.suspend = nil` and does not reconcile the field job.spec.suspend of child Jobs. | ||
|
|
There was a problem hiding this comment.
Because it is not just about update validation in the webhook, but also about signaling to the operator whether or not to reconcile.
|
|
||
| ## Design Details | ||
|
|
||
| Initially we will support this via an annotation on the JobSet: `jobset.sigs.k8s.io/reconciliation-mode: Independent`. If this annotation is set, then the JobSet webhook enforces `jobSet.spec.suspend = nil` and does not reconcile the field job.spec.suspend of child Jobs. |
There was a problem hiding this comment.
I recommend to name it jobset.sigs.k8s.io/suspend-reconciliation-mode since this is specific to suspend reconciliation, not everything else.
| creation-date: 2026-03-05 | ||
| reviewers: | ||
| - "@kannon92" | ||
| - "@ahg" |
There was a problem hiding this comment.
| - "@ahg" | |
| - "@ahg-g" |
|
|
||
| #### Future API Evolution | ||
|
|
||
| If the `jobset.sigs.k8s.io/reconciliation-mode: Independent` annotation proves successful, we may promote this configuration to a strongly-typed API field within the `JobSetSpec`. A dedicated policy field (e.g. `UpdatePolicy`) would cleanly move this behavior out of annotations, providing better validation, discoverability via OpenAPI schemas, and extensibility if other child Job fields need decoupled reconciliation in the future. |
There was a problem hiding this comment.
IMO I want to make proper API a requirement for this.
Can we do something like alpha with feature gate can have annotation but to promote this feature to beta we should move to API?
|
@imreddy13: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
What type of PR is this?
/kind feature
What this PR does / why we need it:
This KEP introduces fine-grained suspend and resume capabilities at the individual Job level within a JobSet. This allows specific child Jobs to be independently suspended and resumed without altering the execution state of other actively running Jobs within the same JobSet.
This feature is necessary to support advanced scheduling and placement optimizations in Kueue, particularly around topology awareness.
If the underlying topology of a specific Job's placement becomes fragmented or broken, Kueue needs the ability to independently relocate (recreate) that single Job. Currently, lacking granular control, adjusting the placement of one Job disrupts the others in the JobSet. By enabling individual Job suspension, Kueue will be able to suspend a specific Job, safely recreate it in a new topological location, and resume it, all without affecting the uninterrupted execution of the other sibling Jobs in the JobSet.
Which issue(s) this PR fixes:
Fixes #1172
Does this PR introduce a user-facing change?
Yes.