Skip to content

Add KEP to support granular suspend/resume for individual Jobs within a JobSet#1178

Open
imreddy13 wants to merge 13 commits intokubernetes-sigs:mainfrom
imreddy13:no-suspend
Open

Add KEP to support granular suspend/resume for individual Jobs within a JobSet#1178
imreddy13 wants to merge 13 commits intokubernetes-sigs:mainfrom
imreddy13:no-suspend

Conversation

@imreddy13
Copy link
Copy Markdown
Contributor

@imreddy13 imreddy13 commented Mar 5, 2026

What type of PR is this?

/kind feature

What this PR does / why we need it:

This KEP introduces fine-grained suspend and resume capabilities at the individual Job level within a JobSet. This allows specific child Jobs to be independently suspended and resumed without altering the execution state of other actively running Jobs within the same JobSet.

This feature is necessary to support advanced scheduling and placement optimizations in Kueue, particularly around topology awareness.

If the underlying topology of a specific Job's placement becomes fragmented or broken, Kueue needs the ability to independently relocate (recreate) that single Job. Currently, lacking granular control, adjusting the placement of one Job disrupts the others in the JobSet. By enabling individual Job suspension, Kueue will be able to suspend a specific Job, safely recreate it in a new topological location, and resume it, all without affecting the uninterrupted execution of the other sibling Jobs in the JobSet.

Which issue(s) this PR fixes:

Fixes #1172

Does this PR introduce a user-facing change?

Yes.

Action required to use this feature: add the annotation `jobset.sigs.k8s.io/reconciliation-mode: Independent` to the JobSet spec.

@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Mar 5, 2026
@k8s-ci-robot k8s-ci-robot requested a review from ahg-g March 5, 2026 20:14
@netlify
Copy link
Copy Markdown

netlify Bot commented Mar 5, 2026

Deploy Preview for kubernetes-sigs-jobset canceled.

Name Link
🔨 Latest commit ee30ddf
🔍 Latest deploy log https://app.netlify.com/projects/kubernetes-sigs-jobset/deploys/69c6f199d64d4200074301ba

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: imreddy13
Once this PR has been reviewed and has the lgtm label, please assign giuseppett for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot requested a review from kannon92 March 5, 2026 20:14
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Mar 5, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Hi @imreddy13. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Tip

We noticed you've done this a few times! Consider joining the org to skip this step and gain /lgtm and other bot rights. We recommend asking approvers on your previous PRs to sponsor you.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Mar 5, 2026
@GiuseppeTT
Copy link
Copy Markdown
Contributor

GiuseppeTT commented Mar 5, 2026

@mwielgus @mwysokin @mimowo can you take a look for the Kueue side?

@kannon92
Copy link
Copy Markdown
Contributor

kannon92 commented Mar 5, 2026

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Mar 5, 2026
Comment thread keps/1172-JobSuspendResume/README.md Outdated
Comment thread keps/1172-JobSuspendResume/README.md Outdated
Comment thread keps/1172-JobSuspendResume/kep.yaml Outdated
Comment thread keps/1172-JobSuspendResume/kep.yaml Outdated
Comment thread keps/1172-JobSuspendResume/kep.yaml Outdated
Comment thread keps/1172-JobSuspendResume/kep.yaml Outdated
Comment thread keps/1172-JobSuspendResume/README.md Outdated
Comment thread keps/1172-JobSuspendResume/README.md

N/A

## Implementation History
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO We should have an alternative.

Is it possible to support this via https://github.com/kubernetes-sigs/kueue/tree/main/pkg/controller/jobs/jobset or adding this into a proper Job interface for Kueue?

I'm suspect that this wouldn't be useful for LWS or RayJobs/RayClusters

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also cc @andreyvelich

for TrainJob.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kannon92 I think an alternative is to leverage the RestartJob feature Giuseppe just built, but it doesn't work for some Kueue use cases.

@mimowo can comment on when they plan to introduce the partial admission functionality in Kueue via kubernetes-sigs/kueue#9940

@GiuseppeTT
Copy link
Copy Markdown
Contributor

The test failed because the toc of another KEP is not up to date. #1177 should fix it shortly.

@GiuseppeTT
Copy link
Copy Markdown
Contributor

Please rebase to get this fix

@kannon92
Copy link
Copy Markdown
Contributor

kannon92 commented Mar 6, 2026

@imreddy13 This proposal was well done but I feel like I am missing the big picture on this.

We can add this feature into JobSet but I don't really understand the kueue work to force these restarts. Would that be done after we have support for this in our API?

And we would need to consider other workloads for supporting this as I know that relocating failing jobs would be useful for any multi-node workload.

@GiuseppeTT
Copy link
Copy Markdown
Contributor

@imreddy13 This proposal was well done but I feel like I am missing the big picture on this.

@mwielgus @mwysokin @mimowo for context on Kueue.

@mwielgus
Copy link
Copy Markdown

mwielgus commented Mar 6, 2026

@imreddy13 This proposal was well done but I feel like I am missing the big picture on this.

@mwielgus @mwysokin @mimowo for context on Kueue.

On the Kueue side, we will be creating a Workload per Jobset Job, not for the whole JobSet. As a result Jobset jobs will be admitted somewhat independently, with an option for synchronized start/admission. The appropriate KEPs on the Kueue side are in the works.

Comment thread keps/1172-JobSuspendResume/README.md Outdated

## Design Details

Initially we will support this via an annotation on the JobSet: `jobset.sigs.k8s.io/skip-suspend-reconciliation`. If this annotation is set, then the JobSet webhook enforces `jobSet.spec.suspend = nil and does not reconcile the field job.spec.suspend of child Jobs.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This API (the annotation) is expressed in terms of the JobSet controller implementation details. The API should express the user intent of how they want a JobSet to behave, not how the controller should execute on the behavior.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the new name (ReconcilationMode) work better? Also open to other suggestions


We cannot use `jobSet.spec.suspend = nil` as a signal to not reconcile the field `job.spec.suspend` of child Jobs becuase the JobSet controller currently treats `jobSet.spec.suspend = nil` as a synonym for `jobSet.spec.suspend = false`. If users are unsuspending JobSets by setting `jobSet.spec.suspend = nil` instead of `false`, this change would break their workflow.

This simple annotation-based solution gives the use case time to evolve before proposing a more complete API (such as an `updatePolicy` field in the future).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am concerned about continuing to use annotations as temporary api, which in the exclusive placement case never graduated to a proper API on the spec.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that an API is cleaner and customers would not need to be educated to move away from the annotation in the future. My main concern is that the Kueue design in the future may not be compatible with the API we introduce. So I preferred the annotation. Wdyt @ahg-g ? The Kueue issue for reference: kubernetes-sigs/kueue#9940

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still unaddressed.

I also don't love the overuse of annotations.

Comment thread keps/1172-JobSuspendResume/README.md Outdated

#### Future API Evolution

If the `jobset.sigs.k8s.io/skip-suspend-reconciliation` annotation proves successful, we may promote this configuration to a strongly-typed API field within the `JobSetSpec`. A dedicated policy field (e.g. `UpdatePolicy`) would cleanly move this behavior out of annotations, providing better validation, discoverability via OpenAPI schemas, and extensibility if other child Job fields need decoupled reconciliation in the future.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am concerned about this gaining adoption and having to go through the process of migrating users to the new api.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. I would prefer feature gate with real API if we have real use cases and customer demand.

Comment thread keps/1172-JobSuspendResume/README.md Outdated
// reconciles the `suspend` field of child Jobs to match the JobSet.
// Valid values are "Sync" (default) and "Ignore".
// +optional
SuspendReconciliation *SuspendReconciliationMode `json:"suspendReconciliation,omitempty"`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mwielgus
Copy link
Copy Markdown

We created a corresponding issue on the Kueue side (kubernetes-sigs/kueue#9940). It should give more context why this is needed and how this will work in the bigger picture.

imreddy13 and others added 2 commits March 17, 2026 12:02
Co-authored-by: Kevin Hannon <kehannon@redhat.com>
Co-authored-by: Kevin Hannon <kehannon@redhat.com>
Copy link
Copy Markdown
Contributor

@kannon92 kannon92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you comment on how this feature will work with depends on?

It may be good to also enumerate the behavior with our other policies (Failure, Success, PVC).

Some of these may be obvious but I think we should think through the interactions of this. Ideally we should have test cases too.

@imreddy13
Copy link
Copy Markdown
Contributor Author

imreddy13 commented Mar 24, 2026

Can you comment on how this feature will work with depends on?

It may be good to also enumerate the behavior with our other policies (Failure, Success, PVC).

Some of these may be obvious but I think we should think through the interactions of this. Ideally we should have test cases too.

  1. DependsOn
    With reconciliation-mode: Independent, suspending an already running child Job should have no effect on the dependsOn logic. The "gate" for dependsOn is only checked at the time of Job creation or re-creation on failure. It should be up to the scheduler to choose what order to suspend/unsuspend Jobs. This feature will still respect dependsOn ordering at Job creation time.

  2. FailurePolicy
    With reconciliation-mode: Independent, suspending a Job will not trigger the failure policy action for the ReplicatedJob since the Job doesn't "fail". Jobs being suspended or unsuspended will not count towards maxRestarts.

JobSet failure policy applies to suspended jobs too i.e. if job A is suspended & job B fails, job A will also be restarted. If a Job fails and then is suspended by Kueue before the JobSet controller processes the failure, the controller will still respect the FailurePolicy.

As a follow up, we will work with @mwielgus to support failure policy actions with Kueue suspend/resume. We can also add a new field to track number of per job restarts that includes suspensions + failure policy restarts.

  1. SuccessPolicy
    With reconciliation-mode: Independent, if the SuccessPolicy is set to All, the JobSet will never reach a Succeeded state as long as any Job is suspended.

If a SuccessPolicy targets a specific ReplicatedJob (e.g., "Succeed if the 'leader' Job completes"), suspending Jobs in other ReplicatedJobs will not prevent the JobSet from succeeding once the leader finishes.

Once a JobSet meets its SuccessPolicy, the controller deletes or cleans up all Jobs including suspended ones.

  1. PVCs
    With reconciliation-mode: Independent, when a Job is suspended, its Pods are terminated, but the Job object and its associated PVCs should persist. When the Job is resumed, the new Pods will mount the same PVCs.

Test Cases

  1. Basic Functionality
  • Independent Sync: Verify that when reconciliation-mode: Independent is set, changing JobSet.spec.suspend from false to true does not suspend Jobs that were already running, and vice versa.

  • Granular Control: Manually suspend 1 of 3 Jobs in a ReplicatedJob and verify the other 2 continue running. Verify the JobSet status remains Active.

  1. Policy Interactions with reconciliation-mode: Independent:
  • DependsOn + Suspension: Create a JobSet where Job B depends on Job A. Suspend one replica of Job A. Verify Job B is never created. Resume the replic, and verify Job B is then created.

  • FailurePolicy + Restart: Set a FailurePolicy with maxRestarts: 1. Suspend one Job, then force a failure in another Job. Verify the JobSet restarts and check if the previously suspended Job is recreated as "running" (demonstrating the state loss during restart).

  • SuccessPolicy (All): Suspend one Job and let all others succeed. Verify the JobSet does not transition to Succeeded. Unsuspend the Job, let it finish, and verify the JobSet then succeeds.

  • SuccessPolicy (Any): Define a policy where only the first ReplicatedJob needs to succeed. Suspend a Job in the second ReplicatedJob. Verify the JobSet still succeeds when the first finishes.

  1. PVC Retention with reconciliation-mode: Independent
    Suspend a Job that has a JobSet-managed PVC. Verify the PVC is not deleted. Resume the Job and verify it successfully mounts the same PVC.

  2. JobSet Deletion with reconciliation-mode: Independent: Delete a JobSet while some child Jobs are suspended. Verify all Jobs (running and suspended) and their associated resources are cleaned up correctly.

@kannon92 PTAL, if these make sense to you I will update the KEP with them

@kannon92
Copy link
Copy Markdown
Contributor

Thank you for that detailed answer. Please go ahead and add that to the KEP.

@imreddy13
Copy link
Copy Markdown
Contributor Author

Thank you for that detailed answer. Please go ahead and add that to the KEP.

Added! @kannon92 @GiuseppeTT PTAL

@GiuseppeTT
Copy link
Copy Markdown
Contributor

Thank you for that detailed answer. Please go ahead and add that to the KEP.

Added! @kannon92 @GiuseppeTT PTAL

Looks good to me. I'll leave the approval for @kannon92. I have a tentative implementation for it in #1202.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 27, 2026
Comment on lines +67 to +68
Initially we will support this via an annotation on the JobSet: `jobset.sigs.k8s.io/reconciliation-mode: Independent`. If this annotation is set, then the JobSet webhook enforces `jobSet.spec.suspend = nil` and does not reconcile the field job.spec.suspend of child Jobs.

Copy link
Copy Markdown
Member

@andreyvelich andreyvelich Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of introducing annotation, why we cannot just add validation webhook that doesn't allow to set suspend API in the individual ReplicatedJobs when this feature is disabled?
If they goal is to allow users/Kueue suspend ReplicatedJobs why not just allow to update that field?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it is not just about update validation in the webhook, but also about signaling to the operator whether or not to reconcile.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we cannot just check that if JobSet reconciler is triggered by job.spec.suspend update, we end the reconciler loop?

Comment on lines +67 to +68
Initially we will support this via an annotation on the JobSet: `jobset.sigs.k8s.io/reconciliation-mode: Independent`. If this annotation is set, then the JobSet webhook enforces `jobSet.spec.suspend = nil` and does not reconcile the field job.spec.suspend of child Jobs.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it is not just about update validation in the webhook, but also about signaling to the operator whether or not to reconcile.


## Design Details

Initially we will support this via an annotation on the JobSet: `jobset.sigs.k8s.io/reconciliation-mode: Independent`. If this annotation is set, then the JobSet webhook enforces `jobSet.spec.suspend = nil` and does not reconcile the field job.spec.suspend of child Jobs.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend to name it jobset.sigs.k8s.io/suspend-reconciliation-mode since this is specific to suspend reconciliation, not everything else.

creation-date: 2026-03-05
reviewers:
- "@kannon92"
- "@ahg"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- "@ahg"
- "@ahg-g"

Copy link
Copy Markdown
Contributor

@kannon92 kannon92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since @ahg-g is a reviewer on this, I'd like a lgtm to make sure his comments are addressed.

Otherwise I think this is good to merge.


#### Future API Evolution

If the `jobset.sigs.k8s.io/reconciliation-mode: Independent` annotation proves successful, we may promote this configuration to a strongly-typed API field within the `JobSetSpec`. A dedicated policy field (e.g. `UpdatePolicy`) would cleanly move this behavior out of annotations, providing better validation, discoverability via OpenAPI schemas, and extensibility if other child Job fields need decoupled reconciliation in the future.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO I want to make proper API a requirement for this.

Can we do something like alpha with feature gate can have annotation but to promote this feature to beta we should move to API?

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

@imreddy13: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-jobset-test-e2e-main-1-36 ee30ddf link true /test pull-jobset-test-e2e-main-1-36

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support granular suspend/resume for individual Jobs within a JobSet

7 participants