-
Notifications
You must be signed in to change notification settings - Fork 264
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP 1618: Optional garbage collection of finished Workloads #2742
base: main
Are you sure you want to change the base?
KEP 1618: Optional garbage collection of finished Workloads #2742
Conversation
Hi @mwysokin. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
✅ Deploy Preview for kubernetes-sigs-kueue canceled.
|
nit, please say "KEP for #1618" in the PR description instead of "Fixes". Otherwise the robot will close the issue after merging the PR :) |
/assign |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First pass, it lgtm overall. Left some comments to improve it.
/ok-to-test |
…ories into 1 and move the description of the behavior to a new section under Design Details.
leading to unnecessary use of etcd storage and a gradual increase in Kueue's memory footprint. Some of these objects, | ||
like finished Workloads, do not contain any useful information that could be used for additional purposes, such as debugging. | ||
That's why, based on user feedback in [#1618](https://github.com/kubernetes-sigs/kueue/issues/1618) | ||
or [#1789](https://github.com/kubernetes-sigs/kueue/issues/1789), a mechanism for garbage collection of finished Workloads |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
QQ: I'm wondering if the KEP and the initial implementation tackle this issue, because the workloads in the context of this issue are not necessarily finished. It might still be ok to mention in motivation for the generic KEP on workload cleanups, but I would like to clarify if the first version will cover it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked and even though an extension to the code would be minimal I'd rather keep it as a separate task not to pollute and further delay this feature. I will update the KEP once I have something more polished then just a draft version of how the removal of orphaned workloads would look like. For now I'm going to remove the orphaned Workloads issue from the KEP. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for checking I think removing them from the KEP makes sense, because:
- they are not finished
- there is no point in keeping such workloads for >30min (or whatever is configured).
So, I consider Workloads corresponding to Jobs deleted with --cascede=orphan still consume resources #1789 to be just a bug which does not require any API changes or KEP,.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, on the second thought. I think it is better to honor user's choice of using "orphan" mode. So, I would suggest to just transition them to finished state. This way we could again link the issue and the KEP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, one remaining question from me: https://github.com/kubernetes-sigs/kueue/pull/2742/files#r1708818068. @tenzen-y would you like to give it a pass?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
/hold
To wait for feedback from @tenzen-y as another pair of eyes.
LGTM label has been added. Git tree hash: 4b4f2f1108238ec7b0edef46b807fe0cc471a4d8
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: mimowo, mwysokin The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
LGTM from my side as well |
ACK |
/remove-kind api-change |
|
||
- Support the deletion of finished Workload objects. | ||
- Introduce configuration of global retention policies for Kueue-authored Kubernetes objects, | ||
starting with finished Workloads. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean that you will work on the deletion of other objects except the finished Workloads in the next iteration?
But, you mention that ''Support the deletion/expiration of any Kueue-authored Kubernetes object other than finished Workloads.'' in the Non-Goals section.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The goal was to include certain extensibility into the design so that other objects could be added without modifying root lvl API (Jobs
, etc.). That's the main reason why the API key for Workloads
is nested. I haven't seen any issues for removal of other completed object (other than orphaned Workloads
#1789) but if there's any interest in extending that feature I'd be happy to work on it but figuring out how to remove every object that Kueue owns wasn't the goal here, that's where the items in goals/non-goals came from. I'm also happy to work on any updates to this KEP if we decide to extend the cleanup process to other objects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would recommend limiting the scope for this KEP to Kueue managed objects, not only to Kueue Workload objects since actually, we aim to extend the API for the future extensibility for the other Kueue managed resources like AdmissionCheck.
The v1beta1 means that we can not bring breaking changes to APIs without bumping the API version.
So, I would like to take care of future API possibilities in this KEP.
So, my recommendation is to remove "starting with finished Workloads." from Goal, and move the statement to the proposal. And then, mention that we do not aim to add APIs for the Jobs like batch/job, RayJob as we discussed in another thread...
I'm wondering if we need to add another mechanisms for the Jobs retention in the future when we find the actual use cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Let me propose another set of changes before the end of the week.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Introduce configuration of global retention policies for Kueue-authored Kubernetes objects
This sentence matches Workload
and ProvisioiningRequest
. We don't need to clean up provisioning requests explicitly, because they are children of the Workloads
.
So, for now, we only need to take care of Workloads
. But I agree we should keep the API extensible for supporting the cleanup of finished jobs, but that can be addressed in a separate KEP.
- **R**: In clusters with a large number of existing finished Workloads | ||
(thousands, tens of thousands, etc.), it may take a significant amount of time | ||
to delete all previously finished Workloads that qualify for deletion. Since the | ||
reconciliation loop is effectively synchronous, this may impact the reconciliation | ||
of new jobs. From the user's perspective, it may seem like Kueue's initialization | ||
time is very long for that initial run until all expired objects are deleted.\ | ||
**M**: Unfortunately, there is no easy way to mitigate this without complicating the code. | ||
A potential improvement would be to have a dedicated internal queue that handles deletion | ||
outside the reconciliation loop, or to delete expired objects in batches so that after | ||
a batch of deletions is completed, any remaining expired objects would be skipped until | ||
the next run of the loop. However, this would go against the synchronous nature of the reconciliation loop. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that we can mitigate this risk by setting the strictly qps only when admins enable this feature.
Once we limit the client qps and burst in the KueueConfiguration, we can mitigate the burst kube-apiserver load.
kueue/config/components/manager/controller_manager_config.yaml
Lines 21 to 23 in c6f181b
clientConnection: | |
qps: 50 | |
burst: 100 |
So, I would propose the mentioning limiting client qps when the admins enable this feature in the existing cluster.
Hence, could you mention this mitigation here?
--> | ||
|
||
|
||
### API |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH, I feel that this API looks like a half-bake to me.
Because when we add new policies to garbage collect other resources like integrations Jobs, we need to hard cord the API name here like FinishedRayClusterRetention
. But, the Kueue integration Jobs are the extensible interface for external Jobs like AppWrapper or in-house one. So, I guess that it is challenging to maintain retention policies for all integrations like:
[...]
objectRetentionPolicies:
finiishedWorkloadRetention: 10s
failedRayClusterRetention: 30s
# How to specify in-house or other OSS resource policy, here?
# In the upstream Kueue, we can not find all integrations
# since the integration framework could be implemented in other places regardless of private or public repositories.
[...]
From another perspective, even if we restrict the objective objects to specify retention duration, this API is a little bit extensible because in spite of that the API says that this API is used for retention policies for "object", but in this case, actually, we can specify only Workload retention policies.
objectRetentionPolicies:
finishedWorkloadRetention: 10s
succeededWorkloadRetention: 30s
In conclusion, I would ask which objects will you support in this feature through all iterations?
It's only Workload? or It's arbitrary objects?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH I started working on this KEP without too much vision but rather I wanted to implement #1618 and I wanted to make the API flexible enough so that it could be extended with other object types in the future. After reading what you said it indeed makes sense to come up with a consistent approach for all objects so that user's could benefit from Kueue's extensibility.
I don't have much experience with KEPs so I'm not sure what should happen now. For sure I'd like to ship the feature so that it doesn't slowly fades away in the in-review stage but at the same time if you feel that it's half-baked maybe I need to make a step back and come up with a more flexible approach. All feedback and guidance is really welcome 🙇
I'll do some research over the weekend and I'll try to propose something early next week.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a couple of questions / aspects I see touched here:
- should we make it generic to cover Job CRDs?
- should we nest the API inside
objectRetentionPolicies
?
Re 1: There is an intrinsic asymmetry between workloads and Job CRDs. Job CRDs are created by users who can specify spec.ttlSecondsAfterFinished
. Workloads are created by Kueue, and so users don't have this control. So, I think it is justified to start from Workloads only and have the API field specific to them.
Another API maintained by Kueue that comes to my mind is ProvisioningRequest. I think having finishedProvisioningRequestRetention
will make sense too (but we can leave it for later as currently we delete them eagerly anyway).
Re 2: I think it makes sense to nest the API inside objectRetentionPolicies
to be future-proof in case we have some other objects with lifetime maintained by Kueue (like provisioning requests, or different policies for finished workloads depending on success /failure).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re 1: There is an intrinsic asymmetry between workloads and Job CRDs. Job CRDs are created by users who can specify spec.ttlSecondsAfterFinished. Workloads are created by Kueue, and so users don't have this control. So, I think it is justified to start from Workloads only and have the API field specific to them.
Yes, I agree with you. TBH, I don't prefer to implement the feature to do garbage collection of Jobs since I'm thinking that deletion jobs are responsible for Job's own controller like ttlSecondsAfterFinished
. But this proposed API is extensible for Job API. So, I asked, "which objects will you support in this feature".
Another API maintained by Kueue that comes to my mind is ProvisioningRequest. I think having finishedProvisioningRequestRetention will make sense too (but we can leave it for later as currently we delete them eagerly anyway).
I agree with you. I believe that the separate and dedicated CRDs for this feature are overkill. So, I would recommend not introducing the dedicated CRD.
Re 2: I think it makes sense to nest the API inside objectRetentionPolicies to be future-proof in case we have some other objects with lifetime maintained by Kueue (like provisioning requests, or different policies for finished workloads depending on success /failure).
If we want to support additional Kueue-managed objects in the future,
could we add a dedicated field for each object something like this?
I guess that the criterion for garbage collection depends on the object’s kind.
workloadRetentionPolicy:
onConditions:
[metav1.Condition]
admissionCheckRetentionsPolicy:
onConditions:
[metav1.Condition]
onResourceFlavor:
name: xxx
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks a bit more complex than I originally anticipated but I agree that it's quite extensible. TBH I lack expertise when it comes evaluating whether this would be a consistent and a good way for Kueue to support that so I'll defer to other maintainers (@mimowo, @alculquicondor). Once we reach the consensus I'll be happy to implement it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer 2. It looks more future-proof.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a side note, although. To evolve API more easily, I want to find a way to guard the new API field :)
IIUC, k/k has the feature gate guard for the new API fields like here: https://github.com/kubernetes/kubernetes/blob/1c4f540669397d031d464877ec3d2b8ac6be9b2f/pkg/apis/core/types.go#L543-L548
We can not break the new API field without bumping the API version in the current API evolving steps. I think that a new API without a guard makes API evolution harder since we need to consider future compatibility and extensibility in the first implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The feature gate only implies that the field will be ignored when disabled. It doesn't mean that we can change the API in a future version of kubernetes, without changing the API version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see. That was not my expected behavior.
Thanks for letting me know.
[//]: # (### Graduation Criteria) | ||
|
||
[//]: # () | ||
[//]: # (<!--) | ||
|
||
[//]: # () | ||
[//]: # (Clearly define what it means for the feature to be implemented and) | ||
|
||
[//]: # (considered stable.) | ||
|
||
[//]: # () | ||
[//]: # (If the feature you are introducing has high complexity, consider adding graduation) | ||
|
||
[//]: # (milestones with these graduation criteria:) | ||
|
||
[//]: # (- [Maturity levels (`alpha`, `beta`, `stable`)][maturity-levels]) | ||
|
||
[//]: # (- [Feature gate][feature gate] lifecycle) | ||
|
||
[//]: # (- [Deprecation policy][deprecation-policy]) | ||
|
||
[//]: # () | ||
[//]: # ([feature gate]: https://git.k8s.io/community/contributors/devel/sig-architecture/feature-gates.md) | ||
|
||
[//]: # ([maturity-levels]: https://git.k8s.io/community/contributors/devel/sig-architecture/api_changes.md#alpha-beta-and-stable-versions) | ||
|
||
[//]: # ([deprecation-policy]: https://kubernetes.io/docs/reference/using-api/deprecation-policy/) | ||
|
||
[//]: # (-->) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you revert these comment-outs since we need to revisit here when we want to graduate this feature to Beta/Stable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I explicitly commented it out instead of removing it so it could be revisited later but maybe it needs to be there from the beginning? It's my first KEP ever so I'm learning as I go and I'll follow your guidance 🙏
@mwysokin: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
/assign |
|
||
- Support the deletion of finished Workload objects. | ||
- Introduce configuration of global retention policies for Kueue-authored Kubernetes objects, | ||
starting with finished Workloads. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Introduce configuration of global retention policies for Kueue-authored Kubernetes objects
This sentence matches Workload
and ProvisioiningRequest
. We don't need to clean up provisioning requests explicitly, because they are children of the Workloads
.
So, for now, we only need to take care of Workloads
. But I agree we should keep the API extensible for supporting the cleanup of finished jobs, but that can be addressed in a separate KEP.
and make progress. | ||
--> | ||
|
||
- Support the deletion/expiration of any Kueue-authored Kubernetes object other than finished Workloads. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Support the deletion/expiration of any Kueue-authored Kubernetes object other than finished Workloads. | |
- Support the deletion/expiration of jobs or other Kubernetes objects not authored by Kueue. |
|
||
## Summary | ||
|
||
<!-- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove the comments
nitty-gritty. | ||
--> | ||
|
||
Add a new API called `objectRetentionPolicies`, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a new API called `objectRetentionPolicies`, | |
Add a new field called `objectRetentionPolicies` to the Kueue Configuration API, |
Add a new API called `objectRetentionPolicies`, | ||
which will enable specifying a retention policy for finished Workloads under an option | ||
called `FinishedWorkloadRetention`. This API section could be extended in the future | ||
with options for other Kueue-authored objects. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with options for other Kueue-authored objects. | |
with options for other Kueue-managed objects. |
--> | ||
|
||
- Initially, other naming was considered for config keys. Namely, instead of "Objects," | ||
the word "Resources" was used, but based on feedback from `@alculquicondor`, it was changed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't usually keep the names of the people who suggested the changes.
--> | ||
|
||
|
||
### API |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that API looks too complex. For the most part, we will be happy just cleaning up finished Workloads.
Let me propose the following:
objectRetentionPolicies:
workloads:
afterFinished: 10m
It leaves us the possibility to extend to other objects and to even add conditions, but I prefer we don't give that flexibility yet.
[//]: # (##### Prerequisite testing updates) | ||
|
||
[//]: # () | ||
[//]: # (<!--) | ||
|
||
[//]: # (Based on reviewers feedback describe what additional tests need to be added prior) | ||
|
||
[//]: # (implementing this enhancement to ensure the enhancements have also solid foundations.) | ||
|
||
[//]: # (-->) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can remove this if there aren't any requirements.
FYI I just wanted to let everyone know that I didn't abandon the work. I've been on vacation last week and I'll be back in September. Sorry for the delay 🙇 |
No worries. I'm looking forward to moving this discussion after you come back here! |
I'm really sorry it's been dragging. I'm coming back to it starting Monday. |
What type of PR is this?
/kind documentation
/kind api-change
What this PR does / why we need it:
Which issue(s) this PR fixes:
KEP for #1618
Special notes for your reviewer:
Does this PR introduce a user-facing change?