-
Notifications
You must be signed in to change notification settings - Fork 4.3k
Fix firstBucketSize for CPU histogram #7554
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
Welcome @glemsom! |
|
Hi @glemsom. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: glemsom The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Hey @glemsom thanks for the PR! A change in CheckpointVersion means that everyone using checkpoints as a history provider will lose their entire history, so wdyt @raywainman, @kwiesmueller, @adrianmoisey , should we be making this conditionally enabled behind a feature flag? I'm kind of on the edge here: for CPU, I don't consider history samples to be very useful, the confidence for scaling up (which is the most critical concern) increases pretty quickly, such that this shouldn't cause big problems. For memory, we only record one peak per half-life period, so this is definitely a bigger impact. |
|
/ok-to-test |
🤔 I'm wondering if there's a way to give users a clean upgrade path. |
|
Right, the buckets won't cleanly carryover from v3 to v4 here and cause users to silently lose history which could cause VPA to go a bit wonky. Putting this behind a flag seems like a good first step... If we wanted to change this to the default then we'd need to have some sort of "migration" code. |
What's slightly annoying is that the VerticalPodAutoscalerCheckpoint object itself doesn't seem to be able to handle multiple versions at once. ie: there is no list or similar data structure containing each version. So I guess we would need a live migration. If the VPA was set to use v4, and a v3 checkpoint existed, it would need to read that checkpoint, migrate and write to v4. May be another option is maintaining two code paths for a few releases. New checkpoints are on v4, but the recommender supports v3 and v4. |
I will have a look at putting this behind a flag, and having the |
From my side, I'd prefer a plan on what to do after the feature flag has been added. |
|
Wonder if you could do this "live" migration when reading the
Then subsequent checkpointing will be done using |
We could iterate over the old checkpoint and re-add all samples of Looking at what this change would do to the bucket boundaries, there would be some movement, leading to pretty much everything getting new recommendations (because the bucket boundaries have changed). See the example prints below, with start bucket values in millicores. Before this change: After this change In the end, we'd probably rather do this and indicate to users that they can expect a massive eviction instead of hiding this behind a feature flag forever? |
|
Looking at last time we lowered the starting bucket, we decreased it from I do not feel comfortable writing a migration for For what it is worth, what we did in our clusters was to simply wipe all |
|
The Kubernetes project currently lacks enough contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
|
The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
|
/remove-lifecycle rotten One more question from my end: If you're really looking for CPU values this small being assigned to the workload, what's your expectation regarding evictions for CPU load changes? Should a Pod be evicted when the recommendation changes from I'm asking, because I'm also seeing efforts in the opposite direction with #7682 which would e.g. allow people to round to the nearest 10, 50 or even 100 mCores, depending on much they care about these minimal differences in CPU requests. |
|
Hello @voelzmo, I found this PR because I have the same issue: loads of small workloads with very small CPU usage. I'd love to use VPA but having 10m as minumum would basically double or triple the requested CPU of my cluster.
I agree it might get tricky and cause a lot of pod restart, but that's kinda what you sign up for when using VPA in Auto mode. I personally would only use VPA in Initial mode with such low usage to avoid that issue. Thanks, I'm looking forward to this feature 😄 |
|
I can see how using VPA in initial mode for containers with these small values is definitely a useful thing. I don't have a strong opinion if the this justifies the work that needs to be done for migrating the existing checkpoints for everyone, but maybe others do @omerap12 @adrianmoisey |
What type of PR is this?
/kind bug
What this PR does / why we need it:
Adjust
firstBucketSizefor CPU histograms to support less than 10m buckets.Which issue(s) this PR fixes:
Fixes #6415
Special notes for your reviewer:
Does this PR introduce a user-facing change?
NONE