Fix firstBucketSize for CPU histogram #7554

glemsom · 2024-12-03T09:52:07Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

Adjust firstBucketSize for CPU histograms to support less than 10m buckets.

Which issue(s) this PR fixes:

Fixes #6415

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

k8s-ci-robot · 2024-12-03T09:52:17Z

Welcome @glemsom!

It looks like this is your first PR to kubernetes/autoscaler 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/autoscaler has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

k8s-ci-robot · 2024-12-03T09:52:18Z

Hi @glemsom. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot · 2024-12-03T09:52:19Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: glemsom
Once this PR has been reviewed and has the lgtm label, please assign krzysied for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

vertical-pod-autoscaler/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

voelzmo · 2024-12-03T10:32:42Z

Hey @glemsom thanks for the PR!
There has been an approach to change this some time ago, and I think this comment by @mwielgus still applies: this is definitely something we want to roll out with care.

A change in CheckpointVersion means that everyone using checkpoints as a history provider will lose their entire history, so wdyt @raywainman, @kwiesmueller, @adrianmoisey , should we be making this conditionally enabled behind a feature flag?

I'm kind of on the edge here: for CPU, I don't consider history samples to be very useful, the confidence for scaling up (which is the most critical concern) increases pretty quickly, such that this shouldn't cause big problems. For memory, we only record one peak per half-life period, so this is definitely a bigger impact.

voelzmo · 2024-12-03T10:33:10Z

/ok-to-test

adrianmoisey · 2024-12-03T11:23:02Z

A change in CheckpointVersion means that everyone using checkpoints as a history provider will lose their entire history, so wdyt @raywainman, @kwiesmueller, @adrianmoisey , should we be making this conditionally enabled behind a feature flag?

🤔
Yeah, I agree that a change like this needs to be handled with care.

I'm wondering if there's a way to give users a clean upgrade path.

raywainman · 2024-12-03T22:15:05Z

Right, the buckets won't cleanly carryover from v3 to v4 here and cause users to silently lose history which could cause VPA to go a bit wonky.

Putting this behind a flag seems like a good first step... If we wanted to change this to the default then we'd need to have some sort of "migration" code.

adrianmoisey · 2024-12-04T06:42:23Z

Putting this behind a flag seems like a good first step... If we wanted to change this to the default then we'd need to have some sort of "migration" code.

What's slightly annoying is that the VerticalPodAutoscalerCheckpoint object itself doesn't seem to be able to handle multiple versions at once. ie: there is no list or similar data structure containing each version.

So I guess we would need a live migration. If the VPA was set to use v4, and a v3 checkpoint existed, it would need to read that checkpoint, migrate and write to v4.

May be another option is maintaining two code paths for a few releases. New checkpoints are on v4, but the recommender supports v3 and v4.
I haven't looked at the code yet to see how viable this is.

glemsom · 2024-12-04T14:51:10Z

Right, the buckets won't cleanly carryover from v3 to v4 here and cause users to silently lose history which could cause VPA to go a bit wonky.

Putting this behind a flag seems like a good first step... If we wanted to change this to the default then we'd need to have some sort of "migration" code.

I will have a look at putting this behind a flag, and having the 10m as the default - if this is an acceptable way to go for now?

adrianmoisey · 2024-12-04T14:58:56Z

I will have a look at putting this behind a flag, and having the 10m as the default - if this is an acceptable way to go for now?

From my side, I'd prefer a plan on what to do after the feature flag has been added.
How do we migrate to the new value and eventually remove the feature flag?

raywainman · 2024-12-09T20:15:20Z

Wonder if you could do this "live" migration when reading the v3 weights here:

autoscaler/vertical-pod-autoscaler/pkg/recommender/util/histogram.go

Line 249 in 37b3da4

    
           func (h *histogram) LoadFromCheckpoint(checkpoint *vpa_types.HistogramCheckpoint) error {

Then subsequent checkpointing will be done using v4 and all should be good from there.

voelzmo · 2024-12-10T09:05:18Z

Wonder if you could do this "live" migration when reading the v3 weights here

We could iterate over the old checkpoint and re-add all samples of buckets[i] to the new checkpoint with the sample value of start(bucket[i+1]) – we don't know exactly which value the samples were originally added with.

Looking at what this change would do to the bucket boundaries, there would be some movement, leading to pretty much everything getting new recommendations (because the bucket boundaries have changed). See the example prints below, with start bucket values in millicores.

Before this change:

bucket 1: 10.0000
bucket 2: 20.5000
bucket 3: 31.5250
bucket 4: 43.1013
bucket 5: 55.2563
bucket 6: 68.0191
bucket 7: 81.4201
bucket 8: 95.4911
bucket 9: 110.2656
bucket 10: 125.7789
bucket 11: 142.0679
bucket 12: 159.1713
bucket 13: 177.1298
bucket 14: 195.9863
bucket 15: 215.7856
bucket 16: 236.5749
bucket 17: 258.4037
bucket 18: 281.3238
bucket 19: 305.3900
bucket 20: 330.6595
bucket 21: 357.1925
bucket 22: 385.0521
bucket 23: 414.3048
bucket 24: 445.0200
bucket 25: 477.2710
bucket 26: 511.1345
bucket 27: 546.6913
bucket 28: 584.0258
bucket 29: 623.2271
bucket 30: 664.3885
bucket 31: 707.6079
bucket 32: 752.9883
bucket 33: 800.6377
bucket 34: 850.6696
bucket 35: 903.2031
bucket 36: 958.3632
bucket 37: 1016.2814
bucket 38: 1077.0955
bucket 39: 1140.9502
bucket 40: 1207.9977
bucket 41: 1278.3976
bucket 42: 1352.3175
bucket 43: 1429.9334
bucket 44: 1511.4301
bucket 45: 1597.0016
bucket 46: 1686.8516
bucket 47: 1781.1942
bucket 48: 1880.2539
bucket 49: 1984.2666
bucket 50: 2093.4800
bucket 51: 2208.1540
bucket 52: 2328.5617
bucket 53: 2454.9897
bucket 54: 2587.7392
bucket 55: 2727.1262
bucket 56: 2873.4825
bucket 57: 3027.1566
bucket 58: 3188.5144
bucket 59: 3357.9402
bucket 60: 3535.8372
bucket 61: 3722.6290
bucket 62: 3918.7605
bucket 63: 4124.6985
bucket 64: 4340.9334
bucket 65: 4567.9801
bucket 66: 4806.3791
bucket 67: 5056.6981
bucket 68: 5319.5330
bucket 69: 5595.5096
bucket 70: 5885.2851
bucket 71: 6189.5494
bucket 72: 6509.0268
bucket 73: 6844.4782
bucket 74: 7196.7021
bucket 75: 7566.5372
bucket 76: 7954.8640
bucket 77: 8362.6072
bucket 78: 8790.7376
bucket 79: 9240.2745
bucket 80: 9712.2882
bucket 81: 10207.9026
bucket 82: 10728.2978
bucket 83: 11274.7126
bucket 84: 11848.4483
bucket 85: 12450.8707
bucket 86: 13083.4142
bucket 87: 13747.5849
bucket 88: 14444.9642
bucket 89: 15177.2124
bucket 90: 15946.0730
bucket 91: 16753.3767
bucket 92: 17601.0455
bucket 93: 18491.0978
bucket 94: 19425.6527
bucket 95: 20406.9353
bucket 96: 21437.2821
bucket 97: 22519.1462
bucket 98: 23655.1035
bucket 99: 24847.8586
bucket 100: 26100.2516
bucket 101: 27415.2641
bucket 102: 28796.0274
bucket 103: 30245.8287
bucket 104: 31768.1202
bucket 105: 33366.5262
bucket 106: 35044.8525
bucket 107: 36807.0951
bucket 108: 38657.4499
bucket 109: 40600.3223
bucket 110: 42640.3385
bucket 111: 44782.3554
bucket 112: 47031.4732
bucket 113: 49393.0468
bucket 114: 51872.6992
bucket 115: 54476.3341
bucket 116: 57210.1508
bucket 117: 60080.6584
bucket 118: 63094.6913
bucket 119: 66259.4258
bucket 120: 69582.3971
bucket 121: 73071.5170
bucket 122: 76735.0928
bucket 123: 80581.8475
bucket 124: 84620.9399
bucket 125: 88861.9868
bucket 126: 93315.0862
bucket 127: 97990.8405
bucket 128: 102900.3825
bucket 129: 108055.4017
bucket 130: 113468.1717
bucket 131: 119151.5803
bucket 132: 125119.1593
bucket 133: 131385.1173
bucket 134: 137964.3732
bucket 135: 144872.5918
bucket 136: 152126.2214
bucket 137: 159742.5325
bucket 138: 167739.6591
bucket 139: 176136.6421
bucket 140: 184953.4742
bucket 141: 194211.1479
bucket 142: 203931.7053
bucket 143: 214138.2905
bucket 144: 224855.2051
bucket 145: 236107.9653
bucket 146: 247923.3636
bucket 147: 260329.5318
bucket 148: 273356.0084
bucket 149: 287033.8088
bucket 150: 301395.4992
bucket 151: 316475.2742
bucket 152: 332309.0379
bucket 153: 348934.4898
bucket 154: 366391.2143
bucket 155: 384720.7750
bucket 156: 403966.8137
bucket 157: 424175.1544
bucket 158: 445393.9121
bucket 159: 467673.6077
bucket 160: 491067.2881
bucket 161: 515630.6525
bucket 162: 541422.1852
bucket 163: 568503.2944
bucket 164: 596938.4591
bucket 165: 626795.3821
bucket 166: 658145.1512
bucket 167: 691062.4088
bucket 168: 725625.5292
bucket 169: 761916.8057
bucket 170: 800022.6459
bucket 171: 840033.7782
bucket 172: 882045.4671
bucket 173: 926157.7405
bucket 174: 972475.6275
bucket 175: 1021109.4089

After this change

bucket 1: 1.0000
bucket 2: 2.0500
bucket 3: 3.1525
bucket 4: 4.3101
bucket 5: 5.5256
bucket 6: 6.8019
bucket 7: 8.1420
bucket 8: 9.5491
bucket 9: 11.0266
bucket 10: 12.5779
bucket 11: 14.2068
bucket 12: 15.9171
bucket 13: 17.7130
bucket 14: 19.5986
bucket 15: 21.5786
bucket 16: 23.6575
bucket 17: 25.8404
bucket 18: 28.1324
bucket 19: 30.5390
bucket 20: 33.0660
bucket 21: 35.7193
bucket 22: 38.5052
bucket 23: 41.4305
bucket 24: 44.5020
bucket 25: 47.7271
bucket 26: 51.1135
bucket 27: 54.6691
bucket 28: 58.4026
bucket 29: 62.3227
bucket 30: 66.4388
bucket 31: 70.7608
bucket 32: 75.2988
bucket 33: 80.0638
bucket 34: 85.0670
bucket 35: 90.3203
bucket 36: 95.8363
bucket 37: 101.6281
bucket 38: 107.7095
bucket 39: 114.0950
bucket 40: 120.7998
bucket 41: 127.8398
bucket 42: 135.2318
bucket 43: 142.9933
bucket 44: 151.1430
bucket 45: 159.7002
bucket 46: 168.6852
bucket 47: 178.1194
bucket 48: 188.0254
bucket 49: 198.4267
bucket 50: 209.3480
bucket 51: 220.8154
bucket 52: 232.8562
bucket 53: 245.4990
bucket 54: 258.7739
bucket 55: 272.7126
bucket 56: 287.3482
bucket 57: 302.7157
bucket 58: 318.8514
bucket 59: 335.7940
bucket 60: 353.5837
bucket 61: 372.2629
bucket 62: 391.8760
bucket 63: 412.4699
bucket 64: 434.0933
bucket 65: 456.7980
bucket 66: 480.6379
bucket 67: 505.6698
bucket 68: 531.9533
bucket 69: 559.5510
bucket 70: 588.5285
bucket 71: 618.9549
bucket 72: 650.9027
bucket 73: 684.4478
bucket 74: 719.6702
bucket 75: 756.6537
bucket 76: 795.4864
bucket 77: 836.2607
bucket 78: 879.0738
bucket 79: 924.0274
bucket 80: 971.2288
bucket 81: 1020.7903
bucket 82: 1072.8298
bucket 83: 1127.4713
bucket 84: 1184.8448
bucket 85: 1245.0871
bucket 86: 1308.3414
bucket 87: 1374.7585
bucket 88: 1444.4964
bucket 89: 1517.7212
bucket 90: 1594.6073
bucket 91: 1675.3377
bucket 92: 1760.1045
bucket 93: 1849.1098
bucket 94: 1942.5653
bucket 95: 2040.6935
bucket 96: 2143.7282
bucket 97: 2251.9146
bucket 98: 2365.5103
bucket 99: 2484.7859
bucket 100: 2610.0252
bucket 101: 2741.5264
bucket 102: 2879.6027
bucket 103: 3024.5829
bucket 104: 3176.8120
bucket 105: 3336.6526
bucket 106: 3504.4852
bucket 107: 3680.7095
bucket 108: 3865.7450
bucket 109: 4060.0322
bucket 110: 4264.0338
bucket 111: 4478.2355
bucket 112: 4703.1473
bucket 113: 4939.3047
bucket 114: 5187.2699
bucket 115: 5447.6334
bucket 116: 5721.0151
bucket 117: 6008.0658
bucket 118: 6309.4691
bucket 119: 6625.9426
bucket 120: 6958.2397
bucket 121: 7307.1517
bucket 122: 7673.5093
bucket 123: 8058.1847
bucket 124: 8462.0940
bucket 125: 8886.1987
bucket 126: 9331.5086
bucket 127: 9799.0841
bucket 128: 10290.0383
bucket 129: 10805.5402
bucket 130: 11346.8172
bucket 131: 11915.1580
bucket 132: 12511.9159
bucket 133: 13138.5117
bucket 134: 13796.4373
bucket 135: 14487.2592
bucket 136: 15212.6221
bucket 137: 15974.2532
bucket 138: 16773.9659
bucket 139: 17613.6642
bucket 140: 18495.3474
bucket 141: 19421.1148
bucket 142: 20393.1705
bucket 143: 21413.8291
bucket 144: 22485.5205
bucket 145: 23610.7965
bucket 146: 24792.3364
bucket 147: 26032.9532
bucket 148: 27335.6008
bucket 149: 28703.3809
bucket 150: 30139.5499
bucket 151: 31647.5274
bucket 152: 33230.9038
bucket 153: 34893.4490
bucket 154: 36639.1214
bucket 155: 38472.0775
bucket 156: 40396.6814
bucket 157: 42417.5154
bucket 158: 44539.3912
bucket 159: 46767.3608
bucket 160: 49106.7288
bucket 161: 51563.0653
bucket 162: 54142.2185
bucket 163: 56850.3294
bucket 164: 59693.8459
bucket 165: 62679.5382
bucket 166: 65814.5151
bucket 167: 69106.2409
bucket 168: 72562.5529
bucket 169: 76191.6806
bucket 170: 80002.2646
bucket 171: 84003.3778
bucket 172: 88204.5467
bucket 173: 92615.7741
bucket 174: 97247.5628
bucket 175: 102110.9409
bucket 176: 107217.4879
bucket 177: 112579.3623
bucket 178: 118209.3304
bucket 179: 124120.7970
bucket 180: 130327.8368
bucket 181: 136845.2287
bucket 182: 143688.4901
bucket 183: 150873.9146
bucket 184: 158418.6103
bucket 185: 166340.5408
bucket 186: 174658.5679
bucket 187: 183392.4963
bucket 188: 192563.1211
bucket 189: 202192.2771
bucket 190: 212302.8910
bucket 191: 222919.0356
bucket 192: 234065.9873
bucket 193: 245770.2867
bucket 194: 258059.8010
bucket 195: 270963.7911
bucket 196: 284512.9806
bucket 197: 298739.6297
bucket 198: 313677.6112
bucket 199: 329362.4917
bucket 200: 345831.6163
bucket 201: 363124.1971
bucket 202: 381281.4070
bucket 203: 400346.4773
bucket 204: 420364.8012
bucket 205: 441384.0412
bucket 206: 463454.2433
bucket 207: 486627.9555
bucket 208: 510960.3533
bucket 209: 536509.3709
bucket 210: 563335.8395
bucket 211: 591503.6314
bucket 212: 621079.8130
bucket 213: 652134.8037
bucket 214: 684742.5438
bucket 215: 718980.6710
bucket 216: 754930.7046
bucket 217: 792678.2398
bucket 218: 832313.1518
bucket 219: 873929.8094
bucket 220: 917627.2999
bucket 221: 963509.6649
bucket 222: 1011686.1481

In the end, we'd probably rather do this and indicate to users that they can expect a massive eviction instead of hiding this behind a feature flag forever?

glemsom · 2025-01-06T10:53:21Z

Looking at last time we lowered the starting bucket, we decreased it from 0.1 to 0.01 in pull-request: #1225
My assumption is that the #1225 pull-request would have caused a similar behavior.

I do not feel comfortable writing a migration for v3 to v4 - but, maybe someone else is ?
However I do like the idea of not hiding this behind a feature flag... As a user, I would expect the recommender to recommend as low as --pod-recommendation-min-cpu-millicores can be set to, without any additional flags required.

For what it is worth, what we did in our clusters was to simply wipe all verticalpodautoscalercheckpoints, and let the controller re-calculate them.

k8s-triage-robot · 2025-04-06T11:32:21Z

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2025-05-06T12:11:17Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle rotten
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

voelzmo · 2025-05-09T09:49:47Z

/remove-lifecycle rotten

One more question from my end: If you're really looking for CPU values this small being assigned to the workload, what's your expectation regarding evictions for CPU load changes? Should a Pod be evicted when the recommendation changes from 3 mCores to 6 mCores (it has increased 100%!).

I'm asking, because I'm also seeing efforts in the opposite direction with #7682 which would e.g. allow people to round to the nearest 10, 50 or even 100 mCores, depending on much they care about these minimal differences in CPU requests.

lbartolotti-gocity · 2025-09-22T10:37:52Z

Hello @voelzmo,

I found this PR because I have the same issue: loads of small workloads with very small CPU usage. I'd love to use VPA but having 10m as minumum would basically double or triple the requested CPU of my cluster.
Regarding your question:

what's your expectation regarding evictions for CPU load changes? Should a Pod be evicted when the recommendation changes from 3 mCores to 6 mCores (it has increased 100%!).

I agree it might get tricky and cause a lot of pod restart, but that's kinda what you sign up for when using VPA in Auto mode. I personally would only use VPA in Initial mode with such low usage to avoid that issue.
I guess that could be a Warning in the documentiation?

Thanks, I'm looking forward to this feature 😄

voelzmo · 2025-10-20T07:15:49Z

I can see how using VPA in initial mode for containers with these small values is definitely a useful thing. I don't have a strong opinion if the this justifies the work that needs to be done for migrating the existing checkpoints for everyone, but maybe others do @omerap12 @adrianmoisey

glemsom added 2 commits December 3, 2024 10:43

Adjust cpuHistogramOptions

2b5817a

Bump SupportedCheckpointVersion

786e297

k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Dec 3, 2024

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Dec 3, 2024

k8s-ci-robot requested a review from jbartosik December 3, 2024 09:52

k8s-ci-robot added the area/vertical-pod-autoscaler label Dec 3, 2024

k8s-ci-robot requested a review from kgolab December 3, 2024 09:52

k8s-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Dec 3, 2024

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Dec 3, 2024

Merge branch 'kubernetes:master' into fix-firstBucketSize

4dcb11d

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 6, 2025

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 6, 2025

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label May 9, 2025

Fix firstBucketSize for CPU histogram #7554

Are you sure you want to change the base?

Fix firstBucketSize for CPU histogram #7554

Conversation

glemsom commented Dec 3, 2024

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Uh oh!

k8s-ci-robot commented Dec 3, 2024

Uh oh!

k8s-ci-robot commented Dec 3, 2024

Uh oh!

k8s-ci-robot commented Dec 3, 2024

Uh oh!

voelzmo commented Dec 3, 2024

Uh oh!

voelzmo commented Dec 3, 2024

Uh oh!

adrianmoisey commented Dec 3, 2024

Uh oh!

raywainman commented Dec 3, 2024

Uh oh!

adrianmoisey commented Dec 4, 2024

Uh oh!

glemsom commented Dec 4, 2024

Uh oh!

adrianmoisey commented Dec 4, 2024

Uh oh!

raywainman commented Dec 9, 2024

Uh oh!

voelzmo commented Dec 10, 2024

Uh oh!

glemsom commented Jan 6, 2025

Uh oh!

k8s-triage-robot commented Apr 6, 2025

Uh oh!

k8s-triage-robot commented May 6, 2025

Uh oh!

voelzmo commented May 9, 2025

Uh oh!

lbartolotti-gocity commented Sep 22, 2025

Uh oh!

voelzmo commented Oct 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants