VPA: add performance benchmarking by maxcao13 · Pull Request #9192 · kubernetes/autoscaler

maxcao13 · 2026-02-06T00:35:19Z

What type of PR is this?

/kind cleanup
/kind documentation
/area vertical-pod-autoscaler

What this PR does / why we need it:

With many changes coming to the VPA project, it would be beneficial to know if any can cause performance regression, or if a change will actually improve performance. For example, something a large change like adding InPlace update mode, or adding a performance enhancement change such as the work done in #8807. This PR proposes a new benchmarking tool to the VPA which can measure the latency of VPA components under performance load in a real cluster (Kind), and not a simulated cluster. We can then use the results on a pull request to compare with the main branch.

Ideally, this would run as a github action or a prow job in CI, for reproducible and consistent environments to run on each change to the VPA which we could then compare to main/master easily. But for now, this is simply just a tool to be run locally. That would ideally come in a follow up PR.

Details about the benchmark itself, internals, quick-start, etcetera, are in the README.md in the vertical-pod-autoscaler/benchmark directory.

Which issue(s) this PR fixes:

Partially fixes #8866

Special notes for your reviewer:

This PR was done in tandem with Cursor Agent mode using the Cursor IDE.

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

maxcao13 · 2026-02-06T00:38:49Z

Here's how I tested this locally:

# in vertical-pod-autoscaler/
$ KIND_VERSION="kindest/node:v1.35.0@sha256:452d707d4862f52530247495d180205e029056831160e22870e37e3f6c1ac31f"
$ kind delete cluster --name kind
$ kind create cluster --image=${KIND_VERSION} --config=benchmark/kind-config.yaml
$ ./hack/deploy-for-e2e-locally.sh full-vpa
$ go build -C benchmark -o ../bin/vpa-benchmark .
$ bin/vpa-benchmark --profile=xxlarge
... results

It is possible that my local setup is more or less powerful than any one else trying this, and that might fudge the timings of the sleeps and such, so I'd love some feedback on how to make this better if it doesn't work for everyone. And yes, loops can actually take longer than the allotted updater-interval or recommender-interval times if there's too many things going on in the cluster like evictions. I had to manually increase the intervals so that wouldn't happen, but I'm assuming it's possible on someone's setup the loop will take too long and not finish, or the loop will be very fast, and didn't need all that time. Since docker shares local machine resources and all that.

FWIW: I did try this out on Omer's PR, and I'm not sure if there was a notable difference. The bulk of the time is spent on EvictPods, most likely because those eviction calls are actual API call to the apiserver every time, instead of getting pods from the informer cache.

# before
========== Results ==========
┌───────────────┬───────────────────┐
│     STEP      │ XXLARGE  ( 1000 ) │
├───────────────┼───────────────────┤
│ AdmissionInit │ 0.0004s           │
│ EvictPods     │ 98.6963s          │
│ FilterPods    │ 0.0925s           │
│ ListPods      │ 0.0025s           │
│ ListVPAs      │ 0.0027s           │
│ total         │ 98.7945s          │
└───────────────┴───────────────────┘

# after
========== Results ==========
┌───────────────┬───────────────────┐
│     STEP      │ XXLARGE  ( 1000 ) │
├───────────────┼───────────────────┤
│ AdmissionInit │ 0.0003s           │
│ EvictPods     │ 99.1713s          │
│ FilterPods    │ 0.0922s           │
│ ListPods      │ 0.0020s           │
│ ListVPAs      │ 0.0026s           │
│ total         │ 99.2683s          │
└───────────────┴───────────────────┘

maxcao13 · 2026-02-06T00:48:53Z

/cc @omerap12 @adrianmoisey

omerap12

I’m planning to go through the code in depth, but overall I like it. Especially the use of KWOK.
There’s one bug you can reproduce by running:

$ ./bin/vpa-benchmark 
=== VPA Benchmark with KWOK ===
Profiles: [small], Runs per profile: 1

Installing KWOK...
  KWOK already installed
Creating KWOK fake node...
  KWOK node already exists
Configuring VPA deployments...
  Configured VPA deployments with increased QPS/burst and updater interval

========== Profile: small (25 VPAs) ==========
Scaling down VPA components...
  Waiting for 6 VPA pods to terminate...
  Waiting for 2 VPA pods to terminate...
  VPA components scaled down
Deleting all VPA checkpoints...
  Deleted 0 VPA checkpoints
Cleaning up existing benchmark resources...
Creating 25 ReplicaSets (2 pods each, 50 total)...
Creating 25 VPAs...
Waiting for 50 KWOK pods to be running...
  Pods: 0/50
  Pods: 0/50
  Pods: 0/50
  Pods: 0/50
  Pods: 0/50
  Pods: 0/50
  Pods: 0/50
  Pods: 0/50
  Pods: 0/50
  Pods: 0/50
  Pods: 0/50
  Pods: 0/50
  Pods: 0/50
...
W0206 13:21:42.996430   69510 main.go:174] Run 1 failed: timeout waiting for pods: context deadline exceeded
No successful runs for profile small!
Final cleanup...

Benchmark completed successfully.

I guess this is because I didn't enter a profile: https://github.com/kubernetes/autoscaler/pull/9192/files#diff-95216a5af7dfd121672dd3633edd84664820b5e5390eff6902f95c2bb52ede51R69

Moreover, should we also use this https://kwok.sigs.k8s.io/docs/user/resource-usage-configuration/ ?

omerap12 · 2026-02-06T13:37:06Z

vertical-pod-autoscaler/benchmark/main.go

+	stagesURL := fmt.Sprintf("https://github.com/kubernetes-sigs/kwok/releases/download/%s/stage-fast.yaml", kwokVersion)
+	cmd = exec.CommandContext(ctx, "kubectl", "apply", "-f", stagesURL)
+	if output, err := cmd.CombinedOutput(); err != nil {
+		return fmt.Errorf("kubectl apply stage-fast.yaml failed: %v\n%s", err, output)


I’m debating whether we want the Go code to handle the infrastructure setup as well.
We’ll probably build a bash script on top of it, so I’m wondering if the script should install all required tools (e.g., KWOK, yq, etc.), while the Go code focuses solely on benchmarking.

Yeah, I was debating this as well.

I think having a bash script outside makes sense to keep the binary focused on benchmarking. I was mostly thinking about keeping all the code contained in one program, but that makes more sense to me. Thanks for the feedback 👍

maxcao13 · 2026-02-06T17:42:41Z

I'll look into the bug, thanks!

EDIT: Hmm, I cannot recreate the bug, there might be something going in your environment. For example, the kwok-controller not changing those pods to be ready?

Moreover, should we also use this https://kwok.sigs.k8s.io/docs/user/resource-usage-configuration/ ?

Hm... I'm wondering how we could benefit from controlling the resource usage, and what other situations we could performance test using this CRD?

I guess it would help to be explicit in what resource usage we want the recommendations to generate from, so that we can always trigger updates/evictions from the updater. I will take a look to see if I can add ClusterResourceUsage for simulation that is outside of the initial resource requests of each pods. For now, I think the recommender is just hitting the minCPU and minMemory and recommending those which will trigger evictions anyways.

iamzili

Good work Max! The PR seems reasonable to me, I have just left a few comments.

Furthermore, based on the code, the main goal of this binary is to check the duration of each phase in the updater within RunOnce. Are there any plans to do the same for the recommender, or even for the admission controller?

iamzili · 2026-02-07T11:43:17Z

vertical-pod-autoscaler/benchmark/main.go

+
+// configureVPADeployments modifies VPA deployments with increased QPS/burst and updater interval.
+// Uses kubectl get/yq/kubectl apply.
+func configureVPADeployments(ctx context.Context) error {


just a side note related to this function: I think it would be great to make the ./hack/deploy-for-e2e-locally.sh script more versatile so that it can deploy the VPA components with custom flags, which would make this function obsolete. I proposed such an improvement in the PR whose goal is to incorporate the helm chart into ./hack/deploy-for-e2e-locally.sh:

#8990 (comment)

Ah, wasn't aware of that PR. I'll keep it in mind, thanks! I want to get rid of the setup in the go code anyways, so I'll be waiting for that to resolve.

iamzili · 2026-02-07T12:19:49Z

vertical-pod-autoscaler/benchmark/main.go

+	for i := 0; i < count; i++ {
+		i := i


Suggested change

for i := 0; i < count; i++ {

i := i

for i := range count {

iamzili · 2026-02-07T13:11:30Z

vertical-pod-autoscaler/benchmark/main.go

+		return err
+	}
+	fmt.Println("  Waiting for recommender to be ready...")
+	if err := waitForVPAPodReady(ctx, kubeClient, "vpa-recommender"); err != nil {


I think waitForVPAPodReady function can be used for the admission controller to wait until it is ready

iamzili · 2026-02-07T16:33:54Z

vertical-pod-autoscaler/benchmark/main.go

+	// Step 10: Wait for updater's first loop
+	// The updater uses time.Tick which waits the full interval before the first tick
+	// We set --updater-interval=2m, so wait 2 minutes for the first loop to start
+	fmt.Println("Waiting 2 minutes for updater's first loop...")
+	time.Sleep(2 * time.Minute)


let me ask, what is the reason for this sleep? I'm asking because we can completely rely on the mechanism implemented in the waitForAndScrapeMetrics function (i.e. polling until the vpa_updater_execution_latency_seconds_sum{step="total"} metric is present).

The reason is because we know for sure that the first tick of the updater and recommender happens at the interval time, not at t=0, so I'm just sleeping as to not waste any execution on polling. I could in theory just add more time to the timeout of waitForAndScrapeMetrics, but again I was thinking that it's guaranteed we will have to wait that long anyways.

omerap12 · 2026-02-08T07:45:49Z

I'll look into the bug, thanks!

EDIT: Hmm, I cannot recreate the bug, there might be something going in your environment. For example, the kwok-controller not changing those pods to be ready?

Moreover, should we also use this https://kwok.sigs.k8s.io/docs/user/resource-usage-configuration/ ?

Hm... I'm wondering how we could benefit from controlling the resource usage, and what other situations we could performance test using this CRD?

I guess it would help to be explicit in what resource usage we want the recommendations to generate from, so that we can always trigger updates/evictions from the updater. I will take a look to see if I can add ClusterResourceUsage for simulation that is outside of the initial resource requests of each pods. For now, I think the recommender is just hitting the minCPU and minMemory and recommending those which will trigger evictions anyways.

Yeah, It’s just something I came across. I’m not sure whether we should use it yet, but it might be worth considering as the benchmark evolves.

This commit adds VPA benchmarking to the project. It adds code for a benchmark binary which measures the internal latency of the VPA updater loop. This allows users to test changes and how it will affect the performance of the VPA. The code and documentation lives in the vertical-pod-autoscaler/benchmark directory. Signed-off-by: Max Cao <macao@redhat.com>

Adds a new GitHub action which runs the vpa-benchmark e2e. Signed-off-by: Max Cao <macao@redhat.com>

k8s-ci-robot · 2026-02-13T20:40:31Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: maxcao13
Once this PR has been reviewed and has the lgtm label, please assign gjtempleton for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

maxcao13 · 2026-02-13T20:40:36Z

Are there any plans to do the same for the recommender

Yes! Hopefully if this is approved, then a followup to add scraping the recommender and displaying metrics would come.

Yeah, It’s just something I came across. I’m not sure whether we should use it yet, but it might be worth considering as the benchmark evolves.

Sounds good, I don't think we need it yet. AFAICT it can just simulate metrics, which would be good for benchmarking the recommender, but not for the updater.

maxcao13 · 2026-02-13T20:42:01Z

I've added a commit in here to add the benchmark as a github action, and it's running right now. I can separate it to a new PR if we like and all this is approved, but I put it in here to show it off, but also to see if the github infra will agree with what I've done :-)

EDIT: https://github.com/kubernetes/autoscaler/actions/runs/22002078972/job/63576978582?pr=9192
See the end of the Build and run benchmark step.

k8s-ci-robot requested review from omerap12 and voelzmo February 6, 2026 00:35

k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Feb 6, 2026

maxcao13 force-pushed the vpa-benchmark branch from c6c5450 to 1988ae0 Compare February 6, 2026 00:47

k8s-ci-robot requested a review from adrianmoisey February 6, 2026 00:48

omerap12 reviewed Feb 6, 2026

View reviewed changes

iamzili reviewed Feb 7, 2026

View reviewed changes

maxcao13 added 2 commits February 13, 2026 12:38

ci(perf): add vpa-benchmark github action

d59e190

Adds a new GitHub action which runs the vpa-benchmark e2e. Signed-off-by: Max Cao <macao@redhat.com>

maxcao13 force-pushed the vpa-benchmark branch from 1988ae0 to d59e190 Compare February 13, 2026 20:40

Conversation

maxcao13 commented Feb 6, 2026

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

Uh oh!

maxcao13 commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maxcao13 commented Feb 6, 2026

Uh oh!

omerap12 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maxcao13 commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

iamzili left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

omerap12 commented Feb 8, 2026

Uh oh!

k8s-ci-robot commented Feb 13, 2026

Uh oh!

maxcao13 commented Feb 13, 2026

Uh oh!

maxcao13 commented Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

maxcao13 commented Feb 6, 2026 •

edited

Loading

maxcao13 commented Feb 6, 2026 •

edited

Loading

maxcao13 commented Feb 13, 2026 •

edited

Loading