Decrease `KOPS_SCHEDULER_QPS/BURST` and `KOPS_CONTROLLER_MANAGER_QPS/BURST` to 300 by ronaldngounou · Pull Request #17942 · kubernetes/kops

ronaldngounou · 2026-02-09T00:51:33Z

Motivation / Background:

The ec2-master-scale 5k nodes tests have been failing since September 19th, 2025 and possibly before due to PodStartUpLatency > 5s as per SLO.

JobID = 2020473671461638144

{ Failure :0
[measurement call PodStartupLatency - PodStartupLatency error: pod startup: too high latency 99th percentile: got 5.71839757s expected: 5s]
:0}

pod_startup: {p99=5718ms, p90=4518ms, p50=1861ms}
Given the latency observed from the logs and this latency, it's the pod_startup phase that took the longest.

Solution:

Historically, there were an experiment to increase scheduler throughput in 5k performance tests. xref. The experiment has suggested KOPS_SCHEDULER_QPS=300 and KOPS_SCHEDULER_BURST=300 to be ran at 300qps. This value is supported by the current test-infra value of 300. xref
In addition, @wojtek-t stated that having scheduler QPS at 300 would hugely decrease pod-start-up-latency. xref
Therefore, I am modifying the values.

it should allow to hugely decrease pod-startup-latency across the whole test. Given that individual controllers have separate QPS limits, we allow scheduler to keep up with the load from deployment, daemonset and job performing pod creations at once.

The value of controller manager are modified to be uniformed with the the test-infra. xref

Impact

https://testgrid.k8s.io/sig-release-master-informing#ec2-master-scale-performance
This is a release informing for v1.36-alpha and helps to keep the ec2-master-scale 5k tests dashboard green.

Contributes to issue kubernetes/kubernetes#134332 (comment)

ronaldngounou · 2026-02-09T00:52:47Z

/assign upodroid
/assign hakman

ronaldngounou · 2026-02-09T03:17:48Z

/test pull-kops-e2e-k8s-aws-calico

@wojtek

…to 300 Motivation / Background: The ec2-master-scale-tests have been failing since September 19th, 2025 and possibly before due to PodStartUpLatency > 5s as per SLO. Solution: Historically, there were an experiment to increase scheduler throughput in 5k performance tests. The experiment has suggested KOPS_SCHEDULER_QPS=300 and KOPS_SCHEDULER_BURST=300 to be ran at 300qps. In addition, @wojtek stated that having a QPS at 300 would hugely decrease pod-start-up-latency across the whole test. The value of controller manager are modified to be uniformed with the the test-infra. Signed-off-by: Ronald Ngounou <ronald.ngounou@yahoo.com>

k8s-ci-robot · 2026-02-09T03:36:54Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from hakman. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

tests/e2e/scenarios/scalability/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ronaldngounou · 2026-02-10T01:14:13Z

We are currently waiting on seeing the 4th run of DELETE events SLO succeed before merging this PR.

mengqiy · 2026-02-10T01:30:03Z

It's not very clear what's the root cause.
IIUC you are implying that the issue is that the controllers create pod faster than the pace that the scheduler can keep up with.
Based on that reasoning, it seems we can keep the config for scheduler unchanged and just reduce the qps/burst for KCM.

ronaldngounou · 2026-02-10T03:42:40Z

I understand the logic. Keeping scheduler QPS high seems like it would help process pods faster. However, I'm concerned about deviating from the proven test-infra configuration.

Test-infra uses Scheduler/KCM QPS=300 (not 500) xref
My only concern is that if we use Scheduler=500qps and KCM = 300qps, we will be running a new experiment. Wouldn't the safe path is to align with the current test-infra values to unify kops and test-infra?

I'd recommend having 300/300 in this PR to align with test-infra configurations.

If we want to experiment with Scheduler=500qps, we can do it as a follow-up with proper testing.

What do you think?

upodroid · 2026-02-10T16:33:22Z

Can we run the 5k job here?

/test pull-kops-ec2-master-scale-performance-5000

ronaldngounou · 2026-02-10T16:34:28Z

/test pull-kops-ec2-master-scale-performance-5000

ronaldngounou · 2026-02-10T16:41:56Z

/test pull-kops-gce-master-scale-performance-5000

ronaldngounou · 2026-02-10T23:11:02Z

/test pull-kops-ec2-master-scale-performance-5000

k8s-ci-robot · 2026-02-11T02:02:12Z

@ronaldngounou: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-kops-e2e-k8s-aws-calico	`804e89d`	link	true	`/test pull-kops-e2e-k8s-aws-calico`
pull-kops-gce-master-scale-performance-5000	`804e89d`	link	true	`/test pull-kops-gce-master-scale-performance-5000`
pull-kops-ec2-master-scale-performance-5000	`804e89d`	link	false	`/test pull-kops-ec2-master-scale-performance-5000`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

alaypatel07 · 2026-02-11T02:32:08Z

{
      "data": {
        "Perc50": 509.642,
        "Perc90": 907.765,
        "Perc99": 996.954
      },
      "unit": "ms",
      "labels": {
        "Metric": "create_to_schedule"
      }
    },
    {
      "data": {
        "Perc50": 271.732,
        "Perc90": 2126.164,
        "Perc99": 3712.676
      },
      "unit": "ms",
      "labels": {
        "Metric": "schedule_to_run"
      }
    },
    {
      "data": {
        "Perc50": 1041.123679,
        "Perc90": 1560.673041,
        "Perc99": 1922.266583
      },

https://storage.googleapis.com/kubernetes-ci-logs/pr-logs/pull/kops/17942/pull-kops-ec2-master-scale-performance-5000/2021361308649132032/artifacts/StatelessPodStartupLatency_PodStartupLatency_load_2026-02-11T00:33:06Z.json

Looking at the data for the startup, it looks like once the pods land on the node, the kubelet is taking more time in sending it to running state reflected in schedule_to_run metric

ronaldngounou · 2026-02-11T02:53:02Z

That's a fair point. We currently don't have kubelet logs to rootcause this further. There were a discussion on enabling audit logs but there were some oppositions xref. I'm going to start a discussion in sig-scalablity.

ronaldngounou · 2026-02-11T06:56:33Z

Updated the issue with my recommendations to keep this PR centered on the code review.

k8s-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Feb 9, 2026

k8s-ci-robot requested review from dims and hakman February 9, 2026 00:51

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Feb 9, 2026

k8s-ci-robot assigned hakman and upodroid Feb 9, 2026

hakman changed the title ~~Decrease KOPS_SCHEDULER_QPS, KOPS_SCHEDULER_BURST, KOPS_CONTROLLER_MANAGER_QPS, and KOPS_CONTROLLER_MANAGER_BURST to 300~~ Decrease KOPS_SCHEDULER_QPS/BURST and KOPS_CONTROLLER_MANAGER_QPS/BURST to 300 Feb 9, 2026

hakman changed the title ~~Decrease KOPS_SCHEDULER_QPS/BURST and KOPS_CONTROLLER_MANAGER_QPS/BURST to 300~~ Decrease KOPS_SCHEDULER_QPS/BURST and KOPS_CONTROLLER_MANAGER_QPS/BURST to 300 Feb 9, 2026

ronaldngounou force-pushed the decrease-scheduler-qps branch from 7c1f2df to 804e89d Compare February 9, 2026 03:36

Conversation

ronaldngounou commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation / Background:

Solution:

Impact

Uh oh!

ronaldngounou commented Feb 9, 2026

Uh oh!

ronaldngounou commented Feb 9, 2026

Uh oh!

k8s-ci-robot commented Feb 9, 2026

Uh oh!

ronaldngounou commented Feb 10, 2026

Uh oh!

mengqiy commented Feb 10, 2026

Uh oh!

ronaldngounou commented Feb 10, 2026

Uh oh!

upodroid commented Feb 10, 2026

Uh oh!

ronaldngounou commented Feb 10, 2026

Uh oh!

ronaldngounou commented Feb 10, 2026

Uh oh!

ronaldngounou commented Feb 10, 2026

Uh oh!

k8s-ci-robot commented Feb 11, 2026

Uh oh!

alaypatel07 commented Feb 11, 2026

Uh oh!

ronaldngounou commented Feb 11, 2026

Uh oh!

ronaldngounou commented Feb 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

ronaldngounou commented Feb 9, 2026 •

edited

Loading