Skip to content

Decrease KOPS_SCHEDULER_QPS/BURST and KOPS_CONTROLLER_MANAGER_QPS/BURST to 300#17942

Open
ronaldngounou wants to merge 1 commit intokubernetes:masterfrom
ronaldngounou:decrease-scheduler-qps
Open

Decrease KOPS_SCHEDULER_QPS/BURST and KOPS_CONTROLLER_MANAGER_QPS/BURST to 300#17942
ronaldngounou wants to merge 1 commit intokubernetes:masterfrom
ronaldngounou:decrease-scheduler-qps

Conversation

@ronaldngounou
Copy link
Member

@ronaldngounou ronaldngounou commented Feb 9, 2026

Motivation / Background:

The ec2-master-scale 5k nodes tests have been failing since September 19th, 2025 and possibly before due to PodStartUpLatency > 5s as per SLO.

JobID = 2020473671461638144

{ Failure :0
[measurement call PodStartupLatency - PodStartupLatency error: pod startup: too high latency 99th percentile: got 5.71839757s expected: 5s]
:0}

pod_startup: {p99=5718ms, p90=4518ms, p50=1861ms}
Given the latency observed from the logs and this latency, it's the pod_startup phase that took the longest.

image

Solution:

Historically, there were an experiment to increase scheduler throughput in 5k performance tests. xref. The experiment has suggested KOPS_SCHEDULER_QPS=300 and KOPS_SCHEDULER_BURST=300 to be ran at 300qps. This value is supported by the current test-infra value of 300. xref
In addition, @wojtek-t stated that having scheduler QPS at 300 would hugely decrease pod-start-up-latency. xref
Therefore, I am modifying the values.

it should allow to hugely decrease pod-startup-latency across the whole test. Given that individual controllers have separate QPS limits, we allow scheduler to keep up with the load from deployment, daemonset and job performing pod creations at once.

The value of controller manager are modified to be uniformed with the the test-infra. xref

Impact

https://testgrid.k8s.io/sig-release-master-informing#ec2-master-scale-performance
This is a release informing for v1.36-alpha and helps to keep the ec2-master-scale 5k tests dashboard green.
image

Contributes to issue kubernetes/kubernetes#134332 (comment)

@k8s-ci-robot k8s-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Feb 9, 2026
@k8s-ci-robot k8s-ci-robot requested review from dims and hakman February 9, 2026 00:51
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Feb 9, 2026
@ronaldngounou
Copy link
Member Author

/assign upodroid
/assign hakman

@hakman hakman changed the title Decrease KOPS_SCHEDULER_QPS, KOPS_SCHEDULER_BURST, KOPS_CONTROLLER_MANAGER_QPS, and KOPS_CONTROLLER_MANAGER_BURST to 300 Decrease KOPS_SCHEDULER_QPS/BURST and KOPS_CONTROLLER_MANAGER_QPS/BURST to 300 Feb 9, 2026
@hakman hakman changed the title Decrease KOPS_SCHEDULER_QPS/BURST and KOPS_CONTROLLER_MANAGER_QPS/BURST to 300 Decrease KOPS_SCHEDULER_QPS/BURST and KOPS_CONTROLLER_MANAGER_QPS/BURST to 300 Feb 9, 2026
@ronaldngounou
Copy link
Member Author

/test pull-kops-e2e-k8s-aws-calico

…to 300

Motivation / Background:
The ec2-master-scale-tests have been failing since September 19th, 2025 and possibly before due to PodStartUpLatency > 5s as per SLO.

Solution:
Historically, there were an experiment to increase scheduler throughput in 5k performance tests. The experiment has suggested
KOPS_SCHEDULER_QPS=300 and KOPS_SCHEDULER_BURST=300 to be ran at 300qps.  In addition, @wojtek stated that having a QPS at 300 would
hugely decrease pod-start-up-latency across the whole test.

The value of controller manager are modified to be uniformed with the the test-infra.

Signed-off-by: Ronald Ngounou <ronald.ngounou@yahoo.com>
@ronaldngounou ronaldngounou force-pushed the decrease-scheduler-qps branch from 7c1f2df to 804e89d Compare February 9, 2026 03:36
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from hakman. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ronaldngounou
Copy link
Member Author

We are currently waiting on seeing the 4th run of DELETE events SLO succeed before merging this PR.

@mengqiy
Copy link
Member

mengqiy commented Feb 10, 2026

It's not very clear what's the root cause.
IIUC you are implying that the issue is that the controllers create pod faster than the pace that the scheduler can keep up with.
Based on that reasoning, it seems we can keep the config for scheduler unchanged and just reduce the qps/burst for KCM.

@ronaldngounou
Copy link
Member Author

I understand the logic. Keeping scheduler QPS high seems like it would help process pods faster. However, I'm concerned about deviating from the proven test-infra configuration.

  • Test-infra uses Scheduler/KCM QPS=300 (not 500) xref
  • My only concern is that if we use Scheduler=500qps and KCM = 300qps, we will be running a new experiment. Wouldn't the safe path is to align with the current test-infra values to unify kops and test-infra?

I'd recommend having 300/300 in this PR to align with test-infra configurations.

If we want to experiment with Scheduler=500qps, we can do it as a follow-up with proper testing.

What do you think?

@upodroid
Copy link
Member

Can we run the 5k job here?

/test pull-kops-ec2-master-scale-performance-5000

@ronaldngounou
Copy link
Member Author

/test pull-kops-ec2-master-scale-performance-5000

@ronaldngounou
Copy link
Member Author

/test pull-kops-gce-master-scale-performance-5000

@ronaldngounou
Copy link
Member Author

/test pull-kops-ec2-master-scale-performance-5000

@k8s-ci-robot
Copy link
Contributor

@ronaldngounou: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kops-e2e-k8s-aws-calico 804e89d link true /test pull-kops-e2e-k8s-aws-calico
pull-kops-gce-master-scale-performance-5000 804e89d link true /test pull-kops-gce-master-scale-performance-5000
pull-kops-ec2-master-scale-performance-5000 804e89d link false /test pull-kops-ec2-master-scale-performance-5000

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@alaypatel07
Copy link
Contributor

{
      "data": {
        "Perc50": 509.642,
        "Perc90": 907.765,
        "Perc99": 996.954
      },
      "unit": "ms",
      "labels": {
        "Metric": "create_to_schedule"
      }
    },
    {
      "data": {
        "Perc50": 271.732,
        "Perc90": 2126.164,
        "Perc99": 3712.676
      },
      "unit": "ms",
      "labels": {
        "Metric": "schedule_to_run"
      }
    },
    {
      "data": {
        "Perc50": 1041.123679,
        "Perc90": 1560.673041,
        "Perc99": 1922.266583
      },

https://storage.googleapis.com/kubernetes-ci-logs/pr-logs/pull/kops/17942/pull-kops-ec2-master-scale-performance-5000/2021361308649132032/artifacts/StatelessPodStartupLatency_PodStartupLatency_load_2026-02-11T00:33:06Z.json

Looking at the data for the startup, it looks like once the pods land on the node, the kubelet is taking more time in sending it to running state reflected in schedule_to_run metric

@ronaldngounou
Copy link
Member Author

That's a fair point. We currently don't have kubelet logs to rootcause this further. There were a discussion on enabling audit logs but there were some oppositions xref. I'm going to start a discussion in sig-scalablity.

@ronaldngounou
Copy link
Member Author

Updated the issue with my recommendations to keep this PR centered on the code review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants