Decrease KOPS_SCHEDULER_QPS/BURST and KOPS_CONTROLLER_MANAGER_QPS/BURST to 300#17942
Decrease KOPS_SCHEDULER_QPS/BURST and KOPS_CONTROLLER_MANAGER_QPS/BURST to 300#17942ronaldngounou wants to merge 1 commit intokubernetes:masterfrom
KOPS_SCHEDULER_QPS/BURST and KOPS_CONTROLLER_MANAGER_QPS/BURST to 300#17942Conversation
|
/assign upodroid |
KOPS_SCHEDULER_QPS/BURST and KOPS_CONTROLLER_MANAGER_QPS/BURST to 300
|
/test pull-kops-e2e-k8s-aws-calico |
…to 300 Motivation / Background: The ec2-master-scale-tests have been failing since September 19th, 2025 and possibly before due to PodStartUpLatency > 5s as per SLO. Solution: Historically, there were an experiment to increase scheduler throughput in 5k performance tests. The experiment has suggested KOPS_SCHEDULER_QPS=300 and KOPS_SCHEDULER_BURST=300 to be ran at 300qps. In addition, @wojtek stated that having a QPS at 300 would hugely decrease pod-start-up-latency across the whole test. The value of controller manager are modified to be uniformed with the the test-infra. Signed-off-by: Ronald Ngounou <ronald.ngounou@yahoo.com>
7c1f2df to
804e89d
Compare
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
We are currently waiting on seeing the 4th run of DELETE events SLO succeed before merging this PR. |
|
It's not very clear what's the root cause. |
|
I understand the logic. Keeping scheduler QPS high seems like it would help process pods faster. However, I'm concerned about deviating from the proven test-infra configuration.
I'd recommend having 300/300 in this PR to align with test-infra configurations. If we want to experiment with Scheduler=500qps, we can do it as a follow-up with proper testing. What do you think? |
|
Can we run the 5k job here?
|
|
/test pull-kops-ec2-master-scale-performance-5000 |
|
/test pull-kops-gce-master-scale-performance-5000 |
|
/test pull-kops-ec2-master-scale-performance-5000 |
|
@ronaldngounou: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Looking at the data for the startup, it looks like once the pods land on the node, the kubelet is taking more time in sending it to running state reflected in |
|
That's a fair point. We currently don't have kubelet logs to rootcause this further. There were a discussion on enabling audit logs but there were some oppositions xref. I'm going to start a discussion in sig-scalablity. |
|
Updated the issue with my recommendations to keep this PR centered on the code review. |
Motivation / Background:
The
ec2-master-scale5k nodes tests have been failing since September 19th, 2025 and possibly before due to PodStartUpLatency > 5s as per SLO.JobID = 2020473671461638144
pod_startup: {p99=5718ms, p90=4518ms, p50=1861ms}Given the latency observed from the logs and this latency, it's the
pod_startupphase that took the longest.Solution:
Historically, there were an experiment to increase scheduler throughput in 5k performance tests. xref. The experiment has suggested
KOPS_SCHEDULER_QPS=300andKOPS_SCHEDULER_BURST=300to be ran at 300qps. This value is supported by the current test-infra value of 300. xrefIn addition, @wojtek-t stated that having scheduler QPS at 300 would hugely decrease pod-start-up-latency. xref
Therefore, I am modifying the values.
The value of controller manager are modified to be uniformed with the the test-infra. xref
Impact
https://testgrid.k8s.io/sig-release-master-informing#ec2-master-scale-performance

This is a release informing for v1.36-alpha and helps to keep the ec2-master-scale 5k tests dashboard green.
Contributes to issue kubernetes/kubernetes#134332 (comment)