Add performance benchmark for the CA RunOnce control loop#9237
Add performance benchmark for the CA RunOnce control loop#9237Choraden wants to merge 13 commits intokubernetes:masterfrom
Conversation
Update NewTestProcessors to use DynamicResourceAllocationEnabled and CSINodeAwareSchedulingEnabled from AutoscalingOptions instead of hardcoded values. This allows tests to properly configure custom resource processing.
Updated MustCreateManager in the integration test package to accept testing.TB instead of *testing.T. This allows the helper to be used within both standard tests and performance benchmarks (which use *testing.B). This change is a prerequisite for introducing performance benchmarking for the RunOnce control loop.
This commit adds a new benchmarking suite in core/bench to evaluate the performance of the primary Cluster Autoscaler control loop (RunOnce). These benchmarks simulate large-scale cluster operations using a mock Kubernetes API and cloud provider, allowing for comparative analysis and detection of performance regressions.
Introduced a -profile-cpu flag to the RunOnce benchmarking suite. When specified, the benchmark will capture a CPU profile during the first execution of the RunOnce loop and write it to the provided file path.
Disable Garbage Collection during RunOnce benchmarks to ensure stable and reproducible results. This prioritizes consistency over absolute performance metrics, allowing for a generic way to calculate performance differences between patches and providing a clean CPU profile for the RunOnce loop.
Introduce a no-op event recorder in RunOnce benchmarks to prevent event dropping and potential performance side-effects. This change also extends AutoscalerBuilder to support injecting custom AutoscalingKubeClients, allowing for better control over the environment in performance-sensitive tests.
Introduce fastScaleUpCloudProvider and fastScaleUpNodeGroup in benchmarks to avoid the overhead of simulating real node creation in the fake cloud provider. This significantly reduces noise in CPU profiles when benchmarking the core autoscaling logic, as it avoids unnecessary node object management in the fake provider. Added NoOpIncreaseSize to the fake NodeGroup to support this faster scale-up simulation.
This change introduces fastTaintingKubeClient which uses reactors to track and inject ToBeDeleted taints on nodes during the benchmark. This allows the scale-down logic to correctly identify nodes that have been marked for deletion by the autoscaler without relying on standard fake client persistence for these taints. This simply removes fake client from cpu profile.
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: Choraden The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Hi @Choraden. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/uncc aleksandra-malinowska vadasambar Keeping as draft until #9099 is merged. |
|
@Choraden: GitHub didn't allow me to request PR reviews from the following users: pmendelski. Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
Sharing results: |


What type of PR is this?
/kind cleanup
What this PR does / why we need it:
While working on #9022, it became clear that a standardized benchmark is necessary to quantify performance gains and prevent potential regressions in the core logic.
Leveraging the ongoing refactor of the autoscaler building logic in #9099, this PR introduces an initial draft of a benchmark specifically for the RunOnce function. This provides a controlled environment to measure the impact of architectural changes on the main execution loop.
Initial benchmark version at #9199 was difficult to stabilize and reason about. So we decided to simplify it to only one RunOnce call, simulating "cold start" of the CA.
Which issue(s) this PR fixes:
Relates to #9022
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: