test: Improved Cluster State Test Suite duration from 727 seconds to 94 seconds #2059

rschalo · 2025-03-07T17:25:07Z

Fixes #N/A

Description
Recently, the cluster state tests have been taking an increasing amount of time (see below for a run which took 727 seconds). I generated a test report with Ginkgo (see below for commands) that identified the slowest tests in the suite and found that Cluster State Sync tests were running the longest.

Investigating the ExpectApplied function showed that running sequentially takes a long time because each function call makes 5 calls to the apiserver:

fetch the object
create the object if not found
update the object resource version after create
update the object status resource version
fetch the object now that it has been updated

Running these requests in parallel with a WaitGroup dramatically increases the speed of the tests. It initially encountered issues with client-side throttling, which is not constrained by API Priority & Fairness. To get around this, I also added a helper function that allows custom rest.Config settings when constructing the environment and then set the client to not be rate-limited.

One thing that may need to change is that in the previous test, we were checking ExpectMetricGaugeValue(state.ClusterStateNodesCount, float64(i+1), nil) after every create whereas now this metric is checked against the expected count after creating all objects. A better way to track this could be another goroutine which polls this metric and validates that it is increasing until the test reaches the expected end result. Happy to add in this PR or in a subsequent one.

How was this change tested?
Before

❯ ginkgo --json-report=report.json ./pkg/controllers/state/...
Running Suite: Controllers/State - /Users/schalor/github.com/testing/karpenter/pkg/controllers/state
====================================================================================================
Random Seed: 1741357499

Will run 58 of 58 specs
••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

Ran 58 of 58 Specs in 727.134 seconds
SUCCESS! -- 58 Passed | 0 Failed | 0 Pending | 0 Skipped
PASS

Ginkgo ran 1 suite in 12m18.298690333s
Test Suite Passed
❯ cat report.json | jq -r '
    [
        .[].SpecReports.[]?
        | select( .State | match("passed") )
        | select( .LeafNodeType | match("It"))
        | {
            Context: ( .ContainerHierarchyTexts | join("/")),
            Test: .LeafNodeText,
            Duration: (.RunTime * .000000001)
        }
    ]
    | sort_by( .Duration )
    | reverse
' | head -n 1000
[
  {
    "Context": "Cluster State Sync",
    "Test": "should consider the cluster state synced when nodes register provider id",
    "Duration": 201.822036709
  },
  {
    "Context": "Cluster State Sync",
    "Test": "shouldn't consider the cluster state synced if a node isn't tracked",
    "Duration": 101.477449167
  },
  {
    "Context": "Cluster State Sync",
    "Test": "should consider the cluster state synced when nodes don't have provider id",
    "Duration": 101.474290916
  },
  {
    "Context": "Cluster State Sync",
    "Test": "should consider the cluster state synced when all nodes are tracked",
    "Duration": 101.436484167
  },
  {
    "Context": "Cluster State Sync",
    "Test": "should consider the cluster state synced when a combination of nodeclaims and node are tracked",
    "Duration": 53.380126292
  },
  {
    "Context": "Cluster State Sync",
    "Test": "should consider the cluster state synced when the representation of nodes is the same",
    "Duration": 51.462873583000004
  },
  {
    "Context": "Cluster State Sync",
    "Test": "should consider the cluster state synced when a new node is added after the initial sync",
    "Duration": 51.249976291
  },
  {
    "Context": "Node Resource Level",
    "Test": "should nominate the node until the nomination time passes",
    "Duration": 21.214906542
  },
  {
    "Context": "Data Races",
    "Test": "should ensure that calling Synced() is valid while making updates to Nodes",
    "Duration": 10.200998417000001
  },
  {
    "Context": "Cluster State Sync",
    "Test": "shouldn't consider the cluster state synced if a nodeclaim isn't tracked",
    "Duration": 5.991445583
  },
  {
    "Context": "Cluster State Sync",
    "Test": "should consider the cluster state synced when all nodeclaims are tracked",
    "Duration": 4.708709875
  },
  {
    "Context": "Cluster State Sync",
    "Test": "shouldn't consider the cluster state synced if a nodeclaim hasn't resolved its provider id",
    "Duration": 4.649389958
  },
  {
    "Context": "DaemonSet Controller",
    "Test": "should only return daemonset pods from the daemonset cache",
    "Duration": 2.433895583
  },
  {
    "Context": "Pod Ack",
    "Test": "should only mark pods as schedulable once",
    "Duration": 2.1460760000000003
  },
  {
    "Context": "DaemonSet Controller",
    "Test": "should update daemonsetCache with the newest created pod",
    "Duration": 1.0675025
  },
  {
    "Context": "Node Resource Level",
    "Test": "should maintain a correct count of resource usage as pods are deleted/added",
    "Duration": 0.7566735
  },
  {
    "Context": "Data Races",
    "Test": "should ensure that calling Synced() is valid while making updates to NodeClaims",
    "Duration": 0.537835917
  },
  {
    "Context": "Node Resource Level",
    "Test": "should handle a node changing from no providerID to registering a providerID",
    "Duration": 0.25278754200000003
  },
  {
    "Context": "Node Resource Level",
    "Test": "should track pods correctly if we miss events or they are consolidated",
    "Duration": 0.22724291600000002
  },
  {
    "Context": "Node Resource Level",
    "Test": "should mark node for deletion when node is deleted",
    "Duration": 0.22465641700000002
  },
  {
    "Context": "Volume Usage/Limits",
    "Test": "should hydrate the volume usage on a Node update",
    "Duration": 0.22363137500000002
  },
  {
    "Context": "Cluster State Sync",
    "Test": "shouldn't consider the cluster state synced if a nodeclaim without a providerID is deleted",
    "Duration": 0.222831041
  },
  {
    "Context": "Volume Usage/Limits",
    "Test": "should maintain the volume usage state when receiving NodeClaim updates",
    "Duration": 0.21953766700000002
  },
  {
    "Context": "Volume Usage/Limits",
    "Test": "should ignore the volume usage limits breach if the pod update is for an already tracked pod",
    "Duration": 0.21275120800000002
  },
  {
    "Context": "Node Resource Level",
    "Test": "should mark node for deletion when nodeclaim is deleted",
    "Duration": 0.209749333
  },
  {
    "Context": "Taints/Managed",
    "Test": "should consider startup taints on a managed node that isn't initialized",
    "Duration": 0.20619975000000001
  },
  {
    "Context": "Pod Healthy NodePool",
    "Test": "should delete the pod schedulable time if the pod is deleted",
    "Duration": 0.205383416
  },
  {
    "Context": "Taints/Managed",
    "Test": "should consider ephemeral taints on a managed node after the node is initialized",
    "Duration": 0.20517875000000002
  },
  {
    "Context": "Pod Anti-Affinity",
    "Test": "should handle events out of order",
    "Duration": 0.20481141700000002
  },
  {
    "Context": "Pod Anti-Affinity",
    "Test": "should stop tracking pods with required anti-affinity if the pod is deleted",
    "Duration": 0.204684041
  },
  {
    "Context": "HostPort Usage",
    "Test": "should ignore the host port usage conflict if the pod update is for an already tracked pod",
    "Duration": 0.20450341700000002
  },
  {
    "Context": "Taints/Unmanaged",
    "Test": "should consider ephemeral taints on an unmanaged node that isn't initialized",
    "Duration": 0.203996334
  },
  {
    "Context": "DaemonSet Controller",
    "Test": "should not update daemonsetCache when daemonset pod is not present",
    "Duration": 0.20390591700000002
  },
  {
    "Context": "Node Resource Level",
    "Test": "should count new pods bound to nodes",
    "Duration": 0.20370375000000002
  },
  {
    "Context": "Node Resource Level",
    "Test": "should track daemonset requested resources separately",
    "Duration": 0.20348437500000002
  },
  {
    "Context": "Node Resource Level",
    "Test": "should not add requests if the pod is terminal",
    "Duration": 0.20331995900000002
  },
  {
    "Context": "Pod Anti-Affinity",
    "Test": "should track pods with required anti-affinity",
    "Duration": 0.20288845800000002
  },
  {
    "Context": "Node Resource Level",
    "Test": "should subtract requests if the pod is deleted",
    "Duration": 0.202450916
  },
  {
    "Context": "Pod Healthy NodePool",
    "Test": "should not store pod schedulable time if the nodePool that pod is scheduled to does not have NodeRegistrationHealthy=true",
    "Duration": 0.20173200000000002
  },
  {
    "Context": "Pod Healthy NodePool",
    "Test": "should store pod schedulable time if the nodePool that pod is scheduled to has NodeRegistrationHealthy=true",
    "Duration": 0.20115470800000002
  },
  {
    "Context": "Pod Healthy NodePool",
    "Test": "should not update the pod schedulable time if it is already stored for a pod",
    "Duration": 0.200633417
  },
  {
    "Context": "HostPort Usage",
    "Test": "should hydrate the HostPort usage on a Node update",
    "Duration": 0.200623833
  },
  {
    "Context": "Taints/Unmanaged",
    "Test": "should consider ephemeral taints on an unmanaged node after the node is initialized",
    "Duration": 0.20019529100000003
  },
  {
    "Context": "DaemonSet Controller",
    "Test": "should update daemonsetCache when daemonset pod is created",
    "Duration": 0.200184084
  },
  {
    "Context": "HostPort Usage",
    "Test": "should maintain the host port usage state when receiving NodeClaim updates",
    "Duration": 0.20014462500000002
  },
  {
    "Context": "Pod Anti-Affinity",
    "Test": "should not track pods with preferred anti-affinity",
    "Duration": 0.200018791
  },
  {
    "Context": "Taints/Managed",
    "Test": "should consider startup taints on a managed node after the node is initialized",
    "Duration": 0.199725708
  },
  {
    "Context": "Consolidated State",
    "Test": "should cause consolidation state to change when a NodePool is updated",
    "Duration": 0.19933329200000002
  },
  {
    "Context": "Node Resource Level",
    "Test": "should count existing pods bound to nodes",
    "Duration": 0.19877312500000002
  },
  {
    "Context": "Node Resource Level",
    "Test": "should not count pods not bound to nodes",
    "Duration": 0.19083745900000001
  },
  {
    "Context": "Node Resource Level",
    "Test": "should stop tracking nodes that are deleted",
    "Duration": 0.127762958
  },
  {
    "Context": "Taints/Managed",
    "Test": "should not consider ephemeral taints on a managed node that isn't initialized",
    "Duration": 0.12012875
  },
  {
    "Context": "Node Deletion",
    "Test": "should not leak a state node when the NodeClaim and Node names match",
    "Duration": 0.11925704100000001
  },
  {
    "Context": "Cluster State Sync",
    "Test": "should emit cluster_state_unsynced_time_seconds metric when cluster state is unsynced",
    "Duration": 0.11726162500000001
  },
  {
    "Context": "Cluster State Sync",
    "Test": "shouldn't consider the cluster state synced if a nodeclaim is added manually with UpdateNodeClaim",
    "Duration": 0.115658333
  },
  {
    "Context": "Consolidated State",
    "Test": "should update the consolidated value when setting consolidation",
    "Duration": 0.1120165
  },
  {
    "Context": "Consolidated State",
    "Test": "should update the consolidated value when state timeout (5m) has passed and state hasn't changed",
    "Duration": 0.10490637500000001
  },
  {
    "Context": "DaemonSet Controller",
    "Test": "should delete daemonset in cache when daemonset is deleted",
    "Duration": 0.10126900000000001
  }
]

After

❯ ginkgo --json-report=report.json ./pkg/controllers/state/...
Running Suite: Controllers/State - /Users/schalor/github.com/testing/karpenter/pkg/controllers/state
====================================================================================================
Random Seed: 1741366973

Will run 58 of 58 specs
••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

Ran 58 of 58 Specs in 94.072 seconds
SUCCESS! -- 58 Passed | 0 Failed | 0 Pending | 0 Skipped
PASS

Ginkgo ran 1 suite in 1m41.933752958s
Test Suite Passed

❯ cat report.json | jq -r '                                   
    [
        .[].SpecReports.[]?
        | select( .State | match("passed") )
        | select( .LeafNodeType | match("It"))
        | {
            Context: ( .ContainerHierarchyTexts | join("/")),
            Test: .LeafNodeText,
            Duration: (.RunTime * .000000001)
        }
    ]
    | sort_by( .Duration )
    | reverse
' | head -n 1000
[
  {
    "Context": "Cluster State Sync",
    "Test": "should consider the cluster state synced when a new node is added after the initial sync",
    "Duration": 26.270470375000002
  },
  {
    "Context": "Node Resource Level",
    "Test": "should nominate the node until the nomination time passes",
    "Duration": 21.205630875
  },
  {
    "Context": "Data Races",
    "Test": "should ensure that calling Synced() is valid while making updates to Nodes",
    "Duration": 12.189924958
  },
  {
    "Context": "Cluster State Sync",
    "Test": "should consider the cluster state synced when a combination of nodeclaims and node are tracked",
    "Duration": 2.4730675420000003
  },
  {
    "Context": "Cluster State Sync",
    "Test": "should consider the cluster state synced when nodes register provider id",
    "Duration": 2.461390958
  },
  {
    "Context": "DaemonSet Controller",
    "Test": "should only return daemonset pods from the daemonset cache",
    "Duration": 2.2364551660000003
  },
  {
    "Context": "Cluster State Sync",
    "Test": "should consider the cluster state synced when all nodeclaims are tracked",
    "Duration": 2.12737525
  },
  {
    "Context": "Cluster State Sync",
    "Test": "shouldn't consider the cluster state synced if a nodeclaim hasn't resolved its provider id",
    "Duration": 2.1203856670000003
  },
  {
    "Context": "Cluster State Sync",
    "Test": "shouldn't consider the cluster state synced if a nodeclaim isn't tracked",
    "Duration": 2.035253458
  },
  {
    "Context": "Cluster State Sync",
    "Test": "should consider the cluster state synced when the representation of nodes is the same",
    "Duration": 1.8687752080000002
  },
  {
    "Context": "Cluster State Sync",
    "Test": "should consider the cluster state synced when all nodes are tracked",
    "Duration": 1.6557027500000001
  },
  {
    "Context": "Cluster State Sync",
    "Test": "should consider the cluster state synced when nodes don't have provider id",
    "Duration": 1.612364208
  },
  {
    "Context": "Cluster State Sync",
    "Test": "shouldn't consider the cluster state synced if a node isn't tracked",
    "Duration": 1.5754032910000002
  },
  {
    "Context": "DaemonSet Controller",
    "Test": "should update daemonsetCache with the newest created pod",
    "Duration": 1.135020166
  },
  {
    "Context": "Node Resource Level",
    "Test": "should maintain a correct count of resource usage as pods are deleted/added",
    "Duration": 0.6766881250000001
  },
  {
    "Context": "Data Races",
    "Test": "should ensure that calling Synced() is valid while making updates to NodeClaims",
    "Duration": 0.644074125
  },
  {
    "Context": "Cluster State Sync",
    "Test": "shouldn't consider the cluster state synced if a nodeclaim without a providerID is deleted",
    "Duration": 0.30745204200000004
  },
  {
    "Context": "Volume Usage/Limits",
    "Test": "should maintain the volume usage state when receiving NodeClaim updates",
    "Duration": 0.249140084
  },
  {
    "Context": "Node Resource Level",
    "Test": "should mark node for deletion when node is deleted",
    "Duration": 0.23030691700000003
  },
  {
    "Context": "Node Resource Level",
    "Test": "should handle a node changing from no providerID to registering a providerID",
    "Duration": 0.228082833
  },
  {
    "Context": "Node Resource Level",
    "Test": "should track pods correctly if we miss events or they are consolidated",
    "Duration": 0.22086750000000002
  },
  {
    "Context": "Volume Usage/Limits",
    "Test": "should hydrate the volume usage on a Node update",
    "Duration": 0.210134917
  },
  {
    "Context": "Pod Healthy NodePool",
    "Test": "should delete the pod schedulable time if the pod is deleted",
    "Duration": 0.209311083
  },
  {
    "Context": "Node Resource Level",
    "Test": "should not count pods not bound to nodes",
    "Duration": 0.208510833
  },
  {
    "Context": "Taints/Managed",
    "Test": "should not consider ephemeral taints on a managed node that isn't initialized",
    "Duration": 0.20736725
  },
  {
    "Context": "Node Resource Level",
    "Test": "should not add requests if the pod is terminal",
    "Duration": 0.20655041700000001
  },
  {
    "Context": "Consolidated State",
    "Test": "should update the consolidated value when state timeout (5m) has passed and state hasn't changed",
    "Duration": 0.204290791
  },
  {
    "Context": "Taints/Managed",
    "Test": "should consider startup taints on a managed node after the node is initialized",
    "Duration": 0.20319125000000002
  },
  {
    "Context": "Pod Anti-Affinity",
    "Test": "should stop tracking pods with required anti-affinity if the pod is deleted",
    "Duration": 0.20299925000000002
  },
  {
    "Context": "Node Resource Level",
    "Test": "should track daemonset requested resources separately",
    "Duration": 0.20295116700000002
  },
  {
    "Context": "Pod Healthy NodePool",
    "Test": "should not update the pod schedulable time if it is already stored for a pod",
    "Duration": 0.20277504200000002
  },
  {
    "Context": "Taints/Managed",
    "Test": "should consider startup taints on a managed node that isn't initialized",
    "Duration": 0.202753375
  },
  {
    "Context": "Node Resource Level",
    "Test": "should count existing pods bound to nodes",
    "Duration": 0.20267641700000003
  },
  {
    "Context": "HostPort Usage",
    "Test": "should maintain the host port usage state when receiving NodeClaim updates",
    "Duration": 0.20263920900000001
  },
  {
    "Context": "Volume Usage/Limits",
    "Test": "should ignore the volume usage limits breach if the pod update is for an already tracked pod",
    "Duration": 0.202227125
  },
  {
    "Context": "Consolidated State",
    "Test": "should cause consolidation state to change when a NodePool is updated",
    "Duration": 0.202211208
  },
  {
    "Context": "Pod Anti-Affinity",
    "Test": "should not track pods with preferred anti-affinity",
    "Duration": 0.202162541
  },
  {
    "Context": "Node Resource Level",
    "Test": "should count new pods bound to nodes",
    "Duration": 0.20175125000000002
  },
  {
    "Context": "Pod Healthy NodePool",
    "Test": "should not store pod schedulable time if the nodePool that pod is scheduled to does not have NodeRegistrationHealthy=true",
    "Duration": 0.20135962500000001
  },
  {
    "Context": "DaemonSet Controller",
    "Test": "should update daemonsetCache when daemonset pod is created",
    "Duration": 0.200780334
  },
  {
    "Context": "DaemonSet Controller",
    "Test": "should not update daemonsetCache when daemonset pod is not present",
    "Duration": 0.20060170900000002
  },
  {
    "Context": "Taints/Unmanaged",
    "Test": "should consider ephemeral taints on an unmanaged node after the node is initialized",
    "Duration": 0.200499958
  },
  {
    "Context": "HostPort Usage",
    "Test": "should ignore the host port usage conflict if the pod update is for an already tracked pod",
    "Duration": 0.200295875
  },
  {
    "Context": "Taints/Managed",
    "Test": "should consider ephemeral taints on a managed node after the node is initialized",
    "Duration": 0.19996425
  },
  {
    "Context": "Pod Anti-Affinity",
    "Test": "should track pods with required anti-affinity",
    "Duration": 0.19960312500000002
  },
  {
    "Context": "Pod Anti-Affinity",
    "Test": "should handle events out of order",
    "Duration": 0.199547209
  },
  {
    "Context": "Node Deletion",
    "Test": "should not leak a state node when the NodeClaim and Node names match",
    "Duration": 0.19906916700000002
  },
  {
    "Context": "Taints/Unmanaged",
    "Test": "should consider ephemeral taints on an unmanaged node that isn't initialized",
    "Duration": 0.19889245800000002
  },
  {
    "Context": "Consolidated State",
    "Test": "should update the consolidated value when setting consolidation",
    "Duration": 0.198867125
  },
  {
    "Context": "Node Resource Level",
    "Test": "should subtract requests if the pod is deleted",
    "Duration": 0.19807433400000002
  },
  {
    "Context": "Node Resource Level",
    "Test": "should mark node for deletion when nodeclaim is deleted",
    "Duration": 0.197335125
  },
  {
    "Context": "HostPort Usage",
    "Test": "should hydrate the HostPort usage on a Node update",
    "Duration": 0.150627917
  },
  {
    "Context": "Pod Healthy NodePool",
    "Test": "should store pod schedulable time if the nodePool that pod is scheduled to has NodeRegistrationHealthy=true",
    "Duration": 0.134957125
  },
  {
    "Context": "Node Resource Level",
    "Test": "should stop tracking nodes that are deleted",
    "Duration": 0.12018370800000001
  },
  {
    "Context": "Cluster State Sync",
    "Test": "should emit cluster_state_unsynced_time_seconds metric when cluster state is unsynced",
    "Duration": 0.116527791
  },
  {
    "Context": "Cluster State Sync",
    "Test": "shouldn't consider the cluster state synced if a nodeclaim is added manually with UpdateNodeClaim",
    "Duration": 0.11324687500000001
  },
  {
    "Context": "Pod Ack",
    "Test": "should only mark pods as schedulable once",
    "Duration": 0.11292625
  },
  {
    "Context": "DaemonSet Controller",
    "Test": "should delete daemonset in cache when daemonset is deleted",
    "Duration": 0.10252354200000001
  }
]

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

k8s-ci-robot · 2025-03-07T17:25:14Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: rschalo
Once this PR has been reviewed and has the lgtm label, please assign mwielgus for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

rschalo · 2025-03-07T17:43:07Z

Tests are racy. Closing until able to fix.

rschalo added 2 commits March 7, 2025 09:06

test: improve cluster state test performance

a5db845

remove report.json and update .gitignore

8824448

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Mar 7, 2025

k8s-ci-robot requested review from jackfrancis and jmdeal March 7, 2025 17:25

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Mar 7, 2025

remove loop var capture

444cfa3

rschalo marked this pull request as draft March 7, 2025 17:41

k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 7, 2025

rschalo closed this Mar 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

test: Improved Cluster State Test Suite duration from 727 seconds to 94 seconds #2059

test: Improved Cluster State Test Suite duration from 727 seconds to 94 seconds #2059

Uh oh!

rschalo commented Mar 7, 2025

Uh oh!

k8s-ci-robot commented Mar 7, 2025

Uh oh!

rschalo commented Mar 7, 2025

Uh oh!

Uh oh!

test: Improved Cluster State Test Suite duration from 727 seconds to 94 seconds #2059

test: Improved Cluster State Test Suite duration from 727 seconds to 94 seconds #2059

Uh oh!

Conversation

rschalo commented Mar 7, 2025

Uh oh!

k8s-ci-robot commented Mar 7, 2025

Uh oh!

rschalo commented Mar 7, 2025

Uh oh!

Uh oh!