Skip to content

Scale testing #8814

Open
Open
@sbueringer

Description

@sbueringer

We recently merged the initial iteration of the in-memory provider (#8799). But this was just the first step of the scale test implementation. This issue provides an overview over ongoing and upcoming tasks around scale testing.

In-Memory provider features:

e2e test and test framework:

  • Implement scale test automation:
    • Cluster topologies:
      • Small workload cluster: x * (1 control-plane + 1 worker node)
      • Small medium workload cluster: x * (3 control-plane + 10 worker node)
      • Medium Workload Cluster: x * (3 control-plane + 50 worker nodes)
      • Large Workload Cluster: x * (3 control-plane + 500 worker nodes)
        • Dimensions: # of MachineDeployments
    • Scenarios:
      • P0 Create & delete (🌱 Add Scale e2e - development only #8833 @ykakarap)
      • Create, upgrade & delete @killianmuldoon
      • Long-lived clusters (~ a few hours or a day, to catch memory leaks etc.)
      • Chaos testing: e.g. injecting failures like cluster not reachable, machine failures
      • More complex scenarios: e.g. topology is actively changed (MD scale up etc.)
      • Add MachineHealthCheck to the scaling test (@ykakarap)
  • Automate scale testing in CI: (prior art KCP, k/k)
    • Metric collection and consumption after test completion
    • Test should fail based on SLA's (e.g. machine creation slower than x minutes)

Metrics & observability:

Performance improvements

Follow-up

Anomalies found that we should further triage:

  • /convert gets called a lot (even though we never use old apiVersions)
  • When deploying > 1k clusters into a namespace "list machines" in KCP becomes pretty slow and apiserver CPU usage was very high (8-14 CPUs) (Debug ideas: cpu profile, apiserver tracing)

Backlog improvement ideas:

  • KCP:
    • (breaking change): create issue that all KCP secrets must have cluster-name label => then configure KCP cache & client to only cache secrets with cluster-name label
    • EnsureResource: Resources are cached atm. Consider only caching PartialObjectMeta instead.
    • Consider caching the pods we care about (at least CP, check if we access other pods, kube-proxy, core-dns)
    • GetMachinesForCluster: cached call + wait for cache safeguards
    • Optimize etcd client creation (cache instead of recreate)
  • Others:
    • Change all CAPI controllers to cache unstructured per default, use APIReader for uncached calls (like for regular typed objects)
    • Audit all usages of APIReader if they are actually necessary
    • Run certain operations less frequently (e.g. apiVersion bump, reconcile labels)
    • Customize controller work queue rate-limiter
    • Buffered reconciling (avoid frequent reconcile of the same item within a short period of time)
    • Resync items over time instead of all at once at resyncPeriod
      • Investigate if a Reconciler re-reconciles all objects for every type it is watching (because resync is implemented on the informer level), e.g. KCP controller reconciles aver the KCP and after the Cluster resync.
    • Priority queue
    • Use CR transform option to strip parts of objects we don't use (fields which are not part of the contract)
      • trade-off: memory vs. processing time to strip fields, also not sure how to configure up front before we know the CRDs
      • => Based on data we don't know if it's worth it at the moment, so we won't do it for now.

Metadata

Metadata

Assignees

Labels

area/e2e-testingIssues or PRs related to e2e testingkind/cleanupCategorizes issue or PR as related to cleaning up code, process, or technical debt.priority/important-longtermImportant over the long term, but may not be staffed and/or may need multiple releases to complete.triage/acceptedIndicates an issue or PR is ready to be actively worked on.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions