Skip to content

CA DRA: integrate DeltaSnapshotStore with dynamicresources.Snapshot #7681

Open
@towca

Description

Which component are you using?:

/area cluster-autoscaler
/area core-autoscaler
/wg device-management

Is your feature request designed to solve a problem? If so describe the problem this feature should solve.:

There are 2 ClusterSnapshotStore implementations with very different performance characteristics:

  • BasicSnapshotStore is a very simple, reference implementation that clones the whole state during Fork(). It's easy to understand and can be used e.g. in tests, but the complexity of operations is not optimized for the typical usage patterns during a Cluster Autoscaler loop. Not really intended for production use because of this.
  • DeltaSnapshotStore is a much more complex implementation, that branches and keeps deltas separately for every Fork(). The complexity of operations is optimized for typical Cluster Autoscaler usage patterns. This is the de-facto production implementation.

In order for DRA autoscaling to work, a ClusterSnapshotStore implementation has to integrate with dynamicresources.Snapshot. This means correctly handling the DRA snapshot during Fork()/Commit()/Revert().

For DRA autoscaling MVP, only BasicSnapshotStore was integrated with dynamicresources.Snapshot. It's pretty trivial in this case - we just need dynamicresources.Snapshot.Clone(). This was enough to test the MVP, but for production use we need to integrate DeltaSnapshotStore as well.

Describe the solution you'd like.:

  • Add the ability to chain multiple dynamicresources.Snapshot objects, where each object represents a delta from the previous one.
  • The chain can be queried/modified in the same way as a single dynamicresources.Snapshot. Queries fall back the chain and are expensive, modifications are applied to the top object in the chain and are cheap.
  • The chain can be consolidated into a single dynamicresources.Snapshot.
  • DeltaSnapshotStore keeps a chain of dynamicresources.Snapshot objects, and modifies the chain in the same way as the NodeInfo storage chain during Fork()/Commit()/Revert().
  • The chain is probably easiest to implement by turning dynamicresources.Snapshot into an interface and introducing another "DeltaChain" implementation that uses the current one internally.

Describe any alternative solutions you've considered.:

IMO we should avoid having two completely separate Basic/Delta implementations for dynamicresources.Snapshot like we do for ClusterSnapshotStore. This pattern leads to duplicating large portions of non-trivial code and extending ClusterSnapshotStore is more painful than it should because of it. The Delta/Chain implementation should use the Basic one internally instead.

Additional context.:

This is a part of Dynamic Resource Allocation (DRA) support in Cluster Autoscaler. An MVP of the support was implemented in #7530 (with the whole implementation tracked in kubernetes/kubernetes#118612). There are a number of post-MVP follow-ups to be addressed before DRA autoscaling is ready for production use - this is one of them.

Metadata

Assignees

No one assigned

    Labels

    area/cluster-autoscalerarea/core-autoscalerDenotes an issue that is related to the core autoscaler and is not specific to any provider.kind/featureCategorizes issue or PR as related to a new feature.wg/device-managementCategorizes an issue or PR as relevant to WG Device Management.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions