CA DRA: integrate DeltaSnapshotStore with dynamicresources.Snapshot #7681
Description
Which component are you using?:
/area cluster-autoscaler
/area core-autoscaler
/wg device-management
Is your feature request designed to solve a problem? If so describe the problem this feature should solve.:
There are 2 ClusterSnapshotStore
implementations with very different performance characteristics:
BasicSnapshotStore
is a very simple, reference implementation that clones the whole state duringFork()
. It's easy to understand and can be used e.g. in tests, but the complexity of operations is not optimized for the typical usage patterns during a Cluster Autoscaler loop. Not really intended for production use because of this.DeltaSnapshotStore
is a much more complex implementation, that branches and keeps deltas separately for everyFork()
. The complexity of operations is optimized for typical Cluster Autoscaler usage patterns. This is the de-facto production implementation.
In order for DRA autoscaling to work, a ClusterSnapshotStore
implementation has to integrate with dynamicresources.Snapshot
. This means correctly handling the DRA snapshot during Fork()/Commit()/Revert()
.
For DRA autoscaling MVP, only BasicSnapshotStore
was integrated with dynamicresources.Snapshot
. It's pretty trivial in this case - we just need dynamicresources.Snapshot.Clone()
. This was enough to test the MVP, but for production use we need to integrate DeltaSnapshotStore
as well.
Describe the solution you'd like.:
- Add the ability to chain multiple
dynamicresources.Snapshot
objects, where each object represents a delta from the previous one. - The chain can be queried/modified in the same way as a single
dynamicresources.Snapshot
. Queries fall back the chain and are expensive, modifications are applied to the top object in the chain and are cheap. - The chain can be consolidated into a single
dynamicresources.Snapshot
. DeltaSnapshotStore
keeps a chain ofdynamicresources.Snapshot
objects, and modifies the chain in the same way as the NodeInfo storage chain duringFork()/Commit()/Revert()
.- The chain is probably easiest to implement by turning
dynamicresources.Snapshot
into an interface and introducing another "DeltaChain" implementation that uses the current one internally.
Describe any alternative solutions you've considered.:
IMO we should avoid having two completely separate Basic/Delta implementations for dynamicresources.Snapshot
like we do for ClusterSnapshotStore
. This pattern leads to duplicating large portions of non-trivial code and extending ClusterSnapshotStore
is more painful than it should because of it. The Delta/Chain implementation should use the Basic one internally instead.
Additional context.:
This is a part of Dynamic Resource Allocation (DRA) support in Cluster Autoscaler. An MVP of the support was implemented in #7530 (with the whole implementation tracked in kubernetes/kubernetes#118612). There are a number of post-MVP follow-ups to be addressed before DRA autoscaling is ready for production use - this is one of them.