-
Notifications
You must be signed in to change notification settings - Fork 4.3k
Introduce registry for caching and exposing TemplateNodeInfos #8911
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Introduce registry for caching and exposing TemplateNodeInfos #8911
Conversation
|
Welcome @Choraden! |
|
Hi @Choraden. Thanks for your PR. I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
f9c0302 to
ad96941
Compare
ad96941 to
4fb808c
Compare
4fb808c to
7f36de5
Compare
|
/assign @towca |
|
/cherry-pick cluster-autoscaler-release-1.35 |
|
@jackfrancis: once the present PR merges, I will cherry-pick it on top of DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
|
||
| // TemplateNodeInfoRegistry is the interface for getting template node infos. | ||
| type TemplateNodeInfoRegistry interface { | ||
| GetNodeInfo(id string) (*framework.NodeInfo, bool) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just id as the param name is pretty ambiguous - I'd rename to nodeGroupId so that it's clear which id we mean here.
| newReadyNodes = append(newReadyNodes, node) | ||
| klog.Warningf("Failed to get template node info for node group %s with error: %v", ng.Id(), err) | ||
| continue | ||
| var nodeInfo *framework.NodeInfo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO the function is getting a bit too complex to read with this addition (especially with the "continue on error" now being one indentation deeper). Could you extract the part determining the nodeInfo to a helper function? It'll also read much nicer on its own with returns instead of the variable.
| // TemplateNodeInfoRegistry is a component that stores and exposes template NodeInfos. | ||
| // It is updated once per autoscaling loop iteration via Recompute() and provides a consistent view of node templates to all processors. | ||
| type TemplateNodeInfoRegistry struct { | ||
| processor TemplateNodeInfoProvider |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
provider seems like a more accurate name
| // It is updated once per autoscaling loop iteration via Recompute() and provides a consistent view of node templates to all processors. | ||
| type TemplateNodeInfoRegistry struct { | ||
| processor TemplateNodeInfoProvider | ||
| nodeInfos map[string]*framework.NodeInfo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Could you separate lock and nodeInfos by a newline, and move lock on top of nodeInfos? It's a visual convention that makes it easy to grasp which fields exactly a given mutex protects. Definitely not necessary here because of how simple the type is, but can be really helpful for more complex types with lots of fields.
| r.lock.RLock() | ||
| defer r.lock.RUnlock() | ||
| result := make(map[string]*framework.NodeInfo, len(r.nodeInfos)) | ||
| maps.Copy(result, r.nodeInfos) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't return maps.Clone(r.nodeInfos) be equivalent but more straightforward?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch.
| // Test GetNodeInfo | ||
| info, found := registry.GetNodeInfo("ng1") | ||
| assert.True(t, found) | ||
| assert.NotNil(t, info) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd assert some actual NodeInfo field (e.g. Node.Name) instead of just non-nil which could technically pass if the provider returned a wrong NodeInfo.
|
|
||
| // TemplateNodeInfoRegistry is the interface for getting template node infos. | ||
| type TemplateNodeInfoRegistry interface { | ||
| GetNodeInfo(id string) (*framework.NodeInfo, bool) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add comments to the interface methods? We need to document:
- What exactly is returned (template NodeInfo for a given NodeGroup, as computed by TemplateNodeInfoProvider), since
NodeInfoon its own is ambiguous - e.g. a NodeInfo also represents an existing Node in the cluster correlated with its scheduled Pods (obtained from the ClusterSnapshot). - How up-to-date the results of the getters are (or in another words - how internal caching works) - something like "The results are updated during the Recompute() call near the beginning of the main Cluster Autoscaler loop - and cached until the next Recompute() call. The getters can be used by logic that happens before the Recompute() call in the main CA loop - but the caller has to handle no results during the first CA loop."
- What the expectations for the returned objects are - the results of
GetNodeInfo()are read-only, the map returned byGetNodeInfos()itself can be modified, but its values are read-only. - All methods are thread-safe and can be called from separate goroutines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added comments.
| klog.Warningf("Failed to get template node info for node group %s with error: %v", ng.Id(), err) | ||
| continue | ||
| var nodeInfo *framework.NodeInfo | ||
| if autoscalingCtx.TemplateNodeInfoRegistry != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can rely on autoscalingCtx.TemplateNodeInfoRegistry not being nil if we always set it during NewStaticAutoscaler. Having the nil check suggests to the reader that this field can be nil, but that should never happen, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, that should never be nil.
I just wanted to be safe from panics if the initial configuration is missing the registry by some mistake.
Removed the nil check.
| "node_7": true, | ||
| }, | ||
| }, | ||
| "Custom DRA driver retrieved via cached template node info": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that the new logic always prefers the registry over NodeGroup.TemplateNodeInfo(), IMO we should "revert" the testing pattern here:
- Rewrite the existing test cases to use the registry instead of
NodeGroup.TemplateNodeInfo() - Add new test cases testing the fallback to
NodeGroup.TemplateNodeInfo()if there is no entry in the registry (1 case with both defined -> registry preferred + 1 case with just theNodeGroup.TemplateNodeInfo()defined and no entry in the registry -> fallback)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| // NewTestProcessors returns a set of simple processors for use in tests. | ||
| // Note: This function injects a default TemplateNodeInfoRegistry into the provided AutoscalingContext. | ||
| // This is a necessary workaround for synthetic tests that manually construct the context without using NewStaticAutoscaler, ensuring they have access to the registry. | ||
| func NewTestProcessors(autoscalingCtx *ca_context.AutoscalingContext) *processors.AutoscalingProcessors { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get that this was the easiest change to make the tests pass, but unfortunately little hacks like these make the tests really hard to understand and extend.
Looking at the usages of this function, it's ~always called after NewScaleTestAutoscalingContext(). IMO the order should be switched, like it's in the prod path - processors are a dependency of the context, not the other way around. NewScaleTestAutoscalingContext() should either take the processors as parameter, or call NewTestProcessors() internally. NewTestProcessors() technically depends on the full context now, but it only uses a small subset of it - config.AutoscalingOptions - which is also used as a parameter to NewScaleTestAutoscalingContext(). Have you explored something like that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I decided to:
- decouple
NewTestProcessorsfromautoscalingCtxand depend only onconfig.AutoscalingOptions - update
NewScaleTestAutoscalingContextto acceptTemplateNodeInfoRegistryas in the originalNewAutoscalingContext - reordered test initialization: create options -> create processors & registry -> create context
This aligns the test setup with the production architecture and improves readability and safety.
Adding it in a separate commit to streamline review. Let me know if you want it squashed eventually.
…NodeInfos This change introduces a new component, TemplateNodeInfoRegistry, which wraps the existing TemplateNodeInfoProvider. It caches the computed template NodeInfos and exposes them via a thread-safe interface. This registry is added to the AutoscalingContext, allowing processors (like the DRA processor) to access the cached templates instead of relying on the less reliable NodeGroup.TemplateNodeInfo().
…gistry Key changes: - Updated NewScaleTestAutoscalingContext to accept TemplateNodeInfoRegistry as a parameter. - Refactored NewTestProcessors to take AutoscalingOptions and return both Processors and TemplateNodeInfoRegistry. - Reordered test initialization to follow the production path: Options -> Processors/Registry -> AutoscalingContext. These changes improve testing readability and extendability by ensuring a consistent setup of the autoscaling environment with the production logic.
The DRACustomResourcesProcessor now attempts to retrieve NodeInfo from the TemplateNodeInfoRegistry before falling back to the NodeGroup. This ensures the processor uses the canonical TemplateNodeInfo for the current autoscaling loop. Crucially, this preserves any enrichments (such as custom DRA resource slices) that are computed during the registry's Recompute phase but might be absent in a fresh, raw template from the CloudProvider.
7f36de5 to
f1ba828
Compare
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: Choraden The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
Choraden
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@towca I've addressed your comments. PTAL
| // NewTestProcessors returns a set of simple processors for use in tests. | ||
| // Note: This function injects a default TemplateNodeInfoRegistry into the provided AutoscalingContext. | ||
| // This is a necessary workaround for synthetic tests that manually construct the context without using NewStaticAutoscaler, ensuring they have access to the registry. | ||
| func NewTestProcessors(autoscalingCtx *ca_context.AutoscalingContext) *processors.AutoscalingProcessors { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I decided to:
- decouple
NewTestProcessorsfromautoscalingCtxand depend only onconfig.AutoscalingOptions - update
NewScaleTestAutoscalingContextto acceptTemplateNodeInfoRegistryas in the originalNewAutoscalingContext - reordered test initialization: create options -> create processors & registry -> create context
This aligns the test setup with the production architecture and improves readability and safety.
Adding it in a separate commit to streamline review. Let me know if you want it squashed eventually.
| klog.Warningf("Failed to get template node info for node group %s with error: %v", ng.Id(), err) | ||
| continue | ||
| var nodeInfo *framework.NodeInfo | ||
| if autoscalingCtx.TemplateNodeInfoRegistry != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, that should never be nil.
I just wanted to be safe from panics if the initial configuration is missing the registry by some mistake.
Removed the nil check.
| "node_7": true, | ||
| }, | ||
| }, | ||
| "Custom DRA driver retrieved via cached template node info": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
|
||
| // TemplateNodeInfoRegistry is the interface for getting template node infos. | ||
| type TemplateNodeInfoRegistry interface { | ||
| GetNodeInfo(id string) (*framework.NodeInfo, bool) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added comments.
| r.lock.RLock() | ||
| defer r.lock.RUnlock() | ||
| result := make(map[string]*framework.NodeInfo, len(r.nodeInfos)) | ||
| maps.Copy(result, r.nodeInfos) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch.
What type of PR is this?
/kind feature
What this PR does / why we need it:
This change introduces a new component,
TemplateNodeInfoRegistry, which wraps the existingTemplateNodeInfoProvider. It caches the computed template NodeInfos and exposes them via a thread-safe interface.This registry is added to the AutoscalingContext, allowing processors (like the DRA processor) to access the cached templates instead of relying on the less reliable
NodeGroup.TemplateNodeInfo().Which issue(s) this PR fixes:
Fixes #8881
Fixes #8882
Special notes for your reviewer:
--
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: