Introduce registry for caching and exposing TemplateNodeInfos #8911

Choraden · 2025-12-10T15:13:47Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

This change introduces a new component, TemplateNodeInfoRegistry, which wraps the existing TemplateNodeInfoProvider. It caches the computed template NodeInfos and exposes them via a thread-safe interface.
This registry is added to the AutoscalingContext, allowing processors (like the DRA processor) to access the cached templates instead of relying on the less reliable NodeGroup.TemplateNodeInfo().

Which issue(s) this PR fixes:

Fixes #8881
Fixes #8882

Special notes for your reviewer:

--

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

linux-foundation-easycla · 2025-12-10T15:13:55Z

The committers listed above are authorized under a signed CLA.

✅ login: Choraden / name: Hubert Grochowski (610aa7f, a5e549d, e37ea47, f1ba828)

k8s-ci-robot · 2025-12-10T15:13:56Z

Welcome @Choraden!

It looks like this is your first PR to kubernetes/autoscaler 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/autoscaler has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

k8s-ci-robot · 2025-12-10T15:13:58Z

Hi @Choraden. Thanks for your PR.

I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Choraden · 2025-12-17T13:25:00Z

/assign @towca

jackfrancis · 2025-12-19T01:02:56Z

/cherry-pick cluster-autoscaler-release-1.35

k8s-infra-cherrypick-robot · 2025-12-19T01:02:59Z

@jackfrancis: once the present PR merges, I will cherry-pick it on top of cluster-autoscaler-release-1.35 in a new PR and assign it to you.

Details

In response to this:

/cherry-pick cluster-autoscaler-release-1.35

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

towca · 2025-12-23T15:28:46Z

cluster-autoscaler/context/autoscaling_context.go

+
+// TemplateNodeInfoRegistry is the interface for getting template node infos.
+type TemplateNodeInfoRegistry interface {
+	GetNodeInfo(id string) (*framework.NodeInfo, bool)


Just id as the param name is pretty ambiguous - I'd rename to nodeGroupId so that it's clear which id we mean here.

towca · 2025-12-23T15:41:30Z

cluster-autoscaler/processors/customresources/dra_processor.go

-			newReadyNodes = append(newReadyNodes, node)
-			klog.Warningf("Failed to get template node info for node group %s with error: %v", ng.Id(), err)
-			continue
+		var nodeInfo *framework.NodeInfo


IMO the function is getting a bit too complex to read with this addition (especially with the "continue on error" now being one indentation deeper). Could you extract the part determining the nodeInfo to a helper function? It'll also read much nicer on its own with returns instead of the variable.

towca · 2025-12-23T15:51:09Z

cluster-autoscaler/processors/nodeinfosprovider/template_node_info_registry.go

+// TemplateNodeInfoRegistry is a component that stores and exposes template NodeInfos.
+// It is updated once per autoscaling loop iteration via Recompute() and provides a consistent view of node templates to all processors.
+type TemplateNodeInfoRegistry struct {
+	processor TemplateNodeInfoProvider


provider seems like a more accurate name

towca · 2025-12-23T15:54:07Z

cluster-autoscaler/processors/nodeinfosprovider/template_node_info_registry.go

+// It is updated once per autoscaling loop iteration via Recompute() and provides a consistent view of node templates to all processors.
+type TemplateNodeInfoRegistry struct {
+	processor TemplateNodeInfoProvider
+	nodeInfos map[string]*framework.NodeInfo


nit: Could you separate lock and nodeInfos by a newline, and move lock on top of nodeInfos? It's a visual convention that makes it easy to grasp which fields exactly a given mutex protects. Definitely not necessary here because of how simple the type is, but can be really helpful for more complex types with lots of fields.

towca · 2025-12-23T16:02:26Z

cluster-autoscaler/processors/nodeinfosprovider/template_node_info_registry.go

+	r.lock.RLock()
+	defer r.lock.RUnlock()
+	result := make(map[string]*framework.NodeInfo, len(r.nodeInfos))
+	maps.Copy(result, r.nodeInfos)


Wouldn't return maps.Clone(r.nodeInfos) be equivalent but more straightforward?

Good catch.

towca · 2025-12-23T16:05:58Z

cluster-autoscaler/processors/nodeinfosprovider/template_node_info_registry_test.go

+	// Test GetNodeInfo
+	info, found := registry.GetNodeInfo("ng1")
+	assert.True(t, found)
+	assert.NotNil(t, info)


I'd assert some actual NodeInfo field (e.g. Node.Name) instead of just non-nil which could technically pass if the provider returned a wrong NodeInfo.

towca · 2025-12-23T16:33:44Z

cluster-autoscaler/context/autoscaling_context.go

+
+// TemplateNodeInfoRegistry is the interface for getting template node infos.
+type TemplateNodeInfoRegistry interface {
+	GetNodeInfo(id string) (*framework.NodeInfo, bool)


Could you add comments to the interface methods? We need to document:

What exactly is returned (template NodeInfo for a given NodeGroup, as computed by TemplateNodeInfoProvider), since NodeInfo on its own is ambiguous - e.g. a NodeInfo also represents an existing Node in the cluster correlated with its scheduled Pods (obtained from the ClusterSnapshot).

How up-to-date the results of the getters are (or in another words - how internal caching works) - something like "The results are updated during the Recompute() call near the beginning of the main Cluster Autoscaler loop - and cached until the next Recompute() call. The getters can be used by logic that happens before the Recompute() call in the main CA loop - but the caller has to handle no results during the first CA loop."

What the expectations for the returned objects are - the results of GetNodeInfo() are read-only, the map returned by GetNodeInfos() itself can be modified, but its values are read-only.

All methods are thread-safe and can be called from separate goroutines.

Added comments.

towca · 2025-12-23T16:41:59Z

cluster-autoscaler/processors/customresources/dra_processor.go

-			klog.Warningf("Failed to get template node info for node group %s with error: %v", ng.Id(), err)
-			continue
+		var nodeInfo *framework.NodeInfo
+		if autoscalingCtx.TemplateNodeInfoRegistry != nil {


I think we can rely on autoscalingCtx.TemplateNodeInfoRegistry not being nil if we always set it during NewStaticAutoscaler. Having the nil check suggests to the reader that this field can be nil, but that should never happen, right?

Right, that should never be nil.
I just wanted to be safe from panics if the initial configuration is missing the registry by some mistake.

Removed the nil check.

towca · 2025-12-23T16:47:33Z

cluster-autoscaler/processors/customresources/dra_processor_test.go

 				"node_7": true,
 			},
 		},
+		"Custom DRA driver retrieved via cached template node info": {


Given that the new logic always prefers the registry over NodeGroup.TemplateNodeInfo(), IMO we should "revert" the testing pattern here:

Rewrite the existing test cases to use the registry instead of NodeGroup.TemplateNodeInfo()

Add new test cases testing the fallback to NodeGroup.TemplateNodeInfo() if there is no entry in the registry (1 case with both defined -> registry preferred + 1 case with just the NodeGroup.TemplateNodeInfo() defined and no entry in the registry -> fallback)

towca · 2025-12-23T16:59:14Z

cluster-autoscaler/processors/test/common.go

 // NewTestProcessors returns a set of simple processors for use in tests.
+// Note: This function injects a default TemplateNodeInfoRegistry into the provided AutoscalingContext.
+// This is a necessary workaround for synthetic tests that manually construct the context without using NewStaticAutoscaler, ensuring they have access to the registry.
 func NewTestProcessors(autoscalingCtx *ca_context.AutoscalingContext) *processors.AutoscalingProcessors {


I get that this was the easiest change to make the tests pass, but unfortunately little hacks like these make the tests really hard to understand and extend.

Looking at the usages of this function, it's ~always called after NewScaleTestAutoscalingContext(). IMO the order should be switched, like it's in the prod path - processors are a dependency of the context, not the other way around. NewScaleTestAutoscalingContext() should either take the processors as parameter, or call NewTestProcessors() internally. NewTestProcessors() technically depends on the full context now, but it only uses a small subset of it - config.AutoscalingOptions - which is also used as a parameter to NewScaleTestAutoscalingContext(). Have you explored something like that?

I decided to:

decouple NewTestProcessors from autoscalingCtx and depend only on config.AutoscalingOptions

update NewScaleTestAutoscalingContext to accept TemplateNodeInfoRegistry as in the original NewAutoscalingContext

reordered test initialization: create options -> create processors & registry -> create context

This aligns the test setup with the production architecture and improves readability and safety.

Adding it in a separate commit to streamline review. Let me know if you want it squashed eventually.

…NodeInfos This change introduces a new component, TemplateNodeInfoRegistry, which wraps the existing TemplateNodeInfoProvider. It caches the computed template NodeInfos and exposes them via a thread-safe interface. This registry is added to the AutoscalingContext, allowing processors (like the DRA processor) to access the cached templates instead of relying on the less reliable NodeGroup.TemplateNodeInfo().

…gistry Key changes: - Updated NewScaleTestAutoscalingContext to accept TemplateNodeInfoRegistry as a parameter. - Refactored NewTestProcessors to take AutoscalingOptions and return both Processors and TemplateNodeInfoRegistry. - Reordered test initialization to follow the production path: Options -> Processors/Registry -> AutoscalingContext. These changes improve testing readability and extendability by ensuring a consistent setup of the autoscaling environment with the production logic.

The DRACustomResourcesProcessor now attempts to retrieve NodeInfo from the TemplateNodeInfoRegistry before falling back to the NodeGroup. This ensures the processor uses the canonical TemplateNodeInfo for the current autoscaling loop. Crucially, this preserves any enrichments (such as custom DRA resource slices) that are computed during the registry's Recompute phase but might be absent in a fresh, raw template from the CloudProvider.

k8s-ci-robot · 2025-12-29T15:00:21Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Choraden
Once this PR has been reviewed and has the lgtm label, please ask for approval from towca. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

cluster-autoscaler/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Choraden

@towca I've addressed your comments. PTAL

Choraden · 2025-12-29T14:25:04Z

cluster-autoscaler/processors/test/common.go

 // NewTestProcessors returns a set of simple processors for use in tests.
+// Note: This function injects a default TemplateNodeInfoRegistry into the provided AutoscalingContext.
+// This is a necessary workaround for synthetic tests that manually construct the context without using NewStaticAutoscaler, ensuring they have access to the registry.
 func NewTestProcessors(autoscalingCtx *ca_context.AutoscalingContext) *processors.AutoscalingProcessors {


I decided to:

decouple NewTestProcessors from autoscalingCtx and depend only on config.AutoscalingOptions

update NewScaleTestAutoscalingContext to accept TemplateNodeInfoRegistry as in the original NewAutoscalingContext

reordered test initialization: create options -> create processors & registry -> create context

This aligns the test setup with the production architecture and improves readability and safety.

Adding it in a separate commit to streamline review. Let me know if you want it squashed eventually.

Choraden · 2025-12-29T14:29:05Z

cluster-autoscaler/processors/customresources/dra_processor.go

-			klog.Warningf("Failed to get template node info for node group %s with error: %v", ng.Id(), err)
-			continue
+		var nodeInfo *framework.NodeInfo
+		if autoscalingCtx.TemplateNodeInfoRegistry != nil {


Right, that should never be nil.
I just wanted to be safe from panics if the initial configuration is missing the registry by some mistake.

Removed the nil check.

Choraden · 2025-12-29T14:31:09Z

cluster-autoscaler/processors/customresources/dra_processor_test.go

 				"node_7": true,
 			},
 		},
+		"Custom DRA driver retrieved via cached template node info": {


Choraden · 2025-12-29T14:40:51Z

cluster-autoscaler/context/autoscaling_context.go

+
+// TemplateNodeInfoRegistry is the interface for getting template node infos.
+type TemplateNodeInfoRegistry interface {
+	GetNodeInfo(id string) (*framework.NodeInfo, bool)


Added comments.

Choraden · 2025-12-29T14:41:11Z

cluster-autoscaler/processors/nodeinfosprovider/template_node_info_registry.go

+	r.lock.RLock()
+	defer r.lock.RUnlock()
+	result := make(map[string]*framework.NodeInfo, len(r.nodeInfos))
+	maps.Copy(result, r.nodeInfos)


Good catch.

k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. kind/feature Categorizes issue or PR as related to a new feature. labels Dec 10, 2025

k8s-ci-robot added the do-not-merge/needs-area label Dec 10, 2025

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Dec 10, 2025

k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. area/cluster-autoscaler labels Dec 10, 2025

k8s-ci-robot requested review from vadasambar and x13n December 10, 2025 15:14

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed do-not-merge/needs-area cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Dec 10, 2025

Choraden force-pushed the template_node_info_registry_v1 branch from f9c0302 to ad96941 Compare December 10, 2025 15:35

k8s-ci-robot removed the do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. label Dec 10, 2025

Choraden marked this pull request as draft December 11, 2025 07:32

k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 11, 2025

Choraden force-pushed the template_node_info_registry_v1 branch from ad96941 to 4fb808c Compare December 17, 2025 12:46

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 17, 2025

Choraden force-pushed the template_node_info_registry_v1 branch from 4fb808c to 7f36de5 Compare December 17, 2025 13:16

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 17, 2025

Choraden marked this pull request as ready for review December 17, 2025 13:24

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 17, 2025

k8s-ci-robot requested review from aleksandra-malinowska and elmiko December 17, 2025 13:24

k8s-ci-robot assigned towca Dec 17, 2025

jackfrancis mentioned this pull request Dec 18, 2025

Cluster Autoscaler 1.35.0 Release tracking issue #8960

Open

5 tasks

towca reviewed Dec 23, 2025

View reviewed changes

Choraden added 4 commits December 29, 2025 14:35

Refactor static autoscaler run once to use TemplateNodeInfosRegistry

a5e549d

Choraden force-pushed the template_node_info_registry_v1 branch from 7f36de5 to f1ba828 Compare December 29, 2025 15:00

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Dec 29, 2025

Choraden commented Dec 29, 2025

View reviewed changes

Introduce registry for caching and exposing TemplateNodeInfos #8911

Are you sure you want to change the base?

Introduce registry for caching and exposing TemplateNodeInfos #8911

Conversation

Choraden commented Dec 10, 2025

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

Uh oh!

linux-foundation-easycla bot commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented Dec 10, 2025

Uh oh!

k8s-ci-robot commented Dec 10, 2025

Uh oh!

Choraden commented Dec 17, 2025

Uh oh!

jackfrancis commented Dec 19, 2025

Uh oh!

k8s-infra-cherrypick-robot commented Dec 19, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

k8s-ci-robot commented Dec 29, 2025

Uh oh!

Choraden left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

linux-foundation-easycla bot commented Dec 10, 2025 •

edited

Loading