Skip to content

backend(k8cache): filter gvrList with important resources to spawning less go routines for the given context#4670

Open
upsaurav12 wants to merge 1 commit intokubernetes-sigs:mainfrom
upsaurav12:check-for-changes
Open

backend(k8cache): filter gvrList with important resources to spawning less go routines for the given context#4670
upsaurav12 wants to merge 1 commit intokubernetes-sigs:mainfrom
upsaurav12:check-for-changes

Conversation

@upsaurav12
Copy link
Contributor

@upsaurav12 upsaurav12 commented Feb 10, 2026

Summary

This PR filters the gvrList with the important resources which are mostly used, resulting spawning less go routines.

Related Issue

related #3854

Changes

  • Added filterImportantResources inside the CheckForChanges

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Feb 10, 2026
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: upsaurav12
Once this PR has been reviewed and has the lgtm label, please assign joaquimrocha for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@upsaurav12 upsaurav12 changed the title fix fix: filter gvrList with important resources to make sure less go routines Feb 10, 2026
@upsaurav12 upsaurav12 changed the title fix: filter gvrList with important resources to make sure less go routines fix: filter gvrList with important resources to spawning less go routines for the given context Feb 10, 2026
@upsaurav12
Copy link
Contributor Author

@illume ^^

Copy link
Contributor

@illume illume left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this.

I’m a bit confused… Maybe the issue it fixes should be linked to one of sub issues? Or maybe it needs a new issue? I guess this could do with a test to help me understand the issue. I’m not sure how to reproduce it. Also maybe writing in the commit message body the issue this fixes will make me understand it better.

Could you please squash and rebase out the merge commit? The commit message should follow conventions.

@upsaurav12
Copy link
Contributor Author

upsaurav12 commented Feb 10, 2026

currently the number of go routines is around 400 (because there are 54 resource informers ), but with this PR it is now ~100 per context. do you think number of go routines it is fine with this ? @illume
for reproducing it would require to add something which check number of go routines such as runtime.NumGoroutines()

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to reduce the number of Kubernetes informer goroutines spawned per context by filtering the discovered GroupVersionResource list down to a small set of “important” resources before setting up watches in k8cache.

Changes:

  • Filter the discovered gvrList to an allowlist before starting informers.
  • Add filterImportantResources helper to centralize the allowlist logic.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +138 to +144
// filtering the gvrList to make sure spawning go routines for only important
// resources which are required to be watched and cached,
// this will help to reduce the performance issue and resource utilization.
filtered := filterImportantResources(gvrList)

return filtered
}
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

returnGVRList now filters the watch/invalidation GVRs down to a small allowlist, but the cache middleware still stores GET responses for all resources (see StoreK8sResponseInCache usage in backend/cmd/server.go). This means cached responses for non-allowlisted resources will no longer be promptly invalidated on external changes and can remain stale until TTL expiry. Consider either restricting caching to the same allowlist, or implementing a lazy/per-request watcher registration so any cached resource also gets an informer (or keep watching all list+watch resources).

Copilot uses AI. Check for mistakes.
Comment on lines +138 to +144
// filtering the gvrList to make sure spawning go routines for only important
// resources which are required to be watched and cached,
// this will help to reduce the performance issue and resource utilization.
filtered := filterImportantResources(gvrList)

return filtered
}
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The doc comment above returnGVRList says it returns “all the supported resources that are supported by the k8s server”, but after this change it returns only the filtered subset. Update the comment to reflect the new behavior so callers don’t assume full coverage.

Copilot uses AI. Check for mistakes.
Comment on lines +146 to +171
// filterImportantResources filters the provided list of GroupVersionResources to
// include only those that are deemed important for caching and watching.
// This helps reduce the number of resources we watch and cache,
// improving performance and resource utilization.
func filterImportantResources(gvrList []schema.GroupVersionResource) []schema.GroupVersionResource {
allowed := map[string]struct{}{
"pods": {},
"services": {},
"deployments": {},
"replicasets": {},
"statefulsets": {},
"daemonsets": {},
"nodes": {},
"configmaps": {},
"secrets": {},
"jobs": {},
"cronjobs": {},
}

filtered := make([]schema.GroupVersionResource, 0, len(allowed))

for _, gvr := range gvrList {
if _, ok := allowed[gvr.Resource]; ok {
filtered = append(filtered, gvr)
}
}
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

filterImportantResources introduces behavior that directly affects cache invalidation correctness/performance, but there are no unit tests covering the filtering (and its interaction with returnGVRList). Add tests that verify the allowlist behavior (including that non-allowlisted resources are excluded) to prevent regressions when the list changes.

Copilot uses AI. Check for mistakes.
Comment on lines +141 to +143
filtered := filterImportantResources(gvrList)

return filtered
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR description says the filter was added “inside the CheckForChanges”, but the change is actually applied in returnGVRList (affecting all watcher setups). If the intent is different (e.g., conditional filtering), either adjust the implementation or update the PR description to match the code.

Copilot uses AI. Check for mistakes.
@upsaurav12 upsaurav12 changed the title fix: filter gvrList with important resources to spawning less go routines for the given context backend(k8cache): filter gvrList with important resources to spawning less go routines for the given context Feb 10, 2026
@upsaurav12 upsaurav12 force-pushed the check-for-changes branch 2 times, most recently from 5ca07ab to 264a1cb Compare February 10, 2026 16:32
@illume illume requested a review from Copilot February 10, 2026 16:50
@illume
Copy link
Contributor

illume commented Feb 10, 2026

@upsaurav12 thanks for the explanation, and for adjusting the commits.

Btw, You could remove some of the signed-off-by in the commit message body, because we don’t use that here.

currently the number of go routines is around 400 (because there are 54 resource informers ), but with this PR it is now ~100 per context. do you think number of go routines it is fine with this ?

This is a very nice improvement. With 10 clusters that’s about 1000.

I wonder if there’s an upper limit set?

Also wonder when they get cleaned up? I mean if someone uses it, 1000 watches are set up… how long before the server stops the watchers when the user hasn’t visited? They’re always listening right? I think that’s fine if the admin has asked for it.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +134 to +136
// filtering the gvrList to make sure spawning go routines for only important
// resources which are required to be watched and cached,
// this will help to reduce the performance issue and resource utilization.
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comments here have grammar/spelling issues (e.g., "make sure spawning go routines" and "go routines" -> "ensure we spawn goroutines"). Please rewrite for clarity since these comments explain a behavioral change that affects correctness.

Suggested change
// filtering the gvrList to make sure spawning go routines for only important
// resources which are required to be watched and cached,
// this will help to reduce the performance issue and resource utilization.
// Filter the gvrList so that we spawn goroutines only for important resources
// that need to be watched and cached. This helps improve performance and
// reduce resource utilization.

Copilot uses AI. Check for mistakes.
Comment on lines +142 to +170
// filterImportantResources filters the provided list of GroupVersionResources to
// include only those that are deemed important for caching and watching.
// This helps reduce the number of resources we watch and cache,
// improving performance and resource utilization.
func filterImportantResources(gvrList []schema.GroupVersionResource) []schema.GroupVersionResource {
allowed := map[string]struct{}{
"pods": {},
"services": {},
"deployments": {},
"replicasets": {},
"statefulsets": {},
"daemonsets": {},
"nodes": {},
"configmaps": {},
"secrets": {},
"jobs": {},
"cronjobs": {},
}

filtered := make([]schema.GroupVersionResource, 0, len(allowed))

for _, gvr := range gvrList {
if _, ok := allowed[gvr.Resource]; ok {
filtered = append(filtered, gvr)
}
}

// return the filtered list of GroupVersionResources that are important for caching and watching
return filtered
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New filtering behavior isn’t covered by tests. There are existing tests for this package (e.g., cache invalidation and informer wiring), but none validate that the GVR discovery output is filtered as intended. Add a unit test for filterImportantResources (and/or returnGVRList) to prevent regressions and to document the expected allowlist behavior.

Copilot uses AI. Check for mistakes.
Comment on lines +134 to +141
// filtering the gvrList to make sure spawning go routines for only important
// resources which are required to be watched and cached,
// this will help to reduce the performance issue and resource utilization.
filtered := filterImportantResources(gvrList)

return filtered
}

Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filtering the discovered GVRs down to a hard-coded allowlist changes cache invalidation semantics: the cache layer stores GET responses for arbitrary resource URLs, but changes to any resource not in allowed will no longer trigger informer events to invalidate cached entries (except via TTL or non-GET requests through Headlamp). This can lead to serving stale data for those resource kinds. Consider either (a) restricting caching to the same allowlist, (b) making the watched resources configurable, or (c) expanding the watched set to include all kinds that can be cached/are used by the UI (e.g., namespaces, ingresses, pv/pvc, rbac objects, etc.).

Suggested change
// filtering the gvrList to make sure spawning go routines for only important
// resources which are required to be watched and cached,
// this will help to reduce the performance issue and resource utilization.
filtered := filterImportantResources(gvrList)
return filtered
}
// Return the full list of listable/watchable resources so that cache
// invalidation covers all kinds that may be cached by the middleware.
// This avoids serving stale data for resources that would otherwise
// not be watched due to a hard-coded allowlist.
return gvrList
}

Copilot uses AI. Check for mistakes.
@upsaurav12
Copy link
Contributor Author

@illume done 👍

@illume
Copy link
Contributor

illume commented Feb 10, 2026

@upsaurav12 great thanks.

Btw, I haven’t checked the copilot review comments yet. Probably by tomorrow I can get to them.

@illume
Copy link
Contributor

illume commented Feb 10, 2026

Ps. The issue in the pr description should be changed to “Related” rather than fixes. Because fixed closes the issue.

@upsaurav12
Copy link
Contributor Author

This is a very nice improvement. With 10 clusters that’s about 1000.

I wonder if there’s an upper limit set?

Also wonder when they get cleaned up? I mean if someone uses it, 1000 watches are set up… how long before the server stops the watchers when the user hasn’t visited? They’re always listening right? I think that’s fine if the admin has asked for it.

@illume is spawning 1000 go routines per context is serious concern or informers watching continuously ? just asking
currently i don't see any time based storing of watchers in cacheInvalidation.go so may be i should add it, but talking about 1000's of watcher for 10 clusters, well i am not sure (may be i might figure it out to reduce further). but for 1 resource let's say "pod" it will spawn 3 4 watchers.

@illume
Copy link
Contributor

illume commented Feb 10, 2026

I mean, that would be 10 cluster contexts for 1000. Only way to find out for sure is to test it. But this reduces it a lot so that’s a nice improvement already. Was just thinking out loud trying to understand it better.

currently i don't see any time based storing of watchers in cacheInvalidation.go so may be i should add it

Yeah, let’s not start that yet. There’s one mode where we want all changes coming in so the cache is as synced as possible. Maybe time based cleanup of watchers is something to consider later.

@upsaurav12
Copy link
Contributor Author

upsaurav12 commented Feb 10, 2026

I mean, that would be 10 cluster contexts for 1000. Only way to find out for sure is to test it. But this reduces it a lot so that’s a nice improvement already. Was just thinking out loud trying to understand it better.

yep will add some tests to repoduce

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Feb 11, 2026
@k8s-ci-robot k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 11, 2026
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Feb 11, 2026
@upsaurav12
Copy link
Contributor Author

@illume I have added the integration tests, not sure if it is the right approach to test this part. maybe it would be better if i can add test for changing number of resources (currently in this i am testing the go routines for filtered resources and 1 context. ) but testing go routines for multiple contexts ie 10 or 20 with dynamic resources list.

…er of go routines for informers

currently the number of go routines is spawning is around 400, because
in CheckForChanges, there are 54 informers are there so because of it
it is spawning to much go routines per context. so through this PR it
is now filtering out the resources which are important or widely used
such as secrets, nodes, pods etc.
@illume
Copy link
Contributor

illume commented Feb 11, 2026

Looks like the test is failing?

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +151 to +168
allowed := map[string]struct{}{
"pods": {},
"services": {},
"deployments": {},
"replicasets": {},
"statefulsets": {},
"daemonsets": {},
"nodes": {},
"configmaps": {},
"secrets": {},
"jobs": {},
"cronjobs": {},
}

filtered := make([]schema.GroupVersionResource, 0, len(allowed))

for _, gvr := range gvrList {
if _, ok := allowed[gvr.Resource]; ok {
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new allow-list only keys off gvr.Resource. That means any API group that happens to use the same resource name will be included (even if it shouldn't be), and it also makes it easy to miss commonly used resources (e.g., namespaces, ingresses, serviceaccounts, roles) without any compile-time signal. Consider keying by full GroupResource (group+resource) and/or centralizing this list with clear rationale and tests so future updates don’t silently change invalidation behavior.

Suggested change
allowed := map[string]struct{}{
"pods": {},
"services": {},
"deployments": {},
"replicasets": {},
"statefulsets": {},
"daemonsets": {},
"nodes": {},
"configmaps": {},
"secrets": {},
"jobs": {},
"cronjobs": {},
}
filtered := make([]schema.GroupVersionResource, 0, len(allowed))
for _, gvr := range gvrList {
if _, ok := allowed[gvr.Resource]; ok {
// Use full GroupResource (group + resource) for the allow-list so that
// different API groups that happen to use the same resource name are not
// conflated. Core resources use the empty group name "".
allowed := map[schema.GroupResource]struct{}{
{Group: "", Resource: "pods"}: {},
{Group: "", Resource: "services"}: {},
{Group: "", Resource: "deployments"}: {},
{Group: "", Resource: "replicasets"}: {},
{Group: "", Resource: "statefulsets"}: {},
{Group: "", Resource: "daemonsets"}: {},
{Group: "", Resource: "nodes"}: {},
{Group: "", Resource: "configmaps"}: {},
{Group: "", Resource: "secrets"}: {},
{Group: "", Resource: "jobs"}: {},
{Group: "", Resource: "cronjobs"}: {},
}
filtered := make([]schema.GroupVersionResource, 0, len(allowed))
for _, gvr := range gvrList {
gr := schema.GroupResource{Group: gvr.Group, Resource: gvr.Resource}
if _, ok := allowed[gr]; ok {

Copilot uses AI. Check for mistakes.
Comment on lines +146 to 175
// filterImportantResources filters the provided list of GroupVersionResources to
// include only those that are deemed important for caching and watching.
// This helps reduce the number of resources we watch and cache,
// improving performance and resource utilization.
func filterImportantResources(gvrList []schema.GroupVersionResource) []schema.GroupVersionResource {
allowed := map[string]struct{}{
"pods": {},
"services": {},
"deployments": {},
"replicasets": {},
"statefulsets": {},
"daemonsets": {},
"nodes": {},
"configmaps": {},
"secrets": {},
"jobs": {},
"cronjobs": {},
}

filtered := make([]schema.GroupVersionResource, 0, len(allowed))

for _, gvr := range gvrList {
if _, ok := allowed[gvr.Resource]; ok {
filtered = append(filtered, gvr)
}
}

// return the filtered list of GroupVersionResources that are important for caching and watching
return filtered
}
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

filterImportantResources is new behavior but there’s no unit test coverage ensuring the filter returns the expected subset and preserves Group/Version fields for allowed resources. Since this package already has cacheInvalidation_test.go, adding tests for returnGVRList/filterImportantResources would help prevent accidental regressions (e.g., typos in resource names or missed resources).

Copilot uses AI. Check for mistakes.
Comment on lines +1794 to +1812
time.Sleep(3 * time.Second)

running := runtime.NumGoroutine()

time.Sleep(3 * time.Second)

after := runtime.NumGoroutine()

expected := false
afterExpected := false

if (running >= 50 && running < 200) && (after >= 50 && after < 200) {
expected = true
afterExpected = true
}

assert.Equal(t, 3, start)
assert.Equal(t, running >= 50 && running < 200, expected)
assert.Equal(t, after >= 50 && after < 200, afterExpected)
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is very likely to fail and/or be meaningless:

  • runtime.NumGoroutine() is not stable across Go versions/test runners, so assert.Equal(t, 3, start) is unreliable.
  • expected/afterExpected are derived from the same boolean being asserted, making the assertions tautological (they’ll pass regardless of goroutine count).
    Consider asserting on a delta from start (or an upper bound) and using require.Eventually to wait for informers to start/settle, rather than hard-coded absolute counts.
Suggested change
time.Sleep(3 * time.Second)
running := runtime.NumGoroutine()
time.Sleep(3 * time.Second)
after := runtime.NumGoroutine()
expected := false
afterExpected := false
if (running >= 50 && running < 200) && (after >= 50 && after < 200) {
expected = true
afterExpected = true
}
assert.Equal(t, 3, start)
assert.Equal(t, running >= 50 && running < 200, expected)
assert.Equal(t, after >= 50 && after < 200, afterExpected)
// Wait until at least one additional goroutine is started, or time out.
require.Eventually(t, func() bool {
return runtime.NumGoroutine() > start
}, 10*time.Second, 100*time.Millisecond)
running := runtime.NumGoroutine()
// Allow some time for background work to settle.
time.Sleep(3 * time.Second)
after := runtime.NumGoroutine()
// Assert on deltas rather than absolute goroutine counts, which are
// environment-dependent. We expect some increase but no unbounded growth.
const maxDelta = 500
deltaRunning := running - start
if deltaRunning < 0 {
deltaRunning = 0
}
deltaAfter := after - start
if deltaAfter < 0 {
deltaAfter = 0
}
assert.GreaterOrEqual(t, deltaRunning, 1, "expected at least one additional goroutine after request")
assert.LessOrEqual(t, deltaRunning, maxDelta, "too many goroutines shortly after request")
assert.LessOrEqual(t, deltaAfter, maxDelta, "too many goroutines after settling time")

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@illume would this assertion good way to check accurately how many go routines spawned

Comment on lines +1786 to +1793
apiPath := "/clusters/" + clusterName + "/api/v1/pods"
ctx := context.Background()

resp, err := httpRequestWithContext(ctx, ts.URL+apiPath, "GET")
require.NoError(t, err)

defer resp.Body.Close()

Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test doesn’t verify the request succeeded (status code) or that cache watching was actually started for the context. If authorization fails in handleCacheAuthorization, CheckForChanges can be skipped entirely, making goroutine assertions unrelated to the behavior under test. Add an explicit require.Equal(http.StatusOK, resp.StatusCode) (or expected status) and ensure the request path/headers exercise the codepath that calls k8cache.CheckForChanges.

Copilot uses AI. Check for mistakes.
@upsaurav12
Copy link
Contributor Author

@illume well i assumed there would be only 3 in the beginning

expected: 3
actual : 175

because of this part. I assumed that number of go routines for this part is predictable but they aren't.

assert.Equal(t, 3, start)
assert.Equal(t, running >= 50 && running < 200, expected)
assert.Equal(t, after >= 50 && after < 200, afterExpected)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants