backend(k8cache): filter gvrList with important resources to spawning less go routines for the given context#4670
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: upsaurav12 The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
@illume ^^ |
illume
left a comment
There was a problem hiding this comment.
Thanks for this.
I’m a bit confused… Maybe the issue it fixes should be linked to one of sub issues? Or maybe it needs a new issue? I guess this could do with a test to help me understand the issue. I’m not sure how to reproduce it. Also maybe writing in the commit message body the issue this fixes will make me understand it better.
Could you please squash and rebase out the merge commit? The commit message should follow conventions.
|
currently the number of go routines is around 400 (because there are 54 resource informers ), but with this PR it is now ~100 per context. do you think number of go routines it is fine with this ? @illume |
There was a problem hiding this comment.
Pull request overview
This PR aims to reduce the number of Kubernetes informer goroutines spawned per context by filtering the discovered GroupVersionResource list down to a small set of “important” resources before setting up watches in k8cache.
Changes:
- Filter the discovered
gvrListto an allowlist before starting informers. - Add
filterImportantResourceshelper to centralize the allowlist logic.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // filtering the gvrList to make sure spawning go routines for only important | ||
| // resources which are required to be watched and cached, | ||
| // this will help to reduce the performance issue and resource utilization. | ||
| filtered := filterImportantResources(gvrList) | ||
|
|
||
| return filtered | ||
| } |
There was a problem hiding this comment.
returnGVRList now filters the watch/invalidation GVRs down to a small allowlist, but the cache middleware still stores GET responses for all resources (see StoreK8sResponseInCache usage in backend/cmd/server.go). This means cached responses for non-allowlisted resources will no longer be promptly invalidated on external changes and can remain stale until TTL expiry. Consider either restricting caching to the same allowlist, or implementing a lazy/per-request watcher registration so any cached resource also gets an informer (or keep watching all list+watch resources).
| // filtering the gvrList to make sure spawning go routines for only important | ||
| // resources which are required to be watched and cached, | ||
| // this will help to reduce the performance issue and resource utilization. | ||
| filtered := filterImportantResources(gvrList) | ||
|
|
||
| return filtered | ||
| } |
There was a problem hiding this comment.
The doc comment above returnGVRList says it returns “all the supported resources that are supported by the k8s server”, but after this change it returns only the filtered subset. Update the comment to reflect the new behavior so callers don’t assume full coverage.
| // filterImportantResources filters the provided list of GroupVersionResources to | ||
| // include only those that are deemed important for caching and watching. | ||
| // This helps reduce the number of resources we watch and cache, | ||
| // improving performance and resource utilization. | ||
| func filterImportantResources(gvrList []schema.GroupVersionResource) []schema.GroupVersionResource { | ||
| allowed := map[string]struct{}{ | ||
| "pods": {}, | ||
| "services": {}, | ||
| "deployments": {}, | ||
| "replicasets": {}, | ||
| "statefulsets": {}, | ||
| "daemonsets": {}, | ||
| "nodes": {}, | ||
| "configmaps": {}, | ||
| "secrets": {}, | ||
| "jobs": {}, | ||
| "cronjobs": {}, | ||
| } | ||
|
|
||
| filtered := make([]schema.GroupVersionResource, 0, len(allowed)) | ||
|
|
||
| for _, gvr := range gvrList { | ||
| if _, ok := allowed[gvr.Resource]; ok { | ||
| filtered = append(filtered, gvr) | ||
| } | ||
| } |
There was a problem hiding this comment.
filterImportantResources introduces behavior that directly affects cache invalidation correctness/performance, but there are no unit tests covering the filtering (and its interaction with returnGVRList). Add tests that verify the allowlist behavior (including that non-allowlisted resources are excluded) to prevent regressions when the list changes.
| filtered := filterImportantResources(gvrList) | ||
|
|
||
| return filtered |
There was a problem hiding this comment.
PR description says the filter was added “inside the CheckForChanges”, but the change is actually applied in returnGVRList (affecting all watcher setups). If the intent is different (e.g., conditional filtering), either adjust the implementation or update the PR description to match the code.
5ca07ab to
264a1cb
Compare
|
@upsaurav12 thanks for the explanation, and for adjusting the commits. Btw, You could remove some of the signed-off-by in the commit message body, because we don’t use that here.
This is a very nice improvement. With 10 clusters that’s about 1000. I wonder if there’s an upper limit set? Also wonder when they get cleaned up? I mean if someone uses it, 1000 watches are set up… how long before the server stops the watchers when the user hasn’t visited? They’re always listening right? I think that’s fine if the admin has asked for it. |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // filtering the gvrList to make sure spawning go routines for only important | ||
| // resources which are required to be watched and cached, | ||
| // this will help to reduce the performance issue and resource utilization. |
There was a problem hiding this comment.
The comments here have grammar/spelling issues (e.g., "make sure spawning go routines" and "go routines" -> "ensure we spawn goroutines"). Please rewrite for clarity since these comments explain a behavioral change that affects correctness.
| // filtering the gvrList to make sure spawning go routines for only important | |
| // resources which are required to be watched and cached, | |
| // this will help to reduce the performance issue and resource utilization. | |
| // Filter the gvrList so that we spawn goroutines only for important resources | |
| // that need to be watched and cached. This helps improve performance and | |
| // reduce resource utilization. |
| // filterImportantResources filters the provided list of GroupVersionResources to | ||
| // include only those that are deemed important for caching and watching. | ||
| // This helps reduce the number of resources we watch and cache, | ||
| // improving performance and resource utilization. | ||
| func filterImportantResources(gvrList []schema.GroupVersionResource) []schema.GroupVersionResource { | ||
| allowed := map[string]struct{}{ | ||
| "pods": {}, | ||
| "services": {}, | ||
| "deployments": {}, | ||
| "replicasets": {}, | ||
| "statefulsets": {}, | ||
| "daemonsets": {}, | ||
| "nodes": {}, | ||
| "configmaps": {}, | ||
| "secrets": {}, | ||
| "jobs": {}, | ||
| "cronjobs": {}, | ||
| } | ||
|
|
||
| filtered := make([]schema.GroupVersionResource, 0, len(allowed)) | ||
|
|
||
| for _, gvr := range gvrList { | ||
| if _, ok := allowed[gvr.Resource]; ok { | ||
| filtered = append(filtered, gvr) | ||
| } | ||
| } | ||
|
|
||
| // return the filtered list of GroupVersionResources that are important for caching and watching | ||
| return filtered |
There was a problem hiding this comment.
New filtering behavior isn’t covered by tests. There are existing tests for this package (e.g., cache invalidation and informer wiring), but none validate that the GVR discovery output is filtered as intended. Add a unit test for filterImportantResources (and/or returnGVRList) to prevent regressions and to document the expected allowlist behavior.
| // filtering the gvrList to make sure spawning go routines for only important | ||
| // resources which are required to be watched and cached, | ||
| // this will help to reduce the performance issue and resource utilization. | ||
| filtered := filterImportantResources(gvrList) | ||
|
|
||
| return filtered | ||
| } | ||
|
|
There was a problem hiding this comment.
Filtering the discovered GVRs down to a hard-coded allowlist changes cache invalidation semantics: the cache layer stores GET responses for arbitrary resource URLs, but changes to any resource not in allowed will no longer trigger informer events to invalidate cached entries (except via TTL or non-GET requests through Headlamp). This can lead to serving stale data for those resource kinds. Consider either (a) restricting caching to the same allowlist, (b) making the watched resources configurable, or (c) expanding the watched set to include all kinds that can be cached/are used by the UI (e.g., namespaces, ingresses, pv/pvc, rbac objects, etc.).
| // filtering the gvrList to make sure spawning go routines for only important | |
| // resources which are required to be watched and cached, | |
| // this will help to reduce the performance issue and resource utilization. | |
| filtered := filterImportantResources(gvrList) | |
| return filtered | |
| } | |
| // Return the full list of listable/watchable resources so that cache | |
| // invalidation covers all kinds that may be cached by the middleware. | |
| // This avoids serving stale data for resources that would otherwise | |
| // not be watched due to a hard-coded allowlist. | |
| return gvrList | |
| } |
264a1cb to
7b82323
Compare
|
@illume done 👍 |
|
@upsaurav12 great thanks. Btw, I haven’t checked the copilot review comments yet. Probably by tomorrow I can get to them. |
|
Ps. The issue in the pr description should be changed to “Related” rather than fixes. Because fixed closes the issue. |
@illume is spawning 1000 go routines per context is serious concern or informers watching continuously ? just asking |
|
I mean, that would be 10 cluster contexts for 1000. Only way to find out for sure is to test it. But this reduces it a lot so that’s a nice improvement already. Was just thinking out loud trying to understand it better.
Yeah, let’s not start that yet. There’s one mode where we want all changes coming in so the cache is as synced as possible. Maybe time based cleanup of watchers is something to consider later. |
yep will add some tests to repoduce |
51f7fc5 to
0f3ed5f
Compare
0f3ed5f to
9e456f9
Compare
|
@illume I have added the integration tests, not sure if it is the right approach to test this part. maybe it would be better if i can add test for changing number of resources (currently in this i am testing the go routines for filtered resources and 1 context. ) but testing go routines for multiple contexts ie 10 or 20 with dynamic resources list. |
…er of go routines for informers currently the number of go routines is spawning is around 400, because in CheckForChanges, there are 54 informers are there so because of it it is spawning to much go routines per context. so through this PR it is now filtering out the resources which are important or widely used such as secrets, nodes, pods etc.
9e456f9 to
193b34f
Compare
|
Looks like the test is failing? |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| allowed := map[string]struct{}{ | ||
| "pods": {}, | ||
| "services": {}, | ||
| "deployments": {}, | ||
| "replicasets": {}, | ||
| "statefulsets": {}, | ||
| "daemonsets": {}, | ||
| "nodes": {}, | ||
| "configmaps": {}, | ||
| "secrets": {}, | ||
| "jobs": {}, | ||
| "cronjobs": {}, | ||
| } | ||
|
|
||
| filtered := make([]schema.GroupVersionResource, 0, len(allowed)) | ||
|
|
||
| for _, gvr := range gvrList { | ||
| if _, ok := allowed[gvr.Resource]; ok { |
There was a problem hiding this comment.
The new allow-list only keys off gvr.Resource. That means any API group that happens to use the same resource name will be included (even if it shouldn't be), and it also makes it easy to miss commonly used resources (e.g., namespaces, ingresses, serviceaccounts, roles) without any compile-time signal. Consider keying by full GroupResource (group+resource) and/or centralizing this list with clear rationale and tests so future updates don’t silently change invalidation behavior.
| allowed := map[string]struct{}{ | |
| "pods": {}, | |
| "services": {}, | |
| "deployments": {}, | |
| "replicasets": {}, | |
| "statefulsets": {}, | |
| "daemonsets": {}, | |
| "nodes": {}, | |
| "configmaps": {}, | |
| "secrets": {}, | |
| "jobs": {}, | |
| "cronjobs": {}, | |
| } | |
| filtered := make([]schema.GroupVersionResource, 0, len(allowed)) | |
| for _, gvr := range gvrList { | |
| if _, ok := allowed[gvr.Resource]; ok { | |
| // Use full GroupResource (group + resource) for the allow-list so that | |
| // different API groups that happen to use the same resource name are not | |
| // conflated. Core resources use the empty group name "". | |
| allowed := map[schema.GroupResource]struct{}{ | |
| {Group: "", Resource: "pods"}: {}, | |
| {Group: "", Resource: "services"}: {}, | |
| {Group: "", Resource: "deployments"}: {}, | |
| {Group: "", Resource: "replicasets"}: {}, | |
| {Group: "", Resource: "statefulsets"}: {}, | |
| {Group: "", Resource: "daemonsets"}: {}, | |
| {Group: "", Resource: "nodes"}: {}, | |
| {Group: "", Resource: "configmaps"}: {}, | |
| {Group: "", Resource: "secrets"}: {}, | |
| {Group: "", Resource: "jobs"}: {}, | |
| {Group: "", Resource: "cronjobs"}: {}, | |
| } | |
| filtered := make([]schema.GroupVersionResource, 0, len(allowed)) | |
| for _, gvr := range gvrList { | |
| gr := schema.GroupResource{Group: gvr.Group, Resource: gvr.Resource} | |
| if _, ok := allowed[gr]; ok { |
| // filterImportantResources filters the provided list of GroupVersionResources to | ||
| // include only those that are deemed important for caching and watching. | ||
| // This helps reduce the number of resources we watch and cache, | ||
| // improving performance and resource utilization. | ||
| func filterImportantResources(gvrList []schema.GroupVersionResource) []schema.GroupVersionResource { | ||
| allowed := map[string]struct{}{ | ||
| "pods": {}, | ||
| "services": {}, | ||
| "deployments": {}, | ||
| "replicasets": {}, | ||
| "statefulsets": {}, | ||
| "daemonsets": {}, | ||
| "nodes": {}, | ||
| "configmaps": {}, | ||
| "secrets": {}, | ||
| "jobs": {}, | ||
| "cronjobs": {}, | ||
| } | ||
|
|
||
| filtered := make([]schema.GroupVersionResource, 0, len(allowed)) | ||
|
|
||
| for _, gvr := range gvrList { | ||
| if _, ok := allowed[gvr.Resource]; ok { | ||
| filtered = append(filtered, gvr) | ||
| } | ||
| } | ||
|
|
||
| // return the filtered list of GroupVersionResources that are important for caching and watching | ||
| return filtered | ||
| } |
There was a problem hiding this comment.
filterImportantResources is new behavior but there’s no unit test coverage ensuring the filter returns the expected subset and preserves Group/Version fields for allowed resources. Since this package already has cacheInvalidation_test.go, adding tests for returnGVRList/filterImportantResources would help prevent accidental regressions (e.g., typos in resource names or missed resources).
| time.Sleep(3 * time.Second) | ||
|
|
||
| running := runtime.NumGoroutine() | ||
|
|
||
| time.Sleep(3 * time.Second) | ||
|
|
||
| after := runtime.NumGoroutine() | ||
|
|
||
| expected := false | ||
| afterExpected := false | ||
|
|
||
| if (running >= 50 && running < 200) && (after >= 50 && after < 200) { | ||
| expected = true | ||
| afterExpected = true | ||
| } | ||
|
|
||
| assert.Equal(t, 3, start) | ||
| assert.Equal(t, running >= 50 && running < 200, expected) | ||
| assert.Equal(t, after >= 50 && after < 200, afterExpected) |
There was a problem hiding this comment.
This test is very likely to fail and/or be meaningless:
runtime.NumGoroutine()is not stable across Go versions/test runners, soassert.Equal(t, 3, start)is unreliable.expected/afterExpectedare derived from the same boolean being asserted, making the assertions tautological (they’ll pass regardless of goroutine count).
Consider asserting on a delta fromstart(or an upper bound) and usingrequire.Eventuallyto wait for informers to start/settle, rather than hard-coded absolute counts.
| time.Sleep(3 * time.Second) | |
| running := runtime.NumGoroutine() | |
| time.Sleep(3 * time.Second) | |
| after := runtime.NumGoroutine() | |
| expected := false | |
| afterExpected := false | |
| if (running >= 50 && running < 200) && (after >= 50 && after < 200) { | |
| expected = true | |
| afterExpected = true | |
| } | |
| assert.Equal(t, 3, start) | |
| assert.Equal(t, running >= 50 && running < 200, expected) | |
| assert.Equal(t, after >= 50 && after < 200, afterExpected) | |
| // Wait until at least one additional goroutine is started, or time out. | |
| require.Eventually(t, func() bool { | |
| return runtime.NumGoroutine() > start | |
| }, 10*time.Second, 100*time.Millisecond) | |
| running := runtime.NumGoroutine() | |
| // Allow some time for background work to settle. | |
| time.Sleep(3 * time.Second) | |
| after := runtime.NumGoroutine() | |
| // Assert on deltas rather than absolute goroutine counts, which are | |
| // environment-dependent. We expect some increase but no unbounded growth. | |
| const maxDelta = 500 | |
| deltaRunning := running - start | |
| if deltaRunning < 0 { | |
| deltaRunning = 0 | |
| } | |
| deltaAfter := after - start | |
| if deltaAfter < 0 { | |
| deltaAfter = 0 | |
| } | |
| assert.GreaterOrEqual(t, deltaRunning, 1, "expected at least one additional goroutine after request") | |
| assert.LessOrEqual(t, deltaRunning, maxDelta, "too many goroutines shortly after request") | |
| assert.LessOrEqual(t, deltaAfter, maxDelta, "too many goroutines after settling time") |
There was a problem hiding this comment.
@illume would this assertion good way to check accurately how many go routines spawned
| apiPath := "/clusters/" + clusterName + "/api/v1/pods" | ||
| ctx := context.Background() | ||
|
|
||
| resp, err := httpRequestWithContext(ctx, ts.URL+apiPath, "GET") | ||
| require.NoError(t, err) | ||
|
|
||
| defer resp.Body.Close() | ||
|
|
There was a problem hiding this comment.
The test doesn’t verify the request succeeded (status code) or that cache watching was actually started for the context. If authorization fails in handleCacheAuthorization, CheckForChanges can be skipped entirely, making goroutine assertions unrelated to the behavior under test. Add an explicit require.Equal(http.StatusOK, resp.StatusCode) (or expected status) and ensure the request path/headers exercise the codepath that calls k8cache.CheckForChanges.
|
@illume well i assumed there would be only 3 in the beginning
because of this part. I assumed that number of go routines for this part is predictable but they aren't.
|
Summary
This PR filters the gvrList with the important resources which are mostly used, resulting spawning less go routines.
Related Issue
related #3854
Changes