fix: Restrict informer cache to operator-managed resources#118
fix: Restrict informer cache to operator-managed resources#118sutaakar merged 2 commits intoopendatahub-io:devfrom
Conversation
Add label-selector filtering on the controller-runtime informer cache so only resources carrying the operator-name label are cached. Without filtering, the cache loads all Pods, Services, ConfigMaps, and RBAC objects cluster-wide, causing unbounded memory growth under load. Changes: - Add newCacheOptions() with ByObject label selectors using "exists" requirement on the operator-name label - Strip managed fields from cached objects (TransformStripManagedFields) - Disable cache for ConfigMap and Secret client reads (DisableFor) - Add operator-name label to MPI controller resources (ConfigMap, ServiceAccount, Role, RoleBinding) that were missing it - Set GOMEMLIMIT=460MiB and add resource limits (512Mi) to deployment
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Central YAML (base), Organization UI (inherited) Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughThe PR introduces a new Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Security and Correctness Findings
🚥 Pre-merge checks | ✅ 2✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. 📝 Coding Plan
Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
cmd/training-operator.v1/main.go (1)
189-200: ConfigMap in bothByObjectandDisableForcreates redundancy.ConfigMap is filtered in
ByObject(line 98) and also listed inDisableFor(line 196). SinceDisableForcauses allclient.Get/Listcalls to bypass the cache entirely, theByObjectlabel filter for ConfigMap only affects the informer's watch/list scope (reducing memory and API load). This is valid but the interaction is subtle.However,
Secretis inDisableForbut not inByObject. If any controller inadvertently sets up a Secret watch via the cache, it would watch all Secrets cluster-wide since no label filter applies. Consider adding&corev1.Secret{}toByObjectwithoperatorFilterfor defense-in-depth, or verify no Secret watches exist.🛡️ Optional: Add Secret to ByObject for consistency
opts := cache.Options{ DefaultTransform: cache.TransformStripManagedFields(), ByObject: map[client.Object]cache.ByObject{ &corev1.ConfigMap{}: operatorFilter, &corev1.Pod{}: operatorFilter, &corev1.Service{}: operatorFilter, &corev1.ServiceAccount{}: operatorFilter, &rbacv1.Role{}: operatorFilter, &rbacv1.RoleBinding{}: operatorFilter, + &corev1.Secret{}: operatorFilter, }, }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cmd/training-operator.v1/main.go` around lines 189 - 200, The ConfigMap is already restricted via CacheOptions.ByObject but Secret is only in DisableFor, so add &corev1.Secret{} to the CacheOptions.ByObject list using the same operatorFilter used for ConfigMap (the same label-filtering predicate) to ensure the informer won’t watch all cluster Secrets; update the ByObject slice (the one that configures the cache informer) to include &corev1.Secret{} with operatorFilter alongside the existing &corev1.ConfigMap{} entry so both resources are label-filtered for defense-in-depth while keeping Secret in DisableFor if you still want Get/List to bypass the cache.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@cmd/training-operator.v1/main.go`:
- Around line 189-200: The ConfigMap is already restricted via
CacheOptions.ByObject but Secret is only in DisableFor, so add &corev1.Secret{}
to the CacheOptions.ByObject list using the same operatorFilter used for
ConfigMap (the same label-filtering predicate) to ensure the informer won’t
watch all cluster Secrets; update the ByObject slice (the one that configures
the cache informer) to include &corev1.Secret{} with operatorFilter alongside
the existing &corev1.ConfigMap{} entry so both resources are label-filtered for
defense-in-depth while keeping Secret in DisableFor if you still want Get/List
to bypass the cache.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Central YAML (base), Organization UI (inherited)
Review profile: CHILL
Plan: Pro
Run ID: 80315754-7e17-4d24-a8fb-dddb3887dc3b
📒 Files selected for processing (3)
cmd/training-operator.v1/main.gomanifests/base/deployment.yamlpkg/controller.v1/mpi/mpijob_controller.go
Add Secret to the ByObject cache configuration with the same operator-name label filter used for other core types. While no controller currently creates a Secret informer, this prevents an unfiltered cluster-wide Secret watch if one is added in the future.
|
Closing to re-open against dev branch as requested. |
Summary
Restricts the controller-runtime informer cache to only watch resources that carry the operator-name label. Without filtering, the cache loads all Pods, Services, ConfigMaps, ServiceAccounts, Roles, and RoleBindings cluster-wide, causing unbounded memory growth when many resources exist across namespaces.
ByObjectlabel-selector filtering using an "exists" requirement on thetraining.kubeflow.org/operator-namelabel for Pods, Services, ConfigMaps, ServiceAccounts, Roles, and RoleBindingsTransformStripManagedFields()to reduce per-object memoryDisableFor) to avoid stale dataGOMEMLIMIT=460MiB(90% of the 512Mi container limit) for GC pressure tuningoperator-namelabel to MPI controller resources (ConfigMap, ServiceAccount, Role, RoleBinding) that were missing itTest Results
Tested on RHOAI 3.2 cluster with 825 flood ConfigMaps across 8 namespaces:
Jira
RHOAIENG-52927
Test Plan
Summary by CodeRabbit
Improvements
Performance