Skip to content

fix: Restrict informer cache to operator-managed resources#118

Merged
sutaakar merged 2 commits intoopendatahub-io:devfrom
ugiordan:fix/oom-harden-informer-cache
Mar 20, 2026
Merged

fix: Restrict informer cache to operator-managed resources#118
sutaakar merged 2 commits intoopendatahub-io:devfrom
ugiordan:fix/oom-harden-informer-cache

Conversation

@ugiordan
Copy link
Copy Markdown
Member

@ugiordan ugiordan commented Mar 19, 2026

Summary

Restricts the controller-runtime informer cache to only watch resources that carry the operator-name label. Without filtering, the cache loads all Pods, Services, ConfigMaps, ServiceAccounts, Roles, and RoleBindings cluster-wide, causing unbounded memory growth when many resources exist across namespaces.

  • Add ByObject label-selector filtering using an "exists" requirement on the training.kubeflow.org/operator-name label for Pods, Services, ConfigMaps, ServiceAccounts, Roles, and RoleBindings
  • Strip managed fields from cached objects via TransformStripManagedFields() to reduce per-object memory
  • Disable cache for ConfigMap and Secret client reads (DisableFor) to avoid stale data
  • Set GOMEMLIMIT=460MiB (90% of the 512Mi container limit) for GC pressure tuning
  • Add resource limits to the deployment manifest (512Mi memory, 500m CPU)
  • Propagate the operator-name label to MPI controller resources (ConfigMap, ServiceAccount, Role, RoleBinding) that were missing it

Test Results

Tested on RHOAI 3.2 cluster with 825 flood ConfigMaps across 8 namespaces:

Metric Stock Operator Fixed Operator
Memory 1452Mi (no limit) 13Mi
Restarts N/A 0
Status Running (unbounded growth) Ready, healthy
Resource limits None 512Mi memory

Jira

RHOAIENG-52927

Test Plan

  • Deploy fixed operator on cluster with high resource count
  • Verify memory stays bounded under 512Mi limit
  • Verify all 6 controllers start and reconcile correctly (tfjob, pytorchjob, xgboostjob, paddlejob, jaxjob, mpijob)
  • Verify operator-name labels are set on MPI controller resources

Summary by CodeRabbit

  • Improvements

    • Label-based filtering applied to operator-managed Kubernetes resources for clearer resource identification.
    • Training operator container now has explicit CPU/memory requests and limits and a configured Go memory cap.
    • Controller-created resources now include an operator label for consistent ownership tracking.
  • Performance

    • Caching behavior adjusted to reduce memory usage and ensure certain secrets/configmaps are fetched directly.

Add label-selector filtering on the controller-runtime informer cache
so only resources carrying the operator-name label are cached. Without
filtering, the cache loads all Pods, Services, ConfigMaps, and RBAC
objects cluster-wide, causing unbounded memory growth under load.

Changes:
- Add newCacheOptions() with ByObject label selectors using "exists"
  requirement on the operator-name label
- Strip managed fields from cached objects (TransformStripManagedFields)
- Disable cache for ConfigMap and Secret client reads (DisableFor)
- Add operator-name label to MPI controller resources (ConfigMap,
  ServiceAccount, Role, RoleBinding) that were missing it
- Set GOMEMLIMIT=460MiB and add resource limits (512Mi) to deployment
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 19, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 93483abd-a1a4-404b-8a4f-90d14dab33cb

📥 Commits

Reviewing files that changed from the base of the PR and between 58e9d95 and 61bd2f4.

📒 Files selected for processing (1)
  • cmd/training-operator.v1/main.go

📝 Walkthrough

Walkthrough

The PR introduces a new newCacheOptions(namespace string) cache.Options used unconditionally when creating the controller-runtime manager. The cache is configured with a label selector requiring kubeflowv1.OperatorNameLabel (Exists) and cache.ByObject filters limited to ConfigMap, Pod, Service, ServiceAccount, Secret, Role, and RoleBinding. Manager options now set Client: client.Options{Cache: &client.CacheOptions{DisableFor: [...]}} so client.Get/List bypasses the cache for ConfigMap and Secret. The deployment manifest adds GOMEMLIMIT="460MiB" and resource requests/limits for the training-operator container. The MPI controller now adds the kubeflowv1.OperatorNameLabel to created ConfigMap, ServiceAccount, Role, and RoleBinding.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Security and Correctness Findings

  • main.go — Cache bypass for ConfigMap/Secret and sensitive-data exposure (CWE-200)

    • Finding: client.CacheOptions.DisableFor disables informer cache for ConfigMap and Secret but the cache informer still exists for those types (filtered by label). If any code relies on the informer cache for Secrets/ConfigMaps, behavior differs from client.Get/List, increasing risk of stale reads or inconsistent behavior.
    • Actionable: Ensure all code paths that read Secrets/ConfigMaps use client.Get/List (which bypass cache) or validate informer usage. Add unit tests exercising both cached and direct reads for these types.
  • main.go — Label-selector uses only Exists; operator isolation risk (CWE-200)

    • Finding: The cache label selector requires only that kubeflowv1.OperatorNameLabel exists; resources missing the label will not match and may populate the global informer, undermining isolation.
    • Actionable: Change selector to require specific operator value (Equals) or enforce a codepath that always sets the label on all created resources. Add integration tests that create unlabeled resources and verify they are excluded.
  • main.go — Namespace validation and silent failure risk

    • Finding: newCacheOptions(namespace string) does not validate namespace. An empty or malformed namespace may create unintended caching behavior.
    • Actionable: Validate namespace is non-empty and conforms to DNS-1123; fail fast with a clear error when invalid.
  • mpijob_controller.go — Incomplete labeling coverage (CWE-200)

    • Finding: Operator label was added to ConfigMap, ServiceAccount, Role, RoleBinding. If any other resources are created elsewhere (e.g., Services, Secrets, Jobs) without the label, they will bypass the filter and expand the informer scope.
    • Actionable: Audit all resource creation paths in the MPI controller and other controllers to ensure the operator label is applied uniformly. Add tests verifying all resources created by the controller contain the label.
  • deployment.yaml — Memory/GC / OOM risk (CWE-770)

    • Finding: GOMEMLIMIT set to 460MiB with pod memory limit 512Mi gives a small headroom (≈52MiB). If the Go runtime reaches GOMEMLIMIT, GC pressure and allocator overhead may still push the process past the hard limit, risking OOMKill.
    • Actionable: Increase pod memory limit relative to GOMEMLIMIT (recommended margin ≥20%) or lower GOMEMLIMIT. Add resource-based load tests to validate OOM behavior under expected workloads.
  • General operational correctness — cache vs client semantics

    • Finding: Using cache.ByObject alongside DisableFor introduces dual semantics (informers filtered by label vs direct client bypassing cache). This increases cognitive load and error surface for future maintainers.
    • Actionable: Document the intended semantics in code comments and contributor docs. Consider a single consistent approach (either fully label-filtered cache plus direct reads, or exclusively bypass cache for sensitive types) and add tests covering both informer and direct client behavior.
🚥 Pre-merge checks | ✅ 2
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and clearly summarizes the main change: implementing cache restrictions using operator-name labels to prevent unbounded memory growth across multiple resource types.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
cmd/training-operator.v1/main.go (1)

189-200: ConfigMap in both ByObject and DisableFor creates redundancy.

ConfigMap is filtered in ByObject (line 98) and also listed in DisableFor (line 196). Since DisableFor causes all client.Get/List calls to bypass the cache entirely, the ByObject label filter for ConfigMap only affects the informer's watch/list scope (reducing memory and API load). This is valid but the interaction is subtle.

However, Secret is in DisableFor but not in ByObject. If any controller inadvertently sets up a Secret watch via the cache, it would watch all Secrets cluster-wide since no label filter applies. Consider adding &corev1.Secret{} to ByObject with operatorFilter for defense-in-depth, or verify no Secret watches exist.

🛡️ Optional: Add Secret to ByObject for consistency
 opts := cache.Options{
     DefaultTransform: cache.TransformStripManagedFields(),
     ByObject: map[client.Object]cache.ByObject{
         &corev1.ConfigMap{}:      operatorFilter,
         &corev1.Pod{}:            operatorFilter,
         &corev1.Service{}:        operatorFilter,
         &corev1.ServiceAccount{}: operatorFilter,
         &rbacv1.Role{}:           operatorFilter,
         &rbacv1.RoleBinding{}:    operatorFilter,
+        &corev1.Secret{}:         operatorFilter,
     },
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/training-operator.v1/main.go` around lines 189 - 200, The ConfigMap is
already restricted via CacheOptions.ByObject but Secret is only in DisableFor,
so add &corev1.Secret{} to the CacheOptions.ByObject list using the same
operatorFilter used for ConfigMap (the same label-filtering predicate) to ensure
the informer won’t watch all cluster Secrets; update the ByObject slice (the one
that configures the cache informer) to include &corev1.Secret{} with
operatorFilter alongside the existing &corev1.ConfigMap{} entry so both
resources are label-filtered for defense-in-depth while keeping Secret in
DisableFor if you still want Get/List to bypass the cache.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@cmd/training-operator.v1/main.go`:
- Around line 189-200: The ConfigMap is already restricted via
CacheOptions.ByObject but Secret is only in DisableFor, so add &corev1.Secret{}
to the CacheOptions.ByObject list using the same operatorFilter used for
ConfigMap (the same label-filtering predicate) to ensure the informer won’t
watch all cluster Secrets; update the ByObject slice (the one that configures
the cache informer) to include &corev1.Secret{} with operatorFilter alongside
the existing &corev1.ConfigMap{} entry so both resources are label-filtered for
defense-in-depth while keeping Secret in DisableFor if you still want Get/List
to bypass the cache.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 80315754-7e17-4d24-a8fb-dddb3887dc3b

📥 Commits

Reviewing files that changed from the base of the PR and between bfceb07 and 58e9d95.

📒 Files selected for processing (3)
  • cmd/training-operator.v1/main.go
  • manifests/base/deployment.yaml
  • pkg/controller.v1/mpi/mpijob_controller.go

Add Secret to the ByObject cache configuration with the same
operator-name label filter used for other core types. While no
controller currently creates a Secret informer, this prevents
an unfiltered cluster-wide Secret watch if one is added in the future.
@ugiordan ugiordan changed the base branch from rhoai to dev March 20, 2026 13:50
@ugiordan
Copy link
Copy Markdown
Member Author

Closing to re-open against dev branch as requested.

@ugiordan ugiordan closed this Mar 20, 2026
@ugiordan ugiordan reopened this Mar 20, 2026
@sutaakar sutaakar merged commit 28a60bd into opendatahub-io:dev Mar 20, 2026
40 of 60 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants