Skip to content

Implement cache-level filtering for managed resources to reduce memory footprint #79

Description

@jewertow

Problem:

Currently, our controller can experience high memory and resource consumption in large clusters because the underlying Informers watch and cache all instances of a given resource type. While we utilize predicates to filter which events trigger a workitem in the reconciliation queue, predicates do not prevent the cache from storing every object in memory.

As the cluster scales, caching thousands of unmanaged resources drastically eats up controller resources.

Proposed Solution:

We should leverage cache.Options when initializing the Manager. This will limit the scope of the resources watched and cached at the informer level, significantly reducing the controller's memory footprint by only caching objects we actually care about.

Critical Risk: Correctness vs. Performance

While cache-level filtering drastically improves performance, it introduces a severe risk regarding reconciliation correctness and self-healing that we must carefully design around.

The "Ghost Object" Risk:

If the controller filters its cache based on specific criteria (e.g., matching a label like app.kubernetes.io/managed-by: mesh-controller), and a user or external process removes that label from a managed resource:

  • Expected behavior: The controller should detect the missing label during reconciliation and add it back to maintain the desired state.
  • Actual risk with cache-filtering: Because the modified object no longer matches the cache's label selector, it will instantly disappear from the controller's cache. The Informer will not surface the update event, leaving the controller entirely blind to the resource. The controller will lose track of the object and will be unable to self-heal it.

We must ensure that reconciliation correctness takes absolute precedence over performance gains.

Implementation requirements:

  • Identify which high-volume resources would benefit most from cache-level filtering.
  • Implement cache.Options filtering for the targeted resource types.
  • Add integration/E2E tests specifically covering the "removed identifier/label" scenario to guarantee the controller can safely recover or gracefully handle the state transition without leaking unmanaged resources.

Metadata

Metadata

Assignees

No one assigned

    Labels

    controllerReconciler internals and performance

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions