Skip to content

Reduce memory usage when watching resources#425

Open
LAMRobinson wants to merge 2 commits intocrossplane-contrib:mainfrom
LAMRobinson:main
Open

Reduce memory usage when watching resources#425
LAMRobinson wants to merge 2 commits intocrossplane-contrib:mainfrom
LAMRobinson:main

Conversation

@LAMRobinson
Copy link
Copy Markdown

Description of your changes

Reduces memory usage of the watch feature (--enable-watches) by switching resource informer caches from full-object (Unstructured) informers to metadata-only (PartialObjectMetadata) informers, and stripping managedFields from all cached objects (both watch caches and the main manager cache) via cache.TransformStripManagedFields().

Note that there are a few other changes in this PR that came up during make reviewable test, these are in a dedicated commit.

Problem

When the watch feature is enabled, the provider creates a cache.Cache per (providerConfig, GVK) pair to detect changes to referenced or managed resources. These caches use Unstructured informers, which means the Kubernetes API server sends the full object (spec, status, annotations, managedFields, etc.) for every resource of each watched GVK, and the provider stores all of it in memory.

The provider never reads data from these caches. The reconciler always fetches resources directly from the API server via c.client.Get(). The watch caches exist purely as event sources -- the event handler (enqueueObjectsForReferences) only uses GetName(), GetNamespace(), and GetObjectKind().GroupVersionKind() from the event object to look up which Object resources need reconciliation.

This means for each watched GVK, the provider caches a complete in-memory copy of every object of that type across all namespaces in the target cluster, even though the cached data is never consumed. For clusters with common resource types (ConfigMaps, Secrets, Deployments, etc.) this leads to extreme memory usage -- 80GB+ observed for a few thousand managed resources.

Additionally, the main manager cache (for Object CRs, ProviderConfigs, etc.) retains managedFields on all cached objects, which adds unnecessary memory overhead especially when SSA is enabled.

Fix

Three complementary changes:

  1. Switch to PartialObjectMetadata informers (watch caches): Replace Unstructured with *metav1.PartialObjectMetadata when calling cache.GetInformer() in both cluster-scoped and namespaced controllers. This causes controller-runtime to use metadata-only List/Watch requests, so the API server only sends object metadata (name, namespace, UID, labels, annotations, etc.) rather than the full spec/status. This reduces per-object size from 5-50KB to ~200-500 bytes and also reduces network bandwidth.

  2. Strip managedFields from watch caches: Apply cache.TransformStripManagedFields() to the watch cache options. Even in metadata-only responses, managedFields can be 2-10KB per object. Stripping them provides an additional 30-60% reduction on the remaining metadata.

  3. Strip managedFields from the main manager cache: Apply cache.TransformStripManagedFields() to the manager's cache options in main.go. This strips managedFields from Object CRs, ProviderConfigs, and all other control plane resources cached by the manager. The provider never reads managedFields from these resources -- the only GetManagedFields() calls in the codebase (syncer.go:149) operate on managed resources fetched directly from the target cluster API, not from the manager cache.

Why this is safe

  • The watch caches are purely event sources. Observe(), Create(), Update(), and Delete() all fetch resources directly from the API server (c.client.Get()), never from the watch cache.
  • enqueueObjectsForReferences() (in indexes.go) only accesses GetObjectKind().GroupVersionKind(), GetNamespace(), and GetName() on event objects -- all available on PartialObjectMetadata.
  • PartialObjectMetadata implements client.Object, so the existing obj.(client.Object) type assertions in event handlers continue to work.
  • cache.GetInformer() in controller-runtime v0.19.0 has explicit support for PartialObjectMetadata via a dedicated metadata-only informer factory.
  • cache.TransformStripManagedFields() is a built-in, tested utility in controller-runtime designed for exactly this purpose.
  • The needSSAFieldManagerUpgrade() function in syncer.go reads managedFields, but from managed resources fetched via direct API calls to the target cluster -- not from the manager cache. Stripping managedFields from the manager cache does not affect this code path.

Files changed

  • internal/controller/cluster/object/informers.go -- metadata-only informers + strip managedFields
  • internal/controller/namespaced/object/informers.go -- metadata-only informers + strip managedFields
  • cmd/provider/main.go -- strip managedFields from manager cache

I have:

  • Read and followed Crossplane's contribution process.
  • Run make reviewable test to ensure this PR is ready for review.

How has this code been tested

  • Verified that PartialObjectMetadata satisfies the client.Object interface required by event handler type assertions
  • Verified that enqueueObjectsForReferences only accesses metadata fields available on PartialObjectMetadata
  • Verified that no code path reads resource data (spec/status) from the watch caches
  • Verified that GetManagedFields() is only called on objects from direct API calls, not from the manager cache
  • Manual testing with watch feature enabled against a cluster with 1000 Objects managing 100KB ConfigMaps to confirm memory reduction and correct reconciliation triggering

Description:
- Set DefaultTransform to strip managed fields in controller-runtime caches.
- Switch informer object type to PartialObjectMetadata for efficiency.
- Update imports to use metav1 instead of unstructured for informers.

Signed-off-by: Laurence Robinson <laurence_robinson@live.co.uk>
Signed-off-by: Laurence Robinson <laurence_robinson@live.co.uk>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant