Skip to content

API Server Memory Spikes & Master Node OOM After Upgrading Kepler (Big Cluster) #2366

@laurall974

Description

@laurall974

Kepler Version

0.10.0 or later (Current/Supported)

Bug Description

After upgrading Kepler from v0.8.0 to a recent release v0.11.3, our production Kubernetes cluster experienced severe API server memory spikes that resulted in master node OOM kills. This happened even though Kepler was deployed only on worker nodes, and process-level metrics were disabled.
Master nodes began OOM-killing critical control-plane components, including:

  • kube-apiserver
  • kube-controller-manager
  • CNI controllers
  • Storage controllers

Once we rolled Kepler back and/or removed it, the issue disappeared.

Steps to Reproduce

Deploy a large cluster (~50+ nodes, ~2000 pods).
Deploy the latest Kepler agent on all worker nodes.
Observe master node memory usage, especially kube-apiserver.

Expected Behavior

Kepler should not trigger extreme kube-apiserver load or cause control plane OOM.
API server interactions should be rate-limited, cached, or distributed to avoid synchronized spikes.

Environment

Nodes: 54 worker nodes
Masters: 3 control-plane nodes (≈30GB RAM each)
Pods: ~2260 running pods
Kepler: Upgraded from v0.8.0 to the new release v0.11.3
Kepler placement: Only deployed on worker nodes (bare metal)

Logs and Error Messages

No Kepler logs because Kepler was removed after the incident. 
API server logs were also rolled out.

Additional Context

Prometheus also experienced a spike to ~40GB during compaction, due to Kepler process metrics creating ~1.5M time series.
We resolved this by disabling process metrics, and Prometheus stabilized.
However, even after disabling process metrics, the API server continued to crash until Kepler was fully removed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugreport bug issue

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions