-
Notifications
You must be signed in to change notification settings - Fork 222
Description
Kepler Version
0.10.0 or later (Current/Supported)
Bug Description
After upgrading Kepler from v0.8.0 to a recent release v0.11.3, our production Kubernetes cluster experienced severe API server memory spikes that resulted in master node OOM kills. This happened even though Kepler was deployed only on worker nodes, and process-level metrics were disabled.
Master nodes began OOM-killing critical control-plane components, including:
- kube-apiserver
- kube-controller-manager
- CNI controllers
- Storage controllers
Once we rolled Kepler back and/or removed it, the issue disappeared.
Steps to Reproduce
Deploy a large cluster (~50+ nodes, ~2000 pods).
Deploy the latest Kepler agent on all worker nodes.
Observe master node memory usage, especially kube-apiserver.
Expected Behavior
Kepler should not trigger extreme kube-apiserver load or cause control plane OOM.
API server interactions should be rate-limited, cached, or distributed to avoid synchronized spikes.
Environment
Nodes: 54 worker nodes
Masters: 3 control-plane nodes (≈30GB RAM each)
Pods: ~2260 running pods
Kepler: Upgraded from v0.8.0 to the new release v0.11.3
Kepler placement: Only deployed on worker nodes (bare metal)
Logs and Error Messages
No Kepler logs because Kepler was removed after the incident.
API server logs were also rolled out.Additional Context
Prometheus also experienced a spike to ~40GB during compaction, due to Kepler process metrics creating ~1.5M time series.
We resolved this by disabling process metrics, and Prometheus stabilized.
However, even after disabling process metrics, the API server continued to crash until Kepler was fully removed.