Skip to content

[processor/k8s_attributes] Fix memory leak in k8s_attributes map#48987

Open
giuliano-sider wants to merge 1 commit into
open-telemetry:mainfrom
giuliano-sider:fix-pod-map-leak
Open

[processor/k8s_attributes] Fix memory leak in k8s_attributes map#48987
giuliano-sider wants to merge 1 commit into
open-telemetry:mainfrom
giuliano-sider:fix-pod-map-leak

Conversation

@giuliano-sider

Copy link
Copy Markdown
Contributor

[processor/k8s_attributes] Fix memory leak in k8s_attributes map due to missing IP or other dynamic attributes from Pod deletion event object.

Component(s)

[processor/k8s_attributes]

Description

In environments with high pod churn or rapid scale-down, the k8sattributes processor can leak pod IP-based cache entries (connection: <IP> and resource_attribute: k8s.pod.ip) in the internal c.Pods map. This leads to unbounded memory growth and a stable baseline of leaked keys even after the actual pods have been scaled down to zero.

Root Cause

When a pod is deleted, the WatchClient's forgetPod method is called to purge the cached entries:

func (c *WatchClient) forgetPod(pod *api_v1.Pod) {
	podToRemove := c.podFromAPI(pod)
	identifiers := c.getIdentifiersFromAssoc(podToRemove)
	for i := range identifiers {
		id := identifiers[i]
		p, ok := c.GetPod(id)

		if ok && p.PodUID == string(pod.UID) {
			c.appendDeleteQueue(id, p.PodUID)
		}
	}
}
  1. forgetPod builds podToRemove from the incoming delete event payload pod.
  2. It then calls getIdentifiersFromAssoc(podToRemove) to determine which keys to add to the delete queue.
  3. If the CNI has already reclaimed/cleared the Pod's IP address by the time the final DELETE event is dispatched (or if the status payload is incomplete), pod.Status.PodIP will be empty ("").
  4. As a result, getIdentifiersFromAssoc only generates the UID-based identifier (resource_attribute: k8s.pod.uid). The IP-based identifiers (connection: <IP> and resource_attribute: k8s.pod.ip) are skipped.
  5. Consequently, only the UID-based key is queued for deletion, while the IP-based keys are never cleared and are permanently leaked in the c.Pods map.

Proposed Solution

Instead of relying on the incoming delete event status to determine the keys to delete, forgetPod should look up the cached pod from c.Pods using the pod's UID (which is always present in the event). The cached pod object is guaranteed to contain the IP address and all attributes as they were stored during the pod's lifecycle.

Link to tracking issue

Fixes #48986

Testing

Added a unit test case that catches the issue of a Pod object from a deletion event that is missing IP fields.

Documentation

Added a changelog entry for a bug fix.

Authorship

  • [ X ] I, a human, wrote this pull request description myself.

…to missing IP or other dynamic attributes from Pod deletion event object.
@giuliano-sider

Copy link
Copy Markdown
Contributor Author

I seem to reproduce the issue (#48986) reliably when scaling down my 100K Deployment of Running Pods to zero. The number of entries in the map this time is 180K. I suppose if I keep scaling up and scaling down, the leak will cause this number to continue to climb?

The unit test that I added (TestPodDeleteIPMissingFromDeleteEvent) fails without the fix in this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

processor/k8sattributes k8s Attributes processor

Projects

None yet

Development

Successfully merging this pull request may close these issues.

k8sattributesprocessor: Pod IP-based cache keys leaked on deletion if Pod IP is missing from the delete event status

2 participants