fix: Pod IP Deletion Leak in eBPF FilterMap by alexcastilio · Pull Request #2114 · microsoft/retina

alexcastilio · 2026-03-16T11:00:28Z

Description

Fix: Pod IP Deletion Leak in eBPF FilterMap

Problem

Pod IPs accumulate indefinitely in the eBPF filtermap because DELETE operations fail in two ways:

PodCallBackFn guard drops delete events: When a namespace is removed from the include list or a pod annotation is removed, nsOfInterest() and podOfInterest() both return false — the PodDeleted event is silently discarded before reaching handlePodEvent().
applyDirtyPodsDelete uses wrong metadata: Even if the event reaches the delete path, Annotated and Namespaced flags are re-evaluated at delete time against current state (not the state when the IP was added). The filtermanager requires matching (Requestor, RequestMetadata) to remove a reference — a delete with the wrong metadata is a no-op.

This causes "no space left on device" errors when the eBPF filtermap fills up (255 entries).

Please provide a brief description of the changes made in this pull request.

Fix

Two changes in pkg/module/metrics/metrics_module.go:

Bypass guard for PodDeleted events — PodCallBackFn now skips the nsOfInterest/podOfInterest check when event.Type == EventTypePodDeleted, ensuring delete events always reach handlePodEvent.
Always delete with both metadata types — applyDirtyPodsDelete unconditionally issues DeleteIPs with both modulePodReqMetadata ("pod") and moduleReqMetadata ("namespace") for every IP in the delete list. The filtermanager's deleteIP is a safe no-op when the metadata doesn't exist for an IP, so extra calls cause no harm.

Additional minor fix: Replaced zap.Any with fmt.Sprint for []net.IP log fields to fix unsupported value type errors in log output.

Tests

Unit tests added and manual test done.

Manual validation

Scenario 1 — Namespace filter change (annotations mode)

Test:

helm install with: enableAnnotations=true, enablePodLevel=true

kubectl create namespace test-leak-ns
kubectl run test-pod -n test-leak-ns --image=nginx
kubectl annotate namespace test-leak-ns retina.sh=observe
# → Verify: "Adding IPs to filter manager" with pod IP in retina logs
kubectl annotate namespace test-leak-ns retina.sh-
kubectl delete pod test-pod -n test-leak-ns
# → Verify: "Adding pod IP to DELETE dirty pods cache" and "Deleting Ips in dirty pods from filtermap" in retina logs

Logs:

=== ADD phase ===
Defaulted container "retina" out of: retina, init-retina (init)
ts=2026-03-16T10:30:44.504Z level=info caller=metrics/metrics_module.go:391 msg="Namespaces to add" namespaces=
ts=2026-03-16T10:30:44.504Z level=info caller=metrics/metrics_module.go:391 msg="Namespaces to add" namespaces=test-leak-ns
ts=2026-03-16T10:30:44.504Z level=info caller=metrics/metrics_module.go:397 msg="Adding IPs to filter manager" namespace=test-leak-ns ips=[10.224.0.32]
namespace/test-leak-ns annotated
pod "test-pod" deleted
=== DELETE phase ===
Defaulted container "retina" out of: retina, init-retina (init)
ts=2026-03-16T10:31:12.955Z level=info caller=metrics/metrics_module.go:478 msg="Adding pod IP to DELETE dirty pods cache" pod name=test-leak-ns/test-pod
ts=2026-03-16T10:31:13.504Z level=debug caller=metrics/metrics_module.go:544 msg="Deleting Ips in dirty pods from filtermap" IPs=[10.224.0.32]

Scenario 2 — Namespace filter change (MetricsConfiguration CRD mode)

Test:

helm install with: enableAnnotations=false, enablePodLevel=true

kubectl create namespace test-leak-ns
kubectl run test-pod -n test-leak-ns --image=nginx
Apply MetricsConfiguration CRD with namespaces.include: [test-leak-ns]
# → Verify: "Adding IPs to filter manager" with pod IP in retina logs
Update MetricsConfiguration CRD to namespaces.include: [default]
kubectl delete pod test-pod -n test-leak-ns
# → Verify: "Adding pod IP to DELETE dirty pods cache" and "Deleting Ips in dirty pods from filtermap" in retina logs

Logs:

=== ADD phase ===
Defaulted container "retina" out of: retina, init-retina (init)
ts=2026-03-16T10:37:22.462Z level=info caller=metrics/metrics_module.go:158 msg="Reconciling metric module" spec= specError="unsupported value type"
ts=2026-03-16T10:37:22.462Z level=info caller=metrics/metrics_module.go:391 msg="Namespaces to add" namespaces=
ts=2026-03-16T10:37:22.462Z level=info caller=metrics/metrics_module.go:391 msg="Namespaces to add" namespaces=test-leak-ns
ts=2026-03-16T10:37:22.463Z level=info caller=metrics/metrics_module.go:397 msg="Adding IPs to filter manager" namespace=test-leak-ns ips=[10.224.0.36]
metricsconfiguration.retina.sh/test-metricsconfig configured
pod "test-pod" deleted
=== DELETE phase ===
Defaulted container "retina" out of: retina, init-retina (init)
ts=2026-03-16T10:37:49.138Z level=info caller=metrics/metrics_module.go:478 msg="Adding pod IP to DELETE dirty pods cache" pod name=test-leak-ns/test-pod
ts=2026-03-16T10:37:49.465Z level=debug caller=metrics/metrics_module.go:544 msg="Deleting Ips in dirty pods from filtermap" IPs=[10.224.0.36]
metricsconfiguration.retina.sh "test-metricsconfig" deleted
namespace "test-leak-ns" deleted

Scenario 3 — Pod annotation removed then deleted

Test:

helm install with: enableAnnotations=true, enablePodLevel=true

Create pod with annotation retina.sh=observe in default namespace
# → Verify: "Adding pod IP to ADD dirty pods cache" in retina logs
kubectl annotate pod annotated-pod -n default retina.sh-
kubectl delete pod annotated-pod -n default
# → Verify: "Adding pod IP to DELETE dirty pods cache" and "Deleting Ips in dirty pods from filtermap" in retina logs

Logs:

=== ADD phase ===
Defaulted container "retina" out of: retina, init-retina (init)
ts=2026-03-16T10:32:59.294Z level=info caller=metrics/metrics_module.go:475 msg="Adding pod IP to ADD dirty pods cache" pod name=default/annotated-pod
ts=2026-03-16T10:32:59.504Z level=debug caller=metrics/metrics_module.go:515 msg="Adding annotated pod IPs to filtermap" IPs=[10.224.0.31]
pod/annotated-pod annotated
pod "annotated-pod" deleted
=== DELETE phase ===
Defaulted container "retina" out of: retina, init-retina (init)
ts=2026-03-16T10:33:14.229Z level=info caller=metrics/metrics_module.go:470 msg="Adding pod IP to DELETE dirty pods cache. Pod not annotated or in namespace of interest." pod name=default/annotated-pod
ts=2026-03-16T10:33:14.505Z level=debug caller=metrics/metrics_module.go:544 msg="Deleting Ips in dirty pods from filtermap" IPs=[10.224.0.31]

Related Issue

#2085

Checklist

I have read the contributing documentation.
I signed and signed-off the commits (git commit -S -s ...). See this documentation on signing commits.
I have correctly attributed the author(s) of the code.
I have tested the changes locally.
I have followed the project's style guidelines.
I have updated the documentation, if necessary.
I have added tests, if applicable.

Screenshots (if applicable) or Testing Completed

Please add any relevant screenshots or GIFs to showcase the changes made.

Additional Notes

Add any additional notes or context about the pull request here.

Please refer to the CONTRIBUTING.md file for more information on how to contribute to this project.

Signed-off-by: Alex Castilio dos Santos <alexsantos@microsoft.com>

…value type error

github-actions · 2026-03-16T11:10:22Z

Retina Code Coverage Report

Total coverage no change

Increased diff

Impacted Files	Coverage
pkg/module/metrics/metrics_module.go	`80.34%` ... `81.61%` (`1.27%`)	⬆️

Decreased diff

Impacted Files	Coverage
pkg/enricher/enricher.go	`57.8%` ... `56.4%` (`-1.4%`)	⬇️

Signed-off-by: Alex Castilio dos Santos <alexsantos@microsoft.com>

aanchal22 · 2026-03-16T17:16:48Z

A few gaps I noticed from my investigation that the two PRs don't cover:

Spurious DELETE event protection
When a pod DELETE event fires, neither PR verifies the pod is actually gone from the cache before processing. Due to the cache timing issue (cache updated before event published), spurious DELETE events during
startup or rapid pod churn could remove valid IPs from the filtermap. Our branch added a cache check:

if endpoint := m.daemonCache.GetPodByIP(ip.String()); endpoint != nil {
     // Pod still exists in cache — ignore spurious DELETE
     return
 }

Forced Annotated = true on IP reuse (in handlePodEvent)
When a pod IP is reused by an untracked pod, the current code forces podCacheEntry.Annotated = true before adding to the delete cache. This causes the delete to use pod-annotation metadata even if the original
IP was added with namespace metadata, potentially leaving a stale entry. The brute-force "delete with both" approach in this PR may mask this, but the forced flag is still incorrect.
Filtermanager observability
No warning logs are emitted when deleteIP fails in the filtermanager cache (requestor not found, IP not found). This makes it harder to diagnose leak issues in production. Adding warnings to
pkg/managers/filtermanager/cache.go for these failure paths would improve debuggability.
eBPF filter map size configurability
The retina_filter eBPF map max_entries is hardcoded at 255. For clusters with many tracked pods, this can cause "no space left on device" errors. I have a separate PR#2117 for making this configurable via Helmvalues / env var.

alexcastilio added 3 commits March 13, 2026 16:05

test: add bug-scenario and regression tests for pod IP deletion leak

e5328c2

Signed-off-by: Alex Castilio dos Santos <alexsantos@microsoft.com>

fix: prevent pod IP leak in eBPF filtermap

7a457cc

Signed-off-by: Alex Castilio dos Santos <alexsantos@microsoft.com>

fix: use fmt.Sprint for []net.IP log fields to avoid zap unsupported …

626deb4

…value type error

alexcastilio requested a review from a team as a code owner March 16, 2026 11:00

alexcastilio requested review from MikeZappa87 and karina-ranadive March 16, 2026 11:00

fix: check log.SetupZapLogger error return in new test functions

7af6495

Signed-off-by: Alex Castilio dos Santos <alexsantos@microsoft.com>

alexcastilio mentioned this pull request Mar 16, 2026

Fix pod IP deletion leak and namespace filtering issues #2116

Open

6 tasks

aanchal22 mentioned this pull request Mar 16, 2026

fix: namespace exclude filtering #2118

Open

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Pod IP Deletion Leak in eBPF FilterMap#2114

fix: Pod IP Deletion Leak in eBPF FilterMap#2114
alexcastilio wants to merge 4 commits intomainfrom
fix/pod-ip-del-leak

alexcastilio commented Mar 16, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 16, 2026 •

edited

Loading

Uh oh!

aanchal22 commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

alexcastilio commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Problem

Fix

Tests

Manual validation

Scenario 1 — Namespace filter change (annotations mode)

Scenario 2 — Namespace filter change (MetricsConfiguration CRD mode)

Scenario 3 — Pod annotation removed then deleted

Related Issue

Checklist

Screenshots (if applicable) or Testing Completed

Additional Notes

Uh oh!

github-actions bot commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Retina Code Coverage Report

Total coverage no change

Uh oh!

aanchal22 commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

alexcastilio commented Mar 16, 2026 •

edited

Loading

github-actions bot commented Mar 16, 2026 •

edited

Loading