Skip to content

Conversation

@dwindsor
Copy link
Contributor

@dwindsor dwindsor commented Jan 29, 2026

Fixes

Description

Make the exit eventcache path mirror the exec path with respect to refcount ops.

When events get kicked to the eventcache, they will be retried a certain number of times to see if pod/parent info has become available. There are 2 methods for retrying: RetryInternal() and Retry(), which are interfaces. RetryInternal() is sometimes called before Retry(), depending on the circumstances.

For exec, we have the parent.RefInc("parent") being called in Retry(), after the pod info check. In exit, we have parent.RefDec("parent") called in RetryInternal() before the pod info check that happens in Retry(), which in the exit event's case just happens to be a call to eventcache.HandleGenericEvent.

Make the exit path balance with the exec path in that it should only call parent.RefDec("parent") after it's confirmed that pod and parent info is present.

Results

Before Fix

tetragon_event_cache_inserts_total 161844
tetragon_event_cache_fetch_retries_total{entry_type="ancestors_info"} 0
tetragon_event_cache_fetch_retries_total{entry_type="parent_info"} 1020
tetragon_event_cache_fetch_retries_total{entry_type="pod_info"} 55169
tetragon_event_cache_fetch_retries_total{entry_type="process_info"} 0
tetragon_event_cache_fetch_failures_total{entry_type="pod_info",event_type="PROCESS_EXEC"} 509
tetragon_event_cache_fetch_failures_total{entry_type="pod_info",event_type="PROCESS_EXIT"} 458

# refcnt is uint32
$ tetra debug dump processcache --max-recv-size 200000000 2>/dev/null | jq -s '[.[] | select(.refcnt > 2147483647)] | length'
565

  {
    "process": {
      "exec_id": "Y3ZlLXRlc3QtY29udHJvbC1wbGFuZTo2MjA3MzQ5NDMxNTc0OTQ6MzY1ODY2MQ==",
      "pid": 3658661,
      "uid": 0,
      "cwd": "/",
      "binary": "/bin/sh",
      "arguments": "/scripts/churn.sh",
      "flags": "execve rootcwd clone inInitTree",
      "start_time": "2026-01-29T19:54:58.884463084Z",
      "auid": 4294967295,
      "pod": {
  ...
     }
    },
    "color": "deletePending",
    "refcnt": 4294967295,
    "refcnt_ops": {
      "parent++": 202,
      "parent--": 203,
      "process++": 1,
      "process--": 1
    }
  },

After Fix

tetragon_event_cache_inserts_total 161948
tetragon_event_cache_fetch_retries_total{entry_type="ancestors_info"} 0
tetragon_event_cache_fetch_retries_total{entry_type="parent_info"} 4556
tetragon_event_cache_fetch_retries_total{entry_type="pod_info"} 53862
tetragon_event_cache_fetch_retries_total{entry_type="process_info"} 0
tetragon_event_cache_fetch_failures_total{entry_type="pod_info",event_type="PROCESS_EXEC"} 508
tetragon_event_cache_fetch_failures_total{entry_type="pod_info",event_type="PROCESS_EXIT"} 450

# refcnt is uint32 
$ tetra debug dump processcache --max-recv-size 200000000 2>/dev/null | jq -s '[.[] | select(.refcnt > 2147483647)] | length'
0

This is the workload that was used to generate these metrics:

apiVersion: v1
kind: ConfigMap
metadata:
  name: churn-scripts
  namespace: default
data:
  churn.sh: |
    #!/bin/sh
    for i in $(seq 1 100); do
      /bin/true
    done
---
apiVersion: batch/v1
kind: Job
metadata:
  name: churn-job
  namespace: default
spec:
  parallelism: 50
  completions: 500
  backoffLimit: 500
  template:
    spec:
      restartPolicy: Never
      volumes:
        - name: scripts
          configMap:
            name: churn-scripts
            defaultMode: 0755
      containers:
        - name: churn
          image: alpine:latest
          command: ["/bin/sh", "/scripts/churn.sh"]
          volumeMounts:
            - name: scripts
              mountPath: /scripts

Changelog

@dwindsor dwindsor requested a review from a team as a code owner January 29, 2026 18:20
@dwindsor dwindsor marked this pull request as draft January 29, 2026 18:55
@dwindsor dwindsor changed the title draft: fix(grpc/exec): fix RefDec in exit's eventcache path fix(grpc/exec): fix RefDec in exit's eventcache path Jan 29, 2026
@dwindsor dwindsor marked this pull request as ready for review January 29, 2026 19:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant