Skip to content

Conversation

@AritraDey-Dev
Copy link
Member

@AritraDey-Dev AritraDey-Dev commented Nov 29, 2025

Fixes #902

Description

This PR fixes an issue where the status field in ProcessExit events was missing from JSON output when the exit code was 0 (success).
From my testing,.This was not a bug in the tetra CLI, but rather an issue with the API definition itself.

Verification:

Before the Fix:

{
  "process_exit": {
    "process": {
      "exec_id": "dGV0cmFnb24tZGV2LWNvbnRyb2wtcGxhbmU6MzYxMTA0MzkwMjk1Njo1NjY5MA==",
      "pid": 56690,
      "uid": 0,
      "cwd": "/run/containerd/io.containerd.runtime.v2.task/k8s.io/07bad515076140f620bedeeb42ccaf2f54f3748d6e04510348287814ace9ddbc",
      "binary": "/usr/local/sbin/runc",
      "arguments": "--root /run/containerd/runc/k8s.io --log /run/containerd/io.containerd.runtime.v2.task/k8s.io/07bad515076140f620bedeeb42ccaf2f54f3748d6e04510348287814ace9ddbc/log.json --log-format json --systemd-cgroup create --bundle /run/containerd/io.containerd.runtime.v2.task/k8s.io/07bad515076140f620bedeeb42ccaf2f54f3748d6e04510348287814ace9ddbc --pid-file /run/containerd/io.containerd.runtime.v2.task/k8s.io/07bad515076140f620bedeeb42ccaf2f54f3748d6e04510348287814ace9ddbc/init.pid 07bad515076140f620bedeeb42ccaf2f54f3748d6e04510348287814ace9ddbc",
      "flags": "execve clone",
      "start_time": "2025-11-29T08:54:22.963255116Z",
      "auid": 4294967295,
      "parent_exec_id": "dGV0cmFnb24tZGV2LWNvbnRyb2wtcGxhbmU6MzYxMTAzNTgyMTgzMjo1NjY3OQ==",
      "tid": 56690,
      "in_init_tree": false
    },
    "parent": {
      "exec_id": "dGV0cmFnb24tZGV2LWNvbnRyb2wtcGxhbmU6MzYxMTAzNTgyMTgzMjo1NjY3OQ==",
      "pid": 56679,
      "uid": 0,
      "cwd": "/run/containerd/io.containerd.runtime.v2.task/k8s.io/07bad515076140f620bedeeb42ccaf2f54f3748d6e04510348287814ace9ddbc",
      "binary": "/usr/local/bin/containerd-shim-runc-v2",
      "arguments": "-namespace k8s.io -id 07bad515076140f620bedeeb42ccaf2f54f3748d6e04510348287814ace9ddbc -address /run/containerd/containerd.sock",
      "flags": "execve clone",
      "start_time": "2025-11-29T08:54:22.955174003Z",
      "auid": 4294967295,
      "parent_exec_id": "dGV0cmFnb24tZGV2LWNvbnRyb2wtcGxhbmU6MzYxMTAyMjc5NzA2Mjo1NjY3MQ==",
      "tid": 56679,
      "in_init_tree": false
    },
    "time": "2025-11-29T08:54:23.027344849Z"
  },
  "node_name": "tetragon-dev-control-plane",
  "time": "2025-11-29T08:54:23.027342859Z",
  "node_labels": {
    "beta.kubernetes.io/arch": "amd64",
    "beta.kubernetes.io/os": "linux",
    "kubernetes.io/arch": "amd64",
    "kubernetes.io/hostname": "tetragon-dev-control-plane",
    "kubernetes.io/os": "linux",
    "node-role.kubernetes.io/control-plane": ""
  }
}

After:

{
  "process_exit": {
    "process": {
      "exec_id": "dGV0cmFnb24tZGV2LWNvbnRyb2wtcGxhbmU6NDUyOTY2Mzc5Nzg0Mzo2ODg3MA==",
      "pid": 68870,
      "uid": 0,
      "cwd": "/run/containerd/io.containerd.runtime.v2.task/k8s.io/f993877bb0749ec44bb6936351daadfffeb9f85a1aedc690c9ff30cbefd3cdeb/rootfs",
      "binary": "/proc/self/fd/6",
      "arguments": "init",
      "flags": "execve clone",
      "start_time": "2025-11-29T09:09:41.582634634Z",
      "auid": 4294967295,
      "parent_exec_id": "dGV0cmFnb24tZGV2LWNvbnRyb2wtcGxhbmU6NDUyOTY1ODA1ODY3Mzo2ODg2MA==",
      "tid": 68870,
      "in_init_tree": false
    },
    "parent": {
      "exec_id": "dGV0cmFnb24tZGV2LWNvbnRyb2wtcGxhbmU6NDUyOTY1ODA1ODY3Mzo2ODg2MA==",
      "pid": 68860,
      "uid": 0,
      "cwd": "/run/containerd/io.containerd.runtime.v2.task/k8s.io/58a0689b62bcd42aa8d69edcfaaf70c79ea7adf1465d021341b5f659e70ea55e",
      "binary": "/usr/local/sbin/runc",
      "arguments": "--root /run/containerd/runc/k8s.io --log /run/containerd/io.containerd.runtime.v2.task/k8s.io/f993877bb0749ec44bb6936351daadfffeb9f85a1aedc690c9ff30cbefd3cdeb/log.json --log-format json --systemd-cgroup create --bundle /run/containerd/io.containerd.runtime.v2.task/k8s.io/f993877bb0749ec44bb6936351daadfffeb9f85a1aedc690c9ff30cbefd3cdeb --pid-file /run/containerd/io.containerd.runtime.v2.task/k8s.io/f993877bb0749ec44bb6936351daadfffeb9f85a1aedc690c9ff30cbefd3cdeb/init.pid f993877bb0749ec44bb6936351daadfffeb9f85a1aedc690c9ff30cbefd3cdeb",
      "flags": "execve clone",
      "start_time": "2025-11-29T09:09:41.576895876Z",
      "auid": 4294967295,
      "parent_exec_id": "dGV0cmFnb24tZGV2LWNvbnRyb2wtcGxhbmU6NDUyNjU5MzkzODU3MDo2ODgyMw==",
      "tid": 68860,
      "in_init_tree": false
    },
    "status": 0,
    "time": "2025-11-29T09:09:41.608000424Z"
  },
  "node_name": "tetragon-dev-control-plane",
  "time": "2025-11-29T09:09:41.608000041Z",
  "node_labels": {
    "beta.kubernetes.io/arch": "amd64",
    "beta.kubernetes.io/os": "linux",
    "kubernetes.io/arch": "amd64",
    "kubernetes.io/hostname": "tetragon-dev-control-plane",
    "kubernetes.io/os": "linux",
    "node-role.kubernetes.io/control-plane": ""
  }
}

Changelog

api: Fix missing status: 0 in process_exit JSON events.

@AritraDey-Dev AritraDey-Dev requested a review from a team as a code owner November 29, 2025 09:59
@netlify
Copy link

netlify bot commented Nov 29, 2025

Deploy Preview for tetragon ready!

Name Link
🔨 Latest commit a95b631
🔍 Latest deploy log https://app.netlify.com/projects/tetragon/deploys/692ac3e7d1d35f0008f92ff9
😎 Deploy Preview https://deploy-preview-4392--tetragon.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@AritraDey-Dev AritraDey-Dev force-pushed the fix-missing-process-exit-status branch from a95b631 to 7c2adc9 Compare November 29, 2025 10:04
The status field in ProcessExit was a simple number (uint32). In Proto3, if a number is 0, it doesn't show up in JSON. This was hiding the status code when a process exited successfully.

We changed it to a wrapper type (UInt32Value). Now, even if the status is 0, it will always appear in the JSON output.

Fixes cilium#902

Signed-off-by: Aritra Dey <[email protected]>
@AritraDey-Dev AritraDey-Dev force-pushed the fix-missing-process-exit-status branch from 7c2adc9 to 249c893 Compare November 29, 2025 10:12
@FedeDP
Copy link
Contributor

FedeDP commented Dec 1, 2025

Thanks for this PR! Can you explain where the bug was and how is this PR fixing it? At least for me, it is not obvious by looking at the code :) This way, we can hopefully avoid future issues from similar bugs!

// Status code on process exit. For example, the status code can indicate
// if an error was encountered or the program exited successfully.
uint32 status = 4;
google.protobuf.UInt32Value status = 4;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that missing the status field, because it's zero is a real problem, it's the case for all the fields right?

moreover I wonder we could make above change and still stay backward compatible? @kkourt

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, it seems we are using a mix of google.protobuf.* classes for some integers and some direct types for others https://github.com/cilium/tetragon/blob/main/api/v1/tetragon/tetragon.proto. Indeed with Go and proto I think we can't force the uint32 to be present in the JSON if its value is zero. So using this class is a way of doing it (using pointers is another, maybe worse, way).

This is indeed a breaking change unfortunately :/!

Copy link
Contributor

@andrewstrohman andrewstrohman Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we fix this in a reverse compatible way by using the optional label described here to get explicit presence?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes unfortunately this is a breaking change.After going through here i think this can be a fix.

  uint32 status = 4 [deprecated = true]
  google.protobuf.UInt32Value exit_code = 7;

but i haven't tested this yet.I will keep this draft for now

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we fix this in a reverse compatible way by using the optional label described here to get explicit presence?

I don't think using optional will fix the status-code issue mentioned in upstream issue #902. It will definitely prevent the breaking change, but that's not the goal of this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using optional will make the thing become a pointer in Go which is still a breaking change

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using optional will make the thing become a pointer in Go which is still a breaking change

yes.got this error locally

/src/api/v1/tetragon/tetragon.proto:323:3:Field "4" with name "status" on message "ProcessExit" changed 
cardinality from "optional with implicit presence" to "optional with explicit presence"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uint32 status = 4 [deprecated = true]
google.protobuf.UInt32Value exit_code = 7;

I have tried this way and here is what the output looks like:
For status code 0

{
  "process_exit": {
    "process": {
      "exec_id": "YXJpdHJhLUlkZWFQYWQtU2xpbS0zLTE1SVJIODoxNjMyMTA3NDgxMDk4MjoxMjUyODk=",
      "pid": 125289,
      "uid": 1000,
      "cwd": "/home/aritra/Downloads/projects/tetragon",
      "binary": "/usr/bin/true",
      "flags": "execve clone",
      "start_time": "2025-12-03T18:36:24.375076061Z",
      "auid": 1000,
      "parent_exec_id": "YXJpdHJhLUlkZWFQYWQtU2xpbS0zLTE1SVJIODoxNjMwOTA1MDAwMDAwMDoxMjUxMTU=",
      "tid": 125289,
      "in_init_tree": false
    },
    "parent": {
      "exec_id": "YXJpdHJhLUlkZWFQYWQtU2xpbS0zLTE1SVJIODoxNjMwOTA1MDAwMDAwMDoxMjUxMTU=",
      "pid": 125115,
      "uid": 1000,
      "cwd": "/home/aritra/Downloads/projects/tetragon",
      "binary": "/usr/bin/bash",
      "arguments": "./verify_fix.sh",
      "flags": "procFS",
      "start_time": "2025-12-03T17:26:41.766178459Z",
      "auid": 1000,
      "parent_exec_id": "YXJpdHJhLUlkZWFQYWQtU2xpbS0zLTE1SVJIODoxNTM1NTg3MDAwMDAwMDoxMTEwMTg=",
      "tid": 125115,
      "in_init_tree": false
    },
    "exit_code": 0,
    "time": "2025-12-03T18:36:24.375839673Z"
  },
  "node_name": "aritra-IdeaPad-Slim-3-15IRH8",
  "time": "2025-12-03T18:36:24.375838636Z"
}

and for other status code (e.g. 1)

{
  "process_exit": {
    "process": {
      "exec_id": "YXJpdHJhLUlkZWFQYWQtU2xpbS0zLTE1SVJIODoxNjMyMzA4MDE1MjE5MDoxMjUzMzY=",
      "pid": 125336,
      "uid": 1000,
      "cwd": "/home/aritra/Downloads/projects/tetragon",
      "binary": "/usr/bin/false",
      "flags": "execve clone",
      "start_time": "2025-12-03T18:36:26.380417078Z",
      "auid": 1000,
      "parent_exec_id": "YXJpdHJhLUlkZWFQYWQtU2xpbS0zLTE1SVJIODoxNjMwOTA1MDAwMDAwMDoxMjUxMTU=",
      "tid": 125336,
      "in_init_tree": false
    },
    "parent": {
      "exec_id": "YXJpdHJhLUlkZWFQYWQtU2xpbS0zLTE1SVJIODoxNjMwOTA1MDAwMDAwMDoxMjUxMTU=",
      "pid": 125115,
      "uid": 1000,
      "cwd": "/home/aritra/Downloads/projects/tetragon",
      "binary": "/usr/bin/bash",
      "arguments": "./verify_fix.sh",
      "flags": "procFS",
      "start_time": "2025-12-03T17:26:41.766178459Z",
      "auid": 1000,
      "parent_exec_id": "YXJpdHJhLUlkZWFQYWQtU2xpbS0zLTE1SVJIODoxNTM1NTg3MDAwMDAwMDoxMTEwMTg=",
      "tid": 125115,
      "in_init_tree": false
    },
    "status": 1,
    "exit_code": 1,
    "time": "2025-12-03T18:36:26.381048159Z"
  },
  "node_name": "aritra-IdeaPad-Slim-3-15IRH8",
  "time": "2025-12-03T18:36:26.381047408Z"
}

question: is this redundancy acceptable? I am concerned that having both status and exit_code might confuse users,Does this seem like a good approach?

Copy link
Member

@mtardy mtardy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep indeed, not sure what to do with that, let's discuss in https://github.com/cilium/tetragon/pull/4392/files#r2576338233.

Another unrelated question: why time is also missing in your "before fix" output?

@AritraDey-Dev AritraDey-Dev marked this pull request as draft December 2, 2025 16:33
@AritraDey-Dev
Copy link
Member Author

Another unrelated question: why time is also missing in your "before fix" output?

Oh, my bad — I pasted the process_exec event output instead of process_exit. I’ve updated the PR description to include the correct output.

@andrewstrohman
Copy link
Contributor

I have a theory about what's going on. In protobuf3, the default values are never included during serialization in order to be more compact. Maybe if you add EmitUnpopulated: true, here, it will fix the problem.

@AritraDey-Dev
Copy link
Member Author

I have a theory about what's going on. In protobuf3, the default values are never included during serialization in order to be more compact. Maybe if you add EmitUnpopulated: true, here, it will fix the problem.

Yes, that does fix the missing status: 0 issue. However, it causes every event to include all null/empty fields (e.g., pod: null, docker: "", cap: null, ns: null, process_credentials: null, etc.).
This results in a huge increase in verbosity for the logs.

@andrewstrohman
Copy link
Contributor

Yes, that does fix the missing status: 0 issue. However, it causes every event to include all null/empty fields (e.g., pod: null, docker: "", cap: null, ns: null, process_credentials: null, etc.).
This results in a huge increase in verbosity for the logs.

I wonder if this is something that actually needs to be fixed. I think when a client (API user) deserializes into an object, the default value will be inserted into the object, even though it wasn't included the protobuf message. So from a programatic interface perspective, no information is lost.

As such, this problem seems contained to being a human interpretation issue. The human expects the the default value to be present, because they don't know that default value are omitted in protobuf3.

If it's really important to be visible to the human, then we could manually add it after the Marshal()call (if the key doesn't exist, add the key with 0 value). This approach would prevent the huge increase due to the other default values.

However, I feel like this problem is not really specific to the status key -- it's more general as it pertains to all default values.

@mtardy
Copy link
Member

mtardy commented Dec 4, 2025

However, I feel like this problem is not really specific to the status key -- it's more general as it pertains to all default values.

yeah yeah that's why we use the google.protobuf.UInt32Value on UID for example, on field we want the zero to be explicitly there, even though it's not strictly useful as you mentioned.

The only valid approach is what you proposed here #4392 (comment). However the real question is do we care enough to do the whole deprecation dance. It's an enhancement I think, but not a crucial one, so not sure we need to bother 🤔. That would be a good topic to bring to the community meeting if you can join next Monday https://isogo.to/tetragon-meeting-notes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

process:exit: ensure that the "status" process exit code field is always set

5 participants