We are aware that we need to improve on monitoring IMEX daemon health: #225 (comment)
We should probably tie it back to ComputeDomain status.
What follows is just one specific (and very basic!) example that I wish to track so that we can improve measurably.
IMEX daemon startup may fail as of a misconfiguration. Example error:
No matching network interface found for any IP addresses in the node configuration file ...
See below, for full daemon log output:
$ kubectl logs -n nvidia-dra-driver-gpu nickelpie-test-compute-domain-template-ntt7v-m6hjz
IMEX Log initializing at: 3/21/2025 20:45:38.937
[Mar 21 2025 20:45:38] [INFO] [tid 39] IMEX version 570.00 is running with the following configuration options
[Mar 21 2025 20:45:38] [INFO] [tid 39] Logging level = 4
[Mar 21 2025 20:45:38] [INFO] [tid 39] Logging file name/path = /var/log/nvidia-imex.log
[Mar 21 2025 20:45:38] [INFO] [tid 39] Append to log file = 0
[Mar 21 2025 20:45:38] [INFO] [tid 39] Max Log file size = 1024 (MBs)
[Mar 21 2025 20:45:38] [INFO] [tid 39] Use Syslog file = 0
[Mar 21 2025 20:45:38] [INFO] [tid 39] IMEX Library communication bind interface =
[Mar 21 2025 20:45:38] [INFO] [tid 39] IMEX library communication bind port = 50000
[Mar 21 2025 20:45:38] [ERROR] [tid 39] No matching network interface found for any IP addresses in the node configuration file /etc/nvidia-imex/nodes_config.cfg.
[Mar 21 2025 20:45:38] [INFO] [tid 39] Cleaning up imex
Timestamps indicate that the crash happened ~immediately. Retries are happening (container in CrashLoopBackOff), but the configuration error is permanent.
When that happens we should make it easy/easier for users to detect that problem.
A user can see a crash-looping pod in the nvidia-dra-driver-gpu namespace, but only for a short while.
At some point it disappears, and then it becomes difficult to see that something is wrong.
In that state, the IMEX daemon then is not running and is hard to see if you don't know where to look. The associated ComputeDomain got transitioned to status: Ready and permanently stays there, example:
$ kubectl get computedomain.resource.nvidia.com -o yaml
apiVersion: v1
items:
- apiVersion: resource.nvidia.com/v1beta1
kind: ComputeDomain
metadata:
...
finalizers:
- resource.nvidia.com/computeDomain
generation: 1
name: snip-template
namespace: default
resourceVersion: "21799278"
uid: 37a29477-79a4-46d1-a7b9-c593d0512473
spec:
channel:
resourceClaimTemplate:
name: snip-channel
numNodes: 1
status:
nodes:
- cliqueID: snip-bd74-4c83-afcb-cfb75ebc9304.1
ipAddress: snip
name: snip
status: Ready
kind: List
metadata:
resourceVersion: ""
Arguably, we also want to have robust detection of 'late' daemon crashes (during runtime). Building this properly includes detecting early daemon crashes.
So, the goal of "catch any daemon crash, and update CD status" is also a good one.
We are aware that we need to improve on monitoring IMEX daemon health: #225 (comment)
We should probably tie it back to ComputeDomain status.
What follows is just one specific (and very basic!) example that I wish to track so that we can improve measurably.
IMEX daemon startup may fail as of a misconfiguration. Example error:
See below, for full daemon log output:
Timestamps indicate that the crash happened ~immediately. Retries are happening (container in CrashLoopBackOff), but the configuration error is permanent.
When that happens we should make it easy/easier for users to detect that problem.
A user can see a crash-looping pod in the
nvidia-dra-driver-gpunamespace, but only for a short while.At some point it disappears, and then it becomes difficult to see that something is wrong.
In that state, the IMEX daemon then is not running and is hard to see if you don't know where to look. The associated ComputeDomain got transitioned to
status: Readyand permanently stays there, example:Arguably, we also want to have robust detection of 'late' daemon crashes (during runtime). Building this properly includes detecting early daemon crashes.
So, the goal of "catch any daemon crash, and update CD status" is also a good one.