Skip to content

IMEX daemon startup error not caught, CD marked as ready #289

@jgehrcke

Description

@jgehrcke

We are aware that we need to improve on monitoring IMEX daemon health: #225 (comment)

We should probably tie it back to ComputeDomain status.

What follows is just one specific (and very basic!) example that I wish to track so that we can improve measurably.

IMEX daemon startup may fail as of a misconfiguration. Example error:

No matching network interface found for any IP addresses in the node configuration file ...

See below, for full daemon log output:

$ kubectl logs -n nvidia-dra-driver-gpu        nickelpie-test-compute-domain-template-ntt7v-m6hjz
IMEX Log initializing at: 3/21/2025 20:45:38.937
[Mar 21 2025 20:45:38] [INFO] [tid 39] IMEX version 570.00 is running with the following configuration options
[Mar 21 2025 20:45:38] [INFO] [tid 39] Logging level = 4
[Mar 21 2025 20:45:38] [INFO] [tid 39] Logging file name/path = /var/log/nvidia-imex.log
[Mar 21 2025 20:45:38] [INFO] [tid 39] Append to log file = 0
[Mar 21 2025 20:45:38] [INFO] [tid 39] Max Log file size = 1024 (MBs)
[Mar 21 2025 20:45:38] [INFO] [tid 39] Use Syslog file = 0
[Mar 21 2025 20:45:38] [INFO] [tid 39] IMEX Library communication bind interface = 
[Mar 21 2025 20:45:38] [INFO] [tid 39] IMEX library communication bind port = 50000
[Mar 21 2025 20:45:38] [ERROR] [tid 39] No matching network interface found for any IP addresses in the node configuration file /etc/nvidia-imex/nodes_config.cfg.
[Mar 21 2025 20:45:38] [INFO] [tid 39] Cleaning up imex

Timestamps indicate that the crash happened ~immediately. Retries are happening (container in CrashLoopBackOff), but the configuration error is permanent.

When that happens we should make it easy/easier for users to detect that problem.

A user can see a crash-looping pod in the nvidia-dra-driver-gpu namespace, but only for a short while.

At some point it disappears, and then it becomes difficult to see that something is wrong.

In that state, the IMEX daemon then is not running and is hard to see if you don't know where to look. The associated ComputeDomain got transitioned to status: Ready and permanently stays there, example:

$ kubectl get computedomain.resource.nvidia.com -o yaml
apiVersion: v1
items:
- apiVersion: resource.nvidia.com/v1beta1
  kind: ComputeDomain
  metadata:
 ...
    finalizers:
    - resource.nvidia.com/computeDomain
    generation: 1
    name: snip-template
    namespace: default
    resourceVersion: "21799278"
    uid: 37a29477-79a4-46d1-a7b9-c593d0512473
  spec:
    channel:
      resourceClaimTemplate:
        name: snip-channel
    numNodes: 1
  status:
    nodes:
    - cliqueID: snip-bd74-4c83-afcb-cfb75ebc9304.1
      ipAddress: snip
      name: snip
    status: Ready
kind: List
metadata:
  resourceVersion: ""

Arguably, we also want to have robust detection of 'late' daemon crashes (during runtime). Building this properly includes detecting early daemon crashes.

So, the goal of "catch any daemon crash, and update CD status" is also a good one.

Metadata

Metadata

Assignees

Labels

kind/bugCategorizes issue or PR as related to a bug.

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions