Skip to content

CD daemon teardown may leave behind stale IMEX daemon (Address already in use) #654

@jgehrcke

Description

@jgehrcke

(most of this makes sense to JP only -- fine; it's a pragmatic summary. I have more notes and log material).

Overnight failover test iterations. Unattended. Limited post-factum debug abilities based on collected logs (imperfect data).

Run 181 details:

  • fault type: inject fault type 3: regular-delete worker pod 1
  • five launcher restarts
[2025-10-05 00:34:39.114981] 2025-10-05T00:34:39.114Z [  22.1s] launcher container restarts seen: 1
[2025-10-05 00:34:52.781099] 2025-10-05T00:34:52.780Z [  35.8s] launcher container restarts seen: 2
[2025-10-05 00:35:22.251432] 2025-10-05T00:35:22.251Z [  65.3s] launcher container restarts seen: 3
[2025-10-05 00:36:02.361049] 2025-10-05T00:36:02.360Z [ 105.4s] launcher container restarts seen: 4
[2025-10-05 00:37:26.539058] 2025-10-05T00:37:26.538Z [ 189.6s] launcher container restarts seen: 5
[2025-10-05 00:39:16.978028] 2025-10-05T00:39:16.977Z [ 300.0s] global deadline reached (300 seconds), leave control loop

Perspective from launcher:

[pod/nvbandwidth-test-2-launcher-2wp4x/mpi-launcher] 2025-10-05T00:34:33.497188780Z     Sample 14: Multinode node 1 -> Multinode node 0: 805.77 GB/s
[pod/nvbandwidth-test-2-launcher-2wp4x/mpi-launcher] 2025-10-05T00:34:38.394097510Z ssh: Could not resolve hostname nvbandwidth-test-2-worker-1.nvbandwidth-test-2.default.svc: Name or service not known
[pod/nvbandwidth-test-2-launcher-2wp4x/mpi-launcher] 2025-10-05T00:34:51.466328516Z ssh: Could not resolve hostname nvbandwidth-test-2-worker-1.nvbandwidth-test-2.default.svc: Name or service not known
[pod/nvbandwidth-test-2-launcher-2wp4x/mpi-launcher] 2025-10-05T00:35:20.457040019Z ssh: Could not resolve hostname nvbandwidth-test-2-worker-1.nvbandwidth-test-2.default.svc: Name or service not known
[pod/nvbandwidth-test-2-launcher-2wp4x/mpi-launcher] 2025-10-05T00:36:01.453257074Z ssh: Could not resolve hostname nvbandwidth-test-2-worker-1.nvbandwidth-test-2.default.svc: Name or service not known
[pod/nvbandwidth-test-2-launcher-2wp4x/mpi-launcher] 2025-10-05T00:37:25.465313075Z ssh: Could not resolve hostname nvbandwidth-test-2-worker-1.nvbandwidth-test-2.default.svc: Name or service not known

That is:

  • launcher container 1 crashed during Sample 14: Multinode node 1 -> Multinode node 0
  • then it crash-looped slowly, because that replacement worker pod never came online (and hence Could not resolve hostname nvbandwidth-test-2-worker-1.nvbandwidth-test-2.default.svc: Name or service not known)

CD daemon pods:

$ cat _cd-daemon_logs_181.log | grep -oe '-compute-domain-2-.*/compute-domain' | sort | uniq
-compute-domain-2-kvr66-hr8jf/compute-domain
-compute-domain-2-kvr66-hvh5w/compute-domain
-compute-domain-2-kvr66-p24k8/compute-domain

In the replacement CD daemon pod, the IMEX daemon crash-cycled on

Error in bind for address '[::ffff:192.168.35.86]:50000': Address already in use

(grpc server startup failure)

Permanent condition over those five minutes:

± cat _cd-daemon_logs_181.log | grep -e 'already in use' | wc -l
274

Confirmed via timestamps that this took place between 00:34:43 and 00:39:16 and hence this was the cause for why the test timed out.

The replacement pod's CD daemon never started up healthy, i.e. the CD's node entry never became Ready, never releasing the workload.

Crash-cycling was internal (via our process manager).

In the to-be-torn-down CD daemon (in response to deleting theworkload pod): maybe this pod actually never terminated properly -- got stuck in process.go:158] Wait() for child, because the IMEX daemon process itself got stuck in the shutdown procedure (that's not unheard of).

We've improved logging and log collection since then -- let's see if this reproduces.

Seen twice. Not reproduced since. No definite conclusion.

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.lifecycle/staleDenotes an issue or PR has remained open with no activity and has become stale.robustnessissue/pr: edge cases & fault tolerance

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions