CD daemon teardown may leave behind stale IMEX daemon (`Address already in use`)

(most of this makes sense to JP only -- fine; it's a pragmatic summary. I have more notes and log material).

Overnight failover test iterations. Unattended. Limited post-factum debug abilities based on collected logs (imperfect data).

Run 181 details:

- fault type: `inject fault type 3: regular-delete worker pod 1`
- five launcher restarts

```
[2025-10-05 00:34:39.114981] 2025-10-05T00:34:39.114Z [  22.1s] launcher container restarts seen: 1
[2025-10-05 00:34:52.781099] 2025-10-05T00:34:52.780Z [  35.8s] launcher container restarts seen: 2
[2025-10-05 00:35:22.251432] 2025-10-05T00:35:22.251Z [  65.3s] launcher container restarts seen: 3
[2025-10-05 00:36:02.361049] 2025-10-05T00:36:02.360Z [ 105.4s] launcher container restarts seen: 4
[2025-10-05 00:37:26.539058] 2025-10-05T00:37:26.538Z [ 189.6s] launcher container restarts seen: 5
[2025-10-05 00:39:16.978028] 2025-10-05T00:39:16.977Z [ 300.0s] global deadline reached (300 seconds), leave control loop
```


Perspective from launcher:

```text
[pod/nvbandwidth-test-2-launcher-2wp4x/mpi-launcher] 2025-10-05T00:34:33.497188780Z     Sample 14: Multinode node 1 -> Multinode node 0: 805.77 GB/s
[pod/nvbandwidth-test-2-launcher-2wp4x/mpi-launcher] 2025-10-05T00:34:38.394097510Z ssh: Could not resolve hostname nvbandwidth-test-2-worker-1.nvbandwidth-test-2.default.svc: Name or service not known
[pod/nvbandwidth-test-2-launcher-2wp4x/mpi-launcher] 2025-10-05T00:34:51.466328516Z ssh: Could not resolve hostname nvbandwidth-test-2-worker-1.nvbandwidth-test-2.default.svc: Name or service not known
[pod/nvbandwidth-test-2-launcher-2wp4x/mpi-launcher] 2025-10-05T00:35:20.457040019Z ssh: Could not resolve hostname nvbandwidth-test-2-worker-1.nvbandwidth-test-2.default.svc: Name or service not known
[pod/nvbandwidth-test-2-launcher-2wp4x/mpi-launcher] 2025-10-05T00:36:01.453257074Z ssh: Could not resolve hostname nvbandwidth-test-2-worker-1.nvbandwidth-test-2.default.svc: Name or service not known
[pod/nvbandwidth-test-2-launcher-2wp4x/mpi-launcher] 2025-10-05T00:37:25.465313075Z ssh: Could not resolve hostname nvbandwidth-test-2-worker-1.nvbandwidth-test-2.default.svc: Name or service not known
```

That is:

- launcher container 1 crashed during `Sample 14: Multinode node 1 -> Multinode node 0`
- then it crash-looped slowly, because that replacement worker pod never came online (and hence `Could not resolve hostname nvbandwidth-test-2-worker-1.nvbandwidth-test-2.default.svc: Name or service not known`)


CD daemon pods:

```text
$ cat _cd-daemon_logs_181.log | grep -oe '-compute-domain-2-.*/compute-domain' | sort | uniq
-compute-domain-2-kvr66-hr8jf/compute-domain
-compute-domain-2-kvr66-hvh5w/compute-domain
-compute-domain-2-kvr66-p24k8/compute-domain
```

In the replacement CD daemon pod, the IMEX daemon crash-cycled on

```
Error in bind for address '[::ffff:192.168.35.86]:50000': Address already in use
```
(grpc server startup failure)

Permanent condition over those five minutes:

```text
± cat _cd-daemon_logs_181.log | grep -e 'already in use' | wc -l
274
```

Confirmed via timestamps that this took place between `00:34:43` and `00:39:16` and hence this was the cause for why the test timed out.

The replacement pod's CD daemon never started up healthy, i.e. the CD's node entry never became Ready, never releasing the workload.

Crash-cycling was internal (via our process manager).

In the to-be-torn-down CD daemon (in response to deleting theworkload pod): maybe this pod actually never terminated properly -- got stuck in `process.go:158] Wait() for child`, because the IMEX daemon process itself got stuck in the shutdown procedure (that's not unheard of).

We've improved logging and log collection since then -- let's see if this reproduces.

Seen twice. Not reproduced since. No definite conclusion.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CD daemon teardown may leave behind stale IMEX daemon (`Address already in use`) #654

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

CD daemon teardown may leave behind stale IMEX daemon (Address already in use) #654

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

CD daemon teardown may leave behind stale IMEX daemon (`Address already in use`) #654