(most of this makes sense to JP only -- fine; it's a pragmatic summary. I have more notes and log material).
Overnight failover test iterations. Unattended. Limited post-factum debug abilities based on collected logs (imperfect data).
Run 181 details:
- fault type:
inject fault type 3: regular-delete worker pod 1
- five launcher restarts
[2025-10-05 00:34:39.114981] 2025-10-05T00:34:39.114Z [ 22.1s] launcher container restarts seen: 1
[2025-10-05 00:34:52.781099] 2025-10-05T00:34:52.780Z [ 35.8s] launcher container restarts seen: 2
[2025-10-05 00:35:22.251432] 2025-10-05T00:35:22.251Z [ 65.3s] launcher container restarts seen: 3
[2025-10-05 00:36:02.361049] 2025-10-05T00:36:02.360Z [ 105.4s] launcher container restarts seen: 4
[2025-10-05 00:37:26.539058] 2025-10-05T00:37:26.538Z [ 189.6s] launcher container restarts seen: 5
[2025-10-05 00:39:16.978028] 2025-10-05T00:39:16.977Z [ 300.0s] global deadline reached (300 seconds), leave control loop
Perspective from launcher:
[pod/nvbandwidth-test-2-launcher-2wp4x/mpi-launcher] 2025-10-05T00:34:33.497188780Z Sample 14: Multinode node 1 -> Multinode node 0: 805.77 GB/s
[pod/nvbandwidth-test-2-launcher-2wp4x/mpi-launcher] 2025-10-05T00:34:38.394097510Z ssh: Could not resolve hostname nvbandwidth-test-2-worker-1.nvbandwidth-test-2.default.svc: Name or service not known
[pod/nvbandwidth-test-2-launcher-2wp4x/mpi-launcher] 2025-10-05T00:34:51.466328516Z ssh: Could not resolve hostname nvbandwidth-test-2-worker-1.nvbandwidth-test-2.default.svc: Name or service not known
[pod/nvbandwidth-test-2-launcher-2wp4x/mpi-launcher] 2025-10-05T00:35:20.457040019Z ssh: Could not resolve hostname nvbandwidth-test-2-worker-1.nvbandwidth-test-2.default.svc: Name or service not known
[pod/nvbandwidth-test-2-launcher-2wp4x/mpi-launcher] 2025-10-05T00:36:01.453257074Z ssh: Could not resolve hostname nvbandwidth-test-2-worker-1.nvbandwidth-test-2.default.svc: Name or service not known
[pod/nvbandwidth-test-2-launcher-2wp4x/mpi-launcher] 2025-10-05T00:37:25.465313075Z ssh: Could not resolve hostname nvbandwidth-test-2-worker-1.nvbandwidth-test-2.default.svc: Name or service not known
That is:
- launcher container 1 crashed during
Sample 14: Multinode node 1 -> Multinode node 0
- then it crash-looped slowly, because that replacement worker pod never came online (and hence
Could not resolve hostname nvbandwidth-test-2-worker-1.nvbandwidth-test-2.default.svc: Name or service not known)
CD daemon pods:
$ cat _cd-daemon_logs_181.log | grep -oe '-compute-domain-2-.*/compute-domain' | sort | uniq
-compute-domain-2-kvr66-hr8jf/compute-domain
-compute-domain-2-kvr66-hvh5w/compute-domain
-compute-domain-2-kvr66-p24k8/compute-domain
In the replacement CD daemon pod, the IMEX daemon crash-cycled on
Error in bind for address '[::ffff:192.168.35.86]:50000': Address already in use
(grpc server startup failure)
Permanent condition over those five minutes:
± cat _cd-daemon_logs_181.log | grep -e 'already in use' | wc -l
274
Confirmed via timestamps that this took place between 00:34:43 and 00:39:16 and hence this was the cause for why the test timed out.
The replacement pod's CD daemon never started up healthy, i.e. the CD's node entry never became Ready, never releasing the workload.
Crash-cycling was internal (via our process manager).
In the to-be-torn-down CD daemon (in response to deleting theworkload pod): maybe this pod actually never terminated properly -- got stuck in process.go:158] Wait() for child, because the IMEX daemon process itself got stuck in the shutdown procedure (that's not unheard of).
We've improved logging and log collection since then -- let's see if this reproduces.
Seen twice. Not reproduced since. No definite conclusion.
(most of this makes sense to JP only -- fine; it's a pragmatic summary. I have more notes and log material).
Overnight failover test iterations. Unattended. Limited post-factum debug abilities based on collected logs (imperfect data).
Run 181 details:
inject fault type 3: regular-delete worker pod 1Perspective from launcher:
That is:
Sample 14: Multinode node 1 -> Multinode node 0Could not resolve hostname nvbandwidth-test-2-worker-1.nvbandwidth-test-2.default.svc: Name or service not known)CD daemon pods:
In the replacement CD daemon pod, the IMEX daemon crash-cycled on
(grpc server startup failure)
Permanent condition over those five minutes:
Confirmed via timestamps that this took place between
00:34:43and00:39:16and hence this was the cause for why the test timed out.The replacement pod's CD daemon never started up healthy, i.e. the CD's node entry never became Ready, never releasing the workload.
Crash-cycling was internal (via our process manager).
In the to-be-torn-down CD daemon (in response to deleting theworkload pod): maybe this pod actually never terminated properly -- got stuck in
process.go:158] Wait() for child, because the IMEX daemon process itself got stuck in the shutdown procedure (that's not unheard of).We've improved logging and log collection since then -- let's see if this reproduces.
Seen twice. Not reproduced since. No definite conclusion.