generated from amazon-archives/__template_Custom
-
Notifications
You must be signed in to change notification settings - Fork 53
Open
Description
Package I'm using:
bottlerocket-core-kit v10.7.0
What I expected to happen:
Successful tests for nvidia-smoke-test and efa-test using aws-k8s-tester repository
What actually happened:
In the efa-test, we are seeing the following error:
Seeing the following error when using the 10.7.0 core-kit with an nvidia instance
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 17m default-scheduler Successfully assigned default/multi-node-all-reduce-perf-launcher-245gt to ip-192-168-180-232.us-west-2.compute.internal
Normal Pulled 16m kubelet Successfully pulled image "xxxxxx.dkr.ecr.us-west-2.amazonaws.com/nccl-test:nvidia-v1" in 1m25.191s (1m25.191s including waiting)
Normal Pulled 15m kubelet Successfully pulled image "xxxxxx.dkr.ecr.us-west-2.amazonaws.com/nccl-test:nvidia-v1" in 173ms (173ms including waiting)
Normal Pulled 15m kubelet Successfully pulled image "xxxxxx.dkr.ecr.us-west-2.amazonaws.com/nccl-test:nvidia-v1" in 144ms (144ms including waiting)
Normal Created 14m (x4 over 16m) kubelet Created container: nccl-test-launcher
Normal Started 14m (x4 over 16m) kubelet Started container nccl-test-launcher
Normal Pulled 14m kubelet Successfully pulled image "xxxxxx.dkr.ecr.us-west-2.amazonaws.com/nccl-test:nvidia-v1" in 181ms (181ms including waiting)
Normal Pulling 13m (x5 over 17m) kubelet Pulling image "xxxxxx.dkr.ecr.us-west-2.amazonaws.com/nccl-test:nvidia-v1"
Normal Pulled 13m kubelet Successfully pulled image "xxxxxx.dkr.ecr.us-west-2.amazonaws.com/nccl-test:nvidia-v1" in 164ms (164ms including waiting)
Warning BackOff 2m27s (x60 over 15m) kubelet Back-off restarting failed container nccl-test-launcher in pod multi-node-all-reduce-perf-launcher-245gt_default(2d87540c-f218-478c-a00a-9dc2e5c16807)
How to reproduce the problem:
Build an aws-k8s-nvidia ami, create a cluster and run smoke test
Metadata
Metadata
Assignees
Labels
No labels