bottlerocket-core-kit v10.7.0: issue with running workloads in nvidia variants

**Package I'm using:**
bottlerocket-core-kit v10.7.0


**What I expected to happen:**
Successful tests for `nvidia-smoke-test` and `efa-test` using `aws-k8s-tester` repository


**What actually happened:**
In the `efa-test`, we are seeing the following error:
Seeing the following error when using the 10.7.0 core-kit with an nvidia instance

```
Events:
  Type     Reason     Age                   From               Message
  ----     ------     ----                  ----               -------
  Normal   Scheduled  17m                   default-scheduler  Successfully assigned default/multi-node-all-reduce-perf-launcher-245gt to ip-192-168-180-232.us-west-2.compute.internal
  Normal   Pulled     16m                   kubelet            Successfully pulled image "xxxxxx.dkr.ecr.us-west-2.amazonaws.com/nccl-test:nvidia-v1" in 1m25.191s (1m25.191s including waiting)
  Normal   Pulled     15m                   kubelet            Successfully pulled image "xxxxxx.dkr.ecr.us-west-2.amazonaws.com/nccl-test:nvidia-v1" in 173ms (173ms including waiting)
  Normal   Pulled     15m                   kubelet            Successfully pulled image "xxxxxx.dkr.ecr.us-west-2.amazonaws.com/nccl-test:nvidia-v1" in 144ms (144ms including waiting)
  Normal   Created    14m (x4 over 16m)     kubelet            Created container: nccl-test-launcher
  Normal   Started    14m (x4 over 16m)     kubelet            Started container nccl-test-launcher
  Normal   Pulled     14m                   kubelet            Successfully pulled image "xxxxxx.dkr.ecr.us-west-2.amazonaws.com/nccl-test:nvidia-v1" in 181ms (181ms including waiting)
  Normal   Pulling    13m (x5 over 17m)     kubelet            Pulling image "xxxxxx.dkr.ecr.us-west-2.amazonaws.com/nccl-test:nvidia-v1"
  Normal   Pulled     13m                   kubelet            Successfully pulled image "xxxxxx.dkr.ecr.us-west-2.amazonaws.com/nccl-test:nvidia-v1" in 164ms (164ms including waiting)
  Warning  BackOff    2m27s (x60 over 15m)  kubelet            Back-off restarting failed container nccl-test-launcher in pod multi-node-all-reduce-perf-launcher-245gt_default(2d87540c-f218-478c-a00a-9dc2e5c16807)
```


**How to reproduce the problem:**
Build an aws-k8s-nvidia ami, create a cluster and run smoke test

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

bottlerocket-core-kit v10.7.0: issue with running workloads in nvidia variants #685

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

bottlerocket-core-kit v10.7.0: issue with running workloads in nvidia variants #685

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions