Skip to content

bottlerocket-core-kit v10.7.0: issue with running workloads in nvidia variants #685

@piyush-jena

Description

@piyush-jena

Package I'm using:
bottlerocket-core-kit v10.7.0

What I expected to happen:
Successful tests for nvidia-smoke-test and efa-test using aws-k8s-tester repository

What actually happened:
In the efa-test, we are seeing the following error:
Seeing the following error when using the 10.7.0 core-kit with an nvidia instance

Events:
  Type     Reason     Age                   From               Message
  ----     ------     ----                  ----               -------
  Normal   Scheduled  17m                   default-scheduler  Successfully assigned default/multi-node-all-reduce-perf-launcher-245gt to ip-192-168-180-232.us-west-2.compute.internal
  Normal   Pulled     16m                   kubelet            Successfully pulled image "xxxxxx.dkr.ecr.us-west-2.amazonaws.com/nccl-test:nvidia-v1" in 1m25.191s (1m25.191s including waiting)
  Normal   Pulled     15m                   kubelet            Successfully pulled image "xxxxxx.dkr.ecr.us-west-2.amazonaws.com/nccl-test:nvidia-v1" in 173ms (173ms including waiting)
  Normal   Pulled     15m                   kubelet            Successfully pulled image "xxxxxx.dkr.ecr.us-west-2.amazonaws.com/nccl-test:nvidia-v1" in 144ms (144ms including waiting)
  Normal   Created    14m (x4 over 16m)     kubelet            Created container: nccl-test-launcher
  Normal   Started    14m (x4 over 16m)     kubelet            Started container nccl-test-launcher
  Normal   Pulled     14m                   kubelet            Successfully pulled image "xxxxxx.dkr.ecr.us-west-2.amazonaws.com/nccl-test:nvidia-v1" in 181ms (181ms including waiting)
  Normal   Pulling    13m (x5 over 17m)     kubelet            Pulling image "xxxxxx.dkr.ecr.us-west-2.amazonaws.com/nccl-test:nvidia-v1"
  Normal   Pulled     13m                   kubelet            Successfully pulled image "xxxxxx.dkr.ecr.us-west-2.amazonaws.com/nccl-test:nvidia-v1" in 164ms (164ms including waiting)
  Warning  BackOff    2m27s (x60 over 15m)  kubelet            Back-off restarting failed container nccl-test-launcher in pod multi-node-all-reduce-perf-launcher-245gt_default(2d87540c-f218-478c-a00a-9dc2e5c16807)

How to reproduce the problem:
Build an aws-k8s-nvidia ami, create a cluster and run smoke test

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions