Skip to content

DeepSpeed Installation Fails During Docker Build (NVML Initialization Issue) #6945

Open
@asdfry

Description

Hello,
I encountered an issue while building a Docker image for deep learning model training, specifically when attempting to install DeepSpeed.

Issue
When building the Docker image, the DeepSpeed installation fails with a warning that NVML initialization is not possible.
However, if I create a container from the same image and install DeepSpeed inside the container, the installation works without any issues.

Environment
Base Image: nvcr.io/nvidia/pytorch:23.01-py3
DeepSpeed Version: 0.16.2

Build Log
docker_build.log

Additional Context
The problem does not occur with the newer base image nvcr.io/nvidia/pytorch:24.05-py3.

Thank you.

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions