DeepSpeed Installation Fails During Docker Build (NVML Initialization Issue) #6945
Open
Description
Hello,
I encountered an issue while building a Docker image for deep learning model training, specifically when attempting to install DeepSpeed.
Issue
When building the Docker image, the DeepSpeed installation fails with a warning that NVML initialization is not possible.
However, if I create a container from the same image and install DeepSpeed inside the container, the installation works without any issues.
Environment
Base Image: nvcr.io/nvidia/pytorch:23.01-py3
DeepSpeed Version: 0.16.2
Build Log
docker_build.log
Additional Context
The problem does not occur with the newer base image nvcr.io/nvidia/pytorch:24.05-py3
.
Thank you.
Metadata
Assignees
Labels
No labels