Update the dockerfile base image to cuda-dl-base #1248

Steboss · 2025-01-14T11:42:27Z

Update the base docker image, so we can use cuda-dl-base from nvcri.

.github/container/Dockerfile.base

Steboss · 2025-01-14T17:00:58Z

So I run a check in the cuda-dl-base image, and I can see that:

for nccl we need to create symlink for include and lib directories, so they're mapped in opt/nvidia/nccl
same for cudNN
we can safely remove install-ofed.sh
and amazon efa support

For the symlink, we'd just need this part of the install-nccl.sh script (and the counterpart in install-cudnn.sh script):

arch=$(uname -m)-linux-gnu
for nccl_file in $(dpkg -L libnccl2 libnccl-dev | sort -u); do
  # Real files and symlinks are linked into $prefix
  if [[ -f "${nccl_file}" || -h "${nccl_file}" ]]; then
    # Replace /usr with $prefix and remove arch-specific lib directories
    nosysprefix="${nccl_file#"/usr/"}"
    noarchlib="${nosysprefix/#"lib/${arch}"/lib}"
    link_name="${prefix}/${noarchlib}"
    link_dir=$(dirname "${link_name}")
    mkdir -p "${link_dir}"
    ln -s "${nccl_file}" "${link_name}"
  else
    echo "Skipping ${nccl_file}"
  fi
done

@DwarKapex does it sound right to you?

Steboss · 2025-01-15T15:05:29Z

@DwarKapex

Updated the based Dockerfile to have cuda-dl-base image
to avoid having conflicts and re-install of nccl and cudnn i've modified install-cudnn.sh and install-nccl.sh
for both script I added a part where we're doing the symlink step to that resources are present in /opt/nvidia/{package}

.github/container/Dockerfile.base

.github/container/install-cudnn.sh

.github/container/install-nccl.sh

… into sbosisio/cuda-dl-base

olupton · 2025-01-16T08:45:47Z

The nsys-jax test failures are because the 24.12 cuda-dl-base includes Nsight Systems 2024.7, whereas we currently install 2024.6 because of some issues with 2024.7 (#1176 is the - pending - attempt to move to 2024.7). A possible workaround would be to use the 24.11 cuda-dl-base for the moment.

.github/container/Dockerfile.base

.github/container/install-cudnn.sh

.github/container/.DS_Store

olupton

LGTM, only one minor nit left.
@yhtang and/or @chaserileyroberts to review the GCP networking relevant parts.

.github/container/Dockerfile.base

yhtang

What is the symlink-xyz scripts modeled after? How do other DLFW containers accomodate the dl core container?

Steboss · 2025-01-23T14:10:16Z

@yhtang
The symlink-xyz scripts are meant to create a symlink for nvcc and cudnn. These packages are already install in cuda-dl-base image, but they're not linked to /opt/nvidia/ folder, as we were doing before.
This is all a jax/xla thing, that's why we might not need this in the other DLFW

.github/container/Dockerfile.base

Steboss · 2025-01-24T19:39:19Z

@olupton @yhtang
i think it may be wise to add an additional step in the CI, that check for the very latest cuda-dl-base image, so we can avoid updating this manually, and we'll have an automatic system that does it for us.

yhtang · 2025-01-27T08:08:21Z

@olupton @yhtang
i think it may be wise to add an additional step in the CI, that check for the very latest cuda-dl-base image, so we can avoid updating this manually, and we'll have an automatic system that does it for us.

Bumping the CUDA base image is usually not a light job, as it may break things. Hence why we always update it via a PR. The DL base image is only updated once a month so IMHO we can live with it.

- Remove some infrastructure missed in #1296 - Fix metric calculation/check for the remaining MaxText tests - Remove the MJX pipeline added in #497, which had been failing for months. - Update the README à la #1143 and #1198 to include dates of the first nightlies to include the base container bumps of #1248, #1276 and #1320 - Add a missing test dependency for Levanter unit tests - Remove some more T5X tests, leaving only a ViT one, and try to fix its metric calculation/check

Installation script revived from #1248. This fixes `nsys-jax` test failures introduced by #1469.

The prior setup pre-dated #1248, now things can be simpler. --------- Co-authored-by: Steboss <[email protected]> Co-authored-by: Steboss <[email protected]>

Update the dockerfile base image so that we can support NCCL

dedf8c2

Steboss requested review from olupton and yhtang January 14, 2025 11:42

olupton requested changes Jan 14, 2025

View reviewed changes

.github/container/Dockerfile.base Show resolved Hide resolved

remove no needed installs

ad2493a

STEFANO BOSISIO and others added 4 commits January 15, 2025 09:36

fix installs and symlinks

d95d51c

forgot to take out tcpx

d341fcd

fix way to link files, so cudnn.h is visible

62000ec

Merge branch 'main' into sbosisio/cuda-dl-base

62f6d65

Steboss changed the title ~~Update the dockerfile base image so that we can support NCCL~~ Update the dockerfile base image to cuda-dl-base Jan 15, 2025

olupton reviewed Jan 15, 2025

View reviewed changes

.github/container/Dockerfile.base Show resolved Hide resolved

.github/container/Dockerfile.base Show resolved Hide resolved

.github/container/install-cudnn.sh Outdated Show resolved Hide resolved

.github/container/install-nccl.sh Show resolved Hide resolved

STEFANO BOSISIO added 2 commits January 15, 2025 16:05

Address @olupton comments

8b52c58

Merge branch 'sbosisio/cuda-dl-base' of github.com:NVIDIA/JAX-Toolbox…

6ff8cd4

… into sbosisio/cuda-dl-base

STEFANO BOSISIO and others added 2 commits January 16, 2025 09:03

follow @olupton suggestion for nsys and check results

bd066f1

Merge branch 'main' into sbosisio/cuda-dl-base

564ec47

olupton requested a review from DwarKapex January 16, 2025 14:14

DwarKapex reviewed Jan 16, 2025

View reviewed changes

.github/container/Dockerfile.base Show resolved Hide resolved

DwarKapex reviewed Jan 16, 2025

View reviewed changes

.github/container/Dockerfile.base Show resolved Hide resolved

DwarKapex reviewed Jan 16, 2025

View reviewed changes

.github/container/Dockerfile.base Show resolved Hide resolved

DwarKapex reviewed Jan 16, 2025

View reviewed changes

.github/container/Dockerfile.base Outdated Show resolved Hide resolved

fix newline

1154ccb

nouiz previously approved these changes Jan 20, 2025

View reviewed changes

.github/container/Dockerfile.base Outdated Show resolved Hide resolved

olupton reviewed Jan 21, 2025

View reviewed changes

add mods in container

bfb588d

Steboss dismissed nouiz’s stale review via bfb588d January 21, 2025 11:45

Address Olli's comments

65b5dd8

olupton reviewed Jan 21, 2025

View reviewed changes

.github/container/.DS_Store Outdated Show resolved Hide resolved

remove DS_store

5d2e464

olupton previously approved these changes Jan 21, 2025

View reviewed changes

.github/container/Dockerfile.base Show resolved Hide resolved

remove check-shm

b058b15

Steboss dismissed olupton’s stale review via b058b15 January 21, 2025 16:01

olupton previously approved these changes Jan 21, 2025

View reviewed changes

yhtang requested changes Jan 22, 2025

View reviewed changes

gpupuck reviewed Jan 24, 2025

View reviewed changes

.github/container/Dockerfile.base Show resolved Hide resolved

update to the latest image

235ffac

Steboss dismissed olupton’s stale review via 235ffac January 24, 2025 19:38

yhtang self-requested a review January 27, 2025 08:09

yhtang previously approved these changes Jan 27, 2025

View reviewed changes

reset to 24.11 for nsight

b2b5bcd

Steboss dismissed yhtang’s stale review via b2b5bcd January 27, 2025 10:23

Merge branch 'main' into sbosisio/cuda-dl-base

ce1f3f3

olupton approved these changes Jan 27, 2025

View reviewed changes

Steboss merged commit 1a13844 into main Jan 27, 2025
89 of 115 checks passed

Steboss deleted the sbosisio/cuda-dl-base branch January 27, 2025 17:15

olupton mentioned this pull request Feb 3, 2025

Run NCCL tests on the JAX-specific base container #1284

Merged

olupton mentioned this pull request Mar 25, 2025

CI: cleanup old files and spurious failures #1359

Merged

olupton mentioned this pull request Jun 3, 2025

Dockerfile.base: install newer nsys #1488

Merged

olupton added a commit that referenced this pull request Jun 3, 2025

Dockerfile.base: install newer nsys (#1488)

563231e

Installation script revived from #1248. This fixes `nsys-jax` test failures introduced by #1469.

Steboss added a commit that referenced this pull request Jun 6, 2025

Run NCCL tests on the JAX-specific base container (#1284)

4a8f06a

The prior setup pre-dated #1248, now things can be simpler. --------- Co-authored-by: Steboss <[email protected]> Co-authored-by: Steboss <[email protected]>

Update the dockerfile base image to cuda-dl-base #1248

Update the dockerfile base image to cuda-dl-base #1248

Uh oh!

Conversation

Steboss commented Jan 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Steboss commented Jan 14, 2025

Uh oh!

Steboss commented Jan 15, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

olupton commented Jan 16, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

olupton left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yhtang left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Steboss commented Jan 23, 2025

Uh oh!

Uh oh!

Steboss commented Jan 24, 2025

Uh oh!

yhtang commented Jan 27, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Steboss commented Jan 14, 2025 •

edited

Loading

yhtang left a comment •

edited

Loading