Skip to content

Conversation

@Steboss
Copy link
Contributor

@Steboss Steboss commented Jan 14, 2025

Update the base docker image, so we can use cuda-dl-base from nvcri.

@Steboss Steboss requested review from olupton and yhtang January 14, 2025 11:42
@Steboss
Copy link
Contributor Author

Steboss commented Jan 14, 2025

So I run a check in the cuda-dl-base image, and I can see that:

  • for nccl we need to create symlink for include and lib directories, so they're mapped in opt/nvidia/nccl
  • same for cudNN
  • we can safely remove install-ofed.sh
  • and amazon efa support

For the symlink, we'd just need this part of the install-nccl.sh script (and the counterpart in install-cudnn.sh script):

arch=$(uname -m)-linux-gnu
for nccl_file in $(dpkg -L libnccl2 libnccl-dev | sort -u); do
  # Real files and symlinks are linked into $prefix
  if [[ -f "${nccl_file}" || -h "${nccl_file}" ]]; then
    # Replace /usr with $prefix and remove arch-specific lib directories
    nosysprefix="${nccl_file#"/usr/"}"
    noarchlib="${nosysprefix/#"lib/${arch}"/lib}"
    link_name="${prefix}/${noarchlib}"
    link_dir=$(dirname "${link_name}")
    mkdir -p "${link_dir}"
    ln -s "${nccl_file}" "${link_name}"
  else
    echo "Skipping ${nccl_file}"
  fi
done

@DwarKapex does it sound right to you?

@Steboss Steboss changed the title Update the dockerfile base image so that we can support NCCL Update the dockerfile base image to cuda-dl-base Jan 15, 2025
@Steboss
Copy link
Contributor Author

Steboss commented Jan 15, 2025

@DwarKapex

  • Updated the based Dockerfile to have cuda-dl-base image
  • to avoid having conflicts and re-install of nccl and cudnn i've modified install-cudnn.sh and install-nccl.sh
  • for both script I added a part where we're doing the symlink step to that resources are present in /opt/nvidia/{package}

@olupton
Copy link
Collaborator

olupton commented Jan 16, 2025

The nsys-jax test failures are because the 24.12 cuda-dl-base includes Nsight Systems 2024.7, whereas we currently install 2024.6 because of some issues with 2024.7 (#1176 is the - pending - attempt to move to 2024.7). A possible workaround would be to use the 24.11 cuda-dl-base for the moment.

@olupton olupton requested a review from DwarKapex January 16, 2025 14:14
nouiz
nouiz previously approved these changes Jan 20, 2025
olupton
olupton previously approved these changes Jan 21, 2025
Copy link
Collaborator

@olupton olupton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, only one minor nit left.
@yhtang and/or @chaserileyroberts to review the GCP networking relevant parts.

olupton
olupton previously approved these changes Jan 21, 2025
Copy link
Collaborator

@yhtang yhtang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the symlink-xyz scripts modeled after? How do other DLFW containers accomodate the dl core container?

@Steboss
Copy link
Contributor Author

Steboss commented Jan 23, 2025

@yhtang
The symlink-xyz scripts are meant to create a symlink for nvcc and cudnn. These packages are already install in cuda-dl-base image, but they're not linked to /opt/nvidia/ folder, as we were doing before.
This is all a jax/xla thing, that's why we might not need this in the other DLFW

@Steboss
Copy link
Contributor Author

Steboss commented Jan 24, 2025

@olupton @yhtang
i think it may be wise to add an additional step in the CI, that check for the very latest cuda-dl-base image, so we can avoid updating this manually, and we'll have an automatic system that does it for us.

@yhtang
Copy link
Collaborator

yhtang commented Jan 27, 2025

@olupton @yhtang
i think it may be wise to add an additional step in the CI, that check for the very latest cuda-dl-base image, so we can avoid updating this manually, and we'll have an automatic system that does it for us.

Bumping the CUDA base image is usually not a light job, as it may break things. Hence why we always update it via a PR. The DL base image is only updated once a month so IMHO we can live with it.

@yhtang yhtang self-requested a review January 27, 2025 08:09
yhtang
yhtang previously approved these changes Jan 27, 2025
@Steboss Steboss merged commit 1a13844 into main Jan 27, 2025
89 of 115 checks passed
@Steboss Steboss deleted the sbosisio/cuda-dl-base branch January 27, 2025 17:15
olupton added a commit that referenced this pull request Mar 26, 2025
- Remove some infrastructure missed in #1296
- Fix metric calculation/check for the remaining MaxText tests
- Remove the MJX pipeline added in #497, which had been failing for
months.
- Update the README à la #1143 and #1198 to include dates of the first
nightlies to include the base container bumps of #1248, #1276 and #1320
- Add a missing test dependency for Levanter unit tests
- Remove some more T5X tests, leaving only a ViT one, and try to fix its
metric calculation/check
olupton added a commit that referenced this pull request Jun 3, 2025
Installation script revived from
#1248.

This fixes `nsys-jax` test failures introduced by
#1469.
Steboss added a commit that referenced this pull request Jun 6, 2025
The prior setup pre-dated
#1248, now things can be
simpler.

---------

Co-authored-by: Steboss <[email protected]>
Co-authored-by: Steboss <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants