-
Notifications
You must be signed in to change notification settings - Fork 66
Update the dockerfile base image to cuda-dl-base #1248
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 6 commits
Commits
Show all changes
18 commits
Select commit
Hold shift + click to select a range
dedf8c2
Update the dockerfile base image so that we can support NCCL
ad2493a
remove no needed installs
d95d51c
fix installs and symlinks
d341fcd
forgot to take out tcpx
62000ec
fix way to link files, so cudnn.h is visible
62f6d65
Merge branch 'main' into sbosisio/cuda-dl-base
Steboss 8b52c58
Address @olupton comments
6ff8cd4
Merge branch 'sbosisio/cuda-dl-base' of github.com:NVIDIA/JAX-Toolbox…
bd066f1
follow @olupton suggestion for nsys and check results
564ec47
Merge branch 'main' into sbosisio/cuda-dl-base
Steboss 1154ccb
fix newline
bfb588d
add mods in container
65b5dd8
Address Olli's comments
5d2e464
remove DS_store
b058b15
remove check-shm
235ffac
update to the latest image
b2b5bcd
reset to 24.11 for nsight
Steboss ce1f3f3
Merge branch 'main' into sbosisio/cuda-dl-base
Steboss File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,27 +1,10 @@ | ||
| # syntax=docker/dockerfile:1-labs | ||
| ARG BASE_IMAGE=nvidia/cuda:12.6.3-devel-ubuntu24.04 | ||
| ARG BASE_IMAGE=nvcr.io/nvidia/cuda-dl-base:24.12-cuda12.6-devel-ubuntu24.04 | ||
| ARG GIT_USER_NAME="JAX Toolbox" | ||
| ARG [email protected] | ||
| ARG CLANG_VERSION=18 | ||
| ARG JAX_TOOLBOX_REF | ||
|
|
||
| ############################################################################### | ||
| ## Obtain GCP's NCCL TCPx plugin | ||
| ############################################################################### | ||
|
|
||
| FROM us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpx/nccl-plugin-gpudirecttcpx:v3.1.10 AS tcpx-installer-amd64 | ||
|
|
||
| # make a stub arm64 container because GCP does not provide an arm64 version of the plugin | ||
| FROM ubuntu AS tcpx-installer-arm64 | ||
| RUN <<"OUTEREOF" bash -ex | ||
| mkdir -p /scripts /var/lib/tcpx/lib64 | ||
| echo '#!/bin/bash' > /scripts/container_entry.sh | ||
| chmod +x /scripts/container_entry.sh | ||
| OUTEREOF | ||
|
|
||
| FROM tcpx-installer-${TARGETARCH} AS tcpx-installer | ||
| RUN /scripts/container_entry.sh install | ||
|
|
||
| ############################################################################### | ||
| ## Build base image | ||
| ############################################################################### | ||
|
|
@@ -153,72 +136,20 @@ ENV PIP_BREAK_SYSTEM_PACKAGES=1 | |
| RUN pip install --upgrade --ignore-installed --no-cache-dir -e /opt/pip pip-tools && rm -rf ~/.cache/* | ||
|
|
||
| ############################################################################### | ||
| ## Install TCPx | ||
| ############################################################################### | ||
|
|
||
| ENV TCPX_LIBRARY_PATH=/usr/local/tcpx/lib64 | ||
| COPY --from=tcpx-installer /var/lib/tcpx/lib64 ${TCPX_LIBRARY_PATH} | ||
|
|
||
| ############################################################################### | ||
| ## Install the latest versions of Nsight Systems and Nsight Compute | ||
| ############################################################################### | ||
|
|
||
| ADD install-nsight.sh /usr/local/bin | ||
olupton marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| RUN install-nsight.sh | ||
|
|
||
| ############################################################################### | ||
| ## Install cuDNN | ||
| ## Symlink for cuDNN | ||
Steboss marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| ############################################################################### | ||
|
|
||
| ADD install-cudnn.sh /usr/local/bin | ||
| RUN install-cudnn.sh | ||
|
|
||
| ############################################################################### | ||
| ## Install NCCL | ||
| ## Symlink for NCCL | ||
| ############################################################################### | ||
|
|
||
| # same fro this | ||
Steboss marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
Steboss marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| ADD install-nccl.sh /usr/local/bin | ||
| RUN install-nccl.sh | ||
|
|
||
| ############################################################################### | ||
| ## RoCE and InfiniteBand support | ||
| ############################################################################### | ||
|
|
||
| ADD install-ofed.sh /usr/local/bin | ||
| RUN install-ofed.sh | ||
olupton marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| ############################################################################## | ||
| ## Amazon EFA support (need to run it inside container separately) | ||
| ############################################################################## | ||
|
|
||
| ADD --chmod=777 \ | ||
| install-efa.sh \ | ||
olupton marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| test-aws-efa.sh \ | ||
olupton marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| /usr/local/bin/ | ||
| ENV LD_LIBRARY_PATH=/opt/amazon/efa/lib:${LD_LIBRARY_PATH} | ||
| ENV PATH=/opt/amazon/efa/bin:${PATH} | ||
|
|
||
| ############################################################################## | ||
| ## NCCL sanity check utility | ||
| ############################################################################## | ||
|
|
||
| ADD install-nccl-sanity-check.sh /usr/local/bin | ||
| ADD nccl-sanity-check.cu /opt | ||
| RUN install-nccl-sanity-check.sh | ||
olupton marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| ADD jax-nccl-test parallel-launch /usr/local/bin/ | ||
olupton marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| ############################################################################### | ||
| ## Add the systemcheck to the entrypoint. | ||
| ############################################################################### | ||
|
|
||
| COPY check-shm.sh /opt/nvidia/entrypoint.d/ | ||
Steboss marked this conversation as resolved.
Show resolved
Hide resolved
olupton marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| ############################################################################### | ||
| ## Add the GCP - TCPX check to the entrypoint. | ||
| ############################################################################### | ||
|
|
||
| # TODO(chaserileyroberts): Reenable once fully tested on GCP. | ||
| # COPY gcp-autoconfig.sh /opt/nvidia/entrypoint.d/ | ||
Steboss marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| ############################################################################### | ||
| ## Install the nsys-jax JAX/XLA-aware profiling scripts, patch Nsight Systems | ||
|
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.