-
Notifications
You must be signed in to change notification settings - Fork 182
Update: latest nemo-rl #1273
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
wedu-nvidia
wants to merge
16
commits into
main
Choose a base branch
from
wedu/nemo-rl-latest
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Update: latest nemo-rl #1273
Changes from 4 commits
Commits
Show all changes
16 commits
Select commit
Hold shift + click to select a range
bf19a4a
update config for sft and grpo
wedu-nvidia c83f108
local test
wedu-nvidia ab13b2b
update yaml
wedu-nvidia 9a3da6b
update to leatest container
wedu-nvidia 96ca50a
restore local yaml
wedu-nvidia 1b905ae
Merge branch 'main' into wedu/nemo-rl-latest
wedu-nvidia a2a8dc8
Use NGC nemo-rl image by commit tag, remove local Dockerfile
wedu-nvidia 63f043d
Exclude numb3rs form test_eval.py (#1275)
Kipok fb0b06d
Merge branch 'main' into wedu/nemo-rl-latest
wedu-nvidia 14ce48f
Merge branch 'main' into wedu/nemo-rl-latest
gwarmstrong cb1b7dd
Update for vllm versiom
wedu-nvidia 7e020d7
Merge branch 'main' into wedu/nemo-rl-latest
gwarmstrong b68c676
Merge branch 'main' into wedu/nemo-rl-latest
gwarmstrong 14a0e8b
Merge branch 'main' into wedu/nemo-rl-latest
gwarmstrong bd6700d
Merge branch 'main' into wedu/nemo-rl-latest
Kipok 23d67ee
Fix gpu tests
Kipok File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,149 +1,10 @@ | ||
| # syntax=docker/dockerfile:1 | ||
| # copied and edited from https://github.com/NVIDIA/NeMo-RL/blob/main/docker/Dockerfile | ||
| # TODO: from next update try to re-use their dockerfile as is as they support specifying the commit | ||
| # Lightweight Dockerfile: use pre-built nvcr.io/nvidian/nemo-rl:nightly and only add NeMo-Skills. | ||
| # To use the image without building at all, set containers.nemo-rl to nvcr.io/nvidian/nemo-rl:nightly | ||
| # in your cluster config (see cluster_configs/example-local.yaml). | ||
|
|
||
| ARG BASE_IMAGE=nvcr.io/nvidia/cuda-dl-base:25.05-cuda12.9-devel-ubuntu24.04 | ||
| ARG NEMO_RL_IMAGE=nvcr.io/nvidian/nemo-rl:nightly | ||
|
|
||
| FROM scratch AS nemo-rl | ||
| FROM ${NEMO_RL_IMAGE} | ||
|
|
||
| ARG NEMO_RL_COMMIT=${NEMO_RL_COMMIT:-e95efb912a6909b5da91ffeb197debe91fd480d8} | ||
| ADD --keep-git-dir=true https://github.com/NVIDIA-NeMo/RL.git#${NEMO_RL_COMMIT} / | ||
|
|
||
|
|
||
| FROM ${BASE_IMAGE} AS base | ||
| # An environment variable to indicate that we are in a container. | ||
| ENV NRL_CONTAINER=1 | ||
|
|
||
| # It is more convenient for users to run as root | ||
| USER root | ||
|
|
||
| RUN <<"EOF" bash -exu -o pipefail | ||
| export DEBIAN_FRONTEND=noninteractive | ||
| export TZ=America/Los_Angeles | ||
|
|
||
| apt-get update | ||
| apt-get install -y --no-install-recommends \ | ||
| jq \ | ||
| curl \ | ||
| git \ | ||
| rsync \ | ||
| wget \ | ||
| less \ | ||
| vim \ | ||
|
|
||
| # Nsight | ||
| apt install -y --no-install-recommends gnupg | ||
| echo "deb http://developer.download.nvidia.com/devtools/repos/ubuntu$(source /etc/lsb-release; echo "$DISTRIB_RELEASE" | tr -d .)/$(dpkg --print-architecture) /" | tee /etc/apt/sources.list.d/nvidia-devtools.list | ||
| apt-key adv --fetch-keys http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub | ||
| apt update | ||
| apt install -y nsight-systems-cli | ||
|
|
||
| # To fix CVE-2025-68973 | ||
| apt install -y --only-upgrade gnupg | ||
|
|
||
| apt-get clean | ||
| rm -rf /var/lib/apt/lists/* | ||
| EOF | ||
|
|
||
| # Install uv and python | ||
| ARG UV_VERSION=0.9.7 | ||
| ARG PYTHON_VERSION=3.12 | ||
| ENV PATH="/root/.local/bin:$PATH" | ||
| RUN curl -LsSf https://astral.sh/uv/${UV_VERSION}/install.sh | sh && \ | ||
| uv python install ${PYTHON_VERSION} | ||
|
|
||
| # Disable usage stats by default for users who are sensitive to sharing usage. | ||
| # Users are encouraged to enable if the wish. | ||
| ENV RAY_USAGE_STATS_ENABLED=0 | ||
| # After ray>=2.47, this feature is enabled by default which creates uv venvs for any py_executable starting with `uv run`. | ||
| # There is severe contention and performance issues with this enabled considering our dependencies are so large and occasionally | ||
| # need to be compiled, so NeMo RL has an implementation in nemo_rl/utils/venv.py that does it once per node as opposed to once per task. | ||
| ENV RAY_ENABLE_UV_RUN_RUNTIME_ENV=0 | ||
| ENV NEMO_RL_VENV_DIR=/opt/ray_venvs | ||
|
|
||
|
|
||
| FROM base AS hermetic | ||
|
|
||
| WORKDIR /opt/NeMo-RL | ||
|
|
||
| # Variables to control the build of TE. If there are issues with parallelization, consider | ||
| # setting these to 1. | ||
| ARG MAX_JOBS | ||
| ARG NVTE_BUILD_THREADS_PER_JOB | ||
| # Only use for custom vllm installs. Learn more at https://github.com/NVIDIA-NeMo/RL/blob/main/docs/guides/use-custom-vllm.md | ||
| ARG BUILD_CUSTOM_VLLM | ||
|
|
||
| ENV UV_PROJECT_ENVIRONMENT=/opt/nemo_rl_venv | ||
| ENV UV_LINK_MODE=copy | ||
|
|
||
| # Ensure DeepEP is built for H100 and B200 (also mcore inference unified memory API now invokes a torch API that requires these to be set) | ||
| ENV TORCH_CUDA_ARCH_LIST="9.0 10.0" | ||
|
|
||
| # First copy only the dependency files | ||
| COPY --from=nemo-rl pyproject.toml uv.lock ./ | ||
| # Copy in the top level __init__.py/package_info.py since build-custom-vllm.sh needs the nemo_rl package to exist. | ||
| COPY --from=nemo-rl nemo_rl/__init__.py nemo_rl/package_info.py ./nemo_rl/ | ||
| COPY --from=nemo-rl tools/build-custom-vllm.sh ./tools/build-custom-vllm.sh | ||
| COPY --from=nemo-rl --link research/ ./research/ | ||
| COPY --from=nemo-rl --link 3rdparty/ ./3rdparty/ | ||
|
|
||
| RUN --mount=type=ssh <<"EOF" bash -exu | ||
| uv venv --seed | ||
| if [[ -n "${BUILD_CUSTOM_VLLM:-}" ]]; then | ||
| bash tools/build-custom-vllm.sh | ||
| source 3rdparty/vllm/nemo-rl.env | ||
| fi | ||
| # uv sync has a more reliable resolver than simple uv pip install which can fail | ||
|
|
||
| # Sync each training + inference backend one at a time (since they may conflict) | ||
| # to warm the uv cache, then at the end just sync the default dependencies. | ||
| # Do everything in one layer to prevent large layers. | ||
|
|
||
| # The venv is symlinked to avoid bloating the layer size | ||
| uv sync --link-mode symlink --locked --no-install-project | ||
| uv sync --link-mode symlink --locked --extra vllm --no-install-project | ||
| uv sync --link-mode symlink --locked --extra mcore --no-install-project | ||
| uv sync --link-mode symlink --locked --extra automodel --no-install-project | ||
| uv sync --link-mode symlink --locked --all-groups --no-install-project | ||
|
|
||
| # Remove the aiohttp in this uv cache dir to fully address CVE GHSA-mqqc-3gqh-h2x8 | ||
| # The ray install will include the older aiohttp version in its cache | ||
| find /root/.cache/uv -type d -path "*ray/_private/runtime_env/agent/thirdparty_files/aiohttp*" -exec rm -rf {} + | ||
| EOF | ||
|
|
||
| ENV PATH="/opt/nemo_rl_venv/bin:$PATH" | ||
| ENV NEMO_RL_VENV_DIR=/opt/ray_venvs | ||
|
|
||
| WORKDIR /opt/NeMo-RL | ||
|
|
||
| FROM hermetic AS release | ||
|
|
||
| ARG NVIDIA_BUILD_ID | ||
| ARG NVIDIA_BUILD_REF | ||
| ARG RC_DATE=00.00 | ||
| ARG TARGETARCH | ||
| ENV NVIDIA_BUILD_ID=${NVIDIA_BUILD_ID:-<unknown>} | ||
| ENV NVIDIA_BUILD_REF=${NVIDIA_BUILD_REF:-<unknown>} | ||
| LABEL com.nvidia.build.id="${NVIDIA_BUILD_ID}" | ||
| LABEL com.nvidia.build.ref="${NVIDIA_BUILD_REF}" | ||
|
|
||
| ENV NEMO_RL_VENV_DIR=/opt/ray_venvs | ||
|
|
||
| # Copy in source from build context (defaults to cloned repo, can be overridden) | ||
| # Exclude pyproject.toml and uv.lock since those may be altered by build-custom-vllm.sh | ||
| COPY --from=nemo-rl --exclude=pyproject.toml --exclude=uv.lock . /opt/NeMo-RL | ||
| # Unshallow the repo to get the full history (in the case it was from the scratch layer). | ||
| # Potentially not necessary if the repo is passed in as a complete repository (w/ full git history), | ||
| # so do a quick check before trying to unshallow. | ||
| RUN git rev-parse --is-shallow-repository | grep -q true && git fetch --unshallow || true | ||
| RUN UV_LINK_MODE=symlink uv run nemo_rl/utils/prefetch_venvs.py | ||
|
|
||
| # Generate container fingerprint for frozen environment support | ||
| # Store outside /opt/NeMo-RL to avoid being overwritten by user mounts | ||
| RUN python tools/generate_fingerprint.py > /opt/nemo_rl_container_fingerprint | ||
|
|
||
| # NOTICES.txt file points to where the OSS source code is archived | ||
| RUN echo "This distribution includes open source which is archived at the following URL: https://opensource.nvidia.com/oss/teams/nvidia/nemo-rl/${RC_DATE}:linux-${TARGETARCH}/index.html" > NOTICES.txt && \ | ||
| echo "For further inquiries or assistance, contact us at oss-requests@nvidia.com" >> NOTICES.txt | ||
|
|
||
| RUN git clone https://github.com/NVIDIA-NeMo/Skills.git /opt/NeMo-Skills && cd /opt/NeMo-Skills && uv pip install . |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -368,7 +368,6 @@ def main(): | |
| loss_fn, | ||
| master_config, | ||
| logger, | ||
| sft_task_spec, | ||
| checkpointer, | ||
| sft_save_state, | ||
| ) | ||
|
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.