Skip to content

Conversation

@Begunner
Copy link
Collaborator

What does this PR do?

TransformerEngine-v2.8 leads to unexpected crashes. Try to update it to v2.10.
Fix other resultant compatibility issues.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the TransformerEngine version from v2.8 to v2.10 in the sglang and vllm Dockerfiles to resolve compatibility issues. The change is correct and addresses the stated problem. My review includes a suggestion to pin the dependency to a specific commit hash instead of a tag to improve build reproducibility and security.

RUN MAX_JOBS=128 pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" git+https://github.com/NVIDIA/apex.git

RUN export NVTE_FRAMEWORK=pytorch && MAX_JOBS=128 NVTE_BUILD_THREADS_PER_JOB=4 pip3 install --resume-retries 999 --no-cache-dir --no-build-isolation git+https://github.com/NVIDIA/TransformerEngine.git@release_v2.8
RUN export NVTE_FRAMEWORK=pytorch && MAX_JOBS=128 NVTE_BUILD_THREADS_PER_JOB=4 pip3 install --resume-retries 999 --no-cache-dir --no-build-isolation git+https://github.com/NVIDIA/TransformerEngine.git@release_v2.10
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

For better reproducibility and security, it's recommended to pin dependencies to a specific commit hash instead of a tag. The tag release_v2.10 can be moved, which could lead to different build results in the future. The commit hash corresponding to this tag is 06082989335780a5f7808246a30146313175883a.

RUN export NVTE_FRAMEWORK=pytorch && MAX_JOBS=128 NVTE_BUILD_THREADS_PER_JOB=4 pip3 install --resume-retries 999 --no-cache-dir --no-build-isolation git+https://github.com/NVIDIA/TransformerEngine.git@06082989335780a5f7808246a30146313175883a

RUN MAX_JOBS=128 pip install -v --disable-pip-version-check --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" git+https://github.com/NVIDIA/apex.git

RUN export NVTE_FRAMEWORK=pytorch && MAX_JOBS=128 NVTE_BUILD_THREADS_PER_JOB=4 pip3 install --resume-retries 999 --no-build-isolation git+https://github.com/NVIDIA/TransformerEngine.git@release_v2.8
RUN export NVTE_FRAMEWORK=pytorch && MAX_JOBS=128 NVTE_BUILD_THREADS_PER_JOB=4 pip3 install --resume-retries 999 --no-build-isolation git+https://github.com/NVIDIA/TransformerEngine.git@release_v2.10
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

For better reproducibility and security, it's recommended to pin dependencies to a specific commit hash instead of a tag. The tag release_v2.10 can be moved, which could lead to different build results in the future. The commit hash corresponding to this tag is 06082989335780a5f7808246a30146313175883a.

RUN export NVTE_FRAMEWORK=pytorch && MAX_JOBS=128 NVTE_BUILD_THREADS_PER_JOB=4 pip3 install --resume-retries 999 --no-build-isolation git+https://github.com/NVIDIA/TransformerEngine.git@06082989335780a5f7808246a30146313175883a

@wuxibin89 wuxibin89 merged commit f3a0233 into volcengine:main Dec 29, 2025
98 of 115 checks passed
boren-ms pushed a commit to boren-ms/verl that referenced this pull request Dec 30, 2025
…sues (volcengine#4714)

### What does this PR do?

> TransformerEngine-v2.8 leads to unexpected crashes. Try to update it
to v2.10.
> Fix other resultant compatibility issues.

---------

Co-authored-by: Begunner <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants