fix Dockerfile.dev breakage for llmd_fs_backend#620
Open
saikat-royc wants to merge 1 commit into
Open
Conversation
Contributor
Author
|
/cc @kfirtoledo |
fix CUDA version mismatch and dev headers symlink - Update default CUDA_TOOLKIT_PKG to cuda-toolkit-13-0 to match the CUDA 13.0 base image and prevent PyTorch compilation version mismatch. - Explicitly parse and update the standard /usr/local/cuda symlink after GKE package installation to resolve missing dev headers (cusparse.h) during compilation Signed-off-by: Saikat Roychowdhury <saikat.royc85@gmail.com>
b678b95 to
f6d9d1a
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR addresses and fixes two critical build-system issues in Dockerfile.dev that were causing C++/CUDA source compilation failures during docker builds:
Fixes:
Problem: The default base image argument VLLM_IMAGE was set to vllm/vllm-openai:v0.21.0 (which packages PyTorch built with CUDA 13.0). However, the default developer toolkit argument CUDA_TOOLKIT_PKG was hardcoded to cuda-toolkit-12-9 (CUDA 12.9). This mismatch triggered a hard PyTorch safety check failure during the source compilation step:
Failure log:
Fix: Updated the default build argument CUDA_TOOLKIT_PKG to cuda-toolkit-13-0 in Dockerfile.dev to ensure standard local builds compile in perfect version parity with the default base
Problem: The base deployment image packages a minimal, runtime-only CUDA setup (under /usr/local/cuda-13.0/ ) and symlinks /usr/local/cuda to it. This folder lacks all developer headers. When
installing our target developer package ( cuda-toolkit-13-0 ), the tools are correctly installed into a separate folder ( /usr/local/cuda-13.0/ ), but the existing symlink /usr/local/cuda is not updated (it still points to the runtime-only environment). During the compilation phase, PyTorch's builder searches for system headers in the standard symlink path: -I/usr/local/cuda/include/ , which points to the headerless directory, causing the build to
Failure Log Snippet:
Fix: Implemented a robust symlink resolution phase in the dependencies RUN block:
Verification of the fixes:
kv_connectors/llmd_fs_backend/tests/