Skip to content

fix Dockerfile.dev breakage for llmd_fs_backend#620

Merged
github-actions[bot] merged 1 commit into
llm-d:mainfrom
saikat-royc:fix-dockerfile-05-28
Jun 2, 2026
Merged

fix Dockerfile.dev breakage for llmd_fs_backend#620
github-actions[bot] merged 1 commit into
llm-d:mainfrom
saikat-royc:fix-dockerfile-05-28

Conversation

@saikat-royc

@saikat-royc saikat-royc commented May 29, 2026

Copy link
Copy Markdown
Contributor

Summary

This PR addresses and fixes two critical build-system issues in Dockerfile.dev that were causing C++/CUDA source compilation failures during docker builds:

  1. CUDA Version Mismatch ( RuntimeError ): Resolves the compiler mismatch error by synchronizing package defaults to CUDA 13.0.
  2. Missing Development Headers: Resolves header lookup failures by dynamically updating key system symlinks during container package updates.

Fixes:

  1. Out-of-Sync Package Defaults (CUDA Version Mismatch)
    Problem: The default base image argument VLLM_IMAGE was set to vllm/vllm-openai:v0.21.0 (which packages PyTorch built with CUDA 13.0). However, the default developer toolkit argument CUDA_TOOLKIT_PKG was hardcoded to cuda-toolkit-12-9 (CUDA 12.9). This mismatch triggered a hard PyTorch safety check failure during the source compilation step:

Failure log:

    RuntimeError: ('The detected CUDA version (%s) mismatches the version that was used to compilePyTorch (%s). Please make sure to use the same CUDA versions.', '12.9', '13.0')                         

Fix: Updated the default build argument CUDA_TOOLKIT_PKG to cuda-toolkit-13-0 in Dockerfile.dev to ensure standard local builds compile in perfect version parity with the default base

  1. Multi-CUDA Symlink Conflict (Missing cusparse.h / Dev Headers)
    Problem: The base deployment image packages a minimal, runtime-only CUDA setup (under /usr/local/cuda-13.0/ ) and symlinks /usr/local/cuda to it. This folder lacks all developer headers. When
    installing our target developer package ( cuda-toolkit-13-0 ), the tools are correctly installed into a separate folder ( /usr/local/cuda-13.0/ ), but the existing symlink /usr/local/cuda is not updated (it still points to the runtime-only environment). During the compilation phase, PyTorch's builder searches for system headers in the standard symlink path: -I/usr/local/cuda/include/ , which points to the headerless directory, causing the build to

Failure Log Snippet:

    In file included from /usr/local/lib/python3.12/dist-packages/torch/include/ATen/cuda/CUDAContext.h:4,
                     from /workspace/llmd_fs_backend/kv_connectors/llmd_fs_backend/csrc/storage/tensor_copier.cu:17:
    /usr/local/lib/python3.12/dist-packages/torch/include/ATen/cuda/CUDAContextLight.h:10:10: fatal error: cusparse.h: No such file or directory
       10 | #include <cusparse.h>
          |          ^~~~~~~~~~~~
    compilation terminated.

Fix: Implemented a robust symlink resolution phase in the dependencies RUN block:

  • Parses the exact versioned folder dynamically from the target ${CUDA_TOOLKIT_PKG} argument (e.g., cuda-toolkit-13-0 $\rightarrow$ /usr/local/cuda-13.0 ).
  • Forces the standard /usr/local/cuda symlink to point to the newly installed developer path containing the standard headers.

Verification of the fixes:

  1. Ensure the build command works
make image-fs-backend-build IMAGE_TAG_BASE=gcr.io/<project> FS_BACKEND_NAME=vllm-llmd-fs DEV_VERSION=vllm-0.21-cu130-client-cache-v1

make image-fs-backend-push IMAGE_TAG_BASE=gcr.io/<project> FS_BACKEND_NAME=vllm-llmd-fs DEV_VERSION=vllm-0.21-cu130-client-cache-v1
  1. Basic inference using inference-perf
  2. Run existing tests kv_connectors/llmd_fs_backend/tests/

@github-actions github-actions Bot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label May 29, 2026
@saikat-royc

Copy link
Copy Markdown
Contributor Author

/cc @kfirtoledo

fix CUDA version mismatch and dev headers symlink
- Update default CUDA_TOOLKIT_PKG to cuda-toolkit-13-0 to
  match the CUDA 13.0 base image and prevent PyTorch compilation
  version mismatch.
- Explicitly parse and update the standard /usr/local/cuda symlink
  after GKE package installation to resolve missing dev headers
  (cusparse.h) during compilation

Signed-off-by: Saikat Roychowdhury <saikat.royc85@gmail.com>
@saikat-royc

Copy link
Copy Markdown
Contributor Author

/cc @kfirtoledo request a review for this PR

@kfirtoledo

Copy link
Copy Markdown
Collaborator

/lgtm
/approve

@github-actions github-actions Bot added the lgtm Looks good to me, indicates that a PR is ready to be merged. label Jun 2, 2026
@github-actions github-actions Bot merged commit c8fff80 into llm-d:main Jun 2, 2026
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

lgtm Looks good to me, indicates that a PR is ready to be merged. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants