Skip to content

fix Dockerfile.dev breakage for llmd_fs_backend#620

Open
saikat-royc wants to merge 1 commit into
llm-d:mainfrom
saikat-royc:fix-dockerfile-05-28
Open

fix Dockerfile.dev breakage for llmd_fs_backend#620
saikat-royc wants to merge 1 commit into
llm-d:mainfrom
saikat-royc:fix-dockerfile-05-28

Conversation

@saikat-royc
Copy link
Copy Markdown
Contributor

@saikat-royc saikat-royc commented May 29, 2026

Summary

This PR addresses and fixes two critical build-system issues in Dockerfile.dev that were causing C++/CUDA source compilation failures during docker builds:

  1. CUDA Version Mismatch ( RuntimeError ): Resolves the compiler mismatch error by synchronizing package defaults to CUDA 13.0.
  2. Missing Development Headers: Resolves header lookup failures by dynamically updating key system symlinks during container package updates.

Fixes:

  1. Out-of-Sync Package Defaults (CUDA Version Mismatch)
    Problem: The default base image argument VLLM_IMAGE was set to vllm/vllm-openai:v0.21.0 (which packages PyTorch built with CUDA 13.0). However, the default developer toolkit argument CUDA_TOOLKIT_PKG was hardcoded to cuda-toolkit-12-9 (CUDA 12.9). This mismatch triggered a hard PyTorch safety check failure during the source compilation step:

Failure log:

    RuntimeError: ('The detected CUDA version (%s) mismatches the version that was used to compilePyTorch (%s). Please make sure to use the same CUDA versions.', '12.9', '13.0')                         

Fix: Updated the default build argument CUDA_TOOLKIT_PKG to cuda-toolkit-13-0 in Dockerfile.dev to ensure standard local builds compile in perfect version parity with the default base

  1. Multi-CUDA Symlink Conflict (Missing cusparse.h / Dev Headers)
    Problem: The base deployment image packages a minimal, runtime-only CUDA setup (under /usr/local/cuda-13.0/ ) and symlinks /usr/local/cuda to it. This folder lacks all developer headers. When
    installing our target developer package ( cuda-toolkit-13-0 ), the tools are correctly installed into a separate folder ( /usr/local/cuda-13.0/ ), but the existing symlink /usr/local/cuda is not updated (it still points to the runtime-only environment). During the compilation phase, PyTorch's builder searches for system headers in the standard symlink path: -I/usr/local/cuda/include/ , which points to the headerless directory, causing the build to

Failure Log Snippet:

    In file included from /usr/local/lib/python3.12/dist-packages/torch/include/ATen/cuda/CUDAContext.h:4,
                     from /workspace/llmd_fs_backend/kv_connectors/llmd_fs_backend/csrc/storage/tensor_copier.cu:17:
    /usr/local/lib/python3.12/dist-packages/torch/include/ATen/cuda/CUDAContextLight.h:10:10: fatal error: cusparse.h: No such file or directory
       10 | #include <cusparse.h>
          |          ^~~~~~~~~~~~
    compilation terminated.

Fix: Implemented a robust symlink resolution phase in the dependencies RUN block:

  • Parses the exact versioned folder dynamically from the target ${CUDA_TOOLKIT_PKG} argument (e.g., cuda-toolkit-13-0 $\rightarrow$ /usr/local/cuda-13.0 ).
  • Forces the standard /usr/local/cuda symlink to point to the newly installed developer path containing the standard headers.

Verification of the fixes:

  1. Ensure the build command works
make image-fs-backend-build IMAGE_TAG_BASE=gcr.io/<project> FS_BACKEND_NAME=vllm-llmd-fs DEV_VERSION=vllm-0.21-cu130-client-cache-v1

make image-fs-backend-push IMAGE_TAG_BASE=gcr.io/<project> FS_BACKEND_NAME=vllm-llmd-fs DEV_VERSION=vllm-0.21-cu130-client-cache-v1
  1. Basic inference using inference-perf
  2. Run existing tests kv_connectors/llmd_fs_backend/tests/

@github-actions github-actions Bot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label May 29, 2026
@saikat-royc
Copy link
Copy Markdown
Contributor Author

/cc @kfirtoledo

fix CUDA version mismatch and dev headers symlink
- Update default CUDA_TOOLKIT_PKG to cuda-toolkit-13-0 to
  match the CUDA 13.0 base image and prevent PyTorch compilation
  version mismatch.
- Explicitly parse and update the standard /usr/local/cuda symlink
  after GKE package installation to resolve missing dev headers
  (cusparse.h) during compilation

Signed-off-by: Saikat Roychowdhury <saikat.royc85@gmail.com>
@saikat-royc saikat-royc force-pushed the fix-dockerfile-05-28 branch from b678b95 to f6d9d1a Compare May 29, 2026 17:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/S Denotes a PR that changes 10-29 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant