Fix Marlin repack PTX incompatibility on H100/H200 (CUDA 12.8) by DavidBellamy · Pull Request #38669 · vllm-project/vllm

DavidBellamy · 2026-04-01T00:01:06Z

Summary

Fixes #38619. The Marlin MoE repack kernel (gptq_marlin_moe_repack) crashes with CUDA error: the provided PTX was compiled with an unsupported toolchain when serving quantized MoE models (e.g. Kimi K2.5) on H100/H200 with a CUDA 12.8 driver, because pre-built wheels compiled with a newer CUDA toolkit generate PTX that the 12.8 driver cannot JIT-compile.

Root cause: MARLIN_OTHER_ARCHS and MARLIN_MOE_OTHER_ARCHS in CMakeLists.txt were set to "7.5;8.0+PTX", meaning on sm_90 (H100/H200) the driver must JIT-compile sm_80 PTX at runtime. If the wheel was built with CTK 12.9+, the embedded PTX uses a newer ISA version than the 12.8 driver supports.

Changes:

CMakeLists.txt: Add 9.0 to both MARLIN_OTHER_ARCHS and MARLIN_MOE_OTHER_ARCHS ("7.5;8.0;9.0+PTX"), so H100/H200 get native sm_90 SASS for Marlin repack kernels. The +PTX moves to 9.0 to preserve forward compatibility for future architectures.
vllm/_custom_ops.py: Wrap all four Marlin repack functions (gptq_marlin_repack, awq_marlin_repack, and their MoE variants) with try/except that catches the "unsupported toolchain" CUDA error and raises a diagnostic message including the driver version and build-from-source instructions.

Testing

Validated on an M2 cluster node:

Hardware: NVIDIA H200 (144GB), driver 570.133.20, CUDA 12.8
Build: vLLM built from source with CUDA_HOME=/usr/local/cuda-12.8, PyTorch 2.10.0+cu128
Model: moonshotai/Kimi-K2.5 (1T params, compressed-tensors WNA16 INT4, 384 MoE experts)
Config: TP8, --enforce-eager, --max-model-len 32768
Result: All 64 checkpoint shards loaded, Marlin MoE repack completed without errors, server started and responded to health checks on port 8042. Previously this crashed during process_weights_after_loading with the PTX toolchain error.

gemini-code-assist

Code Review

This pull request updates the build configuration to include native support for sm_90 in Marlin kernels and introduces a detailed error handler for PTX version mismatches during repack operations. Review feedback highlights that removing the +PTX suffix from sm_80 in CMakeLists.txt breaks compatibility for several architectures (like sm_86 and sm_89) and recommends ensuring both 8.0 and 9.0 retain the suffix. Additionally, the error message in vllm/_custom_ops.py should be generalized as it currently incorrectly specifies "MoE repack" for standard repack callers.

gemini-code-assist · 2026-04-01T00:02:47Z

CMakeLists.txt

+  # Include 9.0 so that H100/H200 (sm_90) get native SASS instead of relying
+  # on PTX JIT, which fails when the wheel's CUDA toolkit is newer than the
+  # driver (e.g. wheel built with CTK 12.9 on a CUDA 12.8 driver).
+  cuda_archs_loose_intersection(MARLIN_OTHER_ARCHS "7.5;8.0;9.0+PTX" "${CUDA_ARCHS}")


Removing the +PTX suffix from 8.0 breaks compatibility for architectures like sm_86 (Ampere) and sm_89 (Ada) that are not explicitly listed in the Marlin arch strings. These devices rely on JIT-compiling from the sm_80 PTX because sm_80 SASS is not binary-compatible with them. To support sm_90 natively while maintaining compatibility for other sm_8x devices, both 8.0 and 9.0 should include the +PTX suffix.

Additionally, please consider applying this same change to MARLIN_ARCHS (line 354) and MARLIN_MOE_ARCHS (line 1063), as the main GEMM kernels will otherwise still trigger the same PTX JIT incompatibility on Hopper GPUs.

cuda_archs_loose_intersection(MARLIN_OTHER_ARCHS "7.5;8.0+PTX;9.0+PTX" "${CUDA_ARCHS}")

gemini-code-assist · 2026-04-01T00:02:47Z

CMakeLists.txt

+  # Include 9.0 so that H100/H200 (sm_90) get native SASS instead of relying
+  # on PTX JIT, which fails when the wheel's CUDA toolkit is newer than the
+  # driver (e.g. wheel built with CTK 12.9 on a CUDA 12.8 driver).
+  cuda_archs_loose_intersection(MARLIN_MOE_OTHER_ARCHS "7.5;8.0;9.0+PTX" "${CUDA_ARCHS}")


As noted previously, removing +PTX from 8.0 breaks compatibility for non-explicitly listed sm_8x architectures. Use 8.0+PTX to ensure these devices can still JIT-compile the kernels from the sm_80 PTX.

cuda_archs_loose_intersection(MARLIN_MOE_OTHER_ARCHS "7.5;8.0+PTX;9.0+PTX" "${CUDA_ARCHS}")

gemini-code-assist · 2026-04-01T00:02:47Z

vllm/_custom_ops.py

+    raise RuntimeError(
+        "Marlin MoE repack kernel failed with a CUDA error that usually "
+        "indicates a PTX version mismatch: the pre-built vLLM wheel was "
+        "compiled with a CUDA toolkit newer than what your GPU driver "
+        f"supports (driver version: {cuda_ver}).\n\n"
+        "To fix this, build vLLM from source with your system's CUDA "
+        "toolkit:\n"
+        "  pip install vllm --no-binary vllm\n\n"
+        "Or install a matching CUDA toolkit / update your GPU driver.\n"
+        "See https://github.com/vllm-project/vllm/issues/38619"
+    ) from original_error


The error message is currently specific to "MoE repack", but this helper function is also used by standard Marlin repack operations (gptq_marlin_repack and awq_marlin_repack). A more generic message would be more accurate for all callers.

Suggested change

raise RuntimeError(

"Marlin MoE repack kernel failed with a CUDA error that usually "

"indicates a PTX version mismatch: the pre-built vLLM wheel was "

"compiled with a CUDA toolkit newer than what your GPU driver "

f"supports (driver version: {cuda_ver}).\n\n"

"To fix this, build vLLM from source with your system's CUDA "

"toolkit:\n"

" pip install vllm --no-binary vllm\n\n"

"Or install a matching CUDA toolkit / update your GPU driver.\n"

"See https://github.com/vllm-project/vllm/issues/38619"

) from original_error

raise RuntimeError(

"Marlin repack kernel failed with a CUDA error that usually "

"indicates a PTX version mismatch: the pre-built vLLM wheel was "

"compiled with a CUDA toolkit newer than what your GPU driver "

f"supports (driver version: {cuda_ver}).\n\n"

"To fix this, build vLLM from source with your system's CUDA "

"toolkit:\n"

" pip install vllm --no-binary vllm\n\n"

"Or install a matching CUDA toolkit / update your GPU driver.\n"

"See https://github.com/vllm-project/vllm/issues/38619"

) from original_error

Add sm_90 to MARLIN_OTHER_ARCHS and MARLIN_MOE_OTHER_ARCHS so that Marlin repack kernels (gptq_marlin_repack, awq_marlin_repack) compile native SASS for H100/H200 instead of relying on PTX JIT. When a pre-built wheel is compiled with a newer CUDA toolkit than the driver supports (e.g. CTK 12.9 wheel on a 12.8 driver), PTX JIT fails with "the provided PTX was compiled with an unsupported toolchain." Also wrap all Marlin repack call sites with a try/except that catches the PTX toolchain error and raises a clear diagnostic message with the driver version and build-from-source instructions. Fixes vllm-project#38619 Signed-off-by: David Bellamy <12414531+DavidBellamy@users.noreply.github.com>

- Keep 8.0+PTX (not bare 8.0) so sm_86/sm_89 can still JIT from PTX - Add 9.0+PTX to MARLIN_ARCHS and MARLIN_MOE_ARCHS (main GEMM kernels) to avoid the same PTX JIT issue on the inference path - Generalize error message from "MoE repack" to "repack" since the helper is shared by all four repack functions Signed-off-by: David Bellamy <12414531+DavidBellamy@users.noreply.github.com>

DavidBellamy requested review from LucasWilkinson and tlrmchlsmth as code owners April 1, 2026 00:01

mergify bot added ci/build nvidia labels Apr 1, 2026

github-project-automation bot added this to NVIDIA Apr 1, 2026

gemini-code-assist bot reviewed Apr 1, 2026

View reviewed changes

DavidBellamy force-pushed the fix/marlin-moe-sm90-ptx-compat branch from e53adee to 3d72856 Compare April 1, 2026 00:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix Marlin repack PTX incompatibility on H100/H200 (CUDA 12.8)#38669

Fix Marlin repack PTX incompatibility on H100/H200 (CUDA 12.8)#38669
DavidBellamy wants to merge 2 commits intovllm-project:mainfrom
DavidBellamy:fix/marlin-moe-sm90-ptx-compat

DavidBellamy commented Apr 1, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Apr 1, 2026

Uh oh!

gemini-code-assist bot Apr 1, 2026

Uh oh!

gemini-code-assist bot Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

DavidBellamy commented Apr 1, 2026

Summary

Testing

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant