Skip to content

[5/n] Migrate CUTLASS MLA, hadamard, awq, allspark and DSV3 fused a gemm to torch stable ABI#38671

Draft
mikaylagawarecki wants to merge 10 commits intovllm-project:mainfrom
mikaylagawarecki:new-stable-abi-phase5
Draft

[5/n] Migrate CUTLASS MLA, hadamard, awq, allspark and DSV3 fused a gemm to torch stable ABI#38671
mikaylagawarecki wants to merge 10 commits intovllm-project:mainfrom
mikaylagawarecki:new-stable-abi-phase5

Conversation

@mikaylagawarecki
Copy link
Copy Markdown
Contributor

@mikaylagawarecki mikaylagawarecki commented Apr 1, 2026

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

#26946

Test Plan

On A100

  python -m pytest tests/kernels/quantization/test_allspark_gemm.py          

On H100

  python -m pytest tests/kernels/quantization/test_hadacore.py
  python -m pytest tests/kernels/quantization/test_awq.py      

On B200

  python -m pytest tests/kernels/attention/test_cutlass_mla_decode.py

Deepseek gemm kernel does not seem to have a test

Test Result

A100:
Screenshot 2026-03-31 at 8 11 43 PM

H100:
Screenshot 2026-03-31 at 8 17 17 PM
Screenshot 2026-03-31 at 8 18 39 PM

B200:
Screenshot 2026-03-31 at 8 13 28 PM


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing (anything written below this line will be removed by GitHub Actions)

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request migrates several CUDA kernels—including AWQ, AllSpark, DeepSeek V3 fused A GEMM, Hadacore, and CUTLASS MLA—from the standard extension to the stable ABI extension (_C_stable_libtorch). The changes involve updating CMakeLists.txt to reassign source files, replacing standard Torch types and macros with stable ABI equivalents (e.g., torch::stable::Tensor, STD_TORCH_CHECK), and implementing stable ABI-compliant utilities for device property caching and cuBLAS handle retrieval. Feedback highlights critical issues regarding thread safety with global workspace tensors, potential compilation failures when using non-movable types in containers, and the need for better bounds checking and naming consistency in the new utility functions.

// Device properties cache for stable ABI compatibility.
// Uses raw CUDA/HIP APIs instead of ATen functions.
// Using inline ensures a single instance across all translation units.
inline std::deque<std::once_flag> device_flags;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The use of std::deque<std::once_flag> is problematic because std::once_flag is non-copyable and non-movable. While std::deque generally provides stable pointers to its elements, the resize operation (line 35) requires the type to be MoveInsertable according to the C++ standard, which std::once_flag is not. This will likely lead to compilation errors on many toolchains. A better approach is to initialize all device properties at once during the global initialization phase, removing the need for per-device once_flag containers.

Copy link
Copy Markdown
Contributor Author

@mikaylagawarecki mikaylagawarecki Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code here is actually a very slight adaptation of the code in torch https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/cuda/CUDAContext.cpp#L12-L59 to make it stable

(granted torch uses c10::once_flag, but that is also non-copyable and non-movable which has the same issue)

Since the std::deque is only ever resized once from size 0 to num_devices, I don't think this is actually problematic. However, I I can fix this if anyone thinks it is problematic

#include "core/registration.h"
#include "libtorch_stable/torch_utils.h"

torch::stable::Tensor as_g_workspace;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The global variable as_g_workspace of type torch::stable::Tensor introduces a significant race condition. In a multi-threaded or multi-stream environment, concurrent calls to allspark_w8a16_gemm will attempt to check and reallocate this global tensor (lines 991-996), leading to memory corruption or use-after-free errors when one thread overwrites the workspace while another is using it. For stable ABI compatibility and thread safety, workspace memory should be managed via a thread-local cache, a per-device map, or ideally passed as an argument from the Python allocator.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pre-existing

Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Pure move, no code changes. Preparatory step for stable ABI migration.

Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Pure move, no code changes. Preparatory step for stable ABI migration.

Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant