[5/n] Migrate CUTLASS MLA, hadamard, awq, allspark and DSV3 fused a gemm to torch stable ABI#38671
[5/n] Migrate CUTLASS MLA, hadamard, awq, allspark and DSV3 fused a gemm to torch stable ABI#38671mikaylagawarecki wants to merge 10 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request migrates several CUDA kernels—including AWQ, AllSpark, DeepSeek V3 fused A GEMM, Hadacore, and CUTLASS MLA—from the standard extension to the stable ABI extension (_C_stable_libtorch). The changes involve updating CMakeLists.txt to reassign source files, replacing standard Torch types and macros with stable ABI equivalents (e.g., torch::stable::Tensor, STD_TORCH_CHECK), and implementing stable ABI-compliant utilities for device property caching and cuBLAS handle retrieval. Feedback highlights critical issues regarding thread safety with global workspace tensors, potential compilation failures when using non-movable types in containers, and the need for better bounds checking and naming consistency in the new utility functions.
| // Device properties cache for stable ABI compatibility. | ||
| // Uses raw CUDA/HIP APIs instead of ATen functions. | ||
| // Using inline ensures a single instance across all translation units. | ||
| inline std::deque<std::once_flag> device_flags; |
There was a problem hiding this comment.
The use of std::deque<std::once_flag> is problematic because std::once_flag is non-copyable and non-movable. While std::deque generally provides stable pointers to its elements, the resize operation (line 35) requires the type to be MoveInsertable according to the C++ standard, which std::once_flag is not. This will likely lead to compilation errors on many toolchains. A better approach is to initialize all device properties at once during the global initialization phase, removing the need for per-device once_flag containers.
There was a problem hiding this comment.
The code here is actually a very slight adaptation of the code in torch https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/cuda/CUDAContext.cpp#L12-L59 to make it stable
(granted torch uses c10::once_flag, but that is also non-copyable and non-movable which has the same issue)
Since the std::deque is only ever resized once from size 0 to num_devices, I don't think this is actually problematic. However, I I can fix this if anyone thinks it is problematic
| #include "core/registration.h" | ||
| #include "libtorch_stable/torch_utils.h" | ||
|
|
||
| torch::stable::Tensor as_g_workspace; |
There was a problem hiding this comment.
The global variable as_g_workspace of type torch::stable::Tensor introduces a significant race condition. In a multi-threaded or multi-stream environment, concurrent calls to allspark_w8a16_gemm will attempt to check and reallocate this global tensor (lines 991-996), leading to memory corruption or use-after-free errors when one thread overwrites the workspace while another is using it. For stable ABI compatibility and thread safety, workspace memory should be managed via a thread-local cache, a per-device map, or ideally passed as an argument from the Python allocator.
There was a problem hiding this comment.
pre-existing
Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
10e67b6 to
8bd7514
Compare
Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Pure move, no code changes. Preparatory step for stable ABI migration. Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Pure move, no code changes. Preparatory step for stable ABI migration. Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
8bd7514 to
2233700
Compare
PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.
Purpose
#26946
Test Plan
On A100
On H100
On B200
Deepseek gemm kernel does not seem to have a test
Test Result
A100:

H100:


B200:

Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing (anything written below this line will be removed by GitHub Actions)