-
-
Notifications
You must be signed in to change notification settings - Fork 12k
Description
🚀 The feature, motivation and pitch
Based on your investigation and the search results, SGLang and vLLM handle the problematic DeepSeek-V3.2 sparse attention (DSA) kernels very differently. SGLang has a more flexible architecture that allows it to bypass the unsupported FLASHMLA_SPARSE kernel, while vLLM's structure forces its use and fails.
Here is a breakdown of why vLLM is stuck and how SGLang works around the issue.
🔍 Why vLLM Fails: A Rigid Backend Chain
The vLLM logs show the core problem: once index_topk is detected, the framework's attention backend selection is forced down a specific path.
- Monolithic FlashMLA Backend: In vLLM, when a model uses DeepSeek Sparse Attention (DSA), the only backend equipped to handle it is
FLASHMLA_SPARSE. This backend relies on the high-performance, low-level CUDA kernels from the officialFlashMLAlibrary . - Hardware Lock-In: The official
FlashMLAkernels are built only for enterprise GPUs with SM90 (Hopper) and SM100 (Blackwell) architectures . They do not support the consumer-grade SM120 (RTX Blackwell) architecture of your GPU, which is a known hardware support gap . - No Fallback: vLLM's architecture for MLA models does not seem to have a built-in, automatic fallback mechanism. When the only viable backend (
FLASHMLA_SPARSE) fails due to incompatible hardware, the process crashes.
🛠️ How SGLang Succeeds: A Flexible, Multi-Kernel Backend
SGLang takes a different approach. It doesn't rely on a single, monolithic backend for MLA. Instead, it uses a Native Sparse Attention (NSA) backend, which is a sophisticated dispatcher capable of using different underlying kernels .
Your discovery of the tilelang_kernel.py file is the key. The NSA backend has access to multiple kernel implementations, including the portable tilelang kernels you found.
| Component | vLLM's Approach | SGLang's NSA Backend Approach |
|---|---|---|
| Primary Sparse Kernel | flash_mla_sparse_fwd (CUDA, SM90/SM100 only) |
flash_mla_sparse_fwd (CUDA, SM90/SM100 only) |
| Fallback Sparse Kernel | None | tilelang_sparse_fwd (Portable TileLang) |
| Dense/Short-Seq Kernel | Separate backend (e.g., TRITON_MLA) |
Integrated in NSA backend (e.g., fa3, triton) |
| Result on SM120 | Crashes when forced to use FLASHMLA_SPARSE. |
Works because it can silently fail over to the tilelang kernel or use a dense kernel for short sequences. |
The "automatic fallback" you suspected is real. SGLang's NSA backend can dynamically choose a kernel based on the sequence length and, crucially, what is available on the hardware. When the fast flashmla_sparse kernel is not supported on SM120, the backend can select the portable tilelang kernel without the user needing to specify it.
Alternatives
Sglang
Additional context
No response
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.