Skip to content

[Feature]: FLASHMLA_SPARSE (SM90/SM100 only) fallback to TILELANG reference kernel (supports ALL) #30644

@fernandaspets

Description

@fernandaspets

🚀 The feature, motivation and pitch

Based on your investigation and the search results, SGLang and vLLM handle the problematic DeepSeek-V3.2 sparse attention (DSA) kernels very differently. SGLang has a more flexible architecture that allows it to bypass the unsupported FLASHMLA_SPARSE kernel, while vLLM's structure forces its use and fails.

Here is a breakdown of why vLLM is stuck and how SGLang works around the issue.

🔍 Why vLLM Fails: A Rigid Backend Chain

The vLLM logs show the core problem: once index_topk is detected, the framework's attention backend selection is forced down a specific path.

  • Monolithic FlashMLA Backend: In vLLM, when a model uses DeepSeek Sparse Attention (DSA), the only backend equipped to handle it is FLASHMLA_SPARSE . This backend relies on the high-performance, low-level CUDA kernels from the official FlashMLA library .
  • Hardware Lock-In: The official FlashMLA kernels are built only for enterprise GPUs with SM90 (Hopper) and SM100 (Blackwell) architectures . They do not support the consumer-grade SM120 (RTX Blackwell) architecture of your GPU, which is a known hardware support gap .
  • No Fallback: vLLM's architecture for MLA models does not seem to have a built-in, automatic fallback mechanism. When the only viable backend (FLASHMLA_SPARSE) fails due to incompatible hardware, the process crashes.

🛠️ How SGLang Succeeds: A Flexible, Multi-Kernel Backend

SGLang takes a different approach. It doesn't rely on a single, monolithic backend for MLA. Instead, it uses a Native Sparse Attention (NSA) backend, which is a sophisticated dispatcher capable of using different underlying kernels .

Your discovery of the tilelang_kernel.py file is the key. The NSA backend has access to multiple kernel implementations, including the portable tilelang kernels you found.

Component vLLM's Approach SGLang's NSA Backend Approach
Primary Sparse Kernel flash_mla_sparse_fwd (CUDA, SM90/SM100 only) flash_mla_sparse_fwd (CUDA, SM90/SM100 only)
Fallback Sparse Kernel None tilelang_sparse_fwd (Portable TileLang)
Dense/Short-Seq Kernel Separate backend (e.g., TRITON_MLA) Integrated in NSA backend (e.g., fa3, triton)
Result on SM120 Crashes when forced to use FLASHMLA_SPARSE. Works because it can silently fail over to the tilelang kernel or use a dense kernel for short sequences.

The "automatic fallback" you suspected is real. SGLang's NSA backend can dynamically choose a kernel based on the sequence length and, crucially, what is available on the hardware. When the fast flashmla_sparse kernel is not supported on SM120, the backend can select the portable tilelang kernel without the user needing to specify it.

Alternatives

Sglang

Additional context

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions