Skip to content

_qattn_sm80 extension fails on Ada Lovelace (sm89/L40S) — illegal instruction at runtime or compilation failure #360

@ShaiDiamant

Description

@ShaiDiamant

Environment

  • GPU: NVIDIA L40S (sm89, Ada Lovelace)
  • CUDA Toolkit: 12.6
  • PyTorch: 2.11.0+cu126
  • Triton: 3.6.0
  • Python: 3.11
  • Platform: Modal (cloud GPU, debian_slim base image)
  • SageAttention: v2.2.0

Problem

SageAttention v2.2.0 cannot work correctly on L40S (sm89) GPUs. There are two failure modes depending on the build configuration:

Failure Mode 1: Runtime illegal instruction (default build)

When building with TORCH_CUDA_ARCH_LIST="8.9", all extensions compile successfully. However, at runtime, ComfyUI workflows using SageAttention crash with:

CUDA error: an illegal instruction was encountered
XID: NVRM: Xid (PCI:0000:3e:00): 13, Graphics SM Warp Exception: Illegal Instruction Encoding

The crash occurs during KSampler attention operations. The root cause is that _qattn_sm80 is built when HAS_SM89=True (due to setup.py line 173: if HAS_SM80 or
HAS_SM86 or HAS_SM89 or ...), but it shares the same NVCC_FLAGS which contain only -gencode arch=compute_89,code=sm_89. The sm80 tensor core MMA source code,
when compiled to sm89 SASS, produces invalid instruction encodings.

Failure Mode 2: Compilation failure (when targeting sm80)

When attempting to fix this by giving _qattn_sm80 its own flags with -gencode arch=compute_80,code=sm_80, the compilation fails because the source file
qk_int_sv_f16_cuda_sm80.cu uses CUDA features that require sm89+ (likely FP8 types or sm89-specific MMA instructions). So the "sm80" extension source is not
actually compatible with sm80 compute capability.

Root Cause

setup.py has a structural issue:

  1. Shared NVCC_FLAGS: All extensions (_qattn_sm80, _qattn_sm89, _fused) use the same NVCC_FLAGS list, which gets gencode flags based on TORCH_CUDA_ARCH_LIST.
    When targeting sm89, _qattn_sm80 is compiled with sm89 gencode but its source code produces invalid sm89 SASS.
  2. Misnamed extension: _qattn_sm80 is built for all architectures ≥ sm80 (if HAS_SM80 or HAS_SM86 or HAS_SM89 or ...), but its source code appears to require
    sm89+ features, making it incompatible with actual sm80 compilation targets.
  3. No per-extension gencode flags: Each extension should have architecture-appropriate flags. _qattn_sm80 should either:
    - Be compiled with sm80-compatible source code and sm80 gencode flags, OR
    - Be renamed/restructured to reflect its actual minimum architecture requirement

Steps to Reproduce

On Modal with L40S GPU:

image = (
modal.Image.debian_slim(python_version="3.11")
.pip_install("torch==2.11.0+cu126", extra_index_url="https://download.pytorch.org/whl/cu126")
.pip_install("triton==3.6.0")
.run_commands(
# Install CUDA toolkit for nvcc
"apt-get update && apt-get install -y cuda-toolkit-12-6 build-essential && "
'CXX=g++ CC=gcc CUDA_HOME=/usr/local/cuda TORCH_CUDA_ARCH_LIST="8.9" '
"pip install git+https://github.com/thu-ml/SageAttention.git@v2.2.0 --no-build-isolation",
gpu="l40s",
)
)

Build succeeds, but running sageattn(q, k, v) on the L40S crashes with cudaErrorIllegalInstruction.

Expected Behavior

SageAttention v2.2.0 should work on L40S (sm89) GPUs — either by:

  • Using per-extension gencode flags so _qattn_sm80 compiles with sm80 flags (SASS + PTX for JIT on newer GPUs)
  • Or restructuring _qattn_sm80 so its source code is genuinely sm80-compatible

Workaround

Using SageAttention v1.0.6 (pure Triton implementation) works but provides significantly less speedup than v2's hand-tuned CUDA kernels.

pip install sageattention==1.0.6

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions