Environment
- GPU: NVIDIA L40S (sm89, Ada Lovelace)
- CUDA Toolkit: 12.6
- PyTorch: 2.11.0+cu126
- Triton: 3.6.0
- Python: 3.11
- Platform: Modal (cloud GPU, debian_slim base image)
- SageAttention: v2.2.0
Problem
SageAttention v2.2.0 cannot work correctly on L40S (sm89) GPUs. There are two failure modes depending on the build configuration:
Failure Mode 1: Runtime illegal instruction (default build)
When building with TORCH_CUDA_ARCH_LIST="8.9", all extensions compile successfully. However, at runtime, ComfyUI workflows using SageAttention crash with:
CUDA error: an illegal instruction was encountered
XID: NVRM: Xid (PCI:0000:3e:00): 13, Graphics SM Warp Exception: Illegal Instruction Encoding
The crash occurs during KSampler attention operations. The root cause is that _qattn_sm80 is built when HAS_SM89=True (due to setup.py line 173: if HAS_SM80 or
HAS_SM86 or HAS_SM89 or ...), but it shares the same NVCC_FLAGS which contain only -gencode arch=compute_89,code=sm_89. The sm80 tensor core MMA source code,
when compiled to sm89 SASS, produces invalid instruction encodings.
Failure Mode 2: Compilation failure (when targeting sm80)
When attempting to fix this by giving _qattn_sm80 its own flags with -gencode arch=compute_80,code=sm_80, the compilation fails because the source file
qk_int_sv_f16_cuda_sm80.cu uses CUDA features that require sm89+ (likely FP8 types or sm89-specific MMA instructions). So the "sm80" extension source is not
actually compatible with sm80 compute capability.
Root Cause
setup.py has a structural issue:
- Shared NVCC_FLAGS: All extensions (_qattn_sm80, _qattn_sm89, _fused) use the same NVCC_FLAGS list, which gets gencode flags based on TORCH_CUDA_ARCH_LIST.
When targeting sm89, _qattn_sm80 is compiled with sm89 gencode but its source code produces invalid sm89 SASS.
- Misnamed extension: _qattn_sm80 is built for all architectures ≥ sm80 (if HAS_SM80 or HAS_SM86 or HAS_SM89 or ...), but its source code appears to require
sm89+ features, making it incompatible with actual sm80 compilation targets.
- No per-extension gencode flags: Each extension should have architecture-appropriate flags. _qattn_sm80 should either:
- Be compiled with sm80-compatible source code and sm80 gencode flags, OR
- Be renamed/restructured to reflect its actual minimum architecture requirement
Steps to Reproduce
On Modal with L40S GPU:
image = (
modal.Image.debian_slim(python_version="3.11")
.pip_install("torch==2.11.0+cu126", extra_index_url="https://download.pytorch.org/whl/cu126")
.pip_install("triton==3.6.0")
.run_commands(
# Install CUDA toolkit for nvcc
"apt-get update && apt-get install -y cuda-toolkit-12-6 build-essential && "
'CXX=g++ CC=gcc CUDA_HOME=/usr/local/cuda TORCH_CUDA_ARCH_LIST="8.9" '
"pip install git+https://github.com/thu-ml/SageAttention.git@v2.2.0 --no-build-isolation",
gpu="l40s",
)
)
Build succeeds, but running sageattn(q, k, v) on the L40S crashes with cudaErrorIllegalInstruction.
Expected Behavior
SageAttention v2.2.0 should work on L40S (sm89) GPUs — either by:
- Using per-extension gencode flags so _qattn_sm80 compiles with sm80 flags (SASS + PTX for JIT on newer GPUs)
- Or restructuring _qattn_sm80 so its source code is genuinely sm80-compatible
Workaround
Using SageAttention v1.0.6 (pure Triton implementation) works but provides significantly less speedup than v2's hand-tuned CUDA kernels.
pip install sageattention==1.0.6
Environment
Problem
SageAttention v2.2.0 cannot work correctly on L40S (sm89) GPUs. There are two failure modes depending on the build configuration:
Failure Mode 1: Runtime illegal instruction (default build)
When building with TORCH_CUDA_ARCH_LIST="8.9", all extensions compile successfully. However, at runtime, ComfyUI workflows using SageAttention crash with:
CUDA error: an illegal instruction was encountered
XID: NVRM: Xid (PCI:0000:3e:00): 13, Graphics SM Warp Exception: Illegal Instruction Encoding
The crash occurs during KSampler attention operations. The root cause is that _qattn_sm80 is built when HAS_SM89=True (due to setup.py line 173: if HAS_SM80 or
HAS_SM86 or HAS_SM89 or ...), but it shares the same NVCC_FLAGS which contain only -gencode arch=compute_89,code=sm_89. The sm80 tensor core MMA source code,
when compiled to sm89 SASS, produces invalid instruction encodings.
Failure Mode 2: Compilation failure (when targeting sm80)
When attempting to fix this by giving _qattn_sm80 its own flags with -gencode arch=compute_80,code=sm_80, the compilation fails because the source file
qk_int_sv_f16_cuda_sm80.cu uses CUDA features that require sm89+ (likely FP8 types or sm89-specific MMA instructions). So the "sm80" extension source is not
actually compatible with sm80 compute capability.
Root Cause
setup.py has a structural issue:
When targeting sm89, _qattn_sm80 is compiled with sm89 gencode but its source code produces invalid sm89 SASS.
sm89+ features, making it incompatible with actual sm80 compilation targets.
- Be compiled with sm80-compatible source code and sm80 gencode flags, OR
- Be renamed/restructured to reflect its actual minimum architecture requirement
Steps to Reproduce
On Modal with L40S GPU:
image = (
modal.Image.debian_slim(python_version="3.11")
.pip_install("torch==2.11.0+cu126", extra_index_url="https://download.pytorch.org/whl/cu126")
.pip_install("triton==3.6.0")
.run_commands(
# Install CUDA toolkit for nvcc
"apt-get update && apt-get install -y cuda-toolkit-12-6 build-essential && "
'CXX=g++ CC=gcc CUDA_HOME=/usr/local/cuda TORCH_CUDA_ARCH_LIST="8.9" '
"pip install git+https://github.com/thu-ml/SageAttention.git@v2.2.0 --no-build-isolation",
gpu="l40s",
)
)
Build succeeds, but running sageattn(q, k, v) on the L40S crashes with cudaErrorIllegalInstruction.
Expected Behavior
SageAttention v2.2.0 should work on L40S (sm89) GPUs — either by:
Workaround
Using SageAttention v1.0.6 (pure Triton implementation) works but provides significantly less speedup than v2's hand-tuned CUDA kernels.
pip install sageattention==1.0.6