_qattn_sm80 extension fails on Ada Lovelace (sm89/L40S) — illegal instruction at runtime or compilation failure

Environment                                                                                                                                                      
                                         
  - GPU: NVIDIA L40S (sm89, Ada Lovelace)                                                                                                                          
  - CUDA Toolkit: 12.6
  - PyTorch: 2.11.0+cu126                                                                                                                                          
  - Triton: 3.6.0                                                                                                                                                  
  - Python: 3.11                         
  - Platform: Modal (cloud GPU, debian_slim base image)                                                                                                            
  - SageAttention: v2.2.0                                                                                                                                          
                                         
  Problem                                                                                                                                                          
                                          
  SageAttention v2.2.0 cannot work correctly on L40S (sm89) GPUs. There are two failure modes depending on the build configuration:                                
  
  Failure Mode 1: Runtime illegal instruction (default build)                                                                                                      
                                          
  When building with TORCH_CUDA_ARCH_LIST="8.9", all extensions compile successfully. However, at runtime, ComfyUI workflows using SageAttention crash with:       
                                          
  CUDA error: an illegal instruction was encountered                                                                                                               
  XID: NVRM: Xid (PCI:0000:3e:00): 13, Graphics SM Warp Exception: Illegal Instruction Encoding
                                                                                                                                                                   
  The crash occurs during KSampler attention operations. The root cause is that _qattn_sm80 is built when HAS_SM89=True (due to setup.py line 173: if HAS_SM80 or  
  HAS_SM86 or HAS_SM89 or ...), but it shares the same NVCC_FLAGS which contain only -gencode arch=compute_89,code=sm_89. The sm80 tensor core MMA source code,    
  when compiled to sm89 SASS, produces invalid instruction encodings.                                                                                              
                                          
  Failure Mode 2: Compilation failure (when targeting sm80)

  When attempting to fix this by giving _qattn_sm80 its own flags with -gencode arch=compute_80,code=sm_80, the compilation fails because the source file          
  qk_int_sv_f16_cuda_sm80.cu uses CUDA features that require sm89+ (likely FP8 types or sm89-specific MMA instructions). So the "sm80" extension source is not
  actually compatible with sm80 compute capability.                                                                                                                
                                          
  Root Cause                             

  setup.py has a structural issue:

  1. Shared NVCC_FLAGS: All extensions (_qattn_sm80, _qattn_sm89, _fused) use the same NVCC_FLAGS list, which gets gencode flags based on TORCH_CUDA_ARCH_LIST.    
  When targeting sm89, _qattn_sm80 is compiled with sm89 gencode but its source code produces invalid sm89 SASS.
  2. Misnamed extension: _qattn_sm80 is built for all architectures ≥ sm80 (if HAS_SM80 or HAS_SM86 or HAS_SM89 or ...), but its source code appears to require    
  sm89+ features, making it incompatible with actual sm80 compilation targets.                                                                                     
  3. No per-extension gencode flags: Each extension should have architecture-appropriate flags. _qattn_sm80 should either:
    - Be compiled with sm80-compatible source code and sm80 gencode flags, OR                                                                                      
    - Be renamed/restructured to reflect its actual minimum architecture requirement                                                                               
                                                                                                                                                                   
  Steps to Reproduce                                                                                                                                               
                                                                                                                                                                   
  # On Modal with L40S GPU:               
  image = (                                                                                                                                                        
      modal.Image.debian_slim(python_version="3.11")
      .pip_install("torch==2.11.0+cu126", extra_index_url="https://download.pytorch.org/whl/cu126")                                                                
      .pip_install("triton==3.6.0")                                                                                                                                
      .run_commands(                     
          # Install CUDA toolkit for nvcc                                                                                                                          
          "apt-get update && apt-get install -y cuda-toolkit-12-6 build-essential && "                                                                             
          'CXX=g++ CC=gcc CUDA_HOME=/usr/local/cuda TORCH_CUDA_ARCH_LIST="8.9" '
          "pip install git+https://github.com/thu-ml/SageAttention.git@v2.2.0 --no-build-isolation",                                                               
          gpu="l40s",                                                                                                                                              
      )                                                                                                                                                            
  )                                                                                                                                                                
                                          
  Build succeeds, but running sageattn(q, k, v) on the L40S crashes with cudaErrorIllegalInstruction.                                                              
  
  Expected Behavior                                                                                                                                                
                                          
  SageAttention v2.2.0 should work on L40S (sm89) GPUs — either by:
  - Using per-extension gencode flags so _qattn_sm80 compiles with sm80 flags (SASS + PTX for JIT on newer GPUs)
  - Or restructuring _qattn_sm80 so its source code is genuinely sm80-compatible                                                                                   
  
  Workaround                                                                                                                                                       
                                          
  Using SageAttention v1.0.6 (pure Triton implementation) works but provides significantly less speedup than v2's hand-tuned CUDA kernels.                         
                                          
  pip install sageattention==1.0.6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

_qattn_sm80 extension fails on Ada Lovelace (sm89/L40S) — illegal instruction at runtime or compilation failure #360

On Modal with L40S GPU:

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

_qattn_sm80 extension fails on Ada Lovelace (sm89/L40S) — illegal instruction at runtime or compilation failure #360

Description

On Modal with L40S GPU:

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions