Skip to content

[BUG] cutlass attention completely blocked on Blackwell with no fallback path #1357

@kbenkhaled

Description

@kbenkhaled

I noticed that in PR #1332 the legacy cutlass operators were blocked on Blackwell (CUDA_MAXIMUM_COMPUTE_CAPABILITY = (9, 0)), but the new cutlass_blackwell operators weren't added to the dispatch priority list.

This means on SM100+ devices:

  • cutlass_blackwell.FwOp exists but is never tried.
  • cutlass.FwOp is blocked by the new capability check
  • There's no fallback for cases cutlass_blackwell doesn't support (float32, K > 128)

I ran into this on a blackwell GPU running a Stable Diffusion pipeline with fp16 and K=512,
NotImplementedError: No operator found for memory_efficient_attention_forward with inputs: query : shape=(1, 1024, 1, 512) (torch.float16) ...cutlassF-pt is not supported because: requires device with capability <= (9, 0) but your GPU has capability (11, 0) (too new)

Was leaving cutlass_blackwell out of dispatch intentional? And if so, should there be a fallback path for unsupported configurations?
If this wasn't intentional I'd be happy to submit a PR to wire up cutlass_blackwell in dispatch and add a fallback to legacy cutlass for the cases the new cutlass Blackwell kernels don't currently support.

this issue is possibly related to this #1356

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions