[BUG] cutlass attention completely blocked on Blackwell with no fallback path

I noticed that in PR #1332  the legacy cutlass operators were blocked on Blackwell (CUDA_MAXIMUM_COMPUTE_CAPABILITY = (9, 0)), but the new cutlass_blackwell operators weren't added to the dispatch priority list.

This means on SM100+ devices:
- cutlass_blackwell.FwOp exists but is never tried.
- cutlass.FwOp is blocked by the new capability check
- There's no fallback for cases cutlass_blackwell doesn't support (float32, K > 128)

I ran into this on a blackwell GPU running a Stable Diffusion pipeline with fp16 and K=512,
NotImplementedError: No operator found for `memory_efficient_attention_forward` with inputs:     query       : shape=(1, 1024, 1, 512) (torch.float16)     ...`cutlassF-pt` is not supported because:    requires device with capability <= (9, 0) but your GPU has capability (11, 0) (too new)

Was leaving cutlass_blackwell out of dispatch intentional? And if so, should there be a fallback path for unsupported configurations?
If this wasn't intentional I'd be happy to submit a PR to wire up cutlass_blackwell in dispatch and add a fallback to legacy cutlass for the cases the new cutlass Blackwell kernels don't currently support.

this issue is possibly related to this #1356 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] cutlass attention completely blocked on Blackwell with no fallback path #1357

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] cutlass attention completely blocked on Blackwell with no fallback path #1357

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions