-
Notifications
You must be signed in to change notification settings - Fork 744
Description
I noticed that in PR #1332 the legacy cutlass operators were blocked on Blackwell (CUDA_MAXIMUM_COMPUTE_CAPABILITY = (9, 0)), but the new cutlass_blackwell operators weren't added to the dispatch priority list.
This means on SM100+ devices:
- cutlass_blackwell.FwOp exists but is never tried.
- cutlass.FwOp is blocked by the new capability check
- There's no fallback for cases cutlass_blackwell doesn't support (float32, K > 128)
I ran into this on a blackwell GPU running a Stable Diffusion pipeline with fp16 and K=512,
NotImplementedError: No operator found for memory_efficient_attention_forward with inputs: query : shape=(1, 1024, 1, 512) (torch.float16) ...cutlassF-pt is not supported because: requires device with capability <= (9, 0) but your GPU has capability (11, 0) (too new)
Was leaving cutlass_blackwell out of dispatch intentional? And if so, should there be a fallback path for unsupported configurations?
If this wasn't intentional I'd be happy to submit a PR to wire up cutlass_blackwell in dispatch and add a fallback to legacy cutlass for the cases the new cutlass Blackwell kernels don't currently support.
this issue is possibly related to this #1356