Skip to content

[Feature][Attention][UX]: Incorporate Features into Attention Selection #30654

@robertgshaw2-redhat

Description

@robertgshaw2-redhat

🚀 The feature, motivation and pitch

SUMMARY:

  • we have default attention backends by priority and a notion of which backend supports what hw
  • however, certain features are not considered in this (e.g. fp8 kv cache, e.g. attention sinks)

Recent example, we had test failures because we updated the logic to load kv cache quantization from the model config. But since CUTLASS_MLA is the default backend on B200, we started seeing test failures (since CUTLASS MLA does not support fp8 kv cache) because we were not automatically falling back to FLASHINFER_MLA (which does)

So the proposal is to:

  • make sure all attention backends report what features are supported
  • update the attention selector to consider these features in the selection

Alternatives

No response

Additional context

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions