[Sparse Attention][Performance] Accelerate the performance of sparse attention + Benchmark

We are having ongoing efforts about supporting sparse attention in GluonNLP: https://github.com/dmlc/gluon-nlp/pull/1395. To better accelerate related kernels, we can compare the performance of these potential solutions, including:

- Use BlockSparse kernel to implement the operator
   We may try out these implementations
   - https://github.com/openai/blocksparse
   - https://github.com/huggingface/pytorch_block_sparse
   - TVM Block Sparse: https://github.com/ceruleangu/Block-Sparse-Benchmark
- Directly implement window attention
  - Use CUTLASS and implement our own version
  - Use TVM + Ansor: https://tvm.apache.org/docs/tutorials/auto_scheduler/tune_conv2d_layer_cuda.html#sphx-glr-tutorials-auto-scheduler-tune-conv2d-layer-cuda-py


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Sparse Attention][Performance] Accelerate the performance of sparse attention + Benchmark #1397

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Sparse Attention][Performance] Accelerate the performance of sparse attention + Benchmark #1397

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions