Flash MoBA consumes more memory than PyTorch’s SDPA

Hi,

Congratulations on the great work. I’m very interested in Flash MoBA and have been running some tests on it. My main goal was to evaluate the impact of `moba_chunk_size` and `moba_topk`*on GPU memory usage and computation time, and to compare the results against PyTorch’s `scaled_dot_product_attention`.

I focused on the non-causal (`causal=False`) case and ran experiments on an H100 with:

* `batch_size = 1`
* `nheads = 4`
* `headdim = 128`
* `causal = False`

The test settings were:

```python
seqlens = [1024, 2048, 4096, 8192, 16384, 32768, 65536, 131072, 262144, 524288, 1_000_000]
chunk_sizes = [64, 128, 256, 512, 1024]
topks = [2, 4, 8, 16, 32]
```

The results are summarized as follows:

1. For a fixed sequence length, `moba_chunk_size` and `moba_topk` appear to have no impact on peak memory usage.
2. For a fixed sequence length, **larger** `moba_chunk_size` and `moba_topk` lead to longer computation time.
3. Flash MoBA is faster than SDPA.
4. However, Flash MoBA **consumes more memory** than SDPA.

<img width="4754" height="1750" alt="Image" src="https://github.com/user-attachments/assets/de41786c-00fb-4119-8ac0-a65ae1cdab63" />



The test code is as follows:

[test_moba_memory_benchmark.py](https://github.com/user-attachments/files/23728176/test_moba_memory_benchmark.py)


In addition, could you please help clarify the following questions:

1. In the original [[MoBA](https://github.com/MoonshotAI/MoBA/blob/master/moba/moba_efficient.py)](https://github.com/MoonshotAI/MoBA/blob/master/moba/moba_efficient.py) implementation, the final output is obtained by combining sparse attention and self-attention via online softmax — where self-attention performs local attention within each chunk (each token attends to previous tokens inside its own chunk), and sparse attention computes top-k cross-chunk attention (selected tokens attend to the top-k most relevant chunks).
   For Flash MoBA, does it only include the top-k cross-chunk attention, without the local self-attention component?

2. When `seqlen_k` is not divisible by `moba_chunk_size`, how is the tail chunk handled?

Thank you very much! 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Flash MoBA consumes more memory than PyTorch’s SDPA #7

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Flash MoBA consumes more memory than PyTorch’s SDPA #7

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions