Skip to content

Flash MoBA consumes more memory than PyTorch’s SDPA #7

@JingangQu

Description

@JingangQu

Hi,

Congratulations on the great work. I’m very interested in Flash MoBA and have been running some tests on it. My main goal was to evaluate the impact of moba_chunk_size and moba_topk*on GPU memory usage and computation time, and to compare the results against PyTorch’s scaled_dot_product_attention.

I focused on the non-causal (causal=False) case and ran experiments on an H100 with:

  • batch_size = 1
  • nheads = 4
  • headdim = 128
  • causal = False

The test settings were:

seqlens = [1024, 2048, 4096, 8192, 16384, 32768, 65536, 131072, 262144, 524288, 1_000_000]
chunk_sizes = [64, 128, 256, 512, 1024]
topks = [2, 4, 8, 16, 32]

The results are summarized as follows:

  1. For a fixed sequence length, moba_chunk_size and moba_topk appear to have no impact on peak memory usage.
  2. For a fixed sequence length, larger moba_chunk_size and moba_topk lead to longer computation time.
  3. Flash MoBA is faster than SDPA.
  4. However, Flash MoBA consumes more memory than SDPA.
Image

The test code is as follows:

test_moba_memory_benchmark.py

In addition, could you please help clarify the following questions:

  1. In the original [MoBA](https://github.com/MoonshotAI/MoBA/blob/master/moba/moba_efficient.py) implementation, the final output is obtained by combining sparse attention and self-attention via online softmax — where self-attention performs local attention within each chunk (each token attends to previous tokens inside its own chunk), and sparse attention computes top-k cross-chunk attention (selected tokens attend to the top-k most relevant chunks).
    For Flash MoBA, does it only include the top-k cross-chunk attention, without the local self-attention component?

  2. When seqlen_k is not divisible by moba_chunk_size, how is the tail chunk handled?

Thank you very much!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions