-
Notifications
You must be signed in to change notification settings - Fork 992
Open
Description
I found the term "MODEL1" is mentioned several times in this repository:
FlashMLA/csrc/sm100/decode/head64/kernel.cuh
Line 554 in 48c6dc4
| // MODEL1: RoPE is the last 64 dims within the full 512 dim, which couples with the last 64 dim from the NoPE part when performing dual GEMM. i.e. |
| // The following fields are only valid for MODEL1 |
I have some questions about this novel term. @interestingLSY Could you please take a look?
-
Are there some publicly available models belonging to the category of "MODEL1"?
-
What are the major differences of attention mechanisms between "MODEL1" and V3.2 ?
Update: My preliminary guess is the following for the 2nd question:
- The topk length becomes dynamic in MODEL1 while constant(2048) in V3.2
- MODEL1 introduces additional KV states which are not generated from history tokens in the current sequence
- The number of features in KV cache without position encoding for MODEL1 is 448, which is reduced from 512 as in V3.2
- Position encoding is also included when recovering V cache
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels