[Question] What is MODEL1?

I found the term "MODEL1" is mentioned several times in this repository:
https://github.com/deepseek-ai/FlashMLA/blob/48c6dc426f045cb7743b18f5c7329f35f1b7ed79/csrc/sm100/decode/head64/kernel.cuh#L554
https://github.com/deepseek-ai/FlashMLA/blob/48c6dc426f045cb7743b18f5c7329f35f1b7ed79/csrc/sm90/decode/sparse_fp8/splitkv_mla.cuh#L161

I have some questions about this novel term. @interestingLSY Could you please take a look?

1. Are there some publicly available models belonging to the category of "MODEL1"? 

2. What are the major differences of attention mechanisms between "MODEL1" and V3.2 ?


Update: My preliminary guess is the following for the 2nd question:

 1. The topk length becomes dynamic in MODEL1 while constant(2048) in V3.2
 2. MODEL1 introduces additional KV states which are not generated from history tokens in the current sequence
 3. The number of features in KV cache without position encoding for MODEL1 is 448, which is reduced from 512 as in V3.2
 4. Position encoding is also included when recovering V cache

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] What is MODEL1? #155

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Question] What is MODEL1? #155

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions