-
Notifications
You must be signed in to change notification settings - Fork 170
Mimo-V2-Flash support #1096
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Mimo-V2-Flash support #1096
Conversation
|
What do you get on hybrid? I wanted to grab q3/q4 and was hoping the less active params let it reason at reasonable speeds. Quant is gonna take the rest of the weekend to download :( |
It still does not solve the Mimo-2 quantized cache issue.
|
If I use -khad for the KV cache:
Then I get that: Otherwise, it works. |
|
OK, so, my system: AMD 7800x3d 8 core Command: build/bin/llama-server -m models/MiMo-V2-Flash-IQ3_XS.gguf -ot "blk.(?:[0-9]|[1-3][0-9]|[4][0]).ffn.*=CPU" -c 32768 -b 8192 -ub 8192 -ctk q8_0 -ctv q8_0 --threads 7 -ngl 95 -sp -amb 512 --host 0.0.0.0 --port 8080 --webui none --repeat-last-n 2048 -mqkv --jinja Performance: This is slower then minimax m2.1, which with the same settings gives me about 15 t/s. Is MTP working? Also, the model doesn't think. Which is a problem because without thinking this model is kind of dumb. On silly tavern I have the thinking settings with chat completion to maximum, but it doesn't seem to work. Edit: OK the model clearly has coherence problems. It's overall quite nonsensical, no matter the context size. Edit2: Apparently the first layer is dense, so my -ot becomes -ot "blk.(?:[1-9]|[1-3][0-9]|[4][0]).ffn._exps.=CPU". |
|
Neither
Haven't you learned yet that in the time age of LLMs, everybody shamelessly and massively exaggerates the utility of the thing that they have done? |
It looks like there is still an issue with SWA. I'm looking into it. |
|
Something is not quite right, so converted to draft. |
|
OK, PPL is the same as mainline (actually it is slightly lower). Checked for a few context lengths, and it is fine. If I had a bug in the SWA attention mask preparation, one would see it in PPL. The issue is that when generating after a while the model starts endlessly repeating the same thing again and again. I thought there is an issue with my implementation. But I have now observed the exact same behavior also in mainline. The probability for endless repetition appears to be very sensitive to the temperature. So, my best guess at this point is that my implementation is fine, but the So, I'll remove the draft status. Would appreciate test reports from more users. |
|
I think there's something off again. The model is def better, but it's still a little incoherent, and it doesn't follow simple prompts consistently, like keeping the answer beneath 100 words in length (it's super yappy). The weirdest thing is that I'm getting 16t/s at 29000 tokens of context, but only 12t/s at 12000 tokens of context. Also, we need a way to turn thinking on and off. |
This PR adds support for Mimo-V2-Flash (https://huggingface.co/XiaomiMiMo/MiMo-V2-Flash), and closes #1076.
Unlike the mainline PR 18328, which does not support flash attention (FA), FA is supported here.
Split mode "graph" is not supported for now. It turns out my splitting logic for the attention tensors only works when the K- and V attention head size is the same, which is not true for Mimo-V2. So, this will have to be a follow up PR. Also, I did not add support for HF->GGUF conversion, so mainline will need to be used for that.
Another limitation of this PR is that quantized KV cache cannot be used on CUDA(we get NaNs). It works fine on the CPU, so will need to investigate why quantized KV cache fails on CUDA.Fixed with latest commit.The other caveat is that the large saving in KV cache size that could be possible due to the aggressive SWA used by Mimo-V2 is not realized, so here mainline has advantage.
On the other hand, because mainline does not support FA for Mimo-V2, I was still able to go to a much larger context than with mainline. I downloaded the
IQ2_XXSquantization from Bartowski. I picked that one so that I can use full GPU offload on the 4x3090 system. With mainline the best I could do before OOM was a context of 8192 with u-batch size of 1024. Withik_llama.cppI can go up to a context of 32k tokens using u-batch size of 2048. Correspondingly performance here is quite a bit better than over there (see sweep bench results below).CPU-only performance is quite decent: I get 115 t/s for PP-2048 and 21.8 t/s for TG-128 on a Ryzen-3995WX CPU.
ik_llama.cpp, Mimo-V2-Flash, IQ2_XXS, 4x3090
llama.cpp, Mimo-V2-Flash, IQ2_XXS, 4x3090