Note: All information below are for Q4_K_M and peak llama.cpp token gen performance unless otherwise stated.
See 2) and SHUT UP fix your content, please.
Use Ollama if you want a simple setup. Be sure to pull the deepseek-r1:671b
model. You can always download quants and run them using llama-server.
To an extent, yes, you actually can (with small quants e.g. IQ1, Q2, etc...). The experience is nowhere near "great" (1t/s!), but you can if you allow data to be loaded from disk. See https://www.reddit.com/r/LocalLLaMA/comments/1idseqb/deepseek_r1_671b_over_2_toksec_without_gpu_on/. Use a disk that's as fast as possible.
No.
Model | Minimum VRAM | Recommended VRAM | Minimum RAM | Recommended RAM | KV cache size MiB (2k context) | Example hardware |
---|---|---|---|---|---|---|
671b | 480 | 640 | 512 | 768 | 9760 | 8x H100 |
For max context (160K), you'll need 762.5 GiB of additional RAM/VRAM.
Expect about 4-6 token/s for DDR5, 3-4 for DDR4.
With multi-channel RAM (12+ channels, 768 bit bus or more) you can get >= 8t/s peak (only Q4_K_S, sorry) if your RAM is clocked high enough. It should be usable if you have enough of that kind of RAM.
If you're going to wait for a really long time for a prompt, then maybe.
See https://digitalspaceport.com/how-to-run-deepseek-r1-671b-fully-locally-on-2000-epyc-rig/ .
Around 30 to 40 tokens per second is expected.
- Use a fast enough GPU.
- Make your RAM faster (obvious)
- Change used_expert_count to a smaller amount than 8. You might get bad results with a small number of experts.
- Use KV cache quant.
Either use a GPU or add more CPU cores. CPUs don't have big compute power, so they don't do prompts well.
See 1). You can run distributed inference, which should be okay if you have enough RAM bandwidth.
See 3). You'll need server/workstation CPUs to handle the big number of GPUs.
If your network is really fast, you can download the model to shared memory then run directly from it. This works best on Kaggle (although you shouldn't run models on Kaggle's CPUs).
Multi head attention, see here: ggml-org/llama.cpp#11446. When MLA is implemented it should reduce the KV size, at the cost of slower prompt processing.
This is an observed behaviour, see this for more information: https://huggingface.co/deepseek-ai/DeepSeek-R1#usage-recommendations.