GitHub - pt13762104/Deepseek-R1-FAQ: My own Faq for Deepseek

Note: All information below are for Q4_K_M and peak llama.cpp token gen performance unless otherwise stated.

-1) I CAN RUN R1 ON MY RASPBERRY PI / RTX 4090 / MY MOBILE PHONE / ...

See 2) and ~~SHUT UP~~ fix your content, please.

0) How do you actually run R1?

Use Ollama if you want a simple setup. Be sure to pull the deepseek-r1:671b model. You can always download quants and run them using llama-server.

1) I have 96/128GB RAM. Can I run Deepseek R1?

To an extent, yes, you actually can (with small quants e.g. IQ1, Q2, etc...). The experience is nowhere near "great" (1t/s!), but you can if you allow data to be loaded from disk. See https://www.reddit.com/r/LocalLLaMA/comments/1idseqb/deepseek_r1_671b_over_2_toksec_without_gpu_on/. Use a disk that's as fast as possible.

2) Are distilled models "R1"?

No.

3) What are the system requirements?

Model	Minimum VRAM	Recommended VRAM	Minimum RAM	Recommended RAM	KV cache size MiB (2k context)	Example hardware
671b	480	640	512	768	9760	8x H100

For max context (160K), you'll need 762.5 GiB of additional RAM/VRAM.

4) I have 8 channels of RAM, how many token/s can I expect?

Expect about 4-6 token/s for DDR5, 3-4 for DDR4.

5) Can you get an usable R1 experience without GPUs?

With multi-channel RAM (12+ channels, 768 bit bus or more) you can get >= 8t/s peak (only Q4_K_S, sorry) if your RAM is clocked high enough. It should be usable if you have enough of that kind of RAM.

6) If I don't have the resources, should I run the setup in 1)?

If you're going to wait for a really long time for a prompt, then maybe.

7) How can I run R1 for cheap?

See https://digitalspaceport.com/how-to-run-deepseek-r1-671b-fully-locally-on-2000-epyc-rig/ .

8) I have 8xH100s, how many token it's expected?

Around 30 to 40 tokens per second is expected.

9) How do I increase throughput?

Use tensor overrides.
Use a fast enough GPU.
Make your RAM faster (obvious)
Change used_expert_count to a smaller amount than 8. You might get bad results with a small number of experts.
Use KV cache quant.

10) My prompt processing speed is slow. Why?

Either use a GPU or add more CPU cores. CPUs don't have big compute power, so they don't do prompts well.

11) Can I run R1 on my laptop?

See 1). You can run distributed inference, which should be okay if you have enough RAM bandwidth.

11.5) Can you actually run R1 on 4090s?

See 3). You'll need server/workstation CPUs to handle the big number of GPUs.

12) My disk space isn't enough!

If your network is really fast, you can download the model to shared memory then run directly from it. This works best on Kaggle (although you shouldn't run models on Kaggle's CPUs).

13) Why is the KV cache so big 😭

Multi head attention, see here: ggml-org/llama.cpp#11446. When MLA is implemented it should reduce the KV size, at the cost of slower prompt processing.

14) The model doesn't think in the <think> tags!

This is an observed behaviour, see this for more information: https://huggingface.co/deepseek-ai/DeepSeek-R1#usage-recommendations.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

-1) I CAN RUN R1 ON MY RASPBERRY PI / RTX 4090 / MY MOBILE PHONE / ...

0) How do you actually run R1?

1) I have 96/128GB RAM. Can I run Deepseek R1?

2) Are distilled models "R1"?

3) What are the system requirements?

4) I have 8 channels of RAM, how many token/s can I expect?

5) Can you get an usable R1 experience without GPUs?

6) If I don't have the resources, should I run the setup in 1)?

7) How can I run R1 for cheap?

8) I have 8xH100s, how many token it's expected?

9) How do I increase throughput?

10) My prompt processing speed is slow. Why?

11) Can I run R1 on my laptop?

11.5) Can you actually run R1 on 4090s?

12) My disk space isn't enough!

13) Why is the KV cache so big 😭

14) The model doesn't think in the <think> tags!

About

Uh oh!

Releases

Packages

pt13762104/Deepseek-R1-FAQ

Folders and files

Latest commit

History

Repository files navigation

-1) I CAN RUN R1 ON MY RASPBERRY PI / RTX 4090 / MY MOBILE PHONE / ...

0) How do you actually run R1?

1) I have 96/128GB RAM. Can I run Deepseek R1?

2) Are distilled models "R1"?

3) What are the system requirements?

4) I have 8 channels of RAM, how many token/s can I expect?

5) Can you get an usable R1 experience without GPUs?

6) If I don't have the resources, should I run the setup in 1)?

7) How can I run R1 for cheap?

8) I have 8xH100s, how many token it's expected?

9) How do I increase throughput?

10) My prompt processing speed is slow. Why?

11) Can I run R1 on my laptop?

11.5) Can you actually run R1 on 4090s?

12) My disk space isn't enough!

13) Why is the KV cache so big 😭

14) The model doesn't think in the <think> tags!

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages