Using 32*A800 but with low performance for deepseekr1 #3455

HoyTiger · 2025-02-10T05:06:34Z

HoyTiger
Feb 10, 2025

I deployed the bf16 DeepSeek-R1 model using 32 A800 GPUs with the following command:

However, the token usage seems consistently low (around 0.02) with a throughput of approximately 18 tokens/s. The GPU utilization remains at 100%, which suggests not just partial parameter activation per token.

Could you advise if there are specific hyperparameters I should configure to address this issue?

HoyTiger · 2025-02-10T07:03:58Z

HoyTiger
Feb 10, 2025
Author

note that The throughput per individual request is 6-7 tokens/s.

2 replies

zhyncs Feb 10, 2025
Collaborator

I remember that @ispobock got around 30 tokens/s in A100.

HoyTiger Feb 10, 2025
Author

Is this the result of a single request or a concurrent request? If possible, can you provide the result of a single request？

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using 32*A800 but with low performance for deepseekr1 #3455

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Using 32*A800 but with low performance for deepseekr1 #3455

Uh oh!

HoyTiger Feb 10, 2025

Replies: 1 comment · 2 replies

Uh oh!

HoyTiger Feb 10, 2025 Author

Uh oh!

zhyncs Feb 10, 2025 Collaborator

Uh oh!

HoyTiger Feb 10, 2025 Author

HoyTiger
Feb 10, 2025

Replies: 1 comment 2 replies

HoyTiger
Feb 10, 2025
Author

zhyncs Feb 10, 2025
Collaborator

HoyTiger Feb 10, 2025
Author