[v0.12.5] Vulkan 2 x NVIDIA GeForce RTX 4090 24 GB #202

b4rtaz · 2025-04-19T22:26:12Z

b4rtaz
Apr 19, 2025
Maintainer

	1 x GeForce RTX 4090 24 GB	2 x GeForce RTX 4090 24 GB
Llama 3.1 8B Q40 - prediction	31.2 tok / s	34.4 tok / s
Llama 3.3 70B Instruct Q40 - prediction	not enough memory	7.04 tok / s

dllama_model_llama3_3_70b_instruct_q40

2 GPUs

./dllama inference --prompt "Tensor parallelism is all you need" --steps 128 --model models/llama3_3_70b_instruct_q40/dllama_model_llama3_3_70b_instruct_q40.m --tokenizer models/llama3_3_70b_instruct_q40/dllama_tokenizer_llama3_3_70b_instruct_q40.t --nthreads 1 --buffer-float-type q80 --max-seq-len 256 --gpu-index 0 --workers 127.0.0.1:9999

🔶 Pred  145 ms Sync   27 ms | Sent  1392 kB Recv  1610 kB | The
🔶 Pred  144 ms Sync   27 ms | Sent  1392 kB Recv  1610 kB |  development
🔶 Pred  141 ms Sync   34 ms | Sent  1392 kB Recv  1610 kB |  of
🔶 Pred  142 ms Sync   32 ms | Sent  1392 kB Recv  1610 kB |  parallel
🔶 Pred  142 ms Sync   34 ms | Sent  1392 kB Recv  1610 kB |  computing
🔶 Pred  143 ms Sync   31 ms | Sent  1392 kB Recv  1610 kB |  has
🔶 Pred  144 ms Sync   34 ms | Sent  1392 kB Recv  1610 kB |  led
🔶 Pred  143 ms Sync   40 ms | Sent  1392 kB Recv  1610 kB |  to
🔶 Pred  146 ms Sync   34 ms | Sent  1392 kB Recv  1610 kB |  the
🔶 Pred  145 ms Sync   32 ms | Sent  1392 kB Recv  1610 kB |  exploration
🔶 Pred  145 ms Sync   29 ms | Sent  1392 kB Recv  1610 kB |  of
🔶 Pred  144 ms Sync   29 ms | Sent  1392 kB Recv  1610 kB |  various
🔶 Pred  144 ms Sync   27 ms | Sent  1392 kB Recv  1610 kB |  ways
🔶 Pred  147 ms Sync   25 ms | Sent  1392 kB Recv  1610 kB |  to
🔶 Pred  145 ms Sync   25 ms | Sent  1392 kB Recv  1610 kB |  represent

dllama_model_llama3_1_8b_instruct_q40

1 GPU

./dllama inference --prompt "Tensor parallelism is all you need" --steps 128    --model models/llama3_1_8b_instruct_q40/dllama_model_llama3_1_8b_instruct_q40.m --tokenizer models/llama3_1_8b_instruct_q40/dllama_tokenizer_llama3_1_8b_instruct_q40.t    --buffer-float-type q80 --max-seq-len 4096 --nthreads 1 --gpu-index 0

🔶 Pred   51 ms Sync    0 ms | Sent     0 kB Recv     0 kB |  to
🔶 Pred   39 ms Sync    0 ms | Sent     0 kB Recv     0 kB |  be
🔶 Pred   35 ms Sync    0 ms | Sent     0 kB Recv     0 kB |  configured
🔶 Pred   31 ms Sync    0 ms | Sent     0 kB Recv     0 kB |  to
🔶 Pred   30 ms Sync    0 ms | Sent     0 kB Recv     0 kB |  perform
🔶 Pred   35 ms Sync    0 ms | Sent     0 kB Recv     0 kB |  a
🔶 Pred   32 ms Sync    0 ms | Sent     0 kB Recv     0 kB |  tensor
🔶 Pred   32 ms Sync    0 ms | Sent     0 kB Recv     0 kB |  matrix
🔶 Pred   31 ms Sync    0 ms | Sent     0 kB Recv     0 kB |  multiplication
🔶 Pred   32 ms Sync    0 ms | Sent     0 kB Recv     0 kB |  in
🔶 Pred   36 ms Sync    0 ms | Sent     0 kB Recv     0 kB |  CUDA

2 GPUs

./dllama inference --prompt "Tensor parallelism is all you need" --steps 128    --model models/llama3_1_8b_instruct_q40/dllama_model_llama3_1_8b_instruct_q40.m --tokenizer models/llama3_1_8b_instruct_q40/dllama_tokenizer_llama3_1_8b_instruct_q40.t    --buffer-float-type q80 --max-seq-len 4096 --nthreads 1 --gpu-index 0 --workers 127.0.0.1:9999

./dllama worker --port 9999 --nthreads 1 --gpu-index 1

🔶 Pred   28 ms Sync    6 ms | Sent   288 kB Recv   522 kB | In
🔶 Pred   28 ms Sync    6 ms | Sent   288 kB Recv   522 kB |  order
🔶 Pred   29 ms Sync    6 ms | Sent   288 kB Recv   522 kB |  to
🔶 Pred   28 ms Sync    6 ms | Sent   288 kB Recv   522 kB |  achieve
🔶 Pred   29 ms Sync    5 ms | Sent   288 kB Recv   522 kB |  the
🔶 Pred   30 ms Sync    5 ms | Sent   288 kB Recv   522 kB |  goal
🔶 Pred   29 ms Sync    6 ms | Sent   288 kB Recv   522 kB |  of
🔶 Pred   29 ms Sync    5 ms | Sent   288 kB Recv   522 kB |  high
🔶 Pred   29 ms Sync    5 ms | Sent   288 kB Recv   522 kB | -performance
🔶 Pred   29 ms Sync    5 ms | Sent   288 kB Recv   522 kB |  computing
🔶 Pred   30 ms Sync    5 ms | Sent   288 kB Recv   522 kB | ,

Spec

(main) [email protected]:/workspace$ nvidia-smi
Sat Apr 19 22:06:12 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.86.16              Driver Version: 570.86.16      CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4090        On  |   00000000:01:00.0 Off |                  Off |
| 30%   45C    P8             30W /  400W |      17MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 4090        On  |   00000000:81:00.0 Off |                  Off |
|  0%   47C    P0             49W /  400W |      16MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
+-----------------------------------------------------------------------------------------+

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[v0.12.5] Vulkan 2 x NVIDIA GeForce RTX 4090 24 GB #202

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

[v0.12.5] Vulkan 2 x NVIDIA GeForce RTX 4090 24 GB #202

Uh oh!

b4rtaz Apr 19, 2025 Maintainer

dllama_model_llama3_3_70b_instruct_q40

dllama_model_llama3_1_8b_instruct_q40

Spec

Replies: 0 comments

b4rtaz
Apr 19, 2025
Maintainer