You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: playbooks/supplemental/clustering-rccl/README.md
+8-8Lines changed: 8 additions & 8 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -13,15 +13,15 @@ SPDX-License-Identifier: MIT
13
13
14
14
## Overview
15
15
16
-
Your Ryzen™ AI Halo™ is already capable of running large language models locally. Clustering takes this further by combining the GPU memory of multiple systems over a local network, giving you access to even larger models with stronger reasoning, better code generation, and deeper multilingual understanding, all entirely on your own hardware.
16
+
Your Ryzen™ AI Halo is already capable of running large language models locally. Clustering takes this further by combining the GPU memory of multiple systems over a local network, giving you access to even larger models with stronger reasoning, better code generation, and deeper multilingual understanding, all entirely on your own hardware.
17
17
18
-
This playbook teaches you how to cluster two Ryzen™ AI Halo™ systems using RCCL (ROCm Communication Collectives Library) with vLLM and run Qwen3.5-397B, a 397B parameter model, across both machines with ROCm acceleration.
18
+
This playbook teaches you how to cluster two Ryzen™ AI Halo systems using RCCL (ROCm Communication Collectives Library) with vLLM and run Qwen3.5-397B, a 397B parameter model, across both machines with ROCm acceleration.
19
19
20
20
## What You'll Learn
21
21
22
-
- How to extend VRAM allocation on Ryzen™ AI Halo™ systems
22
+
- How to extend VRAM allocation on Ryzen™ AI Halo systems
23
23
- Launching vLLM with ROCm support
24
-
- Configuring RCCL for multi-node tensor-parallel inference across two Ryzen™ AI Halo™ systems
24
+
- Configuring RCCL for multi-node tensor-parallel inference across two Ryzen™ AI Halo systems
25
25
- Running a 397B parameter model across two networked Ryzen™ AI Halo systems
26
26
27
27
## Prerequisites
@@ -92,7 +92,7 @@ You should see a speed of `10000Mb/s`:
92
92
93
93
On Linux, ROCm utilizes a shared system memory pool, and this pool is configured by default to half the system memory.
94
94
95
-
This amount can be increased by changing the kernel's Translation Table Manager (TTM) page setting, with the following instructions. AMD recommends setting the minimum dedicated VRAM in the BIOS (0.5GB)
95
+
This amount can be increased by changing the kernel's Translation Table Manager (TTM) page setting, with the following instructions. AMD recommends setting the minimum dedicated VRAM in the BIOS (0.5 GB).
96
96
97
97
* Install the pipx utility and add the path for pipx installed wheels into the system search path.
98
98
@@ -101,7 +101,7 @@ This amount can be increased by changing the kernel's Translation Table Manager
101
101
pipx ensurepath
102
102
```
103
103
104
-
* Install the amd-debug-tools wheel from PyPi.
104
+
* Install the amd-debug-tools wheel from PyPI.
105
105
```bash
106
106
pipx install amd-debug-tools
107
107
```
@@ -224,12 +224,12 @@ To connect Open WebUI to your vLLM endpoint:
224
224
225
225

226
226
227
-
Once connected, select the model from the model dropdown in Open WebUI and start chatting. The model is now running across both of your Ryzen™ AI Halo™ nodes:
227
+
Once connected, select the model from the model dropdown in Open WebUI and start chatting. The model is now running across both of your Ryzen™ AI Halo nodes:
228
228
229
229

230
230
231
231
## Next Steps
232
232
233
233
-**Explore other models**: Discover new models on [Hugging Face](https://huggingface.co/models?&sort=trending) that fit within your cluster's combined GPU memory
234
-
-**Scale to four nodes**: Add two more Ryzen™ AI Halo™ systems as additional Ray workers to shard models across even more GPUs. This requires an Ethernet switch with at least four ports, one for each node. Follow [Step 2: Join the Cluster](#step-2-join-the-cluster-machine-2) on each additional worker and increase `--tensor-parallel-size` accordingly
234
+
-**Scale to four nodes**: Add two more Ryzen™ AI Halo systems as additional Ray workers to shard models across even more GPUs. This requires an Ethernet switch with at least four ports, one for each node. Follow [Step 2: Join the Cluster](#step-2-join-the-cluster-machine-2) on each additional worker and increase `--tensor-parallel-size` accordingly
235
235
-**Try other parallelism strategies**: vLLM supports [expert parallel](https://docs.vllm.ai/en/latest/serving/expert_parallel_deployment/) for mixture-of-experts models and [data parallel](https://docs.vllm.ai/en/latest/serving/data_parallel_deployment/) for higher throughput. Experiment with `--enable-expert-parallel` and `--data-parallel-size` to find the best configuration for your workload
0 commit comments