-
Notifications
You must be signed in to change notification settings - Fork 1
Playbook for 2 node RPC Clustering #79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 5 commits
Commits
Show all changes
8 commits
Select commit
Hold shift + click to select a range
20efbd7
covering llama.cpp RPC-based distributed inference across two STX Hal…
abdmalik-amd 0aef735
edit page.tsx logic to support h3 headers
abdmalik-amd a86d51d
update page.tsx to add smooth scrolling to # ref links
abdmalik-amd e1b58a2
Update clustering-rpc-server README wording
abdmalik-amd 219a903
Update title, add note on how to find RPC_IP for worker machine
abdmalik-amd 5659302
fix RPC launch typo and update extended vram allocation text
abdmalik-amd b3d1cba
add llama-cli & llama-server ui images
abdmalik-amd 2a2a8cd
Merge branch 'main' into abdmalik/rpcclusterplaybook
danielholanda File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,3 +1,367 @@ | ||
| # Clustering with Two Halos (RPC Server) | ||
| # Clustering Two STX Halos with llama.cpp RPC | ||
|
|
||
| <!-- Playbook content goes here --> | ||
| ## Overview | ||
|
|
||
| Your STX Halo™ is already capable of running large language models locally. Clustering takes this further by combining the GPU memory of multiple systems over a local network, giving you access to even larger models with stronger reasoning, better code generation, and deeper multilingual understanding, all entirely on your own hardware. | ||
|
|
||
| This playbook teaches you how to cluster two STX Halo™ systems using llama.cpp's RPC engine and run GLM 4.7, a 358B parameter model, across both machines with ROCm acceleration. | ||
|
|
||
| ## What You'll Learn | ||
|
|
||
| - How to extend VRAM allocation on STX Halo™ systems | ||
| - Installing llama.cpp with ROCm and RPC support | ||
| - Configuring an RPC worker and launching distributed inference across two nodes | ||
| - Running a 358B parameter model across two networked STX Halo™ systems | ||
|
|
||
| ## Prerequisites | ||
| <!-- @require:driver --> | ||
|
|
||
| <!-- @os:windows --> | ||
| - [Git](https://git-scm.com/downloads/win) | ||
| - [Python](https://www.python.org/downloads/) | ||
| - [Visual Studio Build Tools](https://visualstudio.microsoft.com/visual-cpp-build-tools/) with the **Desktop Development with C++** workload | ||
| - [AMD HIP SDK](https://www.amd.com/en/developer/resources/rocm-hub/hip-sdk.html) | ||
| <!-- @os:end --> | ||
|
|
||
| <!-- @os:linux --> | ||
| ```bash | ||
| sudo apt install git cmake python3 python3-pip | ||
| ``` | ||
| <!-- @os:end --> | ||
|
|
||
| ## Extending VRAM Allocation | ||
|
|
||
| > **Note**: Complete this step on both Machine 1 and Machine 2. | ||
|
|
||
| <!-- @setup:memory-config --> | ||
|
|
||
| ## Installing llama.cpp | ||
|
|
||
| > **Note**: Complete this step on both Machine 1 and Machine 2. | ||
|
|
||
| Two installation options are available: | ||
|
|
||
| - [Option 1: Lemonade SDK (Recommended)](#option-1-lemonade-sdk-recommended) - pre-built binaries, fastest setup | ||
| - [Option 2: Manual Source Build](#option-2-manual-source-build) - build from source with full control over build flags | ||
|
|
||
| ### Option 1: Lemonade SDK (Recommended) | ||
|
|
||
| The Lemonade SDK provides nightly builds of llama.cpp with AMD ROCm™ 7 acceleration, targeting GPUs such as gfx1151 (Strix Halo / Ryzen AI Max+ 395) and other recent Radeon architectures. | ||
|
|
||
| <!-- @os:windows --> | ||
| #### Step 1: Download the Pre-Built Binaries | ||
|
|
||
| Navigate to the latest release page and download the archive matching your platform and GPU target: | ||
|
|
||
| [https://github.com/lemonade-sdk/llamacpp-rocm/releases/latest/](https://github.com/lemonade-sdk/llamacpp-rocm/releases/latest/) | ||
|
|
||
| Download the file named `llama-bxxxx-windows-rocm-gfx1151-x64.zip` (where `xxxx` is the build number). | ||
|
|
||
| #### Step 2: Extract the Binaries | ||
|
|
||
| Unzip the downloaded archive: | ||
|
|
||
| ```bash | ||
| llama-bxxxx-windows-rocm-gfx1151-x64.zip | ||
| ``` | ||
|
|
||
| This directory now contains ROCm-enabled builds of `llama-cli.exe`, `llama-server.exe`, and `rpc-server.exe`, precompiled for your STX Halo™ system. | ||
|
|
||
| #### Step 3: Verify GPU Detection | ||
|
|
||
| ```bash | ||
| .\llama-cli.exe --list-devices | ||
| ``` | ||
|
|
||
| Expected output: | ||
|
|
||
| ```bash | ||
| ggml_cuda_init: found 1 ROCm devices: | ||
| Device 0: AMD Radeon(TM) Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 | ||
| Available devices: | ||
| ROCm0: AMD Radeon(TM) Graphics (110511 MiB, 110357 MiB free) | ||
| ``` | ||
| <!-- @os:end --> | ||
|
|
||
| <!-- @os:linux --> | ||
| #### Step 1: Download the Pre-Built Binaries | ||
|
|
||
| Navigate to the latest release page and download the archive matching your platform and GPU target: | ||
|
|
||
| [https://github.com/lemonade-sdk/llamacpp-rocm/releases/latest/](https://github.com/lemonade-sdk/llamacpp-rocm/releases/latest/) | ||
|
|
||
| Download the file named `llama-bxxxx-ubuntu-rocm-gfx1151-x64.zip` (where `xxxx` is the build number). | ||
|
|
||
| #### Step 2: Extract and Prepare the Binaries | ||
|
|
||
| ```bash | ||
| unzip llama-bxxxx-ubuntu-rocm-gfx1151-x64.zip | ||
| cd llama-bxxxx-ubuntu-rocm-gfx1151-x64 | ||
| chmod +x llama-cli llama-server rpc-server | ||
| ``` | ||
|
|
||
| This directory now contains ROCm-enabled builds of `llama-cli`, `llama-server`, and `rpc-server`, precompiled for your STX Halo™ system. | ||
|
|
||
| #### Step 3: Verify GPU Detection | ||
|
|
||
| ```bash | ||
| ./llama-cli --list-devices | ||
| ``` | ||
|
|
||
| Expected output: | ||
|
|
||
| ```bash | ||
| ggml_cuda_init: found 1 ROCm devices: | ||
| Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 | ||
| Available devices: | ||
| ggml_backend_cuda_get_available_uma_memory: final available_memory_kb: 127697544 | ||
| ROCm0: AMD Radeon Graphics (120000 MiB, 124704 MiB free) | ||
| ``` | ||
| <!-- @os:end --> | ||
| With llama.cpp prepared on each node, proceed to [Downloading the Model](#downloading-the-model). | ||
|
|
||
| ### Option 2: Manual Source Build | ||
|
|
||
| <!-- @os:windows --> | ||
| #### Step 1: Build llama.cpp | ||
|
|
||
| Open the **x64 Native Tools Command Prompt** (installed with Visual Studio Build Tools) and clone the repository: | ||
|
|
||
| ```cmd | ||
| git clone https://github.com/ggml-org/llama.cpp | ||
| cd llama.cpp | ||
| ``` | ||
|
|
||
| Add HIP to your path and build with ROCm and RPC support: | ||
|
|
||
| ```cmd | ||
| set PATH=%HIP_PATH%\bin;%PATH% | ||
| cmake -S . -B rocm -G Ninja -DGGML_HIP=ON -DGGML_RPC=ON -DGPU_TARGETS=gfx1151 -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ -DCMAKE_BUILD_TYPE=Release | ||
| cmake --build rocm --config Release | ||
| ``` | ||
|
|
||
| | Build Flag | Purpose | | ||
| |-----------|---------| | ||
| | `-DGGML_HIP=ON` | Enables the ROCm/HIP software stack | | ||
| | `-DGGML_RPC=ON` | Enables RPC for distributed inference | | ||
| | `-DGPU_TARGETS=gfx1151` | Targets the STX Halo™ GPU (Radeon 8060s) | | ||
| | `-G Ninja` | Uses the Ninja build system | | ||
|
|
||
| #### Step 2: Verify GPU Detection | ||
|
|
||
| ```cmd | ||
| cd rocm\bin | ||
| .\llama-cli.exe --list-devices | ||
| ``` | ||
|
|
||
| Expected output: | ||
|
|
||
| ```bash | ||
| ggml_cuda_init: found 1 ROCm devices: | ||
| Device 0: AMD Radeon(TM) Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 | ||
| Available devices: | ||
| ROCm0: AMD Radeon(TM) Graphics (110511 MiB, 110357 MiB free) | ||
| ``` | ||
|
|
||
| #### Step 3: Add HIP to Your User Path | ||
|
|
||
| The build step above set `%HIP_PATH%\bin` for the current session only. To make the HIP libraries available in any terminal (not just the x64 Native Tools Command Prompt), add it to your user `PATH` permanently: | ||
|
|
||
| ```cmd | ||
| powershell -Command "[System.Environment]::SetEnvironmentVariable('Path', [System.Environment]::GetEnvironmentVariable('Path', 'User') + ';%HIP_PATH%\bin', 'User')" | ||
| ``` | ||
|
|
||
| With llama.cpp prepared on each node, proceed to [Downloading the Model](#downloading-the-model). | ||
| <!-- @os:end --> | ||
|
|
||
| <!-- @os:linux --> | ||
| #### Step 1: Build llama.cpp | ||
|
|
||
| Clone the repository: | ||
|
|
||
| ```bash | ||
| git clone https://github.com/ggml-org/llama.cpp | ||
| cd llama.cpp | ||
| ``` | ||
|
|
||
| Build with ROCm and RPC support: | ||
|
|
||
| ```bash | ||
| cmake -B rocm -DGGML_HIP=ON -DGGML_RPC=ON -DGGML_HIP_ROCWMMA_FATTN=ON -DAMDGPU_TARGETS="gfx1151" | ||
| cmake --build rocm --config Release -j$(nproc) | ||
| ``` | ||
|
|
||
| | Build Flag | Purpose | | ||
| |-----------|---------| | ||
| | `-DGGML_HIP=ON` | Enables the ROCm software stack | | ||
| | `-DGGML_RPC=ON` | Enables RPC for distributed inference | | ||
| | `-DGGML_HIP_ROCWMMA_FATTN=ON` | Enables rocWMMA for enhanced Flash Attention on AMD GPUs | | ||
| | `-DAMDGPU_TARGETS="gfx1151"` | Targets the STX Halo™ GPU (Radeon 8060s) | | ||
|
|
||
| For more build options, refer to the [llama.cpp build documentation](https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md). | ||
|
|
||
| #### Step 2: Verify GPU Detection | ||
|
|
||
| ```bash | ||
| cd rocm/bin | ||
| ./llama-cli --list-devices | ||
| ``` | ||
|
|
||
| Expected output: | ||
|
|
||
| ```bash | ||
| ggml_cuda_init: found 1 ROCm devices: | ||
| Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 | ||
| Available devices: | ||
| ggml_backend_cuda_get_available_uma_memory: final available_memory_kb: 127697544 | ||
| ROCm0: AMD Radeon Graphics (120000 MiB, 124704 MiB free) | ||
| ``` | ||
|
|
||
| With llama.cpp prepared on each node, proceed to [Downloading the Model](#downloading-the-model). | ||
| <!-- @os:end --> | ||
|
|
||
| ## Downloading the Model | ||
|
|
||
| This playbook uses [GLM 4.7](https://huggingface.co/zai-org/GLM-4.7), a 358B parameter model in the `Q4_K_XL` quantization from [Unsloth](https://huggingface.co/unsloth/GLM-4.7-GGUF/tree/main/UD-Q4_K_XL). At this quantization the model requires approximately 205GB of storage and fits within the combined GPU memory of two STX Halo™ nodes. | ||
|
|
||
| Download the GGUF files using the Hugging Face CLI: | ||
| <!-- @os:linux --> | ||
| ```bash | ||
| pip install huggingface-hub | ||
| hf download unsloth/GLM-4.7-GGUF --include "UD-Q4_K_XL/*" --local-dir GLM-4.7-GGUF | ||
| ``` | ||
| <!-- @os:end --> | ||
|
|
||
| <!-- @os:windows --> | ||
| ```cmd | ||
| python -m pip install -U huggingface-hub | ||
|
|
||
| $hfScripts = python -c "import sysconfig; print(sysconfig.get_path('scripts'))" | ||
| $env:Path = "$hfScripts;$env:Path" | ||
|
|
||
| hf download unsloth/GLM-4.7-GGUF --include "UD-Q4_K_XL/*" --local-dir GLM-4.7-GGUF | ||
| ``` | ||
| <!-- @os:end --> | ||
|
|
||
| > **Note**: The model download must be completed on Machine 1 (the controller). The RPC worker nodes do not need a local copy of the model files. | ||
|
|
||
| ## Launching the Model on the Cluster | ||
|
|
||
| The llama.cpp RPC (Remote Procedure Call) engine allows a single llama.cpp instance to offload model layers to remote workers over the network. One machine acts as the **controller** (Machine 1), handling tokenization, scheduling, and orchestration. The other machine runs a lightweight **RPC server** (Machine 2) that exposes its GPU memory and compute to the controller. | ||
|
|
||
| At load time, llama.cpp shards the model across both nodes. Once loaded, inference proceeds as if running on a single accelerator. RPC handles tensor transfers and synchronization behind the scenes. | ||
|
|
||
| ### Step 1: Start the RPC Server (Machine 2) | ||
|
|
||
| On Machine 2, start the RPC server to expose its GPU resources to the controller: | ||
|
|
||
| ```bash | ||
| ./rpc-server -p 50053 -c --host 0.0.0.0 | ||
| ``` | ||
|
|
||
| | Flag | Purpose | | ||
| |------|---------| | ||
| | `-p` | Port to broadcast the RPC server on | | ||
| | `-c` | Enables a local cache for large tensors, avoiding repeated network transfers during model loading | | ||
| | `--host` | IP address to bind the RPC server to (`0.0.0.0` for all interfaces) | | ||
|
|
||
| For more options, refer to the [llama.cpp RPC documentation](https://github.com/ggml-org/llama.cpp/blob/master/tools/rpc/README.md). | ||
|
|
||
| ### Step 2: Launch the Model (Machine 1) | ||
|
|
||
| With the RPC server running on Machine 2, launch inference from Machine 1 using either `llama-cli` or `llama-server`. | ||
|
|
||
| #### llama-cli | ||
|
|
||
| `llama-cli` provides a terminal-based interface for interacting directly with the model. It is ideal for benchmarking, debugging, and low-level experimentation. | ||
|
|
||
| <!-- @os:linux --> | ||
| ```bash | ||
| ./llama-cli \ | ||
| -m /path/to/GLM-4.7-GGUF/UD-Q4_K_XL/GLM-4.7-UD-Q4_K_XL-00001-of-00005.gguf \ | ||
| -c 32768 \ | ||
| -fa on \ | ||
| -ngl 999 \ | ||
| --no-mmap \ | ||
| --rpc <RPC_WORKER_IP>:50053 | ||
|
abdmalik-amd marked this conversation as resolved.
|
||
| ``` | ||
|
|
||
| > **Finding `<RPC_WORKER_IP>`**: On Machine 2, run `hostname -I | awk '{print $1}'` to find its local IP address. Use the address on your local network (typically starting with `192.168.` or `10.`). | ||
| <!-- @os:end --> | ||
|
|
||
| <!-- @os:windows --> | ||
| > **Note**: Run this command in Terminal. | ||
|
|
||
| ```powershell | ||
| .\llama-cli.exe ` | ||
| -m C:\path\to\GLM-4.7-GGUF\UD-Q4_K_XL\GLM-4.7-UD-Q4_K_XL-00001-of-00005.gguf ` | ||
| -c 32768 ` | ||
| -fa on ` | ||
| -ngl 999 ` | ||
| --no-mmap ` | ||
| --rpc <RPC_WORKER_IP>:50053 | ||
| ``` | ||
|
|
||
| > **Finding `<RPC_WORKER_IP>`**: On Machine 2, run `ipconfig | findstr /C:"IPv4"` in a terminal to find its local IP address. Use the address on your local network (typically starting with `192.168.` or `10.`). | ||
| <!-- @os:end --> | ||
|
|
||
| #### llama-server | ||
|
|
||
| `llama-server` exposes the same inference engine through a persistent server process with an integrated web UI and an OpenAI-compatible HTTP API. This is the preferred interface for longer-running deployments, multi-user access, and integration with external tooling. | ||
|
|
||
| <!-- @os:linux --> | ||
| ```bash | ||
| ./llama-server \ | ||
| -m /path/to/GLM-4.7-GGUF/UD-Q4_K_XL/GLM-4.7-UD-Q4_K_XL-00001-of-00005.gguf \ | ||
| -c 32768 \ | ||
| -fa on \ | ||
| -ngl 999 \ | ||
| --no-mmap \ | ||
| --host 0.0.0.0 \ | ||
| --port 8081 \ | ||
| --rpc <RPC_WORKER_IP>:50053 | ||
| ``` | ||
|
|
||
| > **Finding `<RPC_WORKER_IP>`**: On Machine 2, run `hostname -I | awk '{print $1}'` to find its local IP address. Use the address on your local network (typically starting with `192.168.` or `10.`). | ||
| <!-- @os:end --> | ||
|
|
||
| <!-- @os:windows --> | ||
| > **Note**: Run this command in Terminal. | ||
|
|
||
| ```powershell | ||
| .\llama-server.exe ` | ||
| -m C:\path\to\GLM-4.7-GGUF\UD-Q4_K_XL\GLM-4.7-UD-Q4_K_XL-00001-of-00005.gguf ` | ||
| -c 32768 ` | ||
| -fa on ` | ||
| -ngl 999 ` | ||
| --no-mmap ` | ||
| --host 0.0.0.0 ` | ||
| --port 8081 ` | ||
| --rpc <RPC_WORKER_IP>:50053 | ||
| ``` | ||
|
|
||
| > **Finding `<RPC_WORKER_IP>`**: On Machine 2, run `ipconfig | findstr /C:"IPv4"` in a terminal to find its local IP address. Use the address on your local network (typically starting with `192.168.` or `10.`). | ||
| <!-- @os:end --> | ||
|
|
||
| Once started, access the web UI at `http://<HOST_IP>:8081`. | ||
|
|
||
|
danielholanda marked this conversation as resolved.
|
||
| #### Parameter Reference | ||
|
|
||
| | Flag | Purpose | | ||
| |------|---------| | ||
| | `-m` | Path to the GGUF model file (use the first shard, `00001-of-00005`) | | ||
| | `-c` | Context size in tokens. Larger values use more memory | | ||
| | `-fa on` | Enables rocWMMA Flash Attention for improved performance on AMD GPUs | | ||
| | `-ngl 999` | Offloads all model layers to the GPU | | ||
| | `--no-mmap` | Disables memory-mapping, reducing load times when model size exceeds system RAM but fits in VRAM | | ||
| | `--host` | IP to bind `llama-server` to (`llama-server` only) | | ||
| | `--port` | Port to serve the HTTP API on (`llama-server` only) | | ||
| | `--rpc` | Comma-separated list of RPC worker endpoints (`IP:port`) | | ||
|
|
||
| For full parameter usage, refer to the [llama-cli documentation](https://github.com/ggml-org/llama.cpp/blob/master/tools/main/README.md) and [llama-server documentation](https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md). | ||
|
|
||
| ## Next Steps | ||
|
|
||
| - **Connect third-party applications**: `llama-server` exposes an OpenAI-compatible API. Point any OpenAI-compatible application (such as Open WebUI) at `http://<HOST_IP>:8081` with any placeholder API key (e.g., `none`) to connect to your cluster | ||
| - **Explore other models**: Browse quantized GGUFs on [Hugging Face](https://huggingface.co/models?search=gguf) to find models that fit within your cluster's combined GPU memory | ||
| - **Scale to four nodes**: Add two more STX Halo™ systems as additional RPC workers to access models at the 1 trillion parameter scale. Pass additional endpoints to `--rpc` as a comma-separated list (e.g., `--rpc <IP1>:50053,<IP2>:50053,<IP3>:50053`) | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,12 +1,12 @@ | ||
| { | ||
| "id": "clustering-rpc-server", | ||
| "title": "Clustering with Two Halos (RPC Server)", | ||
| "description": "Set up distributed inference using RPC server across two STX Halo™ devices with vLLM", | ||
| "title": "Clustering Two STX Halos with llama.cpp RPC", | ||
| "description": "Set up distributed inference using RPC server across two STX Halo™ devices with llama.cpp to run 350B+ models", | ||
| "time": 90, | ||
| "platforms": ["linux"], | ||
| "platforms": ["linux", "windows"], | ||
| "difficulty": "advanced", | ||
| "isNew": false, | ||
| "isNew": true, | ||
| "isFeatured": false, | ||
| "published": true, | ||
| "tags": ["clustering", "rpc", "vllm", "distributed", "docker"] | ||
| "tags": ["clustering", "rpc", "distributed", "llm", "inference", "rocm"] | ||
| } |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.