|
| 1 | +# CUDA setup & graceful degradation for eliza-1 on Windows / Linux |
| 2 | + |
| 3 | +Scope: what an NVIDIA-GPU user on Windows or Linux actually needs to run the |
| 4 | +eliza-1 local models *well* (i.e. on CUDA, not the ~14× slower Vulkan prefill |
| 5 | +path), how Milady detects hardware today, what degrades silently, and the |
| 6 | +concrete plan to close the gaps. macOS (Metal) and Android (Vulkan) are out of |
| 7 | +scope here. |
| 8 | + |
| 9 | +## 1. CUDA dependency reality: driver vs. toolkit |
| 10 | + |
| 11 | +There are two distinct CUDA artifacts and they get conflated constantly: |
| 12 | + |
| 13 | +- **The NVIDIA display driver** ships the CUDA *driver* runtime: `libcuda.so.1` |
| 14 | + (Linux) / `nvcuda.dll` (Windows) plus the kernel module. This is the *only* |
| 15 | + thing required to *run* CUDA code. On Linux it comes from the distro's |
| 16 | + `nvidia-driver-NNN` package (or NVIDIA's `.run`); on Windows it's the Game |
| 17 | + Ready / Studio driver from nvidia.com or Windows Update. A recent-enough |
| 18 | + driver is needed for the CUDA toolkit version a binary was compiled against |
| 19 | + (CUDA 12.x needs driver ≥ 525 on Linux / ≥ 527 on Windows; newer minor |
| 20 | + versions bump the floor — CUDA 12.8, which adds Blackwell `sm_100`/`sm_120`, |
| 21 | + wants ≥ 570). |
| 22 | + |
| 23 | +- **The CUDA toolkit** (`nvcc`, `libcudart`, headers) is a *build-time* |
| 24 | + dependency. node-llama-cpp's CUDA prebuilt **statically bundles libcudart**, so |
| 25 | + *running* the node binding never needs the toolkit. Likewise our fork's CUDA |
| 26 | + build links `cudart_static` — once the `.so`/`.dll` exists, only the driver is |
| 27 | + needed at runtime. |
| 28 | + |
| 29 | +**So: a Windows/Linux user needs the NVIDIA driver, not the CUDA toolkit — *if* |
| 30 | +the CUDA binary already exists on their machine.** The catch is the second |
| 31 | +clause. node-llama-cpp's CUDA prebuilt downloads automatically on `npm install` |
| 32 | +of the binding (it's in `app-core`'s deps, `node-llama-cpp@3.18.1`, |
| 33 | +trusted-dependency flagged). But eliza-1's GGUFs use the custom GGML types |
| 34 | +(`QJL1_256` / `Q4_POLAR` / `TBQ3_*`) and `--spec-type dflash`, so the catalog |
| 35 | +marks every tier `preferredBackend: "llama-server"` with |
| 36 | +`requiresKernel: [...]` — they **must** route to the fork's `llama-server`, not |
| 37 | +node-llama-cpp. The fork is built by `packages/app-core/scripts/build-llama-cpp-dflash.mjs`: |
| 38 | +`detectBackend()` returns `"cuda"` only when `nvcc` *or* `nvidia-smi` is on PATH, |
| 39 | +and the `linux-x64-cuda` / `windows-x64-cuda` targets pass `-DGGML_CUDA=ON` |
| 40 | +which **requires `nvcc` on the build host**. There is no prebuilt-CUDA-fork |
| 41 | +download path in the runtime — `resolveDflashBinary()` only looks at |
| 42 | +`ELIZA_DFLASH_LLAMA_SERVER`, the fused build dir, the managed |
| 43 | +`<root>/local-inference/bin/dflash/<platform>-<arch>-<backend>/llama-server` |
| 44 | +path, and `$PATH`. The CI matrix (`.github/workflows/local-inference-matrix.yml`) |
| 45 | +*does* build `linux-x64-cuda` and uploads it as a smoke artifact, but it's |
| 46 | +`continue-on-error`, gated behind a `gpu`-labelled self-hosted runner, and not |
| 47 | +wired into any release-distribution pipeline. `release-electrobun.yml` ships |
| 48 | +nothing inference-related. So on a fresh desktop install **nothing builds or |
| 49 | +downloads the CUDA fork for the user** — they'd need `nvcc` (i.e. the full CUDA |
| 50 | +toolkit, ~3 GB) installed *and* a manual `bun run local-inference:dflash:build`. |
| 51 | + |
| 52 | +Net for a fresh install: |
| 53 | + |
| 54 | +| | Windows | Linux | |
| 55 | +|---|---|---| |
| 56 | +| NVIDIA driver | user installs (nvidia.com / Windows Update) | user installs (`nvidia-driver-NNN` or `.run`) — see `packages/inference/reports/porting/2026-05-11/cuda-bringup-operator-steps.md` for a worked example of a half-installed driver | |
| 57 | +| CUDA toolkit (`nvcc`) | only needed if *building* the fork; not shipped, not auto-installed | same | |
| 58 | +| node-llama-cpp CUDA prebuilt | auto-downloads with `app-core` deps | auto-downloads with `app-core` deps | |
| 59 | +| eliza-1 fork CUDA build (`llama-server`) | **not built, not downloaded** today | **not built, not downloaded** today | |
| 60 | +| What runs eliza-1 today | Vulkan fork build if present (CI smoke artifact), else CPU fork, else node-llama-cpp (which can't load the custom GGML types → load error) | same | |
| 61 | + |
| 62 | +## 2. Detection — what exists |
| 63 | + |
| 64 | +- **`probeHardware()`** (`local-inference/hardware.ts`) is the only real GPU |
| 65 | + probe. It `import()`s `node-llama-cpp`, calls `getLlama({ gpu: "auto" })`, and |
| 66 | + reads `llama.gpu` (`"cuda" | "metal" | "vulkan" | false`) plus |
| 67 | + `getVramState()`. It feeds `recommendBucket()` (small/mid/large/xl from |
| 68 | + effective VRAM/RAM) and `deviceCapsFromProbe()` (which backends a bundle may |
| 69 | + install). Exposed at `GET /api/local-inference/hardware`. **It does not run |
| 70 | + `nvidia-smi`** and there is no driver-presence / driver-version check. If the |
| 71 | + binding's prebuilt is missing it returns `source: "os-fallback"` with |
| 72 | + `gpu: null` — i.e. it cannot tell "no GPU" from "GPU present, binding not |
| 73 | + installed". |
| 74 | +- **The fork backend selector** (`dflash-server.ts` `platformKey()` / |
| 75 | + `fusedBackendKey()`) was the real gap: it picked `cuda` *only* when |
| 76 | + `CUDA_VISIBLE_DEVICES` was set and not `-1`. That env var is essentially never |
| 77 | + set on a desktop launch, so the runtime always keyed `…-cpu` and would run the |
| 78 | + CPU fork build *even with a CUDA fork build sitting in the managed bin dir*. |
| 79 | + **Fixed in this change** (see §5): when no `*_VISIBLE_DEVICES`/`ELIZA_DFLASH_BACKEND` |
| 80 | + override is present, the selector now probes `<root>/bin/dflash/<plat>-<arch>-{cuda,vulkan,rocm}[-fused]/llama-server` on disk and prefers the first that exists. So a downloaded/built CUDA fork artifact is now actually used. |
| 81 | +- **Backend dispatch** (`backend.ts` `decideBackend()`): chooses node-llama-cpp |
| 82 | + vs. llama-server purely from catalog metadata (`requiresKernel`, |
| 83 | + `runtime.dflash`, `preferredBackend`) + `ELIZA_LOCAL_BACKEND`. It is *not* |
| 84 | + GPU-aware — within `llama-server` the cuda-vs-vulkan-vs-cpu choice is the |
| 85 | + build-dir selection above, not a dispatcher decision. There is **no |
| 86 | + "detected RTX 4090, using CUDA" message surfaced anywhere** — the hardware |
| 87 | + probe data is available via the API but nothing renders a confirmation. |
| 88 | + |
| 89 | +## 3. Graceful degradation — what the chain actually is |
| 90 | + |
| 91 | +For eliza-1 (kernel-required tiers), the de-facto fallback is: |
| 92 | + |
| 93 | +1. fused CUDA fork build → 2. stock CUDA fork build → 3. fused Vulkan → 4. stock |
| 94 | +Vulkan → 5. fused CPU → 6. stock CPU fork → 7. node-llama-cpp (which **cannot |
| 95 | +load** the custom GGML types — it errors). |
| 96 | + |
| 97 | +What the dispatcher actually walks: `BackendDispatcher.load()` picks |
| 98 | +`llama-server` (kernel-required) and calls `dflashLlamaServer.load(plan)`. |
| 99 | +`engine.ts` only falls back to node-llama-cpp when the decision reason is the |
| 100 | +*soft* `"preferred-backend"` and `!dflashRequired()` — kernel-required loads |
| 101 | +**do not** fall back; the error propagates. That's correct (a node-llama-cpp |
| 102 | +fallback would fail to load the GGUF anyway). The within-`llama-server` walk |
| 103 | +(cuda→vulkan→cpu) is the new `accelBackendKey()` disk probe — but there is no |
| 104 | +"warn the user we degraded from CUDA to CPU" message; it just runs slow. |
| 105 | + |
| 106 | +Failure modes and what the user sees today: |
| 107 | + |
| 108 | +- **No fork build at all for the platform** → `resolveDflashBinary()` returns |
| 109 | + null → `getDflashRuntimeStatus()` reports `enabled: false, reason: "No |
| 110 | + compatible llama-server found. Set ELIZA_DFLASH_LLAMA_SERVER or run |
| 111 | + packages/app-core/scripts/build-llama-cpp-dflash.mjs."` and (with eliza-1 |
| 112 | + loaded) `BackendDispatcher.load()` throws `unsatisfiedKernels` / |
| 113 | + "rebuild the fork" — clear error, but it points at a dev script, not a |
| 114 | + user action. `runDflashDoctor()` exposes this via the doctor report |
| 115 | + (`llama-server-binary` check → `fail`). |
| 116 | +- **GPU has too little VRAM** → `assessFit()` returns `tight`/`wontfit` and |
| 117 | + `recommendBucket()` downsizes the tier; this is surfaced in the catalog UI. |
| 118 | + But if a too-big tier is force-loaded, `llama-server` either OOMs the GPU |
| 119 | + (CUDA OOM, hard crash of the child) or — with `gpuLayers: "auto"` — spills to |
| 120 | + CPU silently and runs slow. No proactive warning. |
| 121 | +- **Driver too old / absent** → `getLlama({ gpu: "auto" })` falls back to |
| 122 | + `gpu: false`; `probeHardware()` reports `gpu: null`. If a CUDA fork build is |
| 123 | + on disk, launching it against a missing/old `libcuda.so.1` fails at |
| 124 | + `dlopen`/`LoadLibrary` time — `llama-server` exits non-zero, the engine |
| 125 | + surfaces the spawn failure, but **there is no "your NVIDIA driver is missing |
| 126 | + or too old, install ≥ 12.x" message** mapping the cryptic loader error to an |
| 127 | + actionable fix. |
| 128 | + |
| 129 | +## 4. Installer integration — the plan |
| 130 | + |
| 131 | +The installer (desktop first-run / `bun install` postinstall) should: |
| 132 | + |
| 133 | +**(a) Detect the GPU.** Run `nvidia-smi --query-gpu=name,memory.total,driver_version --format=csv,noheader` (Linux + Windows both ship it with the driver). Parse name, VRAM, and driver version. This is cheaper and more honest than spinning up the node-llama-cpp binding, and it tells us the driver version (which the binding does not). Fall back to `probeHardware()` for non-NVIDIA. |
| 134 | + |
| 135 | +**(b) Get the right fork build onto disk.** Building the CUDA fork needs `nvcc` — pulling a ~3 GB CUDA toolkit on every install is unacceptable. The correct call is the same one node-llama-cpp made: **ship prebuilt CUDA fork binaries as release artifacts** (per `windows-x64-cuda`, `linux-x64-cuda`, plus the `-fused` variants), and have the installer download the matching one into `<root>/local-inference/bin/dflash/<plat>-<arch>-<backend>/`. The build matrix already produces these (`build-llama-cpp-dflash.mjs --target linux-x64-cuda` etc., with `CAPABILITIES.json` emitted next to the binary) — they're just not promoted to a downloadable release today. Concretely: add a `dflash-binaries` release job (parallel to `release-electrobun.yml`) that runs the existing build script for `{linux,windows}-x64-{cpu,vulkan,cuda}` (+ `-fused`) on the `gpu`-labelled self-hosted runner for the CUDA legs, uploads each `<target>/` dir (binary + `CAPABILITIES.json`) as a GH release asset, and a small `local-inference:fetch-binary` resolver in the runtime that, when `resolveDflashBinary()` finds nothing, downloads the asset for `accelBackendKey()` (CUDA if `nvidia-smi` succeeds, else Vulkan, else CPU). Keep `build-llama-cpp-dflash.mjs` as the from-source path for devs and the `MILADY_ELIZA_SOURCE=local` workflow. |
| 136 | + |
| 137 | +**(c) Warn on missing/old driver.** If `nvidia-smi` fails (driver absent) or reports `driver_version` below the CUDA-12.x floor, show a one-time card: "An NVIDIA GPU was detected but the driver is missing/outdated. eliza-1 will run on CPU (≈14× slower) until you install the driver: `https://www.nvidia.com/Download/index.aspx` (Windows) / `sudo ubuntu-drivers install` or your distro's `nvidia-driver-NNN` (Linux)." Map the `llama-server` `dlopen`/`LoadLibrary` failure to the same message. Link to `docs/.../cuda-bringup-operator-steps.md`-style guidance. |
| 138 | + |
| 139 | +**(d) Pick model + context for the detected VRAM.** Already mostly there: `recommendBucket()` → `eliza-1-{0_6b,1_7b,9b,27b,...}`. Tighten it so the first-run default respects `nvidia-smi` VRAM directly (the current heuristic weights `max(vram*1.25, ram*0.5)` which over-estimates on a 6 GB laptop dGPU + 32 GB RAM box). Surface the choice: "Detected RTX 4090 (24 GB) → using eliza-1-9b on CUDA" in onboarding. |
| 140 | + |
| 141 | +**Windows flow:** `nvidia-smi` → if OK and driver ≥ 12.x floor: download `windows-x64-cuda[-fused]` fork build → pick tier from VRAM → "using CUDA" confirmation. If `nvidia-smi` fails: download `windows-x64-vulkan` (all of NVIDIA/AMD/Intel ARC expose Vulkan 1.3) + show driver-install card → still works, just slower. If no Vulkan-capable GPU: `windows-x64-cpu`. node-llama-cpp's own CUDA/Vulkan prebuilts come down with `app-core` deps regardless (used for the hardware probe + any stock GGUF). |
| 142 | + |
| 143 | +**Linux flow:** identical, with `linux-x64-{cuda,vulkan,cpu}` and the driver-install hint pointing at `ubuntu-drivers` / distro package. The `cuda-bringup-operator-steps.md` report shows a real `dpkg --configure -a` half-install recovery worth linking from the warning. |
| 144 | + |
| 145 | +## 5. Changes made in this commit |
| 146 | + |
| 147 | +`packages/app-core/src/services/local-inference/dflash-server.ts`: replaced the |
| 148 | +two near-duplicate `platformKey()` / `fusedBackendKey()` env-only backend |
| 149 | +selectors with a single `accelBackendKey(suffix)` helper. Precedence is now: |
| 150 | +`ELIZA_DFLASH_BACKEND` override → `darwin`→`metal` → `HIP_/ROCR_VISIBLE_DEVICES`→`rocm` |
| 151 | +→ `CUDA_VISIBLE_DEVICES`→`cuda` → **disk probe** for an installed |
| 152 | +`…-{cuda,vulkan,rocm}[-fused]/llama-server` (cuda preferred) → `cpu`. This is |
| 153 | +the smallest fix that makes a present-on-disk CUDA fork build actually get used |
| 154 | +without the operator having to set `CUDA_VISIBLE_DEVICES` by hand. `platformKey()` |
| 155 | +and `fusedBackendKey()` are now thin wrappers, so all existing callers |
| 156 | +(`managedDflashBinaryPath`, `managedFusedDflashDir`, `managedDflashCapabilitiesPath`) |
| 157 | +pick it up. Typecheck clean; `dflash-server.test.ts` 27/27 pass. |
| 158 | + |
| 159 | +Not done here (recommended, larger): the `nvidia-smi`-based detector, the |
| 160 | +prebuilt-CUDA-fork release job + runtime downloader, the missing/old-driver |
| 161 | +warning card, and the VRAM-aware first-run tier pick. Those are §4 above. |
0 commit comments