Skip to content

Commit 4ea9558

Browse files
author
Shaw
committed
Merge remote-tracking branch 'origin/develop' into develop
2 parents d5dbe00 + 97c0d3c commit 4ea9558

5 files changed

Lines changed: 667 additions & 24 deletions

File tree

Lines changed: 161 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,161 @@
1+
# CUDA setup & graceful degradation for eliza-1 on Windows / Linux
2+
3+
Scope: what an NVIDIA-GPU user on Windows or Linux actually needs to run the
4+
eliza-1 local models *well* (i.e. on CUDA, not the ~14× slower Vulkan prefill
5+
path), how Milady detects hardware today, what degrades silently, and the
6+
concrete plan to close the gaps. macOS (Metal) and Android (Vulkan) are out of
7+
scope here.
8+
9+
## 1. CUDA dependency reality: driver vs. toolkit
10+
11+
There are two distinct CUDA artifacts and they get conflated constantly:
12+
13+
- **The NVIDIA display driver** ships the CUDA *driver* runtime: `libcuda.so.1`
14+
(Linux) / `nvcuda.dll` (Windows) plus the kernel module. This is the *only*
15+
thing required to *run* CUDA code. On Linux it comes from the distro's
16+
`nvidia-driver-NNN` package (or NVIDIA's `.run`); on Windows it's the Game
17+
Ready / Studio driver from nvidia.com or Windows Update. A recent-enough
18+
driver is needed for the CUDA toolkit version a binary was compiled against
19+
(CUDA 12.x needs driver ≥ 525 on Linux / ≥ 527 on Windows; newer minor
20+
versions bump the floor — CUDA 12.8, which adds Blackwell `sm_100`/`sm_120`,
21+
wants ≥ 570).
22+
23+
- **The CUDA toolkit** (`nvcc`, `libcudart`, headers) is a *build-time*
24+
dependency. node-llama-cpp's CUDA prebuilt **statically bundles libcudart**, so
25+
*running* the node binding never needs the toolkit. Likewise our fork's CUDA
26+
build links `cudart_static` — once the `.so`/`.dll` exists, only the driver is
27+
needed at runtime.
28+
29+
**So: a Windows/Linux user needs the NVIDIA driver, not the CUDA toolkit — *if*
30+
the CUDA binary already exists on their machine.** The catch is the second
31+
clause. node-llama-cpp's CUDA prebuilt downloads automatically on `npm install`
32+
of the binding (it's in `app-core`'s deps, `node-llama-cpp@3.18.1`,
33+
trusted-dependency flagged). But eliza-1's GGUFs use the custom GGML types
34+
(`QJL1_256` / `Q4_POLAR` / `TBQ3_*`) and `--spec-type dflash`, so the catalog
35+
marks every tier `preferredBackend: "llama-server"` with
36+
`requiresKernel: [...]` — they **must** route to the fork's `llama-server`, not
37+
node-llama-cpp. The fork is built by `packages/app-core/scripts/build-llama-cpp-dflash.mjs`:
38+
`detectBackend()` returns `"cuda"` only when `nvcc` *or* `nvidia-smi` is on PATH,
39+
and the `linux-x64-cuda` / `windows-x64-cuda` targets pass `-DGGML_CUDA=ON`
40+
which **requires `nvcc` on the build host**. There is no prebuilt-CUDA-fork
41+
download path in the runtime — `resolveDflashBinary()` only looks at
42+
`ELIZA_DFLASH_LLAMA_SERVER`, the fused build dir, the managed
43+
`<root>/local-inference/bin/dflash/<platform>-<arch>-<backend>/llama-server`
44+
path, and `$PATH`. The CI matrix (`.github/workflows/local-inference-matrix.yml`)
45+
*does* build `linux-x64-cuda` and uploads it as a smoke artifact, but it's
46+
`continue-on-error`, gated behind a `gpu`-labelled self-hosted runner, and not
47+
wired into any release-distribution pipeline. `release-electrobun.yml` ships
48+
nothing inference-related. So on a fresh desktop install **nothing builds or
49+
downloads the CUDA fork for the user** — they'd need `nvcc` (i.e. the full CUDA
50+
toolkit, ~3 GB) installed *and* a manual `bun run local-inference:dflash:build`.
51+
52+
Net for a fresh install:
53+
54+
| | Windows | Linux |
55+
|---|---|---|
56+
| NVIDIA driver | user installs (nvidia.com / Windows Update) | user installs (`nvidia-driver-NNN` or `.run`) — see `packages/inference/reports/porting/2026-05-11/cuda-bringup-operator-steps.md` for a worked example of a half-installed driver |
57+
| CUDA toolkit (`nvcc`) | only needed if *building* the fork; not shipped, not auto-installed | same |
58+
| node-llama-cpp CUDA prebuilt | auto-downloads with `app-core` deps | auto-downloads with `app-core` deps |
59+
| eliza-1 fork CUDA build (`llama-server`) | **not built, not downloaded** today | **not built, not downloaded** today |
60+
| What runs eliza-1 today | Vulkan fork build if present (CI smoke artifact), else CPU fork, else node-llama-cpp (which can't load the custom GGML types → load error) | same |
61+
62+
## 2. Detection — what exists
63+
64+
- **`probeHardware()`** (`local-inference/hardware.ts`) is the only real GPU
65+
probe. It `import()`s `node-llama-cpp`, calls `getLlama({ gpu: "auto" })`, and
66+
reads `llama.gpu` (`"cuda" | "metal" | "vulkan" | false`) plus
67+
`getVramState()`. It feeds `recommendBucket()` (small/mid/large/xl from
68+
effective VRAM/RAM) and `deviceCapsFromProbe()` (which backends a bundle may
69+
install). Exposed at `GET /api/local-inference/hardware`. **It does not run
70+
`nvidia-smi`** and there is no driver-presence / driver-version check. If the
71+
binding's prebuilt is missing it returns `source: "os-fallback"` with
72+
`gpu: null` — i.e. it cannot tell "no GPU" from "GPU present, binding not
73+
installed".
74+
- **The fork backend selector** (`dflash-server.ts` `platformKey()` /
75+
`fusedBackendKey()`) was the real gap: it picked `cuda` *only* when
76+
`CUDA_VISIBLE_DEVICES` was set and not `-1`. That env var is essentially never
77+
set on a desktop launch, so the runtime always keyed `…-cpu` and would run the
78+
CPU fork build *even with a CUDA fork build sitting in the managed bin dir*.
79+
**Fixed in this change** (see §5): when no `*_VISIBLE_DEVICES`/`ELIZA_DFLASH_BACKEND`
80+
override is present, the selector now probes `<root>/bin/dflash/<plat>-<arch>-{cuda,vulkan,rocm}[-fused]/llama-server` on disk and prefers the first that exists. So a downloaded/built CUDA fork artifact is now actually used.
81+
- **Backend dispatch** (`backend.ts` `decideBackend()`): chooses node-llama-cpp
82+
vs. llama-server purely from catalog metadata (`requiresKernel`,
83+
`runtime.dflash`, `preferredBackend`) + `ELIZA_LOCAL_BACKEND`. It is *not*
84+
GPU-aware — within `llama-server` the cuda-vs-vulkan-vs-cpu choice is the
85+
build-dir selection above, not a dispatcher decision. There is **no
86+
"detected RTX 4090, using CUDA" message surfaced anywhere** — the hardware
87+
probe data is available via the API but nothing renders a confirmation.
88+
89+
## 3. Graceful degradation — what the chain actually is
90+
91+
For eliza-1 (kernel-required tiers), the de-facto fallback is:
92+
93+
1. fused CUDA fork build → 2. stock CUDA fork build → 3. fused Vulkan → 4. stock
94+
Vulkan → 5. fused CPU → 6. stock CPU fork → 7. node-llama-cpp (which **cannot
95+
load** the custom GGML types — it errors).
96+
97+
What the dispatcher actually walks: `BackendDispatcher.load()` picks
98+
`llama-server` (kernel-required) and calls `dflashLlamaServer.load(plan)`.
99+
`engine.ts` only falls back to node-llama-cpp when the decision reason is the
100+
*soft* `"preferred-backend"` and `!dflashRequired()` — kernel-required loads
101+
**do not** fall back; the error propagates. That's correct (a node-llama-cpp
102+
fallback would fail to load the GGUF anyway). The within-`llama-server` walk
103+
(cuda→vulkan→cpu) is the new `accelBackendKey()` disk probe — but there is no
104+
"warn the user we degraded from CUDA to CPU" message; it just runs slow.
105+
106+
Failure modes and what the user sees today:
107+
108+
- **No fork build at all for the platform**`resolveDflashBinary()` returns
109+
null → `getDflashRuntimeStatus()` reports `enabled: false, reason: "No
110+
compatible llama-server found. Set ELIZA_DFLASH_LLAMA_SERVER or run
111+
packages/app-core/scripts/build-llama-cpp-dflash.mjs."` and (with eliza-1
112+
loaded) `BackendDispatcher.load()` throws `unsatisfiedKernels` /
113+
"rebuild the fork" — clear error, but it points at a dev script, not a
114+
user action. `runDflashDoctor()` exposes this via the doctor report
115+
(`llama-server-binary` check → `fail`).
116+
- **GPU has too little VRAM**`assessFit()` returns `tight`/`wontfit` and
117+
`recommendBucket()` downsizes the tier; this is surfaced in the catalog UI.
118+
But if a too-big tier is force-loaded, `llama-server` either OOMs the GPU
119+
(CUDA OOM, hard crash of the child) or — with `gpuLayers: "auto"` — spills to
120+
CPU silently and runs slow. No proactive warning.
121+
- **Driver too old / absent**`getLlama({ gpu: "auto" })` falls back to
122+
`gpu: false`; `probeHardware()` reports `gpu: null`. If a CUDA fork build is
123+
on disk, launching it against a missing/old `libcuda.so.1` fails at
124+
`dlopen`/`LoadLibrary` time — `llama-server` exits non-zero, the engine
125+
surfaces the spawn failure, but **there is no "your NVIDIA driver is missing
126+
or too old, install ≥ 12.x" message** mapping the cryptic loader error to an
127+
actionable fix.
128+
129+
## 4. Installer integration — the plan
130+
131+
The installer (desktop first-run / `bun install` postinstall) should:
132+
133+
**(a) Detect the GPU.** Run `nvidia-smi --query-gpu=name,memory.total,driver_version --format=csv,noheader` (Linux + Windows both ship it with the driver). Parse name, VRAM, and driver version. This is cheaper and more honest than spinning up the node-llama-cpp binding, and it tells us the driver version (which the binding does not). Fall back to `probeHardware()` for non-NVIDIA.
134+
135+
**(b) Get the right fork build onto disk.** Building the CUDA fork needs `nvcc` — pulling a ~3 GB CUDA toolkit on every install is unacceptable. The correct call is the same one node-llama-cpp made: **ship prebuilt CUDA fork binaries as release artifacts** (per `windows-x64-cuda`, `linux-x64-cuda`, plus the `-fused` variants), and have the installer download the matching one into `<root>/local-inference/bin/dflash/<plat>-<arch>-<backend>/`. The build matrix already produces these (`build-llama-cpp-dflash.mjs --target linux-x64-cuda` etc., with `CAPABILITIES.json` emitted next to the binary) — they're just not promoted to a downloadable release today. Concretely: add a `dflash-binaries` release job (parallel to `release-electrobun.yml`) that runs the existing build script for `{linux,windows}-x64-{cpu,vulkan,cuda}` (+ `-fused`) on the `gpu`-labelled self-hosted runner for the CUDA legs, uploads each `<target>/` dir (binary + `CAPABILITIES.json`) as a GH release asset, and a small `local-inference:fetch-binary` resolver in the runtime that, when `resolveDflashBinary()` finds nothing, downloads the asset for `accelBackendKey()` (CUDA if `nvidia-smi` succeeds, else Vulkan, else CPU). Keep `build-llama-cpp-dflash.mjs` as the from-source path for devs and the `MILADY_ELIZA_SOURCE=local` workflow.
136+
137+
**(c) Warn on missing/old driver.** If `nvidia-smi` fails (driver absent) or reports `driver_version` below the CUDA-12.x floor, show a one-time card: "An NVIDIA GPU was detected but the driver is missing/outdated. eliza-1 will run on CPU (≈14× slower) until you install the driver: `https://www.nvidia.com/Download/index.aspx` (Windows) / `sudo ubuntu-drivers install` or your distro's `nvidia-driver-NNN` (Linux)." Map the `llama-server` `dlopen`/`LoadLibrary` failure to the same message. Link to `docs/.../cuda-bringup-operator-steps.md`-style guidance.
138+
139+
**(d) Pick model + context for the detected VRAM.** Already mostly there: `recommendBucket()``eliza-1-{0_6b,1_7b,9b,27b,...}`. Tighten it so the first-run default respects `nvidia-smi` VRAM directly (the current heuristic weights `max(vram*1.25, ram*0.5)` which over-estimates on a 6 GB laptop dGPU + 32 GB RAM box). Surface the choice: "Detected RTX 4090 (24 GB) → using eliza-1-9b on CUDA" in onboarding.
140+
141+
**Windows flow:** `nvidia-smi` → if OK and driver ≥ 12.x floor: download `windows-x64-cuda[-fused]` fork build → pick tier from VRAM → "using CUDA" confirmation. If `nvidia-smi` fails: download `windows-x64-vulkan` (all of NVIDIA/AMD/Intel ARC expose Vulkan 1.3) + show driver-install card → still works, just slower. If no Vulkan-capable GPU: `windows-x64-cpu`. node-llama-cpp's own CUDA/Vulkan prebuilts come down with `app-core` deps regardless (used for the hardware probe + any stock GGUF).
142+
143+
**Linux flow:** identical, with `linux-x64-{cuda,vulkan,cpu}` and the driver-install hint pointing at `ubuntu-drivers` / distro package. The `cuda-bringup-operator-steps.md` report shows a real `dpkg --configure -a` half-install recovery worth linking from the warning.
144+
145+
## 5. Changes made in this commit
146+
147+
`packages/app-core/src/services/local-inference/dflash-server.ts`: replaced the
148+
two near-duplicate `platformKey()` / `fusedBackendKey()` env-only backend
149+
selectors with a single `accelBackendKey(suffix)` helper. Precedence is now:
150+
`ELIZA_DFLASH_BACKEND` override → `darwin``metal``HIP_/ROCR_VISIBLE_DEVICES``rocm`
151+
`CUDA_VISIBLE_DEVICES``cuda`**disk probe** for an installed
152+
`…-{cuda,vulkan,rocm}[-fused]/llama-server` (cuda preferred) → `cpu`. This is
153+
the smallest fix that makes a present-on-disk CUDA fork build actually get used
154+
without the operator having to set `CUDA_VISIBLE_DEVICES` by hand. `platformKey()`
155+
and `fusedBackendKey()` are now thin wrappers, so all existing callers
156+
(`managedDflashBinaryPath`, `managedFusedDflashDir`, `managedDflashCapabilitiesPath`)
157+
pick it up. Typecheck clean; `dflash-server.test.ts` 27/27 pass.
158+
159+
Not done here (recommended, larger): the `nvidia-smi`-based detector, the
160+
prebuilt-CUDA-fork release job + runtime downloader, the missing/old-driver
161+
warning card, and the VRAM-aware first-run tier pick. Those are §4 above.

packages/app-core/src/services/local-inference/dflash-server.ts

Lines changed: 52 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -361,19 +361,58 @@ function managedDflashBinaryPath(): string {
361361
* communicating over IPC"). We prefer the fused binary over the stock one
362362
* whenever both exist for the active backend.
363363
*/
364-
function fusedBackendKey(): string {
364+
/**
365+
* Resolve the llama-server fork backend tag for the current host.
366+
*
367+
* Precedence:
368+
* 1. `ELIZA_DFLASH_BACKEND` — explicit operator override (any value).
369+
* 2. `darwin` → always `metal`.
370+
* 3. `HIP_VISIBLE_DEVICES` / `ROCR_VISIBLE_DEVICES` set → `rocm`.
371+
* 4. `CUDA_VISIBLE_DEVICES` set (and not `-1`) → `cuda`.
372+
* 5. **Installed-build probe** — if an accelerated fork build directory
373+
* exists under `<root>/bin/dflash/<platform>-<arch>-<backend>[-fused]/`
374+
* with a `llama-server` binary in it, prefer that backend (cuda before
375+
* vulkan before rocm). This is what makes a downloaded/built CUDA fork
376+
* artifact actually get used on a fresh Windows/Linux desktop install,
377+
* where none of the `*_VISIBLE_DEVICES` env vars are set — without it
378+
* the runtime always keyed `…-cpu` and silently ran the CPU fork even
379+
* with a CUDA build sitting on disk.
380+
* 6. Fall back to `cpu`.
381+
*
382+
* `suffix` is `"-fused"` for the omnivoice-grafted build dir, `""` for the
383+
* stock build dir.
384+
*/
385+
function accelBackendKey(suffix: "" | "-fused"): string {
365386
const forced = process.env.ELIZA_DFLASH_BACKEND?.trim().toLowerCase();
366-
const backend = forced
367-
? forced
368-
: process.platform === "darwin"
369-
? "metal"
370-
: process.env.HIP_VISIBLE_DEVICES || process.env.ROCR_VISIBLE_DEVICES
371-
? "rocm"
372-
: process.env.CUDA_VISIBLE_DEVICES &&
373-
process.env.CUDA_VISIBLE_DEVICES !== "-1"
374-
? "cuda"
375-
: "cpu";
376-
return `${process.platform}-${process.arch}-${backend}-fused`;
387+
if (forced) return `${process.platform}-${process.arch}-${forced}${suffix}`;
388+
if (process.platform === "darwin") {
389+
return `${process.platform}-${process.arch}-metal${suffix}`;
390+
}
391+
if (process.env.HIP_VISIBLE_DEVICES || process.env.ROCR_VISIBLE_DEVICES) {
392+
return `${process.platform}-${process.arch}-rocm${suffix}`;
393+
}
394+
if (
395+
process.env.CUDA_VISIBLE_DEVICES &&
396+
process.env.CUDA_VISIBLE_DEVICES !== "-1"
397+
) {
398+
return `${process.platform}-${process.arch}-cuda${suffix}`;
399+
}
400+
for (const backend of ["cuda", "vulkan", "rocm"] as const) {
401+
const dir = path.join(
402+
localInferenceRoot(),
403+
"bin",
404+
"dflash",
405+
`${process.platform}-${process.arch}-${backend}${suffix}`,
406+
);
407+
if (fs.existsSync(path.join(dir, "llama-server"))) {
408+
return `${process.platform}-${process.arch}-${backend}${suffix}`;
409+
}
410+
}
411+
return `${process.platform}-${process.arch}-cpu${suffix}`;
412+
}
413+
414+
function fusedBackendKey(): string {
415+
return accelBackendKey("-fused");
377416
}
378417

379418
function managedFusedDflashDir(): string {
@@ -597,18 +636,7 @@ function candidateBinaryPaths(): string[] {
597636
}
598637

599638
function platformKey(): string {
600-
const forced = process.env.ELIZA_DFLASH_BACKEND?.trim().toLowerCase();
601-
if (forced) return `${process.platform}-${process.arch}-${forced}`;
602-
const backend =
603-
process.platform === "darwin"
604-
? "metal"
605-
: process.env.HIP_VISIBLE_DEVICES || process.env.ROCR_VISIBLE_DEVICES
606-
? "rocm"
607-
: process.env.CUDA_VISIBLE_DEVICES &&
608-
process.env.CUDA_VISIBLE_DEVICES !== "-1"
609-
? "cuda"
610-
: "cpu";
611-
return `${process.platform}-${process.arch}-${backend}`;
639+
return accelBackendKey("");
612640
}
613641

614642
export function resolveDflashBinary(): string | null {

0 commit comments

Comments
 (0)