Skip to content

Commit 82af8fb

Browse files
committed
checks: Added brief sanity checks in Skills
1 parent a90bcc1 commit 82af8fb

3 files changed

Lines changed: 246 additions & 54 deletions

File tree

skills/local-ai-app-integration/SKILL.md

Lines changed: 132 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -46,14 +46,21 @@ This skill follows one fixed sequence. Do not deviate without a stated reason.
4646
[ ] 1. Survey the app's current AI integration
4747
[ ] 2. Pick a model + backend profile
4848
[ ] 3. Place Embeddable Lemonade in the app's tree (full package, not just the binary)
49-
[ ] 4. Add a `lemond` launcher (subprocess + API key + port)
49+
[ ] 4. Add a `lemond` launcher (subprocess + API key + port + per-stage logging)
5050
[ ] 5. Re-point the existing client at lemond (set HTTP timeout to 120s)
51-
[ ] 6. Wait for /api/v1/health — do not pre-load; surface first-run latency to user
51+
[ ] 6. Wait for /api/v1/health, install backend, then PULL the model before first use
5252
[ ] 7. Wire shutdown and error recovery
5353
```
5454

5555
Track progress against this checklist. Move on only when each step verifies.
5656

57+
> **Log every stage.** A local integration has many silent failure points —
58+
> spawn, health, backend install, model download, first inference. Without a
59+
> log line at each transition, "nothing happened" is indistinguishable from
60+
> "broke at stage 3." Emit one clear line per stage as you build (see
61+
> [Step 4](#step-4-add-a-lemond-launcher)); the most common dead-end in this
62+
> integration — a blank result with no error — is invisible without them.
63+
5764
---
5865

5966
## Step 1: Survey the app
@@ -91,8 +98,8 @@ it.
9198
| Coding assistant | `Qwen2.5-Coder-7B-Instruct-GGUF` | `llamacpp` | Strong code, runs on iGPU |
9299
| Vision / multimodal chat | `Gemma-4-E2B-it-GGUF` | `llamacpp` | Small multimodal default |
93100
| NPU-first on Ryzen AI | `Llama-3.2-3B-Instruct-Hybrid` | `ryzenai-llm` | XDNA2 NPU on Windows |
94-
| CPU Speech-to-text | `Whisper-Large-v3-Turbo` | `whispercpp` | Best quality/speed |
95-
| NPU speech-to-text | `whisper-v3-turbo-FLM` | `flm` | XDNA2 NPU on Windows |
101+
| Speech-to-text (Windows) | `Whisper-Large-v3-Turbo` | `whispercpp` | One model; probe picks NPU → iGPU/dGPU → CPU automatically |
102+
| Speech-to-text (Linux NPU) | `whisper-v3-turbo-FLM` | `flm` | Linux NPU path; falls back to `whispercpp` iGPU/CPU off-NPU |
96103
| Text-to-speech | `kokoro-v1` | `kokoro` | CPU-only, low latency |
97104
| Image generation | `SDXL-Turbo` | `sd-cpp` | Single-step generation |
98105

@@ -115,6 +122,12 @@ Download the file matching your target OS:
115122
- Windows: `lemonade-embeddable-{VERSION}-windows-x64.zip`
116123
- Linux: `lemonade-embeddable-{VERSION}-ubuntu-x64.tar.gz`
117124

125+
> **Don't hand-build the download URL from the tag.** The git tag carries a
126+
> leading `v` (e.g. `v10.8.0`) but the asset filename strips it
127+
> (`lemonade-embeddable-10.8.0-...`), so using the tag verbatim 404s. Ask the
128+
> GitHub API for the asset by its stable name pattern and use the URL it
129+
> returns, as below — this stays correct across version and naming changes.
130+
118131
**First, create the target directory** — it does not exist in a fresh repo:
119132

120133
```powershell
@@ -130,17 +143,22 @@ mkdir -p vendor/lemonade
130143
Then download and unpack on Windows (PowerShell):
131144

132145
```powershell
133-
$ver = (Invoke-RestMethod https://api.github.com/repos/lemonade-sdk/lemonade/releases/latest).tag_name
134-
Invoke-WebRequest "https://github.com/lemonade-sdk/lemonade/releases/download/$ver/lemonade-embeddable-$ver-windows-x64.zip" -OutFile lemond.zip
146+
$rel = Invoke-RestMethod https://api.github.com/repos/lemonade-sdk/lemonade/releases/latest
147+
$asset = $rel.assets | Where-Object { $_.name -like "lemonade-embeddable-*-windows-x64.zip" } | Select-Object -First 1
148+
Invoke-WebRequest $asset.browser_download_url -OutFile lemond.zip
135149
Expand-Archive lemond.zip -DestinationPath "$env:TEMP\lemond-unpack"
136-
Copy-Item -Recurse "$env:TEMP\lemond-unpack\lemonade-embeddable-$ver-windows-x64\*" vendor\lemonade\
150+
$folder = $asset.name -replace '\.zip$','' # unpacked dir = asset name without .zip
151+
Copy-Item -Recurse "$env:TEMP\lemond-unpack\$folder\*" vendor\lemonade\
152+
# Sanity check: resources/ must be nested under vendor\lemonade\ (not flattened)
153+
if (-not (Test-Path vendor\lemonade\resources\*.json)) { throw "resources/ missing — re-extract and copy again" }
137154
```
138155

139156
On Linux (bash):
140157

141158
```bash
142-
VER=$(curl -s https://api.github.com/repos/lemonade-sdk/lemonade/releases/latest | grep tag_name | cut -d'"' -f4)
143-
curl -L "https://github.com/lemonade-sdk/lemonade/releases/download/$VER/lemonade-embeddable-$VER-ubuntu-x64.tar.gz" | tar -xz --strip-components=1 -C vendor/lemonade
159+
URL=$(curl -s https://api.github.com/repos/lemonade-sdk/lemonade/releases/latest \
160+
| grep browser_download_url | grep ubuntu-x64.tar.gz | cut -d'"' -f4)
161+
curl -L "$URL" | tar -xz --strip-components=1 -C vendor/lemonade
144162
```
145163

146164
> **Copy the full package, not just the binary.** The archive contains
@@ -154,7 +172,9 @@ curl -L "https://github.com/lemonade-sdk/lemonade/releases/download/$VER/lemonad
154172
> only during development/build time to install backends. Install it once on
155173
> the developer machine with `pip install lemonade-sdk`.
156174
157-
The expected layout after unpacking and customization:
175+
The expected layout **after setup** (first run + backend install). A freshly
176+
unzipped package contains only `lemond[.exe]`, `lemonade[.exe]`, `LICENSE`, and
177+
`resources/` — the items below are created later, as their comments note:
158178

159179
```
160180
vendor/lemonade/
@@ -215,6 +235,23 @@ The launcher is a thin process supervisor. Its only jobs:
215235
3. Spawn `lemond <dir> --port <port>` with `LEMONADE_API_KEY` set.
216236
4. Expose the chosen `port` and `key` to the rest of the app.
217237
238+
> **Log one line per lifecycle stage.** Build the logging in from the start —
239+
> not as an afterthought when something breaks. Each silent transition needs a
240+
> visible marker so a failure points at the exact stage. Aim for:
241+
>
242+
> ```
243+
> [lemond] Starting on port <port>
244+
> [lemond] Healthy on port <port>
245+
> [lemond] <recipe>:<backend> installed (or: already installed / install failed)
246+
> [lemond] Pulling model <name>... then: Model <name> ready (or: pull returned <status>)
247+
> [local] <modality> result: <value> (first inference output — empty string here = unpulled model)
248+
> ```
249+
>
250+
> Logging the **first inference result verbatim** is what turns the
251+
> silent-empty failure (Step 6) from a multi-hour mystery into a one-line
252+
> diagnosis. Route these through the app's normal logging so they can be quieted
253+
> for release.
254+
218255
> **Dev-mode file watchers:** If the app runs with a file watcher (Tauri,
219256
> Electron, Next.js, Vite, etc.) that watches the source tree, ensure
220257
> `vendor/lemonade/` is excluded from the watched paths. Lemond writes config
@@ -341,14 +378,21 @@ and the API key. Nothing else.
341378
The model identifier on requests stays a Lemonade model name (e.g.
342379
`Qwen3-4B-GGUF`), not the cloud name.
343380
344-
**Bypass the app's API-key gate in local mode.** A local backend needs no
345-
cloud key, so any onboarding wall, validator, or startup check that demands
346-
one must not block local-mode users. Skip or auto-satisfy the key-entry
347-
screen, treat local mode as already-authorized in validation logic, and
348-
re-enable the gate only for cloud mode. The `lemond` key from Step 4 is set
349-
internally by the launcher, so the user never enters one and any UI
350-
placeholder (e.g. `"local"`) is fine. Flipping into local mode should never
351-
strand the user on a key-entry wall.
381+
**Local mode needs no cloud API key — at all.** This is a defining property of
382+
local mode, not an edge case: there is no cloud service to authenticate to, so
383+
nothing should ever ask the user for a key. Any onboarding wall, validator, or
384+
startup check that demands one must not block local-mode users. Concretely:
385+
386+
- Skip or auto-satisfy the key-entry screen in local mode.
387+
- Treat local mode as already-authorized in every validation path — an
388+
empty-key check must short-circuit to "valid" when the active mode is local,
389+
never throw "API key not configured".
390+
- Re-enable the gate **only** for cloud mode.
391+
392+
The `lemond` key from Step 4 is generated internally by the launcher and used
393+
only for the local loopback connection, so the user never sees or enters one;
394+
any UI placeholder (e.g. `"local"`) is fine. Flipping into local mode should
395+
never strand the user on a key-entry wall.
352396
353397
**Set the HTTP client timeout to at least 120 seconds.** The default timeout
354398
on most HTTP clients (30s) is shorter than the time lemond takes to load a
@@ -374,32 +418,78 @@ resp = client.chat.completions.create(
374418
)
375419
```
376420
377-
## Step 6: Wait for health — do not pre-load
421+
## Step 6: Health, backend, then pull the model — *before* first inference
422+
423+
`GET /api/v1/health` returning 200 means the **server** is up. It does **not**
424+
mean inference will work. Before the first real request succeeds, three more
425+
things must be true: the backend for your modality is installed, the model's
426+
weights are **downloaded to disk**, and (on the first call) the model is loaded
427+
into memory. Treating health=200 as "ready" is the single biggest cause of a
428+
broken-looking integration.
429+
430+
**Do not call `POST /api/v1/load` at startup.** Lemond lazy-loads the model
431+
into memory on the first inference request and handles that step on its own.
432+
Pre-loading is unreliable across lemond versions (the `/load` request body
433+
shape has changed between releases) and a malformed call can crash or
434+
destabilise the server before the user takes any action. Loading is the one
435+
step you let lemond do lazily — pulling is not.
436+
437+
### Pull the model so it exists on disk
438+
439+
Lazy-load only loads weights that are **already downloaded**. If the model was
440+
never pulled, the first inference does not error — lemond returns an empty /
441+
blank result with HTTP 200. So after health passes and the backend is
442+
installed, proactively pull the model:
443+
444+
```http
445+
POST /api/v1/pull
446+
{"model": "Whisper-Large-v3-Turbo"}
447+
```
448+
449+
This is **idempotent** — a no-op if the weights are already present, a download
450+
if they are not. Run it once during setup (after backend install, before the
451+
first user-triggered inference) and log the result.
452+
453+
- **Default model** (the one you chose in Step 2): pull it by name as above.
454+
- **Custom / user-overridden model:** do not assume it exists. Confirm it is a
455+
real Lemonade model first via `GET /api/v1/models` (the **only** trusted
456+
catalog — see [reference.md](reference.md)), then pull it the same way. A
457+
model appearing in the catalog is **not** proof its weights are downloaded;
458+
a successful pull is.
378459

379-
Once `GET /api/v1/health` returns 200, the integration is ready. **Do not
380-
call `POST /api/v1/load` at startup.** Lemond lazy-loads models on the first
381-
inference request and handles this correctly on its own. Pre-loading is
382-
unreliable across lemond versions (request body shape has changed between
383-
releases) and a malformed `/load` call can crash or destabilise the server
384-
before the user takes any action.
460+
> **Silent-empty is almost always an unpulled model.** If inference returns an
461+
> empty string / blank output with no HTTP error, the model was not downloaded.
462+
> Check your pull step before debugging anything elsethis is the failure mode
463+
> that wastes the most time. Log the pull result and the first inference result
464+
> (see Step 4) so this is diagnosable from the console, not by guesswork.
465+
466+
### Surface the *whole* setup, not just model load
467+
468+
First-run cold start is more than a model load. The full sequence is:
469+
470+
```
471+
server spawn → health 200 → backend install → model download → model load → first result
472+
```
385473

386-
**First-run latency is expected and must be surfaced to the user.** On the
387-
very first inference after a cold start, lemond loads the model into memory.
388-
This takes 1030 seconds depending on model size and hardware. An app that
389-
makes no attempt to communicate this will look broken.
474+
On a fresh machine, backend install and model download can each take from tens
475+
of seconds to several **minutes** (multi-GB weights over the network). Model
476+
load alone is 1030s. An app that shows nothing during this will look frozen.
390477

391-
Minimum: show a loading indicator or status message ("Starting local AI…")
392-
from the moment the user triggers inference until the first response arrives.
393-
The simplest implementation is a flag that is set when the first request is
394-
sent and cleared when the first response arrives.
478+
Minimum: show a loading indicator or status message ("Setting up local AI…")
479+
from the moment setup begins until the first response arrives — covering the
480+
*entire* sequence above, not just the final load. The simplest implementation
481+
is a flag set when setup/first-request starts and cleared when the first
482+
response arrives. Once the model is pulled and loaded once, subsequent runs are
483+
fast; the long wait is first-run only.
395484

396485
## Step 7: Lifecycle and recovery
397486

398487
These are the only failure modes worth handling. Do not over-engineer.
399488

400489
| Symptom | Cause | Recovery |
401490
|---|---|---|
402-
| `POST /api/v1/load` returns 404 / model not found | Model not pulled yet | `POST /api/v1/pull` with `{"model": "..."}` then retry `/api/v1/load` |
491+
| **Inference returns empty / blank with HTTP 200, no error** | Model never pulled: backend is installed but weights are absent, so lazy-load has nothing to load | `POST /api/v1/pull` with `{"model":"..."}`, wait for success, retry. Log the pulled result and the first inference result. This is the most common silent failure — see [Step 6](#step-6-health-backend-then-pull-the-model--before-first-inference) |
492+
| `POST /api/v1/load` returns 404 / model not found | Model not pulled yet (same root cause as the empty-result row above) | `POST /api/v1/pull` with `{"model": "..."}` then retry `/api/v1/load` |
403493
| `POST /api/v1/load` returns 500 with backend error | Backend not installed for this hardware | `GET /api/v1/system-info`, pick a supported backend, `POST /api/v1/install` with `{"recipe": "...", "backend": "..."}`, retry |
404494
| Subprocess exits immediately | Port race: another process grabbed the port between `freePort()` and lemond binding | The reference launcher retries with a fresh port automatically (3 attempts) |
405495
| `/api/v1/health` never returns 200 | First-run backend extraction is slow on cold disk | Extend timeout to 90s on first launch, 30s after |
@@ -422,13 +512,19 @@ The integration is done when **all** of these are true:
422512
`lemonade[.exe]`, `LICENSE`, and `resources/` — not just the binary.
423513
- [ ] `lemond` starts as a subprocess with a fresh API key per launch.
424514
- [ ] `GET /api/v1/health` returns 200 within the timeout.
515+
- [ ] The default model is pulled (or bundled) before the first inference; a
516+
custom/overridden model is confirmed via `GET /api/v1/models` and then
517+
pulled. A blank result with no error means this step was skipped.
518+
- [ ] Each lifecycle stage logs a clear line (spawn, health, backend install,
519+
model pull, first result) so a failure is diagnosable from the console.
425520
- [ ] The existing client's chat / image / speech call returns a valid
426521
response with the base URL and key swapped, with no other code changed.
427522
- [ ] First-run latency is surfaced: the UI shows a loading state from the
428523
moment the first inference request is sent until the response arrives.
429524
- [ ] The HTTP client timeout is set to at least 120 seconds.
430-
- [ ] In local mode the app's API-key gate is bypassed: no onboarding wall,
431-
validator, or startup check blocks the user for lacking a cloud key.
525+
- [ ] In local mode the app requires **no** cloud API key: no onboarding wall,
526+
validator, or startup check blocks the user, and no code path throws
527+
"API key not configured" when the active mode is local.
432528
- [ ] If the app uses a dev-mode file watcher, `vendor/lemonade/` is excluded
433529
from the watched paths so runtime writes by lemond do not trigger restarts.
434530
- [ ] Killing the parent process leaves no `lemond` subprocess behind.

skills/local-ai-app-integration/reference.md

Lines changed: 30 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -40,14 +40,12 @@ hardware-optimized one at first run after a system probe.
4040

4141
### Speech-to-text
4242

43-
Two NPU paths exist. **Prefer `flm` for NPU**.
44-
4543
| Recipe | Backend | Model | Hardware | OS |
4644
|---|---|---|---|---|
47-
| `flm` | `npu` | `whisper-v3-turbo-FLM` | XDNA2 NPU | Windows |
45+
| `whispercpp` | `vulkan` | `Whisper-Large-v3-Turbo` | AMD iGPU / dGPU | Windows, Linux |
4846
| `whispercpp` | `cpu` | `Whisper-Large-v3-Turbo` | x86_64 CPU | Windows, Linux |
49-
| `whispercpp` | `vulkan` | `Whisper-Large-v3-Turbo` | x86_64 CPU | Linux |
50-
| `whispercpp` | `npu` | `.rai`-cached whisper model | XDNA2 NPU | Windows (avoid) |
47+
| `whispercpp` | `npu` | `Whisper-Large-v3-Turbo` | XDNA2 NPU | Windows |
48+
| `flm` | `npu` | `whisper-v3-turbo-FLM` | XDNA2 NPU | Linux (runtime-install only) |
5149

5250
### Text-to-speech
5351

@@ -89,6 +87,14 @@ model catalog; it can be stale or incomplete. A model only appears in
8987
`GET /v1/models` once its backend is installed (see Step 3), so install the
9088
backend first or the list will look empty/incomplete.
9189

90+
**Catalogued ≠ downloaded.** A model listed by `GET /v1/models` is *available
91+
to use*, not necessarily present on disk. It must be **pulled**
92+
(`POST /api/v1/pull {"model":"..."}`) before it can serve — until then,
93+
inference returns an empty result with HTTP 200, not an error. The surest
94+
signal that a model is ready is a successful pull, not its presence in the
95+
catalog. See SKILL.md
96+
[Step 6](SKILL.md#step-6-health-backend-then-pull-the-model--before-first-inference).
97+
9298
---
9399

94100
## Hardware probing with /v1/system-info
@@ -120,13 +126,31 @@ Response shape (truncated):
120126
}
121127
```
122128

123-
Decision rules in priority order, for the default `llamacpp` recipe:
129+
The same pattern applies to **every** recipe: read the per-backend `state`,
130+
install the best one that is `installable`, use it if already `installed`, and
131+
fall back down the priority list otherwise. Apply it to whichever recipe matches
132+
the app's modality.
133+
134+
Decision rules in priority order, for the default `llamacpp` recipe (text gen):
124135

125136
1. If `recipes.llamacpp.backends.rocm.state == "installable"`
126137
`POST /v1/install {"recipe":"llamacpp","backend":"rocm"}`.
127138
2. Else if `state == "installed"` for `vulkan` → use it as-is.
128139
3. Else fall back to `cpu`.
129140

141+
Decision rules for the `whispercpp` recipe (speech-to-text), NPU-first:
142+
143+
1. If `recipes.whispercpp.backends.npu.state == "installed"` → use NPU as-is.
144+
2. Else if `npu.state == "installable"`
145+
`POST /v1/install {"recipe":"whispercpp","backend":"npu"}`, then use NPU.
146+
3. Else if `vulkan` is `installed`/`installable` → use the iGPU/dGPU path.
147+
4. Else fall back to `cpu`.
148+
149+
Probe **once**, cache the chosen backend for the session (the result does not
150+
change while the app runs), and log which backend was selected. This is the
151+
mechanism that lets one build run on an NPU machine and a CPU-only machine
152+
without any user configuration.
153+
130154
For Ryzen AI Hybrid models on Windows, additionally check
131155
`ryzenai-llm.backends.npu.state` and install if `installable`.
132156

0 commit comments

Comments
 (0)