You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
## Step 6: Health, backend, then pull the model — *before* first inference
422
+
423
+
`GET /api/v1/health` returning 200 means the **server** is up. It does **not**
424
+
mean inference will work. Before the first real request succeeds, three more
425
+
things must be true: the backend for your modality is installed, the model's
426
+
weights are **downloaded to disk**, and (on the first call) the model is loaded
427
+
into memory. Treating health=200 as "ready" is the single biggest cause of a
428
+
broken-looking integration.
429
+
430
+
**Do not call `POST /api/v1/load` at startup.** Lemond lazy-loads the model
431
+
into memory on the first inference request and handles that step on its own.
432
+
Pre-loading is unreliable across lemond versions (the `/load` request body
433
+
shape has changed between releases) and a malformed call can crash or
434
+
destabilise the server before the user takes any action. Loading is the one
435
+
step you let lemond do lazily — pulling is not.
436
+
437
+
### Pull the model so it exists on disk
438
+
439
+
Lazy-load only loads weights that are **already downloaded**. If the model was
440
+
never pulled, the first inference does not error — lemond returns an empty /
441
+
blank result withHTTP200. So after health passes and the backend is
442
+
installed, proactively pull the model:
443
+
444
+
```http
445
+
POST /api/v1/pull
446
+
{"model": "Whisper-Large-v3-Turbo"}
447
+
```
448
+
449
+
This is **idempotent** — a no-op if the weights are already present, a download
450
+
if they are not. Run it once during setup (after backend install, before the
451
+
first user-triggered inference) and log the result.
452
+
453
+
-**Default model** (the one you chose in Step 2): pull it by name as above.
454
+
-**Custom / user-overridden model:**do not assume it exists. Confirm it is a
455
+
real Lemonade model first via `GET /api/v1/models` (the **only** trusted
456
+
catalog — see [reference.md](reference.md)), then pull it the same way. A
457
+
model appearing in the catalog is **not** proof its weights are downloaded;
458
+
a successful pull is.
378
459
379
-
Once `GET /api/v1/health` returns 200, the integration is ready. **Do not
380
-
call `POST /api/v1/load` at startup.** Lemond lazy-loads models on the first
381
-
inference request and handles this correctly on its own. Pre-loading is
382
-
unreliable across lemond versions (request body shape has changed between
383
-
releases) and a malformed `/load` call can crash or destabilise the server
384
-
before the user takes any action.
460
+
>**Silent-empty is almost always an unpulled model.** If inference returns an
461
+
> empty string / blank output with no HTTP error, the model was not downloaded.
462
+
> Check your pull step before debugging anything else — this is the failure mode
463
+
> that wastes the most time. Log the pull result and the first inference result
464
+
> (see Step 4) so this is diagnosable from the console, not by guesswork.
465
+
466
+
### Surface the *whole* setup, not just model load
467
+
468
+
First-run cold start is more than a model load. The full sequence is:
469
+
470
+
```
471
+
server spawn → health 200 → backend install → model download → model load → first result
472
+
```
385
473
386
-
**First-run latency is expected and must be surfaced to the user.** On the
387
-
very first inference after a cold start, lemond loads the model into memory.
388
-
This takes 10–30 seconds depending on model size and hardware. An app that
389
-
makes no attempt to communicate this will look broken.
474
+
On a fresh machine, backend install and model download can each take from tens
475
+
of seconds to several **minutes** (multi-GB weights over the network). Model
476
+
load alone is 10–30s. An app that shows nothing during this will look frozen.
390
477
391
-
Minimum: show a loading indicator or status message ("Starting local AI…")
392
-
from the moment the user triggers inference until the first response arrives.
393
-
The simplest implementation is a flag that is set when the first request is
394
-
sent and cleared when the first response arrives.
478
+
Minimum: show a loading indicator or status message ("Setting up local AI…")
479
+
from the moment setup begins until the first response arrives — covering the
480
+
*entire* sequence above, not just the final load. The simplest implementation
481
+
is a flag set when setup/first-request starts and cleared when the first
482
+
response arrives. Once the model is pulled and loaded once, subsequent runs are
483
+
fast; the long wait is first-run only.
395
484
396
485
## Step 7: Lifecycle and recovery
397
486
398
487
These are the only failure modes worth handling. Do not over-engineer.
399
488
400
489
| Symptom | Cause | Recovery |
401
490
|---|---|---|
402
-
|`POST /api/v1/load` returns 404/ model not found | Model not pulled yet |`POST /api/v1/pull`with`{"model": "..."}` then retry `/api/v1/load`|
491
+
|**Inference returns empty / blank withHTTP200, no error**| Model never pulled: backend is installed but weights are absent, so lazy-load has nothing to load |`POST /api/v1/pull`with`{"model":"..."}`, wait for success, retry. Log the pulled result and the first inference result. This is the most common silent failure — see [Step 6](#step-6-health-backend-then-pull-the-model--before-first-inference) |
492
+
|`POST /api/v1/load` returns 404/ model not found | Model not pulled yet (same root cause as the empty-result row above) |`POST /api/v1/pull`with`{"model": "..."}` then retry `/api/v1/load`|
| Subprocess exits immediately | Port race: another process grabbed the port between `freePort()` and lemond binding | The reference launcher retries with a fresh port automatically (3 attempts) |
405
495
|`/api/v1/health` never returns 200| First-run backend extraction is slow on cold disk | Extend timeout to 90s on first launch, 30s after |
@@ -422,13 +512,19 @@ The integration is done when **all** of these are true:
422
512
`lemonade[.exe]`, `LICENSE`, and `resources/` — not just the binary.
423
513
- [ ] `lemond` starts as a subprocess with a fresh API key per launch.
424
514
- [ ] `GET /api/v1/health` returns 200 within the timeout.
515
+
- [ ] The default model is pulled (or bundled) before the first inference; a
516
+
custom/overridden model is confirmed via `GET /api/v1/models` and then
517
+
pulled. A blank result with no error means this step was skipped.
518
+
- [ ] Each lifecycle stage logs a clear line (spawn, health, backend install,
519
+
model pull, first result) so a failure is diagnosable from the console.
425
520
- [ ] The existing client's chat / image / speech call returns a valid
426
521
response with the base URL and key swapped, with no other code changed.
427
522
- [ ] First-run latency is surfaced: the UI shows a loading state from the
428
523
moment the first inference request is sent until the response arrives.
429
524
- [ ] The HTTP client timeout is set to at least 120 seconds.
430
-
- [ ] In local mode the app's API-key gate is bypassed: no onboarding wall,
431
-
validator, or startup check blocks the user for lacking a cloud key.
525
+
- [ ] In local mode the app requires **no** cloud API key: no onboarding wall,
526
+
validator, or startup check blocks the user, and no code path throws
527
+
"API key not configured" when the active mode is local.
432
528
- [ ] If the app uses a dev-mode file watcher, `vendor/lemonade/` is excluded
433
529
from the watched paths so runtime writes by lemond do not trigger restarts.
434
530
- [ ] Killing the parent process leaves no `lemond` subprocess behind.
0 commit comments