Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 23 additions & 5 deletions skills/local-ai-app-integration/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,10 @@ Record three things before continuing:
3. **One single place** where the base URL and API key are constructed. If
there isn't one, refactor to one before going further. Local-mode toggling
must flip exactly one config object.
4. **Any API-key gating** that blocks the app before a key is entered
(onboarding walls, validators that reject empty keys, startup checks that
disable AI until a key exists). Note each one — Step 5 bypasses them in
local mode.

## Step 2: Pick a model + backend profile

Expand All @@ -83,7 +87,8 @@ it.
| Coding assistant | `Qwen2.5-Coder-7B-Instruct-GGUF` | `llamacpp` | Strong code, runs on iGPU |
| Vision / multimodal chat | `Gemma-4-E2B-it-GGUF` | `llamacpp` | Small multimodal default |
| NPU-first on Ryzen AI | `Llama-3.2-3B-Instruct-Hybrid` | `ryzenai-llm` | XDNA2 NPU on Windows |
| Speech-to-text | `Whisper-Large-v3-Turbo` | `whispercpp` | Best quality/speed |
| CPU Speech-to-text | `Whisper-Large-v3-Turbo` | `whispercpp` | Best quality/speed |
| NPU speech-to-text | `whisper-v3-turbo-FLM` | `flm` | XDNA2 NPU on Windows |
| Text-to-speech | `kokoro-v1` | `kokoro` | CPU-only, low latency |
| Image generation | `SDXL-Turbo` | `sd-cpp` | Single-step generation |

Expand All @@ -93,7 +98,7 @@ unset. Override only if the app has hard hardware requirements.

For more options and tradeoffs, see [reference.md](reference.md).

## Step 3: Place Embeddable Lemonade in the app's tree
## Step 3: Place Embeddable Lemonade in the app's tree and install backends

Get the embeddable artifact from the latest Lemonade release:

Expand Down Expand Up @@ -131,9 +136,11 @@ vendor/lemonade/
private to the app. Leave as `auto` only if the user explicitly wants to
share weights with other apps.

Strip what you don't ship: delete the `lemonade` CLI and
`resources/defaults.json` from the shipping artifact once `config.json` is
initialized.
**Install the backend before running any model.** Right after placing
`lemond`, install the backend your chosen recipe needs — a model won't load
without it. Use the CLI at packaging time, e.g. `lemonade backends install
flm:npu` (or `llamacpp:vulkan`, `sd-cpp:cpu`, etc.), or `POST /v1/install`
at first run for hardware-specific backends like `llamacpp:rocm`.

## Step 4: Add a `lemond` launcher

Expand Down Expand Up @@ -242,6 +249,15 @@ and the API key. Nothing else.
The model identifier on requests stays a Lemonade model name (e.g.
`Qwen3-4B-GGUF`), not the cloud name.

**Bypass the app's API-key gate in local mode.** A local backend needs no
cloud key, so any onboarding wall, validator, or startup check that demands
one must not block local-mode users. Skip or auto-satisfy the key-entry
screen, treat local mode as already-authorized in validation logic, and
re-enable the gate only for cloud mode. The `lemond` key from Step 4 is set
internally by the launcher, so the user never enters one and any UI
placeholder (e.g. `"local"`) is fine. Flipping into local mode should never
strand the user on a key-entry wall.

**Python (openai) example:**

```python
Expand Down Expand Up @@ -303,6 +319,8 @@ The integration is done when **all** of these are true:
- [ ] The default model loads successfully via `POST /v1/load`.
- [ ] The existing client's chat / image / speech call returns a valid
response with the base URL and key swapped, with no other code changed.
- [ ] In local mode the app's API-key gate is bypassed: no onboarding wall,
validator, or startup check blocks the user for lacking a cloud key.
- [ ] Killing the parent process leaves no `lemond` subprocess behind.
- [ ] On a fresh machine without the optimal backend, the app still works
via the Vulkan fallback bundled in `bin/`.
Expand Down
48 changes: 18 additions & 30 deletions skills/local-ai-app-integration/reference.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,6 @@ by the default-path tables.
- [Endpoint reference](#endpoint-reference)
- [Config keys you may need to set](#config-keys-you-may-need-to-set)
- [Per-model tuning via recipe_options.json](#per-model-tuning-via-recipe_optionsjson)
- [Trimming the bundled artifact](#trimming-the-bundled-artifact)
- [Linux packaging notes](#linux-packaging-notes)

---
Expand All @@ -39,13 +38,16 @@ hardware-optimized one at first run after a system probe.
| `flm` | `npu` | XDNA2 NPU | Cannot be packaging-time bundled on Linux. |
| `ryzenai-llm` | `npu` | XDNA2 NPU | Windows only. Best for the Hybrid model family. |

### Speech-to-text (`whispercpp`)
### Speech-to-text

| Backend | Hardware | OS |
|---|---|---|
| `npu` | XDNA2 NPU | Windows |
| `vulkan` | x86_64 CPU | Linux |
| `cpu` | x86_64 CPU | Windows, Linux |
Two NPU paths exist. **Prefer `flm` for NPU**.

| Recipe | Backend | Model | Hardware | OS |
|---|---|---|---|---|
| `flm` | `npu` | `whisper-v3-turbo-FLM` | XDNA2 NPU | Windows |
| `whispercpp` | `cpu` | `Whisper-Large-v3-Turbo` | x86_64 CPU | Windows, Linux |
| `whispercpp` | `vulkan` | `Whisper-Large-v3-Turbo` | x86_64 CPU | Linux |
| `whispercpp` | `npu` | `.rai`-cached whisper model | XDNA2 NPU | Windows (avoid) |

### Text-to-speech

Expand Down Expand Up @@ -76,11 +78,16 @@ ship a default and document how to override.
| Multimodal (vision) chat | `Gemma-4-E2B-it-GGUF` | 2.0 GB | `llamacpp` |
| Hybrid NPU chat (Ryzen AI) | `Llama-3.2-3B-Instruct-Hybrid` | 2.0 GB | `ryzenai-llm` |
| Speech-to-text | `Whisper-Large-v3-Turbo` | 1.6 GB | `whispercpp` |
| NPU speech-to-text (Ryzen AI) | `whisper-v3-turbo-FLM` | 0.6 GB | `flm` |
| Text-to-speech | `kokoro-v1` | 0.3 GB | `kokoro` |
| Image generation | `SDXL-Turbo` | 6.9 GB | `sd-cpp` |

For the full catalog, fetch `GET /v1/models` after starting `lemond`, or
read `vendor/lemonade/resources/server_models.json`.
For a catalog with more models, fetch `GET /v1/models` after starting `lemond`.
This is the **only** trusted source of available models. Never read or trust
`vendor/lemonade/resources/server_models.json` (or any other static file) as a
model catalog; it can be stale or incomplete. A model only appears in
`GET /v1/models` once its backend is installed (see Step 3), so install the
backend first or the list will look empty/incomplete.

---

Expand Down Expand Up @@ -135,7 +142,7 @@ All endpoints require `Authorization: Bearer {key}` when
| Endpoint | Purpose |
|---|---|
| `GET /api/v1/health` | Readiness probe and loaded-model list |
| `GET /api/v1/models` | List available models (filtered by `server_models.json`) |
| `GET /api/v1/models` | List available models |
| `POST /api/v1/chat/completions` | OpenAI Chat Completions (text + vision + tool calls) |
| `POST /api/v1/embeddings` | OpenAI Embeddings |
| `POST /api/v1/audio/transcriptions` | OpenAI Whisper-style transcription |
Expand Down Expand Up @@ -189,7 +196,7 @@ hand-editing `config.json`, or at runtime via `POST /internal/set`.
| `llamacpp_backend` | string | Pin to `rocm` / `vulkan` / `cpu` / `metal`; leave unset for auto |
| `llamacpp_args` | string | Raw args appended to `llama-server` |
| `sdcpp_backend` | string | `rocm` / `cpu` |
| `whispercpp_backend` | string | `npu` / `vulkan` / `cpu` |
| `whispercpp_backend` | string | `npu`/`cpu` (Windows), `cpu`/`vulkan` (Linux). For NPU prefer the `flm` recipe instead |
| `whispercpp_args` | string | Raw whisper.cpp args |
| `flm_args` | string | Raw FastFlowLM args |
| `steps` | int | SD step count |
Expand Down Expand Up @@ -227,25 +234,6 @@ next to `config.json`. Example:
```

This file is consulted on every model load. No restart required.

---

## Trimming the bundled artifact

The shipping artifact should be the smallest possible footprint. Strip:

| File / dir | Keep? | Reason |
|---|---|---|
| `lemond[.exe]` | Yes | The only required binary |
| `lemonade[.exe]` (CLI) | **No** | Only useful for packaging-time config; remove from installer |
| `LICENSE` | Yes | Required by Apache 2.0 |
| `resources/server_models.json` | Yes | Trim to only the models the app exposes |
| `resources/backend_versions.json` | Yes | Pins backend versions for reproducibility |
| `resources/defaults.json` | **No** (after first launch) | Only consumed once to seed `config.json` |
| `bin/<recipe>/<backend>/` | Yes (one) | Bundle just the universal fallback (e.g. `llamacpp/vulkan`) |
| `bin/<recipe>/<other backends>/` | **No** | Install on demand via `/v1/install` |
| `models/` | Optional | Bundle one default model for offline install, or pull on first run |

---

## Linux packaging notes
Expand Down
3 changes: 2 additions & 1 deletion walkthroughs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,4 +8,5 @@ Participatns using other are still encouraged to participate. Just please note t

Please choose a skill to get started.

* [local-ai-use](./local-ai-use.md): Teach your agent how to run image generation locally.
* [local-ai-use](./local-ai-use.md): Teach your agent how to run image generation locally.
* [local-ai-app-integration](./local-ai-app-integration.md): Add a local AI mode to a cloud-only app.
78 changes: 78 additions & 0 deletions walkthroughs/local-ai-app-integration.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
# AMD Skills Walkthroughs: `local-ai-app-integration`

The goal of this skill is to teach your AI agent to add a **local AI mode** to an
existing app that today only talks to cloud AI APIs.

For this walkthrough we use [`danielholanda/dictate`](https://github.com/danielholanda/dictate),
a Windows dictation app that currently sends every recording to cloud
speech-to-text providers (Groq, Deepgram, Cartesia, Gemini, Mistral, etc.).

## Prerequiresites
This sample app used here requires the Rust toolchain (install from https://rustup.rs/).

Because this walkthrough runs transcription on the NPU, you need a Ryzen AI PC with an XDNA2 NPU (Strix, Strix Halo, Kraken, or Gorgon Point) running Windows.

## Step 1 - Get the target app

* Clone the cloud-only app you want to upgrade:

```
git clone https://github.com/danielholanda/dictate.git
cd dictate
```

## Step 2 - Understanding which skills are available

* Run `claude "Which skills can you see?" --model opus`. You should see a list of skills that should *not* include anything related to local AI app integration.

## Step 3 - Enabling claude to see `local-ai-app-integration`

In the future this will be enabled directly through claude's marketplace. For now, we have to manually add it.

* Clone `https://github.com/amd/skills`
* Move the `local-ai-app-integration` skill from the repo to `.claude/skills/`
* Run `claude "Which skills can you see?" --model opus`. You should see a list of skills that includes `local-ai-app-integration`.

## Step 4 - Running the skill

Run `claude --model opus` inside the `dictate` repo run the prompt:

```
This app sends my dictation audio to cloud speech-to-text providers.
Add a local AI mode that runs transcription on my machine instead by default.
I want it to run using the NPU. Keep the cloud providers as an option and minimize code changes.
```

Claude should:

1. Survey where the app calls its cloud transcription APIs.
2. Pick a local speech-to-text model + backend (e.g. `whisper-v3-turbo-FLM` using the `FLM` NPU backend).
3. Vendor the Embeddable Lemonade (`lemond`) binary into the app tree.
4. Add a launcher that spawns `lemond` on a free port.
5. Re-point the app's existing client at the local endpoint and wait for `/v1/health`.

Please note this may take several minutes as this app has a fairly large codebase.

## Step 5 - Running the modified app

Dictate is a Tauri (Rust + Node) app. From the repo root:

```
npm install
npm run tauri dev
```
Once the window opens, press the microphone button to speak, and confirm that transcription is now running through your local device instead of a cloud provider. The transcribed text should appear where your cursor was last located.

## Step 6 - (Optional) Going beyond

`local-ai-app-integration` works for any modality, not just speech-to-text. The
same pattern adds local chat, embeddings, image generation, or text-to-speech to
any app that already calls into the cloud. You can try using this skill to turn other cloud apps into local apps.

## Step 7 - (Optional) Try to get things done without AMD Skills

Remove the added skill from `.claude/skills/` and rerun the experiment above. This should lead to a high variance in execution length and token usage. Some common issues without the skill include:
* Model produces a local implementation that does not use NPU acceleration as instructed.
* Model inventing a brittle local server setup that does not handle health checks, API keys, or shutdown.
* Model touching many files instead of flipping a single base-URL/key config object.
* Model providing a knowledge article instead of actually integrating local AI into the app.
Loading