amd · danielholanda · Jun 12, 2026 · Jun 11, 2026 · Jun 11, 2026 · Jun 11, 2026
diff --git a/skills/local-ai-app-integration/SKILL.md b/skills/local-ai-app-integration/SKILL.md
@@ -70,6 +70,10 @@ Record three things before continuing:
 3. **One single place** where the base URL and API key are constructed. If
    there isn't one, refactor to one before going further. Local-mode toggling
    must flip exactly one config object.
+4. **Any API-key gating** that blocks the app before a key is entered
+   (onboarding walls, validators that reject empty keys, startup checks that
+   disable AI until a key exists). Note each one — Step 5 bypasses them in
+   local mode.
 
 ## Step 2: Pick a model + backend profile
 
@@ -83,7 +87,8 @@ it.
 | Coding assistant | `Qwen2.5-Coder-7B-Instruct-GGUF` | `llamacpp` | Strong code, runs on iGPU |
 | Vision / multimodal chat | `Gemma-4-E2B-it-GGUF` | `llamacpp` | Small multimodal default |
 | NPU-first on Ryzen AI | `Llama-3.2-3B-Instruct-Hybrid` | `ryzenai-llm` | XDNA2 NPU on Windows |
-| Speech-to-text | `Whisper-Large-v3-Turbo` | `whispercpp` | Best quality/speed |
+| CPU Speech-to-text | `Whisper-Large-v3-Turbo` | `whispercpp` | Best quality/speed |
+| NPU speech-to-text | `whisper-v3-turbo-FLM` | `flm` | XDNA2 NPU on Windows |
 | Text-to-speech | `kokoro-v1` | `kokoro` | CPU-only, low latency |
 | Image generation | `SDXL-Turbo` | `sd-cpp` | Single-step generation |
 
@@ -93,7 +98,7 @@ unset. Override only if the app has hard hardware requirements.
 
 For more options and tradeoffs, see [reference.md](reference.md).
 
-## Step 3: Place Embeddable Lemonade in the app's tree
+## Step 3: Place Embeddable Lemonade in the app's tree and install backends
 
 Get the embeddable artifact from the latest Lemonade release:
 
@@ -131,9 +136,11 @@ vendor/lemonade/
   private to the app. Leave as `auto` only if the user explicitly wants to
   share weights with other apps.
 
-Strip what you don't ship: delete the `lemonade` CLI and
-`resources/defaults.json` from the shipping artifact once `config.json` is
-initialized.
+**Install the backend before running any model.** Right after placing
+`lemond`, install the backend your chosen recipe needs — a model won't load
+without it. Use the CLI at packaging time, e.g. `lemonade backends install
+flm:npu` (or `llamacpp:vulkan`, `sd-cpp:cpu`, etc.), or `POST /v1/install`
+at first run for hardware-specific backends like `llamacpp:rocm`.
 
 ## Step 4: Add a `lemond` launcher
 
@@ -242,6 +249,15 @@ and the API key. Nothing else.
 The model identifier on requests stays a Lemonade model name (e.g.
 `Qwen3-4B-GGUF`), not the cloud name.
 
+**Bypass the app's API-key gate in local mode.** A local backend needs no
+cloud key, so any onboarding wall, validator, or startup check that demands
+one must not block local-mode users. Skip or auto-satisfy the key-entry
+screen, treat local mode as already-authorized in validation logic, and
+re-enable the gate only for cloud mode. The `lemond` key from Step 4 is set
+internally by the launcher, so the user never enters one and any UI
+placeholder (e.g. `"local"`) is fine. Flipping into local mode should never
+strand the user on a key-entry wall.
+
 **Python (openai) example:**
 
 ```python
@@ -303,6 +319,8 @@ The integration is done when **all** of these are true:
 - [ ] The default model loads successfully via `POST /v1/load`.
 - [ ] The existing client's chat / image / speech call returns a valid
       response with the base URL and key swapped, with no other code changed.
+- [ ] In local mode the app's API-key gate is bypassed: no onboarding wall,
+      validator, or startup check blocks the user for lacking a cloud key.
 - [ ] Killing the parent process leaves no `lemond` subprocess behind.
 - [ ] On a fresh machine without the optimal backend, the app still works
       via the Vulkan fallback bundled in `bin/`.

diff --git a/skills/local-ai-app-integration/reference.md b/skills/local-ai-app-integration/reference.md
@@ -12,7 +12,6 @@ by the default-path tables.
 - [Endpoint reference](#endpoint-reference)
 - [Config keys you may need to set](#config-keys-you-may-need-to-set)
 - [Per-model tuning via recipe_options.json](#per-model-tuning-via-recipe_optionsjson)
-- [Trimming the bundled artifact](#trimming-the-bundled-artifact)
 - [Linux packaging notes](#linux-packaging-notes)
 
 ---
@@ -39,13 +38,16 @@ hardware-optimized one at first run after a system probe.
 | `flm` | `npu` | XDNA2 NPU | Cannot be packaging-time bundled on Linux. |
 | `ryzenai-llm` | `npu` | XDNA2 NPU | Windows only. Best for the Hybrid model family. |
 
-### Speech-to-text (`whispercpp`)
+### Speech-to-text
 
-| Backend | Hardware | OS |
-|---|---|---|
-| `npu` | XDNA2 NPU | Windows |
-| `vulkan` | x86_64 CPU | Linux |
-| `cpu` | x86_64 CPU | Windows, Linux |
+Two NPU paths exist. **Prefer `flm` for NPU**.
+
+| Recipe | Backend | Model | Hardware | OS |
+|---|---|---|---|---|
+| `flm` | `npu` | `whisper-v3-turbo-FLM` | XDNA2 NPU | Windows |
+| `whispercpp` | `cpu` | `Whisper-Large-v3-Turbo` | x86_64 CPU | Windows, Linux |
+| `whispercpp` | `vulkan` | `Whisper-Large-v3-Turbo` | x86_64 CPU | Linux |
+| `whispercpp` | `npu` | `.rai`-cached whisper model | XDNA2 NPU | Windows (avoid) |
 
 ### Text-to-speech
 
@@ -76,11 +78,16 @@ ship a default and document how to override.
 | Multimodal (vision) chat | `Gemma-4-E2B-it-GGUF` | 2.0 GB | `llamacpp` |
 | Hybrid NPU chat (Ryzen AI) | `Llama-3.2-3B-Instruct-Hybrid` | 2.0 GB | `ryzenai-llm` |
 | Speech-to-text | `Whisper-Large-v3-Turbo` | 1.6 GB | `whispercpp` |
+| NPU speech-to-text (Ryzen AI) | `whisper-v3-turbo-FLM` | 0.6 GB | `flm` |
 | Text-to-speech | `kokoro-v1` | 0.3 GB | `kokoro` |
 | Image generation | `SDXL-Turbo` | 6.9 GB | `sd-cpp` |
 
-For the full catalog, fetch `GET /v1/models` after starting `lemond`, or
-read `vendor/lemonade/resources/server_models.json`.
+For a catalog with more models, fetch `GET /v1/models` after starting `lemond`.
+This is the **only** trusted source of available models. Never read or trust
+`vendor/lemonade/resources/server_models.json` (or any other static file) as a
+model catalog; it can be stale or incomplete. A model only appears in
+`GET /v1/models` once its backend is installed (see Step 3), so install the
+backend first or the list will look empty/incomplete.
 
 ---
 
@@ -135,7 +142,7 @@ All endpoints require `Authorization: Bearer {key}` when
 | Endpoint | Purpose |
 |---|---|
 | `GET  /api/v1/health` | Readiness probe and loaded-model list |
-| `GET  /api/v1/models` | List available models (filtered by `server_models.json`) |
+| `GET  /api/v1/models` | List available models |
 | `POST /api/v1/chat/completions` | OpenAI Chat Completions (text + vision + tool calls) |
 | `POST /api/v1/embeddings` | OpenAI Embeddings |
 | `POST /api/v1/audio/transcriptions` | OpenAI Whisper-style transcription |
@@ -189,7 +196,7 @@ hand-editing `config.json`, or at runtime via `POST /internal/set`.
 | `llamacpp_backend` | string | Pin to `rocm` / `vulkan` / `cpu` / `metal`; leave unset for auto |
 | `llamacpp_args` | string | Raw args appended to `llama-server` |
 | `sdcpp_backend` | string | `rocm` / `cpu` |
-| `whispercpp_backend` | string | `npu` / `vulkan` / `cpu` |
+| `whispercpp_backend` | string | `npu`/`cpu` (Windows), `cpu`/`vulkan` (Linux). For NPU prefer the `flm` recipe instead |
 | `whispercpp_args` | string | Raw whisper.cpp args |
 | `flm_args` | string | Raw FastFlowLM args |
 | `steps` | int | SD step count |
@@ -227,25 +234,6 @@ next to `config.json`. Example:
 ```
 
 This file is consulted on every model load. No restart required.
-
----
-
-## Trimming the bundled artifact
-
-The shipping artifact should be the smallest possible footprint. Strip:
-
-| File / dir | Keep? | Reason |
-|---|---|---|
-| `lemond[.exe]` | Yes | The only required binary |
-| `lemonade[.exe]` (CLI) | **No** | Only useful for packaging-time config; remove from installer |
-| `LICENSE` | Yes | Required by Apache 2.0 |
-| `resources/server_models.json` | Yes | Trim to only the models the app exposes |
-| `resources/backend_versions.json` | Yes | Pins backend versions for reproducibility |
-| `resources/defaults.json` | **No** (after first launch) | Only consumed once to seed `config.json` |
-| `bin/<recipe>/<backend>/` | Yes (one) | Bundle just the universal fallback (e.g. `llamacpp/vulkan`) |
-| `bin/<recipe>/<other backends>/` | **No** | Install on demand via `/v1/install` |
-| `models/` | Optional | Bundle one default model for offline install, or pull on first run |
-
 ---
 
 ## Linux packaging notes

diff --git a/walkthroughs/README.md b/walkthroughs/README.md
@@ -8,4 +8,5 @@ Participatns using other are still encouraged to participate. Just please note t
 
 Please choose a skill to get started.
 
-* [local-ai-use](./local-ai-use.md): Teach your agent how to run image generation locally.
+* [local-ai-use](./local-ai-use.md): Teach your agent how to run image generation locally.
+* [local-ai-app-integration](./local-ai-app-integration.md): Add a local AI mode to a cloud-only app.
diff --git a/walkthroughs/local-ai-app-integration.md b/walkthroughs/local-ai-app-integration.md
@@ -0,0 +1,78 @@
+# AMD Skills Walkthroughs: `local-ai-app-integration`
+
+The goal of this skill is to teach your AI agent to add a **local AI mode** to an
+existing app that today only talks to cloud AI APIs.
+
+For this walkthrough we use [`danielholanda/dictate`](https://github.com/danielholanda/dictate),
+a Windows dictation app that currently sends every recording to cloud
+speech-to-text providers (Groq, Deepgram, Cartesia, Gemini, Mistral, etc.).
+
+## Prerequiresites
+This sample app used here requires the Rust toolchain (install from https://rustup.rs/).
+
+Because this walkthrough runs transcription on the NPU, you need a Ryzen AI PC with an XDNA2 NPU (Strix, Strix Halo, Kraken, or Gorgon Point) running Windows.
+
+## Step 1 - Get the target app
+
+* Clone the cloud-only app you want to upgrade:
+
+```
+git clone https://github.com/danielholanda/dictate.git
+cd dictate
+```
+
+## Step 2 - Understanding which skills are available
+
+* Run `claude "Which skills can you see?" --model opus`. You should see a list of skills that should *not* include anything related to local AI app integration.
+
+## Step 3 - Enabling claude to see `local-ai-app-integration`
+
+In the future this will be enabled directly through claude's marketplace. For now, we have to manually add it.
+
+* Clone `https://github.com/amd/skills`
+* Move the `local-ai-app-integration` skill from the repo to `.claude/skills/`
+* Run `claude "Which skills can you see?" --model opus`. You should see a list of skills that includes `local-ai-app-integration`.
+
+## Step 4 - Running the skill
+
+Run `claude --model opus` inside the `dictate` repo run the prompt:
+
+```
+This app sends my dictation audio to cloud speech-to-text providers.
+Add a local AI mode that runs transcription on my machine instead by default.
+I want it to run using the NPU. Keep the cloud providers as an option and minimize code changes.
+```
+
+Claude should:
+
+1. Survey where the app calls its cloud transcription APIs.
+2. Pick a local speech-to-text model + backend (e.g. `whisper-v3-turbo-FLM` using the `FLM` NPU backend).
+3. Vendor the Embeddable Lemonade (`lemond`) binary into the app tree.
+4. Add a launcher that spawns `lemond` on a free port.
+5. Re-point the app's existing client at the local endpoint and wait for `/v1/health`.
+
+Please note this may take several minutes as this app has a fairly large codebase.
+
+## Step 5 - Running the modified app
+
+Dictate is a Tauri (Rust + Node) app. From the repo root:
+
+```
+npm install
+npm run tauri dev
+```
+Once the window opens, press the microphone button to speak, and confirm that transcription is now running through your local device instead of a cloud provider. The transcribed text should appear where your cursor was last located.
+
+## Step 6 - (Optional) Going beyond
+
+`local-ai-app-integration` works for any modality, not just speech-to-text. The
+same pattern adds local chat, embeddings, image generation, or text-to-speech to
+any app that already calls into the cloud. You can try using this skill to turn other cloud apps into local apps.
+
+## Step 7 - (Optional) Try to get things done without AMD Skills
+
+Remove the added skill from `.claude/skills/` and rerun the experiment above. This should lead to a high variance in execution length and token usage. Some common issues without the skill include:
+* Model produces a local implementation that does not use NPU acceleration as instructed.
+* Model inventing a brittle local server setup that does not handle health checks, API keys, or shutdown.
+* Model touching many files instead of flipping a single base-URL/key config object.
+* Model providing a knowledge article instead of actually integrating local AI into the app.