Skip to content

Commit 4c1eb3d

Browse files
Refactored local-ai-use skill to be modular
Split the skill guidance into two use-cases: speech and image. This fixes the issue of agent setting up both flows when only one was requested by the user
1 parent 71720b2 commit 4c1eb3d

4 files changed

Lines changed: 191 additions & 104 deletions

File tree

skills/local-ai-use/SKILL.md

Lines changed: 86 additions & 63 deletions
Original file line numberDiff line numberDiff line change
@@ -15,13 +15,26 @@ description: >-
1515

1616
# Local AI Use (route image, TTS, STT through Lemonade)
1717

18-
This is a **meta-skill**. You run it once. After that, every later request that
19-
needs image generation, text-to-speech, or speech-to-text uses the local
18+
This is a **meta-skill**. After you run it, every later request that needs
19+
image generation, text-to-speech, or speech-to-text uses the local
2020
[Lemonade Server](https://lemonade-server.ai) instead of a cloud API. The
2121
agent's own LLM keeps handling text; only the expensive multimodal calls move
2222
on-device.
2323

24-
The skill does two things:
24+
The skill covers **two independent modality groups** — set up only the one(s)
25+
the user actually needs:
26+
27+
| Group | Covers | Default model(s) |
28+
|---|---|---|
29+
| **image** | image generation + editing | `SD-Turbo` |
30+
| **speech** | text-to-speech **and** speech-to-text | `kokoro-v1`, `Whisper-Tiny` |
31+
32+
> **Do ONLY the group the user asked for.** If the user wants to generate an
33+
> image, set up `image` only — do **not** pull the speech models. The setup
34+
> command takes the group as an argument (`image`, `speech`, or `all`), and the
35+
> rule installed into `AGENTS.md` contains only the group(s) you set up.
36+
37+
For each group you set up, the skill does two things:
2538

2639
1. **Verifies that local Lemonade is reachable and has the right models.**
2740
2. **Drops a `Local AI Use` block into the workspace `AGENTS.md`** so the agent
@@ -48,32 +61,36 @@ instead.
4861
missing, install from <https://lemonade-server.ai/install_options.html>
4962
before continuing. Do not silently install on the user's machine; that is a
5063
system-wide change and must be the user's call.
51-
- **Disk:** ~8 GB free for the three default models (SD-Turbo + Whisper-Tiny
52-
+ kokoro-v1).
64+
- **Disk:** ~5 GB for `image` (SD-Turbo); ~0.4 GB for `speech`
65+
(kokoro-v1 + Whisper-Tiny). Only the group(s) you set up are downloaded.
5366
- **Network:** required for the first `lemonade pull` of each model. After
5467
that, every modality runs offline.
5568

5669
## The opinionated path
5770

58-
Run this checklist top to bottom. Track progress against it; do not move on
59-
until each step verifies.
71+
Run this checklist top to bottom for the group(s) the user needs. Track progress
72+
against it; do not move on until each step verifies.
6073

6174
```
6275
[ ] 1. Confirm Lemonade Server is installed and reachable
63-
[ ] 2. Pull the three default modality models
76+
[ ] 2. Pull the selected group's default models
6477
[ ] 3. Install the routing rule into the workspace AGENTS.md
65-
[ ] 4. Smoke-test image, TTS, and STT against the local endpoint
78+
[ ] 4. Smoke-test the selected group's endpoints
6679
```
6780

68-
The single command that does steps 1, 2, and 3 in one shot is:
81+
The single command that does steps 1, 2, and 3 in one shot, scoped to a group:
6982

7083
```bash
71-
python scripts/setup_local_ai.py
84+
python scripts/setup_local_ai.py image # image only
85+
python scripts/setup_local_ai.py speech # TTS + STT only
86+
python scripts/setup_local_ai.py all # both (only if the user wants both)
7287
```
7388

74-
(Run from this skill's folder.) The script is idempotent: re-running it on a
75-
fully configured workspace is a no-op apart from a healthcheck. Read the
76-
sections below for what to do when each step fails.
89+
(Run from this skill's folder.) The script pulls only the selected group's
90+
models and writes only that group's rule section. It is idempotent: re-running
91+
with the same group is a no-op apart from a healthcheck. To add a group later,
92+
re-run with the full set you want (e.g. `all`). Read the sections below for what
93+
to do when each step fails.
7794

7895
---
7996

@@ -101,33 +118,36 @@ and no API key is required (the system-wide server defaults to no auth on
101118
loopback). If the user has set `LEMONADE_API_KEY`, the routing rule template
102119
in `templates/local-ai-rule.md` shows where to add the `Authorization` header.
103120

104-
## Step 2: pull the three default modality models
121+
## Step 2: pull the selected group's default models
105122

106-
Pull these three. They are the **Lite Collection** defaults from Lemonade
107-
OmniRouter, sized to keep token-and-cost savings real on commodity hardware:
123+
Pull only the models for the group(s) you are setting up. They are the
124+
**Lite Collection** defaults from Lemonade OmniRouter, sized to keep
125+
token-and-cost savings real on commodity hardware:
108126

109-
| Modality | Model | Size | Why this default |
110-
|---|---|---|---|
111-
| Image generation | `SD-Turbo` | ~5 GB | Single-step generation, runs on CPU and AMD iGPU/dGPU |
112-
| Text-to-speech | `kokoro-v1` | ~0.3 GB | Only TTS model Lemonade currently supports; CPU-only, low latency |
113-
| Speech-to-text | `Whisper-Tiny` | ~0.1 GB | Smallest Whisper; fast on CPU. Upgrade to `Whisper-Large-v3-Turbo` if accuracy matters more than latency. |
127+
| Group | Modality | Model | Size | Why this default |
128+
|---|---|---|---|---|
129+
| `image` | Image generation | `SD-Turbo` | ~5 GB | Single-step generation, runs on CPU and AMD iGPU/dGPU |
130+
| `speech` | Text-to-speech | `kokoro-v1` | ~0.3 GB | Only TTS model Lemonade currently supports; CPU-only, low latency |
131+
| `speech` | Speech-to-text | `Whisper-Tiny` | ~0.1 GB | Smallest Whisper; fast on CPU. Upgrade to `Whisper-Large-v3-Turbo` if accuracy matters more than latency. |
114132

115133
```bash
134+
# image group
116135
lemonade pull SD-Turbo
136+
# speech group
117137
lemonade pull kokoro-v1
118138
lemonade pull Whisper-Tiny
119139
```
120140

121141
To choose a different model while installing the rule, pass it to the setup
122-
script. For example, to make future image requests use SDXL:
142+
script alongside the group. For example, to make future image requests use SDXL:
123143

124144
```bash
125-
python scripts/setup_local_ai.py --image-model SDXL-Turbo
145+
python scripts/setup_local_ai.py image --image-model SDXL-Turbo
126146
```
127147

128148
The script will pull the selected model and write that model ID into the
129149
installed `AGENTS.md` rule. The same pattern works for `--tts-model` and
130-
`--stt-model`.
150+
`--stt-model` with the `speech` group.
131151

132152
Each `pull` is idempotent. To verify what is already downloaded:
133153

@@ -169,15 +189,16 @@ block to:
169189

170190
The rule's content is identical; only the file location changes.
171191

172-
## Step 4: smoke-test the three modalities
192+
## Step 4: smoke-test the group(s) you set up
173193

174-
Verify each modality against the live server before declaring success. These
175-
mirror the inline patterns in the installed rule, so a green pass here means
176-
the rule will work. If you installed with a model override such as
177-
`--image-model SDXL-Turbo`, use that model ID in the smoke test and confirm
178-
the installed `AGENTS.md` rule contains it.
194+
Verify each modality you set up against the live server before declaring
195+
success — run only the tests for the group(s) you installed. These mirror the
196+
inline patterns in the installed rule, so a green pass here means the rule will
197+
work. If you installed with a model override such as `--image-model SDXL-Turbo`,
198+
use that model ID in the smoke test and confirm the installed `AGENTS.md` rule
199+
contains it.
179200

180-
**Image generation** (writes `out.png`):
201+
**Image generation** `image` group (writes `out.png`):
181202

182203
```bash
183204
curl -sX POST http://localhost:13305/api/v1/images/generations \
@@ -186,7 +207,7 @@ curl -sX POST http://localhost:13305/api/v1/images/generations \
186207
| python -c "import sys,json,base64; open('out.png','wb').write(base64.b64decode(json.load(sys.stdin)['data'][0]['b64_json']))"
187208
```
188209

189-
**Text-to-speech** (writes `out.mp3`):
210+
**Text-to-speech** `speech` group (writes `out.mp3`):
190211

191212
```bash
192213
curl -sX POST http://localhost:13305/api/v1/audio/speech \
@@ -195,37 +216,39 @@ curl -sX POST http://localhost:13305/api/v1/audio/speech \
195216
-o out.mp3
196217
```
197218

198-
**Speech-to-text** (round-trips `out.mp3` → text via a wav re-encode):
219+
**Speech-to-text** `speech` group (round-trips `out.mp3` → text via a wav re-encode):
199220

200221
```bash
201222
ffmpeg -y -i out.mp3 -ar 16000 -ac 1 out.wav
202223
curl -sX POST http://localhost:13305/api/v1/audio/transcriptions \
203224
-F "file=@out.wav" -F "model=Whisper-Tiny"
204225
```
205226

206-
If any of the three returns a non-2xx status, fix it now. The rule we just
207-
installed sends future requests to these same endpoints, so a broken endpoint
208-
becomes a broken user experience.
227+
If a test for a group you set up returns a non-2xx status, fix it now. The rule
228+
we just installed sends future requests to these same endpoints, so a broken
229+
endpoint becomes a broken user experience.
209230

210231
---
211232

212233
## What changes after this skill runs
213234

214235
From the next turn onward, the agent reads the rule in `AGENTS.md` on every
215-
message. The rule explicitly tells the agent:
216-
217-
- **For image generation:** call `POST /api/v1/images/generations` on the
218-
local server. Do **not** call any cloud image API and do **not** use the
219-
built-in `GenerateImage` tool (that path bills tokens to the cloud
220-
provider).
221-
- **For text-to-speech:** call `POST /api/v1/audio/speech`. Do **not** call
222-
cloud TTS providers (OpenAI TTS, ElevenLabs, etc.).
223-
- **For speech-to-text:** call `POST /api/v1/audio/transcriptions`. Do
224-
**not** call cloud transcription providers.
225-
- **Fallback:** only fall back to a cloud API after one local attempt has
226-
failed *and* the user has been told the local call failed. Never
227-
silently fall back; the whole point of this skill is to keep cost
228-
predictable.
236+
message. For each group you set up, the rule explicitly tells the agent:
237+
238+
- **image group — image generation:** call `POST /api/v1/images/generations`
239+
(or `/images/edits`) on the local server. Do **not** call any cloud image API
240+
and do **not** use the built-in `GenerateImage` tool (that path bills tokens
241+
to the cloud provider).
242+
- **speech group — text-to-speech:** call `POST /api/v1/audio/speech`. Do
243+
**not** call cloud TTS providers (OpenAI TTS, ElevenLabs, etc.).
244+
- **speech group — speech-to-text:** call `POST /api/v1/audio/transcriptions`.
245+
Do **not** call cloud transcription providers.
246+
- **Fallback (any group):** only fall back to a cloud API after one local
247+
attempt has failed *and* the user has been told the local call failed. Never
248+
silently fall back; the whole point of this skill is to keep cost predictable.
249+
250+
A group you did **not** set up is untouched — the agent keeps using its
251+
configured providers for that modality.
229252

230253
The agent's own text reasoning continues to use whatever LLM Cursor / Claude
231254
Code / Codex is configured with. This skill does not redirect chat tokens;
@@ -246,20 +269,20 @@ machine.
246269

247270
## Verification checklist
248271

249-
Mark this skill complete only when **all** of the following are true:
272+
Mark a group complete only when **all** of the following are true for it:
250273

251274
- [ ] `lemonade status --json` reports the server running on port 13305.
252-
- [ ] `lemonade list --downloaded` shows `SD-Turbo`, `kokoro-v1`, and
253-
`Whisper-Tiny`.
254-
- [ ] The workspace `AGENTS.md` contains the
255-
`amd-skills:local-ai-use` block.
256-
- [ ] All three smoke tests in Step 4 succeed.
257-
- [ ] On a follow-up turn, asking the agent to "generate an image of X"
258-
causes it to POST to `http://localhost:13305/api/v1/images/generations`
259-
rather than calling a cloud tool.
260-
261-
If any box is unchecked, the user is still paying cloud cost for at least
262-
one modality.
275+
- [ ] `lemonade list --downloaded` shows the group's model(s): `SD-Turbo` for
276+
`image`; `kokoro-v1` and `Whisper-Tiny` for `speech`.
277+
- [ ] The workspace `AGENTS.md` contains the `amd-skills:local-ai-use` block,
278+
and that block includes the group's section (`### Image` and/or
279+
`### Speech`).
280+
- [ ] The group's smoke test(s) in Step 4 succeed.
281+
- [ ] On a follow-up turn, a request for that modality causes the agent to POST
282+
to the local endpoint rather than calling a cloud tool.
283+
284+
You only need the rows for the group(s) you set up. A group you skipped is
285+
expected to still use cloud providers.
263286

264287
---
265288

skills/local-ai-use/reference.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,7 @@ asks for higher quality or has explicit hardware to spare.
3333
To upgrade: re-run setup with the target model, for example:
3434

3535
```bash
36-
python scripts/setup_local_ai.py --image-model SDXL-Turbo
36+
python scripts/setup_local_ai.py image --image-model SDXL-Turbo
3737
```
3838

3939
The script pulls the model and rewrites the `AGENTS.md` rule in place.

0 commit comments

Comments
 (0)