Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,7 @@ Embed AMD-optimized AI into end-user applications.
| Skill | What it does |
| --- | --- |
| `local-ai-app-integration` | Add private, on-device AI to apps that use OpenAI, Anthropic, or Ollama APIs by bundling Embeddable Lemonade as a subprocess. |
| `local-ai-use` | Apply a Lemonade-first strategy so agents default to local image generation, text-to-speech, and speech-to-text to reduce token/cost usage before any cloud fallback. |

### Cross-stack porting

Expand Down Expand Up @@ -226,7 +227,7 @@ See [AUTHORING.md](AUTHORING.md) for the full authoring guide, including when a

## Status

This repository is in its early days. The first in-repo skill, `skills/local-ai-app-integration/`, is available now and seeds the **Application integration** focus area. The Hardware-native, Cross-stack porting, and Profiling and delivery focus areas are being built out incrementally alongside manifests and CI. Expect rapid iteration. File an issue if there is a workflow you want covered, or open a PR with a skill you have been wanting to share.
This repository is in its early days. In-repo skills include `skills/local-ai-app-integration/` and `skills/local-ai-use/`, seeding the **Application integration** focus area. The Hardware-native, Cross-stack porting, and Profiling and delivery focus areas are being built out incrementally alongside manifests and CI. Expect rapid iteration. File an issue if there is a workflow you want covered, or open a PR with a skill you have been wanting to share.

## License

Expand Down
256 changes: 256 additions & 0 deletions skills/local-ai-use/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,256 @@
---
name: local-ai-use
description: >-
Routes image generation, text-to-speech, and speech-to-text through a local
Lemonade Server so the agent stops paying for cloud calls on those modalities.
Use when the user wants to save tokens, save cost, or save money using local
AI; default to local, offline, on-device, or private image generation,
transcription, or text-to-speech in this workspace; stop using DALL-E,
Whisper-as-a-service, ElevenLabs, or other paid multimodal APIs; route the
agent's image, TTS, or STT tool calls to a local model; or mentions Lemonade
Server, OmniRouter, SD-Turbo, kokoro, Whisper, Ryzen AI, NPU/iGPU/dGPU
inference, or "use local for images but cloud for chat". Run once per
workspace; the rule it installs handles every later request.
---

# Local AI Use (route image, TTS, STT through Lemonade)

This is a **meta-skill**. You run it once. After that, every later request that
needs image generation, text-to-speech, or speech-to-text uses the local
[Lemonade Server](https://lemonade-server.ai) instead of a cloud API. The
agent's own LLM keeps handling text; only the expensive multimodal calls move
on-device.

The skill does two things:

1. **Verifies that local Lemonade is reachable and has the right models.**
2. **Drops a `Local AI Use` block into the workspace `AGENTS.md`** so the agent
reads the routing rule on every later turn, in Cursor, Claude Code, Codex,
Gemini CLI, and any other agent that respects `AGENTS.md`.

## When to use this skill

Use this skill when **all** of the following are true:

- The user has, or is willing to install, the system-wide Lemonade Server.
- The user accepts the default Lemonade endpoint `http://localhost:13305`.
- The user wants the change to be **persistent** across future turns and
agent restarts (the rule is written to disk).

If the user is instead **embedding** Lemonade as a private subprocess inside
an app installer, do not use this skill; use `local-ai-app-integration`
instead.

## Prerequisites

- **OS:** Windows 11 x64, Ubuntu/Debian x64, or macOS (beta).
- **Lemonade Server CLI on `PATH`:** verify with `lemonade --version`. If
missing, install from <https://lemonade-server.ai/install_options.html>
before continuing. Do not silently install on the user's machine; that is a
system-wide change and must be the user's call.
- **Disk:** ~8 GB free for the three default models (SD-Turbo + Whisper-Tiny
+ kokoro-v1).
- **Network:** required for the first `lemonade pull` of each model. After
that, every modality runs offline.

## The opinionated path

Run this checklist top to bottom. Track progress against it; do not move on
until each step verifies.

```
[ ] 1. Confirm Lemonade Server is installed and reachable
[ ] 2. Pull the three default modality models
[ ] 3. Install the routing rule into the workspace AGENTS.md
[ ] 4. Smoke-test image, TTS, and STT against the local endpoint
```

The single command that does steps 1, 2, and 3 in one shot is:

```bash
python scripts/setup_local_ai.py
```

(Run from this skill's folder.) The script is idempotent: re-running it on a
fully configured workspace is a no-op apart from a healthcheck. Read the
sections below for what to do when each step fails.

---

## Step 1: confirm Lemonade Server is reachable

Run:

```bash
lemonade status --json
```

Two acceptable outcomes:

| `lemonade status` says | Action |
|---|---|
| `Server is running on port 13305` | Continue to Step 2. |
| `Server is not running` | Start it. On Windows, launch the **Lemonade** Start Menu shortcut. On Linux, run `sudo systemctl start lemonade-server`. Re-check `lemonade status`. |

If `lemonade` is not on `PATH` at all, the server is not installed. Stop and
point the user at <https://lemonade-server.ai/install_options.html>. Do not
attempt a silent install.

The rest of this skill assumes the endpoint is `http://localhost:13305/api/v1`
and no API key is required (the system-wide server defaults to no auth on
loopback). If the user has set `LEMONADE_API_KEY`, the routing rule template
in `templates/local-ai-rule.md` shows where to add the `Authorization` header.

## Step 2: pull the three default modality models

Pull these three. They are the **Lite Collection** defaults from Lemonade
OmniRouter, sized to keep token-and-cost savings real on commodity hardware:

| Modality | Model | Size | Why this default |
|---|---|---|---|
| Image generation | `SD-Turbo` | ~5 GB | Single-step generation, runs on CPU and AMD iGPU/dGPU |
| Text-to-speech | `kokoro-v1` | ~0.3 GB | Only TTS model Lemonade currently supports; CPU-only, low latency |
| Speech-to-text | `Whisper-Tiny` | ~0.1 GB | Smallest Whisper; fast on CPU. Upgrade to `Whisper-Large-v3-Turbo` if accuracy matters more than latency. |

```bash
lemonade pull SD-Turbo
lemonade pull kokoro-v1
lemonade pull Whisper-Tiny
```

Each `pull` is idempotent. To verify what is already downloaded:

```bash
lemonade list --downloaded
```

For coverage of larger / higher-quality alternatives (`SDXL-Turbo`,
`Flux-2-Klein-4B`, `Whisper-Large-v3-Turbo`), see the
[model picker in reference.md](reference.md#model-picker).

## Step 3: install the routing rule into AGENTS.md

The rule is a Markdown block stored in [`templates/local-ai-rule.md`](templates/local-ai-rule.md).
Append it to the workspace's `AGENTS.md` (create the file if missing). Both
Cursor and Claude Code load `AGENTS.md` automatically on every turn, so the
agent will see the rule on its next message without any further setup.

`scripts/setup_local_ai.py` does this for you, surrounded by stable markers
so re-running the script replaces the block in place rather than appending
a second copy. The markers look like:

```
<!-- BEGIN amd-skills:local-ai-use -->
...rule...
<!-- END amd-skills:local-ai-use -->
```

If you write the file by hand, keep those exact markers. The script relies
on them for idempotent updates.

If the user's agent only respects a different convention, mirror the same
block to:

- `CLAUDE.md` (Claude Code, project-scoped) or `~/.claude/CLAUDE.md` (global)
- `.cursor/rules/local-ai-use.mdc` (Cursor user/project rules)
- `GEMINI.md` (Gemini CLI)

The rule's content is identical; only the file location changes.

## Step 4: smoke-test the three modalities

Verify each modality against the live server before declaring success. These
mirror the inline patterns in the installed rule, so a green pass here means
the rule will work.

**Image generation** (writes `out.png`):

```bash
curl -sX POST http://localhost:13305/api/v1/images/generations \
-H "Content-Type: application/json" \
-d '{"model":"SD-Turbo","prompt":"a single red apple on a white table","size":"512x512","steps":4,"response_format":"b64_json"}' \
| python -c "import sys,json,base64; open('out.png','wb').write(base64.b64decode(json.load(sys.stdin)['data'][0]['b64_json']))"
```

**Text-to-speech** (writes `out.mp3`):

```bash
curl -sX POST http://localhost:13305/api/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{"model":"kokoro-v1","input":"Local AI is now active.","response_format":"mp3"}' \
-o out.mp3
```

**Speech-to-text** (round-trips `out.mp3` → text via a wav re-encode):

```bash
ffmpeg -y -i out.mp3 -ar 16000 -ac 1 out.wav
curl -sX POST http://localhost:13305/api/v1/audio/transcriptions \
-F "file=@out.wav" -F "model=Whisper-Tiny"
```

If any of the three returns a non-2xx status, fix it now. The rule we just
installed sends future requests to these same endpoints, so a broken endpoint
becomes a broken user experience.

---

## What changes after this skill runs

From the next turn onward, the agent reads the rule in `AGENTS.md` on every
message. The rule explicitly tells the agent:

- **For image generation:** call `POST /api/v1/images/generations` on the
local server. Do **not** call any cloud image API and do **not** use the
built-in `GenerateImage` tool (that path bills tokens to the cloud
provider).
- **For text-to-speech:** call `POST /api/v1/audio/speech`. Do **not** call
cloud TTS providers (OpenAI TTS, ElevenLabs, etc.).
- **For speech-to-text:** call `POST /api/v1/audio/transcriptions`. Do
**not** call cloud transcription providers.
- **Fallback:** only fall back to a cloud API after one local attempt has
failed *and* the user has been told the local call failed. Never
silently fall back; the whole point of this skill is to keep cost
predictable.

The agent's own text reasoning continues to use whatever LLM Cursor / Claude
Code / Codex is configured with. This skill does not redirect chat tokens;
it only redirects the multimodal calls that would otherwise leave the
machine.

## Troubleshooting cheatsheet

| Symptom | Cause | Recovery |
|---|---|---|
| `lemonade: command not found` | Server CLI not installed | Install from <https://lemonade-server.ai/install_options.html>; restart shell. |
| `Server is not running` | Service stopped after install | Windows: launch the **Lemonade** Start Menu shortcut. Linux: `sudo systemctl start lemonade-server`. |
| `POST /v1/images/generations` returns 404 model not found | Image model not downloaded | `lemonade pull SD-Turbo` and retry. |
| Image generation is slow on CPU (~4–5 min) | sd-cpp on CPU backend | Install the GPU backend on supported AMD hardware: `lemonade backends install sd-cpp:rocm`. |
| `POST /v1/audio/transcriptions` returns 400 unsupported format | Input is not 16 kHz mono WAV | Re-encode with `ffmpeg -i in.* -ar 16000 -ac 1 out.wav`. |
| `POST /v1/audio/speech` returns 404 | TTS model not downloaded | `lemonade pull kokoro-v1`. |
| 401 Unauthorized on every request | User has set `LEMONADE_API_KEY` | Add `Authorization: Bearer $LEMONADE_API_KEY` to every request and to the rule block. |

## Verification checklist

Mark this skill complete only when **all** of the following are true:

- [ ] `lemonade status --json` reports the server running on port 13305.
- [ ] `lemonade list --downloaded` shows `SD-Turbo`, `kokoro-v1`, and
`Whisper-Tiny`.
- [ ] The workspace `AGENTS.md` contains the
`amd-skills:local-ai-use` block.
- [ ] All three smoke tests in Step 4 succeed.
- [ ] On a follow-up turn, asking the agent to "generate an image of X"
causes it to POST to `http://localhost:13305/api/v1/images/generations`
rather than calling a cloud tool.

If any box is unchecked, the user is still paying cloud cost for at least
one modality.

---

## Reference

For the full model picker, alternate-quality options, the complete endpoint
reference, the API-key flow, and the OmniRouter tool definitions you can
hand to an agent's tool-calling loop, see [reference.md](reference.md).
Loading
Loading