Commit 8204be0
authored
fix(core): shape genie-ai-runtime chat requests + compact oversized prompts (#74)
Closes #73. GenieClaw's web chat through `/api/chat` used to return `{"error":"chat: genie-ai-runtime 0: "}` because the OpenAI-compatible adapter sent a multi-kB prompt (tool manifest + memory + history) into a Jetson runtime whose Qwen3 context auto-capped to ~327 tokens under memory pressure. The runtime then crashed; the client side saw a socket-drop and surfaced an empty status.
Fix has three coordinated parts:
1. **Request profile** (`RequestProfile { Generic, GenieAiRuntime }` in `openai_compat.rs`). Generic keeps today's wire format byte-for-byte so the llama.cpp path is unaffected. GenieAiRuntime sends `model: "jetson-llm"` (matches the model name the Jetson runtime registers) and `think: false` (matches the working direct-curl baseline). The `#[serde(skip_serializing_if = "Option::is_none")]` on the new `think` field means llama.cpp never sees it.
2. **Body compaction for the Jetson profile.** When the serialized body exceeds 4 KB, replace messages with a single user-role turn containing a concise system blurb plus the latest user content. Pairs with the genie-ai-runtime PR #81 prompt-size guard (which started returning HTTP 400 instead of crashing) — compaction sidesteps the rejection entirely.
3. **Honest error surfacing.** When the runtime closes the socket before sending HTTP status (`status == 0 && body.is_empty()`), report `{backend} closed connection before HTTP status (request_bytes=N)` instead of `genie-ai-runtime 0: `. Symmetric guard on the SSE path via `if !headers_done` + initial `status = 0`.
4. **Service unit tightening.** `genie-ai-runtime.service` now starts `jetson-llm-server` with `--int8-kv` + `-c 2048` (via the `GENIEPOD_AI_RUNTIME_CONTEXT=2048` env knob). INT8 KV roughly doubles fittable context vs FP16 so the Jetson runtime reliably gets the 2048 tokens GenieClaw's web prompt is sized for. Test in `tool_dispatch_test.rs` pins both invariants so they can't silently regress.
Three unit tests cover the compaction surface: Generic-unchanged, GenieAiRuntime-compacts-when-large (asserts model=jetson-llm, think=false, content preserved, history dropped), and the fallback-to-latest-non-system path for the case where the last message isn't a user role.
Acceptance follow-ups deferred to separate issues:
- **Web-chat history preservation**: current compaction replaces ALL prior turns with the latest user message + system blurb. Web chat returns a working response but doesn't remember previous turns. Should evolve toward "system blurb + recent N user/assistant pairs" once the runtime's context capacity is more reliable.
- **Dashboard `llm Connection refused (os error 111)` (issue #73 AC #3)**: that row comes from `genie-health`'s separate probe path, not the chat client. Needs a separate issue against `genie-health`'s probe target / URL.
- **Static `GENIE_RUNTIME_MAX_BODY_BYTES = 4 KB`**: ideally derived from a runtime-capacity probe rather than hardcoded; defensible static value for an alpha.
All 5 CI checks green on `5c55f62` (fmt, clippy, test, aarch64 cross-compile, `--no-default-features`).1 parent 2c945a9 commit 8204be0
4 files changed
Lines changed: 281 additions & 29 deletions
File tree
- crates/genie-core
- src/llm
- tests
- deploy/systemd
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | 1 | | |
2 | 2 | | |
3 | 3 | | |
4 | | - | |
| 4 | + | |
5 | 5 | | |
6 | 6 | | |
7 | 7 | | |
| |||
12 | 12 | | |
13 | 13 | | |
14 | 14 | | |
15 | | - | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
16 | 21 | | |
17 | 22 | | |
18 | 23 | | |
19 | 24 | | |
20 | 25 | | |
21 | | - | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
22 | 31 | | |
23 | 32 | | |
24 | 33 | | |
| |||
0 commit comments