Skip to content

Commit df5624f

Browse files
lalaluneclaude
andcommitted
feat(porting): add on-device chat profiling harness
Adds scripts/benchmark/profile-inference.mjs to walk a (model x kvCache x dflash x prompt) matrix against any agent that exposes the /api/local-inference + /api/conversations surfaces and emit structured JSON + Markdown reports under reports/porting/<date>/. Designed to run today against a host-side dev server and against the cuttlefish AOSP image once the chat round-trip fix lands. Also adds: - scripts/benchmark/configs/aosp-default.json: 2 x 3 x 2 x 4 default matrix matching the porting plan's validation matrix (llama-3.2-1b + bonsai-8b-1bit; baseline-fp16, tbq4-tbq3, qjl-tbq3; with/without dflash drafter; 4 representative prompts; 3 iterations + 1 warmup). - scripts/benchmark/stub-agent-server.mjs: tiny in-process HTTP stub that mimics the agent surface so the harness itself can be validated end-to-end without standing up a real elizaOS instance. - docs/porting/benchmark-harness.md: how-to-run, config schema, output format, and the documented API gap (POST /api/local-inference/active doesn't accept per-load cacheTypeK/V or drafter overrides yet, so non-default kvCache/dflash entries are recorded in configGaps[] until that ships). Validated against the stub: 48-run matrix completes in ~47s with median tok/s, p95 latency, first-token latency, and config gaps in both profile.json and profile.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 74b8401 commit df5624f

4 files changed

Lines changed: 1306 additions & 0 deletions

File tree

docs/porting/benchmark-harness.md

Lines changed: 234 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,234 @@
1+
# On-device inference profiling harness
2+
3+
The `scripts/benchmark/profile-inference.mjs` script profiles the on-device
4+
chat agent across a configurable matrix of models, KV-cache configurations,
5+
DFlash drafter pairings, and prompts. It exists to satisfy the **Validation
6+
matrix (per port)** section of
7+
[`docs/porting/on-device-quantization-porting-plan.md`](./on-device-quantization-porting-plan.md):
8+
the harness is the recurring runner for "End-to-end agent chat" across each
9+
quantization path the porting plan ships.
10+
11+
The harness is HTTP-only — it talks to whatever agent is reachable at the
12+
target URL. That means the same script runs against:
13+
14+
- A host-side dev server (`bun run dev`) for kernel/runtime work that
15+
doesn't need a phone.
16+
- The cuttlefish AOSP image, once the chat round-trip fix lands (tracked
17+
separately under Agent E's branch).
18+
- A real arm64 device (e.g. `ZL8325M37K`) when the on-device chat path is
19+
green and the device is reachable on loopback via `adb forward`.
20+
21+
## How to run
22+
23+
```sh
24+
node scripts/benchmark/profile-inference.mjs [options]
25+
```
26+
27+
Common invocations:
28+
29+
```sh
30+
# Dev server on localhost, default config + output to reports/porting/<today>/
31+
node scripts/benchmark/profile-inference.mjs
32+
33+
# Cuttlefish (after Agent E lands the fix), with a custom output dir
34+
node scripts/benchmark/profile-inference.mjs \
35+
--target http://127.0.0.1:31337 \
36+
--label cuttlefish-x86_64 \
37+
--out reports/porting/2026-05-09-cuttlefish
38+
39+
# Real device via adb forward
40+
adb forward tcp:31337 tcp:31337
41+
node scripts/benchmark/profile-inference.mjs \
42+
--target http://127.0.0.1:31337 \
43+
--label arm64-ZL8325M37K
44+
```
45+
46+
### Options
47+
48+
| Flag | Default | Purpose |
49+
|---|---|---|
50+
| `--target <url>` | `http://localhost:31337` | Agent API base URL |
51+
| `--config <path>` | `scripts/benchmark/configs/aosp-default.json` | Matrix config |
52+
| `--token <str>` | `MILADY_API_TOKEN` / `ELIZA_API_TOKEN` env | API token if the server is auth-gated |
53+
| `--out <dir>` | `reports/porting/<YYYY-MM-DD>` | Output directory for `profile.json` + `profile.md` |
54+
| `--non-streaming` | streaming on | Use sync `/messages` instead of SSE; no first-token latency |
55+
| `--load-timeout-ms <n>` | `120000` | Per-load timeout |
56+
| `--request-timeout-ms <n>` | `180000` | Per-message timeout |
57+
| `--label <str>` | `null` | Optional label embedded in the report (e.g. host name, device id) |
58+
59+
Auth: the harness sends both `Authorization: Bearer <token>` and
60+
`X-API-Token: <token>` when a token is provided, matching every header
61+
shape the server's `getProvidedApiToken` helper accepts.
62+
63+
## Validating the harness without the real path
64+
65+
A stub HTTP server lives next to the harness and implements just enough of
66+
the agent surface to drive the matrix end-to-end with synthetic responses:
67+
68+
```sh
69+
node scripts/benchmark/stub-agent-server.mjs --port 31337 &
70+
node scripts/benchmark/profile-inference.mjs --label stub-validation
71+
```
72+
73+
The stub serves `/api/health`, `/api/local-inference/active`,
74+
`/api/conversations`, and the `/messages` + `/messages/stream` endpoints.
75+
Use it to catch harness regressions before pointing at a real agent.
76+
77+
## Config schema
78+
79+
The matrix config is a JSON file with the following shape (validated by
80+
`validateConfig` at startup; invalid configs fail fast with a clear
81+
message):
82+
83+
```jsonc
84+
{
85+
"models": ["llama-3.2-1b", "bonsai-8b-1bit"],
86+
"kvCacheConfigs": [
87+
{ "name": "baseline-fp16", "k": "f16", "v": "f16" },
88+
{ "name": "tbq4-tbq3", "k": "tbq4_0", "v": "tbq3_0" },
89+
{ "name": "qjl-tbq3", "k": "qjl1_256", "v": "tbq3_0" }
90+
],
91+
"dflashConfigs": [
92+
{ "name": "no-dflash", "drafter": null },
93+
{ "name": "dflash-bonsai", "drafter": "bonsai-8b-dflash-drafter" }
94+
],
95+
"prompts": [
96+
{ "id": "short-q", "text": "What is the capital of France?", "maxTokens": 50 },
97+
{ "id": "long-gen", "text": "Write a 200-word story...", "maxTokens": 250 }
98+
],
99+
"iterations": 3,
100+
"warmupIterations": 1
101+
}
102+
```
103+
104+
The total run count is `models × kvCacheConfigs × dflashConfigs × prompts`,
105+
each repeated `warmupIterations + iterations` times. The default matrix is
106+
`2 × 3 × 2 × 4 = 48` combinations.
107+
108+
### Adding new entries
109+
110+
- **Prompts:** add an object to `prompts[]`. `id` must be unique and is
111+
what shows up in the report; `text` is the user message; `maxTokens` is
112+
recorded in the report (the chat endpoint enforces its own limits, so
113+
`maxTokens` is currently advisory).
114+
- **Models:** add the canonical catalog id from
115+
`eliza/packages/app-core/src/services/local-inference/catalog.ts`. The
116+
agent must already have the model installed (or downloadable) for the
117+
load to succeed; otherwise the run is captured as an error.
118+
- **KV cache configs:** each entry is `{ name, k, v }`. See **API gaps**
119+
below for the current limitation: per-load overrides aren't accepted by
120+
the server yet, so `k` / `v` are recorded in the report and the catalog
121+
default is what actually loads.
122+
- **DFlash configs:** `{ name, drafter }`. `drafter: null` skips
123+
speculative decoding. Same gap applies — drafter pairing is read from
124+
the catalog, not the request.
125+
126+
## Output format
127+
128+
Each run produces two files:
129+
130+
### `profile.json`
131+
132+
Full structured matrix output. Schema:
133+
134+
```jsonc
135+
{
136+
"schemaVersion": 1,
137+
"target": "http://localhost:31337",
138+
"label": "...",
139+
"streaming": true,
140+
"configPath": ".../aosp-default.json",
141+
"startedAt": "ISO-8601",
142+
"finishedAt": "ISO-8601",
143+
"config": { /* echoed input config */ },
144+
"runs": [
145+
{
146+
"key": "<model>__<kvCache>__<dflash>__<prompt>",
147+
"model": "llama-3.2-1b",
148+
"kvCache": { "name": "baseline-fp16", "k": "f16", "v": "f16" },
149+
"dflash": { "name": "no-dflash", "drafter": null },
150+
"prompt": { "id": "short-q", "maxTokens": 50 },
151+
"startedAt": "ISO-8601",
152+
"finishedAt": "ISO-8601",
153+
"loadMs": 320,
154+
"loadResult": { "modelId": "...", "status": "ready", ... },
155+
"configGaps": [ { "kind": "...", "requested": {...}, "workaround": "..." } ],
156+
"warmupIterations": [ { "index": 0, "totalLatencyMs": 412, ... } ],
157+
"iterations": [ { "index": 0, "totalLatencyMs": 401, "tokensPerSecond": 42.1, ... } ],
158+
"summary": {
159+
"successCount": 3,
160+
"errorCount": 0,
161+
"totalLatencyMs": { "count": 3, "median": 401, "p95": 442, "min": 380, "max": 442 },
162+
"firstTokenLatencyMs": { "count": 3, "median": 72, "p95": 88, ... },
163+
"tokensPerSecond": { "count": 3, "median": 42, "p95": 47, ... },
164+
"estimatedTokens": { "count": 3, "median": 64, ... }
165+
},
166+
"error": null
167+
}
168+
]
169+
}
170+
```
171+
172+
A run is considered successful at the harness level whenever
173+
`runOneCombination` completed; a model that fails to load or every
174+
iteration erroring is captured as a populated `error`/`iterations[*].error`
175+
field rather than aborting the matrix. This is intentional: the porting
176+
plan expects some kvCache configs (e.g. `qjl1_256`) to fail until the
177+
kernel lands, and the report should record those gaps.
178+
179+
Token counts are estimates (`Math.ceil(text.length / 4)`); the streaming
180+
SSE surface doesn't emit canonical token counts. The estimate is recorded
181+
as `estimatedTokens` so it isn't conflated with a real count.
182+
183+
### `profile.md`
184+
185+
Markdown summary table. Columns: model, kvCache, dflash, prompt, load
186+
latency, first-token median, total median, total p95, tokens/s median,
187+
OK/total iteration count, notes (errors + config gaps).
188+
189+
## API gaps
190+
191+
The current `POST /api/local-inference/active` endpoint **does not** accept
192+
per-load overrides for `cacheTypeK`, `cacheTypeV`, or the dflash drafter
193+
pairing. Those values are read from the catalog entry's `runtime` block
194+
inside `resolveLocalInferenceLoadArgs`
195+
(`eliza/packages/app-core/src/services/local-inference/active-model.ts`).
196+
197+
**Implication:** any `kvCacheConfig` whose `k` / `v` differ from the
198+
loaded model's catalog defaults, or any `dflashConfig` whose `drafter`
199+
differs from the catalog's `runtime.dflash.drafterModelId`, won't actually
200+
take effect at load time. The harness records each such mismatch in the
201+
run's `configGaps[]` array with a documented workaround:
202+
203+
- For KV cache overrides: set
204+
`ELIZA_LLAMA_CACHE_TYPE_K` / `ELIZA_LLAMA_CACHE_TYPE_V` on the agent
205+
process before starting it, then re-run the harness against that agent.
206+
This is the same env-var path the AOSP shim already supports — see
207+
`eliza_llama_context_params_set_type_k` in
208+
`eliza/packages/app-core/platforms/android/...`.
209+
- For drafter pairing: edit the catalog entry's
210+
`runtime.dflash.drafterModelId` (or pick a model whose catalog block
211+
already references the desired drafter).
212+
213+
When the load endpoint grows programmatic overrides (likely as part of
214+
the QJL kernel landing), drop the `configGaps` synthesis from
215+
`runOneCombination` and pass the values directly in the load body.
216+
217+
## Where the report goes
218+
219+
- The full matrix lives at `reports/porting/<YYYY-MM-DD>/profile.json`
220+
(and `.md`).
221+
- The headline numbers from `profile.md` should be appended to the
222+
`## Current state on the AOSP image` section of
223+
[`docs/porting/on-device-quantization-porting-plan.md`](./on-device-quantization-porting-plan.md)
224+
as numbers come in. The porting plan is the cross-port comparison table;
225+
this directory is the per-run archive.
226+
227+
## Re-running across sessions
228+
229+
The harness has no machine-specific state. Pointing at a different
230+
`--target` is the only thing needed to compare runs across hosts. The
231+
output directory defaults to today's date so two runs on the same day
232+
without `--out` overwrite each other; pass `--out` (or `--label` for a
233+
distinguishing label inside the report) when running multiple matrices in
234+
a single day.
Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
{
2+
"$schema": "../profile-config.schema.json",
3+
"description": "Default profiling matrix for the AOSP on-device chat path. Reflects the porting plan's 'Validation matrix (per port)' section. Some kvCacheConfigs (qjl1_256) and dflashConfigs (drafter pairing) will fail today; they are kept in the matrix so reports document the gap until the kernels land.",
4+
"models": ["llama-3.2-1b", "bonsai-8b-1bit"],
5+
"kvCacheConfigs": [
6+
{ "name": "baseline-fp16", "k": "f16", "v": "f16" },
7+
{ "name": "tbq4-tbq3", "k": "tbq4_0", "v": "tbq3_0" },
8+
{ "name": "qjl-tbq3", "k": "qjl1_256", "v": "tbq3_0" }
9+
],
10+
"dflashConfigs": [
11+
{ "name": "no-dflash", "drafter": null },
12+
{ "name": "dflash-bonsai", "drafter": "bonsai-8b-dflash-drafter" }
13+
],
14+
"prompts": [
15+
{
16+
"id": "short-q",
17+
"text": "What is the capital of France?",
18+
"maxTokens": 50
19+
},
20+
{
21+
"id": "med-reason",
22+
"text": "Explain in three sentences why the sky is blue.",
23+
"maxTokens": 120
24+
},
25+
{
26+
"id": "long-gen",
27+
"text": "Write a 200-word story about a small robot who wants to learn to paint.",
28+
"maxTokens": 250
29+
},
30+
{
31+
"id": "context-heavy",
32+
"text": "PREAMBLE:\nThe Mariana Trench is the deepest known oceanic trench in the world, located in the western Pacific Ocean. It reaches a maximum depth of about 11,034 meters at the Challenger Deep. The trench was first sounded in 1875 by HMS Challenger. Marine life persists even at these depths, including amphipods, sea cucumbers, and microbial mats that thrive on chemosynthesis. Hydrothermal vents along the trench floor support entire ecosystems independent of sunlight. The Trieste bathyscaphe carried Jacques Piccard and Don Walsh to the bottom in 1960, and James Cameron repeated the descent solo in 2012. Pressure at the bottom exceeds 1,000 atmospheres, making engineering exploration extraordinarily difficult. Recent uncrewed vehicles have catalogued new species including the Mariana snailfish, the deepest-dwelling vertebrate ever recorded. Plastic debris has nonetheless been found at the trench floor, indicating that anthropogenic pollution has reached even Earth's most remote habitats.\n\nQUESTION: Summarize the preamble in one sentence.",
33+
"maxTokens": 100
34+
}
35+
],
36+
"iterations": 3,
37+
"warmupIterations": 1
38+
}

0 commit comments

Comments
 (0)