|
| 1 | +# On-device inference profiling harness |
| 2 | + |
| 3 | +The `scripts/benchmark/profile-inference.mjs` script profiles the on-device |
| 4 | +chat agent across a configurable matrix of models, KV-cache configurations, |
| 5 | +DFlash drafter pairings, and prompts. It exists to satisfy the **Validation |
| 6 | +matrix (per port)** section of |
| 7 | +[`docs/porting/on-device-quantization-porting-plan.md`](./on-device-quantization-porting-plan.md): |
| 8 | +the harness is the recurring runner for "End-to-end agent chat" across each |
| 9 | +quantization path the porting plan ships. |
| 10 | + |
| 11 | +The harness is HTTP-only — it talks to whatever agent is reachable at the |
| 12 | +target URL. That means the same script runs against: |
| 13 | + |
| 14 | +- A host-side dev server (`bun run dev`) for kernel/runtime work that |
| 15 | + doesn't need a phone. |
| 16 | +- The cuttlefish AOSP image, once the chat round-trip fix lands (tracked |
| 17 | + separately under Agent E's branch). |
| 18 | +- A real arm64 device (e.g. `ZL8325M37K`) when the on-device chat path is |
| 19 | + green and the device is reachable on loopback via `adb forward`. |
| 20 | + |
| 21 | +## How to run |
| 22 | + |
| 23 | +```sh |
| 24 | +node scripts/benchmark/profile-inference.mjs [options] |
| 25 | +``` |
| 26 | + |
| 27 | +Common invocations: |
| 28 | + |
| 29 | +```sh |
| 30 | +# Dev server on localhost, default config + output to reports/porting/<today>/ |
| 31 | +node scripts/benchmark/profile-inference.mjs |
| 32 | + |
| 33 | +# Cuttlefish (after Agent E lands the fix), with a custom output dir |
| 34 | +node scripts/benchmark/profile-inference.mjs \ |
| 35 | + --target http://127.0.0.1:31337 \ |
| 36 | + --label cuttlefish-x86_64 \ |
| 37 | + --out reports/porting/2026-05-09-cuttlefish |
| 38 | + |
| 39 | +# Real device via adb forward |
| 40 | +adb forward tcp:31337 tcp:31337 |
| 41 | +node scripts/benchmark/profile-inference.mjs \ |
| 42 | + --target http://127.0.0.1:31337 \ |
| 43 | + --label arm64-ZL8325M37K |
| 44 | +``` |
| 45 | + |
| 46 | +### Options |
| 47 | + |
| 48 | +| Flag | Default | Purpose | |
| 49 | +|---|---|---| |
| 50 | +| `--target <url>` | `http://localhost:31337` | Agent API base URL | |
| 51 | +| `--config <path>` | `scripts/benchmark/configs/aosp-default.json` | Matrix config | |
| 52 | +| `--token <str>` | `MILADY_API_TOKEN` / `ELIZA_API_TOKEN` env | API token if the server is auth-gated | |
| 53 | +| `--out <dir>` | `reports/porting/<YYYY-MM-DD>` | Output directory for `profile.json` + `profile.md` | |
| 54 | +| `--non-streaming` | streaming on | Use sync `/messages` instead of SSE; no first-token latency | |
| 55 | +| `--load-timeout-ms <n>` | `120000` | Per-load timeout | |
| 56 | +| `--request-timeout-ms <n>` | `180000` | Per-message timeout | |
| 57 | +| `--label <str>` | `null` | Optional label embedded in the report (e.g. host name, device id) | |
| 58 | + |
| 59 | +Auth: the harness sends both `Authorization: Bearer <token>` and |
| 60 | +`X-API-Token: <token>` when a token is provided, matching every header |
| 61 | +shape the server's `getProvidedApiToken` helper accepts. |
| 62 | + |
| 63 | +## Validating the harness without the real path |
| 64 | + |
| 65 | +A stub HTTP server lives next to the harness and implements just enough of |
| 66 | +the agent surface to drive the matrix end-to-end with synthetic responses: |
| 67 | + |
| 68 | +```sh |
| 69 | +node scripts/benchmark/stub-agent-server.mjs --port 31337 & |
| 70 | +node scripts/benchmark/profile-inference.mjs --label stub-validation |
| 71 | +``` |
| 72 | + |
| 73 | +The stub serves `/api/health`, `/api/local-inference/active`, |
| 74 | +`/api/conversations`, and the `/messages` + `/messages/stream` endpoints. |
| 75 | +Use it to catch harness regressions before pointing at a real agent. |
| 76 | + |
| 77 | +## Config schema |
| 78 | + |
| 79 | +The matrix config is a JSON file with the following shape (validated by |
| 80 | +`validateConfig` at startup; invalid configs fail fast with a clear |
| 81 | +message): |
| 82 | + |
| 83 | +```jsonc |
| 84 | +{ |
| 85 | + "models": ["llama-3.2-1b", "bonsai-8b-1bit"], |
| 86 | + "kvCacheConfigs": [ |
| 87 | + { "name": "baseline-fp16", "k": "f16", "v": "f16" }, |
| 88 | + { "name": "tbq4-tbq3", "k": "tbq4_0", "v": "tbq3_0" }, |
| 89 | + { "name": "qjl-tbq3", "k": "qjl1_256", "v": "tbq3_0" } |
| 90 | + ], |
| 91 | + "dflashConfigs": [ |
| 92 | + { "name": "no-dflash", "drafter": null }, |
| 93 | + { "name": "dflash-bonsai", "drafter": "bonsai-8b-dflash-drafter" } |
| 94 | + ], |
| 95 | + "prompts": [ |
| 96 | + { "id": "short-q", "text": "What is the capital of France?", "maxTokens": 50 }, |
| 97 | + { "id": "long-gen", "text": "Write a 200-word story...", "maxTokens": 250 } |
| 98 | + ], |
| 99 | + "iterations": 3, |
| 100 | + "warmupIterations": 1 |
| 101 | +} |
| 102 | +``` |
| 103 | + |
| 104 | +The total run count is `models × kvCacheConfigs × dflashConfigs × prompts`, |
| 105 | +each repeated `warmupIterations + iterations` times. The default matrix is |
| 106 | +`2 × 3 × 2 × 4 = 48` combinations. |
| 107 | + |
| 108 | +### Adding new entries |
| 109 | + |
| 110 | +- **Prompts:** add an object to `prompts[]`. `id` must be unique and is |
| 111 | + what shows up in the report; `text` is the user message; `maxTokens` is |
| 112 | + recorded in the report (the chat endpoint enforces its own limits, so |
| 113 | + `maxTokens` is currently advisory). |
| 114 | +- **Models:** add the canonical catalog id from |
| 115 | + `eliza/packages/app-core/src/services/local-inference/catalog.ts`. The |
| 116 | + agent must already have the model installed (or downloadable) for the |
| 117 | + load to succeed; otherwise the run is captured as an error. |
| 118 | +- **KV cache configs:** each entry is `{ name, k, v }`. See **API gaps** |
| 119 | + below for the current limitation: per-load overrides aren't accepted by |
| 120 | + the server yet, so `k` / `v` are recorded in the report and the catalog |
| 121 | + default is what actually loads. |
| 122 | +- **DFlash configs:** `{ name, drafter }`. `drafter: null` skips |
| 123 | + speculative decoding. Same gap applies — drafter pairing is read from |
| 124 | + the catalog, not the request. |
| 125 | + |
| 126 | +## Output format |
| 127 | + |
| 128 | +Each run produces two files: |
| 129 | + |
| 130 | +### `profile.json` |
| 131 | + |
| 132 | +Full structured matrix output. Schema: |
| 133 | + |
| 134 | +```jsonc |
| 135 | +{ |
| 136 | + "schemaVersion": 1, |
| 137 | + "target": "http://localhost:31337", |
| 138 | + "label": "...", |
| 139 | + "streaming": true, |
| 140 | + "configPath": ".../aosp-default.json", |
| 141 | + "startedAt": "ISO-8601", |
| 142 | + "finishedAt": "ISO-8601", |
| 143 | + "config": { /* echoed input config */ }, |
| 144 | + "runs": [ |
| 145 | + { |
| 146 | + "key": "<model>__<kvCache>__<dflash>__<prompt>", |
| 147 | + "model": "llama-3.2-1b", |
| 148 | + "kvCache": { "name": "baseline-fp16", "k": "f16", "v": "f16" }, |
| 149 | + "dflash": { "name": "no-dflash", "drafter": null }, |
| 150 | + "prompt": { "id": "short-q", "maxTokens": 50 }, |
| 151 | + "startedAt": "ISO-8601", |
| 152 | + "finishedAt": "ISO-8601", |
| 153 | + "loadMs": 320, |
| 154 | + "loadResult": { "modelId": "...", "status": "ready", ... }, |
| 155 | + "configGaps": [ { "kind": "...", "requested": {...}, "workaround": "..." } ], |
| 156 | + "warmupIterations": [ { "index": 0, "totalLatencyMs": 412, ... } ], |
| 157 | + "iterations": [ { "index": 0, "totalLatencyMs": 401, "tokensPerSecond": 42.1, ... } ], |
| 158 | + "summary": { |
| 159 | + "successCount": 3, |
| 160 | + "errorCount": 0, |
| 161 | + "totalLatencyMs": { "count": 3, "median": 401, "p95": 442, "min": 380, "max": 442 }, |
| 162 | + "firstTokenLatencyMs": { "count": 3, "median": 72, "p95": 88, ... }, |
| 163 | + "tokensPerSecond": { "count": 3, "median": 42, "p95": 47, ... }, |
| 164 | + "estimatedTokens": { "count": 3, "median": 64, ... } |
| 165 | + }, |
| 166 | + "error": null |
| 167 | + } |
| 168 | + ] |
| 169 | +} |
| 170 | +``` |
| 171 | + |
| 172 | +A run is considered successful at the harness level whenever |
| 173 | +`runOneCombination` completed; a model that fails to load or every |
| 174 | +iteration erroring is captured as a populated `error`/`iterations[*].error` |
| 175 | +field rather than aborting the matrix. This is intentional: the porting |
| 176 | +plan expects some kvCache configs (e.g. `qjl1_256`) to fail until the |
| 177 | +kernel lands, and the report should record those gaps. |
| 178 | + |
| 179 | +Token counts are estimates (`Math.ceil(text.length / 4)`); the streaming |
| 180 | +SSE surface doesn't emit canonical token counts. The estimate is recorded |
| 181 | +as `estimatedTokens` so it isn't conflated with a real count. |
| 182 | + |
| 183 | +### `profile.md` |
| 184 | + |
| 185 | +Markdown summary table. Columns: model, kvCache, dflash, prompt, load |
| 186 | +latency, first-token median, total median, total p95, tokens/s median, |
| 187 | +OK/total iteration count, notes (errors + config gaps). |
| 188 | + |
| 189 | +## API gaps |
| 190 | + |
| 191 | +The current `POST /api/local-inference/active` endpoint **does not** accept |
| 192 | +per-load overrides for `cacheTypeK`, `cacheTypeV`, or the dflash drafter |
| 193 | +pairing. Those values are read from the catalog entry's `runtime` block |
| 194 | +inside `resolveLocalInferenceLoadArgs` |
| 195 | +(`eliza/packages/app-core/src/services/local-inference/active-model.ts`). |
| 196 | + |
| 197 | +**Implication:** any `kvCacheConfig` whose `k` / `v` differ from the |
| 198 | +loaded model's catalog defaults, or any `dflashConfig` whose `drafter` |
| 199 | +differs from the catalog's `runtime.dflash.drafterModelId`, won't actually |
| 200 | +take effect at load time. The harness records each such mismatch in the |
| 201 | +run's `configGaps[]` array with a documented workaround: |
| 202 | + |
| 203 | +- For KV cache overrides: set |
| 204 | + `ELIZA_LLAMA_CACHE_TYPE_K` / `ELIZA_LLAMA_CACHE_TYPE_V` on the agent |
| 205 | + process before starting it, then re-run the harness against that agent. |
| 206 | + This is the same env-var path the AOSP shim already supports — see |
| 207 | + `eliza_llama_context_params_set_type_k` in |
| 208 | + `eliza/packages/app-core/platforms/android/...`. |
| 209 | +- For drafter pairing: edit the catalog entry's |
| 210 | + `runtime.dflash.drafterModelId` (or pick a model whose catalog block |
| 211 | + already references the desired drafter). |
| 212 | + |
| 213 | +When the load endpoint grows programmatic overrides (likely as part of |
| 214 | +the QJL kernel landing), drop the `configGaps` synthesis from |
| 215 | +`runOneCombination` and pass the values directly in the load body. |
| 216 | + |
| 217 | +## Where the report goes |
| 218 | + |
| 219 | +- The full matrix lives at `reports/porting/<YYYY-MM-DD>/profile.json` |
| 220 | + (and `.md`). |
| 221 | +- The headline numbers from `profile.md` should be appended to the |
| 222 | + `## Current state on the AOSP image` section of |
| 223 | + [`docs/porting/on-device-quantization-porting-plan.md`](./on-device-quantization-porting-plan.md) |
| 224 | + as numbers come in. The porting plan is the cross-port comparison table; |
| 225 | + this directory is the per-run archive. |
| 226 | + |
| 227 | +## Re-running across sessions |
| 228 | + |
| 229 | +The harness has no machine-specific state. Pointing at a different |
| 230 | +`--target` is the only thing needed to compare runs across hosts. The |
| 231 | +output directory defaults to today's date so two runs on the same day |
| 232 | +without `--out` overwrite each other; pass `--out` (or `--label` for a |
| 233 | +distinguishing label inside the report) when running multiple matrices in |
| 234 | +a single day. |
0 commit comments