Skip to content

Commit 5cb8180

Browse files
author
Shaw
committed
updates
1 parent 1b04787 commit 5cb8180

164 files changed

Lines changed: 21133 additions & 7257 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

ELIZA_1_GGUF_PLATFORM_PLAN.json

Lines changed: 446 additions & 0 deletions
Large diffs are not rendered by default.

ELIZA_1_GGUF_READINESS.md

Lines changed: 431 additions & 0 deletions
Large diffs are not rendered by default.

ELIZA_1_TESTING_TODO.md

Lines changed: 64 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ Status as of 2026-05-11 on this workspace:
66
- Standalone Vulkan SPIR-V fixtures pass on this Apple Silicon host through MoltenVK, including TurboQuant, QJL, PolarQuant, and Polar+QJL residual.
77
- Built-fork Vulkan graph dispatch source wiring now exists for `GGML_OP_ATTN_SCORE_QJL`, `GGML_OP_ATTN_SCORE_TBQ` (`turbo3`, `turbo4`, `turbo3_tcq`), and `GGML_OP_ATTN_SCORE_POLAR` (`use_qjl=0/1`), but runtime-ready capability bits stay false until native Vulkan graph smoke passes on physical hardware.
88
- `adb devices -l` currently shows only `emulator-5554`; emulator Vulkan is diagnostic only and is not recordable Eliza-1 hardware evidence.
9-
- `xcrun xctrace list devices` currently shows `Shaw's iPhone (26.3.1)` offline; simulator results are not physical iOS evidence.
9+
- `xcrun xctrace list devices` currently shows `Shaw's iPhone (26.3.1)` offline even though `xcrun devicectl list devices` sees the iPhone 15 Pro as paired/available; simulator results are not physical iOS evidence.
1010
- CUDA, ROCm, GH200, and native Windows runners are present and fail closed, but this Mac cannot provide recordable target hardware evidence.
1111
- No final Eliza-1 release bundles exist yet with final weights, hashes, eval outputs, license manifests, and Hugging Face upload evidence.
1212

@@ -30,7 +30,7 @@ cd packages/inference/verify
3030
```
3131

3232
Expected: rejects non-Linux/MoltenVK and software ICDs unless explicitly allowed, runs `make reference-test kernel-contract vulkan-verify`, builds `linux-x64-vulkan`, dumps `CAPABILITIES.json`, then runs `make vulkan-dispatch-smoke`.
33-
On pass it writes `packages/inference/verify/vulkan-runtime-dispatch-evidence.json` and rebuilds once so `CAPABILITIES.json` can flip Vulkan runtime capabilities without the smoke-only bootstrap override.
33+
The graph smoke links against the managed output directory by default (`$ELIZA_STATE_DIR/local-inference/bin/dflash/linux-x64-vulkan`) and fails closed if `libggml-vulkan.so` is missing. On pass it writes `packages/inference/verify/vulkan-runtime-dispatch-evidence.json` and rebuilds once so `CAPABILITIES.json` can flip Vulkan runtime capabilities without the smoke-only bootstrap override.
3434

3535
Android Vulkan on a physical Adreno/Mali device:
3636

@@ -45,6 +45,7 @@ Current Vulkan blockers:
4545

4646
- Need physical Linux Intel/AMD/NVIDIA Vulkan smoke, not MoltenVK.
4747
- Need physical Android Adreno and Mali smoke.
48+
- This Mac cannot produce the native `libggml-vulkan.so` graph runtime evidence; `make -C packages/inference/verify vulkan-dispatch-smoke` is expected to fail closed here until run on physical Linux Vulkan hardware or supplied with real Android graph evidence.
4849
- Current graph source patch advertises only the single-batch contiguous shapes covered by `vulkan_dispatch_smoke.cpp`; batched `ne[2]/ne[3]` support needs a separate graph smoke before it can be enabled.
4950
- Android graph evidence must cover all six routes or the five runtime capability keys with finite `maxDiff`.
5051

@@ -54,24 +55,34 @@ Native Linux NVIDIA:
5455

5556
```bash
5657
cd packages/inference/verify
58+
HOST_ID=$(hostname -s 2>/dev/null || hostname)
59+
REPORT="hardware-results/cuda-linux-x64-${HOST_ID}.json"
60+
mkdir -p hardware-results
5761
ELIZA_DFLASH_SMOKE_MODEL=/models/eliza-1-smoke.gguf \
58-
./cuda_runner.sh --report hardware-results/cuda-$(hostname).json
62+
ELIZA_DFLASH_HARDWARE_REPORT_DIR=hardware-results \
63+
./cuda_runner.sh --report "$REPORT"
64+
node -e 'const fs=require("node:fs"); const r=JSON.parse(fs.readFileSync(process.argv[1],"utf8")); if (!(r.status==="pass" && r.passRecordable && r.evidence?.gpuInfo && r.evidence?.toolchainInfo && r.evidence?.modelSha256)) { console.error(JSON.stringify(r,null,2)); process.exit(1); }' "$REPORT"
5965
```
6066

6167
Remote NVIDIA host from a non-CUDA machine:
6268

6369
```bash
6470
cd packages/inference/verify
71+
REPORT=hardware-results/cuda-remote-linux-x64.json
6572
CUDA_REMOTE=user@cuda-host \
6673
CUDA_REMOTE_DIR=/path/to/eliza \
74+
CUDA_REMOTE_REPORT=hardware-results/cuda-remote-linux-x64.json \
6775
ELIZA_DFLASH_SMOKE_MODEL=/models/eliza-1-smoke.gguf \
68-
./cuda_runner.sh --report hardware-results/cuda-remote.json
76+
ELIZA_DFLASH_HARDWARE_REPORT_DIR=hardware-results \
77+
./cuda_runner.sh --report "$REPORT"
78+
node -e 'const fs=require("node:fs"); const r=JSON.parse(fs.readFileSync(process.argv[1],"utf8")); if (!(r.status==="pass" && r.passRecordable && r.evidence?.gpuInfo && r.evidence?.toolchainInfo && r.evidence?.modelSha256)) { console.error(JSON.stringify(r,null,2)); process.exit(1); }' "$REPORT"
6979
```
7080

7181
Current CUDA blockers:
7282

7383
- Requires Linux with `nvcc`, `nvidia-smi`, and a real NVIDIA GPU.
7484
- Requires a real GGUF smoke model; fixture-only runs do not count.
85+
- Remote collection must copy back the target-generated report; a local wrapper report with missing `gpuInfo`, `toolchainInfo`, or `modelSha256` is not recordable.
7586
- Need at least one x64 CUDA pass and one aarch64 Hopper/GH200-class pass.
7687

7788
## ROCm
@@ -80,10 +91,20 @@ Native Linux AMD:
8091

8192
```bash
8293
cd packages/inference/verify
94+
HOST_ID=$(hostname -s 2>/dev/null || hostname)
95+
REPORT="hardware-results/rocm-${HOST_ID}.json"
96+
mkdir -p hardware-results
8397
ELIZA_DFLASH_SMOKE_MODEL=/models/eliza-1-smoke.gguf \
84-
./rocm_runner.sh --report hardware-results/rocm-$(hostname).json
98+
ELIZA_DFLASH_HARDWARE_REPORT_DIR=hardware-results \
99+
ELIZA_DFLASH_CMAKE_FLAGS='-DCMAKE_HIP_ARCHITECTURES=gfx942' \
100+
./rocm_runner.sh --report "$REPORT"
101+
node -e 'const fs=require("node:fs"); const r=JSON.parse(fs.readFileSync(process.argv[1],"utf8")); if (!(r.status==="pass" && r.passRecordable && r.evidence?.gpuInfo && r.evidence?.toolchainInfo && r.evidence?.modelSha256)) { console.error(JSON.stringify(r,null,2)); process.exit(1); }' "$REPORT"
85102
```
86103

104+
Use `-DCMAKE_HIP_ARCHITECTURES=gfx90a` for MI250, `gfx942` for MI300, and
105+
`gfx1100;gfx1101;gfx1102` for RDNA3-class consumer coverage. For RDNA4, pin
106+
the exact `gfx*` agent reported by `rocminfo` before recording evidence.
107+
87108
Current ROCm blockers:
88109

89110
- Requires x86_64 Linux with `hipcc`, `rocminfo`, and a `gfx*` AMD GPU agent.
@@ -96,43 +117,71 @@ Native GH200/Hopper aarch64 Linux:
96117

97118
```bash
98119
cd packages/inference/verify
120+
HOST_ID=$(hostname -s 2>/dev/null || hostname)
121+
REPORT="hardware-results/gh200-${HOST_ID}.json"
122+
CUDA_REPORT="${REPORT%.json}.cuda.json"
123+
mkdir -p hardware-results
99124
ELIZA_DFLASH_SMOKE_MODEL=/models/eliza-1-smoke.gguf \
100-
./gh200_runner.sh --report hardware-results/gh200-$(hostname).json
125+
ELIZA_DFLASH_HARDWARE_REPORT_DIR=hardware-results \
126+
./gh200_runner.sh --report "$REPORT"
127+
node -e 'const fs=require("node:fs"); for (const p of process.argv.slice(1)) { const r=JSON.parse(fs.readFileSync(p,"utf8")); if (!(r.status==="pass" && r.passRecordable && r.evidence?.gpuInfo && r.evidence?.modelSha256)) { console.error(p); console.error(JSON.stringify(r,null,2)); process.exit(1); } }' "$REPORT" "$CUDA_REPORT"
101128
```
102129

103130
Current GH200 blockers:
104131

105132
- Requires aarch64 Linux userspace and H100/H200/GH200-class GPU or compute capability 9.x.
106133
- Delegates to `cuda_runner.sh` with `CUDA_TARGET=linux-aarch64-cuda` and `-DCMAKE_CUDA_ARCHITECTURES=90a`.
134+
- Save both the GH200 wrapper report and the delegated CUDA report (`${REPORT%.json}.cuda.json` by default).
107135
- Needs real server hardware; this Mac cannot verify it.
108136

109137
## Windows
110138

111139
Native Windows CUDA:
112140

113141
```powershell
142+
$ReportDir = "C:\temp\eliza-hardware-results"
143+
New-Item -ItemType Directory -Force -Path $ReportDir | Out-Null
144+
$Report = Join-Path $ReportDir "windows-cuda-$env:COMPUTERNAME.json"
145+
$env:ELIZA_DFLASH_HARDWARE_REPORT_DIR = $ReportDir
114146
pwsh -File packages/inference/verify/windows_runner.ps1 `
115147
-Backend cuda `
116148
-Model C:\models\eliza-1-smoke.gguf `
117-
-Report C:\temp\eliza-cuda.json
149+
-ReportDir $ReportDir `
150+
-Report $Report
151+
$r = Get-Content $Report | ConvertFrom-Json
152+
if (-not ($r.status -eq "pass" -and $r.passRecordable -and $r.evidence.gpuInfo -and $r.evidence.toolchainInfo -and $r.evidence.modelSha256)) { $r | ConvertTo-Json -Depth 8; throw "CUDA evidence is not recordable" }
118153
```
119154

120155
Native Windows Vulkan:
121156

122157
```powershell
158+
$ReportDir = "C:\temp\eliza-hardware-results"
159+
New-Item -ItemType Directory -Force -Path $ReportDir | Out-Null
160+
$Report = Join-Path $ReportDir "windows-vulkan-$env:COMPUTERNAME.json"
161+
$env:ELIZA_DFLASH_HARDWARE_REPORT_DIR = $ReportDir
123162
pwsh -File packages/inference/verify/windows_runner.ps1 `
124163
-Backend vulkan `
125164
-Model C:\models\eliza-1-smoke.gguf `
126-
-Report C:\temp\eliza-vulkan.json
165+
-ReportDir $ReportDir `
166+
-Report $Report
167+
$r = Get-Content $Report | ConvertFrom-Json
168+
if (-not ($r.status -eq "pass" -and $r.passRecordable -and $r.evidence.gpuInfo -and $r.evidence.modelSha256)) { $r | ConvertTo-Json -Depth 8; throw "Vulkan evidence is not recordable" }
127169
```
128170

129171
Native Windows CPU:
130172

131173
```powershell
174+
$ReportDir = "C:\temp\eliza-hardware-results"
175+
New-Item -ItemType Directory -Force -Path $ReportDir | Out-Null
176+
$Report = Join-Path $ReportDir "windows-cpu-$env:COMPUTERNAME.json"
177+
$env:ELIZA_DFLASH_HARDWARE_REPORT_DIR = $ReportDir
132178
pwsh -File packages/inference/verify/windows_runner.ps1 `
133179
-Backend cpu `
134180
-Model C:\models\eliza-1-smoke.gguf `
135-
-Report C:\temp\eliza-cpu.json
181+
-ReportDir $ReportDir `
182+
-Report $Report
183+
$r = Get-Content $Report | ConvertFrom-Json
184+
if (-not ($r.status -eq "pass" -and $r.passRecordable -and $r.evidence.modelSha256)) { $r | ConvertTo-Json -Depth 8; throw "CPU evidence is not recordable" }
136185
```
137186

138187
Current Windows blockers:
@@ -152,7 +201,8 @@ node packages/app-core/scripts/ios-xcframework/run-physical-device-smoke.mjs
152201

153202
Current iOS blockers:
154203

155-
- The only iPhone currently visible to Xcode is offline.
204+
- The physical iPhone is visible to CoreDevice as paired/available, but `xctrace` still lists UDID `00008130-001955E91EF8001C` as offline.
205+
- Retrying with the CoreDevice identifier reached an interactive `Password:` prompt before XCTest output; do not enter credentials inside the runner.
156206
- Simulator runs do not count as physical iOS evidence.
157207
- Physical smoke must validate the embedded Metal library, Capacitor bridge load, and at least one real local-inference route from the app shell.
158208

@@ -161,8 +211,11 @@ Current iOS blockers:
161211
Before publishing any Eliza-1 bundle to Hugging Face:
162212

163213
- Generate final GGUF weights and fused bundle manifest.
214+
- Keep VAD as the required `vad/silero-vad-int8.onnx` sidecar; do not
215+
treat every release payload as GGUF-only.
164216
- Record SHA-256 for every payload file.
165-
- Include license manifests for text, voice, ASR, vision, DFlash, and kernel sidecars.
217+
- Include license manifests for text, voice, ASR, VAD, vision, DFlash,
218+
and kernel sidecars.
166219
- Run tier evals and hardware smoke for the target platform class.
167220
- Upload to the `elizalabs` Hugging Face org and preserve upload logs/artifact URLs.
168221

bun.lock

Lines changed: 1 addition & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

docs/benchmarks/eliza-browser-app-harness.md

Lines changed: 45 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -3,9 +3,10 @@
33
`scripts/eliza-browser-app-harness.mjs` is a Puppeteer-over-Eliza skeleton for
44
benchmarking browser-agent tasks through the Eliza app surface.
55

6-
The harness has one hard boundary: it does not drive target websites. It creates
7-
a normal Eliza conversation, sends a prompt instructing the agent to use its
8-
built-in `BROWSER` action, then observes Eliza-owned APIs and the Eliza app UI.
6+
The harness has one hard boundary: it does not drive target websites. By
7+
default it opens the Eliza app with Puppeteer, types the task into the normal
8+
chat composer, and instructs the agent to use its built-in `BROWSER` action.
9+
After that it only observes Eliza-owned APIs and the Eliza app UI.
910

1011
## Quick Start
1112

@@ -22,6 +23,18 @@ bun run harness:browser-app -- \
2223
--no-launch \
2324
--target-url https://example.com/ \
2425
--prompt "Open the page and report its headline." \
26+
--require-browser-tab \
27+
--require-browser-events \
28+
--timeout 90s
29+
```
30+
31+
Use the conversation API instead of the UI only for API-only CI runs:
32+
33+
```sh
34+
bun run harness:browser-app -- \
35+
--prompt-via-api \
36+
--target-url https://example.com/ \
37+
--prompt "Open the page and report its headline." \
2538
--timeout 90s
2639
```
2740

@@ -46,6 +59,16 @@ tmp/eliza-browser-harness/<run-id>/
4659
polling.
4760
- `--no-launch`: require an already-running stack.
4861
- `--prompt <text>`: task text to wrap in the harness BROWSER-action prompt.
62+
- `--prompt-via-ui`: type the prompt into the Eliza app chat UI with Puppeteer
63+
(default).
64+
- `--prompt-via-api`: send the prompt through `POST
65+
/api/conversations/:id/messages` instead of typing it into the UI.
66+
- `--require-browser-tab`: fail unless a browser workspace tab is observed by
67+
the end of the run.
68+
- `--require-browser-events`: fail unless browser workspace events are observed
69+
by the end of the run.
70+
- `--require-trajectory`: fail unless a trajectory record is observed by the
71+
end of the run.
4972
- `--target-url <url>`: target URL for the agent's browser task.
5073
- `--timeout <ms|s|m>`: total polling time after the prompt is sent.
5174
- `--api-base <url>`: Eliza API base URL, default
@@ -68,13 +91,14 @@ Before prompting, the harness captures:
6891
- `GET /api/status`
6992
- `GET /api/dev/stack`
7093

71-
After creating a conversation and sending the task prompt, it polls:
94+
After sending the task prompt, it polls:
7295

7396
- `GET /api/browser-workspace`
7497
- `GET /api/browser-workspace/events`
7598
- `GET /api/trajectories?limit=20&offset=0`
7699
- `GET /api/dev/console-log?maxLines=400&maxBytes=256000`
77-
- `GET /api/conversations/:id/messages`
100+
- `GET /api/conversations/:id/messages` when `--prompt-via-api` created a
101+
known conversation.
78102

79103
`/api/browser-workspace/events` and `/api/dev/console-log` may return `404` on
80104
some stacks. Those responses are recorded as artifacts rather than treated as
@@ -91,9 +115,9 @@ The harness blocks these browser-workspace routes in its HTTP helper:
91115
- `/api/browser-workspace/tabs/:id/show`
92116
- `/api/browser-workspace/tabs/:id/hide`
93117

94-
Puppeteer is only used to open the Eliza app UI URL and capture
95-
`eliza-app-initial.png` / `eliza-app-final.png`. It does not click, type, or
96-
navigate inside target websites.
118+
Puppeteer is only used to open the Eliza app UI URL, type/click the Eliza chat
119+
composer when `--prompt-via-ui` is active, and capture app screenshots. It does
120+
not click, type, navigate, or evaluate inside target websites.
97121

98122
## Artifact Map
99123

@@ -103,15 +127,24 @@ Common files:
103127
- `probe-health.json`, `probe-status.json`, `probe-dev-stack.json`: initial
104128
probe responses.
105129
- `discovery.json`: resolved API/UI URLs and probe status summary.
106-
- `conversation-create.json`: conversation creation response.
130+
- `conversation-create.json`: conversation creation response when using
131+
`--prompt-via-api`.
107132
- `agent-prompt.json`: exact prompt sent to the agent.
108-
- `conversation-prompt-response.json`: non-streaming chat response.
133+
- `conversation-prompt-response.json`: non-streaming chat response when using
134+
`--prompt-via-api`.
135+
- `ui-prompt.json`: UI prompt selectors and screenshot metadata when using
136+
`--prompt-via-ui`.
109137
- `polls.jsonl`: every poll response, including tolerated `404`s.
110138
- `browser-workspace-events.jsonl`: event endpoint poll subset.
111139
- `poll-latest.json`: last response seen for each polled endpoint.
112140
- `final-*.json` or `final-*.txt`: final endpoint captures.
113-
- `eliza-app-initial.png`, `eliza-app-final.png`: Puppeteer screenshots when a
114-
Chrome executable is available.
141+
- `analysis.json`: derived tab/event/trajectory counts, endpoint errors, and
142+
assertion results.
143+
- `eliza-app-initial.png`, `eliza-app-after-ui-prompt.png`,
144+
`eliza-app-final.png`: Puppeteer screenshots when a Chrome executable is
145+
available.
146+
- `puppeteer-console.jsonl`: console and page-error events from the Eliza app
147+
surface.
115148
- `summary.json`: final pass/fail status and run metadata.
116149

117150
If the harness launches `bun run dev:desktop`, child stdout/stderr are written

0 commit comments

Comments
 (0)