elizaOS
diff --git a/‎ELIZA_1_GGUF_PLATFORM_PLAN.json‎
Lines changed: 446 additions & 0 deletions b/‎ELIZA_1_GGUF_PLATFORM_PLAN.json‎
Lines changed: 446 additions & 0 deletions
diff --git a/‎ELIZA_1_GGUF_READINESS.md‎
Lines changed: 431 additions & 0 deletions b/‎ELIZA_1_GGUF_READINESS.md‎
Lines changed: 431 additions & 0 deletions
diff --git a/‎ELIZA_1_TESTING_TODO.md‎
Lines changed: 64 additions & 11 deletions b/‎ELIZA_1_TESTING_TODO.md‎
Lines changed: 64 additions & 11 deletions
diff --git a/‎bun.lock‎
Lines changed: 1 addition & 0 deletions b/‎bun.lock‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎docs/benchmarks/eliza-browser-app-harness.md‎
Lines changed: 45 additions & 12 deletions b/‎docs/benchmarks/eliza-browser-app-harness.md‎
Lines changed: 45 additions & 12 deletions
@@ -6,7 +6,7 @@ Status as of 2026-05-11 on this workspace:
 - Standalone Vulkan SPIR-V fixtures pass on this Apple Silicon host through MoltenVK, including TurboQuant, QJL, PolarQuant, and Polar+QJL residual.
 - Built-fork Vulkan graph dispatch source wiring now exists for `GGML_OP_ATTN_SCORE_QJL`, `GGML_OP_ATTN_SCORE_TBQ` (`turbo3`, `turbo4`, `turbo3_tcq`), and `GGML_OP_ATTN_SCORE_POLAR` (`use_qjl=0/1`), but runtime-ready capability bits stay false until native Vulkan graph smoke passes on physical hardware.
 - `adb devices -l` currently shows only `emulator-5554`; emulator Vulkan is diagnostic only and is not recordable Eliza-1 hardware evidence.
-- `xcrun xctrace list devices` currently shows `Shaw's iPhone (26.3.1)` offline; simulator results are not physical iOS evidence.
+- `xcrun xctrace list devices` currently shows `Shaw's iPhone (26.3.1)` offline even though `xcrun devicectl list devices` sees the iPhone 15 Pro as paired/available; simulator results are not physical iOS evidence.
 - CUDA, ROCm, GH200, and native Windows runners are present and fail closed, but this Mac cannot provide recordable target hardware evidence.
 - No final Eliza-1 release bundles exist yet with final weights, hashes, eval outputs, license manifests, and Hugging Face upload evidence.
 
@@ -30,7 +30,7 @@ cd packages/inference/verify
 ```
 
 Expected: rejects non-Linux/MoltenVK and software ICDs unless explicitly allowed, runs `make reference-test kernel-contract vulkan-verify`, builds `linux-x64-vulkan`, dumps `CAPABILITIES.json`, then runs `make vulkan-dispatch-smoke`.
-On pass it writes `packages/inference/verify/vulkan-runtime-dispatch-evidence.json` and rebuilds once so `CAPABILITIES.json` can flip Vulkan runtime capabilities without the smoke-only bootstrap override.
+The graph smoke links against the managed output directory by default (`$ELIZA_STATE_DIR/local-inference/bin/dflash/linux-x64-vulkan`) and fails closed if `libggml-vulkan.so` is missing. On pass it writes `packages/inference/verify/vulkan-runtime-dispatch-evidence.json` and rebuilds once so `CAPABILITIES.json` can flip Vulkan runtime capabilities without the smoke-only bootstrap override.
 
 Android Vulkan on a physical Adreno/Mali device:
 
@@ -45,6 +45,7 @@ Current Vulkan blockers:
 
 - Need physical Linux Intel/AMD/NVIDIA Vulkan smoke, not MoltenVK.
 - Need physical Android Adreno and Mali smoke.
+- This Mac cannot produce the native `libggml-vulkan.so` graph runtime evidence; `make -C packages/inference/verify vulkan-dispatch-smoke` is expected to fail closed here until run on physical Linux Vulkan hardware or supplied with real Android graph evidence.
 - Current graph source patch advertises only the single-batch contiguous shapes covered by `vulkan_dispatch_smoke.cpp`; batched `ne[2]/ne[3]` support needs a separate graph smoke before it can be enabled.
 - Android graph evidence must cover all six routes or the five runtime capability keys with finite `maxDiff`.
 
@@ -54,24 +55,34 @@ Native Linux NVIDIA:
 
 ```bash
 cd packages/inference/verify
+HOST_ID=$(hostname -s 2>/dev/null || hostname)
+REPORT="hardware-results/cuda-linux-x64-${HOST_ID}.json"
+mkdir -p hardware-results
 ELIZA_DFLASH_SMOKE_MODEL=/models/eliza-1-smoke.gguf \
-  ./cuda_runner.sh --report hardware-results/cuda-$(hostname).json
+ELIZA_DFLASH_HARDWARE_REPORT_DIR=hardware-results \
+  ./cuda_runner.sh --report "$REPORT"
+node -e 'const fs=require("node:fs"); const r=JSON.parse(fs.readFileSync(process.argv[1],"utf8")); if (!(r.status==="pass" && r.passRecordable && r.evidence?.gpuInfo && r.evidence?.toolchainInfo && r.evidence?.modelSha256)) { console.error(JSON.stringify(r,null,2)); process.exit(1); }' "$REPORT"
 ```
 
 Remote NVIDIA host from a non-CUDA machine:
 
 ```bash
 cd packages/inference/verify
+REPORT=hardware-results/cuda-remote-linux-x64.json
 CUDA_REMOTE=user@cuda-host \
 CUDA_REMOTE_DIR=/path/to/eliza \
+CUDA_REMOTE_REPORT=hardware-results/cuda-remote-linux-x64.json \
 ELIZA_DFLASH_SMOKE_MODEL=/models/eliza-1-smoke.gguf \
-  ./cuda_runner.sh --report hardware-results/cuda-remote.json
+ELIZA_DFLASH_HARDWARE_REPORT_DIR=hardware-results \
+  ./cuda_runner.sh --report "$REPORT"
+node -e 'const fs=require("node:fs"); const r=JSON.parse(fs.readFileSync(process.argv[1],"utf8")); if (!(r.status==="pass" && r.passRecordable && r.evidence?.gpuInfo && r.evidence?.toolchainInfo && r.evidence?.modelSha256)) { console.error(JSON.stringify(r,null,2)); process.exit(1); }' "$REPORT"
 ```
 
 Current CUDA blockers:
 
 - Requires Linux with `nvcc`, `nvidia-smi`, and a real NVIDIA GPU.
 - Requires a real GGUF smoke model; fixture-only runs do not count.
+- Remote collection must copy back the target-generated report; a local wrapper report with missing `gpuInfo`, `toolchainInfo`, or `modelSha256` is not recordable.
 - Need at least one x64 CUDA pass and one aarch64 Hopper/GH200-class pass.
 
 ## ROCm
@@ -80,10 +91,20 @@ Native Linux AMD:
 
 ```bash
 cd packages/inference/verify
+HOST_ID=$(hostname -s 2>/dev/null || hostname)
+REPORT="hardware-results/rocm-${HOST_ID}.json"
+mkdir -p hardware-results
 ELIZA_DFLASH_SMOKE_MODEL=/models/eliza-1-smoke.gguf \
-  ./rocm_runner.sh --report hardware-results/rocm-$(hostname).json
+ELIZA_DFLASH_HARDWARE_REPORT_DIR=hardware-results \
+ELIZA_DFLASH_CMAKE_FLAGS='-DCMAKE_HIP_ARCHITECTURES=gfx942' \
+  ./rocm_runner.sh --report "$REPORT"
+node -e 'const fs=require("node:fs"); const r=JSON.parse(fs.readFileSync(process.argv[1],"utf8")); if (!(r.status==="pass" && r.passRecordable && r.evidence?.gpuInfo && r.evidence?.toolchainInfo && r.evidence?.modelSha256)) { console.error(JSON.stringify(r,null,2)); process.exit(1); }' "$REPORT"
 ```
 
+Use `-DCMAKE_HIP_ARCHITECTURES=gfx90a` for MI250, `gfx942` for MI300, and
+`gfx1100;gfx1101;gfx1102` for RDNA3-class consumer coverage. For RDNA4, pin
+the exact `gfx*` agent reported by `rocminfo` before recording evidence.
+
 Current ROCm blockers:
 
 - Requires x86_64 Linux with `hipcc`, `rocminfo`, and a `gfx*` AMD GPU agent.
@@ -96,43 +117,71 @@ Native GH200/Hopper aarch64 Linux:
 
 ```bash
 cd packages/inference/verify
+HOST_ID=$(hostname -s 2>/dev/null || hostname)
+REPORT="hardware-results/gh200-${HOST_ID}.json"
+CUDA_REPORT="${REPORT%.json}.cuda.json"
+mkdir -p hardware-results
 ELIZA_DFLASH_SMOKE_MODEL=/models/eliza-1-smoke.gguf \
-  ./gh200_runner.sh --report hardware-results/gh200-$(hostname).json
+ELIZA_DFLASH_HARDWARE_REPORT_DIR=hardware-results \
+  ./gh200_runner.sh --report "$REPORT"
+node -e 'const fs=require("node:fs"); for (const p of process.argv.slice(1)) { const r=JSON.parse(fs.readFileSync(p,"utf8")); if (!(r.status==="pass" && r.passRecordable && r.evidence?.gpuInfo && r.evidence?.modelSha256)) { console.error(p); console.error(JSON.stringify(r,null,2)); process.exit(1); } }' "$REPORT" "$CUDA_REPORT"
 ```
 
 Current GH200 blockers:
 
 - Requires aarch64 Linux userspace and H100/H200/GH200-class GPU or compute capability 9.x.
 - Delegates to `cuda_runner.sh` with `CUDA_TARGET=linux-aarch64-cuda` and `-DCMAKE_CUDA_ARCHITECTURES=90a`.
+- Save both the GH200 wrapper report and the delegated CUDA report (`${REPORT%.json}.cuda.json` by default).
 - Needs real server hardware; this Mac cannot verify it.
 
 ## Windows
 
 Native Windows CUDA:
 
 ```powershell
+$ReportDir = "C:\temp\eliza-hardware-results"
+New-Item -ItemType Directory -Force -Path $ReportDir | Out-Null
+$Report = Join-Path $ReportDir "windows-cuda-$env:COMPUTERNAME.json"
+$env:ELIZA_DFLASH_HARDWARE_REPORT_DIR = $ReportDir
 pwsh -File packages/inference/verify/windows_runner.ps1 `
   -Backend cuda `
   -Model C:\models\eliza-1-smoke.gguf `
-  -Report C:\temp\eliza-cuda.json
+  -ReportDir $ReportDir `
+  -Report $Report
+$r = Get-Content $Report | ConvertFrom-Json
+if (-not ($r.status -eq "pass" -and $r.passRecordable -and $r.evidence.gpuInfo -and $r.evidence.toolchainInfo -and $r.evidence.modelSha256)) { $r | ConvertTo-Json -Depth 8; throw "CUDA evidence is not recordable" }
 ```
 
 Native Windows Vulkan:
 
 ```powershell
+$ReportDir = "C:\temp\eliza-hardware-results"
+New-Item -ItemType Directory -Force -Path $ReportDir | Out-Null
+$Report = Join-Path $ReportDir "windows-vulkan-$env:COMPUTERNAME.json"
+$env:ELIZA_DFLASH_HARDWARE_REPORT_DIR = $ReportDir
 pwsh -File packages/inference/verify/windows_runner.ps1 `
   -Backend vulkan `
   -Model C:\models\eliza-1-smoke.gguf `
-  -Report C:\temp\eliza-vulkan.json
+  -ReportDir $ReportDir `
+  -Report $Report
+$r = Get-Content $Report | ConvertFrom-Json
+if (-not ($r.status -eq "pass" -and $r.passRecordable -and $r.evidence.gpuInfo -and $r.evidence.modelSha256)) { $r | ConvertTo-Json -Depth 8; throw "Vulkan evidence is not recordable" }
 ```
 
 Native Windows CPU:
 
 ```powershell
+$ReportDir = "C:\temp\eliza-hardware-results"
+New-Item -ItemType Directory -Force -Path $ReportDir | Out-Null
+$Report = Join-Path $ReportDir "windows-cpu-$env:COMPUTERNAME.json"
+$env:ELIZA_DFLASH_HARDWARE_REPORT_DIR = $ReportDir
 pwsh -File packages/inference/verify/windows_runner.ps1 `
   -Backend cpu `
   -Model C:\models\eliza-1-smoke.gguf `
-  -Report C:\temp\eliza-cpu.json
+  -ReportDir $ReportDir `
+  -Report $Report
+$r = Get-Content $Report | ConvertFrom-Json
+if (-not ($r.status -eq "pass" -and $r.passRecordable -and $r.evidence.modelSha256)) { $r | ConvertTo-Json -Depth 8; throw "CPU evidence is not recordable" }
 ```
 
 Current Windows blockers:
@@ -152,7 +201,8 @@ node packages/app-core/scripts/ios-xcframework/run-physical-device-smoke.mjs
 
 Current iOS blockers:
 
-- The only iPhone currently visible to Xcode is offline.
+- The physical iPhone is visible to CoreDevice as paired/available, but `xctrace` still lists UDID `00008130-001955E91EF8001C` as offline.
+- Retrying with the CoreDevice identifier reached an interactive `Password:` prompt before XCTest output; do not enter credentials inside the runner.
 - Simulator runs do not count as physical iOS evidence.
 - Physical smoke must validate the embedded Metal library, Capacitor bridge load, and at least one real local-inference route from the app shell.
 
@@ -161,8 +211,11 @@ Current iOS blockers:
 Before publishing any Eliza-1 bundle to Hugging Face:
 
 - Generate final GGUF weights and fused bundle manifest.
+- Keep VAD as the required `vad/silero-vad-int8.onnx` sidecar; do not
+  treat every release payload as GGUF-only.
 - Record SHA-256 for every payload file.
-- Include license manifests for text, voice, ASR, vision, DFlash, and kernel sidecars.
+- Include license manifests for text, voice, ASR, VAD, vision, DFlash,
+  and kernel sidecars.
 - Run tier evals and hardware smoke for the target platform class.
 - Upload to the `elizalabs` Hugging Face org and preserve upload logs/artifact URLs.
 
 
@@ -3,9 +3,10 @@
 `scripts/eliza-browser-app-harness.mjs` is a Puppeteer-over-Eliza skeleton for
 benchmarking browser-agent tasks through the Eliza app surface.
 
-The harness has one hard boundary: it does not drive target websites. It creates
-a normal Eliza conversation, sends a prompt instructing the agent to use its
-built-in `BROWSER` action, then observes Eliza-owned APIs and the Eliza app UI.
+The harness has one hard boundary: it does not drive target websites. By
+default it opens the Eliza app with Puppeteer, types the task into the normal
+chat composer, and instructs the agent to use its built-in `BROWSER` action.
+After that it only observes Eliza-owned APIs and the Eliza app UI.
 
 ## Quick Start
 
@@ -22,6 +23,18 @@ bun run harness:browser-app -- \
   --no-launch \
   --target-url https://example.com/ \
   --prompt "Open the page and report its headline." \
+  --require-browser-tab \
+  --require-browser-events \
+  --timeout 90s
+```
+
+Use the conversation API instead of the UI only for API-only CI runs:
+
+```sh
+bun run harness:browser-app -- \
+  --prompt-via-api \
+  --target-url https://example.com/ \
+  --prompt "Open the page and report its headline." \
   --timeout 90s
 ```
 
@@ -46,6 +59,16 @@ tmp/eliza-browser-harness/<run-id>/
   polling.
 - `--no-launch`: require an already-running stack.
 - `--prompt <text>`: task text to wrap in the harness BROWSER-action prompt.
+- `--prompt-via-ui`: type the prompt into the Eliza app chat UI with Puppeteer
+  (default).
+- `--prompt-via-api`: send the prompt through `POST
+  /api/conversations/:id/messages` instead of typing it into the UI.
+- `--require-browser-tab`: fail unless a browser workspace tab is observed by
+  the end of the run.
+- `--require-browser-events`: fail unless browser workspace events are observed
+  by the end of the run.
+- `--require-trajectory`: fail unless a trajectory record is observed by the
+  end of the run.
 - `--target-url <url>`: target URL for the agent's browser task.
 - `--timeout <ms|s|m>`: total polling time after the prompt is sent.
 - `--api-base <url>`: Eliza API base URL, default
@@ -68,13 +91,14 @@ Before prompting, the harness captures:
 - `GET /api/status`
 - `GET /api/dev/stack`
 
-After creating a conversation and sending the task prompt, it polls:
+After sending the task prompt, it polls:
 
 - `GET /api/browser-workspace`
 - `GET /api/browser-workspace/events`
 - `GET /api/trajectories?limit=20&offset=0`
 - `GET /api/dev/console-log?maxLines=400&maxBytes=256000`
-- `GET /api/conversations/:id/messages`
+- `GET /api/conversations/:id/messages` when `--prompt-via-api` created a
+  known conversation.
 
 `/api/browser-workspace/events` and `/api/dev/console-log` may return `404` on
 some stacks. Those responses are recorded as artifacts rather than treated as
@@ -91,9 +115,9 @@ The harness blocks these browser-workspace routes in its HTTP helper:
 - `/api/browser-workspace/tabs/:id/show`
 - `/api/browser-workspace/tabs/:id/hide`
 
-Puppeteer is only used to open the Eliza app UI URL and capture
-`eliza-app-initial.png` / `eliza-app-final.png`. It does not click, type, or
-navigate inside target websites.
+Puppeteer is only used to open the Eliza app UI URL, type/click the Eliza chat
+composer when `--prompt-via-ui` is active, and capture app screenshots. It does
+not click, type, navigate, or evaluate inside target websites.
 
 ## Artifact Map
 
@@ -103,15 +127,24 @@ Common files:
 - `probe-health.json`, `probe-status.json`, `probe-dev-stack.json`: initial
   probe responses.
 - `discovery.json`: resolved API/UI URLs and probe status summary.
-- `conversation-create.json`: conversation creation response.
+- `conversation-create.json`: conversation creation response when using
+  `--prompt-via-api`.
 - `agent-prompt.json`: exact prompt sent to the agent.
-- `conversation-prompt-response.json`: non-streaming chat response.
+- `conversation-prompt-response.json`: non-streaming chat response when using
+  `--prompt-via-api`.
+- `ui-prompt.json`: UI prompt selectors and screenshot metadata when using
+  `--prompt-via-ui`.
 - `polls.jsonl`: every poll response, including tolerated `404`s.
 - `browser-workspace-events.jsonl`: event endpoint poll subset.
 - `poll-latest.json`: last response seen for each polled endpoint.
 - `final-*.json` or `final-*.txt`: final endpoint captures.
-- `eliza-app-initial.png`, `eliza-app-final.png`: Puppeteer screenshots when a
-  Chrome executable is available.
+- `analysis.json`: derived tab/event/trajectory counts, endpoint errors, and
+  assertion results.
+- `eliza-app-initial.png`, `eliza-app-after-ui-prompt.png`,
+  `eliza-app-final.png`: Puppeteer screenshots when a Chrome executable is
+  available.
+- `puppeteer-console.jsonl`: console and page-error events from the Eliza app
+  surface.
 - `summary.json`: final pass/fail status and run metadata.
 
 If the harness launches `bun run dev:desktop`, child stdout/stderr are written