Skip to content

fix: harden cloud E2E and elizaOS USB live path#7825

Merged
lalalune merged 4 commits into
developfrom
nubs/messylinux-cloud-e2e-hardening
May 20, 2026
Merged

fix: harden cloud E2E and elizaOS USB live path#7825
lalalune merged 4 commits into
developfrom
nubs/messylinux-cloud-e2e-hardening

Conversation

@NubsCarson
Copy link
Copy Markdown
Member

@NubsCarson NubsCarson commented May 20, 2026

Summary

  • Runs cloud mock-stack E2E against the real container-control-plane sidecar instead of the stale in-process control-plane mock.
  • Adds a guarded in-memory sandbox provider for NODE_ENV=test / CLOUD_E2E=1 only.
  • Adds a Node-hosted cloud-api Worker adapter for the E2E harness so CI exercises the generated router, DB queue, and sidecar forwarder without Wrangler local runtime.
  • Fixes control-plane request body forwarding under Node (duplex: "half"), closes local DB pools before stopping PGlite, and moves best-effort agent API-key revocation outside the sandbox-delete transaction.
  • Hardens the E2E adapter/teardown path after review: filters hop-by-hop request headers, awaits waitUntil batches deterministically, destroys already-streaming responses on late errors, and forcibly closes memory-sandbox keep-alive sockets.
  • Hardens the elizaOS Live USB path: the writer now prefers persistence-compatible .img artifacts, refuses direct ISO-to-USB writes by default, keeps the internal Tails GPT partition name needed by Persistent Storage while exposing the ELIZAOS filesystem label, and includes sudo for inherited Persistent Storage hooks.
  • Fixes the elizaOS onboarding crash found in VM testing when /api/voice/onboarding/profile/start returns a partial session without prompts.
  • Updates elizaOS Live docs with current virtual-USB evidence, exact remaining gaps, and release/update architecture notes.

Root Cause

Mock-stack E2E was still wired to stale harness assumptions: wrong repo-root resolution, an in-process control-plane mock, Wrangler-dependent cloud-api startup, and teardown races against live PGlite sockets. After moving to the real sidecar path, deprovision also exposed slow best-effort API-key cleanup inside the sandbox-delete DB transaction.

The elizaOS Live USB proof exposed separate release-path issues: raw ISO writes do not create the USB layout Tails expects for persistence, the generated USB image had branded the internal GPT partition name in a way that failed Tails' persistence eligibility guard, inherited Persistent Storage hooks still call /usr/bin/sudo, and onboarding assumed the voice-profile endpoint always returned a complete prompt list.

Validation

Cloud validation from the prior commits on this branch:

  • bunx @biomejs/biome check packages/cloud-shared/src/lib/services/memory-sandbox-provider.ts packages/scripts/cloud/admin/dev/cloud-api-e2e-server.mjs
  • bun run --cwd packages/cloud-shared typecheck
  • bun run --cwd packages/cloud-api typecheck
  • bun run --cwd packages/test/cloud-e2e typecheck
  • CI=true MOCK_REDIS=1 MOCK_HETZNER_LATENCY=0 MOCK_HETZNER_ACTION_MS=30 CONTROL_PLANE_TICK_MS=50 DATABASE_URL=pglite://./.eliza-ci/.pgdata HCLOUD_TOKEN=test-token CONTAINER_CONTROL_PLANE_TOKEN=test-token CRON_SECRET=test-cron-secret timeout 15m bun run cloud:e2e — 4 passed in 24.2s
  • timeout 8m bun run test:cloud — 279 passed, 0 failed

Current elizaOS Live/USB/onboarding validation:

  • bash -n scripts/usb-write.sh tails/auto/scripts/create-usb-image-from-iso
  • ELIZAOS_STATIC_SOURCE_ONLY=1 ./scripts/static-smoke.sh
  • bunx @biomejs/biome check packages/ui/src/api/client-voice-profiles.ts packages/ui/src/api/client-voice-profiles.test.ts packages/ui/src/components/onboarding/VoicePrefixSteps.tsx
  • bun run --cwd packages/ui test src/api/client-voice-profiles.test.ts — 21 passed
  • Loop-backed guard test proved scripts/usb-write.sh refuses direct ISO writes before writing.
  • Virtual USB VM proof booted a raw disk image as QEMU USB mass storage, reached elizaOS boot/greeter/desktop/app/API, created and unlocked Tails Persistent Storage, rebooted the same virtual USB and unlocked it again, and activated the elizaOS/Milady persistence bind mounts.
  • The virtual USB test used a stale pre-fix image for the runtime artifact. It found the missing-sudo greeter crash on the persistence-enabled reboot path; this PR fixes that in source, but a fresh ISO plus .img rebuild is still required before calling the current HEAD final USB-ready.
  • git diff --check

Remaining Release Gates

  • Rebuild the ISO and persistence-compatible USB .img from this exact branch head.
  • Re-run the VM virtual-USB proof on that fresh .img.
  • Flash and boot a physical USB for hardware evidence before making final USB-ready claims.
  • Privacy-mode embedded browser/OAuth behavior, production release keys, updater UX, SBOM/provenance, and enterprise mirror/policy remain production-readiness work, not merge blockers for this demo/productization branch.

Greptile Summary

This PR hardens the cloud mock-stack E2E harness by replacing the stale in-process control-plane mock with the real container-control-plane sidecar, introducing a guarded in-memory sandbox provider for test environments, and adding a Node-hosted Worker fetch adapter so CI exercises the actual router, DB queue, and sidecar forwarder without Wrangler.

  • Cloud E2E harness overhaul (stack.ts, cloud-api-e2e-server.mjs, memory-sandbox-provider.ts): swaps in the real control-plane sidecar, adds hop-by-hop header filtering, fixes duplex: "half" for Node body forwarding, deterministically awaits waitUntil batches, and destroys streaming responses on late errors.
  • Sandbox teardown fixes (eliza-sandbox.ts, api-keys.ts): moves best-effort API-key revocation outside the delete transaction and combines findByName + deleteByName into one RETURNING delete call.
  • USB installer and voice-profile improvements (linux-backend.ts, client-voice-profiles.ts): adds mountinfo-based system-disk detection, and normalises malformed capture-session responses with a fallback instead of passing unvalidated server data to UI state.

Confidence Score: 5/5

Safe to merge — all changed paths have been validated locally and the core logic changes are sound.

The deleteByName consolidation in api-keys.ts was verified against the repository implementation, which uses a RETURNING clause so deleted rows are correctly returned for cache invalidation. The sandbox teardown refactor in eliza-sandbox.ts correctly moves best-effort API-key revocation outside the transaction without any data-loss risk. The new E2E server correctly filters hop-by-hop headers, drains waitUntil promises before writing response headers, and handles late errors by destroying the socket. No correctness issues were found.

No files require special attention.

Important Files Changed

Filename Overview
packages/scripts/cloud/admin/dev/cloud-api-e2e-server.mjs New Node-hosted Worker adapter: correctly filters hop-by-hop headers, drains waitUntil promises before writing response, and destroys sockets on late errors.
packages/cloud-shared/src/lib/services/memory-sandbox-provider.ts New test-only in-memory sandbox provider: properly tracks sockets, calls closeIdleConnections(), destroys all sockets, and races server.close() against a 2 s timeout.
packages/cloud-shared/src/lib/services/eliza-sandbox.ts API-key revocation moved outside the delete transaction so slow best-effort cleanup no longer holds a DB lock.
packages/cloud-shared/src/lib/services/api-keys.ts Replaces findByName + deleteByName with a single deleteByName call that uses RETURNING to return deleted rows; confirmed correct by reading the repository implementation.
packages/test/cloud-e2e/src/fixtures/stack.ts Major fixture refactor: replaces in-process control-plane mock with real sidecar subprocess, adds Node executable resolution, and properly sequences PGlite vs app process teardown.
packages/cloud-api/v1/_container-control-plane-forward.ts Adds duplex: "half" to the fetch init only when a body is present, fixing request body forwarding under Node's undici-based fetch.
packages/ui/src/api/client-voice-profiles.ts Adds normaliseCaptureSession and normaliseCapturePrompt helpers that filter malformed server responses and fall back to local prompts.
packages/os/usb-installer/src/backend/linux-backend.ts Adds mountinfo-based system-disk detection via sysfs ancestor traversal and a test-injectable currentSystemDiskNames override.
packages/cloud-shared/src/lib/services/sandbox-provider.ts Adds shouldUseMemoryTestProvider() guard that throws if ELIZA_TEST_SANDBOX_PROVIDER=memory is used outside NODE_ENV=test or CLOUD_E2E=1.

Sequence Diagram

sequenceDiagram
    participant TR as Test Runner
    participant ST as stack.ts fixture
    participant PG as PGlite TCP Bridge
    participant CP as container-control-plane
    participant API as cloud-api Node adapter
    participant MSP as MemorySandboxProvider

    TR->>ST: startCloudStack()
    ST->>PG: spawn pglite-server.ts
    ST->>ST: spawnSync migrate-with-diagnostics.ts
    ST->>CP: spawn bun run start
    ST->>CP: waitForHttpOk /health
    ST->>API: spawn node --import tsx cloud-api-e2e-server.mjs
    ST->>API: waitForHttpOk /api/health
    ST-->>TR: StackHandle

    TR->>API: POST /api/v1/eliza/agents
    API->>CP: forward request
    CP->>MSP: create()
    MSP-->>CP: SandboxHandle
    CP-->>API: sandbox created
    API-->>TR: 200 OK

    TR->>ST: stop()
    ST->>API: SIGTERM
    ST->>CP: SIGTERM
    ST->>ST: closeDatabaseConnectionsForTests()
    ST->>PG: SIGTERM
Loading

Reviews (5): Last reviewed commit: "fix(os): harden USB persistence and onbo..." | Re-trigger Greptile

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 20, 2026

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: f9104407-98d7-4b4e-8d8d-83237dc40609

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch nubs/messylinux-cloud-e2e-hardening

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@NubsCarson NubsCarson force-pushed the nubs/messylinux-cloud-e2e-hardening branch from 24dfb21 to 7f92e47 Compare May 20, 2026 02:43
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 20, 2026

Claude Code is working…

I'll analyze this and get back to you.

View job run

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 20, 2026

Claude encountered an error after 0s —— View job


I'll analyze this and get back to you.

Comment thread packages/cloud-shared/src/lib/services/memory-sandbox-provider.ts Outdated
@NubsCarson NubsCarson force-pushed the nubs/messylinux-cloud-e2e-hardening branch from 7f92e47 to b8d2366 Compare May 20, 2026 02:53
Comment thread packages/scripts/cloud/admin/dev/cloud-api-e2e-server.mjs
@NubsCarson NubsCarson force-pushed the nubs/messylinux-cloud-e2e-hardening branch 2 times, most recently from 16158b7 to b78a85b Compare May 20, 2026 05:08
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 20, 2026

Claude encountered an error after 0s —— View job


I'll analyze this and get back to you.

@NubsCarson NubsCarson force-pushed the nubs/messylinux-cloud-e2e-hardening branch 2 times, most recently from 7497cf6 to 75046d5 Compare May 20, 2026 05:14
@NubsCarson
Copy link
Copy Markdown
Member Author

Updated this PR on top of current develop with the review follow-ups from the latest run:

  • Filter hop-by-hop Node HTTP headers before constructing the Worker Request.
  • Await ctx.waitUntil() work in batches before returning the E2E adapter response.
  • Destroy the response socket if a streaming error happens after headers are sent.
  • Track and destroy memory-sandbox keep-alive sockets so teardown cannot hang.

Latest local validation on 8639eb755b:

  • bunx @biomejs/biome check packages/cloud-shared/src/lib/services/memory-sandbox-provider.ts packages/scripts/cloud/admin/dev/cloud-api-e2e-server.mjs
  • bun run --cwd packages/cloud-shared typecheck
  • bun run --cwd packages/cloud-api typecheck
  • bun run --cwd packages/test/cloud-e2e typecheck
  • CI=true MOCK_REDIS=1 MOCK_HETZNER_LATENCY=0 MOCK_HETZNER_ACTION_MS=30 CONTROL_PLANE_TICK_MS=50 DATABASE_URL=pglite://./.eliza-ci/.pgdata HCLOUD_TOKEN=test-token CONTAINER_CONTROL_PLANE_TOKEN=test-token CRON_SECRET=test-cron-secret timeout 15m bun run cloud:e2e — 4 passed in 24.3s
  • timeout 8m bun run test:cloud — 279 passed, 0 failed
  • git diff --check

I did not use the earlier standalone BlueBubbles route invocation as the gate for this update; the broader test:cloud sweep is the current local unit evidence.

@NubsCarson NubsCarson force-pushed the nubs/messylinux-cloud-e2e-hardening branch from 8639eb7 to f3aa7bf Compare May 20, 2026 05:25
@NubsCarson
Copy link
Copy Markdown
Member Author

Revalidated after the branch was rebased onto latest origin/develop.

Current PR head: f3aa7bf8f7 on origin/develop f6f16699fc.

Latest local validation:

  • bunx @biomejs/biome check packages/cloud-shared/src/lib/services/memory-sandbox-provider.ts packages/scripts/cloud/admin/dev/cloud-api-e2e-server.mjs
  • bun run --cwd packages/cloud-shared typecheck
  • bun run --cwd packages/cloud-api typecheck
  • bun run --cwd packages/test/cloud-e2e typecheck
  • CI=true MOCK_REDIS=1 MOCK_HETZNER_LATENCY=0 MOCK_HETZNER_ACTION_MS=30 CONTROL_PLANE_TICK_MS=50 DATABASE_URL=pglite://./.eliza-ci/.pgdata HCLOUD_TOKEN=test-token CONTAINER_CONTROL_PLANE_TOKEN=test-token CRON_SECRET=test-cron-secret timeout 15m bun run cloud:e2e — 4 passed in 24.2s
  • timeout 8m bun run test:cloud — 279 passed, 0 failed
  • git diff --check

@NubsCarson NubsCarson force-pushed the nubs/messylinux-cloud-e2e-hardening branch from f3aa7bf to 12f906e Compare May 20, 2026 05:44
@NubsCarson NubsCarson force-pushed the nubs/messylinux-cloud-e2e-hardening branch from 12f906e to af2a1d2 Compare May 20, 2026 05:48
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 20, 2026

Claude encountered an error after 0s —— View job


I'll analyze this and get back to you.

@github-actions github-actions Bot added the ui label May 20, 2026
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 20, 2026

Claude encountered an error after 0s —— View job


I'll analyze this and get back to you.

@NubsCarson NubsCarson changed the title fix(cloud): harden mock-stack E2E harness fix: harden cloud E2E and elizaOS USB live path May 20, 2026
@lalalune lalalune merged commit 1d1b894 into develop May 20, 2026
38 of 40 checks passed
lalalune pushed a commit that referenced this pull request May 20, 2026
@lalalune lalalune deleted the nubs/messylinux-cloud-e2e-hardening branch May 20, 2026 07:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants