Skip to content

Message-free supervisor: process split — agent ⇄ supervisor over gRPC, programs included#977

Open
odesenfans wants to merge 47 commits into
devfrom
dev-accelerate
Open

Message-free supervisor: process split — agent ⇄ supervisor over gRPC, programs included#977
odesenfans wants to merge 47 commits into
devfrom
dev-accelerate

Conversation

@odesenfans

@odesenfans odesenfans commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Experiment: drive the supervisor/agent split up to the gRPC boundary in one pass. Rebased on dev after the #976 squash-merge — the diff is exactly the 10 experiment commits.

What & why

Phase 1 of the backport design, accelerated: the agent and the supervisor run as two separate processes, with every live VM operation — on-demand Firecracker programs included — crossing proto/supervisor.proto over a Unix-socket gRPC link. The in-process mode remains the default; split mode is the strangler toggle.

Changes

  • gRPC transport (supervisor/proto_convert.py, grpc_server.py, grpc_client.py): SupervisorService wraps any Supervisor behind the proto; GrpcSupervisor implements the ABC as a client. Errors cross as a serialized ErrorDetail in the aleph-supervisor-error-bin trailer + mapped status codes; the client rebuilds the exact SupervisorError class (UNIMPLEMENTEDNotImplementedSupervisorError). ReinstallVmRequest.wipe_volumes becomes optional bool so unset takes the ABC's True default.
  • Daemon (python -m aleph.vm.supervisor): owns the VmPool (controllers, networking, systemd, reattach), serves on ALEPH_VM_SUPERVISOR_GRPC_SOCKET. SIGTERM stops the server only — VMs keep running and reattach on the next start (Message-free supervisor: reboot-recovery from on-disk configs #957 path).
  • Contract additions for programs: CreateVmRequest.program_mode plus VmInfo.control_socket_path / runtime_version / ipv4_gateway / ipv6_gateway. Infra-only vocabulary: the payloads spoken over the control socket stay opaque to the supervisor.
  • Program path crosses the seam (the "hardest entanglement"):
    • agent: build_program_create_vm_spec downloads code/runtime/data/volumes (AlephProgramResources) and emits a path-only spec (RUNTIME/CODE/EXTRA disk roles; drive order is the contract guest device names derive from);
    • supervisor: SpecFirecrackerProgram boots from resolved paths, vsock on, init handshake awaited as part of boot;
    • agent: ProgramGuestClient owns the guest API process (<vsock>_53), the runtime config push and run_code (CONNECT 52); run_code_on_request/on_event rewritten off pool/VmExecution, serialized per VM (one create, one config push under concurrency — race-tested).
  • Split mode: ALEPH_VM_SUPERVISOR_GRPC_SOCKET set ⇒ the agent builds no pool and wires GrpcSupervisor; endpoints whose capability has not crossed yet return 501 via require_vm_pool (backups, restore, confidential, migration, network recreate, HAProxy regenerate, GPU reservation, persistent programs). Agent shutdown no longer stops VMs in this mode.
  • Hardening found by the live run: bounded guest-API startup wait (the inherited loop hung forever on child death), explicit fork context (Python 3.14 changed the Linux default), server-side traceback logging for ERROR_CODE_INTERNAL aborts.

Out of scope (501 / skipped in split mode, consistent with the series)

Persistent programs, backups/restore, confidential, migration, domains/HAProxy, GPU reservation, the advisory admission gate. Known nuance: no in-flight-run tracking across the boundary yet — a reap during a long run_code kills that request (mitigated by the expiry cancel/schedule bracketing). See the design doc §6.

Test plan

  • Live e2e through the split (local, no jailer): diagnostic-program message loaded from the Aleph network → agent-side downloads → CreateVm over UDS gRPC boots the microvm in the daemon (init handshake, runtime 2.0.0) → agent config push + run_code('/') over vsock → HTTP 200DeleteVm clean.
  • GrpcSupervisor conformance + class-exact error round-trips + streaming over a real in-test UDS channel; DTO⇄proto round-trips; spec-program boot config; _ensure_program_vm behaviors incl. the concurrency race; split-mode wiring.
  • Full tests/ suite: 741 passed, 9 pre-existing environment-only failures (identical set at the base).
  • mypy error set identical to base; ruff format, isort, proto-clean gates green.

Protocol review follow-through (second pass)

A full review of the contract, implemented as a commit series (each commit one concern):

Vocabulary scrubbed off the wire

  • DiskConfig.mount dropped (dead on both backends — Firecracker mounts travel over the guest channel, QemuVM never read the field).
  • BACKEND_QEMU_SEV dropped: the VMM and the TEE are orthogonal; confidential = backend: QEMU + TeeConfig presence. The spec path now rejects TEE specs loudly instead of booting unprotected.
  • Enum-typed wire fields (HealthStatus, Protocol, TeeBackend); is_instance field name reserved too.
  • VmInfo IPs folded into IpAssignment {address, network_cidr, gateway} × 2 families.
  • hostname moved into CreateVmRequest (base32-of-item-hash naming is computed agent-side now).
  • GuestChannel.ready_timeout_secs: boot-time policy crosses the wire instead of living in supervisor settings.
  • Same-host invariant (paths by reference over a shared fs) stated in the proto header.

Capabilities

  • GetVmSpec; idempotent CreateVm on vm_id (same spec → current info, conflict → ALREADY_EXISTS); honest RebootVm (ephemeral spec VMs are stopped and re-created from the held spec).
  • StopVm/StartVm — stop without releasing the definition (persistent VMs; ephemeral answer UNIMPLEMENTED, their cycle is DeleteVm+CreateVm).
  • WatchEvents server stream + a split-mode agent watcher that drops per-VM agent state when a VM leaves RUNNING (the cross-process replacement for the in-process reap hooks), reconnecting on failure.
  • Deadlines on every unary client RPC (30s query / 300s lifecycle — gRPC's default is infinite); live log chunks stamped at capture (no more 1970-epoch lines).

Verified (second pass): full suite 769 passed + 3 xfailed, same 9 env-only failures; mypy error set identical to base; live e2e re-ran clean on the final contract (ready_timeout_secs=20 visible on the wire, HTTP 200, clean delete). Deferred items (resource accounting, google.rpc.Status envelope, capabilities RPC, migration reshape, opaque cloud-init user-data) are recorded in the design doc §7.

Design & status: docs/plans/2026-06-11-grpc-process-split-design.md

🤖 Generated with Claude Code

…ersion

The wire layer for Phase 1: SupervisorService wraps any Supervisor behind
proto/supervisor.proto over a UDS; GrpcSupervisor implements the Supervisor
ABC as a gRPC client. SupervisorError crosses as a serialized ErrorDetail in
the aleph-supervisor-error-bin trailer plus a mapped status code, and the
client rebuilds the exact exception class (UNIMPLEMENTED → NotImplemented).

ReinstallVmRequest.wipe_volumes becomes `optional bool` so an unset field
takes the ABC's True default instead of proto3's false.

Tested: DTO/proto round-trips, conformance + behavior over a real UDS
channel, class-exact error round-trips, streaming, reinstall default.
…cket

python -m aleph.vm.supervisor owns the VmPool (controllers, networking,
systemd, reattach) and serves the gRPC contract on
ALEPH_VM_SUPERVISOR_GRPC_SOCKET (default EXECUTION_ROOT/supervisor.sock).
SIGTERM stops the server only — VMs keep running and are reattached on the
next start, the #957 recovery path. Reattach failure is non-fatal so the
contract stays reachable.

New setting SUPERVISOR_GRPC_SOCKET: when set, the agent will talk to this
socket instead of the in-process supervisor (wired in a follow-up).
…ateways on VmInfo

CreateVmRequest.program_mode asks the supervisor for the program boot flow
(vsock enabled, guest-init ready signal awaited as part of boot). VmInfo
gains the infra facts the agent needs to own the guest-level protocols
across the process split: control_socket_path (the vsock UDS the agent
dials / binds listeners beside), runtime_version (the init handshake's
version string), and the tap gateway addresses for the network config the
agent pushes. No Aleph vocabulary crosses: payloads on the control socket
stay opaque to the supervisor.
…VmSpec

SpecFirecrackerProgram boots a program microvm from resolved paths only:
kernel + RUNTIME-role root disk + optional CODE disk + EXTRA disks, vsock
enabled, guest-init handshake awaited as part of boot. The guest API and
config push are explicitly NOT here — they are agent protocols spoken over
the control socket. Drive order (rootfs, code, extras) is the contract the
agent derives guest device names from.

pool.create_vm_from_spec routes backend=FIRECRACKER + program_mode specs
to the new boot path (ephemeral only; persistent programs stay legacy).
Admission for spec-built VMs remains the agent's responsibility, as
documented on check_admission.
… boundary

The 'hardest entanglement' crosses the seam. run_code_on_request /
run_code_on_event no longer touch pool/VmExecution for ephemeral programs:

- build_program_create_vm_spec downloads code/runtime/data/volumes
  agent-side (AlephProgramResources) and emits a path-only spec
  (backend=FIRECRACKER, program_mode, RUNTIME/CODE/EXTRA disk roles).
- ProgramGuestClient owns the guest protocols the controller used to run:
  the guest API process on <vsock>_53, the runtime configuration push
  (CONNECT 52 — code bytes, entrypoint, variables, network from VmInfo's
  gateways/runtime_version), and run_code. Guest device names derive from
  the spec's disk order.
- _ensure_program_vm is the get-or-create: a running VM this agent process
  did not configure is recreated (the runtime takes one config push per
  boot, so unknown == unusable). Failures map to the same HTTP responses
  as before via both error vocabularies (download-phase controller
  exceptions + SupervisorError).
- Idle expiry / update-watch reapers also drop the agent's guest state
  (guest API process, configured mark) via the on_reaped hooks.

Persistent programs keep the legacy pool path (spec-path support deferred);
instances were already migrated. Suite: 705 passed, 9 pre-existing
environment-only failures (identical at base).
When ALEPH_VM_SUPERVISOR_GRPC_SOCKET is set, the agent constructs no VmPool:
app['supervisor'] is a GrpcSupervisor on that socket and the daemon
(python -m aleph.vm.supervisor) owns the VMs. Agent shutdown no longer stops
VMs in this mode — that is the point of separating the lifecycles.

All migrated surface works across the boundary: allocations, instance
lifecycle, operator stop/erase/reboot/reinstall/logs, port forwards, the
about/executions views, and the on-demand program path. Endpoints whose
capability has not crossed yet degrade to 501 via require_vm_pool (backups,
restore, confidential, migration, network recreate, HAProxy regenerate, GPU
reservation, persistent programs) instead of crashing on a missing pool;
about-pages fall back to pool-less reporting (no GPUs, raw free disk).

Smoke-tested: supervisor daemon in one process, agent webapp with pool=None
in another, /v2/about/executions/list served through the real UDS channel.
Two simultaneous requests to a not-yet-created program raced to
create-and-configure it; the runtime accepts one configuration push per
boot, so the loser would wedge the VM. _ensure_program_vm now runs under a
per-VM creation lock owned by the ProgramGuestClient; followers re-check
inside the lock and take the fast path. Pinned by a gather()-based race
test.

Also records the first-pass status and known gaps in the design doc
(in-flight run vs reap across the boundary, deferred capabilities).
… server-side

Found by a live end-to-end run (real diagnostic program in a Firecracker
microvm, agent and supervisor as separate processes over the UDS contract):

- The guest API wait loop (inherited from the old controller) spun forever
  when the child process died at startup. Bounded to 10s, with the child's
  death surfaced as VmSetupError. The child is now forked explicitly:
  Python 3.14 changed the Linux default away from fork, and the guest API
  relies on inheriting the parent's loaded settings.
- A stale guest-API socket from a previous run is unlinked before binding.
- The gRPC server now logs the chained traceback when it aborts an RPC with
  ERROR_CODE_INTERNAL — those are 'not a vocabulary case' bugs and were
  previously invisible on the daemon side.

The live run (Aleph network message -> agent-side downloads -> CreateVm over
gRPC -> init handshake -> agent config push + run_code over vsock -> HTTP
200 -> DeleteVm) completes cleanly.
…ready payload, mechanism-only disk roles, no is_instance

Review feedback: the contract must stay generic regardless of the
hypervisor; Firecracker-vs-QEMU is the only distinction the supervisor
should see.

- CreateVmRequest.program_mode (program/instance vocabulary on the wire)
  becomes `optional GuestChannel guest_channel { ready_port }`: a pure
  mechanism request — expose a host⇄guest channel and treat the guest's
  ready signal on ready_port as part of boot. Firecracker instances are
  naturally 'FC without a channel'.
- VmInfo.control_socket_path → guest_channel_path (empty when the VM has no
  channel, which covers QEMU). VmInfo.runtime_version → guest_ready_payload:
  the raw bytes the guest sent, passed through opaquely — the supervisor no
  longer parses the Aleph runtime's msgpack handshake; the agent does
  (runtime_config_from_ready_payload). MicroVM keeps the raw payload and
  takes the ready port as a parameter.
- DiskRole collapses to ROOTFS/EXTRA: which disk is the root device is
  mechanism; code/runtime/data are workload roles the client maps onto
  guest devices via disk order (which it already did).
- VmInfo.is_instance removed (field 18 reserved): agent vocabulary, derived
  from the registry; the guest channel's presence is the registry-miss
  fallback for the /about list labels.
- The Aleph runtime's channel conventions (control port 52, guest API 53)
  move to aleph.vm.utils.runtime_channel, agent-side, and replace the
  hardcoded CONNECT 52 strings.
- packaging: deb package ships grpcio (matches pyproject).

Re-validated live through the split: CreateVm over gRPC reports
ready_payload=b'\x81\xa7version\xa52.0.0' (raw guest msgpack), agent
parses it, config push + run_code('/') → HTTP 200, DeleteVm clean.
Suite: 742 passed, same 9 environment-only failures as base.
@codecov

codecov Bot commented Jun 11, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 73.34025% with 1028 lines in your changes missing coverage. Please review.
✅ Project coverage is 78.01%. Comparing base (70cc3ea) to head (be9425f).

Files with missing lines Patch % Lines
tests/integration/conftest.py 24.23% 244 Missing and 3 partials ⚠️
tests/integration/test_vm_management.py 16.66% 95 Missing ⚠️
src/aleph/vm/orchestrator/run.py 53.37% 56 Missing and 20 partials ⚠️
src/aleph/vm/orchestrator/vm/program_client.py 48.22% 72 Missing and 1 partial ⚠️
tests/integration/test_resource_release.py 20.89% 53 Missing ⚠️
src/aleph/vm/supervisor/daemon.py 0.00% 51 Missing ⚠️
tests/integration/test_backup_restore.py 18.64% 48 Missing ⚠️
tests/integration/test_vm_creation.py 21.66% 47 Missing ⚠️
src/aleph/vm/pool.py 28.30% 37 Missing and 1 partial ⚠️
src/aleph/vm/supervisor/inprocess.py 89.45% 23 Missing and 14 partials ⚠️
... and 25 more
Additional details and impacted files
@@            Coverage Diff             @@
##              dev     #977      +/-   ##
==========================================
- Coverage   78.92%   78.01%   -0.91%     
==========================================
  Files         161      188      +27     
  Lines       18469    22107    +3638     
  Branches     1328     1562     +234     
==========================================
+ Hits        14576    17246    +2670     
- Misses       3517     4427     +910     
- Partials      376      434      +58     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

- Drop DiskConfig.mount: guest mount points are client vocabulary no
  hypervisor mechanism consumes (Firecracker mounts travel over the
  guest channel; QemuVM never read the controller-config field).
- Drop BACKEND_QEMU_SEV: the VMM and the TEE are orthogonal. A
  confidential VM is backend QEMU + a TeeConfig; the spec path now
  rejects TEE specs explicitly (TeeUnavailableError) instead of
  encoding them as a backend variant it would silently ignore.
- Enum-ify the stringly-typed fields: HealthStatus, Protocol (port
  forwarding), TeeBackend (TeeConfig + Measurement).
- Reserve the "is_instance" field name alongside its number.
- State the same-host invariant (paths by reference over a shared
  filesystem) in the proto header.
…ages

The six flat strings (ipv4/ipv6 address, *_network, *_gateway) with
mixed bare-IP/CIDR shapes become two IpAssignment messages, one per
address family: {address (bare IP), network_cidr, gateway (bare IP)}.
Old field numbers and names are reserved.
…equest

The supervisor was deriving the hostname from the vm_id with Aleph's
base32-of-item-hash convention. The agent now computes it and sends it;
the supervisor applies it verbatim (empty falls back to a mechanical
vm_id truncation).
… the client's

How long a guest legitimately takes to signal ready is a fact about the
image the client chose; it was bound by the supervisor-side
settings.INIT_TIMEOUT. The channel now carries the bound (0 = supervisor
default); the agent sends its runtime policy.
Live StreamLogs lines carried timestamp_ns=0 as a sentinel, which
clients rendered as the 1970 epoch. The contract says timestamp_ns is
server-side capture time; now it is.
- GetVmSpec(vm_id) returns the spec a live VM was created from, letting
  the agent rebuild its records after a restart (UNIMPLEMENTED for
  message-built legacy executions).
- CreateVm is idempotent on vm_id: an identical spec for a live VM
  returns its current info (safe retry after UNAVAILABLE or a missed
  deadline); a different spec — or a collision with a message-built
  execution — fails ALREADY_EXISTS instead of silently returning the
  wrong VM.
- RebootVm actually reboots ephemeral spec-created VMs: stop, then
  re-create from the held spec, instead of returning a stopped husk the
  client had to know to re-create.
DeleteVm was the only way down: there was no way to stop a persistent
VM without erasing it, or to bring it back. StopVm stops the VM but
keeps it listed (STOPPED, forget-on-stop defused the same way reinstall
does); StartVm restarts it through the existing restart_persistent_vm
path (tap/nftables/port-mapping re-application) and is a no-op read on
a running VM. Ephemeral VMs answer UNIMPLEMENTED — their cycle is
DeleteVm + CreateVm.
In split mode the agent had no way to learn a VM went down without
polling: the in-process reap hooks only fire for agent-initiated
teardowns. WatchEvents streams every transition the supervisor performs
(create / stop / start / reboot's down-then-up / reinstall / delete),
no replay (snapshot with ListVms, then watch).

Agent side: a split-mode background task consumes the stream and drops
per-VM agent state (guest API process, configured mark, idle-teardown
timers) when a VM leaves RUNNING — the cross-process replacement for
the reap hooks. Reconnects on stream failure; degrades gracefully if
the supervisor answers UNIMPLEMENTED.

Spontaneous guest-death detection (a crashed VMM with no supervisor
operation) is not detected anywhere today; noted in the proto as a
future extension.
gRPC's default deadline is infinite: a wedged supervisor daemon would
hang the agent's HTTP handlers forever on a UDS that never answers.
Every unary call now carries one — 30s for queries and quick host-side
mutations, 300s for operations that boot or drain VMs — via a shared
_unary helper that also centralizes the error-trailer translation.
Streams (logs, events, backup download) are long-lived by design and
carry none.
mypy parity with the base branch restored (the stubs reject valid
async (Application) -> None handlers when app is precisely typed).

@foxpatch-aleph foxpatch-aleph left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well-designed PR implementing the gRPC-based process split between agent and supervisor. Clean ABC decomposition, robust closed-error vocabulary serialized in gRPC trailers, proper deadline discipline, and good backward compatibility via the Strangler Fig pattern. Minor comments on layering and a pre-existing timeout-handling gap.

src/aleph/vm/supervisor/grpc_client.py (line 49):

src/aleph/vm/orchestrator/run.py (line 662):

CI type-checks tests/ too; the mixed-value dict inferred as
dict[str, object] rejected attribute access on the mocks.

@foxpatch-aleph foxpatch-aleph left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well-designed PR implementing the agent/supervisor process split via gRPC. The boundary contract is cleanly defined, error translation is thorough, program path migration correctly splits concerns between agent and supervisor, and test coverage is excellent including round-trip DTO conversions, class-exact error round-trip parametrization over a real UDS channel, race-condition tests for concurrent cold starts, and split-mode wiring. A few minor observations about process lifecycle timing and silent returns in deferred paths, but none blocking.

src/aleph/vm/orchestrator/vm/program_client.py (line 257): asyncio.sleep(0) after process.terminate() yields to the event loop but doesn't wait for the signal to be delivered. Consider removing the sleep(0) entirely since it's misleading here — the subsequent sleep(2) is the actual wait. Not a bug, just misleading.

src/aleph/vm/supervisor/grpc_client.py (line 115): The if metadata: check works because trailing_metadata() returns an empty tuple () when absent. A brief comment noting this assumption would help future readers.

src/aleph/vm/orchestrator/tasks.py (line 252): _handle_domains_aggregate silently returns when pool is None (split mode). Consider adding a debug log so it's clear why domain updates aren't happening.

src/aleph/vm/orchestrator/run.py (line 704): The legacy event path _run_code_on_event_legacy no longer calls expiry.cancel(vm_id) before serving, while the original run_code_on_event did. This is a minor behavioral regression for persistent programs under the legacy path — expiry isn't reset before serving.

src/aleph/vm/supervisor/types.py (line 112): Backend.QEMU_SEV enum value is removed. Since Backend ceased being a StrEnum with this PR, code comparing backend == "qemu_sev" would silently fail. Verified the test suite passes, so no such code survives — but worth a mention in the PR description or commit message as a breaking change.

Wire the six BackupOps methods to the existing qemu backup machinery
(controllers/qemu/backup.py); the gRPC plumbing for them already existed.
Backups cover the rootfs disk only, matching what restore can put back.

- start_backup: async job per VM, idempotent against a running job and
  against a non-expired archive (24h TTL, as the operator endpoint did);
  optional guest fs-freeze via quiesce_guest, best-effort.
- Supervisor-issued backup ids use microsecond timestamps so a retry
  after a failed run gets a fresh id; create_backup_archive accepts the
  timestamp so the tar stem matches the id issued at start.
- get_backup_status/list_backups: completed archives are read from disk
  (the source of truth); only in-flight and failed runs live in memory.
- download_backup: 1 MiB chunks with offsets.
- restore_backup: extract the rootfs member (member-streamed, no
  extractall), verify, stop the VM with the forget-on-stop defused,
  swap the rootfs atomically, restart; emits down-then-up events.
  Persistent QEMU VMs only.
The reap task checked its stop_event identity but then deleted by hash, so
when reboot_vm (or a delete+create sequence) recreated the VM under the same
vm_id while the old stop was still being reaped, the task removed the NEW
execution from the pool: the freshly rebooted VM vanished from GetVm/ListVms
while its hypervisor kept running. Found by the supervisor integration suite
(reboot test); forget by identity instead.
An opt-in (AVM_ITEST=1) suite that drives a real supervisor daemon over its
Unix-socket gRPC contract, agent-free: specs are built inline from local
artifacts, never from Aleph messages.

- Harness: session-scoped daemon with private CACHE_ROOT/EXECUTION_ROOT;
  per-test gRPC clients; self-gating skips (KVM, root, kernel/runtime/cloud
  image paths via AVM_ITEST_* env vars). As root, a systemd drop-in (in
  /run, removed on teardown) points aleph-vm-controller@ at this source
  tree so persistent VMs run the code under test. Venv runs get the distro
  dbus/nftables modules through a symlink shim instead of all of
  dist-packages on PYTHONPATH.
- Creation: Firecracker boots to the vsock ready signal; QEMU instances get
  an IP and answer with an SSH banner from the host; idempotent re-create
  and ALREADY_EXISTS conflicts; get_vm/list_vms/get_vm_spec agreement.
- Management: journald logs via get_logs/stream_logs, ephemeral reboot by
  recreation, watch_events transitions, port-forward round trip against the
  live nftables ruleset, QEMU stop/start/reboot with reachability after
  each transition.
- Resource release: hypervisor processes, execution-root files, TAP
  interfaces, firewall rules and controller units all gone after delete;
  repeated create/delete cycles under one vm_id accumulate nothing.
- Backups: full cycle on a live instance - backup, mutate guest state over
  SSH, download + verify the archive stream, restore and observe the
  mutation undone, delete; Firecracker rejection asserted over the wire.
…th, delete leftovers

Three product gaps found by the integration suite on a real host:

- QEMU exits at startup when EXECUTION_ROOT makes its control socket paths
  exceed sun_path's 108 bytes ('UNIX socket path is too long'), and the
  failure was invisible (see below). settings.check() now rejects roots
  that cannot fit <exec>/<64-hex>-monitor.socket.
- wait_for_controller_ready returned as soon as the unit sampled 'active',
  but a Restart=on-failure unit whose process dies right away (the exact
  symptom of the socket-path failure) is 'active' between crashes, so
  create_vm reported RUNNING for a VM that never existed. The wait now
  re-checks 2s later and fails loudly on a flapping unit.
- delete_vm left the controller config and the cloud-init seed behind
  (caught by the resource-release test). Delete now removes them; stop_vm
  keeps them, reattach and StartVm read them.
The root-mode session root moved to /var/lib/avm-<rand>/e: the previous
paths (both the pytest /tmp root and /var/lib/aleph-vm-itest-<rand>/exec)
made the qemu control socket paths exceed 108 bytes, so qemu died at
startup on every QEMU test while the flapping unit still sampled as
active. A guard asserts the socket paths fit before starting the daemon.

@foxpatch-aleph foxpatch-aleph left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Large, well-structured PR splitting the agent and supervisor into separate processes over gRPC. The implementation is thorough with proper error translation, idempotent operations, concurrent access handling, and comprehensive test coverage (769 tests). Protocol vocabulary has been carefully cleaned up. No correctness or security issues identified.

src/aleph/vm/orchestrator/vm/program_client.py (line 226): The await asyncio.sleep(0) after process.terminate() is likely insufficient to let the process exit. Consider await asyncio.sleep(0.5) as the first wait, or use process.join(timeout=2) if the forked process supports it.

src/aleph/vm/conf.py (line 477): The sun_path check says "UNIX socket paths are limited to 108 bytes" (total including NUL) but asserts longest_socket_path <= 107 (path only). This is correct, but the error message could be clearer: "108 bytes total including NUL terminator; longest user path is 107".

src/aleph/vm/models.py (line 740): The 2-second sleep before confirming 'active' is correct for crash-loop detection but adds delay for every VM start. Consider whether a shorter interval (e.g., 0.5-1s) would suffice - systemd state transitions are near-instantaneous for a crash-loop but the 2s window is conservative.

VmExecution.stop() queued the systemd stop job and immediately deleted
the TAP interface while qemu was still performing the guest's graceful
ACPI shutdown. qemu's tap file descriptor goes bad (TUNSETVNETHDRSZ
ioctl fails), it aborts without flushing the qcow2, and the next boot
kernel-panics on the corrupted image. stop() now waits for the
controller unit to reach inactive/failed before teardown, bounded a
little past systemd's TimeoutStopSec.

Two related gaps on the same paths:
- restart_persistent_vm returned right after RestartUnit, so StartVm
  and RestoreBackup could report BOOTING (or RUNNING on a unit that was
  about to crash-loop). It now awaits wait_for_controller_ready and
  stamps started_at.
- qemu never unlinks its UNIX control sockets; delete_vm now removes
  the -monitor/-qmp/-qga .socket files with the rest of the VM's
  on-disk artifacts.

Found by the supervisor integration suite (stop/start cycle, restore
cycle and delete resource-release tests).

@foxpatch-aleph foxpatch-aleph left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ autodev could not generate a review for this PR (empty output from the model).

Click Re-request review to retry.

…panics

The Ubuntu 26.04 kernel attempts a kexec-handover (KHO) scratch
reservation at every boot. On a 768 MiB guest it reserves ~550 MiB,
leaving ~72 MiB; PID 1 then OOMs and the kernel panics with "System is
deadlocked on memory" before the network comes up. Whether the
reservation succeeds depends on the KASLR-randomized kernel placement,
so 768 MiB guests booted or panicked at random (the 240s SSH banner
timeouts in test_backup_restore_cycle and test_qemu_stop_start_reboot_cycle).

At 2048 MiB the reservation always fits (~650 MiB) and the guest keeps
1.2 GiB available: verified with 4 boots at each size, all deterministic.

@foxpatch-aleph foxpatch-aleph left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR implements a major architectural change — splitting the monolith into agent and supervisor processes connected via gRPC over a Unix domain socket. The design is clean, the error vocabulary is well-defined, and the strangler pattern (InProcessSupervisor implements the same contract as GrpcSupervisor) is correctly executed. Test coverage is comprehensive with conformance tests, wire-format round-trips, event streaming, and integration tests. I found no correctness bugs, no security issues, and the code quality is high. I have several comments but none are blockers.

… reboot

The persistent branch of reboot_vm called systemd restart (which only
queues the job) and sampled the unit state immediately, so a healthy
reboot was reported as BOOTING. Await wait_for_controller_ready like
start_vm and restore_backup do, and refresh times.started_at since a
reboot is a new start.

@foxpatch-aleph foxpatch-aleph left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well-structured PR that cleanly splits agent and supervisor into two processes with a well-designed gRPC contract. Strong engineering across the board: clear proto vocabulary, thorough error translation, comprehensive testing (unit + integration), and good documentation. Minor observations only: an unnecessary asyncio.sleep(0) in program_client.forget(), a removed noqa suppression in errors.py, and the unbounded event queue should be noted for future consideration if high-frequency events are added.

src/aleph/vm/orchestrator/vm/program_client.py (line 258): await asyncio.sleep(0) after process.terminate() does not yield enough time for the process to actually die — the sleep(0) just passes control back to the event loop for one iteration. This means the code always falls through to the 2-second sleep branch. Consider either removing the sleep(0) entirely, or making it a meaningful wait like 0.5.

src/aleph/vm/supervisor/errors.py (line 140): The removal of the # noqa: BLE001 comment leaves a bare except clause. While the translating_errors context makes the intent clear, a linter configured with BLE001 enabled will now flag this.

src/aleph/vm/orchestrator/supervisor.py (line 179): The watch_supervisor_events async generator is driven over an unbounded asyncio.Queue inside InProcessSupervisor. Today this is fine because lifecycle events are human-scale. If spontaneous guest-death detection is added in the future (making events high-frequency), consider adding max-size or backpressure to the queue.

src/aleph/vm/supervisor/inprocess.py (line 604): Excellent — the InProcessSupervisor is growing beyond a thin pool wrapper into the canonical boundary implementation. The backup state management with disk-based source of truth (for completed archives) and in-memory only for in-flight/failed is the right balance.

systemd garbage-collects a unit once it reaches a clean inactive state,
so wait_for_controller_stopped polling at 1s intervals often misses the
brief "inactive" window: GetUnit then raises NoSuchUnit, which
get_service_active_state mapped to "unknown", and the loop burned the
full 75s timeout on every persistent stop (visible as a ~75s hole
between Stopped and the next Started in the integration runs).

Map NoSuchUnit to a distinct "not-loaded" state and accept it as
stopped; other D-Bus errors remain "unknown" and keep being retried so
a transient failure cannot trigger an early network teardown.

@foxpatch-aleph foxpatch-aleph left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ autodev could not generate a review for this PR (empty output from the model).

Click Re-request review to retry.

Closed grpc.aio channels can be trapped in reference cycles (exception
chains keep frames alive, frames keep the client alive), so the last
ones are only freed by the GC pass that runs during interpreter
shutdown. On Python 3.14 AioChannel.__dealloc__ then tries to join
grpc's poller thread, which finalization forbids, printing an ignored
PythonFinalizationError traceback after an otherwise green run.

Run one cyclic GC pass in pytest_sessionfinish so the channels are
collected while the interpreter is still alive. Reproduced and verified
standalone: a closed channel held in a cycle prints the exact traceback
at exit, and a gc.collect() before exit silences it.

@foxpatch-aleph foxpatch-aleph left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ autodev could not generate a review for this PR (empty output from the model).

Click Re-request review to retry.

…r visible

The backup tar was created directly at its final <backup_id>.tar path,
and the presence of that file is what get_backup_status, list_backups,
download_backup and find_existing_backup all take as backup completion.
During the multi-hundred-MB tar write, a status poll therefore reported
COMPLETE with whatever size the growing file had at that instant.

Caught by test_backup_restore_cycle: get_backup_status returned
COMPLETE with size_bytes=569361920 while the writer was still going,
and the later download streamed the real 861091840-byte archive.
find_existing_backup had the same hazard: a second StartBackup could
adopt a half-written tar as a valid existing backup.

Write the tar to a .tar.partial staging name and rename it into place
after the sha256 sidecar and metadata are written, mirroring what
_restore_rootfs_sync already does for the rootfs swap. A failed run
removes the staging file and leaves no .tar behind.
cleanup_expired_backups now also removes .tar.partial files older than
the TTL (a writer that died mid-backup); fresh ones are left alone.

@foxpatch-aleph foxpatch-aleph left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A substantial, well-structured refactor that introduces a process-split architecture between agent and supervisor over gRPC. The design is clean — message-agnostic DTOs, closed error vocabulary, clear boundary between supervisor (VMM boot, disk lifecycle, backup) and agent (Aleph message parsing, guest protocol, configuration push). However, there are several correctness and safety issues that should be addressed before merging.

src/aleph/vm/supervisor/grpc_client.py (line 108):

src/aleph/vm/orchestrator/vm/program_client.py (line 208):

src/aleph/vm/supervisor/inprocess.py (line 626):

src/aleph/vm/supervisor/inprocess.py (line 747):

src/aleph/vm/supervisor/inprocess.py (line 658):

src/aleph/vm/supervisor/inprocess.py (line 766):

src/aleph/vm/supervisor/grpc_server.py (line 94):

src/aleph/vm/supervisor/inprocess.py (line 680):

src/aleph/vm/supervisor/inprocess.py (line 699):

src/aleph/vm/supervisor/inprocess.py (line 771):

src/aleph/vm/supervisor/inprocess.py (line 406):

src/aleph/vm/supervisor/proto_convert.py (line 168):

src/aleph/vm/supervisor/inprocess.py (line 589):

tests (line ):

src/aleph/vm/supervisor/inprocess.py (line 96):

src/aleph/vm/supervisor/grpc_client.py (line 74):

src/aleph/vm/supervisor/inprocess.py (line 298):

New tests-integration job in the pytest workflow, parallel to the unit
tests. It provisions the same artifacts the unit job already uses
(Firecracker v1.15.1 + guest kernel, the debootstrapped
aleph-debian-12-python runtime) plus the Ubuntu 24.04 cloud image for
the QEMU tests (LTS rather than the 26.04 devel image for stability;
the suite takes the image via AVM_ITEST_QEMU_IMAGE), then runs
tests/integration as root with AVM_ITEST=1 so the QEMU, TAP networking,
port forwarding and backup/restore tests all execute.

Guest DNS is pinned to public resolvers like in the droplet CI: the
runner's systemd-resolved on 127.0.0.53 is unreachable from inside a
VM. On failure the job dumps the controller unit journal and the guest
console lines for debugging.

@foxpatch-aleph foxpatch-aleph left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A sound, well-structured architecture split that decomposes the monolithic agent+supervisor into two communicating processes over a gRPC/Unix-socket boundary. The design cleanly separates concerns (agent owns networking/messages/user policy, supervisor owns VMMs/systemd/backups), uses a closed error vocabulary with reliable cross-process error propagation via gRPC trailers, and maintains backward compatibility through a strangler pattern. The proto contract is carefully designed with same-host invariants explicitly documented. Backup operations are fully implemented with fsfreeze support. Tests show 769 passed. Minor issues noted below.

src/aleph/vm/supervisor/grpc_server.py (line 77):

src/aleph/vm/supervisor/grpc_server.py (line 193):

src/aleph/vm/supervisor/grpc_client.py (line 229):

src/aleph/vm/supervisor/grpc_client.py (line 139):

src/aleph/vm/orchestrator/run.py (line 237):

src/aleph/vm/orchestrator/run.py (line 398):

src/aleph/vm/orchestrator/run.py (line 555):

src/aleph/vm/orchestrator/run.py (line 253):

src/aleph/vm/supervisor/inprocess.py (line 459):

src/aleph/vm/supervisor/inprocess.py (line 698):

src/aleph/vm/supervisor/inprocess.py (line 758):

proto/supervisor.proto (line 498):

proto/supervisor.proto (line 11):

src/aleph/vm/supervisor/grpc_server.py (line 65):

src/aleph/vm/supervisor/errors.py (line 91):

src/aleph/vm/supervisor/grpc_client.py (line 99):

src/aleph/vm/supervisor/daemon.py (line 67):

src/aleph/vm/supervisor/daemon.py (line 51):

src/aleph/vm/orchestrator/supervisor.py (line 163):

src/aleph/vm/orchestrator/supervisor.py (line 211):

src/aleph/vm/supervisor/grpc_client.py (line 155):

src/aleph/vm/orchestrator/supervisor.py (line ):

src/aleph/vm/supervisor/inprocess.py (line 745):

src/aleph/vm/controllers/qemu/backup.py (line None):

src/aleph/vm/orchestrator/run.py (line 122):

src/aleph/vm/orchestrator/run.py (line 349):

src/aleph/vm/supervisor/inprocess.py (line 402):

src/aleph/vm/pool.py (line None):

src/aleph/vm/controllers/configuration.py (line None):

src/aleph/vm/supervisor/proto_convert.py (line 1):

src/aleph/vm/orchestrator/run.py (line 38):

src/aleph/vm/supervisor/inprocess.py (line 626):

src/aleph/vm/supervisor/inprocess.py (line 714):

src/aleph/vm/supervisor/errors.py (line 133):

src/aleph/vm/supervisor/grpc_server.py (line 80):

src/aleph/vm/supervisor/grpc_server.py (line 114):

src/aleph/vm/supervisor/inprocess.py (line 489):

src/aleph/vm/orchestrator/vm/program_client.py (line 180):

src/aleph/vm/orchestrator/vm/program_client.py (line 70):

src/aleph/vm/orchestrator/vm/program_client.py (line 76):

src/aleph/vm/supervisor/inprocess.py (line 312):

src/aleph/vm/orchestrator/supervisor.py (line 207):

The settings check added in 6dfb76b rejected execution roots whose
longest QEMU control socket path (<root>/<64-hex>-monitor.socket)
exceeds 107 bytes. The real limit is 108: sun_path is char[108] and
QEMU fills it exactly without a trailing NUL, only failing on 109+
(verified against qemu-system-x86_64: 108 binds, 109 dies with 'Path
must be less than 108 bytes'). Python clients do need room for a NUL
(107 max) but only ever connect to the -qmp.socket, whose suffix is 4
bytes shorter, so the bound path remains the binding constraint.

The off-by-one broke test_create_qemu_instance(_online) on CI, where
tempfile gives /tmp/alephtest_XXXXXXXX/exec: exactly 108, which booted
fine for years. Relax the assert to <= 108 and align the integration
conftest's mirror of it.

@foxpatch-aleph foxpatch-aleph left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR implements Phase 1 of the agent/supervisor process split — a substantial architectural refactor that introduces a clean gRPC boundary between the two processes. The design is contract-first with a well-defined proto, frozen-dataclass DTOs, and a capability-based ABC interface. Error handling is carefully designed with class-exact round-trip via ErrorDetail trailers and mapped gRPC status codes. The strangler pattern is correctly applied: the in-process path remains fully functional and the gRPC path is introduced alongside it, selectable via the SUPERVISOR_GRPC_SOCKET setting. Backward compatibility is preserved throughout; existing message-driven code paths are untouched. New code is well-structured, documented, and follows established patterns. The ProgramGuestClient correctly moves agent-side guest protocols (guest API, configuration push, code execution) across the boundary. The InProcessSupervisor now serves as the reference implementation for the contract, with added lifecycle methods and event emission. The proto file cleanly separates concerns with reserved fields replacing removed concepts like 'is_instance' and 'BACKEND_QEMU_SEV'. The main risk areas (deadlines on all unary calls, stream cancellation semantics, identity-based forget-on-stop fix) are addressed correctly. Testing was not verified but the code structure supports testability through the ABC abstraction.

On CI every supervisor stop_vm/delete_vm was a silent no-op for QEMU
VMs: qemu stayed alive until the session teardown, where systemd's
stop-sigterm timeout SIGKILLed it, failing three integration tests
(stop/start cycle, delete release checks, restore reachability).

Root cause chain:
- stop_and_disable only stopped the unit when is_service_active was
  true, and is_service_active first requires GetUnitFileState ==
  enabled. The integration harness's fallback controller unit had no
  [Install] section, so the instance could never be enabled, the check
  returned false for a running unit, and the stop was skipped with no
  error. Locally the packaged template (which has [Install]) masked
  this. Active does not imply enabled: gate the stop on the unit's
  actual ActiveState instead.
- The fallback unit's TimeoutStopSec=30 was shorter than the
  controller's 50s ACPI grace window, so a guest that needed escalation
  would be SIGKILLed mid-shutdown; align with the packaged unit's 60
  and add the [Install] section so enable/disable behave as in
  production.

Also filter the guest console forwarding out of the CI unit-journal
dump: it drowned the 500-line tail, hiding the stop-sigterm timeout
lines that pinpointed this.

@foxpatch-aleph foxpatch-aleph left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ autodev could not generate a review for this PR (empty output from the model).

Click Re-request review to retry.

…events and host info

Five coverage additions to the supervisor integration suite:

- Daemon restart (the recovery story): a persistent QEMU VM must survive
  a supervisor daemon SIGTERM + relaunch with the same qemu process and
  stay reachable, the new daemon must reattach it as RUNNING via
  load_persistent_executions, and stop/start/delete must work end to
  end on the reattached execution. The conftest daemon fixture now
  records its spawn environment so restart_daemon can relaunch the
  exact same daemon and wait for it to answer health.

- Extra data disks: a DiskRole.EXTRA qcow2 is visible in the guest
  (located by size, not device-name guessing), data written to it
  survives a stop/start cycle, and delete(wipe=True) erases the
  writable volume file on the host.

- Error paths over the real gRPC boundary: VmNotFoundError from every
  lookup surface, NotImplementedSupervisorError for stop/start of
  ephemeral VMs (which must leave the VM running and deletable),
  BackupNotFoundError for unknown and foreign backup ids, and
  VmNotFoundError on double delete.

- Event stream breadth: the FC test now covers create + reboot
  (down-then-up pair) + delete fanned out to two concurrent
  subscribers; the persistent QEMU cycle asserts stop/start/reboot all
  reach the stream in order.

- Smaller surfaces: get_host_info sanity, UDP port forwards
  (parametrized with TCP), and IPv6 guest reachability in the QEMU
  create test (the address is statically pushed via cloud-init, so it
  must answer on port 22 like IPv4 does).

@foxpatch-aleph foxpatch-aleph left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ autodev could not generate a review for this PR (empty output from the model).

Click Re-request review to retry.

ExportVm, ImportVm and GetMigrationStatus were NotImplemented stubs;
the wire layer (proto, gRPC server/client, conversions) already
existed. This implements the directory-based contract: export dumps a
VM into a host-local directory, import recreates it from one, and
moving the directory between hosts is the caller's business (the agent
or an external orchestrator; migrations are exercised end to end in
aleph-testnets).

Export (async job, poll via GetMigrationStatus): requires a persistent
spec-built QEMU VM, takes the same per-VM lock as backup/restore, stops
the VM through the regular stop path (the disks must be quiescent, and
the VM is leaving this host), converts qcow2 disks with qemu-img
convert -c (collapsing any backing chain so the export is
self-contained) and copies other formats verbatim, then writes a
manifest recording the spec and per-disk sha256. Progress is reported
through bytes_transferred/bytes_total. A failed export flips the job to
FAILED and restarts the VM.

Import (synchronous): validates the manifest, refuses ids that are
already registered before touching any disk (an import onto a live VM
must not overwrite the disks it runs from), verifies each disk's
sha256, copies them under PERSISTENT_VOLUMES_DIR/<vm_id>, rewrites the
spec disk paths and boots through the regular create path (conflict
guards and events included). Half-copied state is removed on failure.

The manifest serializes the spec through its proto encoding
(CreateVmRequest as JSON): the proto already is the cross-version
schema, so the manifest inherits its compatibility story instead of
inventing a second serialization.

New MIGRATION_NOT_FOUND error code end to end (proto enum, ErrorCode,
MigrationNotFoundError, status/class maps) for unknown migration ids
and directories without a manifest; the proto bindings are regenerated.
The stub-conformance lists shrink to the confidential trio.

17 new unit tests cover the manifest roundtrip and validation, export
success/failure/concurrency, import path rewriting, corruption and
conflict guards, and status lookups.

@foxpatch-aleph foxpatch-aleph left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well-structured PR that drives the agent/supervisor process split through gRPC with clean proto vocabulary, thorough error round-trips (class-exact ErrorDetail trailers), and careful concurrency management (per-VM creation locks, stop_event defusing, identity-based forget-on-stop fix). The program path is correctly split: agent downloads code/resources and emits path-only specs; supervisor boots from those paths. Tests are comprehensive (wire round-trips, error translation, streaming, split-mode wiring). Minor observation: persist_record in _ensure_program_vm runs after setup_program, so a DB write failure could leave a running VM orphaned from the registry — but this is a minor edge case without security implications, as the VM still serves correctly in memory.

src/aleph/vm/orchestrator/run.py (line 423): If persist_record fails after setup_program succeeds, the VM is running and serving but not recorded in the agent's DB. No cleanup path handles this specific failure. Consider moving persist_record before setup_program, or wrapping in a try/except that at least logs the failure.

src/aleph/vm/orchestrator/run.py (line 39): The comment in _is_spec_eligible says "Keep the two in sync" regarding build_create_vm_spec, but this coupling is fragile. Consider a shared validation function or having build_create_vm_spec itself gate the caller — or at minimum, have a test that asserts the two agree on all inputs.

src/aleph/vm/supervisor/grpc_server.py (line 176): WatchEvents, StreamLogs, and DownloadBackup have manual exception handling instead of @_translating — this is the correct pattern for async generators where the decorator can't work. Consider extracting the boilerplate (the try/except pattern is identical across all three) into a shared helper to reduce copy-paste.

src/aleph/vm/supervisor/grpc_server.py (line 95): The raise AssertionError("abort() must raise") is technically reachable if _abort is called and context.abort() does not raise (e.g. if the gRPC implementation changes). Consider adding # pragma: no cover since it's a defensive guard.

…py and host-info asserts

Four fixes from the first CI execution of the new coverage:

- test_backup_restore_cycle now takes the backup with
  quiesce_guest=True. The unquiesced qemu-img convert -U copy of a live
  disk is crash-consistent at best, and the restored guest failed to
  boot from it twice in a row on the slower CI runners (locally the
  restore boot already logged a cloud-init fatal error without failing
  the test). fsfreeze through the guest agent is how a backup of a
  running VM is meant to be taken.

- make_data_disk now creates a raw image: the controller attaches host
  volumes without a format, so qemu treats the file as a raw device. A
  qcow2 exposed its 197 KiB container bytes to the guest instead of a
  64 MiB disk, and the device-by-size lookup found nothing.

- get_host_info: cpu_architecture/vendor/model are best-effort and
  legitimately empty on GitHub's Azure runners; assert only fields
  every host populates.

- Annotate the events test's streams list: CI's mypy gate covers
  tests/ (the local run only checked src/), and it failed on the
  missing annotation.

The failure-only journal dump now prints each guest's console tail
separately: a single global tail only showed the last VM, hiding the
restored guest whose boot actually failed.

@foxpatch-aleph foxpatch-aleph left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ autodev could not generate a review for this PR (empty output from the model).

Click Re-request review to retry.

The resource-release tests sampled a count baseline of daemon-child
hypervisor processes, then asserted create added exactly one and delete
returned to the baseline. That baseline is contaminated when a previous
firecracker test is still mid-teardown: delete_vm returns before the host
reaps the firecracker process, so a neighbour's dying VM is counted in the
baseline and the create assertion (baseline + 1) fails once that process
is finally reaped. Observed locally as 'assert 1 == (1 + 1)'.

Track the PIDs this VM actually spawns via set difference against a
pre-create snapshot, so a neighbour's teardown lag cannot affect the
result.

@foxpatch-aleph foxpatch-aleph left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ autodev could not generate a review for this PR (empty output from the model).

Click Re-request review to retry.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants