feat(supervisor): Phase 2 - gRPC wire protocol + two-service packaging#981
Conversation
Two-process split as the only supported deployment: a supervisor daemon owning the pool and serving the Supervisor interface over gRPC, and an agent that reaches VMs exclusively through that interface. Key decisions: - rename InProcessSupervisor -> LocalSupervisor (the pool-backed engine) - agent keeps a thin embedded seam for dev/tests only - fold admission + GPU reservation into create_vm (atomic) - HAProxy/L7 domain mapping moves fully to the agent (fed by list_vms), out of the pool; the supervisor stays L3/L4 - interface-first, two phases: decouple agent from pool, then transport and two-service packaging
Decouple the agent from VmPool. P1.1 (rename + wiring) and P1.6a (recreate_network) are ready to build; backup/restore, migration and confidential measurement are blocked on three contract decisions noted in the plan.
Introduce orchestrator/haproxy_sync.py with sync_domain_mappings(supervisor), which derives each local VM's IP from VmInfo.ipv4.address and pushes the domain->IP map to HAProxy. Add build_map_entries_from_vm_ips and write_entries_to_backend to haproxy.py as the low-level building blocks.
Replace pool.update_domain_mapping() calls in the orchestrator views and tasks with sync_domain_mappings(supervisor): regenerate_proxy, notify_allocation, operate_update, the periodic resync task, and the domains-aggregate handler. The supervisor's list_vms() output now drives HAProxy, not the in-process pool.
Delete VmPool.update_domain_mapping and its load-time call; the agent now seeds an initial forced sync in periodic_domain_resync at startup. Drop the dead instances-based helpers from haproxy.py (fetch_list_and_update, update_backends, _build_map_entries, _resolve_vm_ip), keeping fetch_list and the low-level socket/map helpers. Add a guard test.
Add a SevInfo dataclass mirroring the seven QEMU query-sev fields and carry it (plus the base64 launch_measure) on Measurement, defaulted so existing callers and the proto path keep working. This lets the confidential measurement endpoint preserve its current response shape once the logic moves into the engine.
Move the initialize/measurement/inject-secret logic out of the agent endpoints and into the engine: write the SEV session certificates and start the controller service, query the launch measurement plus the seven query-sev fields, and inject the secret then resume the guest. The confidential ops are no longer stubs, so drop them from the conformance STUB_METHODS set.
Rewire operate_confidential_initialize, _measurement and _inject_secret
to call only the Supervisor interface: the initialize guards now read
supervisor.get_vm (status + confidential_mode) instead of the pool, and
the three endpoints no longer touch require_vm_pool, the pool or
QemuVmClient. The measurement response is preserved byte-for-byte as
{"sev_info": {7 fields}, "launch_measure": ...}; inject_secret now
returns {"status": "ok"} since the void supervisor method cannot carry
the post-injection QMP status (the agreed minor contract change).
…ot a hardcoded SEV The TEE generation is determined by the VM's confidential configuration (an input the supervisor already holds via the execution policy), so get_measurement now derives tee_backend through _confidential_mode instead of hardcoding TeeBackend.SEV. TeeBackend.SEV still covers SEV and SEV-ES (refined by sev_info.policy); SEV-SNP maps distinctly.
Grow the supervisor backup interface so the agent can drop its pool-backed backup/restore logic without any feature regression: - BackupInfo carries checksum, volumes and source_sizes for completed archives, so the agent builds the HTTP body and download sidecar headers from engine metadata. - start_backup gains include_volumes: the engine archives the VM's non-read-only persistent volumes alongside the rootfs. - restore_from_image(vm_id, image_path, max_virtual_size_bytes) restores from a QCOW2 already staged on a host path (uploaded image or downloaded volume): the engine validates, size-checks, swaps the rootfs and restarts. - GrpcSupervisor stubs restore_from_image NotImplementedError (wired in Phase 2).
Rewire the five backup/restore endpoints off the VmPool: they now reach VM and disk work only through the Supervisor interface, keeping HTTP-only concerns (owner auth, presigned download URLs and their verification, multipart/JSON parsing, staging uploaded bytes, sidecar response headers). - operate_backup calls start_backup(quiesce_guest, include_volumes): 202 while RUNNING, metadata body (sourced from BackupInfo) plus a presigned URL when COMPLETE; InvalidBackend -> 400, InsufficientResources -> 507. - operate_backup_status reports via list_backups. - operate_backup_download verifies the presigned URL, streams from download_backup, and sets Content-Length / X-Backup-Checksum / X-Source-Size from the backup metadata. - operate_backup_delete delegates to delete_backup. - operate_restore stages the uploaded QCOW2 or the volume_ref download to a temp path, then calls restore_from_image; qemu-img rejection and oversized disks surface as 400. Drops the agent-side BackupState background-task tracking, the inline QemuVmClient/fsfreeze logic, require_vm_pool and pool reads. Adds a guard test asserting the five endpoints reference no pool symbols.
The test exercised the removed operator._do_restore / _parse_restore_upload internals. Rewire it through operate_restore against a supervisor whose restore_from_image raises InvalidBackendError (what the engine's qemu-img check does on a non-QCOW2), still asserting the 4xx (not 500).
The agent's P2P pull protocol keeps the network transport (compress, hash, serve, download, rebase); add the four MigrationOps methods that own the parts touching the VmExecution and the pool, which the agent must not reach into directly: - stop_vm_for_export: graceful stop + locate the persistent-volumes dir - restart_after_failed_export: bring the VM back up after a failed export - create_migrated_vm: create a persistent VM from a staged instance message - release_migrated_vm: stop + forget a VM that has migrated away Implemented for real in LocalSupervisor, stubbed NotImplementedError in GrpcSupervisor (wired in Phase 2). Surface count 32 to 36.
Thread the supervisor (not the VmPool) through the migration runner and the three migration endpoints. The P2P pull protocol stays the agent-side network transport: the on-host HTTP disk transfer, tokens, ExportJob/ImportJob states and request/response shapes are unchanged. Only the disk/VM work moves behind the supervisor: - migration_export gates via supervisor.get_vm (VmNotFoundError = 404) and reads status/backend/confidential_mode off the returned VmInfo; run_export asks the supervisor to stop the VM and report its volumes dir, then the agent compresses and serves the files it finds there. - migration_import gates the already-running check via supervisor.get_vm; run_import downloads and rebases the peer disks (network transport) then calls supervisor.create_migrated_vm. The dest-dir rmtree decision is made against a single supervisor.get_vm probe instead of pool.executions. - migration_cleanup releases the source VM via supervisor.release_migrated_vm (engine: stop_vm + forget_vm) instead of pool.stop_vm/forget_vm. The three endpoints (and the runner) now contain no require_vm_pool / vm_pool / pool.create_a_vm / pool.executions references; a guard test asserts this via inspect.getsource.
Remove the provisional directory-based export_vm / import_vm / get_migration_status from the MigrationOps ABC, LocalSupervisor, GrpcSupervisor and the gRPC server handlers, and the migrate.py manifest module they relied on. Migration now rides the standard lifecycle RPCs; only the export-time graceful stop survives as a MigrationOps method (it needs a guest powerdown stop_vm does not do). The orphan ExportVm / ImportVm / GetMigrationStatus proto RPCs stay in supervisor.proto with a note to drop them in the Phase 2 proto pass, so the generated bindings are unchanged and check_proto_clean stays green. Surface count 36 to 30.
run_import now builds a CreateVmSpec from the fetched instance message with build_create_vm_spec (the same path run.create_vm_execution uses for a normal persistent instance) and calls supervisor.create_vm(spec) instead of the bespoke create_migrated_vm. The spec's rootfs path is PERSISTENT_VOLUMES_DIR/<vm_hash>/rootfs.qcow2, exactly where the download+rebase already staged the overlay; build_create_vm_spec adopts an already-present host-persistence overlay rather than recreating it, so create_vm reuses the staged disk with no re-download (covered by test_import_spec_build_reuses_staged_overlay). The failed-export restart rides the standard start_vm RPC instead of restart_after_failed_export. The disk/VM seam methods were removed in the previous commit; this wires the runner onto the lifecycle RPCs.
A migrated instance booted on the destination but never got its host port forwards: create_vm_from_spec only reloads persisted mappings (empty on a fresh destination), and run_import stopped at create_vm. The normal create path runs a post-create tail (wait-until-running + apply the agent's resolved port forwards, always including SSH/22); migration skipped it, so mapped_ports stayed empty, the VM was unreachable, and the testnet migration test timed out waiting for the re-dispatched VM's port 22. Extract that tail into run.finish_instance_create and call it from both the normal create path and run_import, so a migrated instance is wired identically to a freshly created one. Adds a regression assertion that run_import invokes the tail.
…/grpc-only-supervisor-phase2
foxpatch-aleph
left a comment
There was a problem hiding this comment.
Click Re-request review to retry.
foxpatch-aleph
left a comment
There was a problem hiding this comment.
Well-structured Phase 2 implementation that completes the gRPC wire protocol. The four remaining stubs are now wired, all previously-dropped fields are carried, migration RPCs are removed, HostInfo is reconciled, and the two-service packaging is implemented. Test coverage is thorough with round-trip, guard, and unit tests. No correctness or security issues found.
CI now runs mypy on tests/ for this branch (it never did while #981 targeted a non-dev base). FakePool's check_capacity/reserve_gpus were assigned dynamically in test_reserve_resources_*; declare them so mypy's attr-defined check passes.
foxpatch-aleph
left a comment
There was a problem hiding this comment.
Well-structured PR completing the gRPC supervisor wire protocol and two-service packaging split. All four previously-stubbed methods are fully wired, all eight dropped fields are carried over the wire, the orphan migration RPCs are removed, HostInfo is reconciled, and reserve_resources takes a message-free DTO. The packaging changes (two systemd units, maintainer scripts for migration, socket default) are correct and the test coverage is thorough with guard tests ensuring completeness. No correctness or security issues found.
src/aleph/vm/supervisor/local.py (line 1155): Likely fine but worth double-checking: the old check_admission(content, current_vm_hash=...) accepted an optional current_vm_hash parameter to exclude a specific VM from committed totals (for re-notification). Neither check_capacity nor reserve_resources preserves this parameter — was the current_vm_hash exclusion intentionally removed from the reserve path? Re-notification of a running instance would not normally hit reserve_resources (the message is already known), but if it did, the absence of the exclusion means a re-reserve for an already-running VM could incorrectly fail on memory. The plan doc doesn't mention this parameter, and the test test_check_admission_skips_renotify_of_running_instance was deleted in test_resources.py, so it seems intentional — just confirming the known scope.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## dev #981 +/- ##
==========================================
+ Coverage 80.00% 80.30% +0.30%
==========================================
Files 198 201 +3
Lines 23026 23083 +57
Branches 1526 1511 -15
==========================================
+ Hits 18421 18536 +115
+ Misses 4172 4114 -58
Partials 433 433 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
The HTTP API (port 4020, /control/*) is served by the agent in the two-service split; the failure handler only dumped the supervisor unit, leaving the agent side (where /control/allocations 503s) invisible. Capture both.
foxpatch-aleph
left a comment
There was a problem hiding this comment.
Click Re-request review to retry.
The after-failure export only captured the supervisor unit; the agent (HTTP API, /control/allocations) journal is where instance-create errors surface in the two-service split.
foxpatch-aleph
left a comment
There was a problem hiding this comment.
Click Re-request review to retry.
The shared capacity-admission path (calculate_available_disk) sums get_disk_usage_delta() over every execution's resources. SpecProgramResources (spec-driven Firecracker programs) never implemented it, so once a spec program was running, any subsequent create_vm crashed with AttributeError and the agent returned 503 on /control/allocations. Spec disk admission is deferred (DiskSpec carries no size yet), so it reserves nothing: return 0, mirroring check_spec_admission's disk_mib=0.
foxpatch-aleph
left a comment
There was a problem hiding this comment.
Completes the gRPC wire protocol by wiring the four remaining stubs (run_program_code, restore_from_image, recreate_network, reserve_resources), carrying eight previously-dropped dataclass fields, reconciling HostInfo, removing orphan migration RPCs, and shipping the two-process systemd split. Clean code, thorough tests, sound architectural decisions.
tests/supervisor/test_grpc_complete.py (line 8): The "NotImplementedError" not in source guard is technically redundant with the "wired in Phase 2" check and broader than necessary — it would also catch genuine NotImplementedError usage in error handling code. Consider removing it and keeping only the "wired in Phase 2" check for a more precise guard.
The droplet test writes its own supervisor.env and installs with --force-confold,
which keeps it over the packaged one - so it must carry the two-service socket the
package ships (packaging/aleph-vm/etc/aleph-vm/supervisor.env). Without it the agent
runs embedded next to the supervisor daemon, two pools fight over the VM tap
('File descriptor in bad state'), and program networking endpoints return 503.
foxpatch-aleph
left a comment
There was a problem hiding this comment.
Click Re-request review to retry.
A confidential instance is created via the allocation path, where create_vm (build config + start to the awaiting-init state) takes ~20s. The scheduler exposes placement earlier, so the owner's one-shot init-session (/confidential/initialize) can arrive mid-create — before create_vm_execution writes the agent registry record — and got a 404 (the testnet confidential test fails here). The pre-refactor path looked the VM up in the pool (registered early) and didn't race. Wait for the record (written exactly when create completes and the VM is awaiting init) up to a bounded cap instead of 404ing.
foxpatch-aleph
left a comment
There was a problem hiding this comment.
Completes the gRPC wire protocol for the supervisor interface (4 new RPCs, 8 previously-dropped fields, removal of orphan migration RPCs), refactors pool capacity admission into a shared check_capacity core, and ships the two-process split with systemd units and maintainer scripts. All stubs are wired, all fields are carried, the embedded path is guarded for dev/test, and the test suite is comprehensive including round-trip gRPC tests over real Unix sockets, guard tests, and pool-level reservation tests.
src/aleph/vm/supervisor/grpc_client.py (line 213): The deadline formula timeout + LIFECYCLE_TIMEOUT_SECS = 300 means every run_program_code call has at least a 300s gRPC ceiling. This is correct — the engine's own timeout fires first — but worth noting for review: the gRPC deadine will be significantly larger than the engine timeout for short-lived requests.
tests/supervisor/test_grpc_complete.py (line 11): The "NotImplementedError" not in source check is slightly too broad — a legitimate NotImplementedError could be added elsewhere in grpc_client.py. The companion test_every_rpc_has_a_handler test at line 14 already covers the real concern exhaustively. Consider dropping or narrowing line 11 to skip this risk.
After the record-wait fix, init-session got past the 404 but hit a 400 'vm_running': the endpoint rejects status in (RUNNING, BOOTING), and a spec-path confidential VM awaiting its owner reports BOOTING (start() sets starting_at without launching the controller). That VM is exactly what init-session initializes, so exempt awaiting_confidential_init from the running check.
foxpatch-aleph
left a comment
There was a problem hiding this comment.
Clean, well-structured Phase 2 implementation completing the gRPC wire protocol (4 new RPCs, 8 dropped fields, HostInfo reconciliation, orphan migration RPC removal), two-service packaging (daemon + agent systemd units with upgrade migration), and thorough confinement/testing. The code is correct, readable, and well-tested. No bugs or security issues found.
src/aleph/vm/supervisor/grpc_server.py (line 67): Using is for enum comparison against ErrorCode.INTERNAL is technically correct (Python enum members are singletons), but == would be more idiomatic and robust against potential module reload edge cases. Consider changing to error.code == ErrorCode.INTERNAL for consistency with the rest of the codebase.
src/aleph/vm/pool.py (line 137): The three parameter-to-variable assignments (required_memory_mib = memory_mib, etc.) are redundant. The original check_admission needed them because it extracted values from a message object (message.resources.memory), but here the parameters can be used directly. Consider inlining to reduce noise.
src/aleph/vm/supervisor/translate.py (line 248): The build_reservation_request function accesses getattr(gpu, "model", "") for the GPU model. The comment notes "the message GPU has no model field (it is a network-derived name) so model stays empty" — but GpuProperties does have a model field in some message versions. Using getattr is safe, but the logic differs slightly from build_create_vm_spec where model is hardcoded to empty string. If a message version starts providing model, the two functions will diverge. Consider noting this intentional asymmetry more explicitly or aligning them.
tests/supervisor/test_grpc_roundtrip.py (line 86): _Fake.__abstractmethods__ = frozenset() works but is fragile — if new abstract methods are added to Supervisor, the _Fake class won't fail at definition time. Consider deriving _ASYNC_METHODS from the ABC's abstract methods via reflection instead of listing them manually. The guard test in test_grpc_complete.py catches missing server handlers but doesn't catch missing async mocks on the fake.
A confidential VM returns awaiting_confidential_init from create, so the normal create-completion (run.finish_instance_create: wait-running + port forwards) is skipped. The VM only boots once the owner injects the secret, after which nothing mapped its host ports — so SSH/22 never appeared and the testnet confidential test timed out waiting for port 22. Reconcile the port forwards in the inject-secret endpoint now that the VM is running, mirroring the normal create path.
foxpatch-aleph
left a comment
There was a problem hiding this comment.
Comprehensive Phase 2 implementation that completes the gRPC wire protocol for all Supervisor methods, removes orphan migration RPCs, carries previously-dropped fields over the wire, refactors the pool's capacity admission and GPU reservation into message-free primitives, ships a two-service systemd split with maintainer scripts for clean upgrade migration, and has thorough test coverage including guard tests and round-trip tests. The code is correct, well-structured, and properly secured.
packaging/aleph-vm/DEBIAN/postinst (line 30): The socket-wait loop doesn't log a warning on timeout, which could make post-upgrade failures hard to diagnose. Consider adding a diagnostic log message after the loop exits without finding the socket.
tests/supervisor/test_grpc_complete.py (line 12): test_no_phase2_stubs_remain is fragile — it scans for the string "NotImplementedError" in the entire grpc_client.py source. If any legitimate NotImplementedError is ever added, this test will falsely fail. Consider only checking for the specific "wired in Phase 2" pattern, or remove this guard entirely and rely on test_every_rpc_has_a_handler which is more targeted.
Phase 2: complete the gRPC wire protocol + two-service packaging
Phase 1 (#980) decoupled the agent from the pool through the
Supervisorinterface, wired to the embedded
LocalSupervisor. Phase 2 makes the sameinterface work over gRPC and ships the two-process split. Design:
docs/plans/2026-06-16-grpc-only-supervisor-design.md; plan:docs/plans/2026-06-17-grpc-only-supervisor-phase2-implementation.md.Part A — complete the wire protocol
recreate_network,restore_from_image,run_program_code(msgpack-framedscope),
reserve_resources. NoNotImplementedError("wired in Phase 2")remains (guard test enforces this).
TeeConfig.firmware_path,GpuSpec.device_id/model,CreateVmSpec.owner_address,Measurement.sev_info+launch_measure(newSevInfomessage),BackupInfo.checksum/volumes/source_sizes,StartBackup.include_volumes.reserve_resourcescarries a message-freeReservationRequestDTO — theagent translates the Aleph message; no
aleph_messagetype crosses the wire(verified: zero in
proto_convert/grpc_server/grpc_client/types/abc).Pool gained a shared
check_capacity(unifying the message/spec/reserve paths)and a
reserve_gpuskeyed ondevice_id; orphaned message-basedpool.reserve_resourcesremoved.ExportVm/ImportVm/GetMigrationStatus).Part B — two-service packaging
aleph-vm-supervisor.servicebecomes the daemon (python -m aleph.vm.supervisor,owns the pool, serves gRPC); new
aleph-vm-agent.serviceruns the HTTPorchestrator and dials the socket.
postinstmigrates the old single unit(stop → daemon → wait for socket → agent); running VMs survive in
controller@units and the daemon reattaches.supervisor.envsetsALEPH_VM_SUPERVISOR_GRPC_SOCKET, so production isgRPC-split while the code default stays
None(dev/tests embedded).Part C — confinement + validation
build_supervisorfactory guard (embedded when socket unset, gRPC when set).HealthInfo(OK)through a realGrpcSupervisor.Testing
910 passed (only the pre-existing root/env failures:
test_execution,test_interfaces); ruff/isort clean; mypy clean onsupervisor/+pool.py;_pb/regenerates with no diff.Known follow-ups (out of Phase 2 scope)
pool.pystill importsaleph_messagefor the legacy internal create path, sothe daemon process pulls it transitively (the wire is message-free); fully
purging it is a later cleanup.
types.pyMigrationInfo/MigrationPhase/MigrationIdDTOs left in place.