Skip to content
Merged
Show file tree
Hide file tree
Changes from 85 commits
Commits
Show all changes
88 commits
Select commit Hold shift + click to select a range
4565fbf
docs: design for gRPC-only supervisor (full process split)
odesenfans Jun 16, 2026
6b88ba5
docs: Phase 1 implementation plan for gRPC-only supervisor
odesenfans Jun 16, 2026
ec08964
Rename InProcessSupervisor to LocalSupervisor (module local.py)
odesenfans Jun 16, 2026
a44eb3a
Update production references to LocalSupervisor
odesenfans Jun 16, 2026
f93f5fe
Update supervisor tests to LocalSupervisor
odesenfans Jun 16, 2026
6196c34
Extract build_supervisor factory for explicit agent wiring
odesenfans Jun 16, 2026
d7871fe
Remove throwaway rename smoke test
odesenfans Jun 16, 2026
cac0b67
Add recreate_network to the Supervisor interface
odesenfans Jun 16, 2026
2f30a7d
Implement recreate_network in LocalSupervisor
odesenfans Jun 16, 2026
6221648
Route recreate_network endpoint through the supervisor
odesenfans Jun 16, 2026
5c84acd
feat(haproxy): add agent-side domain-mapping sync driven by supervisor
odesenfans Jun 16, 2026
1162539
refactor(haproxy): route domain-mapping callers through agent sync
odesenfans Jun 16, 2026
b9e027e
refactor(haproxy): remove pool HAProxy coupling, agent owns domain sync
odesenfans Jun 16, 2026
31c473f
docs: record resolved contract decisions (backup, migration, confiden…
odesenfans Jun 16, 2026
4e27fd3
feat(supervisor): extend Measurement with SEV info and launch measure
odesenfans Jun 16, 2026
46bd9eb
feat(supervisor): implement confidential ops in LocalSupervisor
odesenfans Jun 16, 2026
5ca81c6
feat(agent): delegate confidential endpoints to the supervisor
odesenfans Jun 16, 2026
a7663a8
Add check_spec_admission for the spec create path
odesenfans Jun 16, 2026
9c3c625
Fold capacity admission into create_vm_from_spec atomically
odesenfans Jun 16, 2026
ffdb435
Remove agent-side admission in notify_allocation; surface boundary 503
odesenfans Jun 16, 2026
69d4128
Route reserve_resources through the Supervisor interface
odesenfans Jun 16, 2026
8e27ff7
fix(supervisor): derive measurement tee_backend from the VM config, n…
odesenfans Jun 16, 2026
28d16d8
Enrich BackupOps engine: include_volumes, metadata, restore_from_image
odesenfans Jun 16, 2026
dbfc777
Route backup/restore endpoints through the supervisor
odesenfans Jun 16, 2026
25d5c70
Update restore-rejects-invalid-image test for the supervisor path
odesenfans Jun 16, 2026
2e97558
test: stage volume_ref restore under tmp_path
odesenfans Jun 16, 2026
84878a0
Add P2P-migration disk/VM seam to the Supervisor interface
odesenfans Jun 16, 2026
886fdd1
Route the P2P migration endpoints through the supervisor
odesenfans Jun 16, 2026
5ea7eff
Delete the directory-based migration flow
odesenfans Jun 16, 2026
5da0804
Route migration import through the standard create_vm RPC
odesenfans Jun 16, 2026
d2b5a82
Route migration cleanup through the standard delete_vm RPC
odesenfans Jun 16, 2026
db07ed2
docs: revise migration design to lifecycle RPCs (delete directory-bas…
odesenfans Jun 16, 2026
20c3cd5
Add run_program_code to the Supervisor interface
odesenfans Jun 16, 2026
b9cb91a
Emit persistent program specs from build_program_create_vm_spec
odesenfans Jun 16, 2026
9c02744
Boot persistent programs through the Firecracker spec create path
odesenfans Jun 16, 2026
099f3e4
Route persistent programs through the supervisor, drop the legacy paths
odesenfans Jun 16, 2026
12e9a93
Final pool removal: the agent reaches VMs only through the supervisor
odesenfans Jun 16, 2026
c0fa925
feat(supervisor): carry GPU requests on CreateVmSpec
odesenfans Jun 16, 2026
2487e87
feat(supervisor): resolve and reserve GPUs on the spec create path
odesenfans Jun 16, 2026
23552e8
feat(orchestrator): route GPU instances through the spec create path
odesenfans Jun 16, 2026
3c9bea7
feat(supervisor): populate spec.tee for confidential instances in tra…
odesenfans Jun 16, 2026
15c6179
feat(supervisor): build confidential launch on the spec create path
odesenfans Jun 16, 2026
82494f0
feat(orchestrator): route confidential instances through the spec path
odesenfans Jun 16, 2026
8c46d72
test(supervisor): sort imports in confidential spec pool create tests
odesenfans Jun 16, 2026
43a9d34
feat(conf): default instances to QEMU hypervisor
odesenfans Jun 16, 2026
abc6c4d
refactor: drop the Firecracker-instance concept
odesenfans Jun 16, 2026
b340c22
feat(supervisor): consume owner GPU reservation engine-side
odesenfans Jun 16, 2026
e5fcb43
refactor(agent): delete create_a_vm pool bridge from run.py
odesenfans Jun 16, 2026
b212ae0
test(agent): tighten pool-free guard to forbid all pool tokens
odesenfans Jun 16, 2026
5096912
style: apply ruff format and isort across Phase 1 changes
odesenfans Jun 16, 2026
acbd7fd
refactor(migration): fold stop_vm_for_export into the standard stop_vm
odesenfans Jun 17, 2026
3e8ae66
fix(confidential): apply the requested SEV policy on both create paths
odesenfans Jun 17, 2026
dae1251
fix(supervisor): resolve mypy errors (grpc_server Path import, test a…
odesenfans Jun 17, 2026
1d1cd45
fix(pool): admit spec programs against the program memory bucket
odesenfans Jun 17, 2026
a919c3e
docs: Phase 2 implementation plan for gRPC-only supervisor
odesenfans Jun 17, 2026
1123f39
feat(supervisor): carry tee firmware, gpu request, owner, include_vol…
odesenfans Jun 17, 2026
aea881a
fix(supervisor): refresh stale Phase-1 wire docstrings; cover new fie…
odesenfans Jun 17, 2026
8ff3baa
feat(supervisor): carry SEV info and launch measure over the wire
odesenfans Jun 17, 2026
58e8f3d
feat(supervisor): carry backup archive metadata over the wire
odesenfans Jun 17, 2026
1fec45a
feat(supervisor): reconcile HostInfo hardware and reservation fields …
odesenfans Jun 17, 2026
7fc69a4
feat(supervisor): wire recreate_network over gRPC
odesenfans Jun 17, 2026
854602b
test(supervisor): document the mock gRPC fixture and simplify the ABC…
odesenfans Jun 17, 2026
26a506f
feat(supervisor): wire restore_from_image over gRPC
odesenfans Jun 17, 2026
8196305
feat(supervisor): wire run_program_code over gRPC (msgpack scope)
odesenfans Jun 17, 2026
c85b0d6
fix(supervisor): scale run_program_code gRPC deadline with the reques…
odesenfans Jun 17, 2026
7f3ee0c
feat(supervisor): reserve_resources takes a message-free resources DT…
odesenfans Jun 17, 2026
a76afe8
refactor(pool): unify capacity admission, drop orphaned reserve_resou…
odesenfans Jun 17, 2026
e365abc
refactor(supervisor): drop the orphan directory-based migration RPCs
odesenfans Jun 17, 2026
ecf7535
test(supervisor): guard that the gRPC surface is complete
odesenfans Jun 17, 2026
35a35ce
feat(packaging): split supervisor daemon and agent into two systemd u…
odesenfans Jun 17, 2026
aad4f41
test(supervisor): pin embedded-by-default, gRPC-when-socket-set wiring
odesenfans Jun 17, 2026
e5a0efa
refactor(supervisor): drop now-unused migration DTOs from types
odesenfans Jun 17, 2026
db59389
refactor(pool): remove the dead legacy message create path
odesenfans Jun 17, 2026
f2d8f19
docs: refresh comments that named the removed create_a_vm/check_admis…
odesenfans Jun 17, 2026
bd3a2e6
fix(supervisor): preserve confidential inject_secret success body
odesenfans Jun 17, 2026
a188990
fix(migration): resolve import hypervisor like the create path
odesenfans Jun 17, 2026
89543ec
fix(migration): set up port forwards on the destination after import
odesenfans Jun 17, 2026
bb77857
Merge remote-tracking branch 'origin/od/grpc-only-supervisor' into od…
odesenfans Jun 18, 2026
b13a003
ci: trigger CI for #981 (base now dev; carries #980 fixes)
odesenfans Jun 18, 2026
f151efd
Merge dev (Phase 1 #980 squashed) into Phase 2
odesenfans Jun 18, 2026
b10c5d4
test(supervisor): declare FakePool reservation attrs for mypy
odesenfans Jun 18, 2026
91ca399
ci(deb): also dump aleph-vm-agent journal on droplet-test failure
odesenfans Jun 18, 2026
7233a56
ci(deb): dump aleph-vm-agent journal in the post-test log export
odesenfans Jun 18, 2026
8f5e1e3
fix(pool): give SpecProgramResources a get_disk_usage_delta
odesenfans Jun 18, 2026
c6ec16c
ci(deb): set SUPERVISOR_GRPC_SOCKET in the droplet supervisor.env
odesenfans Jun 18, 2026
e3b87a8
fix(confidential): wait for the VM record on init-session instead of 404
odesenfans Jun 18, 2026
114279a
fix(confidential): don't reject init-session on a VM awaiting init
odesenfans Jun 18, 2026
3f476a9
fix(confidential): set up port forwards after secret injection
odesenfans Jun 18, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions .github/workflows/build-deb-package-and-integration-tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -211,6 +211,11 @@ jobs:
# picks up DigitalOcean's VPC-internal resolver (e.g. 10.110.15.254),
# which is unreachable from inside the VMs and breaks all guest DNS.
echo 'ALEPH_VM_DNS_NAMESERVERS=["1.1.1.1","8.8.8.8"]' >> supervisor.env
# This file replaces the packaged supervisor.env (kept via --force-confold),
# so it must carry the two-service socket the package ships; otherwise the
# agent runs embedded alongside the daemon (two pools fighting over the VM
# tap -> "File descriptor in bad state"). Matches packaging/aleph-vm/etc/aleph-vm/supervisor.env.
echo ALEPH_VM_SUPERVISOR_GRPC_SOCKET=/var/lib/aleph/vm/supervisor.sock >> supervisor.env
ssh root@${DROPLET_IPV4} mkdir -p /etc/aleph-vm/
scp supervisor.env root@${DROPLET_IPV4}:/etc/aleph-vm/supervisor.env
scp packaging/target/${{ matrix.os_config.package_name }} root@${DROPLET_IPV4}:/opt
Expand Down Expand Up @@ -241,6 +246,10 @@ jobs:
run: |
ssh root@${DROPLET_IPV4} "systemctl status aleph-vm-supervisor --no-pager" || true
ssh root@${DROPLET_IPV4} "journalctl -u aleph-vm-supervisor -n 100 --no-pager" || true
# Two-service split: the HTTP API (port 4020, /control/*) is served by
# the agent, so capture its status and journal too.
ssh root@${DROPLET_IPV4} "systemctl status aleph-vm-agent --no-pager" || true
ssh root@${DROPLET_IPV4} "journalctl -u aleph-vm-agent -n 200 --no-pager" || true

- name: "Test runtime: Debian 12, SDK 0.9.0"
run: ./.github/scripts/test_runtime_on_droplet.sh "${DROPLET_IPV4}" "63faf8b5db1cf8d965e6a464a0cb8062af8e7df131729e48738342d956f29ace"
Expand All @@ -262,6 +271,8 @@ jobs:
if: ${{ !cancelled() && steps.system-booted.outcome == 'success'}}
run: |
ssh root@${DROPLET_IPV4} "journalctl -u aleph-vm-supervisor"
# Two-service split: /control/* (port 4020) is served by the agent.
ssh root@${DROPLET_IPV4} "journalctl -u aleph-vm-agent" || true

- name: Cleanup
if: always()
Expand Down
1,518 changes: 1,518 additions & 0 deletions docs/plans/2026-06-17-grpc-only-supervisor-phase2-implementation.md

Large diffs are not rendered by default.

8 changes: 7 additions & 1 deletion packaging/aleph-vm/DEBIAN/postinst
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,12 @@ fi
# Systemd is absent from containers
if ! [[ -v container ]]; then
systemctl daemon-reload
systemctl enable aleph-vm-supervisor.service
systemctl enable aleph-vm-supervisor.service aleph-vm-agent.service
systemctl restart aleph-vm-supervisor.service
# Wait for the daemon socket before the agent dials it (split mode).
for _ in $(seq 1 30); do
[ -S /var/lib/aleph/vm/supervisor.sock ] && break
sleep 1
done
systemctl restart aleph-vm-agent.service
fi
10 changes: 5 additions & 5 deletions packaging/aleph-vm/DEBIAN/preinst
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,11 @@ set -uf -o pipefail

# Systemd is absent from containers
if ! [[ -v container ]]; then
# Stop the service during an upgrade.
# The service does not exist during a new install and will fail, this is okay
systemctl stop aleph-vm-supervisor.service
# Stop the services during an upgrade.
# The services do not exist during a new install and will fail, this is okay.
# Stop the agent first, then the supervisor daemon, mirroring the dependency order.
systemctl stop aleph-vm-agent.service 2>/dev/null || true
systemctl stop aleph-vm-supervisor.service 2>/dev/null || true
fi

set -e


8 changes: 5 additions & 3 deletions packaging/aleph-vm/DEBIAN/prerm
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
#!/bin/bash
set -euf -o pipefail

systemctl disable aleph-vm-supervisor.service || true
systemctl stop aleph-vm-supervisor.service || true
systemctl reset-failed aleph-vm-supervisor.service 2>/dev/null || true
for unit in aleph-vm-agent.service aleph-vm-supervisor.service; do
systemctl disable "$unit" || true
systemctl stop "$unit" || true
systemctl reset-failed "$unit" 2>/dev/null || true
done
3 changes: 3 additions & 0 deletions packaging/aleph-vm/etc/aleph-vm/supervisor.env
Original file line number Diff line number Diff line change
Expand Up @@ -2,3 +2,6 @@
ALEPH_VM_PRINT_SYSTEM_LOGS=False
ALEPH_VM_DOMAIN_NAME=vm.example.org
ALEPH_VM_PAYMENT_RECEIVER_ADDRESS=
# Two-process split: the agent talks to the supervisor daemon over this socket.
# Both units read this file; the daemon binds it, the agent dials it.
ALEPH_VM_SUPERVISOR_GRPC_SOCKET=/var/lib/aleph/vm/supervisor.sock
20 changes: 20 additions & 0 deletions packaging/aleph-vm/etc/systemd/system/aleph-vm-agent.service
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
[Unit]
Description=Aleph.im VM agent (CRN HTTP API)
After=network.target aleph-vm-supervisor.service
Wants=aleph-vm-supervisor.service

[Service]
User=0
Group=0
WorkingDirectory=/opt/aleph-vm
Environment=PYTHONPATH=/opt/aleph-vm/:$PYTHONPATH
Environment=PYTHONDONTWRITEBYTECODE="enabled"
EnvironmentFile=/etc/aleph-vm/supervisor.env
ExecStart=python3 -m aleph.vm.orchestrator --print-settings
Restart=on-failure
RestartPreventExitStatus=0 130 143
RestartSec=10s
TimeoutStopSec=30s

[Install]
WantedBy=multi-user.target
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
[Unit]
Description=Aleph.im VM execution engine
Description=Aleph.im VM supervisor daemon (pool owner, gRPC)
After=network.target
After=docker.service
Wants=docker.service
Expand All @@ -11,11 +11,8 @@ WorkingDirectory=/opt/aleph-vm
Environment=PYTHONPATH=/opt/aleph-vm/:$PYTHONPATH
Environment=PYTHONDONTWRITEBYTECODE="enabled"
EnvironmentFile=/etc/aleph-vm/supervisor.env
ExecStart=python3 -m aleph.vm.orchestrator --print-settings
ExecStart=python3 -m aleph.vm.supervisor
Restart=on-failure
# Numeric exit codes for portability — signal names need to be without "SIG"
# prefix in systemd, and older systemd versions only accept numeric codes.
# 130 = killed by SIGINT (128+2), 143 = killed by SIGTERM (128+15).
RestartPreventExitStatus=0 130 143
RestartSec=10s
TimeoutStopSec=30s
Expand Down
118 changes: 70 additions & 48 deletions proto/supervisor.proto
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,8 @@ service Supervisor {
rpc StartVm(StartVmRequest) returns (VmInfo);
rpc RebootVm(RebootVmRequest) returns (VmInfo);
rpc ReinstallVm(ReinstallVmRequest) returns (VmInfo);
rpc RunProgramCode(RunProgramCodeRequest) returns (RunProgramCodeResponse);
rpc RestoreFromImage(RestoreFromImageRequest) returns (VmInfo);

// ── Port forwarding ──
rpc AddPortForward(AddPortForwardRequest) returns (PortForwardInfo);
Expand Down Expand Up @@ -72,15 +74,16 @@ service Supervisor {
rpc DeleteBackup(DeleteBackupRequest) returns (DeleteBackupResponse);
rpc RestoreBackup(RestoreBackupRequest) returns (VmInfo);

// ── Migration ──
rpc ExportVm(ExportVmRequest) returns (MigrationInfo);
rpc ImportVm(ImportVmRequest) returns (VmInfo);
rpc GetMigrationStatus(GetMigrationStatusRequest) returns (MigrationInfo);

// ── Confidential ──
rpc InitializeConfidential(InitializeConfidentialRequest) returns (InitializeConfidentialResponse);
rpc GetMeasurement(GetMeasurementRequest) returns (Measurement);
rpc InjectSecret(InjectSecretRequest) returns (InjectSecretResponse);

// ── Network ──
rpc RecreateNetwork(RecreateNetworkRequest) returns (RecreateNetworkResponse);

// ── Reservation ──
rpc ReserveResources(ReserveResourcesRequest) returns (ReserveResourcesResponse);
}

// ── Host ─────────────────────────────────────────────────────────────────
Expand Down Expand Up @@ -129,6 +132,11 @@ message HostInfo {

// Networking
string host_ipv4 = 17; // primary external IPv4 of the host; empty when host networking is disabled

// Reservation-aware figures the agent's /about endpoints surface
uint64 available_disk_bytes = 18;
string gpu_inventory_json = 19; // list[dict] as JSON; rich agent GPU inventory
string available_gpus_json = 20; // list[dict] as JSON
}

message NumaNode {
Expand Down Expand Up @@ -201,6 +209,7 @@ message CreateVmRequest {
// signal. What is spoken over the channel is the client's business, opaque
// to the supervisor.
optional GuestChannel guest_channel = 14;
string owner_address = 16; // VM owner's Aleph address; engine consumes this owner's GPU reservation
}

message GuestChannel {
Expand Down Expand Up @@ -245,6 +254,7 @@ message TeeConfig {
TeeBackend backend = 1; // attestation backend (orthogonal to the top-level Backend enum, which selects the VMM)
string policy = 2; // empty = default
string session_dir = 3; // confidential session files
string firmware_path = 4; // resolved OVMF blob path (empty = none)
}

message NetworkConfig {
Expand All @@ -254,8 +264,10 @@ message NetworkConfig {
}

message GpuConfig {
string pci_host = 1;
string pci_host = 1; // RESOLVED concrete address; empty in a request
bool supports_x_vga = 2;
string device_id = 3; // REQUEST: vendor:device, e.g. "10de:2504"
string model = 4; // REQUEST: human label
}

// The confidential-computing mode a VM is actually running under. Precise by
Expand Down Expand Up @@ -354,6 +366,21 @@ message ReinstallVmRequest {
// field is treated as true.
}

message RestoreFromImageRequest {
string vm_id = 1;
string image_path = 2; // host path to a staged QCOW2 image
uint64 max_virtual_size_bytes = 3; // 0 = no cap
}

message RunProgramCodeRequest {
string vm_id = 1;
bytes scope_msgpack = 2; // ASGI scope dict, msgpack-encoded
double timeout_secs = 3;
}
message RunProgramCodeResponse {
bytes reply = 1; // raw runtime reply, opaque to the supervisor
}

// ── Events ───────────────────────────────────────────────────────────────

message WatchEventsRequest {}
Expand Down Expand Up @@ -451,11 +478,15 @@ message BackupInfo {
uint64 size_bytes = 4; // 0 until COMPLETE
uint64 created_at_unix_secs = 5;
string error_message = 6; // populated when status = FAILED
string checksum = 7; // archive checksum, populated when COMPLETE
repeated string volumes = 8; // archived volume names
map<string, uint64> source_sizes = 9; // per-volume uncompressed source size
}

message StartBackupRequest {
string vm_id = 1;
bool quiesce_guest = 2; // request guest fs-freeze if supported
bool include_volumes = 3; // also archive non-read-only persistent volumes
}

message GetBackupStatusRequest { string vm_id = 1; string backup_id = 2; }
Expand All @@ -474,48 +505,6 @@ message DeleteBackupResponse {}

message RestoreBackupRequest { string vm_id = 1; string backup_id = 2; }

// ── Migration ────────────────────────────────────────────────────────────

enum MigrationPhase {
MIGRATION_PHASE_UNSPECIFIED = 0;
MIGRATION_PHASE_PREPARING = 1;
MIGRATION_PHASE_EXPORTING = 2;
MIGRATION_PHASE_IMPORTING = 3;
MIGRATION_PHASE_COMPLETE = 4;
MIGRATION_PHASE_FAILED = 5;
}

message MigrationInfo {
string vm_id = 1;
string migration_id = 2;
MigrationPhase phase = 3;
uint64 bytes_transferred = 4;
uint64 bytes_total = 5;
string error_message = 6;
}

// NOTE (Plan 0.A): the directory-based shape below is provisional.
// aleph-vm's current migration is HTTP-based (the source exposes
// /control/machine/{ref}/migration/disk/... and the destination
// fetches). The contract needs reshaping for host-to-host transport
// before Plan 0.C wires real implementations. See design doc §9 open
// questions.
// Phase 1 removed the directory-based migration; drop these RPCs in the Phase 2 proto pass.
message ExportVmRequest {
string vm_id = 1;
string destination_dir = 2; // PROVISIONAL: local path on the host
}

message ImportVmRequest {
string vm_id = 1; // id the new VM will be assigned post-import
string source_dir = 2; // PROVISIONAL: see note on ExportVmRequest
}

message GetMigrationStatusRequest {
string vm_id = 1;
string migration_id = 2;
}

// ── Confidential ─────────────────────────────────────────────────────────

message InitializeConfidentialRequest {
Expand All @@ -528,10 +517,22 @@ message InitializeConfidentialResponse {}

message GetMeasurementRequest { string vm_id = 1; }

message SevInfo {
bool enabled = 1;
uint32 api_major = 2;
uint32 api_minor = 3;
uint32 build_id = 4;
uint32 policy = 5;
string state = 6;
uint32 handle = 7;
}

message Measurement {
string vm_id = 1;
bytes measurement_bytes = 2; // attestation report / SEV launch measure
TeeBackend tee_backend = 3;
SevInfo sev_info = 4; // present for SEV launches
string launch_measure = 5; // base64 launch measurement
}

message InjectSecretRequest {
Expand All @@ -542,6 +543,27 @@ message InjectSecretRequest {

message InjectSecretResponse {}

// ── Network ──────────────────────────────────────────────────────────────

message RecreateNetworkRequest {}
message RecreateNetworkResponse {
string summary_json = 1; // JSON-encoded summary dict from the engine
}

// ── Reservation ──────────────────────────────────────────────────────────

message ReserveResourcesRequest {
string user_address = 1;
uint32 vcpus = 2;
uint64 memory_mib = 3;
uint64 disk_mib = 4;
bool is_instance = 5;
repeated GpuConfig gpus = 6; // request: matched by device_id
}
message ReserveResourcesResponse {
int64 expiry_unix_ns = 1; // reservation expiry, unix ns UTC
}

// ── Wire error vocabulary ────────────────────────────────────────────────
//
// Closed enum of errors the supervisor can return. Server side: map
Expand Down
11 changes: 11 additions & 0 deletions src/aleph/vm/controllers/firecracker/spec_program.py
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,17 @@ def from_spec(cls, spec: CreateVmSpec) -> SpecProgramResources:
extra_disks=[disk for disk in spec.disks if disk.role is DiskRole.EXTRA],
)

def get_disk_usage_delta(self) -> int:
"""Disk reserved-but-unused by this execution, summed by the pool's
capacity admission (calculate_available_disk) across all executions.

Spec disk admission is deferred: DiskSpec carries no size yet (the
program rootfs is a read-only squashfs and extra disks declare no size),
so a spec program reserves nothing. Mirrors check_spec_admission passing
disk_mib=0. Returning 0 keeps SpecProgramResources usable in the shared
admission path that the other resource types implement."""
return 0

def to_dict(self):
return self.__dict__

Expand Down
2 changes: 1 addition & 1 deletion src/aleph/vm/models.py
Original file line number Diff line number Diff line change
Expand Up @@ -152,7 +152,7 @@ async def fetch_port_redirect_config_and_setup(self):
"""Fetch the user's port-forwarding aggregate and apply updates.

Persisted-mapping reload is the creator's job
(pool.create_a_vm / create_vm_from_spec / restart_persistent_vm).
(pool.create_vm_from_spec / restart_persistent_vm).
"""
if not self.is_instance:
return
Expand Down
14 changes: 9 additions & 5 deletions src/aleph/vm/orchestrator/views/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -79,6 +79,7 @@
SupervisorError,
VmNotFoundError,
)
from aleph.vm.supervisor.translate import build_reservation_request
from aleph.vm.supervisor.types import (
ConfidentialMode,
PortForwardInfo,
Expand Down Expand Up @@ -831,8 +832,8 @@ async def notify_allocation(request: web.Request):
update_watcher = request.app["update_watcher"]

# Capacity admission is no longer checked here: the engine enforces it
# atomically inside the create path (pool.check_admission for the message
# path, pool.check_spec_admission for the spec path), raising the typed
# atomically inside the create path (pool.check_spec_admission, on the
# shared pool.check_capacity core), raising the typed
# InsufficientResourcesError. The vm_creation_exceptions / 503 path below
# surfaces that error to the caller.

Expand Down Expand Up @@ -1007,10 +1008,13 @@ async def operate_reserve_resources(request: web.Request, authenticated_sender:
except ValidationError as error:
return web.json_response(data=error.json(), status=web.HTTPBadRequest.status_code)

# The supervisor runs capacity admission (keeping the dry-run honest) then
# holds the requested resources, returning the reservation expiry.
# The agent translates the message into a message-free resources DTO; the
# supervisor runs capacity admission (keeping the dry-run honest) then holds
# the requested resources, returning the reservation expiry. No Aleph message
# crosses the supervisor boundary.
reservation_request = build_reservation_request(message, authenticated_sender)
try:
expiration_date = await supervisor.reserve_resources(message, authenticated_sender)
expiration_date = await supervisor.reserve_resources(reservation_request)
except BoundaryInsufficientResourcesError as error:
logger.warning("Refusing resource reservation: %s", error)
return web.HTTPServiceUnavailable(
Expand Down
Loading
Loading