Skip to content

Commit 3dbbc76

Browse files
authored
extract supervisor interface layer, enforce agent/supervisor import boundary (#986)
1 parent 08af7b4 commit 3dbbc76

107 files changed

Lines changed: 1277 additions & 387 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.github/workflows/test-using-pytest.yml

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -49,6 +49,14 @@ jobs:
4949
run: |
5050
hatch run linting:typing
5151
52+
- name: Check the agent/supervisor/contract import boundary
53+
# Runs in the testing env (import-linter needs the project importable).
54+
# sudo so it shares the root testing env with the proto-check / unit-test
55+
# steps below, matching their comment about env races.
56+
run: |
57+
sudo python3 -m pip install --upgrade --ignore-installed hatch hatch-vcs coverage "virtualenv<21"
58+
sudo hatch run testing:imports
59+
5260
- name: Download and build required files for running tests. Copied from packaging/Makefile.
5361
run: |
5462
sudo useradd jailman

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -31,3 +31,4 @@ node_modules
3131
/kernels/linux-*.tar
3232
/kernels/linux-*.tar.sign
3333
/.worktrees/
34+
.import_linter_cache/

docs/plans/2026-06-19-agent-supervisor-boundary-design.md

Lines changed: 291 additions & 0 deletions
Large diffs are not rendered by default.
Lines changed: 139 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,139 @@
1+
# The code move (PR-3: physical relocation, behavior-neutral)
2+
3+
**Status:** Design / approved scope
4+
**Date:** 2026-06-19
5+
**Owner:** Olivier Desenfans
6+
**Repo:** `aleph-im/aleph-vm`
7+
**Parent design:** `docs/plans/2026-05-28-aleph-vm-architecture-backport-design.md`
8+
**Predecessors:** `2026-06-19-agent-supervisor-boundary-design.md` (PR-1),
9+
`2026-06-19-controller-split-by-concern-design.md` (PR-2)
10+
11+
## 1. Context
12+
13+
After PR-1 (contract layer + enforced boundary) and PR-2 (controllers no longer
14+
carry an Aleph-aware personality, views catch only `SupervisorError`), the code
15+
is in the right *layers* but not yet under the right *names and directories*.
16+
The agent still lives in a package called `orchestrator/`; the supervisor-owned
17+
running controller still lives in a neutral-looking top-level `controllers/`.
18+
19+
PR-3 performs the physical move so the directory tree matches the architecture.
20+
It is purely mechanical: renames, file moves, and import updates. No logic
21+
changes, so it is gated on "moves only + green suite", like PR-1.
22+
23+
This is safe to do last precisely because PR-1 and PR-2 already separated the
24+
concerns; moving files around now cannot smuggle in a coupling, because the
25+
import-linter would reject it.
26+
27+
## 2. Moves
28+
29+
### 2.1 `orchestrator/` -> `agent/`
30+
31+
The agent package gets its real name. This is the bulk of the diff (every
32+
internal import of `aleph.vm.orchestrator...` updates), but it is mechanical.
33+
34+
Packaging follow-through (service *names* are unchanged, only module targets):
35+
36+
- `pyproject.toml`: `scripts.aleph-vm = "aleph.vm.agent.cli:main"`.
37+
- `packaging/.../aleph-vm-agent.service`: `ExecStart=python3 -m aleph.vm.agent
38+
--print-settings`.
39+
- `aleph/vm/agent/__main__.py` (moved) keeps the same CLI surface.
40+
41+
### 2.2 Agent-side downloader -> `agent/`
42+
43+
The message->spec downloader extracted in PR-2 moves into `agent/` next to
44+
`translate.py` / `qemu_build.py` (which PR-1 already placed agent-side).
45+
46+
### 2.3 `controllers/` -> `supervisor/controllers/`
47+
48+
After PR-2, `controllers/` holds only the running controller, the runtime
49+
resource holder (`from_spec`), and management (`QemuVmClient`, `backup`,
50+
`snapshots`). It is the supervisor's execution worker.
51+
52+
**Decision (2026-06-19): move it to `supervisor/controllers/`.** The controller
53+
is spawned by the supervisor (via its `SystemDManager`) and is logically the
54+
supervisor's execution worker; nesting reflects that, keeps the supervisor
55+
subtree together, and lines up with parent-design Phase 2, where the Python
56+
hypervisor and its controllers are swapped for Rust as one unit. Use `git mv` so
57+
rename detection keeps the diff reviewable.
58+
59+
Entry-point follow-through:
60+
61+
- `aleph-vm-controller@.service`: `ExecStart=/usr/bin/python3 -m
62+
aleph.vm.supervisor.controllers --config=/var/lib/aleph/vm/%i-controller.json`.
63+
The service *name* and the `%i-controller.json` config path are unchanged, so
64+
operators and the on-disk config contract are unaffected; only the module
65+
target moves. The unit file and the module ship in the same `.deb`, so the
66+
rename is internally consistent within any version (already-running controller
67+
instances keep running on their old invocation until their next start, which
68+
uses the new, consistent unit).
69+
- `aleph/vm/supervisor/controllers/__main__.py` (moved) keeps the same
70+
`--config` CLI surface.
71+
72+
### 2.4 `configuration.py` (already in `contract/` from PR-1)
73+
74+
`controllers/configuration.py`, the on-disk config-file schema (agent
75+
`qemu_build` writes it, the controller reads it, `local.py` removes it), is
76+
**already moved to `contract/configuration.py` in PR-1**. It had to be: once
77+
controllers nests under `supervisor/` here, leaving the schema in
78+
`supervisor/controllers/` would make agent-side `qemu_build` import
79+
`supervisor/controllers`, a forbidden `agent -> supervisor` edge. PR-1 does the
80+
move (its deps are clean: stdlib, pydantic, foundation), so PR-3 has nothing to
81+
do for it beyond confirming the import paths are already `contract.*`.
82+
83+
## 3. Enforcement and cleanup
84+
85+
- Update the import-linter config to the final package names. The only remaining
86+
documented residual is `agent -> {pool, models}` (the `VmExecution`/`VmPool`
87+
cleave, a separate adjacent effort). Every other ignore entry from PR-1 is gone
88+
by now (PR-2 removed the controller ones).
89+
- Optionally leave thin re-export shims at the old `aleph.vm.orchestrator.*`
90+
paths for one release if any out-of-repo tooling imports them; otherwise a hard
91+
rename. Recommend a hard rename inside the repo (update all call sites) and
92+
shims only if a concrete external importer is found.
93+
94+
## 4. Testing strategy
95+
96+
- Behavior-neutral: the existing suite is the oracle, run it green (known
97+
env-only exceptions excepted).
98+
- **Launch smoke test**: confirm all three entry points still import and start:
99+
`python -m aleph.vm.agent --print-settings`, `python -m aleph.vm.supervisor`,
100+
and the controller `python -m aleph.vm.supervisor.controllers --config=<sample>`.
101+
- `mypy` baseline unchanged; import-linter passes under the final names.
102+
- Grep for stale `aleph.vm.orchestrator` references across `packaging/`, docs,
103+
CI workflows, and tests; none should remain (or only intentional shims).
104+
105+
## 5. Risks
106+
107+
| Risk | Mitigation |
108+
| ---- | ---------- |
109+
| A missed `orchestrator` reference in packaging/CI breaks a service at deploy | §4 grep sweep across packaging, systemd units, CI, and the Makefile; launch smoke test. |
110+
| The rename collides badly with in-flight branches | Land PR-1/PR-2 first; do the rename in one commit; rebase dependents once, immediately. |
111+
| External tooling imports `aleph.vm.orchestrator` | Search for real importers; add re-export shims only if found, with a deprecation note. |
112+
| Reviewers cannot see logic vs move in a huge diff | Keep PR-3 strictly mechanical (no logic edits); use `git mv` so rename detection keeps the diff reviewable. |
113+
114+
## 6. Sequence recap and what remains after
115+
116+
The sequence:
117+
118+
0. **PR-0** (behavior-neutral): rename `vm_hash` -> `vm_id` on the supervisor-side
119+
objects (`VmExecution`, `VmPool`, `local.py`) and the agent call sites reading
120+
that attribute. See `2026-06-19-supervisor-vm-id-rename-design.md`.
121+
1. **PR-1** (behavior-neutral): contract layer, misfiled agent code out of
122+
`supervisor/`, three back-references fixed, import-linter.
123+
2. **PR-2** (behavior-affecting): split the `Resources` dual personality, finish
124+
the wire-error vocabulary, remove the two `controllers` residuals.
125+
3. **PR-3** (behavior-neutral): `orchestrator/` -> `agent/`, `controllers/` ->
126+
`supervisor/controllers/`, final import-linter names. (`configuration.py` ->
127+
`contract/` already happened in PR-1.)
128+
129+
After PR-3 the only remaining cross-boundary coupling is `agent -> {pool,
130+
models}`: the `VmExecution`/`VmPool` cleave from the parent design §4, which is a
131+
separate effort and not part of this sequence. The cosmetic work of this sequence
132+
is then done; the next architectural milestone is that cleave, after which the
133+
Python hypervisor can be swapped for the Rust one (parent design Phase 2).
134+
135+
## 7. Next step
136+
137+
Implementation plan (writing-plans) per PR, authored when its predecessor lands
138+
(PR-2's plan after PR-1 merges, PR-3's after PR-2 merges), since each plan's
139+
exact import edits depend on the prior PR's final state.
Lines changed: 161 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,161 @@
1+
# Controller split by concern (PR-2: behavior-affecting)
2+
3+
**Status:** Design / approved scope
4+
**Date:** 2026-06-19
5+
**Owner:** Olivier Desenfans
6+
**Repo:** `aleph-im/aleph-vm`
7+
**Parent design:** `docs/plans/2026-05-28-aleph-vm-architecture-backport-design.md` (§4, A.6)
8+
**Predecessor:** `docs/plans/2026-06-19-agent-supervisor-boundary-design.md` (PR-1)
9+
10+
## 1. Context
11+
12+
PR-1 establishes the `contract` layer and an import-enforced boundary, but it
13+
leaves `controllers/` as a *shared* base layer because two pieces of it are used
14+
on both sides. PR-1 records those as documented residuals:
15+
16+
- `orchestrator -> controllers` for the `Resources` classes, and
17+
- `orchestrator -> controllers` for the controller exception types caught by the
18+
HTTP views.
19+
20+
PR-2 removes both residuals. It is the smaller half of what the parent design
21+
called the §4 cleave, because the hardest piece (moving volume download out of
22+
`controllers.setup()`) **already happened** during the spec work: `translate.py`
23+
(agent) calls `resources.download_all()` and resolves every path before building
24+
the `CreateVmSpec`; the running controller no longer downloads. What remains is
25+
to stop the controller code from carrying a second, Aleph-aware personality, and
26+
to finish the wire-error vocabulary so the views stop reaching into controller
27+
internals.
28+
29+
PR-2 changes behavior (it splits a class and changes which exception types the
30+
views catch), so it is gated on tests rather than on "moves only".
31+
32+
**Out of scope (still): the `VmExecution` / `VmPool` cleave.** PR-2 does not
33+
split the god-objects. The controller split does not require it: download is
34+
already agent-side and the spec path is in place. The `orchestrator ->
35+
{pool, models}` residual from PR-1 stays until that separate, adjacent effort.
36+
37+
## 2. The two entanglements PR-2 removes
38+
39+
### 2.1 The `Resources` dual personality
40+
41+
`AlephQemuResources` (and the firecracker `AlephProgramResources` /
42+
`AlephFirecrackerResources`, plus `AlephQemuConfidentialResources`) are
43+
constructed two different ways:
44+
45+
- **Agent / download path**: `AlephQemuResources(message_content, namespace)`
46+
followed by `download_all()`. This reads the Aleph message, downloads runtime
47+
and volumes, creates writable qcow2 images, and feeds `translate.py`'s
48+
`CreateVmSpec`. It is Aleph-aware (holds `message_content`).
49+
- **Supervisor / runtime path**: `AlephQemuResources.from_spec(spec, namespace)`.
50+
No download, `message_content=None`; it just exposes the attribute surface the
51+
running controller and pool read (`rootfs_path`, `volumes`, `gpus`,
52+
`kernel_image_path`). `from_spec` even does a local
53+
`from aleph.vm.supervisor.types import DiskRole` to read the spec.
54+
55+
One class, two lives. The download half is agent; the runtime half is supervisor.
56+
57+
**Target:** two types.
58+
59+
- An **agent-side downloader** owns `message_content`, `download_runtime`,
60+
`download_volumes`, `download_all`, and the writable-image creation. It lives
61+
agent-side and its product is the `CreateVmSpec` (resolved paths). It replaces
62+
the `AlephQemuResources(message)` use in `translate.py`.
63+
- A **supervisor-side runtime holder** is the message-free
64+
`VmResources`/`from_spec` shape the controller reads. It stays controller-side
65+
(post-PR-3, supervisor-side). After PR-1 it reads `DiskRole` from `contract`,
66+
not via a local `supervisor.types` import.
67+
68+
The shared `controllers/resources.py::VmResources` attribute surface stays as the
69+
common base for the runtime holder. The downloader composes or subclasses it;
70+
choice deferred to the plan (see §5).
71+
72+
### 2.2 The wire-error vocabulary (parent A.6)
73+
74+
The views are in a hybrid state today: `orchestrator/views/__init__.py` imports
75+
controller and hypervisor exception types (`ResourceDownloadError`,
76+
`VmSetupError`, `FileTooLargeError`, `MicroVMFailedInitError`,
77+
`HostNotFoundError`, `InsufficientResourcesError`) and lists them in the
78+
`vm_creation_exceptions` catch tuples, *and* it already catches `SupervisorError`
79+
/ `InternalSupervisorError` from the boundary. `operator.py` similarly imports
80+
`controllers.qemu.backup` types.
81+
82+
The supervisor side already has the mapping: `supervisor/error_mapping.py` (split
83+
out in PR-1) translates every internal backend exception to the closed
84+
`SupervisorError` set via `translating_errors()`.
85+
86+
**Target:** every Supervisor boundary method wraps its work in
87+
`translating_errors()` (audit that the create / operate / backup paths the views
88+
exercise are covered), so the boundary only ever raises `SupervisorError`. The
89+
views then:
90+
91+
- drop the `from aleph.vm.controllers...` and
92+
`from aleph.vm.hypervisors...` exception imports,
93+
- catch `contract.errors.SupervisorError` (optionally branching on `.code` /
94+
`ErrorCode` for the few distinct HTTP statuses), and
95+
- map `ErrorCode -> HTTP status` in one small helper.
96+
97+
`operator.py`'s direct use of `controllers.qemu.backup` / `QemuVmClient` is the
98+
same shape: those operations move behind boundary methods that raise
99+
`SupervisorError`, and the view stops importing controller types. (If any backup
100+
RPC is not yet on the boundary, that surfaces as a plan task.)
101+
102+
## 3. Result
103+
104+
After PR-2:
105+
106+
- `controllers/` no longer carries Aleph-aware download personalities; it holds
107+
the runtime holder + the running controller + management (`QemuVmClient`,
108+
`backup`, `snapshots`) only.
109+
- The agent imports its own downloader and catches only `SupervisorError`. The
110+
two `orchestrator -> controllers` residuals from PR-1 are deleted from the
111+
import-linter ignore list, and the linter is tightened to forbid them.
112+
- `controllers/` is now unambiguously a supervisor-owned package (its
113+
config-file schema, `configuration.py`, already moved to `contract/` in PR-1).
114+
This is what makes PR-3's physical move into `supervisor/controllers/`
115+
mechanical.
116+
117+
The only remaining documented residual is `orchestrator -> {pool, models}`,
118+
owned by the separate `VmExecution`/`VmPool` cleave.
119+
120+
## 4. Testing strategy
121+
122+
Behavior changes, so tests lead:
123+
124+
- **Error-mapping tests** (the riskiest surface): for each internal backend
125+
exception, assert `translate_exception` yields the right `SupervisorError` /
126+
`ErrorCode`, and that the view maps that code to the same HTTP status the
127+
current code returns. Lock the create-failure HTTP contract (503 for
128+
insufficient resources, the 4xx/5xx for setup/download/file-too-large) with
129+
view-level tests before changing the catch sites, so the refactor is provably
130+
status-preserving.
131+
- **Downloader / runtime-holder split**: keep the existing
132+
`translate` and `test_qemu_instance` coverage green; the downloader must
133+
produce byte-identical spec paths, and `from_spec` must build the same runtime
134+
holder. Add a unit test that the runtime holder has no `message_content` /
135+
download surface.
136+
- **Confidential**: `AlephQemuConfidentialResources` follows the same split;
137+
keep the SEV-path tests (and the testnet #27 SEV run before merge, as with the
138+
confidential-init work).
139+
- Full supervisor suite green (known env-only exceptions excepted); `mypy`
140+
baseline unchanged; import-linter passes with the two residuals removed.
141+
142+
## 5. Open questions (resolved in the implementation plan)
143+
144+
- **Downloader vs runtime holder relationship**: does the agent downloader
145+
subclass `VmResources` (sharing the attribute surface) or compose a separate
146+
type that emits spec fields? Composition keeps the agent free of the
147+
controller base class; subclassing is less code.
148+
- **`make_writable_volume` placement**: creating the writable qcow2 runs
149+
`qemu-img` on the host, which is arguably supervisor/infra work, but today it
150+
runs in the agent download phase. Keep it in the downloader (status quo,
151+
behavior-neutral) or move it behind the boundary (larger, defer)? Recommend
152+
keeping it in the downloader for PR-2 and noting it for the cleave.
153+
- **Backup/QMP coverage on the boundary**: confirm `operator.py`'s backup and
154+
`QemuVmClient` uses already have boundary methods that raise `SupervisorError`;
155+
any gap is a prerequisite task inside PR-2 (or a thin precursor PR).
156+
157+
## 6. Next step
158+
159+
Implementation plan (writing-plans) covering the `Resources` split, the
160+
boundary `translating_errors()` audit, the view error-mapping helper, the
161+
residual-removal in the import-linter config, and the test checklist above.

0 commit comments

Comments
 (0)