|
| 1 | +# Controller split by concern (PR-2: behavior-affecting) |
| 2 | + |
| 3 | +**Status:** Design / approved scope |
| 4 | +**Date:** 2026-06-19 |
| 5 | +**Owner:** Olivier Desenfans |
| 6 | +**Repo:** `aleph-im/aleph-vm` |
| 7 | +**Parent design:** `docs/plans/2026-05-28-aleph-vm-architecture-backport-design.md` (§4, A.6) |
| 8 | +**Predecessor:** `docs/plans/2026-06-19-agent-supervisor-boundary-design.md` (PR-1) |
| 9 | + |
| 10 | +## 1. Context |
| 11 | + |
| 12 | +PR-1 establishes the `contract` layer and an import-enforced boundary, but it |
| 13 | +leaves `controllers/` as a *shared* base layer because two pieces of it are used |
| 14 | +on both sides. PR-1 records those as documented residuals: |
| 15 | + |
| 16 | +- `orchestrator -> controllers` for the `Resources` classes, and |
| 17 | +- `orchestrator -> controllers` for the controller exception types caught by the |
| 18 | + HTTP views. |
| 19 | + |
| 20 | +PR-2 removes both residuals. It is the smaller half of what the parent design |
| 21 | +called the §4 cleave, because the hardest piece (moving volume download out of |
| 22 | +`controllers.setup()`) **already happened** during the spec work: `translate.py` |
| 23 | +(agent) calls `resources.download_all()` and resolves every path before building |
| 24 | +the `CreateVmSpec`; the running controller no longer downloads. What remains is |
| 25 | +to stop the controller code from carrying a second, Aleph-aware personality, and |
| 26 | +to finish the wire-error vocabulary so the views stop reaching into controller |
| 27 | +internals. |
| 28 | + |
| 29 | +PR-2 changes behavior (it splits a class and changes which exception types the |
| 30 | +views catch), so it is gated on tests rather than on "moves only". |
| 31 | + |
| 32 | +**Out of scope (still): the `VmExecution` / `VmPool` cleave.** PR-2 does not |
| 33 | +split the god-objects. The controller split does not require it: download is |
| 34 | +already agent-side and the spec path is in place. The `orchestrator -> |
| 35 | +{pool, models}` residual from PR-1 stays until that separate, adjacent effort. |
| 36 | + |
| 37 | +## 2. The two entanglements PR-2 removes |
| 38 | + |
| 39 | +### 2.1 The `Resources` dual personality |
| 40 | + |
| 41 | +`AlephQemuResources` (and the firecracker `AlephProgramResources` / |
| 42 | +`AlephFirecrackerResources`, plus `AlephQemuConfidentialResources`) are |
| 43 | +constructed two different ways: |
| 44 | + |
| 45 | +- **Agent / download path**: `AlephQemuResources(message_content, namespace)` |
| 46 | + followed by `download_all()`. This reads the Aleph message, downloads runtime |
| 47 | + and volumes, creates writable qcow2 images, and feeds `translate.py`'s |
| 48 | + `CreateVmSpec`. It is Aleph-aware (holds `message_content`). |
| 49 | +- **Supervisor / runtime path**: `AlephQemuResources.from_spec(spec, namespace)`. |
| 50 | + No download, `message_content=None`; it just exposes the attribute surface the |
| 51 | + running controller and pool read (`rootfs_path`, `volumes`, `gpus`, |
| 52 | + `kernel_image_path`). `from_spec` even does a local |
| 53 | + `from aleph.vm.supervisor.types import DiskRole` to read the spec. |
| 54 | + |
| 55 | +One class, two lives. The download half is agent; the runtime half is supervisor. |
| 56 | + |
| 57 | +**Target:** two types. |
| 58 | + |
| 59 | +- An **agent-side downloader** owns `message_content`, `download_runtime`, |
| 60 | + `download_volumes`, `download_all`, and the writable-image creation. It lives |
| 61 | + agent-side and its product is the `CreateVmSpec` (resolved paths). It replaces |
| 62 | + the `AlephQemuResources(message)` use in `translate.py`. |
| 63 | +- A **supervisor-side runtime holder** is the message-free |
| 64 | + `VmResources`/`from_spec` shape the controller reads. It stays controller-side |
| 65 | + (post-PR-3, supervisor-side). After PR-1 it reads `DiskRole` from `contract`, |
| 66 | + not via a local `supervisor.types` import. |
| 67 | + |
| 68 | +The shared `controllers/resources.py::VmResources` attribute surface stays as the |
| 69 | +common base for the runtime holder. The downloader composes or subclasses it; |
| 70 | +choice deferred to the plan (see §5). |
| 71 | + |
| 72 | +### 2.2 The wire-error vocabulary (parent A.6) |
| 73 | + |
| 74 | +The views are in a hybrid state today: `orchestrator/views/__init__.py` imports |
| 75 | +controller and hypervisor exception types (`ResourceDownloadError`, |
| 76 | +`VmSetupError`, `FileTooLargeError`, `MicroVMFailedInitError`, |
| 77 | +`HostNotFoundError`, `InsufficientResourcesError`) and lists them in the |
| 78 | +`vm_creation_exceptions` catch tuples, *and* it already catches `SupervisorError` |
| 79 | +/ `InternalSupervisorError` from the boundary. `operator.py` similarly imports |
| 80 | +`controllers.qemu.backup` types. |
| 81 | + |
| 82 | +The supervisor side already has the mapping: `supervisor/error_mapping.py` (split |
| 83 | +out in PR-1) translates every internal backend exception to the closed |
| 84 | +`SupervisorError` set via `translating_errors()`. |
| 85 | + |
| 86 | +**Target:** every Supervisor boundary method wraps its work in |
| 87 | +`translating_errors()` (audit that the create / operate / backup paths the views |
| 88 | +exercise are covered), so the boundary only ever raises `SupervisorError`. The |
| 89 | +views then: |
| 90 | + |
| 91 | +- drop the `from aleph.vm.controllers...` and |
| 92 | + `from aleph.vm.hypervisors...` exception imports, |
| 93 | +- catch `contract.errors.SupervisorError` (optionally branching on `.code` / |
| 94 | + `ErrorCode` for the few distinct HTTP statuses), and |
| 95 | +- map `ErrorCode -> HTTP status` in one small helper. |
| 96 | + |
| 97 | +`operator.py`'s direct use of `controllers.qemu.backup` / `QemuVmClient` is the |
| 98 | +same shape: those operations move behind boundary methods that raise |
| 99 | +`SupervisorError`, and the view stops importing controller types. (If any backup |
| 100 | +RPC is not yet on the boundary, that surfaces as a plan task.) |
| 101 | + |
| 102 | +## 3. Result |
| 103 | + |
| 104 | +After PR-2: |
| 105 | + |
| 106 | +- `controllers/` no longer carries Aleph-aware download personalities; it holds |
| 107 | + the runtime holder + the running controller + management (`QemuVmClient`, |
| 108 | + `backup`, `snapshots`) only. |
| 109 | +- The agent imports its own downloader and catches only `SupervisorError`. The |
| 110 | + two `orchestrator -> controllers` residuals from PR-1 are deleted from the |
| 111 | + import-linter ignore list, and the linter is tightened to forbid them. |
| 112 | +- `controllers/` is now unambiguously a supervisor-owned package (its |
| 113 | + config-file schema, `configuration.py`, already moved to `contract/` in PR-1). |
| 114 | + This is what makes PR-3's physical move into `supervisor/controllers/` |
| 115 | + mechanical. |
| 116 | + |
| 117 | +The only remaining documented residual is `orchestrator -> {pool, models}`, |
| 118 | +owned by the separate `VmExecution`/`VmPool` cleave. |
| 119 | + |
| 120 | +## 4. Testing strategy |
| 121 | + |
| 122 | +Behavior changes, so tests lead: |
| 123 | + |
| 124 | +- **Error-mapping tests** (the riskiest surface): for each internal backend |
| 125 | + exception, assert `translate_exception` yields the right `SupervisorError` / |
| 126 | + `ErrorCode`, and that the view maps that code to the same HTTP status the |
| 127 | + current code returns. Lock the create-failure HTTP contract (503 for |
| 128 | + insufficient resources, the 4xx/5xx for setup/download/file-too-large) with |
| 129 | + view-level tests before changing the catch sites, so the refactor is provably |
| 130 | + status-preserving. |
| 131 | +- **Downloader / runtime-holder split**: keep the existing |
| 132 | + `translate` and `test_qemu_instance` coverage green; the downloader must |
| 133 | + produce byte-identical spec paths, and `from_spec` must build the same runtime |
| 134 | + holder. Add a unit test that the runtime holder has no `message_content` / |
| 135 | + download surface. |
| 136 | +- **Confidential**: `AlephQemuConfidentialResources` follows the same split; |
| 137 | + keep the SEV-path tests (and the testnet #27 SEV run before merge, as with the |
| 138 | + confidential-init work). |
| 139 | +- Full supervisor suite green (known env-only exceptions excepted); `mypy` |
| 140 | + baseline unchanged; import-linter passes with the two residuals removed. |
| 141 | + |
| 142 | +## 5. Open questions (resolved in the implementation plan) |
| 143 | + |
| 144 | +- **Downloader vs runtime holder relationship**: does the agent downloader |
| 145 | + subclass `VmResources` (sharing the attribute surface) or compose a separate |
| 146 | + type that emits spec fields? Composition keeps the agent free of the |
| 147 | + controller base class; subclassing is less code. |
| 148 | +- **`make_writable_volume` placement**: creating the writable qcow2 runs |
| 149 | + `qemu-img` on the host, which is arguably supervisor/infra work, but today it |
| 150 | + runs in the agent download phase. Keep it in the downloader (status quo, |
| 151 | + behavior-neutral) or move it behind the boundary (larger, defer)? Recommend |
| 152 | + keeping it in the downloader for PR-2 and noting it for the cleave. |
| 153 | +- **Backup/QMP coverage on the boundary**: confirm `operator.py`'s backup and |
| 154 | + `QemuVmClient` uses already have boundary methods that raise `SupervisorError`; |
| 155 | + any gap is a prerequisite task inside PR-2 (or a thin precursor PR). |
| 156 | + |
| 157 | +## 6. Next step |
| 158 | + |
| 159 | +Implementation plan (writing-plans) covering the `Resources` split, the |
| 160 | +boundary `translating_errors()` audit, the view error-mapping helper, the |
| 161 | +residual-removal in the import-linter config, and the test checklist above. |
0 commit comments