refactor(boundary): PR-1 — extract contract layer, enforce agent/supervisor import boundary#986
refactor(boundary): PR-1 — extract contract layer, enforce agent/supervisor import boundary#986odesenfans wants to merge 7 commits into
Conversation
…split PR-1 of the agent/supervisor boundary work: extract a contract layer (ABC, wire types, error enum), move misfiled agent code (translate, qemu_build, cloudinit) out of supervisor/, fix the three supervisor-> agent back-references, and enforce the boundary with import-linter. Strictly behavior-neutral; the VmExecution/VmPool cleave, download extraction, wire-error vocabulary, and true controller split are deferred to the later step (parent design 2026-05-28 sections 4 and A.6).
Completes the three-PR sequence design. PR-2 removes the two orchestrator->controllers residuals left by PR-1: split the Resources dual personality (agent message-downloader vs supervisor spec-runtime holder) and finish the wire-error vocabulary so views catch only SupervisorError. Download is already agent-side (spec work), so PR-2 is smaller than the parent design's section 4 implied. PR-3 is the behavior-neutral physical move: orchestrator/ -> agent/, controllers reclassified supervisor-owned, final import-linter names. The VmExecution/VmPool cleave stays out of the sequence as a separate adjacent effort.
… -> contract in PR-1 Per decision: move controllers under supervisor/ in PR-3 (it is the supervisor's execution worker, swapped together in Phase 2). Consequence: configuration.py (the config-file schema agent qemu_build writes) cannot live under supervisor/controllers/ without an agent->supervisor edge, so it moves to contract/ in PR-1 instead. Reclassify controllers as supervisor-side-with-residuals (not a shared base) and correct the Resources straddle description: download is already agent-side (translate.py); the tangle is the single class with two construction personalities, not controllers.setup() downloading.
Behavior-neutral identity rename, runs first (independent of the package reorg, clarifies names before later PRs move code). End-to-end on the supervisor-side objects: VmExecution.vm_hash -> vm_id, VmPool keys/params, local.py, and the agent call sites that read execution.vm_hash. Agent locals that genuinely hold the Aleph ItemHash keep the vm_hash name. Threaded into the PR-1 and PR-3 sequence recaps.
…llision vm_id is already the numeric local VM id (int, ~11 files across controllers/network/hypervisors). Renaming vm_hash->vm_id needs a prior numeric-id rename (e.g. vm_index) and owner sign-off; deferred.
…import boundary PR-1 of the agent/supervisor boundary work (behavior-neutral). - New aleph.vm.contract package: the Supervisor ABC, wire types, the SupervisorError enum, and the controller config-file schema (configuration.py). Both sides import it; it imports neither. - Split supervisor/errors.py: the closed SupervisorError vocabulary moves to contract/errors.py (no backend deps); the exception->code mapping (translate_exception/translating_errors) stays supervisor-side in supervisor/error_mapping.py. - Move the message->spec translator to the agent (orchestrator/translate.py). qemu_build and cloudinit stay supervisor/controller-side: they are used by the supervisor pool, not the agent (corrects the design's classification). - Enforce the boundary with import-linter (pyproject [tool.importlinter] + testing-env dep + CI step): contract is independent; controllers import neither side; the agent reaches the supervisor only through the ABC (modulo the composition root); the supervisor never imports the agent (modulo the shared port-mappings DB + aggregate-settings cache, documented residuals for the VmExecution/VmPool cleave). No runtime behavior change: full supervisor+migration suite green (only the pre-existing root-only test_interfaces failures remain), mypy baseline unchanged, lint-imports 4 kept / 0 broken.
foxpatch-aleph
left a comment
There was a problem hiding this comment.
A clean, behavior-neutral refactoring that extracts the contract layer, moves translate.py to the agent side, splits errors.py into contract errors + supervisor-side error mapping, and adds an import-linter gate. All file moves are correct, no stale import paths remain in src/ or tests/, the linter contracts properly enforce the boundary with documented residuals, and CI is wired up correctly. Only minor stale comments/docstrings remain.
src/aleph/vm/orchestrator/translate.py (line 3): Stale docstring: says "the supervisor refactor (Phase 0.C)" and "the rest of the supervisor pipeline can work with", but this module is now agent-side code. Update to reflect the new ownership.
src/aleph/vm/controllers/qemu/instance.py (line 133): Comment says "Local import keeps the supervisor.* dependency out of module load order" but the import now goes to aleph.vm.contract.types, not supervisor.*. Update to say "contract.*" or just "avoids a circular import."
src/aleph/vm/controllers/qemu_confidential/instance.py (line 47): Same stale comment as instance.py:133 — says "supervisor.* dependency" but imports from aleph.vm.contract.errors.
src/aleph/vm/orchestrator/views/__init__.py (line 74): Two separate from aleph.vm.contract.errors import blocks (lines 74-76 and 77-81) could be merged into one import statement. Minor cosmetic nit.
The contract-layer import-path changes reordered import blocks; re-run the CI formatters (isort, ruff format, pyproject-fmt) so the linting gate passes. No logic change.
foxpatch-aleph
left a comment
There was a problem hiding this comment.
A clean, behavior-neutral refactoring that extracts a contract layer (ABC, types, errors, configuration) shared between the agent and supervisor, splits errors.py into contract-side error classes and supervisor-side error mapping, moves translate.py to the agent (orchestrator/), and adds an import-linter CI gate. All file moves are complete (no stale files or imports remain), the contract layer has no forbidden dependencies, the import-linter configuration correctly captures forbidden edges and documented residuals, and test imports are updated throughout. The design docs are thorough and the deviations from the original design (keeping qemu_build/cloudinit supervisor-side) are well-justified. No correctness, security, or testing issues found.
.github/workflows/test-using-pytest.yml (line 57): The sudo python3 -m pip install --upgrade --ignore-installed hatch hatch-vcs coverage "virtualenv<21" on line 57 duplicates the same command at line 42. The root hatch install already happened in the previous step. This is harmless (defensive re-install) but adds ~10-15s of CI time for no benefit. Consider removing it or adding a comment explaining why a second install is needed specifically before the import-linter step.
src/aleph/vm/orchestrator/run.py (line 20): The alias from aleph.vm.contract import errors as supervisor_errors preserves the old supervisor_errors namespace. This is a reasonable transitional convenience for a behavior-neutral PR, but the name is now slightly misleading since the errors live in contract, not supervisor. Consider renaming to contract_errors or boundary_errors in a follow-up to reduce confusion.
src/aleph/vm/orchestrator/translate.py (line 38): This from aleph.vm.controllers.qemu.cloudinit import get_hostname_from_hash creates an orchestrator -> controllers edge that is not explicitly documented as a residual in the import-linter ignore_imports (unlike the Resources classes and controller exception types). It's not forbidden by any linter contract, so it passes, but future readers may wonder why this edge is allowed while the design doc lists orchestrator -> controllers as a known residual. Consider adding a brief comment noting this is a pre-existing edge to be cleaned up with the controller split.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## dev #986 +/- ##
==========================================
- Coverage 79.07% 79.07% -0.01%
==========================================
Files 202 203 +1
Lines 22639 22644 +5
Branches 1461 1461
==========================================
+ Hits 17902 17905 +3
- Misses 4353 4354 +1
- Partials 384 385 +1 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
PR-1: extract the contract layer, enforce the agent/supervisor import boundary
First of a sequence (designs in
docs/plans/2026-06-19-*) that carves a clean,import-enforced boundary between the Aleph agent and the supervisor.
Behavior-neutral: file moves, an error-class split, import-direction fixes,
and an import-linter gate. No runtime logic changes.
What changed
aleph.vm.contractpackage — the shared wire/interface vocabulary bothsides depend on: the
SupervisorABC, the value types (VmId,VmInfo,CreateVmSpec, …), the closedSupervisorErrorenum, and the on-diskcontroller config-file schema (
configuration.py, moved out ofcontrollers/so the agent's
qemu_builddoes not import a supervisor package). It importsneither side.
errors.pysplit — theSupervisorErrorclasses (no backend deps) move tocontract/errors.py; the exception→code mapping(
translate_exception/translating_errors) stays supervisor-side assupervisor/error_mapping.py.translate.py→orchestrator/(agent): it consumesaleph_messageand isused only by agent code.
[tool.importlinter]+ testing-env dep + CI step):contract is independent; controllers import neither side; the agent reaches the
supervisor only through the ABC (modulo the composition root); the supervisor
never imports the agent.
Deviations from the design doc (corrected here)
qemu_build.pyandcloudinit.pystay supervisor/controller-side. Thedesign classified them as agent code, but the only importer of
qemu_buildisthe supervisor
pool(it builds the controller config from the spec), andcloudinitis used byqemu_build. So they are supervisor concerns; onlytranslateis agent code. (I will update the design doc to match.)The supervisor→agent edges that remain are the shared port-mappings DB
(
models/pool/local/daemon→orchestrator.metrics) and theaggregate-settings cache (
pool→orchestrator.utils). Both are entangledwith the DB engine / policy layer and are behavior-affecting to extract, so they
belong with the
VmExecution/VmPoolcleave (parent design §4). They arelisted as explicit
ignore_importswith comments.AMDSEVPolicy(third-partyaleph_message) is left inlocal.py— it is not an agent-boundary violation.Verification
tests/supervisor+tests/migrationsuite: 922 passed, only thepre-existing root-only
test_interfacesfailures remain (env, not this change).lint-imports: 4 contracts kept, 0 broken.mypy src/aleph/vm/: baseline unchanged (43 pre-existing errors, none in thenew/moved modules).
Part of the agent/supervisor split. Followed by PR-2 (controller
Resourcessplit + wire-error vocabulary) and PR-3 (the physical
orchestrator/→agent/and
controllers/→supervisor/controllers/moves). PR-0 (thevm_hash→vm_idrename) is blocked on a naming decision (
vm_idalready names the numeric localid) and tracked separately.
🤖 Generated with Claude Code