Skip to content

refactor(boundary): PR-1 — extract contract layer, enforce agent/supervisor import boundary#986

Open
odesenfans wants to merge 7 commits into
devfrom
od/agent-supervisor-contract
Open

refactor(boundary): PR-1 — extract contract layer, enforce agent/supervisor import boundary#986
odesenfans wants to merge 7 commits into
devfrom
od/agent-supervisor-contract

Conversation

@odesenfans

Copy link
Copy Markdown
Contributor

PR-1: extract the contract layer, enforce the agent/supervisor import boundary

First of a sequence (designs in docs/plans/2026-06-19-*) that carves a clean,
import-enforced boundary between the Aleph agent and the supervisor.
Behavior-neutral: file moves, an error-class split, import-direction fixes,
and an import-linter gate. No runtime logic changes.

What changed

  • New aleph.vm.contract package — the shared wire/interface vocabulary both
    sides depend on: the Supervisor ABC, the value types (VmId, VmInfo,
    CreateVmSpec, …), the closed SupervisorError enum, and the on-disk
    controller config-file schema (configuration.py, moved out of controllers/
    so the agent's qemu_build does not import a supervisor package). It imports
    neither side.
  • errors.py split — the SupervisorError classes (no backend deps) move to
    contract/errors.py; the exception→code mapping
    (translate_exception/translating_errors) stays supervisor-side as
    supervisor/error_mapping.py.
  • translate.pyorchestrator/ (agent): it consumes aleph_message and is
    used only by agent code.
  • import-linter gate ([tool.importlinter] + testing-env dep + CI step):
    contract is independent; controllers import neither side; the agent reaches the
    supervisor only through the ABC (modulo the composition root); the supervisor
    never imports the agent.

Deviations from the design doc (corrected here)

  • qemu_build.py and cloudinit.py stay supervisor/controller-side. The
    design classified them as agent code, but the only importer of qemu_build is
    the supervisor pool (it builds the controller config from the spec), and
    cloudinit is used by qemu_build. So they are supervisor concerns; only
    translate is agent code. (I will update the design doc to match.)
  • The three "leaks" are documented import-linter residuals, not fixed here.
    The supervisor→agent edges that remain are the shared port-mappings DB
    (models/pool/local/daemonorchestrator.metrics) and the
    aggregate-settings cache (poolorchestrator.utils). Both are entangled
    with the DB engine / policy layer and are behavior-affecting to extract, so they
    belong with the VmExecution/VmPool cleave (parent design §4). They are
    listed as explicit ignore_imports with comments. AMDSEVPolicy (third-party
    aleph_message) is left in local.py — it is not an agent-boundary violation.

Verification

  • Full tests/supervisor + tests/migration suite: 922 passed, only the
    pre-existing root-only test_interfaces failures remain (env, not this change).
  • lint-imports: 4 contracts kept, 0 broken.
  • mypy src/aleph/vm/: baseline unchanged (43 pre-existing errors, none in the
    new/moved modules).

Part of the agent/supervisor split. Followed by PR-2 (controller Resources
split + wire-error vocabulary) and PR-3 (the physical orchestrator/agent/
and controllers/supervisor/controllers/ moves). PR-0 (the vm_hashvm_id
rename) is blocked on a naming decision (vm_id already names the numeric local
id) and tracked separately.

🤖 Generated with Claude Code

…split

PR-1 of the agent/supervisor boundary work: extract a contract layer
(ABC, wire types, error enum), move misfiled agent code (translate,
qemu_build, cloudinit) out of supervisor/, fix the three supervisor->
agent back-references, and enforce the boundary with import-linter.
Strictly behavior-neutral; the VmExecution/VmPool cleave, download
extraction, wire-error vocabulary, and true controller split are
deferred to the later step (parent design 2026-05-28 sections 4 and A.6).
Completes the three-PR sequence design. PR-2 removes the two
orchestrator->controllers residuals left by PR-1: split the Resources
dual personality (agent message-downloader vs supervisor spec-runtime
holder) and finish the wire-error vocabulary so views catch only
SupervisorError. Download is already agent-side (spec work), so PR-2 is
smaller than the parent design's section 4 implied. PR-3 is the
behavior-neutral physical move: orchestrator/ -> agent/, controllers
reclassified supervisor-owned, final import-linter names. The
VmExecution/VmPool cleave stays out of the sequence as a separate
adjacent effort.
… -> contract in PR-1

Per decision: move controllers under supervisor/ in PR-3 (it is the
supervisor's execution worker, swapped together in Phase 2). Consequence:
configuration.py (the config-file schema agent qemu_build writes) cannot
live under supervisor/controllers/ without an agent->supervisor edge, so
it moves to contract/ in PR-1 instead. Reclassify controllers as
supervisor-side-with-residuals (not a shared base) and correct the
Resources straddle description: download is already agent-side
(translate.py); the tangle is the single class with two construction
personalities, not controllers.setup() downloading.
Behavior-neutral identity rename, runs first (independent of the package
reorg, clarifies names before later PRs move code). End-to-end on the
supervisor-side objects: VmExecution.vm_hash -> vm_id, VmPool keys/params,
local.py, and the agent call sites that read execution.vm_hash. Agent
locals that genuinely hold the Aleph ItemHash keep the vm_hash name.
Threaded into the PR-1 and PR-3 sequence recaps.
…llision

vm_id is already the numeric local VM id (int, ~11 files across
controllers/network/hypervisors). Renaming vm_hash->vm_id needs a prior
numeric-id rename (e.g. vm_index) and owner sign-off; deferred.
…import boundary

PR-1 of the agent/supervisor boundary work (behavior-neutral).

- New aleph.vm.contract package: the Supervisor ABC, wire types, the
  SupervisorError enum, and the controller config-file schema
  (configuration.py). Both sides import it; it imports neither.
- Split supervisor/errors.py: the closed SupervisorError vocabulary moves to
  contract/errors.py (no backend deps); the exception->code mapping
  (translate_exception/translating_errors) stays supervisor-side in
  supervisor/error_mapping.py.
- Move the message->spec translator to the agent (orchestrator/translate.py).
  qemu_build and cloudinit stay supervisor/controller-side: they are used by
  the supervisor pool, not the agent (corrects the design's classification).
- Enforce the boundary with import-linter (pyproject [tool.importlinter] +
  testing-env dep + CI step): contract is independent; controllers import
  neither side; the agent reaches the supervisor only through the ABC (modulo
  the composition root); the supervisor never imports the agent (modulo the
  shared port-mappings DB + aggregate-settings cache, documented residuals for
  the VmExecution/VmPool cleave).

No runtime behavior change: full supervisor+migration suite green (only the
pre-existing root-only test_interfaces failures remain), mypy baseline
unchanged, lint-imports 4 kept / 0 broken.

@foxpatch-aleph foxpatch-aleph left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A clean, behavior-neutral refactoring that extracts the contract layer, moves translate.py to the agent side, splits errors.py into contract errors + supervisor-side error mapping, and adds an import-linter gate. All file moves are correct, no stale import paths remain in src/ or tests/, the linter contracts properly enforce the boundary with documented residuals, and CI is wired up correctly. Only minor stale comments/docstrings remain.

src/aleph/vm/orchestrator/translate.py (line 3): Stale docstring: says "the supervisor refactor (Phase 0.C)" and "the rest of the supervisor pipeline can work with", but this module is now agent-side code. Update to reflect the new ownership.

src/aleph/vm/controllers/qemu/instance.py (line 133): Comment says "Local import keeps the supervisor.* dependency out of module load order" but the import now goes to aleph.vm.contract.types, not supervisor.*. Update to say "contract.*" or just "avoids a circular import."

src/aleph/vm/controllers/qemu_confidential/instance.py (line 47): Same stale comment as instance.py:133 — says "supervisor.* dependency" but imports from aleph.vm.contract.errors.

src/aleph/vm/orchestrator/views/__init__.py (line 74): Two separate from aleph.vm.contract.errors import blocks (lines 74-76 and 77-81) could be merged into one import statement. Minor cosmetic nit.

The contract-layer import-path changes reordered import blocks; re-run the
CI formatters (isort, ruff format, pyproject-fmt) so the linting gate passes.
No logic change.

@foxpatch-aleph foxpatch-aleph left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A clean, behavior-neutral refactoring that extracts a contract layer (ABC, types, errors, configuration) shared between the agent and supervisor, splits errors.py into contract-side error classes and supervisor-side error mapping, moves translate.py to the agent (orchestrator/), and adds an import-linter CI gate. All file moves are complete (no stale files or imports remain), the contract layer has no forbidden dependencies, the import-linter configuration correctly captures forbidden edges and documented residuals, and test imports are updated throughout. The design docs are thorough and the deviations from the original design (keeping qemu_build/cloudinit supervisor-side) are well-justified. No correctness, security, or testing issues found.

.github/workflows/test-using-pytest.yml (line 57): The sudo python3 -m pip install --upgrade --ignore-installed hatch hatch-vcs coverage "virtualenv<21" on line 57 duplicates the same command at line 42. The root hatch install already happened in the previous step. This is harmless (defensive re-install) but adds ~10-15s of CI time for no benefit. Consider removing it or adding a comment explaining why a second install is needed specifically before the import-linter step.

src/aleph/vm/orchestrator/run.py (line 20): The alias from aleph.vm.contract import errors as supervisor_errors preserves the old supervisor_errors namespace. This is a reasonable transitional convenience for a behavior-neutral PR, but the name is now slightly misleading since the errors live in contract, not supervisor. Consider renaming to contract_errors or boundary_errors in a follow-up to reduce confusion.

src/aleph/vm/orchestrator/translate.py (line 38): This from aleph.vm.controllers.qemu.cloudinit import get_hostname_from_hash creates an orchestrator -> controllers edge that is not explicitly documented as a residual in the import-linter ignore_imports (unlike the Resources classes and controller exception types). It's not forbidden by any linter contract, so it passes, but future readers may wonder why this edge is allowed while the design doc lists orchestrator -> controllers as a known residual. Consider adding a brief comment noting this is a pre-existing edge to be cleaned up with the controller split.

@codecov

codecov Bot commented Jun 18, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 97.70115% with 8 lines in your changes missing coverage. Please review.
✅ Project coverage is 79.07%. Comparing base (08af7b4) to head (ec33c32).

Files with missing lines Patch % Lines
src/aleph/vm/supervisor/error_mapping.py 77.14% 4 Missing and 4 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##              dev     #986      +/-   ##
==========================================
- Coverage   79.07%   79.07%   -0.01%     
==========================================
  Files         202      203       +1     
  Lines       22639    22644       +5     
  Branches     1461     1461              
==========================================
+ Hits        17902    17905       +3     
- Misses       4353     4354       +1     
- Partials      384      385       +1     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants