refactor(infra-local): simplify + unify into one Python compose package (~50% smaller)#796
Conversation
…s (~50%) apps/infra-local was ~4,500 lines for an 11-service local stack. This keeps every service and behavior identical while removing avoidable ceremony: 24 files, +437/-2703. No Cargo/Rust/core-SDK source changes. Code (boxlite_local/, 10 files -> 8): - doctor.py: drop the Severity/DoctorCheck/DoctorReport/DoctorError(report)/ format_report framework for plain functions returning failure strings; shrink the lsof parser. Keep a one-line DoctorError(RuntimeError). - Delete types.py: ServiceSpec/HealthCheck move to services.py (their only consumers); the Doctor* types are gone with the doctor rewrite. - Delete execwrap.py: exec_collect inlined into orchestrator.py. - Centralize the dual-layout SDK import (boxlite vs boxlite.boxlite) in _sdk.py (was copy-pasted across orchestrator/doctor). - Drop dead config: the BOXLITE_*_HOST_PORT env overrides never moved a bound port (ServiceSpec.ports are literals), the pg_url property, and the YAGNI HealthCheck.tcp_port / Severity.WARN. Delete configs/minio/init.sh (a drifted dead duplicate of the inline _MINIO_INIT_SCRIPT in services.py). Tests — delete apps/infra-local/tests/ (~1000 lines), behavior-preserving: - The suite is not in CI and mostly asserted dataclass defaults. The real E2E (scripts/test/e2e) tests the REST path, which by design forbids host bind mounts + host ports (cases/test_volume_readonly.py), so these can't move there. Dev-stack validation is now `make stack-up` + the apps `npm run e2e:local`. - Preserve the one genuinely-unique, nowhere-else-covered capability as a native-gated SDK test: sdks/python/tests/test_volume_port_persistence.py — a long-lived box with a read-write host volume that persists + a host-reachable mapped port (the shape infra-local relies on, e.g. postgres 25432:5432 over a writable volume). Mirrors the existing test_readonly_volume_remount.py. Docs (1212 -> 103): - Delete CONNECTIONS.md (it restated config.py/services.py). README rewritten to ~100 lines: what-it-is, quick start, make targets, endpoint+credential table, how to validate, top-6 troubleshooting. Also corrected the stale "Docker Desktop required" prereq (the runtime is Hypervisor.framework + libkrun). Scripts: - stack-up.sh: factor the four identical component prechecks + health-waits into prestart()/await_up() helpers; each launch invocation is unchanged. Verification: py_compile + import all modules; `python -m boxlite_local --help` + doctor fast paths; bash -n on all scripts; new SDK test pytest --collect-only (import + marker + collection; full native run is gated). End-to-end remains `make stack-up && make stack-status`.
…, flatten Unifies the two-language orchestrator into one Python package and slims the command + directory surface. Builds on the prior simplification commit; folds in the cli.py -> __main__.py merge from that step. Rename boxlite_local/ -> compose/ (`python -m compose`). The old name was redundant (the whole repo is boxlite) and didn't match its infra-local/ dir. Internal imports are relative, so only pyproject/Makefile/README referenced it. Port L2 shell -> compose/native.py: the four native host processes (API, runner, proxy, dashboard) are supervised via subprocess instead of ~800 lines of bash. Daemons spawn detached (start_new_session) and stop via SIGTERM->SIGKILL on the process group -- killpg reaps the nx/go grandchildren, replacing the bash pkill-by-name sweep. Reuses orchestrator._http_probe / doctor._lsof_owner / InfraConfig. Deletes scripts/ (9 files). Flatten: configs/api.env -> api.env (root); scripts/ + configs/ gone. The tree is now four root files + the compose/ package. CLI: one unified `python -m compose` over L1 (boxes) + L2 (procs), 7 verbs -- up / down / status / logs / restart / reset / nuke. `up` is self-contained (install, build-if-missing, preflight, seed), so build/seed/doctor/migrate are not separate commands. `restart <name>` bounces an L2 proc OR recreates an L1 box (folds the old `rebuild`). `nuke` is the full teardown; `reset` is data-only. Makefile: ~19 targets -> 9 thin aliases (one per verb + help/install). Verification: py_compile + import all modules; `python -m compose --help` + every subcommand. Process supervision is runtime-only behavior I cannot exercise here (no live stack), so native.py is a faithful behavior-port of the deleted scripts and `make up && make status` must be run live to validate end-to-end before merge. No automated test added for native.py: the suite was removed earlier (per the repo convention of not unit-testing process/glue code), and the one unique SDK capability is already pinned by sdks/python/tests/test_volume_port_persistence.py.
Adversarial review found compose/native.py's apps/.env parser stored `export KEY=val` lines under the literal key "export KEY", silently dropping the var from the API process env — a divergence from the bash it ports (`set -a && . ./.env && set +a`, which strips the `export` keyword). Latent in the checked-in template (no export lines) but live for dev edits to apps/.env. Strip the keyword (and only the keyword: `exportFOO=` stays `exportFOO`). Verified against real `bash -c 'set -a && . file && set +a && env'` (revert -> diverges on an exported var, restore -> matches).
… busybox httpd The new test_volume_port_persistence integration test failed on its first real execution (the pre-push hook): `alpine:latest`'s base busybox omits the `httpd` applet, so `httpd -p 8000 -h /data` was command-not-found. Run the in-box server as the box's own long-lived foreground cmd instead — python:3-alpine with `python3 -m http.server` over the mounted volume (mirrors how infra-local runs each service as the box's main process), rather than backgrounding a daemon via exec (whose lifetime is the exec session, not the box). Verified live: `pytest tests/test_volume_port_persistence.py -m integration` → 1 passed (30s). Proves RW-volume write-through + persistence-across-restart + host-port reachability against a real microVM.
|
Caution Review failedThe pull request is closed. ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughReplaces the Changesinfra-local orchestrator rewrite
SDK volume and port persistence integration test
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 6
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
sdks/python/tests/test_volume_port_persistence.py (1)
95-136:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winGuarantee temp-directory cleanup on all failure paths in Line [95]-Line [136].
shutil.rmtree(host_dir, ignore_errors=True)only runs in the second box’sfinally; if an exception occurs earlier (for example during second box creation), the temp directory is leaked.Suggested fix (outer cleanup finally)
import http.client +import contextlib import os @@ def test_rw_volume_persists_and_port_is_reachable(self, runtime): host_dir = tempfile.mkdtemp(prefix="bl_vol_port_") host_port = _free_host_port() + try: + # ── Box 1: write through the RW volume, then serve it on a mapped port ── + box = _box(write_marker=True) + try: + box.start() + assert _get_when_ready(host_port, "/marker.txt") == MARKER + with open(os.path.join(host_dir, "marker.txt")) as f: + assert f.read() == MARKER, "RW volume did not write through to host" + finally: + with contextlib.suppress(Exception): + box.stop() - # ── Box 2: a fresh box on the same host volume serves the persisted data ── - box2 = _box(write_marker=False) - try: - box2.start() - assert _get_when_ready(host_port, "/marker.txt") == MARKER, ( - "volume data did not persist across a box restart" - ) - finally: - box2.stop() + # ── Box 2: a fresh box on the same host volume serves the persisted data ── + box2 = _box(write_marker=False) + try: + box2.start() + assert _get_when_ready(host_port, "/marker.txt") == MARKER, ( + "volume data did not persist across a box restart" + ) + finally: + with contextlib.suppress(Exception): + box2.stop() + finally: shutil.rmtree(host_dir, ignore_errors=True)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@sdks/python/tests/test_volume_port_persistence.py` around lines 95 - 136, The temporary directory cleanup with shutil.rmtree is only executed in the finally block of the second box, which means if any exception occurs earlier in the test (during first box operations, second box creation, or other assertions), the temp directory leaks. Add an outer try/finally block that wraps all the test logic after host_dir is created to guarantee that shutil.rmtree(host_dir, ignore_errors=True) executes on all failure paths in the test_rw_volume_persists_and_port_is_reachable method.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@apps/infra-local/compose/native.py`:
- Around line 491-494: The seed function call at line 491 (and also at line 501)
does not check the return code, allowing the up command to exit successfully
even when seed initialization fails. Capture the return value from seed(cfg,
no_bounce=True) and check if it indicates failure; if it does, propagate that
failure by exiting the up command with an appropriate non-zero status code
instead of continuing to success.
- Around line 627-633: The hard-reset logic calls _psql and _migrate functions
that can fail without raising exceptions, yet continues to report success with
the ok() and warn() messages regardless of whether these operations actually
succeeded. Modify the code to capture and check the return codes (or exceptions)
from both the _psql call (which drops and recreates the schema) and the _migrate
call, and only proceed to the ok() success message if both operations complete
successfully. If either _psql or _migrate fails, the function should raise an
error or return a failure status to prevent silent partial failures. Apply this
pattern to all affected locations in the hard-reset path.
- Around line 289-299: In the start_component() function, when a component fails
to become healthy (the healthy variable is False), add cleanup code before
logging the error and returning False. Stop and kill the spawned process p and
remove its associated pidfile to prevent stale daemons and ensure consistent
retries. The cleanup should happen in the else branch where the timeout error is
logged, after the unhealthy condition is detected but before the return False
statement.
- Around line 591-607: The restart function ignores the return value from
start_component(p, table[name]) when restarting L2 components in the loop over
l2, which means the command returns success even if a component fails to become
healthy. Capture the return value from the start_component call within the for
loop iterating over l2, check if it indicates failure (non-zero), and if so,
return that failure code immediately instead of continuing and returning 0 at
the end of the function.
In `@apps/infra-local/README.md`:
- Line 101: The opening fence in the code block lacks a language identifier,
which triggers markdownlint MD040. Add "text" as the language identifier to the
opening triple backticks (change ``` to ```text) to properly tag the fenced code
block that displays the directory structure and file descriptions.
In `@sdks/python/tests/test_volume_port_persistence.py`:
- Around line 53-59: The _free_host_port() function at line 53-59 has a TOCTOU
race condition: it finds a free port and closes the socket, allowing another
process to claim the port before actual binding occurs. Implement retry logic to
handle this: when binding to the returned port fails in the code at line 97-111
(where the port is actually used), catch the bind error and call
_free_host_port() again to get a new port, then retry the binding. Repeat this
until binding succeeds. Alternatively, modify _free_host_port() to keep the
socket open and return both the socket and port number, passing the open socket
to the binding code to ensure the port stays reserved until the actual binding
occurs.
---
Outside diff comments:
In `@sdks/python/tests/test_volume_port_persistence.py`:
- Around line 95-136: The temporary directory cleanup with shutil.rmtree is only
executed in the finally block of the second box, which means if any exception
occurs earlier in the test (during first box operations, second box creation, or
other assertions), the temp directory leaks. Add an outer try/finally block that
wraps all the test logic after host_dir is created to guarantee that
shutil.rmtree(host_dir, ignore_errors=True) executes on all failure paths in the
test_rw_volume_persists_and_port_is_reachable method.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: 52f2032e-e67c-440f-bde9-a5b1daf1403e
📒 Files selected for processing (39)
apps/infra-local/CONNECTIONS.mdapps/infra-local/Makefileapps/infra-local/README.mdapps/infra-local/api.envapps/infra-local/boxlite_local/__init__.pyapps/infra-local/boxlite_local/__main__.pyapps/infra-local/boxlite_local/cli.pyapps/infra-local/boxlite_local/doctor.pyapps/infra-local/boxlite_local/execwrap.pyapps/infra-local/boxlite_local/types.pyapps/infra-local/compose/__init__.pyapps/infra-local/compose/__main__.pyapps/infra-local/compose/_sdk.pyapps/infra-local/compose/config.pyapps/infra-local/compose/doctor.pyapps/infra-local/compose/native.pyapps/infra-local/compose/orchestrator.pyapps/infra-local/compose/services.pyapps/infra-local/configs/minio/init.shapps/infra-local/pyproject.tomlapps/infra-local/scripts/_stack-common.shapps/infra-local/scripts/seed-init-data.shapps/infra-local/scripts/stack-build.shapps/infra-local/scripts/stack-down.shapps/infra-local/scripts/stack-logs.shapps/infra-local/scripts/stack-reset.shapps/infra-local/scripts/stack-restart.shapps/infra-local/scripts/stack-status.shapps/infra-local/scripts/stack-up.shapps/infra-local/tests/__init__.pyapps/infra-local/tests/integration/__init__.pyapps/infra-local/tests/integration/test_e2e_full.pyapps/infra-local/tests/integration/test_multi_service.pyapps/infra-local/tests/unit/__init__.pyapps/infra-local/tests/unit/test_config.pyapps/infra-local/tests/unit/test_doctor_lsof.pyapps/infra-local/tests/unit/test_orchestrator.pyapps/infra-local/tests/unit/test_topo.pysdks/python/tests/test_volume_port_persistence.py
💤 Files with no reviewable changes (23)
- apps/infra-local/CONNECTIONS.md
- apps/infra-local/scripts/stack-restart.sh
- apps/infra-local/boxlite_local/init.py
- apps/infra-local/scripts/stack-build.sh
- apps/infra-local/configs/minio/init.sh
- apps/infra-local/scripts/_stack-common.sh
- apps/infra-local/scripts/stack-up.sh
- apps/infra-local/boxlite_local/execwrap.py
- apps/infra-local/scripts/seed-init-data.sh
- apps/infra-local/boxlite_local/types.py
- apps/infra-local/scripts/stack-status.sh
- apps/infra-local/tests/unit/test_topo.py
- apps/infra-local/boxlite_local/main.py
- apps/infra-local/scripts/stack-down.sh
- apps/infra-local/tests/integration/test_multi_service.py
- apps/infra-local/scripts/stack-reset.sh
- apps/infra-local/scripts/stack-logs.sh
- apps/infra-local/boxlite_local/doctor.py
- apps/infra-local/tests/unit/test_doctor_lsof.py
- apps/infra-local/tests/unit/test_config.py
- apps/infra-local/tests/unit/test_orchestrator.py
- apps/infra-local/tests/integration/test_e2e_full.py
- apps/infra-local/boxlite_local/cli.py
CodeRabbit + CodeQL findings on the compose port — each verified against the code before applying: native.py: - start_component now stops the failed component on health-timeout: it was leaving a stale daemon + pidfile that a retry would treat as "already running", skipping a broken component. [CodeRabbit Major] - up and restart propagate failures into the exit code — they returned 0 even when a component never became healthy or a hard seed failure occurred. [CodeRabbit Major] - reset --hard now checks the `DROP SCHEMA` psql + `_migrate` exit codes and aborts on failure (restores the deleted bash's `set -e` fail-fast; it was printing "complete" over a broken partial schema). `_migrate` returns its code. [CodeRabbit Major] - logs() handles `os.execvp` failure (returns 1) and no longer falls through implicitly, satisfying its `-> int` contract. [CodeQL] - explanatory comments on the intentional best-effort `except` blocks (_kill_port_listeners / _terminate_group / _seed_api_env / _ensure_installed). [CodeQL] README: language tag on the layout fenced block (markdownlint MD040). [CodeRabbit] sdks/python volume-port test: pick a fresh host port and retry the box start, to close the `_free_host_port()` TOCTOU window (port released before the box binds it). Re-verified live: `pytest -m integration` -> 1 passed.
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
apps/infra-local/compose/native.py (1)
688-690:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winUnchecked
restart()return value may lead to unnecessary 60s wait.When
seed()bounces the API viarestart(cfg, ["api"]), the return value is ignored. If the API fails to restart (e.g., port conflict, binary crash),seed()will wait the full 60s before timing out at line 698 rather than failing fast.Consider checking the return value and failing early:
Proposed fix
if _component_pid(p, "api") is not None: log("restarting api so it re-runs its initialize* cycle...") - restart(cfg, ["api"]) + if restart(cfg, ["api"]) != 0: + err("api failed to restart — cannot complete seed") + return 1 else:🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@apps/infra-local/compose/native.py` around lines 688 - 690, The `restart(cfg, ["api"])` call on line 690 ignores its return value, which causes unnecessary delays when the API fails to restart. Check the return value of the `restart()` function call and add error handling to fail early if the restart operation fails. This will prevent the code from waiting the full 60 seconds before timing out at line 698 when the API encounters issues like port conflicts or crashes.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@apps/infra-local/compose/native.py`:
- Around line 688-690: The `restart(cfg, ["api"])` call on line 690 ignores its
return value, which causes unnecessary delays when the API fails to restart.
Check the return value of the `restart()` function call and add error handling
to fail early if the restart operation fails. This will prevent the code from
waiting the full 60 seconds before timing out at line 698 when the API
encounters issues like port conflicts or crashes.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: 6cffec55-a4d9-4ef3-979c-79033b6f8df2
📒 Files selected for processing (3)
apps/infra-local/README.mdapps/infra-local/compose/native.pysdks/python/tests/test_volume_port_persistence.py
🚧 Files skipped from review as they are similar to previous changes (2)
- apps/infra-local/README.md
- sdks/python/tests/test_volume_port_persistence.py
Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com> Signed-off-by: dorianzheng <8065637+DorianZheng@users.noreply.github.com>
Signed-off-by: dorianzheng <8065637+DorianZheng@users.noreply.github.com>
What
Simplifies
apps/infra-local(the BoxLite-based local dev stack) and unifies its two-language orchestrator into one Python package. ~4,500 → ~2,190 lines (~50%), one entry point (python -m compose), flat structure.Commits
BOXLITE_*_HOST_PORTknobs never moved a bound port;pg_url/tcp_port/Severity.WARN; deadconfigs/minio/init.sh), dedupe the SDK import shim, delete the staleCONNECTIONS.md(445 lines restating the code), rewrite the README, and remove the infra-local test suite (not in CI; mostly asserted dataclass defaults) — preserving the one unique capability as a native-gated SDK test.scripts/) intocompose/native.py(subprocess supervision: detached daemons, pidfiles,killpgteardown). Packageboxlite_local→compose.scripts/+configs/deleted;api.env→ root. CLI is 7 verbs (up/down/status/logs/restart/reset/nuke); Makefile ~19 → 9 thin aliases.export KEY=valparser bug in the.envreader; the SDK capability test was rewritten (alpinebusybox lackshttpd) to apython:3-alpinehttp.server box and verified live.Final structure
4 root files (
api.env,Makefile,README.md,pyproject.toml) + thecompose/package (config / services / orchestrator [L1] / native [L2] / doctor / _sdk / main). Noscripts/,configs/, ortests/.Verification
python -m compose --help+ all 7 subcommands;py_compile+ import all modules.pytest -m integration→ 1 passed, 30s): RW-volume persistence + host-port mapping.ZERO_FINDINGS; CLAUDE.md auditor PASS on every commit; pre-push integration suite green.compose/native.pyis a faithful behavior-port of the deleted bash, but process supervision is runtime-only behavior. It compiles/imports/parses and the adversarial review cleared it against the bash spec, but a livemake up && make status(thenmake down/make restart COMPONENTS=runner) on an Apple Silicon Mac is the final proof the L2 daemons spawn/reap correctly. Please run that before merging.No Cargo/Rust/core-SDK source changes; scope is
apps/infra-local/+ onesdks/python/tests/file.Summary by CodeRabbit
Release Notes
Refactor
python -m compose, with Makefile aliases forup,down,status,logs,restart,reset, andnuke..apps-local/.Documentation
apps/infra-localREADME for the new compose workflow, commands, and endpoints.Tests