Skip to content

fix(extensions): probe unmonitored user exts; surface recreate warnings; activate-path progress#1104

Draft
yasinBursali wants to merge 1 commit intoLight-Heart-Labs:mainfrom
yasinBursali:fix/extension-status-state-machine
Draft

fix(extensions): probe unmonitored user exts; surface recreate warnings; activate-path progress#1104
yasinBursali wants to merge 1 commit intoLight-Heart-Labs:mainfrom
yasinBursali:fix/extension-status-state-machine

Conversation

@yasinBursali
Copy link
Copy Markdown
Contributor

What

Fixes three real defects in the dashboard-api / host-agent extension lifecycle state machine, plus a previously-unflagged activate-path gap discovered during code-reading:

  • Replaces the synthetic status="healthy" for unmonitored user extensions (extensions whose manifest declares no health: field) with a concurrent TCP probe so containers that aren't actually listening fall through to "stopped" instead of showing as green "enabled".
  • Surfaces previously-silent _post_install_core_recreate failures via a new warnings: list[str] field in the progress JSON.
  • Fixes the activate-from-disabled flow which previously left the frontend polling indefinitely on _call_agent("start") failure with no error progress and no terminal status.

Why

  • The extensions_catalog and extension_detail endpoints synthesized ServiceStatus(..., status="healthy") unconditionally for every user extension without a manifest health: field — purely because the directory existed on disk. Stopped/crashed containers showed as green "enabled" in the dashboard. Functional regression in the catalog → UI contract.
  • _post_install_core_recreate failures (e.g., openclaw post-install recreating open-webui to integrate the agent into the chat UI) only logged a warning and returned silently. Progress stayed at "started". Frontend showed openclaw as "enabled" even though the integration silently didn't register. No user feedback.
  • enable_extension's activate-from-disabled flow at line 1462+ had no _write_initial_progress and only set agent_ok = False on _call_agent("start") failure with no _write_error_progress. The frontend's pollProgress starts polling on any successful enable response, reads {"status": "idle"} from /api/extensions/<sid>/progress, and silently spins forever. No terminal toast, no error feedback, no UI state change.

How

Fix 1 — TCP probe replaces synthetic "healthy" (routers/extensions.py):

  • New async helper _probe_unmonitored_user_ext(host, port) using asyncio.wait_for(asyncio.open_connection(host, port), timeout=0.5).
  • Catches asyncio.TimeoutError, OSError (parent of ConnectionRefusedError, covers DNS/socket failures).
  • Writer cleanup wrapped in contextlib.suppress(OSError); layered defensively.
  • extensions_catalog and extension_detail synthesis blocks now fan out probes via asyncio.gather(..., return_exceptions=True). Total catalog cost stays at ~500ms regardless of unmonitored-ext count.
  • On connect-success → ServiceStatus(..., status="healthy", response_time_ms=<measured>). On connect-failure → ext dropped from services_by_id so _compute_extension_status falls through to "stopped".
  • Probe targets cfg.get("host") or "127.0.0.1" and the manifest-declared port (only what the user already installed).

Fix 2 — warnings field in progress JSON (bin/dream-host-agent.py):

  • _write_progress accepts a new warnings: list[str] | None = None kwarg. Pre-existing warnings on the file are preserved (read + filter for strings + append). Field is omitted entirely when empty for backwards-compatibility.
  • _post_install_core_recreate writes status="started" + warning text on docker_compose_recreate failure: "Installed, but post-install recreate of open-webui failed: <err>. Run 'dream restart' to retry.". openclaw itself remains running — status stays terminal.
  • Frontend doesn't read warnings yet; this is forward-compatible groundwork. A future PR will add a non-blocking toast for non-empty warnings arrays.

Fix 3 — Activate-path progress writes (routers/extensions.py):

  • Per-svc _write_initial_progress(svc_id) at the top of the activate loop (every enabled svc starts with a fresh "starting" record).
  • _write_error_progress(svc_id, "Host agent failed to start extension. Run 'dream restart' to recover.") on _call_agent("start") failure, mirroring the stopped-path pattern. continue so post_start doesn't fire on a failed start.
  • pre_start branch already wrote _write_error_progress on the base — preserved. post_start failures remain non-terminal (warning accumulation only).

Sanity check on #462 carry-forward: All three sub-bugs originally listed in fork issue #462 (stopped-path missing _write_error_progress, restart-guidance unification, and the unhealthy status return) were verified to already be in upstream/main via independent git diff upstream/main HEAD -- routers/extensions.py inspection (the difference was 34 lines of unrelated _call_agent exception broadening from the local staging branch). NO carry-forward action taken.

Testing

Pytest in the worktree:

  • test_unmonitored_user_ext_probe_refused_falls_through_to_stopped
  • test_unmonitored_user_ext_probe_success_reports_enabled
  • test_enable_disabled_writes_error_progress_on_agent_start_failure
  • test_recreate_failure_writes_warning_to_progress
  • Adjusted: test_recreate_failure_is_swallowed adds monkeypatch.setattr(_mod, "DATA_DIR", tmp_path) to prevent the new progress-write side-effect from leaking a file into the working tree.

Full suite: 702 passed, 13 failed. All 13 failures verified pre-existing on the base via stash-and-rerun (test_workflows.py × 9, test_gpu_detailed.py × 2, test_main.py × 1, test_extensions.py::TestCallAgentErrorNarrowing × 1 — the last one caused by the local-staging-only _call_agent exception broadening which will rebase out). None are in code paths touched by this PR.

make lint — passed.

Live smoke skipped — docker-compose v1 is not on this Mac and dashboard-api isn't currently listening; pytest mocks of asyncio.open_connection are the equivalent assertion.

Review

Critique Guardian: APPROVED. CG independently verified the ALREADY-IN-BASE claim for the originally-listed sub-bugs, the TCP probe semantics, the warnings kwarg backwards-compatibility, and the activate-path mirror of the stopped-path pattern.

Known Considerations

  • This PR is DRAFT and must merge after the in-flight async-cache-miss branch (touches the same extensions_catalog / extension_detail line range) and the in-flight one-shot-CLI-extensions branch (touches the host-agent state machine). Mechanical safeguard.
  • Branch base is the local staging branch with the full upstream-PR queue merged; the PR diff currently includes those changes too. After the prerequisite branches land, a rebase onto upstream/main will reduce the diff to just this PR's scope.
  • The frontend warnings consumer is forward-compatible groundwork — a future PR will add a non-blocking toast for non-empty warnings arrays in the progress JSON.
  • Potential SSRF surface: the TCP probe targets manifest-declared host:port. Bounded by 0.5s timeout, connect-only (no banner read, no payload), already-trusted boundary (user installed the extension). Worth hardening if community extension installation is ever exposed to less-trusted callers.
  • Activate-path success leaves polling running for the TTL — same UX issue exists in the stopped-path. Out of scope here; future cleanup.

Platform Impact

  • macOS (launchd-managed host-agent + Apple Silicon dashboard-api in Docker): no impact — asyncio.open_connection is uniform; _write_progress uses os.rename for atomic-write.
  • Linux (systemd host-agent + Docker dashboard-api): no impact — same.
  • Windows-WSL2 (systemd-in-WSL host-agent + Docker Desktop dashboard-api): no impact — same.

…gs; activate-path progress

Three real fixes plus a previously-unflagged activate-path gap:

1. routers/extensions.py — replace the unconditional "healthy" synthesis
   for unmonitored user extensions (those with no manifest health: field)
   with a concurrent async TCP probe (0.5s timeout per probe, fanned out
   via asyncio.gather). On connect-failure the ext is dropped from
   services_by_id so _compute_extension_status falls through to "stopped"
   instead of showing as green "enabled" for stopped/crashed containers.

2. bin/dream-host-agent.py — _post_install_core_recreate failures now
   surface via a new warnings: list[str] field in the progress JSON
   (status stays "started" since openclaw itself is running). _write_progress
   accepts a backwards-compatible warnings kwarg; pre-existing warnings
   are preserved on append; field is omitted when empty.

3. routers/extensions.py activate-from-disabled flow — per-svc
   _write_initial_progress at the top of the loop and _write_error_progress
   on _call_agent("start") failure, mirroring the stopped-path's restart
   guidance message. Previously the path silently spun the frontend's
   pollProgress on any "idle" status with no terminal feedback.

Sanity check: the three sub-bugs originally listed in the tracker were
independently verified to already be in upstream/main (the local staging
branch's _call_agent exception broadening accounted for the 34-line diff).
No carry-forward needed.

5 new + 1 adjusted test in test_extensions.py / test_host_agent.py;
702 passed, 13 pre-existing baseline failures preserved.

PR is DRAFT until the in-flight async-cache-miss + one-shot-CLI branches
land — both touch line-adjacent regions in the same functions.
@Lightheartdevs
Copy link
Copy Markdown
Collaborator

Merge-pass recheck: this remains draft/stacked, so I am not merging it yet even though the checks are green.

The feature set is useful, but the PR body says the branch includes local staging-queue changes and must land after the async cache-miss / one-shot extension state-machine work. Please rebase onto current main once those prerequisites are actually merged, reduce the diff to only this PR's scope, and then undraft for review. The warnings progress field and unmonitored-extension probe should get a close pass once the diff is clean.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants