fix(extensions): probe unmonitored user exts; surface recreate warnings; activate-path progress#1104
Draft
yasinBursali wants to merge 1 commit intoLight-Heart-Labs:mainfrom
Draft
Conversation
…gs; activate-path progress
Three real fixes plus a previously-unflagged activate-path gap:
1. routers/extensions.py — replace the unconditional "healthy" synthesis
for unmonitored user extensions (those with no manifest health: field)
with a concurrent async TCP probe (0.5s timeout per probe, fanned out
via asyncio.gather). On connect-failure the ext is dropped from
services_by_id so _compute_extension_status falls through to "stopped"
instead of showing as green "enabled" for stopped/crashed containers.
2. bin/dream-host-agent.py — _post_install_core_recreate failures now
surface via a new warnings: list[str] field in the progress JSON
(status stays "started" since openclaw itself is running). _write_progress
accepts a backwards-compatible warnings kwarg; pre-existing warnings
are preserved on append; field is omitted when empty.
3. routers/extensions.py activate-from-disabled flow — per-svc
_write_initial_progress at the top of the loop and _write_error_progress
on _call_agent("start") failure, mirroring the stopped-path's restart
guidance message. Previously the path silently spun the frontend's
pollProgress on any "idle" status with no terminal feedback.
Sanity check: the three sub-bugs originally listed in the tracker were
independently verified to already be in upstream/main (the local staging
branch's _call_agent exception broadening accounted for the 34-line diff).
No carry-forward needed.
5 new + 1 adjusted test in test_extensions.py / test_host_agent.py;
702 passed, 13 pre-existing baseline failures preserved.
PR is DRAFT until the in-flight async-cache-miss + one-shot-CLI branches
land — both touch line-adjacent regions in the same functions.
8d4f651 to
7d49893
Compare
Collaborator
|
Merge-pass recheck: this remains draft/stacked, so I am not merging it yet even though the checks are green. The feature set is useful, but the PR body says the branch includes local staging-queue changes and must land after the async cache-miss / one-shot extension state-machine work. Please rebase onto current |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Fixes three real defects in the dashboard-api / host-agent extension lifecycle state machine, plus a previously-unflagged activate-path gap discovered during code-reading:
status="healthy"for unmonitored user extensions (extensions whose manifest declares nohealth:field) with a concurrent TCP probe so containers that aren't actually listening fall through to"stopped"instead of showing as green "enabled"._post_install_core_recreatefailures via a newwarnings: list[str]field in the progress JSON._call_agent("start")failure with no error progress and no terminal status.Why
extensions_catalogandextension_detailendpoints synthesizedServiceStatus(..., status="healthy")unconditionally for every user extension without a manifesthealth:field — purely because the directory existed on disk. Stopped/crashed containers showed as green "enabled" in the dashboard. Functional regression in the catalog → UI contract._post_install_core_recreatefailures (e.g., openclaw post-install recreatingopen-webuito integrate the agent into the chat UI) only logged a warning and returned silently. Progress stayed at"started". Frontend showed openclaw as "enabled" even though the integration silently didn't register. No user feedback.enable_extension's activate-from-disabled flow at line 1462+ had no_write_initial_progressand only setagent_ok = Falseon_call_agent("start")failure with no_write_error_progress. The frontend'spollProgressstarts polling on any successful enable response, reads{"status": "idle"}from/api/extensions/<sid>/progress, and silently spins forever. No terminal toast, no error feedback, no UI state change.How
Fix 1 — TCP probe replaces synthetic
"healthy"(routers/extensions.py):_probe_unmonitored_user_ext(host, port)usingasyncio.wait_for(asyncio.open_connection(host, port), timeout=0.5).asyncio.TimeoutError,OSError(parent ofConnectionRefusedError, covers DNS/socket failures).contextlib.suppress(OSError); layered defensively.extensions_catalogandextension_detailsynthesis blocks now fan out probes viaasyncio.gather(..., return_exceptions=True). Total catalog cost stays at ~500ms regardless of unmonitored-ext count.ServiceStatus(..., status="healthy", response_time_ms=<measured>). On connect-failure → ext dropped fromservices_by_idso_compute_extension_statusfalls through to"stopped".cfg.get("host") or "127.0.0.1"and the manifest-declared port (only what the user already installed).Fix 2 —
warningsfield in progress JSON (bin/dream-host-agent.py):_write_progressaccepts a newwarnings: list[str] | None = Nonekwarg. Pre-existing warnings on the file are preserved (read + filter for strings + append). Field is omitted entirely when empty for backwards-compatibility._post_install_core_recreatewritesstatus="started"+ warning text ondocker_compose_recreatefailure:"Installed, but post-install recreate of open-webui failed: <err>. Run 'dream restart' to retry.". openclaw itself remains running — status stays terminal.warningsyet; this is forward-compatible groundwork. A future PR will add a non-blocking toast for non-empty warnings arrays.Fix 3 — Activate-path progress writes (
routers/extensions.py):_write_initial_progress(svc_id)at the top of the activate loop (every enabled svc starts with a fresh "starting" record)._write_error_progress(svc_id, "Host agent failed to start extension. Run 'dream restart' to recover.")on_call_agent("start")failure, mirroring the stopped-path pattern.continueso post_start doesn't fire on a failed start._write_error_progresson the base — preserved. post_start failures remain non-terminal (warning accumulation only).Sanity check on #462 carry-forward: All three sub-bugs originally listed in fork issue #462 (stopped-path missing
_write_error_progress, restart-guidance unification, and the unhealthy status return) were verified to already be inupstream/mainvia independentgit diff upstream/main HEAD -- routers/extensions.pyinspection (the difference was 34 lines of unrelated_call_agentexception broadening from the local staging branch). NO carry-forward action taken.Testing
Pytest in the worktree:
test_unmonitored_user_ext_probe_refused_falls_through_to_stopped✓test_unmonitored_user_ext_probe_success_reports_enabled✓test_enable_disabled_writes_error_progress_on_agent_start_failure✓test_recreate_failure_writes_warning_to_progress✓test_recreate_failure_is_swallowedaddsmonkeypatch.setattr(_mod, "DATA_DIR", tmp_path)to prevent the new progress-write side-effect from leaking a file into the working tree.Full suite: 702 passed, 13 failed. All 13 failures verified pre-existing on the base via stash-and-rerun (test_workflows.py × 9, test_gpu_detailed.py × 2, test_main.py × 1, test_extensions.py::TestCallAgentErrorNarrowing × 1 — the last one caused by the local-staging-only
_call_agentexception broadening which will rebase out). None are in code paths touched by this PR.make lint— passed.Live smoke skipped —
docker-compose v1is not on this Mac and dashboard-api isn't currently listening; pytest mocks ofasyncio.open_connectionare the equivalent assertion.Review
Critique Guardian: APPROVED. CG independently verified the ALREADY-IN-BASE claim for the originally-listed sub-bugs, the TCP probe semantics, the
warningskwarg backwards-compatibility, and the activate-path mirror of the stopped-path pattern.Known Considerations
extensions_catalog/extension_detailline range) and the in-flight one-shot-CLI-extensions branch (touches the host-agent state machine). Mechanical safeguard.upstream/mainwill reduce the diff to just this PR's scope.warningsconsumer is forward-compatible groundwork — a future PR will add a non-blocking toast for non-emptywarningsarrays in the progress JSON.host:port. Bounded by 0.5s timeout, connect-only (no banner read, no payload), already-trusted boundary (user installed the extension). Worth hardening if community extension installation is ever exposed to less-trusted callers.Platform Impact
asyncio.open_connectionis uniform;_write_progressusesos.renamefor atomic-write.