fix(swarm): surface errors, output contract, Windows-safe store, redact paths#119
Conversation
run_swarm / get_swarm_status / get_run_result and the in-process run_swarm tool each hand-maintained a field allowlist that omitted SwarmTask.error. A misconfigured provider returned status="failed" with no diagnosable reason, though the error was captured on disk the whole time. Add src/swarm/serialization.py as the single source of truth for the per-task projection (now includes error + iterations) plus a run-level error summary; route all three boundaries through it so the allowlist cannot silently drift again. Additive JSON change; the only shape normalization is _format_result missing-summary "" -> null. (cherry picked from commit 95f2717ac32bd02829f650bc971e3cd6dd7fffa6)
A worker that emitted only its plan ("Phase 1 — Plan"), mock data,
unparsed tool markup, or (for a data agent) made no tool call and
wrote no report was returned status="completed". runtime.py also
folded timeout/token_limit into completed. The run reported success
with a stub as final_report (P01); degraded workers were invisible
(P03).
Add WorkerStatus.incomplete and a hybrid output contract in worker.py:
content-sanity for every agent, tool-evidence only for data agents so
tool-less synthesis/editor roles are not false-rejected. Fail-fast to
incomplete (no corrective retry — left for a follow-up). runtime.py
only maps a true completed to TaskStatus.completed; anything else
fails the run, so final_report can no longer be a non-deliverable.
(cherry picked from commit 774b26824509678c953cd0200ef7a56c03f23201)
The MCP run_swarm wrapper called start_run() without include_shell_tools, so stdio swarm workers silently lost bash and could not run the scripts their own execution rules mandate (P03-B). build_filtered_registry also dropped a requested-but-unavailable tool with no log, hiding the contradiction. Pass the server's shell-tool policy into start_run and warn on a dropped tool. Add swarm_runs_root() as the single source of truth for the run store location, shared by mcp_server and the path_utils run-dir allow-list so the two can no longer drift (P03-A: an installed-wheel layout had put every worker run_dir outside it). (cherry picked from commit 2975fda09b5ad44d15580d9330890f72f43c00ce)
SwarmStore._atomic_write did tmp.replace(target) with no retry. On Windows os.replace raises PermissionError WinError 5/32 when the target is concurrently open by load_run on the poll path, crashing the worker thread and leaving the run stuck pending (P13). POSIX os.replace is atomic and never raises these. Add a WinError-scoped, bounded retry around the rename and a matching transient read/parse retry in load_run (the other side of the same race). Non-transient errors re-raise immediately; off-Windows the loop runs once, so POSIX behavior is unchanged. Pre-existing, not introduced by the P01/P03/P04 fixes — surfaced by E2E dogfooding. (cherry picked from commit 46a779cf436353591e7840879f8c32f763dde826)
Remove an unused `from pathlib import Path` in the PR03 test added by 2975fda, and the pre-existing unused `from src.swarm.mailbox import Mailbox` in runtime.py (file already in this branch's diff). Net ruff delta for PR1-PR4 is now -1. The remaining mcp_server.py:41 E402 is pre-existing on upstream and structural — left out of scope. (cherry picked from commit e96f56af1ead9b5b7058c90ec0d4df738b45e473)
Unknown preset / workspace-escape / missing run-or-task dir errors embedded the full install path (OS username, .venvs/site-packages topology) — CWE-209/497, reachable via MCP/API/agent/CLI (P10). Drop the absolute path while keeping the actionable bits: preset name + Available list, the workspace-root boundary, and the logical run/task id (basename only). Two test_path_safety assertions that matched the old leaking message are updated to the redacted wording (behavior unchanged — still raises and rejects). Verified-clean sites (caller's own input / env-var / names list) left untouched. (cherry picked from commit 69ea45295410cd25ea3079c25b7e3c4c73390b80)
Add agent/src/tools/redaction.py: anchored prefix redaction (no regex, idempotent, None-safe) that hides only known-internal root prefixes while keeping the relative tail for diagnosability. Wire it through the swarm error read boundaries (serialization, runtime, store) and the file tools' final broad except so a leaked absolute path cannot reach a user-facing string (CWE-209/497, P10). ValueError branches and the protected providers/llm.py are left untouched. (cherry picked from commit 21dc9cdfacf17cccd31cacfbe34481f62f1390af)
|
Thanks — will review this week. |
# Conflicts: # agent/mcp_server.py
|
Merged after a quick origin/main merge into the branch (purely style conflicts in Three things worth calling out:
The Windows-vs-POSIX split in |
_internal_roots resolves 8 candidate paths + builds 3 separator variants + sorts on every redact_internal_paths call. On the worker hot path this runs once per task error / event / file-tool failure. Result is process- stable (paths don't change without a restart), so @cache it. Follow-up to #119.
The substring head-match (a) false-positived on nested status fields and (b) false-negated when the error envelope sat past the 160-char head. Parse the result as JSON and check only the top-level status field; fall back to the original substring scan when the payload is truncated / unparseable so the classifier never raises on the worker hot path. Adds 7 unit tests pinning the new behavior — nested-status non-match, past-head match, truncated fallback, non-error statuses (degenerate / warning) correctly classified as non-error. Follow-up to #119.
…ct paths (HKUDS#119) * fix(swarm): surface task error at read boundaries run_swarm / get_swarm_status / get_run_result and the in-process run_swarm tool each hand-maintained a field allowlist that omitted SwarmTask.error. A misconfigured provider returned status="failed" with no diagnosable reason, though the error was captured on disk the whole time. Add src/swarm/serialization.py as the single source of truth for the per-task projection (now includes error + iterations) plus a run-level error summary; route all three boundaries through it so the allowlist cannot silently drift again. Additive JSON change; the only shape normalization is _format_result missing-summary "" -> null. (cherry picked from commit 95f2717ac32bd02829f650bc971e3cd6dd7fffa6) * fix(swarm): enforce worker output contract A worker that emitted only its plan ("Phase 1 — Plan"), mock data, unparsed tool markup, or (for a data agent) made no tool call and wrote no report was returned status="completed". runtime.py also folded timeout/token_limit into completed. The run reported success with a stub as final_report (P01); degraded workers were invisible (P03). Add WorkerStatus.incomplete and a hybrid output contract in worker.py: content-sanity for every agent, tool-evidence only for data agents so tool-less synthesis/editor roles are not false-rejected. Fail-fast to incomplete (no corrective retry — left for a follow-up). runtime.py only maps a true completed to TaskStatus.completed; anything else fails the run, so final_report can no longer be a non-deliverable. (cherry picked from commit 774b26824509678c953cd0200ef7a56c03f23201) * fix(swarm): thread shell tools + unify runs root The MCP run_swarm wrapper called start_run() without include_shell_tools, so stdio swarm workers silently lost bash and could not run the scripts their own execution rules mandate (P03-B). build_filtered_registry also dropped a requested-but-unavailable tool with no log, hiding the contradiction. Pass the server's shell-tool policy into start_run and warn on a dropped tool. Add swarm_runs_root() as the single source of truth for the run store location, shared by mcp_server and the path_utils run-dir allow-list so the two can no longer drift (P03-A: an installed-wheel layout had put every worker run_dir outside it). (cherry picked from commit 2975fda09b5ad44d15580d9330890f72f43c00ce) * fix(swarm): make store atomic write Windows-safe SwarmStore._atomic_write did tmp.replace(target) with no retry. On Windows os.replace raises PermissionError WinError 5/32 when the target is concurrently open by load_run on the poll path, crashing the worker thread and leaving the run stuck pending (P13). POSIX os.replace is atomic and never raises these. Add a WinError-scoped, bounded retry around the rename and a matching transient read/parse retry in load_run (the other side of the same race). Non-transient errors re-raise immediately; off-Windows the loop runs once, so POSIX behavior is unchanged. Pre-existing, not introduced by the P01/P03/P04 fixes — surfaced by E2E dogfooding. (cherry picked from commit 46a779cf436353591e7840879f8c32f763dde826) * style(swarm): drop unused imports (ruff F401) Remove an unused `from pathlib import Path` in the PR03 test added by 2975fda, and the pre-existing unused `from src.swarm.mailbox import Mailbox` in runtime.py (file already in this branch's diff). Net ruff delta for PR1-PR4 is now -1. The remaining mcp_server.py:41 E402 is pre-existing on upstream and structural — left out of scope. (cherry picked from commit e96f56af1ead9b5b7058c90ec0d4df738b45e473) * fix: redact internal absolute paths in errors Unknown preset / workspace-escape / missing run-or-task dir errors embedded the full install path (OS username, .venvs/site-packages topology) — CWE-209/497, reachable via MCP/API/agent/CLI (P10). Drop the absolute path while keeping the actionable bits: preset name + Available list, the workspace-root boundary, and the logical run/task id (basename only). Two test_path_safety assertions that matched the old leaking message are updated to the redacted wording (behavior unchanged — still raises and rejects). Verified-clean sites (caller's own input / env-var / names list) left untouched. (cherry picked from commit 69ea45295410cd25ea3079c25b7e3c4c73390b80) * fix(security): redact internal paths in errors Add agent/src/tools/redaction.py: anchored prefix redaction (no regex, idempotent, None-safe) that hides only known-internal root prefixes while keeping the relative tail for diagnosability. Wire it through the swarm error read boundaries (serialization, runtime, store) and the file tools' final broad except so a leaked absolute path cannot reach a user-facing string (CWE-209/497, P10). ValueError branches and the protected providers/llm.py are left untouched. (cherry picked from commit 21dc9cdfacf17cccd31cacfbe34481f62f1390af) * fix(security): redact swarm event error payloads --------- Co-authored-by: Haozhe Wu <haozhe_wu@connect.hku.hk>
_internal_roots resolves 8 candidate paths + builds 3 separator variants + sorts on every redact_internal_paths call. On the worker hot path this runs once per task error / event / file-tool failure. Result is process- stable (paths don't change without a restart), so @cache it. Follow-up to HKUDS#119.
The substring head-match (a) false-positived on nested status fields and (b) false-negated when the error envelope sat past the 160-char head. Parse the result as JSON and check only the top-level status field; fall back to the original substring scan when the payload is truncated / unparseable so the classifier never raises on the worker hot path. Adds 7 unit tests pinning the new behavior — nested-status non-match, past-head match, truncated fallback, non-error statuses (degenerate / warning) correctly classified as non-error. Follow-up to HKUDS#119.
Summary
completedstatus.SwarmStoreatomic write/read Windows-safe (WinError-scoped bounded retry; POSIX behavior unchanged).src/tools/redaction.py(anchored, idempotent, no regex) and redact internal absolute paths from user-facing errors and events (CWE-209/497).Why
A run could report
completedwith no grounded deliverable; raised errors andtask_failed/run_errorevent payloads leaked the OS user and venv/install topology; on Windows a concurrentos.replacerace crashed the swarm.Changes
swarm/{serialization,runtime,store,worker,models},tools/{redaction,read_file,write_file,edit_file}Test Plan
pytest --ignore=agent/tests/e2e_backtest -q-> 1080 passed, 1 skippedChecklist
src/agent|session|providers)Note: follow-up will harden
_classify_deliverable/_is_error_result(substring -> JSON parse); out of scope here.