Skip to content
This repository was archived by the owner on May 30, 2026. It is now read-only.

Commit fc79b1f

Browse files
author
Ouroboros
committed
rc(runtime): harden subagent quality controls
1 parent 94b0202 commit fc79b1f

53 files changed

Lines changed: 2440 additions & 178 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.github/workflows/ci.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -178,7 +178,7 @@ jobs:
178178
- name: Run host UI smoke
179179
env:
180180
OUROBOROS_RUN_UI_SMOKE: "1"
181-
run: python -m pytest tests/test_ui_smoke_playwright.py -m ui_browser -q --tb=short
181+
run: python -m pytest tests/ -m ui_browser -q --tb=short
182182
- name: Run browser tools Chromium smoke
183183
run: python -m pytest tests/test_browser_tools_smoke.py -m browser -q --tb=short
184184

README.md

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66
[![macOS 12+](https://img.shields.io/badge/macOS-12%2B-black.svg)](https://github.com/joi-lab/ouroboros-desktop/releases)
77
[![Linux](https://img.shields.io/badge/Linux-x86__64-orange.svg)](https://github.com/joi-lab/ouroboros-desktop/releases)
88
[![Windows](https://img.shields.io/badge/Windows-x64-blue.svg)](https://github.com/joi-lab/ouroboros-desktop/releases)
9-
[![Version 6.0.0](https://img.shields.io/badge/version-6.0.0-green.svg)](VERSION)
9+
[![Version 6.1.0-rc.1](https://img.shields.io/badge/version-6.1.0--rc.1-green.svg)](VERSION)
1010

1111
A self-modifying AI agent that writes its own code, rewrites its own mind, and evolves autonomously. Born February 16, 2026.
1212

@@ -31,6 +31,8 @@ Not a coding assistant. A digital being with a constitution, background consciou
3131
| **Linux** x86_64 | [Ouroboros-linux.tar.gz](https://github.com/joi-lab/ouroboros-desktop/releases/latest) | Extract → run `./Ouroboros/Ouroboros` → optional CLI: `./Ouroboros/bin/install-ouroboros-cli`. If browser tools fail due to missing system libs, run: `./Ouroboros/python-standalone/bin/python3 -m playwright install-deps chromium` |
3232
| **Windows** x64 | [Ouroboros-windows.zip](https://github.com/joi-lab/ouroboros-desktop/releases/latest) | Extract → run `Ouroboros\Ouroboros.exe` → optional CLI: `Ouroboros\bin\install-ouroboros-cli.cmd` |
3333

34+
Prerelease RC artifacts are published on their tag page, for example [`v6.1.0-rc.1`](https://github.com/joi-lab/ouroboros-desktop/releases/tag/v6.1.0-rc.1); `/releases/latest` intentionally stays on the latest stable release.
35+
3436
<p align="center">
3537
<img src="assets/setup.png" width="500" alt="Drag Ouroboros.app to install">
3638
</p>
@@ -473,6 +475,7 @@ not paraphrase it.
473475

474476
| Version | Date | Description |
475477
|---------|------|-------------|
478+
| 6.1.0-rc.1 | 2026-05-25 | **rc(runtime): harden live subagent handoff, isolation, and UI lineage.** Adds effective task-status SSOT, real bounded wait tools including `wait_for_tasks`, forged subagent ingress rejection, strict local-readonly constraints, DNS fail-closed browser isolation, child-drive mailbox routing/retention, web_search source attribution, lineage-aware cost observability, threaded child cards, and focused regressions. |
476479
| 6.0.0 | 2026-05-25 | **major(runtime): add live local-readonly subagents.** Upgrades `schedule_task` to a strict child-task contract, runs leaf subagents through the existing queue and workers with forked memory by default, enforces schema and execute-time local-readonly isolation, preserves full task-result handoff, and documents the delegation review rules. |
477480
| 5.33.0-rc.6 | 2026-05-24 | **rc(gateway): prevent masking upload connection/parse faults as size-limit errors.** Introduces a typed ChatUploadPayloadTooLarge exception class to isolate file-size 413 blocks from connection cuts and form-parse faults, returning a standard 400 with original message for ASGI/socket errors. Includes focused test coverage. |
478481
| 5.33.0-rc.5 | 2026-05-24 | **rc(gateway): prevent masking upload connection/parse faults as size-limit errors.** Refactors the chat upload ASGI stream wrapper to verify if caught exceptions are indeed the 'oversized' signal before returning a 413, returning a 400 with the original error message for connection cuts and malformed formats. |

VERSION

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
6.0.0
1+
6.1.0-rc.1

docs/ARCHITECTURE.md

Lines changed: 36 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# Ouroboros v6.0.0 — Architecture & Reference
1+
# Ouroboros v6.1.0-rc.1 — Architecture & Reference
22

33
This file is NOT a changelog. Version history lives in README.md, git tags, and commit log.
44

@@ -86,6 +86,7 @@ server.py (Starlette+uvicorn) ← HTTP + WebSocket on configurable host:port (de
8686
├── server_web.py ← Static web file helpers (NoCacheStaticFiles, web dir resolver)
8787
├── task_continuation.py ← Durable per-task review continuation state across restart/outage
8888
├── task_results.py ← Durable task result/status files (task_results/<id>.json)
89+
├── task_status.py ← Effective task-status SSOT: child-drive result merge, lineage lookup, bounded waits
8990
├── tool_capabilities.py ← SSOT for tool sets (core, parallel-safe, truncation, browser)
9091
├── tool_policy.py ← Tool access policy and gating (imports from tool_capabilities)
9192
├── utils.py ← Shared utilities; v5.8.3-rc.2 SSOT for JSON atomic writes/reads, UTC timestamps, hashes, log sanitization, and subprocess helpers
@@ -232,8 +233,10 @@ sudo must be noninteractive (`sudo -n`) and password-prompting sudo is blocked.
232233
Headless memory isolation is implemented as a per-task child drive under
233234
`data/state/headless_tasks/<task_id>/data`. `forked` mode copies stable memory
234235
seed files (`identity.md`, `WORLD.md`, `registry.md`, and `knowledge/`) without
235-
dialogue/task history; `empty` mode starts from a fresh child drive; `shared` is
236-
reserved for self/local tasks and is rejected for external workspaces. External
236+
dialogue/task history; `empty` mode starts from a fresh child drive; live
237+
`shared` mode is disabled for subagents and external workspace tasks until a
238+
sanitized shared-context v2 exists. Ordinary local root tasks may still use the
239+
parent drive directly when no external workspace isolation is requested. External
237240
runs produce explicit artifacts under `data/task_results/artifacts/<task_id>/`:
238241
`workspace_preflight.json`, `workspace_patch.json`, and `memory_export.json`;
239242
successful patch finalization also produces `workspace.patch`, while failed
@@ -249,7 +252,15 @@ as a child task, and an existing worker executes it. There is no separate
249252
scheduler, dashboard, endpoint, or settings surface. Child lineage is inferred
250253
from the active `ToolContext` and persisted as `parent_task_id`, `root_task_id`,
251254
`session_id`, `actor_id`, `delegation_role`, `role`, `memory_mode`,
252-
`drive_root`, `budget_drive_root`, and `task_constraint`.
255+
`drive_root`, `child_drive_root`, `budget_drive_root`, and `task_constraint`.
256+
`task_status.py` is the effective-status SSOT for gateway and tool reads: a
257+
child terminal result overrides a stale parent `requested`/`scheduled`/`running`
258+
result, while authoritative parent terminal failures/cancellations stay
259+
authoritative. Workspace artifact tasks stay nonterminal while
260+
`artifact_status` is `pending`/`finalizing`; only `ready`/`failed` artifact
261+
states make the effective workspace result terminal. `wait_for_task` performs a
262+
bounded wait (default 180s), and `wait_for_tasks` performs batch waits (default
263+
600s) with full per-child result, trace, and cost output preserved untruncated.
253264

254265
Live subagents run with deterministic
255266
`task_constraint.mode="local_readonly_subagent"`. The registry filters their
@@ -261,21 +272,33 @@ is unchanged for normal tasks, but subagents additionally deny known
261272
secret/control files such as `settings.json`, token/credential/key files, and
262273
secret-like owner-state paths. Browser tools remain available for remote-page
263274
inspection, but subagents fail closed instead of auto-installing browser
264-
dependencies and cannot browse or act on loopback/local or non-HTTP URLs, make
265-
browser subrequests to loopback/local URLs, or run arbitrary browser JavaScript.
275+
dependencies and cannot browse or act on non-HTTP(S), loopback, private,
276+
link-local, reserved, or unresolved hosts. The guard checks literal IPs and DNS
277+
results before navigation, after redirects, and in route handlers, so hostnames
278+
resolving to blocked addresses are denied. This is a URL/DNS-layer guard, not a
279+
connect-time proxy; hostile DNS rebinding would need a future resolver-pinning
280+
or proxy design if stronger network isolation is required. Subagents also
281+
cannot run arbitrary browser JavaScript.
266282

267283
`memory_mode=forked` is the default and uses the same child-drive mechanism as
268284
headless workspaces: copy stable memory seed files only (`identity.md`,
269285
`WORLD.md`, `registry.md`, `knowledge/`) into
270286
`data/state/headless_tasks/<task_id>/data`, without dialogue history, scratchpad
271287
blocks, task history, or auto-merge. `empty` creates a blank child drive.
272-
`shared` keeps the parent drive and should be used only when shared local state
273-
is an explicit parent decision. On completion, only the child task result is
274-
copied back to the parent drive; identity, scratchpad, registry, knowledge,
275-
dialogue blocks, and `memory_export` are never merged or exported
288+
`shared` is rejected for live local subagents and external workspace tasks; a
289+
future sanitized shared mode must be designed separately. On completion, only
290+
the child task result is copied back to the parent drive; identity, scratchpad,
291+
registry, knowledge, dialogue blocks, and `memory_export` are never merged or exported
276292
automatically. v1 subagents are leaf workers: the schema and execute-time gate
277293
hide and block `schedule_task`, while the supervisor keeps a structural depth
278-
cap of 2 and a maximum of 3 active child tasks per `root_task_id`.
294+
cap of 2 and a maximum of 3 active child tasks per `root_task_id`. External
295+
`/api/tasks` and CLI `run` requests may not forge
296+
`delegation_role=subagent` or parent/root lineage; only the internal
297+
`schedule_task` event path can create live subagents. Startup performs a
298+
best-effort prune of terminal copied
299+
back child drives under `state/headless_tasks/` after the retention window
300+
(default 7 days, env/settings override), and skips nonterminal or artifact
301+
finalization states.
279302

280303
### Two-process model
281304

@@ -411,7 +434,7 @@ Shown when `settings.json` does not contain any supported remote provider key an
411434
Web onboarding uses `/api/claude-code/status` and `/api/claude-code/install`.
412435
- The wizard blocks progression if nothing runnable is configured.
413436
- When OpenRouter is absent and official OpenAI is the only configured remote runtime, untouched default model values are auto-remapped to `openai::gpt-5.5` / `openai::gpt-5.5-mini` so first-run startup does not strand the app on OpenRouter-only defaults.
414-
- `web_search` uses the official OpenAI Responses API only. It requires `OPENAI_API_KEY` and treats any non-empty `OPENAI_BASE_URL` as an incompatible custom runtime configuration rather than a fallback.
437+
- `web_search` uses the official OpenAI Responses API only. It requires `OPENAI_API_KEY` and treats any non-empty `OPENAI_BASE_URL` as an incompatible custom runtime configuration rather than a fallback. Results are JSON with `answer` and `sources[]` when citation annotations are available; usage events include task/root/parent/delegation attribution and `source=web_search`.
415438
- When Cloud.ru is the only configured remote runtime, first-run model defaults use explicit `cloudru::...` IDs from `provider_models.CLOUDRU_DIRECT_DEFAULTS`; OpenAI-compatible remains an explicit model-selection flow from the full Settings page because there is no single safe universal default model ID for arbitrary compatible endpoints.
416439
- Closing the wizard without saving is non-fatal: the main app still launches and the user can finish configuration in Settings.
417440

@@ -514,7 +537,7 @@ Rationale: frontend work should not require understanding supervisor, worker, ma
514537

515538
### Chat
516539

517-
`web/modules/chat.js` owns the message timeline, input, attachment staging, input recall, budget pill, runtime controls, and live task card. It loads persisted history from `/api/chat/history`, merges echoed local messages by `client_message_id`, and collapses task/progress/tool chatter into expandable cards rather than transcript spam. Mobile keyboard handling lives in `web/app.js` + CSS `keyboard-open` classes so only the message pane scrolls while the visual viewport changes.
540+
`web/modules/chat.js` owns the message timeline, input, attachment staging, input recall, budget pill, runtime controls, and live task cards. It loads persisted history from `/api/chat/history`, merges echoed local messages by `client_message_id`, and collapses task/progress/tool chatter into expandable cards rather than transcript spam. Subagent progress uses separate child cards keyed by `subagent_task_id`/`task_id`; parent cards receive lineage references (`parent_task_id`, `root_task_id`, child id, role) without duplicating child bubbles on reload/reconnect. Mobile keyboard handling lives in `web/app.js` + CSS `keyboard-open` classes so only the message pane scrolls while the visual viewport changes.
518541

519542
History sync is intentionally two-pass: progress/system entries are replayed first to build live-card timelines, then regular user/assistant messages call `finishLiveCard`. This prevents `taskState.completed` from being set before progress events apply, which previously discarded thinking-bubble/live-card state.
520543

docs/DEVELOPMENT.md

Lines changed: 23 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -123,7 +123,7 @@ Derived from P7 (Minimalism): entire codebase fits in one context window.
123123
- Module hard gate: 1600 lines for non-grandfathered modules in `tests/test_smoke.py`. Grandfathered (`GRANDFATHERED_OVERSIZED_MODULES` in `ouroboros/review.py`): `llm.py`, `claude_advisory_review.py`, `review_state.py`, `server.py`, and temporary v5.7.1 debt `git.py` — split deferred until each surface stabilises, with `git.py` expected to pay down in the next tools pass.
124124
- Method target: <150 lines. Crossing that line is a decomposition signal, not an automatic failure by itself.
125125
- Method hard gate: 300 lines in `tests/test_smoke.py`.
126-
- Codebase-wide function-count hard gate: enforced by `tests/test_smoke.py` against the value defined in `ouroboros/review.py::MAX_TOTAL_FUNCTIONS` (currently 2250; single source of truth — bump the constant when adding a feature with an explicit comment justifying the increase).
126+
- Codebase-wide function-count hard gate: enforced by `tests/test_smoke.py` against the value defined in `ouroboros/review.py::MAX_TOTAL_FUNCTIONS` (currently 2275; single source of truth — bump the constant when adding a feature with an explicit comment justifying the increase).
127127
- Function parameters: <8.
128128
- Net complexity growth per cycle approaches zero.
129129
- If a feature is not used in the current cycle — it is premature.
@@ -252,7 +252,7 @@ Before every commit, verify the following:
252252
#### Module Size & Complexity
253253
- [ ] Module stays near one context window (~1000 lines target; 1600 hard gate unless explicitly grandfathered debt)
254254
- [ ] No method exceeds the practical target (150 lines) or the hard gate (300 lines)
255-
- [ ] Total Python function count stays under the current smoke hard gate (currently 2250; consult `ouroboros/review.py::MAX_TOTAL_FUNCTIONS` for the active value; bump with a comment if a feature requires more headroom)
255+
- [ ] Total Python function count stays under the current smoke hard gate (currently 2275; consult `ouroboros/review.py::MAX_TOTAL_FUNCTIONS` for the active value; bump with a comment if a feature requires more headroom)
256256
- [ ] No function has more than 8 parameters
257257
- [ ] No gratuitous abstract layers (Bible P7)
258258

@@ -275,19 +275,37 @@ Before every commit, verify the following:
275275
`role`, `context`, `constraints`, and `memory_mode` are optional. Do not
276276
reintroduce public `parent_task_id` or `description` arguments; lineage comes
277277
from `ToolContext`.
278+
- Live `memory_mode=shared` is disabled. Keep `forked` and `empty` as the only
279+
live subagent modes unless a later design adds sanitized shared-context v2.
280+
- External `/api/tasks` and CLI requests must reject forged
281+
`delegation_role=subagent`; only `schedule_task` may create subagents.
278282
- `task_constraint.mode="local_readonly_subagent"` must be enforced twice:
279283
schema discovery exposes only the local-readonly allowlist, and registry
280284
execution rejects forbidden calls even when invoked manually.
285+
- `task_constraint` boolean parsing must be strict; strings such as `"false"`
286+
are false, never truthy through Python's `bool("false")`.
281287
- Subagent changes must keep writes, commits, review mutation, runtime control,
282288
tool expansion, skills, MCP/extensions, shell, and further `schedule_task`
283289
recursion blocked unless a later accepted design explicitly changes the
284290
permission model.
285291
- `data_read` and `data_list` secret/control-file denials are subagent-scoped.
292+
- Browser isolation for local-readonly subagents is DNS fail-closed: block
293+
non-HTTP(S), loopback/private/link-local/reserved/unspecified literal IPs,
294+
unresolved hostnames, and hostnames resolving to any blocked IP before goto,
295+
after redirects, and in route handlers.
296+
- Effective task status belongs in `ouroboros/task_status.py`. Do not duplicate
297+
child-drive result merge or terminal-status logic in gateways/tools; use
298+
`load_effective_task_result`, `effective_task_result`, and bounded wait
299+
helpers. `wait_for_task` and `wait_for_tasks` results must remain untruncated.
300+
- `forward_to_worker` may write only to validated running tasks whose lineage
301+
belongs to the current task/root, and must route forked/empty child subagents
302+
to the child-drive mailbox.
286303
Do not broaden generic data-tool behavior for normal tasks while fixing
287304
subagent isolation.
288-
- Handoff is the full task result. Do not add shared ledgers, automatic memory
289-
merges, or new settings/endpoints unless the accepted plan explicitly calls
290-
for them.
305+
- The pre-final handoff reminder is a compact effective-status snapshot. Full
306+
untruncated child handoff belongs to `get_task_result`, `wait_for_task`, and
307+
`wait_for_tasks`. Do not add shared ledgers, automatic memory merges, or new
308+
settings/endpoints unless the accepted plan explicitly calls for them.
291309

292310
#### Page Header Layout
293311
- Top-level page chrome (`renderPageHeader`, tab strips, primary actions) must sit outside the scrolling content region.

0 commit comments

Comments
 (0)