Skip to content

fix(evaluate): resolve project working dir fallback#1492

Open
Q00 wants to merge 6 commits into
mainfrom
warden/issue-1490-evaluate-working-dir
Open

fix(evaluate): resolve project working dir fallback#1492
Q00 wants to merge 6 commits into
mainfrom
warden/issue-1490-evaluate-working-dir

Conversation

@Q00

@Q00 Q00 commented Jun 20, 2026

Copy link
Copy Markdown
Owner

Summary

  • Document working_dir in the evaluate skill and clarify that it controls Stage 2 source-file visibility as well as Stage 1 commands.
  • Resolve evaluate working directories from explicit working_dir, registered brownfield default, seed metadata/context, then cwd.
  • Pass the resolved project root through plugin/subagent dispatch and in-process evaluation.
  • Add regression coverage for explicit, brownfield-default, seed metadata, and cwd fallback paths.

Test plan

  • uv run ruff check src/ouroboros/mcp/tools/evaluation_handlers.py tests/unit/mcp/tools/test_evaluate_multi_ac.py
  • uv run pytest tests/unit/mcp/tools/test_evaluate_multi_ac.py tests/unit/mcp/tools/test_definitions.py -q — 148 passed
  • uv run pytest tests/unit/mcp/tools/test_evaluate_multi_ac.py tests/unit/mcp/tools/test_definitions.py tests/unit/mcp/tools/test_start_evaluate.py tests/unit/mcp/tools/test_checklist_verify.py tests/unit/mcp/tools/test_handler_subagent_wiring.py -q — 210 passed

Closes #1490
Refs #1491

@ouroboros-agent ouroboros-agent Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review — ouroboros-agent[bot]

Verdict: REQUEST_CHANGES

Metadata

| Field | Value |
|### Recovery Notes
First recoverable review artifact generated from codex analysis log.

---|---|
| PR | #1492 |
| HEAD checked | b4e0f2461028e20639e64e53f82147575e2ef9dd |
| Request ID | req_1781961952_131 |
| Review record | f0b79dca-54ca-4743-9e06-1fc2a0f0f60f |

What Improved

  • Adds evaluation working_dir fallback resolution so Stage 1 mechanical checks and Stage 2 artifact collection can use a project root even when the caller omits working_dir.
  • Documents that working_dir controls both command execution and source-file visibility.

Issue Requirements

Requirement Status
Add/advertise working_dir as controlling Stage 1 commands and Stage 2 source visibility Met
If working_dir is omitted, fall back through brownfield default, seed project metadata, then MCP server cwd Partially met
Preserve safe project-boundary behavior for seed-derived paths Not met

Prior Findings Status

Prior bot review metadata was present only as history. I did not copy prior findings; the blocker above is based on current source evidence.

Blockers

# File:Line Severity Finding
1 src/ouroboros/mcp/tools/evaluation_handlers.py:107 BLOCKING _seed_project_dir() resolves seed-encoded metadata.project_dir / working_directory / context_references[].path with resolve_path_against_base(... ) but does not enable containment. The shared project-path helper explicitly says callers handling untrusted seed paths must opt into containment, and this new fallback now feeds the resolved path into artifact writes/collection and mechanical command execution. A seed can therefore redirect an omitted-working_dir evaluation outside the intended project root, causing Stage 2 to scan/read another tree and Stage 1 to load or run .ouroboros/mechanical.toml from that location. Use the central seed resolver/containment contract or reject escaped seed paths instead of silently trusting them.

Follow-up Findings

# File:Line Priority Confidence Suggestion
1 tests/unit/mcp/tools/test_evaluate_multi_ac.py:77 Medium Medium Add a regression test for the conflict case where both a registered brownfield default and seed metadata are present, since that precedence is now part of the documented contract.

Non-blocking Suggestions

| 1 | tests/unit/mcp/tools/test_evaluate_multi_ac.py:77 | Tests | The new tests cover each fallback source independently, but not seed path containment/rejection or default-vs-seed precedence conflicts. |

Test Coverage Notes

  • Ran SETUPTOOLS_SCM_PRETEND_VERSION=0.0.0 /root/.local/bin/uv run python -m pytest tests/unit/mcp/tools/test_evaluate_multi_ac.py successfully: 16 passed.
  • Reviewed source paths for resolver, artifact collection, mechanical command cwd, and existing shared project-path containment contract.

Design Notes

The feature belongs in the evaluation boundary, but it duplicates seed project-path resolution instead of reusing the existing containment-aware contract.

Design / Roadmap Gate

The changed boundary controls stateful artifact writes, source-file visibility, and subprocess cwd. Explicit working_dir can remain caller-controlled, but seed-derived fallback paths are not equivalent: seeds may be generated/imported and must not silently escape the intended stable base. Current HEAD lacks that gate.

Directional Notes

I focused on resolver precedence, stale/global defaults, seed trust boundaries, artifact locations, and mechanical command cwd because this PR changes evaluation runtime boundaries.

Test Coverage

  • Ran SETUPTOOLS_SCM_PRETEND_VERSION=0.0.0 /root/.local/bin/uv run python -m pytest tests/unit/mcp/tools/test_evaluate_multi_ac.py successfully: 16 passed.
  • Reviewed source paths for resolver, artifact collection, mechanical command cwd, and existing shared project-path containment contract.

Merge Recommendation

Do not merge until seed-derived project paths are contained or explicitly rejected, with a regression test proving escaped seed metadata cannot redirect evaluation.

Review-Metadata:
verdict: REQUEST_CHANGES
head_sha: b4e0f24
request_id: req_1781961952_131
review_profile: memory-aware-zero-trust-v2
advisory_memory_only: true


Reviewed by ouroboros-agent[bot] via Codex deep analysis

@Q00

Q00 commented Jun 20, 2026

Copy link
Copy Markdown
Owner Author

Updated this PR in 5b99937 to address the review blocker: seed-derived working-dir fallback now reuses the shared containment-aware seed project-path resolver, and regression coverage now covers brownfield-default precedence and escaped seed metadata fallback.\n\nVerification run locally:\n- SETUPTOOLS_SCM_PRETEND_VERSION=0.0.0 uv run python -m pytest tests/unit/mcp/tools/test_evaluate_multi_ac.py -q (18 passed)\n- SETUPTOOLS_SCM_PRETEND_VERSION=0.0.0 uv run ruff check src/ouroboros/mcp/tools/evaluation_handlers.py tests/unit/mcp/tools/test_evaluate_multi_ac.py (passed)\n\n_Posted by agentos-roadmap-warden — bot. Reply with /warden ignore to suppress further comments on this thread._

@ouroboros-agent ouroboros-agent Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review — ouroboros-agent[bot]

Verdict: REQUEST_CHANGES

Metadata

| Field | Value |
|### Recovery Notes
First recoverable review artifact generated from codex analysis log.

---|---|
| PR | #1492 |
| HEAD checked | 5b999371ec2d9e83b981610ab73f5008b227af1a |
| Request ID | req_1781964380_133 |
| Review record | e4e71516-2cbc-434a-a0bb-532950ac505f |

What Improved

  • Documents that working_dir controls both Stage 1 command execution and Stage 2 source visibility.
  • Adds shared working-directory resolution for evaluation with explicit, brownfield default, seed metadata, and cwd fallback order.
  • Adds regression coverage for explicit/default/seed/cwd precedence and escaped seed metadata fallback.

Issue Requirements

Requirement Status
PR body linked issue requirement N/A
working_dir controls both Stage 1 command execution and Stage 2 source-file visibility Partially met
If working_dir is omitted, fallback uses registered brownfield default, then seed project metadata, then MCP server cwd Partially met
Regression coverage covers brownfield-default precedence and escaped seed metadata fallback Met

Prior Findings Status

A prior bot review existed and the contributor says this revision addressed the earlier seed-derived fallback blocker. I did not reuse prior findings; the current blocker is based on the current source and changed fallback boundary.

Blockers

# File:Line Severity Finding
1 src/ouroboros/mcp/tools/evaluation_handlers.py:76 BLOCKING _default_brownfield_project_dir() returns the persisted default path after .resolve() without verifying that it still exists and is a directory, and _resolve_evaluate_working_dir() then returns that default before trying seed metadata at lines 110-114. A stale brownfield default is enough to block fallback to a valid seed project root. Downstream, evaluation writes/collects artifacts and runs mechanical discovery from that chosen root (evaluation_handlers.py:594, :601, :615); on a nonexistent root, artifact writes fail, directory scanning returns nothing, detector sees no manifests, and build_mechanical_config() falls back to all commands None, which Stage 1 treats as skipped (src/ouroboros/evaluation/languages.py:257-260). This can silently evaluate the wrong/empty project boundary and approve without the intended mechanical checks. Validate the brownfield default with is_dir() before accepting it, or fall through/error so seed metadata/cwd can be used.

Follow-up Findings

# File:Line Priority Confidence Suggestion
None.

Non-blocking Suggestions

| None. |

Test Coverage Notes

  • Ran SETUPTOOLS_SCM_PRETEND_VERSION=0.0.0 /root/.local/bin/uv run python -m pytest tests/unit/mcp/tools/test_evaluate_multi_ac.py -q: 18 passed.
  • Ran SETUPTOOLS_SCM_PRETEND_VERSION=0.0.0 /root/.local/bin/uv run python -m pytest tests/unit/core/test_project_paths.py tests/unit/core/test_seed_project_path_callers.py -q: 41 passed.
  • Missing critical coverage for stale/nonexistent brownfield default path falling through to seed metadata or cwd.

Design Notes

The centralization around the shared project-path resolver is the right direction, but the new brownfield default boundary needs stale-path validation because it now gates both mechanical verification and semantic source visibility.

Design / Roadmap Gate

Affected boundary: MCP evaluate working-dir resolution now feeds plugin/subagent payloads and in-process Stage 1/Stage 2 evaluation. Explicit paths can be trusted as caller intent, but persisted brownfield defaults are operator state and can become stale after repo moves/deletion. Accepting stale state before seed metadata breaks compatibility with seed-scoped evaluation and can replay/evaluate against an empty boundary.

Directional Notes

Review focused on runtime boundary safety: fallback order, stale persisted state, artifact locations, and whether Stage 1 can silently skip checks. Maintainer memory was advisory only; the blocker is from current HEAD evidence.

Test Coverage

  • Ran SETUPTOOLS_SCM_PRETEND_VERSION=0.0.0 /root/.local/bin/uv run python -m pytest tests/unit/mcp/tools/test_evaluate_multi_ac.py -q: 18 passed.
  • Ran SETUPTOOLS_SCM_PRETEND_VERSION=0.0.0 /root/.local/bin/uv run python -m pytest tests/unit/core/test_project_paths.py tests/unit/core/test_seed_project_path_callers.py -q: 41 passed.
  • Missing critical coverage for stale/nonexistent brownfield default path falling through to seed metadata or cwd.

Merge Recommendation

Do not merge until stale or non-directory brownfield defaults are rejected or skipped with regression coverage proving fallback to seed metadata/cwd still preserves Stage 1 and Stage 2 visibility.

Review-Metadata:
verdict: REQUEST_CHANGES
head_sha: 5b99937
request_id: req_1781964380_133
review_profile: memory-aware-zero-trust-v2
advisory_memory_only: true


Reviewed by ouroboros-agent[bot] via Codex deep analysis

@Q00

Q00 commented Jun 20, 2026

Copy link
Copy Markdown
Owner Author

Pushed follow-up commit 571f3b91 to address the latest review blocker: evaluate fallback now ignores stale or non-directory brownfield defaults and continues to seed metadata before falling back to the MCP server cwd.

Verification run locally:

  • SETUPTOOLS_SCM_PRETEND_VERSION=0.0.0 uv run python -m pytest tests/unit/mcp/tools/test_evaluate_multi_ac.py -q (20 passed)
  • SETUPTOOLS_SCM_PRETEND_VERSION=0.0.0 uv run ruff check src/ouroboros/mcp/tools/evaluation_handlers.py tests/unit/mcp/tools/test_evaluate_multi_ac.py (passed)

Posted by agentos-roadmap-warden — bot. Reply with /warden ignore to suppress further comments on this thread.

@ouroboros-agent ouroboros-agent Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review — ouroboros-agent[bot]

Verdict: APPROVE

Metadata

| Field | Value |
|### Recovery Notes
First recoverable review artifact generated from codex analysis log.

---|---|
| PR | #1492 |
| HEAD checked | 571f3b91fc140f591f529f4ff96fd58600051e8b |
| Request ID | req_1781967196_136 |
| Review record | 5573b1b5-5d39-4ddb-a322-e635e1bfad21 |

What Improved

  • ouroboros_evaluate now resolves a project root before plugin dispatch and in-process evaluation, so Stage 1 commands and Stage 2 artifact collection use the same working directory.
  • The evaluate skill docs now recommend passing an absolute working_dir and document fallback order.
  • Regression coverage now covers explicit working-dir precedence, brownfield default precedence, stale/non-directory defaults, seed metadata fallback, containment escape fallback, and cwd fallback.

Issue Requirements

Requirement Status
No linked issue / PR body requirement captured N/A

Prior Findings Status

Prior bot reviews existed and requested changes. I did not reuse prior findings. Current source shows the latest concerns described in contributor comments are addressed: stale/non-directory brownfield defaults fall through to seed metadata, escaped seed metadata falls back to cwd, and the handler uses the shared containment-aware resolver.

Blockers

No in-scope blocking findings remained after policy filtering.

Follow-up Findings

# File:Line Priority Confidence Suggestion
1 tests/unit/mcp/tools/test_evaluate_multi_ac.py:148 Low High Minor formatting nit: there is an extra blank line inside TestEvaluateWorkingDirResolution. Ruff accepts it, so this is non-blocking.

Non-blocking Suggestions

| 1 | tests/unit/mcp/tools/test_evaluate_multi_ac.py:148 | Style | Extra blank line inside the test class; cosmetic only. |

Test Coverage Notes

  • Reviewed changed tests in tests/unit/mcp/tools/test_evaluate_multi_ac.py, especially working-dir resolution coverage at lines 77-224.
  • Ran SETUPTOOLS_SCM_PRETEND_VERSION=0.0.0 /root/.local/bin/uv run python -m pytest tests/unit/mcp/tools/test_evaluate_multi_ac.py -q: 20 passed.
  • Ran resolver-adjacent coverage: tests/unit/core/test_project_paths.py tests/unit/core/test_seed_project_path_callers.py -q: 41 passed.
  • Ran SETUPTOOLS_SCM_PRETEND_VERSION=0.0.0 /root/.local/bin/uv run ruff check src/ouroboros/mcp/tools/evaluation_handlers.py tests/unit/mcp/tools/test_evaluate_multi_ac.py: passed.

Design Notes

The design is scoped and coherent: one handler-level resolver establishes the project root, then both plugin dispatch and in-process evaluation receive that resolved path. It reuses the shared seed path resolver instead of creating a parallel containment implementation.

Design / Roadmap Gate

Affected boundary is ouroboros_evaluate working-directory selection. Current HEAD evidence supports the intended contract: explicit working_dir wins, usable brownfield default is next, seed metadata/context is resolved through resolve_seed_project_path, and cwd remains the final compatibility fallback. The resolved path is passed to subagent context and used for artifact collection plus mechanical config in the in-process path.

Directional Notes

I focused on runtime boundary safety: working-dir precedence, stale brownfield defaults, seed containment, Stage 1 mechanical config, Stage 2 artifact visibility, and plugin dispatch payload shape. Maintainer memory was advisory only; no blocker was raised from memory.

Test Coverage

  • Reviewed changed tests in tests/unit/mcp/tools/test_evaluate_multi_ac.py, especially working-dir resolution coverage at lines 77-224.
  • Ran SETUPTOOLS_SCM_PRETEND_VERSION=0.0.0 /root/.local/bin/uv run python -m pytest tests/unit/mcp/tools/test_evaluate_multi_ac.py -q: 20 passed.
  • Ran resolver-adjacent coverage: tests/unit/core/test_project_paths.py tests/unit/core/test_seed_project_path_callers.py -q: 41 passed.
  • Ran SETUPTOOLS_SCM_PRETEND_VERSION=0.0.0 /root/.local/bin/uv run ruff check src/ouroboros/mcp/tools/evaluation_handlers.py tests/unit/mcp/tools/test_evaluate_multi_ac.py: passed.

Merge Recommendation

Approve. I found no blocking runtime, persistence, plugin-dispatch, or test-coverage issue in the changed boundary, and targeted verification passed.

Review-Metadata:
verdict: APPROVE
head_sha: 571f3b9
request_id: req_1781967196_136
review_profile: memory-aware-zero-trust-v2
advisory_memory_only: true


Reviewed by ouroboros-agent[bot] via Codex deep analysis

@Q00

Q00 commented Jun 20, 2026

Copy link
Copy Markdown
Owner Author

Pushed follow-up commit 8734e453 after CI's Ruff format job failed on the new regression-test block. No behavior changes; local verification now includes format check:

  • SETUPTOOLS_SCM_PRETEND_VERSION=0.0.0 uv run python -m pytest tests/unit/mcp/tools/test_evaluate_multi_ac.py -q (20 passed)
  • SETUPTOOLS_SCM_PRETEND_VERSION=0.0.0 uv run ruff check src/ouroboros/mcp/tools/evaluation_handlers.py tests/unit/mcp/tools/test_evaluate_multi_ac.py (passed)
  • SETUPTOOLS_SCM_PRETEND_VERSION=0.0.0 uv run ruff format --check src/ouroboros/mcp/tools/evaluation_handlers.py tests/unit/mcp/tools/test_evaluate_multi_ac.py (passed)

Posted by agentos-roadmap-warden — bot. Reply with /warden ignore to suppress further comments on this thread.

@ouroboros-agent ouroboros-agent Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review — ouroboros-agent[bot]

Verdict: APPROVE

Metadata

| Field | Value |
|### Recovery Notes
First recoverable review artifact generated from codex analysis log.

---|---|
| PR | #1492 |
| HEAD checked | 8734e4535f7567da8e2e9e301539a240edde748e |
| Request ID | req_1781967773_137 |
| Review record | 80c5bf03-ec45-4a4e-a932-8dd8a3a2bf3f |

What Improved

  • Evaluation now resolves one project root for both Stage 1 commands and Stage 2 source visibility.
  • Omitted working_dir now falls back through brownfield default, seed metadata, then MCP server cwd.
  • Tests cover explicit override, brownfield precedence, stale/non-directory brownfield fallback, seed metadata, escaped seed metadata, and cwd fallback.

Issue Requirements

Requirement Status
Ensure omitted evaluate working_dir can fall back to the registered brownfield default Met
Fall back from unusable brownfield defaults to seed metadata Met
Use containment-aware seed project-path resolution for seed-derived fallback Met
Document that working_dir controls Stage 1 and Stage 2 visibility Met
Add regression coverage for precedence and fallback behavior Met

Prior Findings Status

Prior bot reviews existed, including earlier changes-requested states and a later approval. I rechecked the current source independently; the stale/non-directory brownfield fallback and seed containment cases are now covered by current tests, so no prior blocker is maintained.

Blockers

No in-scope blocking findings remained after policy filtering.

Follow-up Findings

# File:Line Priority Confidence Suggestion
1 skills/evaluate/SKILL.md:76 Low Medium The skill example still documents only singular acceptance_criterion; documenting acceptance_criteria would make the newer checklist path easier to discover.

Non-blocking Suggestions

| 1 | skills/evaluate/SKILL.md:76 | Documentation | Consider adding the plural acceptance_criteria argument to the usage block. |

Test Coverage Notes

  • Reviewed changed source, diff, PR comments, prior review metadata, and affected path-resolution contracts.
  • Ran SETUPTOOLS_SCM_PRETEND_VERSION=0.0.0 /root/.local/bin/uv run python -m pytest tests/unit/mcp/tools/test_evaluate_multi_ac.py -q: 20 passed.
  • Ran SETUPTOOLS_SCM_PRETEND_VERSION=0.0.0 /root/.local/bin/uv run ruff check src/ouroboros/mcp/tools/evaluation_handlers.py tests/unit/mcp/tools/test_evaluate_multi_ac.py: passed.
  • Ran SETUPTOOLS_SCM_PRETEND_VERSION=0.0.0 /root/.local/bin/uv run ruff format --check src/ouroboros/mcp/tools/evaluation_handlers.py tests/unit/mcp/tools/test_evaluate_multi_ac.py: passed.

Design Notes

The change keeps the fallback logic centralized in the MCP handler and reuses the shared seed path resolver instead of adding a parallel path parser. The runtime split between plugin dispatch and in-process evaluation now receives the same resolved root.

Design / Roadmap Gate

Affected boundary: ouroboros_evaluate working-directory resolution across plugin/subagent dispatch, mechanical verification, artifact collection, and seed metadata recovery. Current HEAD preserves explicit operator override, validates brownfield defaults as existing directories, applies contained seed metadata, and keeps cwd as the final compatibility fallback. The added tests exercise the critical precedence and stale-path cases.

Directional Notes

Review focus followed the maintainer posture around runtime boundaries, fallback ordering, stale project state, and duplicate path-resolution logic. Memory was advisory only; no blocker was raised from memory.

Test Coverage

  • Reviewed changed source, diff, PR comments, prior review metadata, and affected path-resolution contracts.
  • Ran SETUPTOOLS_SCM_PRETEND_VERSION=0.0.0 /root/.local/bin/uv run python -m pytest tests/unit/mcp/tools/test_evaluate_multi_ac.py -q: 20 passed.
  • Ran SETUPTOOLS_SCM_PRETEND_VERSION=0.0.0 /root/.local/bin/uv run ruff check src/ouroboros/mcp/tools/evaluation_handlers.py tests/unit/mcp/tools/test_evaluate_multi_ac.py: passed.
  • Ran SETUPTOOLS_SCM_PRETEND_VERSION=0.0.0 /root/.local/bin/uv run ruff format --check src/ouroboros/mcp/tools/evaluation_handlers.py tests/unit/mcp/tools/test_evaluate_multi_ac.py: passed.

Merge Recommendation

Approve. I found no current HEAD blocker, and the targeted runtime and formatting checks pass.

Review-Metadata:
verdict: APPROVE
head_sha: 8734e45
request_id: req_1781967773_137
review_profile: memory-aware-zero-trust-v2
advisory_memory_only: true


Reviewed by ouroboros-agent[bot] via Codex deep analysis

@Q00

Q00 commented Jun 20, 2026

Copy link
Copy Markdown
Owner Author

Pushed follow-up commit 621f9224 after the full test matrix exposed the missing capability metadata expectation for evaluate working_dir context.

Additional local verification:

  • SETUPTOOLS_SCM_PRETEND_VERSION=0.0.0 uv run python -m pytest tests/unit/orchestrator/test_capabilities.py::test_each_ouroboros_owned_tool_metadata_is_tool_specific_not_generic_inference tests/unit/orchestrator/test_capabilities.py::test_ouroboros_owned_tools_have_explicit_capability_metadata tests/unit/orchestrator/test_capabilities.py::test_ouroboros_capability_registry_covers_every_owned_tool_identifier -q (31 passed)
  • SETUPTOOLS_SCM_PRETEND_VERSION=0.0.0 uv run python -m pytest tests/unit/mcp/tools/test_evaluate_multi_ac.py -q (20 passed)
  • SETUPTOOLS_SCM_PRETEND_VERSION=0.0.0 uv run ruff check src/ouroboros/mcp/tools/evaluation_handlers.py tests/unit/mcp/tools/test_evaluate_multi_ac.py tests/unit/orchestrator/test_capabilities.py (passed)
  • SETUPTOOLS_SCM_PRETEND_VERSION=0.0.0 uv run ruff format --check src/ouroboros/mcp/tools/evaluation_handlers.py tests/unit/mcp/tools/test_evaluate_multi_ac.py tests/unit/orchestrator/test_capabilities.py (passed)

Posted by agentos-roadmap-warden — bot. Reply with /warden ignore to suppress further comments on this thread.

@ouroboros-agent ouroboros-agent Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review — ouroboros-agent[bot]

Verdict: REQUEST_CHANGES

Metadata

| Field | Value |
|### Recovery Notes
First recoverable review artifact generated from codex analysis log.

---|---|
| PR | #1492 |
| HEAD checked | 621f9224d73294e3b2a9feae932643dfd344ef64 |
| Request ID | req_1781968981_140 |
| Review record | 50bf5252-8d4e-4d97-99e8-339b2b110e1a |

What Improved

  • ouroboros_evaluate now resolves omitted working_dir through explicit arg, brownfield default, seed metadata, then MCP cwd for the direct/blocking path.
  • The evaluate skill documents that working_dir affects both Stage 1 commands and Stage 2 source visibility.
  • Regression tests cover direct evaluate fallback precedence and capability metadata expectations.

Issue Requirements

Requirement Status
Document working_dir as controlling Stage 1 command execution and Stage 2 source visibility Met
Omitted direct ouroboros_evaluate working_dir falls back to brownfield default, seed project metadata, then MCP cwd Met
Ignore stale/non-directory brownfield defaults before seed fallback Met
Keep capability metadata aware of evaluate working_dir context Met
Apply the same omitted-working_dir fallback across evaluate MCP entrypoints, including start/plugin dispatch Not met

Prior Findings Status

Prior bot review text was not reused. Human comments indicate earlier direct-evaluate fallback blockers were addressed; current source and tests support that for EvaluateHandler. This review maintains a blocker only for the sibling StartEvaluateHandler plugin boundary still bypassing the new resolver.

Blockers

# File:Line Severity Finding
1 src/ouroboros/mcp/tools/evaluation_handlers.py:1871 BLOCKING StartEvaluateHandler still bypasses the new fallback resolver in plugin mode. Its definition reuses EvaluateHandler().definition.parameters, so ouroboros_start_evaluate now exposes the same working_dir contract, but the plugin-dispatch branch returns before the inner EvaluateHandler.handle() runs and forwards only arguments.get("working_dir"). When callers omit working_dir, plugin-mode start-evaluate payloads get None instead of the brownfield default, seed project metadata, or cwd that direct ouroboros_evaluate now computes at lines 519 and 528. This leaves Stage 2 source visibility wrong for the documented long-running/plugin evaluate boundary.

Follow-up Findings

# File:Line Priority Confidence Suggestion
1 src/ouroboros/mcp/tools/evaluation_handlers.py:113 Medium Medium Consider adding a regression for explicit working_dir that points to a missing or non-directory path, so the intended behavior is locked down instead of inherited from downstream best-effort failures.

Non-blocking Suggestions

| 1 | src/ouroboros/mcp/tools/evaluation_handlers.py:118 | Cleanup | _default_brownfield_project_dir() already filters unusable defaults, so the repeated is_dir() branch in _resolve_evaluate_working_dir() is defensive but currently redundant. |

Test Coverage Notes

  • Ran SETUPTOOLS_SCM_PRETEND_VERSION=0.0.0 /root/.local/bin/uv run python -m pytest tests/unit/mcp/tools/test_evaluate_multi_ac.py -q: 20 passed.
  • Ran targeted capability metadata tests: 31 passed.
  • Ran ruff check and ruff format --check on changed source/tests: passed.
  • Missing coverage: ouroboros_start_evaluate plugin-mode fallback path.

Design Notes

The resolver is placed correctly for direct evaluation before plugin and in-process dispatch, but the start-evaluate plugin shortcut now diverges from the shared contract.

Design / Roadmap Gate

Affected boundary: evaluate working-directory resolution feeds mechanical config, artifact collection, semantic source visibility, and plugin payload context. Direct and background non-plugin paths use the new resolver, but ouroboros_start_evaluate plugin mode returns before that shared resolution and preserves a stale raw-argument path.

Directional Notes

Review focus was on runtime boundaries, fallback precedence, plugin dispatch, and capability metadata. Maintainer memory was advisory only; the blocker is based on current source evidence.

Test Coverage

  • Ran SETUPTOOLS_SCM_PRETEND_VERSION=0.0.0 /root/.local/bin/uv run python -m pytest tests/unit/mcp/tools/test_evaluate_multi_ac.py -q: 20 passed.
  • Ran targeted capability metadata tests: 31 passed.
  • Ran ruff check and ruff format --check on changed source/tests: passed.
  • Missing coverage: ouroboros_start_evaluate plugin-mode fallback path.

Merge Recommendation

REQUEST_CHANGES until StartEvaluateHandler plugin dispatch resolves working_dir with the same fallback logic, with a regression test covering omitted working_dir plus brownfield or seed-derived project root.

Review-Metadata:
verdict: REQUEST_CHANGES
head_sha: 621f922
request_id: req_1781968981_140
review_profile: memory-aware-zero-trust-v2
advisory_memory_only: true


Reviewed by ouroboros-agent[bot] via Codex deep analysis

@Q00

Q00 commented Jun 20, 2026

Copy link
Copy Markdown
Owner Author

Pushed follow-up commit d55f5c02 to address the latest review blocker: ouroboros_start_evaluate plugin dispatch now uses the same omitted-working_dir fallback resolver before building the subagent payload, with a regression for brownfield-default fallback.

Additional local verification:

  • SETUPTOOLS_SCM_PRETEND_VERSION=0.0.0 uv run python -m pytest tests/unit/mcp/tools/test_start_evaluate.py tests/unit/mcp/tools/test_evaluate_multi_ac.py -q (36 passed)
  • SETUPTOOLS_SCM_PRETEND_VERSION=0.0.0 uv run python -m pytest tests/unit/orchestrator/test_capabilities.py::test_each_ouroboros_owned_tool_metadata_is_tool_specific_not_generic_inference tests/unit/orchestrator/test_capabilities.py::test_ouroboros_owned_tools_have_explicit_capability_metadata tests/unit/orchestrator/test_capabilities.py::test_ouroboros_capability_registry_covers_every_owned_tool_identifier -q (31 passed)
  • SETUPTOOLS_SCM_PRETEND_VERSION=0.0.0 uv run ruff check src/ouroboros/mcp/tools/evaluation_handlers.py tests/unit/mcp/tools/test_start_evaluate.py tests/unit/mcp/tools/test_evaluate_multi_ac.py tests/unit/orchestrator/test_capabilities.py (passed)
  • SETUPTOOLS_SCM_PRETEND_VERSION=0.0.0 uv run ruff format --check src/ouroboros/mcp/tools/evaluation_handlers.py tests/unit/mcp/tools/test_start_evaluate.py tests/unit/mcp/tools/test_evaluate_multi_ac.py tests/unit/orchestrator/test_capabilities.py (passed)

Posted by agentos-roadmap-warden — bot. Reply with /warden ignore to suppress further comments on this thread.

@ouroboros-agent ouroboros-agent Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review — ouroboros-agent[bot]

Verdict: APPROVE

Metadata

| Field | Value |
|### Recovery Notes
First recoverable review artifact generated from codex analysis log.

---|---|
| PR | #1492 |
| HEAD checked | d55f5c02f2e5a9ce62abf732de4bb2353d54dd1d |
| Request ID | req_1781969887_142 |
| Review record | 92b4cead-e816-4199-b861-f96f8fa9c29b |

What Improved

  • ouroboros_evaluate now resolves omitted working_dir from brownfield default, seed metadata, then MCP cwd.
  • ouroboros_start_evaluate plugin dispatch now uses the same fallback before building the subagent payload.
  • Evaluate skill docs now state that working_dir gates both Stage 1 commands and Stage 2 source visibility.

Issue Requirements

Requirement Status
Document that evaluate working_dir controls Stage 1 command execution and Stage 2 source visibility Met
When evaluate working_dir is omitted, fall back to registered brownfield default before cwd Met
If brownfield default is stale or not a directory, continue to seed metadata/cwd fallback Met
Apply omitted-working_dir fallback to ouroboros_start_evaluate plugin dispatch payloads Met
Keep capability metadata aware of evaluate working_dir context Met
Linked issue requirement captured N/A

Prior Findings Status

Prior review concerns are not re-raised. Current HEAD shows the seed-derived fallback now uses the shared containment-aware resolver, stale/non-directory brownfield defaults fall through, and start_evaluate plugin dispatch resolves working_dir before payload construction.

Blockers

No in-scope blocking findings remained after policy filtering.

Follow-up Findings

# File:Line Priority Confidence Suggestion
None.

Non-blocking Suggestions

| - | - | - | None. |

Test Coverage Notes

  • Reviewed changed tests for evaluate working-dir fallback, stale/non-directory brownfield defaults, seed metadata containment fallback, start_evaluate plugin payload, and capability metadata.
  • Ran SETUPTOOLS_SCM_PRETEND_VERSION=0.0.0 /root/.local/bin/uv run python -m pytest tests/unit/mcp/tools/test_evaluate_multi_ac.py tests/unit/mcp/tools/test_start_evaluate.py -q — 36 passed.
  • Ran targeted capability metadata pytest — 31 passed.
  • Ran Ruff check and Ruff format check on changed source/tests — passed.

Design Notes

The change centralizes evaluate project-root fallback in a small helper and reuses existing project-path containment logic for seed-derived paths. The MCP handlers remain thin enough, with behavior guarded by focused unit tests.

Design / Roadmap Gate

Affected boundary is evaluate project-root selection. Explicit working_dir remains highest precedence; omitted values resolve through brownfield default, seed metadata, then cwd. Brownfield defaults are required to exist as directories. Seed metadata resolution is containment-aware through resolve_seed_project_path. Plugin and in-process evaluate paths now receive the resolved directory string/path before dispatch or pipeline execution.

Directional Notes

Review focused on runtime boundary safety: stale operator defaults, seed metadata containment, plugin dispatch payload parity, and Stage 1/Stage 2 project-root consistency. Maintainer memory was advisory only; no blocker was based on memory alone.

Test Coverage

  • Reviewed changed tests for evaluate working-dir fallback, stale/non-directory brownfield defaults, seed metadata containment fallback, start_evaluate plugin payload, and capability metadata.
  • Ran SETUPTOOLS_SCM_PRETEND_VERSION=0.0.0 /root/.local/bin/uv run python -m pytest tests/unit/mcp/tools/test_evaluate_multi_ac.py tests/unit/mcp/tools/test_start_evaluate.py -q — 36 passed.
  • Ran targeted capability metadata pytest — 31 passed.
  • Ran Ruff check and Ruff format check on changed source/tests — passed.

Merge Recommendation

Approve. Current HEAD covers the previously risky evaluate working-dir fallback and plugin dispatch boundary with passing targeted tests and lint/format checks.

Review-Metadata:
verdict: APPROVE
head_sha: d55f5c0
request_id: req_1781969887_142
review_profile: memory-aware-zero-trust-v2
advisory_memory_only: true


Reviewed by ouroboros-agent[bot] via Codex deep analysis

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

evaluate: working_dir undocumented, cwd fallback breaks Stage 2 for greenfield/Hermes callers

1 participant