Skip to content

feat: Better subagent contracts; lighter refinement for subagents too#996

Merged
Henry-811 merged 1 commit intodev/v0.1.63from
dev/v0.1.62-p1
Mar 13, 2026
Merged

feat: Better subagent contracts; lighter refinement for subagents too#996
Henry-811 merged 1 commit intodev/v0.1.63from
dev/v0.1.62-p1

Conversation

@ncrispino
Copy link
Collaborator

@ncrispino ncrispino commented Mar 13, 2026

PR Title Format

Your PR title must follow the format: <type>: <brief description>

Valid types:

  • fix: - Bug fixes
  • feat: - New features
  • breaking: - Breaking changes
  • docs: - Documentation updates
  • refactor: - Code refactoring
  • test: - Test additions/modifications
  • chore: - Maintenance tasks
  • perf: - Performance improvements
  • style: - Code style changes
  • ci: - CI/CD configuration changes

Examples:

  • fix: resolve memory leak in data processing
  • feat: add export to CSV functionality
  • breaking: change API response format
  • docs: update installation guide

Description

Brief description of the changes in this PR

Type of change

  • Bug fix (fix:) - Non-breaking change which fixes an issue
  • New feature (feat:) - Non-breaking change which adds functionality
  • Breaking change (breaking:) - Fix or feature that would cause existing functionality to not work as expected
  • Documentation (docs:) - Documentation updates
  • Code refactoring (refactor:) - Code changes that neither fix a bug nor add a feature
  • Tests (test:) - Adding missing tests or correcting existing tests
  • Chore (chore:) - Maintenance tasks, dependency updates, etc.
  • Performance improvement (perf:) - Code changes that improve performance
  • Code style (style:) - Changes that do not affect the meaning of the code (formatting, missing semi-colons, etc.)
  • CI/CD (ci:) - Changes to CI/CD configuration files and scripts

Checklist

  • I have run pre-commit on my changed files and all checks pass
  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Pre-commit status

# Paste the output of running pre-commit on your changed files:
# uv run pre-commit install
# git diff --name-only HEAD~1 | xargs uv run pre-commit run --files # for last commit
# git diff --name-only origin/<base branch>...HEAD | xargs uv run pre-commit run --files # for all commits in PR
# git add <your file> # if any fixes were applied
# git commit -m "chore: apply pre-commit fixes"
# git push origin <branch-name>

How to Test

Add test method for this PR.

Test CLI Command

Write down the test bash command. If there is pre-requests, please emphasize.

Expected Results

Description/screenshots of expected results.

Additional context

Add any other context about the PR here.

Summary by CodeRabbit

  • New Features

    • Added Ensemble coordination pattern (Produce → Vote → Synthesize) as default for multi-agent subagent runs, increasing answer diversity.
    • Introduced round evaluator transformation pressure settings (gentle/balanced/aggressive) to control self-improvement aggressiveness.
  • Configuration

    • New coordination configuration options for round evaluator and ensemble mode behavior.
    • Updated YAML schema with ensemble pattern guidance and transformation pressure controls.
  • Documentation

    • Enhanced composition pattern, coordination workflow, and subagent documentation with new coordination strategies.
    • Expanded round evaluator guidance with material self-improvement and verification contract details.
  • Improvements

    • Adjusted default reasoning effort for GPT-5.4 from xhigh to high.
    • Improved background MCP client handling for subagent configurations.
    • Enhanced verification contract structure and orchestrator timeout handling.

@coderabbitai
Copy link

coderabbitai bot commented Mar 13, 2026

📝 Walkthrough

Walkthrough

This PR introduces ensemble coordination pattern (Produce → Vote → Synthesize) and round-evaluator transformation-pressure feature. Changes include new configuration options, updated documentation, enhanced orchestrator state tracking, validation logic, subagent model extensions, and comprehensive test coverage for the new coordination mechanics.

Changes

Cohort / File(s) Summary
Documentation Updates
docs/modules/composition.md, docs/modules/coordination_workflow.md, docs/modules/subagents.md, docs/source/reference/yaml_schema.rst
Added Composition Pattern 5 (Ensemble), documented round-evaluator normal path with transformation-pressure modes, updated YAML schema with new coordination options (round_evaluator_before_checklist, orchestrator_managed_round_evaluator, round_evaluator_refine, round_evaluator_transformation_pressure), and ensemble configuration guidance.
Configuration & Validation
massgen/agent_config.py, massgen/config_validator.py, massgen/cli.py, massgen/config_builder.py
Added round_evaluator_transformation_pressure field (gentle/balanced/aggressive) to CoordinationConfig with validation, updated CLI parsing to propagate the new field, adjusted default reasoning effort from xhigh to high for GPT-5.4.
Backend & API Parameter Handling
massgen/backend/base.py, massgen/api_params_handler/_api_params_handler_base.py, massgen/backend/codex.py
Excluded round_evaluator_transformation_pressure from backend/API params, updated CodexBackend reasoning default to high, added helpers to normalize and rewrite subagent runtime modes in background MCP servers (delegated → isolated for host-side Codex clients).
Prompt & System Message Building
massgen/message_templates.py, massgen/system_message_builder.py, massgen/system_prompt_sections.py
Extended final_presentation_message with had_voting parameter to adjust synthesis guidance based on voting outcome, threaded round_evaluator_transformation_pressure through EvaluationSection and ChangedocSection constructors, added _build_round_evaluator_transformation_pressure_guidance to embed transformation-pressure bias in evaluator prompts.
Subagent Models & Manager
massgen/subagent/models.py, massgen/subagent/manager.py
Added disable_injection and defer_voting_until_all_answered to SubagentOrchestratorConfig for ensemble defaults, extended RoundEvaluatorResult with next_tasks_strategy_mode, next_tasks_incremental_override_reason, and next_tasks_success_contract fields, updated YAML serialization/deserialization to propagate new fields, added manager logic to propagate ensemble defaults only when refine=False.
MCP Tools & Planning
massgen/mcp_tools/checklist_tools_server.py, massgen/mcp_tools/custom_tools_server.py, massgen/mcp_tools/planning/_planning_mcp_server.py
Enhanced checklist redirect to propagate round_evaluator_strategy_mode and success_contract details, refined custom_tools mode-stripping logic to distinguish explicit vs. default controls, updated terminal verification task description template to use structured sections (Verification Contract, Replay Steps, Latest Verification Result, Stale If).
Subagent Type Guidance
massgen/subagent_types/builder/SUBAGENT.md, massgen/subagent_types/round_evaluator/SUBAGENT.md
Refined builder focus on single deliverable with one coherent goal, substantially expanded round_evaluator SUBAGENT.md with material self-improvement emphasis, transformation-pressure modes (gentle/balanced/aggressive), explicit success_contract structure, strategy_mode field (thesis_shift), and enriched task schemas with strategy_role, success_criteria, failure_signals, required_evidence.
Orchestrator Core Logic
massgen/orchestrator.py
Significantly enhanced state tracking (agent timeout states, answer streaks, revision counts, worktree branches, workspace paths), refined final-answer presentation with answer selection based on votes/timestamps, added timeout monitoring (soft/hard logic) and cleanup for worktrees, introduced snapshot save/restore for answer revisions and partial snapshots on early termination, extended hook management for native and MCP support, added startup rate-limiting, improved error handling in stream execution with tool-call validation and retry logic, extended post-evaluation with rework support, integrated comprehensive metrics collection and structured logging.
Test Coverage — Core Features
massgen/tests/test_ensemble_mode.py, massgen/tests/test_round_evaluator_loop.py, massgen/tests/test_round_evaluator_verdict.py
New test module for ensemble mode validates SubagentOrchestratorConfig defaults, skip-vote logic, synthesize prompt variants with had_voting flag, waiting logic with killed agents, manager propagation of ensemble defaults. Extended round-evaluator tests with transformation-pressure validation, richer next-tasks payload schema, success_contract requirements, serialization/deserialization of strategy metadata.
Test Coverage — Configuration & Validation
massgen/tests/test_api_params_exclusion.py, massgen/tests/test_codex_reasoning_config.py, massgen/tests/test_config_builder.py, massgen/tests/test_quickstart_wizard_config_filename.py, massgen/tests/test_web_setup_status_api.py, massgen/tests/test_web_quickstart_reasoning_sync.py
Updated tests to reflect xhigh → high default reasoning effort for GPT-5.4, adjusted Copilot/Codex metadata defaults and recommendations, added round_evaluator_transformation_pressure to KNOWN_INTERNAL_PARAMS.
Test Coverage — Orchestrator & Subagents
massgen/tests/test_orchestrator_consensus.py, massgen/tests/test_checklist_label_refresh_ordering.py, massgen/tests/test_checklist_tools_server.py, massgen/tests/test_custom_tools.py, massgen/tests/test_codex_mcp_tool_visibility.py, massgen/tests/unit/test_orchestrator_unit.py, massgen/tests/test_subagent_manager.py
Added tests for ensemble configuration with defer_voting_until_all_answered toggle, round_evaluator_strategy_summary preservation in checklist refresh, checklist submission with thesis_shift strategy and success_contract details, synthetic mode stripping logic, host-side MCP rewriting of delegated→isolated runtime modes, next_tasks payload schema evolution with success_contract/strategy_mode/approach_assessment, timeout handling with graceful fallback.
Test Coverage — Prompts & Sections
massgen/tests/test_planning_injection.py, massgen/tests/test_system_message_builder_phases.py, massgen/tests/test_system_prompt_sections.py, massgen/tests/test_specialized_subagents.py
Updated memory task assertions to verify Verification Contract section headers, added transformation-pressure flow-through in coordination message tests, expanded checklist-gated decision and round-evaluator prompt tests covering managed modes and transformation bias, verified single-thesis and narrow-scope guidance in builder and round_evaluator subagent type prompts.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Orchestrator
    participant Agent1 as Agent 1
    participant Agent2 as Agent 2
    participant RoundEval as Round Evaluator
    participant Synthesizer as User (Synthesis)

    rect rgba(100, 150, 255, 0.5)
    Note over Orchestrator,Agent2: Phase 1: Produce<br/>(disable_injection=true, defer_voting=true)
    Orchestrator->>Agent1: Execute task (independent context)
    Orchestrator->>Agent2: Execute task (independent context)
    Agent1->>Orchestrator: Answer 1
    Agent2->>Orchestrator: Answer 2
    end

    rect rgba(150, 200, 150, 0.5)
    Note over Orchestrator,RoundEval: Phase 2: Vote & Evaluate
    Orchestrator->>RoundEval: Present all answers<br/>Material self-improvement brief
    RoundEval->>RoundEval: Evaluate answers<br/>Apply transformation_pressure<br/>(gentle/balanced/aggressive)
    RoundEval->>Orchestrator: Vote results + next_tasks
    end

    rect rgba(200, 150, 100, 0.5)
    Note over Orchestrator,Synthesizer: Phase 3: Synthesize
    Orchestrator->>Synthesizer: Present winner answer<br/>with synthesis guidance
    Note over Synthesizer: If had_voting=true:<br/>Primary: winner's answer<br/>Secondary: strengths from others
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

The diff spans 40+ files with substantial heterogeneity: orchestrator.py introduces dense state-tracking and timeout logic; subagent models and round-evaluator SUBAGENT.md contain rich new contract/metadata requirements; prompt-building logic branches on transformation pressure; configuration validation and serialization across multiple layers; test updates across 15+ test files with evolving schema expectations. Mixed complexity requiring careful review of orchestrator control flow, payload schema evolution, and cross-layer configuration threading.

Possibly related issues

Possibly related PRs

Suggested reviewers

  • a5507203
🚥 Pre-merge checks | ✅ 3 | ❌ 3

❌ Failed checks (3 warnings)

Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description is merely a template with no actual content filled in; it lacks any real description of changes, test details, or context specific to this PR. Replace the template placeholder with a concrete description of the changes, including what the new subagent contracts entail, how refinement was lightened, key files modified, and testing approach.
Docstring Coverage ⚠️ Warning Docstring coverage is 77.07% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Documentation Updated ⚠️ Warning PR missing critical documentation: helper methods lack proper Google-style docstrings, test helper function missing docstring, referenced example config file does not exist, config validator missing isinstance() type check, and release notes not updated. Add comprehensive docstrings to helper methods and test function, create missing config file or update documentation reference, add isinstance() validation for round_evaluator_transformation_pressure parameter, and update release notes.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly describes the main feature: introducing better subagent contracts and lighter refinement for subagents, matching the substantial changes across documentation, configuration, and testing.
Capabilities Registry Check ✅ Passed Backend changes are limited to configuration adjustments for existing GPT-5.4 model, MCP server normalization, and coordination parameter exclusion. No new models or capabilities are added.
Config Parameter Sync ✅ Passed Pull request correctly adds round_evaluator_transformation_pressure parameter to both required exclusion lists in backend base.py and API params handler base.py.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch dev/v0.1.62-p1
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

CodeRabbit can suggest fixes for GitHub Check annotations.

Configure the reviews.tools.github-checks setting to adjust the time to wait for GitHub Checks to complete.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 11

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (7)
massgen/mcp_tools/checklist_tools_server.py (1)

128-165: ⚠️ Potential issue | 🟠 Major

Normalize optional success-contract lists before iterating.

If fail_if_any or required_evidence is serialized as null, this redirect helper raises TypeError before submit_checklist / propose_improvements can return the handoff message.

🛠️ Proposed fix
     if isinstance(success_contract, dict):
         outcome_statement = str(success_contract.get("outcome_statement") or "").strip()
         quality_bar = str(success_contract.get("quality_bar") or "").strip()
-        fail_if_any = [str(item).strip() for item in success_contract.get("fail_if_any", []) if str(item).strip()]
-        required_evidence = [str(item).strip() for item in success_contract.get("required_evidence", []) if str(item).strip()]
+        raw_fail_if_any = success_contract.get("fail_if_any") or []
+        raw_required_evidence = success_contract.get("required_evidence") or []
+        fail_if_any = (
+            [str(item).strip() for item in raw_fail_if_any if str(item).strip()]
+            if isinstance(raw_fail_if_any, list)
+            else []
+        )
+        required_evidence = (
+            [str(item).strip() for item in raw_required_evidence if str(item).strip()]
+            if isinstance(raw_required_evidence, list)
+            else []
+        )
         if outcome_statement:
             parts.append(f"Success contract outcome: {outcome_statement}.")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@massgen/mcp_tools/checklist_tools_server.py` around lines 128 - 165, The code
fails when success_contract["fail_if_any"] or ["required_evidence"] is null
because the list comprehensions expect an iterable; normalize these fields to an
empty list before iterating. Update the handling in the success_contract block
(around variables fail_if_any and required_evidence in
checklist_tools_server.py) to use something like
success_contract.get("fail_if_any") or [] and
success_contract.get("required_evidence") or [] (then apply the existing list
comprehensions) so null/None becomes an empty list and avoids TypeError while
building parts.
massgen/tests/test_round_evaluator_verdict.py (1)

479-487: ⚠️ Potential issue | 🟡 Minor

Cover incremental_override_reason in this load test too.

This path was updated for the new next_tasks fields, but one of the newly added fields is still unasserted. If deserialization starts dropping incremental_override_reason, this test stays green.

💡 Suggested assertion
         assert result.next_tasks_execution_scope == SAMPLE_NEXT_TASKS["execution_scope"]
         assert result.next_tasks_strategy_mode == SAMPLE_NEXT_TASKS["strategy_mode"]
+        assert (
+            result.next_tasks["incremental_override_reason"]
+            == SAMPLE_NEXT_TASKS["incremental_override_reason"]
+        )
         assert result.next_tasks_success_contract == SAMPLE_NEXT_TASKS["success_contract"]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@massgen/tests/test_round_evaluator_verdict.py` around lines 479 - 487, Add an
assertion covering the incremental_override_reason field in the same block that
verifies other next_tasks fields: check that result.next_tasks is not None and
then assert result.next_tasks_incremental_override_reason ==
SAMPLE_NEXT_TASKS["incremental_override_reason"] (matching the existing pattern
of result.next_tasks_objective, result.next_tasks_primary_strategy, etc.) so
deserialization of incremental_override_reason is validated in
test_round_evaluator_verdict.py.
massgen/tests/test_subagent_manager.py (1)

1967-1987: ⚠️ Potential issue | 🟡 Minor

Align these fixtures with the canonical next_tasks contract.

These blocks now add the richer v2-style fields, but they still live in a separate hand-maintained payload instead of reusing the canonical helper added in massgen/tests/test_round_evaluator_verdict.py. That helper is already modeling the newer schema, so drift here will let the hoist path pass against an artifact the evaluator should no longer emit.

Also applies to: 2116-2136

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@massgen/tests/test_subagent_manager.py` around lines 1967 - 1987, The test
fixtures embed v2-style fields manually (e.g., success_contract, strategy_mode,
approach_assessment, tasks) instead of reusing the canonical next_tasks helper
used elsewhere; replace the hand-maintained payload with a call to the canonical
next_tasks helper (the shared next_tasks fixture used by the round evaluator
tests) and apply any small overrides via dict/merge only when necessary so the
fixture matches the canonical next_tasks contract; make the same change for the
second occurrence noted (the block around the other similar lines).
massgen/cli.py (1)

3450-3454: ⚠️ Potential issue | 🟡 Minor

Add Google-style Args/Returns to this changed function docstring.

Line 3450 has only a summary docstring; since _parse_coordination_config changed, it should include Google-style parameter and return sections.

Proposed docstring update
 def _parse_coordination_config(coord_cfg: dict[str, Any]) -> "CoordinationConfig":
-    """Parse a coordination config dict into a CoordinationConfig object.
-
-    Centralizes the parsing logic used by run_question_with_history,
-    run_single_question, and run_textual_interactive_mode.
-    """
+    """Parse a coordination config dict into a CoordinationConfig object.
+
+    Centralizes parsing logic used by run_question_with_history,
+    run_single_question, and run_textual_interactive_mode.
+
+    Args:
+        coord_cfg: Raw coordination configuration dictionary.
+
+    Returns:
+        CoordinationConfig: Parsed coordination configuration object.
+    """

As per coding guidelines, **/*.py: For new or changed functions, include Google-style docstrings.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@massgen/cli.py` around lines 3450 - 3454, The docstring for
_parse_coordination_config is missing Google-style Args/Returns sections; update
the function's docstring to include an Args section describing the
coordination_config parameter (type and meaning) and any other parameters used
by _parse_coordination_config, and a Returns section describing the
CoordinationConfig object (type and what it contains). Keep the existing summary
lines, add sections labeled "Args:" and "Returns:" following Google-style
formatting, and mirror descriptions used by callers like
run_question_with_history, run_single_question, and run_textual_interactive_mode
to ensure consistency.
massgen/tests/unit/test_orchestrator_unit.py (1)

603-623: ⚠️ Potential issue | 🟡 Minor

Update these fixtures to the schema-v2 next_tasks.json contract.

Both payloads still use schema_version: "1" and skip newer task fields like strategy_role / execution. That lets the managed auto-injection path go green without ever parsing the handoff shape this PR documents and validates elsewhere.

Also applies to: 739-758

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@massgen/tests/unit/test_orchestrator_unit.py` around lines 603 - 623, The
fixtures in test_orchestrator_unit.py use the old schema_version "1" and omit
new schema-v2 task fields; update each next_tasks payload (e.g., the object with
id "rewrite_core" and the other payload mentioned) to "schema_version": "2" and
add the missing schema-v2 fields (for example include strategy_role, execution,
any new task-level fields required by next_tasks.json v2 and ensure
approach_assessment/strategy_mode conform to v2 shape) so the unit tests
exercise the new handoff shape rather than relying on auto-injection.
massgen/subagent_types/round_evaluator/SUBAGENT.md (1)

307-325: ⚠️ Potential issue | 🟠 Major

Define incremental_override_reason in the JSON contract itself.

The rules require this field when an elevated-ceiling plan stays in incremental_refinement, but the sample object never shows where it belongs or what shape it has. Right now the subagent would have to guess, which is a bad failure mode for a machine-parsed artifact.

Also applies to: 419-426

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@massgen/subagent_types/round_evaluator/SUBAGENT.md` around lines 307 - 325,
Add a concrete definition for incremental_override_reason to the JSON contract
so machines don't have to guess its presence or shape: include an
incremental_override_reason field (string or small object) at the same level as
success_contract (or inside success_contract if your schema groups plan metadata
there), provide a "type"/"description" and an "example" value, and document that
this field is required when strategy_mode or plan state equals
"incremental_refinement" (referencing incremental_refinement and
success_contract in the doc); mirror the same addition in the other sample block
mentioned (lines 419-426).
massgen/message_templates.py (1)

720-772: ⚠️ Potential issue | 🟡 Minor

Update docstring to document the new had_voting parameter.

The method signature now includes had_voting: bool = False, which changes synthesize-mode behavior, but the docstring remains a single-line summary without documenting this parameter. Add a Google-style docstring with parameter documentation to match the coding guidelines for **/*.py files.

Additionally, verify the call site at test_round_evaluator_verdict.py:877 to ensure it intentionally omits the flag or update it to explicitly pass had_voting if needed.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@massgen/message_templates.py` around lines 720 - 772, The function
build_final_presentation_message adds a new parameter had_voting but the
docstring is still a single-line summary; update its docstring to a Google-style
multi-line docstring documenting all parameters (original_task, vote_summary,
all_answers, selected_agent_id, agent_changedocs, final_answer_strategy,
had_voting) and the return value, and briefly describe how had_voting alters
synthesize behavior; then check the call site referenced
(test_round_evaluator_verdict.py at the test around line 877) and either leave
the call as-is if omission is intentional (relies on default False) or update
that test invocation to pass had_voting explicitly to reflect the intended test
scenario.
🧹 Nitpick comments (10)
massgen/configs/features/round_evaluator_example.yaml (1)

66-70: Add “What happens” comments for the new evaluator knobs.

The new round_evaluator_transformation_pressure and max_new_answers entries are clear to core contributors, but brief execution-flow comments here will make the example safer to copy/paste.

♻️ Suggested comment-only improvement
-    round_evaluator_transformation_pressure: balanced
+    # What happens: controls evaluator rewrite boldness (gentle|balanced|aggressive).
+    round_evaluator_transformation_pressure: balanced
@@
-      max_new_answers: 3
+      # What happens: maximum evaluator re-attempts per spawned evaluator agent.
+      max_new_answers: 3

As per coding guidelines: massgen/configs/**/*.yaml should include "What happens" comments explaining execution flow.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@massgen/configs/features/round_evaluator_example.yaml` around lines 66 - 70,
Add "What happens" comments explaining execution flow for the new evaluator
knobs: describe how round_evaluator_transformation_pressure modifies
transformation selection/priority during round evaluation (e.g., controls bias
toward aggressive vs conservative transformations) and how
subagent_orchestrator.max_new_answers limits the number of new candidate answers
spawned per orchestration cycle; place these comments adjacent to the
round_evaluator_transformation_pressure and
subagent_orchestrator.max_new_answers entries in the example YAML so copy/pasted
configs include guidance for runtime behavior and expected effect on evaluation
flow.
massgen/tests/test_planning_injection.py (1)

281-283: Expand header assertions to cover the full memo contract.

Given the updated verification memo structure, consider also asserting ## Inputs and Artifacts and ## Replay Steps to catch partial prompt regressions.

🧪 Suggested assertion expansion
         assert "## Verification Contract" in out[-1]["description"]
+        assert "## Inputs and Artifacts" in out[-1]["description"]
+        assert "## Replay Steps" in out[-1]["description"]
         assert "## Latest Verification Result" in out[-1]["description"]
         assert "## Stale If" in out[-1]["description"]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@massgen/tests/test_planning_injection.py` around lines 281 - 283, Update the
assertions that check the verification memo in the test to include the new
headers; specifically, in the block that asserts on out[-1]["description"] (the
verification memo), add assertions for "## Inputs and Artifacts" and "## Replay
Steps" in addition to the existing "## Verification Contract", "## Latest
Verification Result", and "## Stale If" checks so the test fails on partial
prompt regressions.
massgen/backend/codex.py (1)

650-659: Handle inline runtime-mode form to make normalization more robust.

Current rewrite only catches the split-token shape (--runtime-mode delegated). If args arrive as --runtime-mode=delegated, Line 651-656 won’t rewrite and background subagents may still run delegated.

🛠️ Suggested robustness patch
 `@staticmethod`
 def _rewrite_subagent_runtime_mode_for_background(args: list[Any]) -> list[Any]:
@@
-    for index, token in enumerate(rewritten[:-1]):
-        if token != "--runtime-mode":
-            continue
-        runtime_mode = rewritten[index + 1]
-        if str(runtime_mode).strip().lower() != "delegated":
-            continue
-        rewritten[index + 1] = "isolated"
-        changed = True
+    for index, token in enumerate(rewritten):
+        token_str = str(token).strip()
+
+        # Support inline form: --runtime-mode=delegated
+        if token_str.lower().startswith("--runtime-mode="):
+            mode = token_str.split("=", 1)[1].strip().lower()
+            if mode == "delegated":
+                rewritten[index] = "--runtime-mode=isolated"
+                changed = True
+            continue
+
+        # Support split form: --runtime-mode delegated
+        if token_str != "--runtime-mode" or index + 1 >= len(rewritten):
+            continue
+        runtime_mode = rewritten[index + 1]
+        if str(runtime_mode).strip().lower() != "delegated":
+            continue
+        rewritten[index + 1] = "isolated"
+        changed = True

Based on learnings: For cross-backend tool-calling behavior, backend-side normalization/validation should be the source of truth and should tolerate argument-shape variance.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@massgen/backend/codex.py` around lines 650 - 659, The current normalization
loop in the code that rewrites command args (variables/identifiers: rewritten,
args, token, runtime_mode) only handles the separated form "--runtime-mode
delegated" and misses the inline form "--runtime-mode=delegated"; update the
loop in the function performing this rewrite so it also recognizes tokens that
start with "--runtime-mode=" (or contains "="), parse the RHS value, and if it
equals "delegated" (case-insensitive) replace it with "isolated" by either
setting rewritten[index] to "--runtime-mode=isolated" or splitting and updating
rewritten[index + 1] consistently, preserve the changed flag, and return
rewritten if changed else args as before.
massgen/tests/test_specialized_subagents.py (1)

347-421: Use Google-style docstrings for the newly added test functions.

These new test docstrings are summary-only; please convert them to Google-style format to match repo standards.

Suggested pattern (apply similarly to the new tests in this block)
 def test_round_evaluator_prompt_defines_transformation_pressure_contract():
-    """round_evaluator should define how transformation pressure biases thesis selection."""
+    """Validate round_evaluator transformation-pressure contract.
+
+    Returns:
+        None.
+    """

As per coding guidelines: **/*.py: For new or changed functions, include Google-style docstrings.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@massgen/tests/test_specialized_subagents.py` around lines 347 - 421, Replace
the summary-only triple-quoted docstrings in the four new test functions
(test_round_evaluator_prompt_commits_to_one_material_next_round_thesis,
test_round_evaluator_prompt_defines_transformation_pressure_contract,
test_round_evaluator_prompt_requires_narrow_builder_task_scopes,
test_builder_prompt_requires_one_coherent_goal_and_split_same_file_shopping_lists)
with Google-style docstrings: start with a one-line summary, then a blank line
and an "Args:" section (can be omitted or empty for tests) and/or a "Returns:"/
"Raises:" section only if applicable; keep the content concise and preserve the
intent of the original descriptions. Ensure each function's docstring follows
the repository's Google-style format consistently.
massgen/tests/test_config_builder.py (1)

542-580: Add docstrings to these touched reasoning-profile tests.

The expectation updates look good, but the three modified test functions in this block still lack docstrings, so the changed surface falls short of the repo-wide Python docstring rule.

As per coding guidelines, "**/*.py: For new or changed functions, include Google-style docstrings".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@massgen/tests/test_config_builder.py` around lines 542 - 580, Add
Google-style docstrings to the three modified test functions:
test_reasoning_profile_for_codex_includes_xhigh,
test_reasoning_profile_for_codex_gpt54_defaults_high, and
test_reasoning_profile_for_copilot_uses_runtime_metadata. For each, insert a
short docstring immediately under the def that summarizes the test purpose
(one-line summary plus optional one-sentence detail), and include Args/Returns
sections only if relevant per repo convention; keep them concise and
Google-style so the functions comply with the repo-wide Python docstring rule.
massgen/tests/test_subagent_manager.py (1)

2004-2006: Assert on parsed JSON instead of a raw substring.

A malformed or truncated next_tasks.json would still pass as long as that one text fragment survived. Load the file and assert the new structured fields so these tests actually cover the schema change.

💡 Suggested assertion pattern
-        assert '"strategy_mode": "thesis_shift"' in next_tasks_path.read_text(encoding="utf-8")
+        persisted = json.loads(next_tasks_path.read_text(encoding="utf-8"))
+        assert persisted["strategy_mode"] == "thesis_shift"
+        assert persisted["incremental_override_reason"] == ""
+        assert "success_contract" in persisted

Also applies to: 2153-2155

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@massgen/tests/test_subagent_manager.py` around lines 2004 - 2006, The test
currently asserts a raw substring in next_tasks.json which can yield false
positives; instead read and parse the JSON from next_tasks_path (via
next_tasks_path.read_text + json.loads) and assert the structured fields such as
the top-level "strategy_mode" equals "thesis_shift" and any other expected
keys/values; update both occurrences (around next_tasks_path and the similar
block at the other location) to parse the file and use direct dict assertions
rather than substring matching.
massgen/subagent/models.py (1)

650-652: Consider documenting the new fields in the class docstring.

The new next_tasks_strategy_mode, next_tasks_incremental_override_reason, and next_tasks_success_contract fields are not documented in the class docstring. While existing fields like verdict and scores also lack documentation, adding descriptions for these new strategy-related fields would improve maintainability.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@massgen/subagent/models.py` around lines 650 - 652, Add short descriptions
for the new fields next_tasks_strategy_mode,
next_tasks_incremental_override_reason, and next_tasks_success_contract into the
class docstring where the model is defined (document their purpose, allowed
values/types and when they are used); reference the exact attribute names in the
docstring so future readers can map them to the attributes, e.g., state that
next_tasks_strategy_mode is one of "incremental_refinement"|"thesis_shift"|None,
next_tasks_incremental_override_reason is an optional human-readable string
explaining an override, and next_tasks_success_contract is an optional dict
describing expected success criteria/contract for produced tasks.
massgen/system_prompt_sections.py (2)

4293-4304: Fix malformed bullet rendering for injected pressure guidance.

pressure_guidance already contains bullet lines. Prefixing it with "- " produces a duplicated first marker (- - ...) and inconsistent indentation for following lines, which can degrade prompt readability.

💡 Suggested fix
-        pressure_guidance = _build_round_evaluator_transformation_pressure_guidance(
+        pressure_guidance = _build_round_evaluator_transformation_pressure_guidance(
             round_evaluator_transformation_pressure,
-        ).replace("Transformation-pressure contract:\n", "")
+        ).replace("Transformation-pressure contract:\n", "").strip()
+        pressure_guidance_block = "\n".join(
+            f"   {line}" for line in pressure_guidance.splitlines()
+        )
         gate_step = (
             "2. **Choose the correct gate before you start building.**\n"
             "   - The orchestrator-managed round evaluator is for material self-improvement.\n"
             "   - If the round-evaluator header says tasks were auto-injected into your task plan:\n"
             "     this is the normal path. Call `get_task_plan`, implement that one committed\n"
             "     next-round thesis, verify it, and do NOT call `submit_checklist` or\n"
             "     `propose_improvements`.\n"
-            f"   - {pressure_guidance.strip()}\n"
+            f"{pressure_guidance_block}\n"
             "   - Otherwise, the checklist branch is degraded fallback: evaluate the existing\n"
             "     answers, identify gaps and improvements, then `submit_checklist` with your\n"
             "     scores. Do NOT make edits to the deliverable before the checklist verdict —\n"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@massgen/system_prompt_sections.py` around lines 4293 - 4304, The bullet
marker is being prepended to pressure_guidance, causing a duplicated "- -" and
misaligned bullets; update the gate_step assembly to insert pressure_guidance
(built by _build_round_evaluator_transformation_pressure_guidance and stored in
pressure_guidance) without adding an extra "- " prefix and ensure its lines are
properly indented to match the surrounding bullets (e.g., trim and inject
pressure_guidance directly or re-indent its lines to include the same leading
spaces used by the other list items in gate_step).

769-770: Align docstrings for changed functions with Google style.

The new/changed signatures introduce parameters without Google-style Args/Returns blocks, which makes these high-traffic prompt builders harder to maintain.

As per coding guidelines, **/*.py: For new or changed functions, include Google-style docstrings.

Also applies to: 809-823, 4273-4275

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@massgen/system_prompt_sections.py` around lines 769 - 770, Update the
docstring for _build_round_evaluator_transformation_pressure_guidance to follow
Google style: add an "Args" section documenting the parameter pressure (type and
allowed values/meaning) and a "Returns" section describing the returned str, and
ensure the one-line summary stays as the first line; apply the same Google-style
Args/Returns docstring pattern to the other changed prompt-builder functions
mentioned (those around lines 809-823 and 4273-4275) so all new/modified
functions include clear Args and Returns blocks.
massgen/orchestrator.py (1)

11015-11016: Minor: Redundant string conversion.

Each item is converted with str(item).strip() twice—once for filtering and once for the output. Consider using the walrus operator to evaluate once.

♻️ Proposed refactor
-                fail_if_any = [str(item).strip() for item in success_contract.get("fail_if_any", []) if str(item).strip()]
-                required_evidence = [str(item).strip() for item in success_contract.get("required_evidence", []) if str(item).strip()]
+                fail_if_any = [s for item in success_contract.get("fail_if_any", []) if (s := str(item).strip())]
+                required_evidence = [s for item in success_contract.get("required_evidence", []) if (s := str(item).strip())]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@massgen/orchestrator.py` around lines 11015 - 11016, The two list
comprehensions building fail_if_any and required_evidence perform
str(item).strip() twice; refactor them to compute the stripped value once using
the walrus operator so you both filter and collect the stripped string in a
single expression (operate on success_contract.get(...) and update the variables
fail_if_any and required_evidence accordingly), referencing the existing
success_contract, fail_if_any, and required_evidence names to locate and replace
the redundant comprehensions.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/modules/composition.md`:
- Around line 146-147: The docs reference an example config
"configs/voting/ensemble_mode.yaml" that doesn't exist and also omits the
project prefix; update the documentation or repo: either add the missing example
file at massgen/configs/voting/ensemble_mode.yaml or change the reference in
docs/modules/composition.md to point to an existing file (e.g.,
massgen/configs/voting/gemini_gpt_voting_sensitivity.yaml) and correct the path
to include the "massgen/" prefix; also ensure the related reference to
docs/source/reference/yaml_schema.rst remains correct.

In `@docs/modules/subagents.md`:
- Around line 188-215: Update the opening of this section to explicitly state
that both coordination.orchestrator_managed_round_evaluator and the
round_evaluator_before_checklist flag are required together to enable the
orchestrator-managed round-evaluator flow; revise the first sentence/validation
list to mention both flags by name (round_evaluator_before_checklist and
orchestrator_managed_round_evaluator) and note that omitting either produces a
degraded fallback, and ensure any subsequent sentences that implied a
single-flag enablement are adjusted to reflect the dual-flag prerequisite.

In `@docs/source/reference/yaml_schema.rst`:
- Around line 1206-1209: The inline literal for defer_voting_until_all_answered:
true is split across a newline causing RST parsing warnings; update the sentence
so the entire inline literal (and related literal `disable_injection: true`)
stays on a single line or use a single inline literal span for the whole clause
(e.g., ``disable_injection: true, defer_voting_until_all_answered: true``) and
keep the reference to `subagent_orchestrator` unchanged so Sphinx parses
correctly.
- Around line 1043-1047: The description for round_evaluator_refine is unclear
due to the phrase "run iterate"; update the text to a clearer wording such as
"allow the evaluator child to perform iterative refinement before producing its
packet (default: ``false``)" so it accurately conveys that the evaluator will
execute its iterative refinement process prior to emitting the packet; modify
the schema doc entry for round_evaluator_refine accordingly.
- Around line 1179-1238: Update the v0.1.61 release notes in the Recent Releases
section to mention the new ensemble coordination pattern and the related
configuration options introduced in this release: reference the ensemble
defaults (disable_injection, defer_voting_until_all_answered,
max_new_answers_per_agent, final_answer_strategy) and the round evaluator option
round_evaluator_transformation_pressure alongside the existing round
evaluator/subagent notes so the entry covers both the ensemble pattern and the
round evaluator subagent features.

In `@massgen/backend/codex.py`:
- Around line 625-646: Replace the single-line docstrings on the two helper
methods with Google-style docstrings: update
_normalize_background_server_config(server_cfg: dict[str, Any]) -> dict[str,
Any] and _rewrite_subagent_runtime_mode_for_background(args: list[Any]) ->
list[Any] to include a short summary, Args and Returns sections (and Raises if
applicable), describing the parameters (server_cfg / args), what the function
returns (normalized config / rewritten args), and the behavior (rewrites server
config for subagent background mode / forces isolated runtime mode for subagent
MCP servers); keep wording concise and follow existing project Google-style
examples for formatting.
- Around line 625-660: ClaudeCodeBackend lacks the same background MCP server
config normalization found in CodexBackend, so add a private
_normalize_background_server_config(server_cfg: dict[str, Any]) (or reuse the
existing logic from CodexBackend including the helper
_rewrite_subagent_runtime_mode_for_background) to ClaudeCodeBackend and ensure
_get_background_mcp_servers() calls it before returning server configs; then add
a unit test mirroring
test_codex_background_mcp_client_rewrites_delegated_subagent_runtime_mode_for_host
(e.g.,
test_claude_code_background_mcp_client_rewrites_delegated_subagent_runtime_mode_for_host)
to assert delegated→isolated rewriting for ClaudeCodeBackend background MCP
servers.

In `@massgen/config_validator.py`:
- Line 166: The validator currently does a membership test against
VALID_ROUND_EVALUATOR_TRANSFORMATION_PRESSURE which will raise TypeError if
round_evaluator_transformation_pressure is a list/dict; update the check in the
validation function that reads the round_evaluator_transformation_pressure field
to first ensure the value is a str (or None) and, if it's not, append a
ValidationIssue indicating an invalid type, otherwise perform the membership
test against VALID_ROUND_EVALUATOR_TRANSFORMATION_PRESSURE (and do the same
pattern for the other occurrences around the 1136-1144 area) so malformed YAML
yields a ValidationIssue instead of crashing.

In `@massgen/tests/test_ensemble_mode.py`:
- Around line 84-87: The skip-guard in the ModuleNotFoundError handlers is
checking for the wrong module name ("claude_code_sdk") so missing
"claude_agent_sdk" raises instead of skipping; update the checks in the except
ModuleNotFoundError blocks (the handlers that inspect str(e)) to look for
"claude_agent_sdk" instead of "claude_code_sdk" in all occurrences in the test
suite (including the handlers in test_ensemble_mode.py and the similar guard in
test_final_presentation_fallback.py) so the tests correctly call pytest.skip
when claude_agent_sdk is not installed.

In `@massgen/tests/test_round_evaluator_loop.py`:
- Around line 40-124: The fixture _make_valid_next_tasks_payload() has a
mismatch: strategy_mode is "incremental_refinement" but the task entries use
strategy_role "thesis_shift"; update the two task dicts (the entries in
"evolution_tasks" and "tasks") to use strategy_role "incremental_refinement"
(replace "thesis_shift") so the overall payload is coherent; locate and edit the
task objects in _make_valid_next_tasks_payload() by updating their
"strategy_role" fields to match the top-level "strategy_mode".

In `@massgen/tests/test_round_evaluator_verdict.py`:
- Around line 82-83: Add a Google-style docstring to the helper function
_make_valid_next_tasks_payload(): insert a concise one-line summary of what the
helper returns, a blank line, an Args section if any parameters (none in this
case) or omit, and a Returns section describing the dict shape and types (e.g.,
keys and value types for the next-tasks payload), and optionally an Example
showing typical usage; place this docstring immediately below the def
_make_valid_next_tasks_payload() declaration so it satisfies the repo's
docstring requirement.

---

Outside diff comments:
In `@massgen/cli.py`:
- Around line 3450-3454: The docstring for _parse_coordination_config is missing
Google-style Args/Returns sections; update the function's docstring to include
an Args section describing the coordination_config parameter (type and meaning)
and any other parameters used by _parse_coordination_config, and a Returns
section describing the CoordinationConfig object (type and what it contains).
Keep the existing summary lines, add sections labeled "Args:" and "Returns:"
following Google-style formatting, and mirror descriptions used by callers like
run_question_with_history, run_single_question, and run_textual_interactive_mode
to ensure consistency.

In `@massgen/mcp_tools/checklist_tools_server.py`:
- Around line 128-165: The code fails when success_contract["fail_if_any"] or
["required_evidence"] is null because the list comprehensions expect an
iterable; normalize these fields to an empty list before iterating. Update the
handling in the success_contract block (around variables fail_if_any and
required_evidence in checklist_tools_server.py) to use something like
success_contract.get("fail_if_any") or [] and
success_contract.get("required_evidence") or [] (then apply the existing list
comprehensions) so null/None becomes an empty list and avoids TypeError while
building parts.

In `@massgen/message_templates.py`:
- Around line 720-772: The function build_final_presentation_message adds a new
parameter had_voting but the docstring is still a single-line summary; update
its docstring to a Google-style multi-line docstring documenting all parameters
(original_task, vote_summary, all_answers, selected_agent_id, agent_changedocs,
final_answer_strategy, had_voting) and the return value, and briefly describe
how had_voting alters synthesize behavior; then check the call site referenced
(test_round_evaluator_verdict.py at the test around line 877) and either leave
the call as-is if omission is intentional (relies on default False) or update
that test invocation to pass had_voting explicitly to reflect the intended test
scenario.

In `@massgen/subagent_types/round_evaluator/SUBAGENT.md`:
- Around line 307-325: Add a concrete definition for incremental_override_reason
to the JSON contract so machines don't have to guess its presence or shape:
include an incremental_override_reason field (string or small object) at the
same level as success_contract (or inside success_contract if your schema groups
plan metadata there), provide a "type"/"description" and an "example" value, and
document that this field is required when strategy_mode or plan state equals
"incremental_refinement" (referencing incremental_refinement and
success_contract in the doc); mirror the same addition in the other sample block
mentioned (lines 419-426).

In `@massgen/tests/test_round_evaluator_verdict.py`:
- Around line 479-487: Add an assertion covering the incremental_override_reason
field in the same block that verifies other next_tasks fields: check that
result.next_tasks is not None and then assert
result.next_tasks_incremental_override_reason ==
SAMPLE_NEXT_TASKS["incremental_override_reason"] (matching the existing pattern
of result.next_tasks_objective, result.next_tasks_primary_strategy, etc.) so
deserialization of incremental_override_reason is validated in
test_round_evaluator_verdict.py.

In `@massgen/tests/test_subagent_manager.py`:
- Around line 1967-1987: The test fixtures embed v2-style fields manually (e.g.,
success_contract, strategy_mode, approach_assessment, tasks) instead of reusing
the canonical next_tasks helper used elsewhere; replace the hand-maintained
payload with a call to the canonical next_tasks helper (the shared next_tasks
fixture used by the round evaluator tests) and apply any small overrides via
dict/merge only when necessary so the fixture matches the canonical next_tasks
contract; make the same change for the second occurrence noted (the block around
the other similar lines).

In `@massgen/tests/unit/test_orchestrator_unit.py`:
- Around line 603-623: The fixtures in test_orchestrator_unit.py use the old
schema_version "1" and omit new schema-v2 task fields; update each next_tasks
payload (e.g., the object with id "rewrite_core" and the other payload
mentioned) to "schema_version": "2" and add the missing schema-v2 fields (for
example include strategy_role, execution, any new task-level fields required by
next_tasks.json v2 and ensure approach_assessment/strategy_mode conform to v2
shape) so the unit tests exercise the new handoff shape rather than relying on
auto-injection.

---

Nitpick comments:
In `@massgen/backend/codex.py`:
- Around line 650-659: The current normalization loop in the code that rewrites
command args (variables/identifiers: rewritten, args, token, runtime_mode) only
handles the separated form "--runtime-mode delegated" and misses the inline form
"--runtime-mode=delegated"; update the loop in the function performing this
rewrite so it also recognizes tokens that start with "--runtime-mode=" (or
contains "="), parse the RHS value, and if it equals "delegated"
(case-insensitive) replace it with "isolated" by either setting rewritten[index]
to "--runtime-mode=isolated" or splitting and updating rewritten[index + 1]
consistently, preserve the changed flag, and return rewritten if changed else
args as before.

In `@massgen/configs/features/round_evaluator_example.yaml`:
- Around line 66-70: Add "What happens" comments explaining execution flow for
the new evaluator knobs: describe how round_evaluator_transformation_pressure
modifies transformation selection/priority during round evaluation (e.g.,
controls bias toward aggressive vs conservative transformations) and how
subagent_orchestrator.max_new_answers limits the number of new candidate answers
spawned per orchestration cycle; place these comments adjacent to the
round_evaluator_transformation_pressure and
subagent_orchestrator.max_new_answers entries in the example YAML so copy/pasted
configs include guidance for runtime behavior and expected effect on evaluation
flow.

In `@massgen/orchestrator.py`:
- Around line 11015-11016: The two list comprehensions building fail_if_any and
required_evidence perform str(item).strip() twice; refactor them to compute the
stripped value once using the walrus operator so you both filter and collect the
stripped string in a single expression (operate on success_contract.get(...) and
update the variables fail_if_any and required_evidence accordingly), referencing
the existing success_contract, fail_if_any, and required_evidence names to
locate and replace the redundant comprehensions.

In `@massgen/subagent/models.py`:
- Around line 650-652: Add short descriptions for the new fields
next_tasks_strategy_mode, next_tasks_incremental_override_reason, and
next_tasks_success_contract into the class docstring where the model is defined
(document their purpose, allowed values/types and when they are used); reference
the exact attribute names in the docstring so future readers can map them to the
attributes, e.g., state that next_tasks_strategy_mode is one of
"incremental_refinement"|"thesis_shift"|None,
next_tasks_incremental_override_reason is an optional human-readable string
explaining an override, and next_tasks_success_contract is an optional dict
describing expected success criteria/contract for produced tasks.

In `@massgen/system_prompt_sections.py`:
- Around line 4293-4304: The bullet marker is being prepended to
pressure_guidance, causing a duplicated "- -" and misaligned bullets; update the
gate_step assembly to insert pressure_guidance (built by
_build_round_evaluator_transformation_pressure_guidance and stored in
pressure_guidance) without adding an extra "- " prefix and ensure its lines are
properly indented to match the surrounding bullets (e.g., trim and inject
pressure_guidance directly or re-indent its lines to include the same leading
spaces used by the other list items in gate_step).
- Around line 769-770: Update the docstring for
_build_round_evaluator_transformation_pressure_guidance to follow Google style:
add an "Args" section documenting the parameter pressure (type and allowed
values/meaning) and a "Returns" section describing the returned str, and ensure
the one-line summary stays as the first line; apply the same Google-style
Args/Returns docstring pattern to the other changed prompt-builder functions
mentioned (those around lines 809-823 and 4273-4275) so all new/modified
functions include clear Args and Returns blocks.

In `@massgen/tests/test_config_builder.py`:
- Around line 542-580: Add Google-style docstrings to the three modified test
functions: test_reasoning_profile_for_codex_includes_xhigh,
test_reasoning_profile_for_codex_gpt54_defaults_high, and
test_reasoning_profile_for_copilot_uses_runtime_metadata. For each, insert a
short docstring immediately under the def that summarizes the test purpose
(one-line summary plus optional one-sentence detail), and include Args/Returns
sections only if relevant per repo convention; keep them concise and
Google-style so the functions comply with the repo-wide Python docstring rule.

In `@massgen/tests/test_planning_injection.py`:
- Around line 281-283: Update the assertions that check the verification memo in
the test to include the new headers; specifically, in the block that asserts on
out[-1]["description"] (the verification memo), add assertions for "## Inputs
and Artifacts" and "## Replay Steps" in addition to the existing "##
Verification Contract", "## Latest Verification Result", and "## Stale If"
checks so the test fails on partial prompt regressions.

In `@massgen/tests/test_specialized_subagents.py`:
- Around line 347-421: Replace the summary-only triple-quoted docstrings in the
four new test functions
(test_round_evaluator_prompt_commits_to_one_material_next_round_thesis,
test_round_evaluator_prompt_defines_transformation_pressure_contract,
test_round_evaluator_prompt_requires_narrow_builder_task_scopes,
test_builder_prompt_requires_one_coherent_goal_and_split_same_file_shopping_lists)
with Google-style docstrings: start with a one-line summary, then a blank line
and an "Args:" section (can be omitted or empty for tests) and/or a "Returns:"/
"Raises:" section only if applicable; keep the content concise and preserve the
intent of the original descriptions. Ensure each function's docstring follows
the repository's Google-style format consistently.

In `@massgen/tests/test_subagent_manager.py`:
- Around line 2004-2006: The test currently asserts a raw substring in
next_tasks.json which can yield false positives; instead read and parse the JSON
from next_tasks_path (via next_tasks_path.read_text + json.loads) and assert the
structured fields such as the top-level "strategy_mode" equals "thesis_shift"
and any other expected keys/values; update both occurrences (around
next_tasks_path and the similar block at the other location) to parse the file
and use direct dict assertions rather than substring matching.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: dd594487-1353-4e80-99da-16265d81eda4

📥 Commits

Reviewing files that changed from the base of the PR and between 37b6f84 and c636a22.

📒 Files selected for processing (43)
  • docs/modules/composition.md
  • docs/modules/coordination_workflow.md
  • docs/modules/subagents.md
  • docs/source/reference/yaml_schema.rst
  • massgen/agent_config.py
  • massgen/api_params_handler/_api_params_handler_base.py
  • massgen/backend/base.py
  • massgen/backend/codex.py
  • massgen/cli.py
  • massgen/config_builder.py
  • massgen/config_validator.py
  • massgen/configs/features/round_evaluator_example.yaml
  • massgen/mcp_tools/checklist_tools_server.py
  • massgen/mcp_tools/custom_tools_server.py
  • massgen/mcp_tools/planning/_planning_mcp_server.py
  • massgen/message_templates.py
  • massgen/orchestrator.py
  • massgen/subagent/manager.py
  • massgen/subagent/models.py
  • massgen/subagent_types/builder/SUBAGENT.md
  • massgen/subagent_types/round_evaluator/SUBAGENT.md
  • massgen/system_message_builder.py
  • massgen/system_prompt_sections.py
  • massgen/tests/frontend/test_quickstart_wizard_config_filename.py
  • massgen/tests/integration/test_orchestrator_consensus.py
  • massgen/tests/test_api_params_exclusion.py
  • massgen/tests/test_checklist_label_refresh_ordering.py
  • massgen/tests/test_checklist_tools_server.py
  • massgen/tests/test_codex_mcp_tool_visibility.py
  • massgen/tests/test_codex_reasoning_config.py
  • massgen/tests/test_config_builder.py
  • massgen/tests/test_custom_tools.py
  • massgen/tests/test_ensemble_mode.py
  • massgen/tests/test_planning_injection.py
  • massgen/tests/test_round_evaluator_loop.py
  • massgen/tests/test_round_evaluator_verdict.py
  • massgen/tests/test_specialized_subagents.py
  • massgen/tests/test_subagent_manager.py
  • massgen/tests/test_system_message_builder_phases.py
  • massgen/tests/test_system_prompt_sections.py
  • massgen/tests/test_web_quickstart_reasoning_sync.py
  • massgen/tests/test_web_setup_status_api.py
  • massgen/tests/unit/test_orchestrator_unit.py

Comment on lines +146 to +147
See ``docs/source/reference/yaml_schema.rst`` and ``configs/voting/ensemble_mode.yaml`` for configuration details.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

fd 'ensemble_mode.yaml'
rg -n "ensemble_mode.yaml" docs/modules/composition.md docs/source/reference/yaml_schema.rst

Repository: massgen/MassGen

Length of output: 205


🏁 Script executed:

fd . configs/ -name "*ensemble*" -o -name "*voting*"
ls -la configs/ 2>/dev/null || echo "No configs directory found"

Repository: massgen/MassGen

Length of output: 333


🏁 Script executed:

find . -name "*.yaml" -o -name "*.yml" | head -20
find . -type d -name "config*" 2>/dev/null
cat docs/source/reference/yaml_schema.rst | head -50

Repository: massgen/MassGen

Length of output: 3065


🏁 Script executed:

rg -i "voting|ensemble" docs/ massgen/
ls -la massgen/configs/
find massgen/configs -type f -name "*.yaml" | sort

Repository: massgen/MassGen

Length of output: 50374


🏁 Script executed:

find . -type d -name "voting"
find . -name "*ensemble*mode*" -type f
# Check what configs actually exist
ls -la massgen/configs/
head -20 docs/modules/composition.md | tail -10

Repository: massgen/MassGen

Length of output: 2438


🏁 Script executed:

ls -la massgen/configs/voting/
find massgen/configs/voting -type f

Repository: massgen/MassGen

Length of output: 314


🏁 Script executed:

sed -n '130,160p' docs/modules/composition.md

Repository: massgen/MassGen

Length of output: 2020


🏁 Script executed:

# Check if ensemble_mode.yaml is mentioned anywhere else in docs
rg "ensemble_mode" docs/ massgen/

Repository: massgen/MassGen

Length of output: 201


The referenced config file ensemble_mode.yaml does not exist.

massgen/configs/voting/ensemble_mode.yaml is not in the repository. The voting directory contains only gemini_gpt_voting_sensitivity.yaml. Either create this example config file or correct the documentation reference to point to an existing config.

Additionally, the path is missing the massgen/ prefix—should be massgen/configs/voting/ not configs/voting/.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/modules/composition.md` around lines 146 - 147, The docs reference an
example config "configs/voting/ensemble_mode.yaml" that doesn't exist and also
omits the project prefix; update the documentation or repo: either add the
missing example file at massgen/configs/voting/ensemble_mode.yaml or change the
reference in docs/modules/composition.md to point to an existing file (e.g.,
massgen/configs/voting/gemini_gpt_voting_sensitivity.yaml) and correct the path
to include the "massgen/" prefix; also ensure the related reference to
docs/source/reference/yaml_schema.rst remains correct.

Comment on lines +188 to +215
orchestrator-managed round-evaluator stage:

- round 1: parent builds and submits its first answer normally
- round 2+: the parent launches one blocking `round_evaluator` subagent before
checklist submission unless the separate orchestrator-managed gate is enabled
- round 2+: the orchestrator launches one blocking `round_evaluator` subagent
before checklist submission
- the round evaluator returns a critique/spec packet with `criteria_interpretation`, `criterion_findings`, `cross_answer_synthesis`, `preserve`, `improvement_spec`, `verification_plan`, and `evidence_gaps`
- the parent saves or copies that packet into its workspace as the diagnostic
report used for `submit_checklist`
- core path: if valid `next_tasks.json` is present, the parent treats it as the
one committed next-round thesis, calls `get_task_plan`, implements, verifies,
and submits via `new_answer`
- degraded fallback: if valid `next_tasks.json` is missing or invalid, the
parent uses `critique_packet.md` as the diagnostic basis for `submit_checklist`
- the parent does not run a second full self-evaluation pass; additional
verification is only for explicit `evidence_gaps`
- the parent still owns `submit_checklist`, `propose_improvements`, `new_answer`, and `vote`
- generated child YAML for `round_evaluator` always mounts the shared
temp-workspace root read-only
- with `refine: false`, `round_evaluator` remains a quick critique-only child
run and omits checklist-gated child settings
- with `refine: true`, `round_evaluator` may inherit the parent checklist gate;
when no child-specific criteria are configured, it falls back to a built-in
`round_evaluator` criteria preset for judging the critique packet itself
- if the evaluator child times out before producing `critique_packet.md`, the
orchestrator degrades back to the normal parent-owned checklist flow for that
answer set instead of terminating coordination immediately
- `round_evaluator_refine` remains an advanced/non-default branch
- when the child run is using presenter-stage `synthesize`/`winner_present`, it
keeps `skip_final_presentation: false`
- `round_evaluator_transformation_pressure` controls how aggressively the
evaluator seeks a larger thesis shift: `gentle`, `balanced`, or `aggressive`

`coordination.orchestrator_managed_round_evaluator: true` is a separate,
currently gated mode that lets the orchestrator launch that same blocking
`round_evaluator` before round 2+.
`coordination.orchestrator_managed_round_evaluator: true` remains required for
this stage and keeps the launch reserved for the orchestrator rather than a
manual parent prompt pattern.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Make the dual-flag prerequisite explicit at the top of this section.

This now reads as if round_evaluator_before_checklist enables the stage by itself, but a few lines later orchestrator_managed_round_evaluator is still described as required. Please state both required flags together in the opening sentence or validation list so readers do not land on an invalid partial config.

Based on learnings, module documentation in docs/modules/ is the authoritative source for complex subsystems and should be updated precisely when behavior changes.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/modules/subagents.md` around lines 188 - 215, Update the opening of this
section to explicitly state that both
coordination.orchestrator_managed_round_evaluator and the
round_evaluator_before_checklist flag are required together to enable the
orchestrator-managed round-evaluator flow; revise the first sentence/validation
list to mention both flags by name (round_evaluator_before_checklist and
orchestrator_managed_round_evaluator) and note that omitting either produces a
degraded fallback, and ensure any subsequent sentences that implied a
single-flag enablement are adjusted to reflect the dual-flag prerequisite.

Comment on lines +1043 to +1047
* - ``round_evaluator_refine``
- boolean
- No
- Advanced/non-default option that lets the evaluator child run iterate before producing its packet (default: ``false``).
* - ``round_evaluator_transformation_pressure``
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix wording in round_evaluator_refine description.

The phrase “run iterate” reads as a typo and is confusing in the schema docs.

✏️ Suggested wording fix
-  - Advanced/non-default option that lets the evaluator child run iterate before producing its packet (default: ``false``).
+  - Advanced/non-default option that lets the evaluator child iterate before producing its packet (default: ``false``).
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
* - ``round_evaluator_refine``
- boolean
- No
- Advanced/non-default option that lets the evaluator child run iterate before producing its packet (default: ``false``).
* - ``round_evaluator_transformation_pressure``
* - ``round_evaluator_refine``
- boolean
- No
- Advanced/non-default option that lets the evaluator child iterate before producing its packet (default: ``false``).
* - ``round_evaluator_transformation_pressure``
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/source/reference/yaml_schema.rst` around lines 1043 - 1047, The
description for round_evaluator_refine is unclear due to the phrase "run
iterate"; update the text to a clearer wording such as "allow the evaluator
child to perform iterative refinement before producing its packet (default:
``false``)" so it accurately conveys that the evaluator will execute its
iterative refinement process prior to emitting the packet; modify the schema doc
entry for round_evaluator_refine accordingly.

Comment on lines +1179 to +1238
Ensemble pattern (recommended defaults):

.. code-block:: yaml

orchestrator:
# Agents work independently — no peer answer injection
disable_injection: true
# Wait for all agents to finish before voting begins
defer_voting_until_all_answered: true
# Each agent produces 1 answer (adjustable)
max_new_answers_per_agent: 1
# Winner synthesizes from all answers
final_answer_strategy: "synthesize"

The **ensemble pattern** is a coordination strategy where agents produce answers
independently (no peer visibility), then vote on the best answer, and the winner
synthesizes insights from all others into a refined final answer.

**When to use ensemble mode:**

- You want diverse, independent perspectives without agents anchoring on each
other's work
- The task benefits from competitive parallel attempts rather than iterative
refinement (e.g., creative writing, design proposals, solution brainstorming)
- You want faster coordination — single round of production + vote, no
multi-round iteration

**Subagent default:** Multi-agent subagent runs use ensemble defaults
automatically (``disable_injection: true``, ``defer_voting_until_all_answered:
true``). Override by setting these fields explicitly in
``subagent_orchestrator`` config.

.. list-table:: Ensemble vs Standard Voting vs Decomposition
:header-rows: 1

* - Aspect
- Standard voting
- Ensemble pattern
- Decomposition
* - Peer visibility
- Agents see each other's answers
- Agents work in isolation
- Agents see subtask assignments
* - Iteration
- Multiple refinement rounds
- Single round of production
- Multiple rounds per subtask
* - Voting
- After iterative refinement
- After all answers produced
- No voting (presenter assembles)
* - Final answer
- Winner presents
- Winner synthesizes from all
- Presenter integrates subtasks
* - Best for
- Deep quality refinement
- Diverse perspectives, speed
- Complex multi-part tasks

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Verify Recent Releases section exists and contains feature keywords.
rg -n "Recent Releases|round_evaluator_transformation_pressure|Ensemble" docs/source/index.rst docs/source/reference/yaml_schema.rst

Repository: massgen/MassGen

Length of output: 449


🏁 Script executed:

cat -n docs/source/index.rst | sed -n '195,250p'

Repository: massgen/MassGen

Length of output: 2605


Update docs/source/index.rst v0.1.61 release notes to include ensemble pattern and transformation_pressure features.

The ensemble pattern coordination strategy with round_evaluator_transformation_pressure and ensemble defaults (documented in yaml_schema.rst lines 1179–1238) is not mentioned in the Recent Releases section. The v0.1.61 entry currently only describes the round evaluator subagent; update it to also reference the new ensemble pattern and coordination options introduced in this release.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/source/reference/yaml_schema.rst` around lines 1179 - 1238, Update the
v0.1.61 release notes in the Recent Releases section to mention the new ensemble
coordination pattern and the related configuration options introduced in this
release: reference the ensemble defaults (disable_injection,
defer_voting_until_all_answered, max_new_answers_per_agent,
final_answer_strategy) and the round evaluator option
round_evaluator_transformation_pressure alongside the existing round
evaluator/subagent notes so the entry covers both the ensemble pattern and the
round evaluator subagent features.

Comment on lines +1206 to +1209
**Subagent default:** Multi-agent subagent runs use ensemble defaults
automatically (``disable_injection: true``, ``defer_voting_until_all_answered:
true``). Override by setting these fields explicitly in
``subagent_orchestrator`` config.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Avoid splitting inline literals across lines in RST.

The inline literal around defer_voting_until_all_answered: true is broken by a newline, which can cause Sphinx/RST parsing warnings.

🧩 Suggested RST-safe formatting
-**Subagent default:** Multi-agent subagent runs use ensemble defaults
-automatically (``disable_injection: true``, ``defer_voting_until_all_answered:
-true``). Override by setting these fields explicitly in
-``subagent_orchestrator`` config.
+**Subagent default:** Multi-agent subagent runs use ensemble defaults
+automatically (``disable_injection: true``,
+``defer_voting_until_all_answered: true``). Override by setting these fields
+explicitly in ``subagent_orchestrator`` config.
As per coding guidelines: `docs/source/**/*.rst`: “Ensure proper RST syntax”.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
**Subagent default:** Multi-agent subagent runs use ensemble defaults
automatically (``disable_injection: true``, ``defer_voting_until_all_answered:
true``). Override by setting these fields explicitly in
``subagent_orchestrator`` config.
**Subagent default:** Multi-agent subagent runs use ensemble defaults
automatically (``disable_injection: true``,
``defer_voting_until_all_answered: true``). Override by setting these fields
explicitly in ``subagent_orchestrator`` config.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/source/reference/yaml_schema.rst` around lines 1206 - 1209, The inline
literal for defer_voting_until_all_answered: true is split across a newline
causing RST parsing warnings; update the sentence so the entire inline literal
(and related literal `disable_injection: true`) stays on a single line or use a
single inline literal span for the whole clause (e.g., ``disable_injection:
true, defer_voting_until_all_answered: true``) and keep the reference to
`subagent_orchestrator` unchanged so Sphinx parses correctly.

Comment on lines +625 to +660
@staticmethod
def _normalize_background_server_config(server_cfg: dict[str, Any]) -> dict[str, Any]:
"""Rewrite server config details that only make sense in model-facing runtimes."""
name = server_cfg.get("name")
if not isinstance(name, str) or not name.startswith("subagent_"):
return server_cfg

args = server_cfg.get("args")
if not isinstance(args, list):
return server_cfg

rewritten_args = CodexBackend._rewrite_subagent_runtime_mode_for_background(args)
if rewritten_args is args:
return server_cfg

normalized = server_cfg.copy()
normalized["args"] = rewritten_args
return normalized

@staticmethod
def _rewrite_subagent_runtime_mode_for_background(args: list[Any]) -> list[Any]:
"""Host-side background clients should launch subagent MCP servers in isolated mode."""
rewritten = list(args)
changed = False

for index, token in enumerate(rewritten[:-1]):
if token != "--runtime-mode":
continue
runtime_mode = rewritten[index + 1]
if str(runtime_mode).strip().lower() != "delegated":
continue
rewritten[index + 1] = "isolated"
changed = True

return rewritten if changed else args

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify test coverage for delegated->isolated runtime-mode rewrite behavior
# and backend parity references.

rg -n -C2 --type=py '_rewrite_subagent_runtime_mode_for_background|runtime[-_ ]mode|delegated|isolated'
rg -n -C2 --type=py 'CodexBackend|claude_code|base_with_custom_tool_and_mcp|chat_completions|response|gemini|grok'

Repository: massgen/MassGen

Length of output: 50373


🏁 Script executed:

# Search specifically for tests of the _rewrite_subagent_runtime_mode_for_background function
rg -l '_rewrite_subagent_runtime_mode_for_background' --type=py

Repository: massgen/MassGen

Length of output: 83


🏁 Script executed:

# Search for test files that test CodexBackend
fd 'test.*codex|codex.*test' --type=py

Repository: massgen/MassGen

Length of output: 228


🏁 Script executed:

# Search for MCP or custom_tool tests
fd 'test.*mcp|mcp.*test|test.*custom.*tool' --type=py

Repository: massgen/MassGen

Length of output: 228


🏁 Script executed:

# Look at the codex.py file to understand the context
cat -n massgen/backend/codex.py | head -50

Repository: massgen/MassGen

Length of output: 2993


🏁 Script executed:

# Check if there are any backend parity tests
fd 'test.*backend|backend.*test' --type=py | head -20

Repository: massgen/MassGen

Length of output: 228


🏁 Script executed:

# Find test files using find instead of fd
find massgen/tests -name "*codex*" -o -name "*backend*" | grep -i test | head -20

Repository: massgen/MassGen

Length of output: 778


🏁 Script executed:

# Search for any tests that test CodexBackend
rg -l 'CodexBackend' --type py

Repository: massgen/MassGen

Length of output: 429


🏁 Script executed:

# Search for tests that test MCP server configuration
rg -l 'mcp_servers|server_cfg|normalize.*config' --type py | grep test

Repository: massgen/MassGen

Length of output: 882


🏁 Script executed:

# Check test directory structure
ls -la massgen/tests/

Repository: massgen/MassGen

Length of output: 16728


🏁 Script executed:

# Check if test_custom_tools_server_background.py tests CodexBackend
rg -n 'CodexBackend|claude_code|_rewrite_subagent_runtime_mode_for_background|delegated|isolated' massgen/tests/test_custom_tools_server_background.py | head -30

Repository: massgen/MassGen

Length of output: 41


🏁 Script executed:

# Check test_background_delegate.py
rg -n 'CodexBackend|claude_code|_normalize_background_server_config|delegated|runtime' massgen/tests/test_background_delegate.py | head -30

Repository: massgen/MassGen

Length of output: 41


🏁 Script executed:

# Check if there are integration tests covering multiple backends
rg -n 'BACKEND_CONFIGS|test.*backend.*parity|claude.*codex|codex.*claude' massgen/tests/test_subagent_mcp_server.py | head -20

Repository: massgen/MassGen

Length of output: 41


🏁 Script executed:

# Search for what backends are tested in background tools tests
rg -n 'class.*Backend|def test.*codex|def test.*background' massgen/tests/test_custom_tools_server_background.py | head -15

Repository: massgen/MassGen

Length of output: 1326


🏁 Script executed:

# Look at the structure of test_background_delegate.py to understand what's tested
head -100 massgen/tests/test_background_delegate.py

Repository: massgen/MassGen

Length of output: 3266


🏁 Script executed:

# Search for CodexBackend test files
find massgen/tests -name "*codex*" -type f

Repository: massgen/MassGen

Length of output: 188


🏁 Script executed:

# Check what test_custom_tools.py tests
rg -n 'CodexBackend|claude_code|backend.*parity' massgen/tests/test_custom_tools.py | head -20

Repository: massgen/MassGen

Length of output: 41


🏁 Script executed:

# Look at test_codex_mcp_tool_visibility.py to see how CodexBackend is tested
head -80 massgen/tests/test_codex_mcp_tool_visibility.py

Repository: massgen/MassGen

Length of output: 2538


🏁 Script executed:

# Search for any backend parity pattern in tests
rg -n 'BACKEND_CONFIGS|all.*backend|backend.*test.*parity' massgen/tests/*.py | head -30

Repository: massgen/MassGen

Length of output: 3461


🏁 Script executed:

# Check test_custom_tools.py to see what backends it tests
rg -n 'def test_|class Test' massgen/tests/test_custom_tools.py | head -40

Repository: massgen/MassGen

Length of output: 2815


🏁 Script executed:

# Check if there are any tests for _normalize_background_server_config
rg -n '_normalize_background_server_config|_rewrite_subagent_runtime_mode' massgen/tests/*.py

Repository: massgen/MassGen

Length of output: 41


🏁 Script executed:

# Check test_subagent_mcp_server.py to see backend coverage patterns
rg -n 'def test_.*background|def test_.*server.*config|CodexBackend|claude_code' massgen/tests/test_subagent_mcp_server.py | head -30

Repository: massgen/MassGen

Length of output: 1298


🏁 Script executed:

# Look at which backends are tested in test_streaming_buffer.py pattern
rg -n 'backend.*classes|get_all.*backend' massgen/tests/test_streaming_buffer.py -A 10 | head -50

Repository: massgen/MassGen

Length of output: 2843


🏁 Script executed:

# Check if test_custom_tools_server_background.py or test_claude_code_background_tools.py test CodexBackend
rg -n 'CodexBackend|class.*Backend' massgen/tests/test_custom_tools_server_background.py

Repository: massgen/MassGen

Length of output: 41


🏁 Script executed:

# Check the exact code around lines 625-660 in codex.py to understand what needs testing
sed -n '620,665p' massgen/backend/codex.py

Repository: massgen/MassGen

Length of output: 1808


🏁 Script executed:

# Check claude_code.py for similar background server config handling
rg -n '_normalize_background_server_config|_rewrite_subagent_runtime_mode|background.*server.*config' massgen/backend/claude_code.py

Repository: massgen/MassGen

Length of output: 41


🏁 Script executed:

# Check base_with_custom_tool_and_mcp.py for similar behavior
rg -n '_normalize_background_server_config|_rewrite_subagent_runtime_mode|background.*server.*config' massgen/backend/base_with_custom_tool_and_mcp.py

Repository: massgen/MassGen

Length of output: 41


🏁 Script executed:

# Check if any tests import or instantiate CodexBackend for background/server config testing
rg -n 'CodexBackend.*background|background.*CodexBackend' massgen/tests/*.py

Repository: massgen/MassGen

Length of output: 41


🏁 Script executed:

# Look at the broader context of where _normalize_background_server_config is called in codex.py
rg -n '_normalize_background_server_config|_filter_background_servers' massgen/backend/codex.py -B 2 -A 2

Repository: massgen/MassGen

Length of output: 588


🏁 Script executed:

# Check what calls _get_background_mcp_servers in codex.py and when it's used
rg -n '_get_background_mcp_servers|background.*mcp.*server' massgen/backend/codex.py -B 2 -A 2

Repository: massgen/MassGen

Length of output: 946


🏁 Script executed:

# Search for where CodexBackend is instantiated in tests to understand test patterns
rg -l 'CodexBackend(' massgen/tests/*.py

Repository: massgen/MassGen

Length of output: 135


🏁 Script executed:

# Check if test_codex_mcp_tool_visibility.py has any background-related tests
grep -n 'def test_\|background' massgen/tests/test_codex_mcp_tool_visibility.py

Repository: massgen/MassGen

Length of output: 2288


🏁 Script executed:

# Look at test coverage for background tool management across backends
rg -n 'background.*tool\|_normalize_background\|server.*config.*background' massgen/tests/test_custom_tools_server_background.py | head -20

Repository: massgen/MassGen

Length of output: 41


🏁 Script executed:

# Check if there are integration tests that span multiple backends for subagent/background behavior
rg -n 'CodexBackend.*subagent|subagent.*CodexBackend|claude_code.*background|background.*claude_code' massgen/tests/*.py

Repository: massgen/MassGen

Length of output: 2288


🏁 Script executed:

# Look at the actual test that tests runtime mode rewriting
sed -n '159,220p' massgen/tests/test_codex_mcp_tool_visibility.py

Repository: massgen/MassGen

Length of output: 1957


🏁 Script executed:

# Check if test_claude_code_background_tools.py tests similar runtime mode rewriting
rg -n 'delegated|isolated|runtime.*mode|_normalize_background' massgen/tests/test_claude_code_background_tools.py

Repository: massgen/MassGen

Length of output: 41


🏁 Script executed:

# Check what backend the ResponseBackend tests use for background tools
grep -n 'class.*Backend\|ResponseBackend\|ChatCompletionsBackend' massgen/tests/test_custom_tools.py | head -15

Repository: massgen/MassGen

Length of output: 752


🏁 Script executed:

# Search for any base_with_custom_tool_and_mcp background server config tests
rg -n 'ResponseBackend.*background\|ChatCompletionsBackend.*background\|base_with_custom_tool.*background' massgen/tests/*.py

Repository: massgen/MassGen

Length of output: 41


🏁 Script executed:

# Verify the test name and location we found
grep -n 'test_codex_background_mcp_client_rewrites_delegated_subagent_runtime_mode_for_host' massgen/tests/test_codex_mcp_tool_visibility.py

Repository: massgen/MassGen

Length of output: 156


🏁 Script executed:

# Check if claude_code.py has _get_background_mcp_client or similar background config handling
rg -n '_get_background_mcp_client|_collect_background_mcp_servers' massgen/backend/claude_code.py

Repository: massgen/MassGen

Length of output: 247


🏁 Script executed:

# Check response.py and chat_completions.py for background MCP server handling
rg -n '_get_background_mcp_client|_collect_background_mcp_servers|background.*server' massgen/backend/response.py massgen/backend/chat_completions.py

Repository: massgen/MassGen

Length of output: 41


🏁 Script executed:

# Verify if any tests test runtime mode rewriting for other backends
rg -n 'runtime.*mode.*delegated|rewrite.*runtime.*isolated' massgen/tests/*.py

Repository: massgen/MassGen

Length of output: 1527


🏁 Script executed:

# Check if the learning mentions any existing test patterns we should follow
grep -r 'parity.*test\|backend.*parity' massgen/tests/*.py | head -10

Repository: massgen/MassGen

Length of output: 41


🏁 Script executed:

# Let me verify if ClaudeCodeBackend needs the same normalization by checking its implementation
sed -n '1850,1930p' massgen/backend/claude_code.py

Repository: massgen/MassGen

Length of output: 3666


🏁 Script executed:

# Check if ClaudeCodeBackend._get_background_mcp_servers does normalization
rg -n '_get_background_mcp_servers' massgen/backend/claude_code.py -A 20 | head -50

Repository: massgen/MassGen

Length of output: 2165


🏁 Script executed:

# Check if there's a _normalize method in ClaudeCodeBackend
rg -n '_normalize_background|_filter_background|_get_background_mcp_servers' massgen/backend/claude_code.py

Repository: massgen/MassGen

Length of output: 535


🏁 Script executed:

# Compare base_with_custom_tool_and_mcp to see the inheritance pattern
rg -n '_get_background_mcp_servers|_collect_background_mcp_servers' massgen/backend/base_with_custom_tool_and_mcp.py

Repository: massgen/MassGen

Length of output: 41


🏁 Script executed:

# Check if ClaudeCodeBackend should implement the same normalization
# by looking at whether it receives delegated subagent configs
rg -n 'subagent_.*runtime.*mode|runtime.*delegated' massgen/backend/claude_code.py

Repository: massgen/MassGen

Length of output: 41


🏁 Script executed:

# Check test_claude_code_background_tools.py for any similar test patterns
sed -n '1,100p' massgen/tests/test_claude_code_background_tools.py

Repository: massgen/MassGen

Length of output: 3584


🏁 Script executed:

# Verify the exact code flow: does ClaudeCodeBackend get subagent configs with delegated mode?
rg -n 'mcp_servers.*subagent|subagent.*mcp' massgen/backend/claude_code.py -B 2 -A 2

Repository: massgen/MassGen

Length of output: 41


🏁 Script executed:

# Final verification: Check if ClaudeCodeBackend config can receive delegated mode subagent configs
rg -n 'subagent_.*delegated\|--runtime-mode.*delegated' massgen/tests/test_claude_code_background_tools.py

Repository: massgen/MassGen

Length of output: 41


🏁 Script executed:

# Confirm what the original review should have referenced
grep -A 5 'test_codex_background_mcp_client_rewrites_delegated' massgen/tests/test_codex_mcp_tool_visibility.py | head -15

Repository: massgen/MassGen

Length of output: 333


🏁 Script executed:

# Check if ClaudeCodeBackend ever receives subagent configs in background scenario
rg -n 'subagent_agent\|subagent_' massgen/backend/claude_code.py

Repository: massgen/MassGen

Length of output: 41


Add runtime mode normalization and parity test for ClaudeCodeBackend's background MCP servers.

A test already exists for CodexBackend (test_codex_background_mcp_client_rewrites_delegated_subagent_runtime_mode_for_host in test_codex_mcp_tool_visibility.py) validating that delegated→isolated runtime mode rewriting works for background MCP execution. However, ClaudeCodeBackend's _get_background_mcp_servers() method does not implement equivalent normalization logic. Add _normalize_background_server_config() to ClaudeCodeBackend and add a corresponding test to ensure both native backends handle delegated subagent MCP server configs consistently when executing background targets.

🧰 Tools
🪛 Ruff (0.15.5)

[error] 651-651: Possible hardcoded password assigned to: "token"

(S105)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@massgen/backend/codex.py` around lines 625 - 660, ClaudeCodeBackend lacks the
same background MCP server config normalization found in CodexBackend, so add a
private _normalize_background_server_config(server_cfg: dict[str, Any]) (or
reuse the existing logic from CodexBackend including the helper
_rewrite_subagent_runtime_mode_for_background) to ClaudeCodeBackend and ensure
_get_background_mcp_servers() calls it before returning server configs; then add
a unit test mirroring
test_codex_background_mcp_client_rewrites_delegated_subagent_runtime_mode_for_host
(e.g.,
test_claude_code_background_mcp_client_rewrites_delegated_subagent_runtime_mode_for_host)
to assert delegated→isolated rewriting for ClaudeCodeBackend background MCP
servers.

VALID_WRITE_MODES = {"auto", "worktree", "isolated", "legacy"}
VALID_DRIFT_CONFLICT_POLICIES = {"skip", "prefer_presenter", "fail"}
VALID_NOVELTY_INJECTION = {"none", "gentle", "moderate", "aggressive"}
VALID_ROUND_EVALUATOR_TRANSFORMATION_PRESSURE = {"gentle", "balanced", "aggressive"}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Type-check round_evaluator_transformation_pressure before validating it.

The new allowlist is the right shape, but pressure not in ... will raise TypeError for list/dict YAML values, so malformed configs can still crash the validator instead of producing a ValidationIssue.

🛠️ Proposed fix
                 if "round_evaluator_transformation_pressure" in coordination:
                     pressure = coordination["round_evaluator_transformation_pressure"]
-                    if pressure not in self.VALID_ROUND_EVALUATOR_TRANSFORMATION_PRESSURE:
-                        valid_values = ", ".join(sorted(self.VALID_ROUND_EVALUATOR_TRANSFORMATION_PRESSURE))
+                    valid_values = ", ".join(sorted(self.VALID_ROUND_EVALUATOR_TRANSFORMATION_PRESSURE))
+                    if not isinstance(pressure, str):
+                        result.add_error(
+                            "'round_evaluator_transformation_pressure' must be a string",
+                            f"{location}.coordination.round_evaluator_transformation_pressure",
+                            f"Use one of: {valid_values}",
+                        )
+                    elif pressure not in self.VALID_ROUND_EVALUATOR_TRANSFORMATION_PRESSURE:
                         result.add_error(
                             f"Invalid round_evaluator_transformation_pressure: '{pressure}'. Supported values: {valid_values}",
                             f"{location}.coordination.round_evaluator_transformation_pressure",
                             f"Use one of: {valid_values}",
                         )

As per coding guidelines, "massgen/config_validator.py: Config validator changes - CHECK: ... Ensure validation error messages are helpful".

Also applies to: 1136-1144

🧰 Tools
🪛 Ruff (0.15.5)

[warning] 166-166: Mutable default value for class attribute

(RUF012)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@massgen/config_validator.py` at line 166, The validator currently does a
membership test against VALID_ROUND_EVALUATOR_TRANSFORMATION_PRESSURE which will
raise TypeError if round_evaluator_transformation_pressure is a list/dict;
update the check in the validation function that reads the
round_evaluator_transformation_pressure field to first ensure the value is a str
(or None) and, if it's not, append a ValidationIssue indicating an invalid type,
otherwise perform the membership test against
VALID_ROUND_EVALUATOR_TRANSFORMATION_PRESSURE (and do the same pattern for the
other occurrences around the 1136-1144 area) so malformed YAML yields a
ValidationIssue instead of crashing.

Comment on lines +84 to +87
except ModuleNotFoundError as e:
if "claude_code_sdk" in str(e):
pytest.skip("Skipping: optional dependency 'claude_code_sdk' not installed")
raise
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== Optional SDK imports in orchestrator =="
rg -n --type=py -C2 'claude_(agent|code)_sdk' massgen/orchestrator.py

echo
echo "== Skip-guard checks in ensemble tests =="
rg -n --type=py -C2 'claude_(agent|code)_sdk' massgen/tests/test_ensemble_mode.py

Repository: massgen/MassGen

Length of output: 1489


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== All occurrences of claude_code_sdk in repo =="
rg -n --type=py 'claude_code_sdk' .

echo
echo "== All occurrences of claude_agent_sdk in repo =="
rg -n --type=py 'claude_agent_sdk' .

echo
echo "== Check if orchestrator has other optional SDK imports =="
rg -n --type=py -C1 'except (ImportError|ModuleNotFoundError)' massgen/orchestrator.py | head -50

Repository: massgen/MassGen

Length of output: 5914


Optional dependency skip guard checks for the wrong module name.

Lines 85, 108, 237, and 269 check for "claude_code_sdk" in the error message, but orchestrator.py:1100 imports claude_agent_sdk. When claude_agent_sdk is missing, the ModuleNotFoundError will contain "claude_agent_sdk", not "claude_code_sdk", causing the skip guard to fail and tests to raise instead of skip.

🔧 Proposed fix
-        if "claude_code_sdk" in str(e):
-            pytest.skip("Skipping: optional dependency 'claude_code_sdk' not installed")
+        if "claude_agent_sdk" in str(e):
+            pytest.skip("Skipping: optional dependency 'claude_agent_sdk' not installed")

Also applies to: 107-110, 237-239, 269-271. Same issue exists in test_final_presentation_fallback.py:28-29.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@massgen/tests/test_ensemble_mode.py` around lines 84 - 87, The skip-guard in
the ModuleNotFoundError handlers is checking for the wrong module name
("claude_code_sdk") so missing "claude_agent_sdk" raises instead of skipping;
update the checks in the except ModuleNotFoundError blocks (the handlers that
inspect str(e)) to look for "claude_agent_sdk" instead of "claude_code_sdk" in
all occurrences in the test suite (including the handlers in
test_ensemble_mode.py and the similar guard in
test_final_presentation_fallback.py) so the tests correctly call pytest.skip
when claude_agent_sdk is not installed.

Comment on lines +40 to +124
def _make_valid_next_tasks_payload() -> dict:
return {
"schema_version": "2",
"objective": "Rebuild the deliverable around a stronger editorial thesis",
"primary_strategy": "editorial_reframe",
"why_this_strategy": "A stronger organizing concept fixes multiple weak criteria at once.",
"strategy_mode": "incremental_refinement",
"approach_assessment": {
"ceiling_status": "ceiling_not_reached",
"ceiling_explanation": "The current approach still has headroom if executed with more discipline.",
"breakthroughs": [],
"paradigm_shift": {
"recommended": False,
"current_limitation": "",
"alternative_approach": "",
"transferable_elements": [],
},
},
"success_contract": {
"outcome_statement": "The next revision should feel intentionally reauthored, not just patched.",
"quality_bar": "The weakest section should feel deliberate and premium in isolation.",
"fail_if_any": [
"The output still reads like the same template with only local polish changes.",
],
"required_evidence": [
"Fresh rendered screenshots of the revised sections",
"Replayable verification notes for the key changed areas",
],
},
"deprioritize_or_remove": ["generic filler modules"],
"execution_scope": {"active_chunk": "c1"},
"fix_tasks": [],
"evolution_tasks": [
{
"id": "reframe_midpage",
"task_category": "evolution",
"strategy_role": "thesis_shift",
"description": "Rebuild the weak middle sections around one stronger editorial concept.",
"implementation_guidance": "Replace repeated generic card modules with one stronger narrative section model.",
"priority": "high",
"depends_on": [],
"chunk": "c1",
"execution": {"mode": "delegate", "subagent_type": "builder"},
"verification": "The mid-page is organized around one stronger editorial concept.",
"verification_method": "Render the page and compare the changed sections against the previous layout.",
"success_criteria": "A reviewer can describe one clear new organizing thesis behind the revised sections.",
"failure_signals": [
"The sections are still recognizably the same layout with only copy and styling tweaks.",
],
"required_evidence": [
"Before/after screenshots of the rebuilt sections",
],
"metadata": {
"impact": "transformative",
"relates_to": ["E7", "E8"],
},
},
],
"tasks": [
{
"id": "reframe_midpage",
"task_category": "evolution",
"strategy_role": "thesis_shift",
"description": "Rebuild the weak middle sections around one stronger editorial concept.",
"implementation_guidance": "Replace repeated generic card modules with one stronger narrative section model.",
"priority": "high",
"depends_on": [],
"chunk": "c1",
"execution": {"mode": "delegate", "subagent_type": "builder"},
"verification": "The mid-page is organized around one stronger editorial concept.",
"verification_method": "Render the page and compare the changed sections against the previous layout.",
"success_criteria": "A reviewer can describe one clear new organizing thesis behind the revised sections.",
"failure_signals": [
"The sections are still recognizably the same layout with only copy and styling tweaks.",
],
"required_evidence": [
"Before/after screenshots of the rebuilt sections",
],
"metadata": {
"impact": "transformative",
"relates_to": ["E7", "E8"],
},
},
],
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Make the canonical “valid” payload coherent.

_make_valid_next_tasks_payload() says the overall handoff is incremental_refinement, but the only task is a strategy_role: "thesis_shift" editorial reframe. That mixed signal weakens the normalization/round-trip tests because the baseline fixture is not one consistent contract case.

Suggested minimal fix
-        "strategy_mode": "incremental_refinement",
+        "strategy_mode": "thesis_shift",
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@massgen/tests/test_round_evaluator_loop.py` around lines 40 - 124, The
fixture _make_valid_next_tasks_payload() has a mismatch: strategy_mode is
"incremental_refinement" but the task entries use strategy_role "thesis_shift";
update the two task dicts (the entries in "evolution_tasks" and "tasks") to use
strategy_role "incremental_refinement" (replace "thesis_shift") so the overall
payload is coherent; locate and edit the task objects in
_make_valid_next_tasks_payload() by updating their "strategy_role" fields to
match the top-level "strategy_mode".

Comment on lines +82 to +83
def _make_valid_next_tasks_payload() -> dict:
return {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add a Google-style docstring to _make_valid_next_tasks_payload().

This helper is new and currently undocumented, which misses the repo's Python docstring requirement.

💡 Suggested docstring
 def _make_valid_next_tasks_payload() -> dict:
+    """Build a valid round-evaluator `next_tasks` payload for tests.
+
+    Returns:
+        dict: Canonical schema-v2 payload used across verdict parsing tests.
+    """
     return {
As per coding guidelines, `**/*.py`: For new or changed functions, include Google-style docstrings.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def _make_valid_next_tasks_payload() -> dict:
return {
def _make_valid_next_tasks_payload() -> dict:
"""Build a valid round-evaluator `next_tasks` payload for tests.
Returns:
dict: Canonical schema-v2 payload used across verdict parsing tests.
"""
return {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@massgen/tests/test_round_evaluator_verdict.py` around lines 82 - 83, Add a
Google-style docstring to the helper function _make_valid_next_tasks_payload():
insert a concise one-line summary of what the helper returns, a blank line, an
Args section if any parameters (none in this case) or omit, and a Returns
section describing the dict shape and types (e.g., keys and value types for the
next-tasks payload), and optionally an Example showing typical usage; place this
docstring immediately below the def _make_valid_next_tasks_payload() declaration
so it satisfies the repo's docstring requirement.

@Henry-811 Henry-811 changed the base branch from main to dev/v0.1.63 March 13, 2026 15:57
@Henry-811 Henry-811 merged commit 72325a8 into dev/v0.1.63 Mar 13, 2026
21 of 22 checks passed
@coderabbitai coderabbitai bot mentioned this pull request Mar 13, 2026
18 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants