Skip to content

chore: promote staging to staging-promote/66ccafb9-24321596712 (2026-04-13 03:02 UTC)#2384

Open
ironclaw-ci[bot] wants to merge 1 commit intostaging-promote/66ccafb9-24321596712from
staging-promote/4529f009-24323637408
Open

chore: promote staging to staging-promote/66ccafb9-24321596712 (2026-04-13 03:02 UTC)#2384
ironclaw-ci[bot] wants to merge 1 commit intostaging-promote/66ccafb9-24321596712from
staging-promote/4529f009-24323637408

Conversation

@ironclaw-ci
Copy link
Copy Markdown
Contributor

@ironclaw-ci ironclaw-ci bot commented Apr 13, 2026

Auto-promotion from staging CI

Batch range: a53eac5c2dec6b6cd5c08189086093fde64aa9cb..4529f009d4e239e06590f083662ac8804bcfaf22
Promotion branch: staging-promote/4529f009-24323637408
Base: staging-promote/66ccafb9-24321596712
Triggered by: Staging CI batch at 2026-04-13 03:02 UTC

Commits in this batch (13):

Current commits in this promotion (1)

Current base: staging-promote/66ccafb9-24321596712
Current head: staging-promote/4529f009-24323637408
Current range: origin/staging-promote/66ccafb9-24321596712..origin/staging-promote/4529f009-24323637408

Auto-updated by staging promotion metadata workflow

Waiting for gates:

  • Tests: pending
  • E2E: pending
  • Claude Code review: pending (will post comments on this PR)

Auto-created by staging-ci workflow

…ath (#2325) (#2340)

The Python orchestrator had no error counting for structured action calls
(Tier 0), allowing threads to loop indefinitely on failing tool calls and
complete "successfully" even when every tool call failed. This adds a
consecutive_action_errors counter that increments when all actions in a
batch fail, resets when any succeeds, injects a nudge at the threshold,
and transitions to failed at threshold + 2. Also prefixes error outputs
with [ACTION FAILED] for visibility and persists the counter in
checkpoints.

Closes #2325
Related: #2279, #2240

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions github-actions bot added size: L 200-499 changed lines risk: low Changes to docs, tests, or low-risk modules contributor: core 20+ merged PRs labels Apr 13, 2026
@claude
Copy link
Copy Markdown

claude bot commented Apr 13, 2026

Code review

Found 12 issues across security, architecture, bugs, and performance:

CRITICAL (1)

  1. [CRITICAL:88] Race condition: consecutive_action_errors counter not atomic — If orchestrator ever runs multiple concurrent Monty VMs with shared state (e.g., parallel sub-agents loading state from DB), the consecutive_action_errors counter will have lost-update bugs. Checkpoint persistence makes inconsistencies durable. Current single-VM-per-thread architecture is safe, but this becomes a data corruption hazard if concurrency is added.

https://github.com/anthropics/ironclaw/blob/3f727c41afda86b7a1c74f354cefb27d9d192dce/crates/ironclaw_engine/orchestrator/default.py#L473-L474


HIGH (3)

  1. [HIGH:92] String concatenation for error messages instead of templates — Error messages at lines 874 and 880-885 use inline string concatenation instead of template files. CLAUDE.md requires multi-line prompt strings to live in crates/ironclaw_engine/prompts/*.md and be loaded via include_str!(). This violates the stated architecture and makes localization/A/B testing difficult.

https://github.com/anthropics/ironclaw/blob/3f727c41afda86b7a1c74f354cefb27d9d192dce/crates/ironclaw_engine/orchestrator/default.py#L874-L885

  1. [HIGH:88] Magic number fragmentation: max_consecutive_errors + 2 threshold lacks documentation — Failure at >= max_consecutive_errors + 2 vs nudge at >= max_consecutive_errors introduces unexplained logic. The + 2 offset is unmotivated, unparameterized, and inconsistent with Tier 1 error handling. This diverges from design goal of "Consistency with Rust loop" (issue Orchestrator: add action error counting to Python execution loop #2325).

https://github.com/anthropics/ironclaw/blob/3f727c41afda86b7a1c74f354cefb27d9d192dce/crates/ironclaw_engine/orchestrator/default.py#L869-L876

  1. [HIGH:78] Missing timeout on Python VM executionrun_python_final() (refactored test helper) has memory limits (500_000 allocations) but no wall-clock timeout. If orchestrator Python code ever includes user-influenced loops, it can hang indefinitely. Current code is safe only because orchestrator has no dynamic loops.

https://github.com/anthropics/ironclaw/blob/3f727c41afda86b7a1c74f354cefb27d9d192dce/crates/ironclaw_engine/src/executor/orchestrator.rs#L2481-L2483


MEDIUM (7)

  1. [MEDIUM:85] Unbounded string concatenation in hot path — Nudge message constructed via string concatenation at every failed batch crossing threshold. Repeated append_message() calls accumulate context weight unnecessarily.

https://github.com/anthropics/ironclaw/blob/3f727c41afda86b7a1c74f354cefb27d9d192dce/crates/ironclaw_engine/orchestrator/default.py#L880-L885

  1. [MEDIUM:78] Test extraction refactoring introduces untyped Monty bridgerun_python_final() loses type safety by returning MontyObject and forcing callers to pattern-match with panics. Better approach: generic eval_python_value<T: From<MontyObject>> with a proper conversion trait.

https://github.com/anthropics/ironclaw/blob/3f727c41afda86b7a1c74f354cefb27d9d192dce/crates/ironclaw_engine/src/executor/orchestrator.rs#L2475-L2516

  1. [MEDIUM:75] Test refactoring introduces code duplicationeval_python_bool() and eval_python_int() both duplicate the helpers extraction logic (searching for \ndef run_loop(). Should extract extract_helpers() function.

https://github.com/anthropics/ironclaw/blob/3f727c41afda86b7a1c74f354cefb27d9d192dce/crates/ironclaw_engine/src/executor/orchestrator.rs#L2518-L2525

  1. [MEDIUM:72] DRY violation: checkpoint persistence repeated 7 times__save_checkpoint__() dict is copy-pasted at lines 615, 635, 671, 769, 790, 807, 890. Future counter additions require hunting down all sites. Should extract _save_action_error_checkpoint() helper.

https://github.com/anthropics/ironclaw/blob/3f727c41afda86b7a1c74f354cefb27d9d192dce/crates/ironclaw_engine/orchestrator/default.py#L612-L617

  1. [MEDIUM:72] Repeated JSON parsing in checkpoint serializationmonty_to_json() called twice per checkpoint; then .as_u64() chained 4 times. Low practical impact (checkpoints are infrequent), but indicates opportunity for optimization.

https://github.com/anthropics/ironclaw/blob/3f727c41afda86b7a1c74f354cefb27d9d192dce/crates/ironclaw_engine/src/executor/orchestrator.rs#L1741-L1744

  1. [MEDIUM:70] Batch error/success counting assumes parallel results alignment — No assertion verifies len(results) == len(executable_calls). If Rust batch handler ever reorders results (e.g., by completion time), errors will pair with wrong calls silently.

https://github.com/anthropics/ironclaw/blob/3f727c41afda86b7a1c74f354cefb27d9d192dce/crates/ironclaw_engine/orchestrator/default.py#L734-L757

  1. [MEDIUM:68] Partial batch success logic lacks explicit documentation — Reset on ANY success, increment only on ALL-fail design is sound but not explained. Code comment should justify why.

https://github.com/anthropics/ironclaw/blob/3f727c41afda86b7a1c74f354cefb27d9d192dce/crates/ironclaw_engine/orchestrator/default.py#L864-L867


LOW (1)

  1. [LOW:52] Docstring synchronization debt — Old docstring at line 2470 describes bool-only behavior; new one describes generic behavior. Orphaned first line should be removed.

https://github.com/anthropics/ironclaw/blob/3f727c41afda86b7a1c74f354cefb27d9d192dce/crates/ironclaw_engine/src/executor/orchestrator.rs#L2470-L2474


Recommendations (by priority)

IMMEDIATE (production safety):

  • Add timeout wrapper to run_python_final() to prevent infinite-loop hangs
  • Add assertion for len(results) == len(executable_calls) to catch result alignment bugs

BEFORE MERGE (architectural consistency):

  • Move error messages to crates/ironclaw_engine/prompts/error_messages.md
  • Document or parameterize the + 2 threshold offset
  • Extract _save_action_error_checkpoint() helper to reduce DRY violations

NICE-TO-HAVE (cleanup):

  • Fix docstring at line 2470
  • Extract extract_helpers() to reduce test code duplication
  • Add generic eval_python_value<T> test helper for type safety

Overall: Feature is sound and tests are comprehensive. These are maintainability and clarity improvements, not correctness blockers. Most LOW/MEDIUM issues are maintenance debt rather than production bugs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

contributor: core 20+ merged PRs risk: low Changes to docs, tests, or low-risk modules size: L 200-499 changed lines staging-promotion

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant