3ebb resilience landing (direct-to-main, manual bypass-merge) — fixes live e1bf + lands DD1-3/9d2c/284b#688
Conversation
3ebb DD1 (resilience-harden, part 1): codify the review-gate decidability lattice PASS/FAIL/INDETERMINATE and retrofit the first named Goal-1 gate. - review-tristate-lib.sh: lattice helper (generalizes precondition-gate.sh exit-78). tristate_classify_verdict (unreviewed>0 -> FAIL/1 safe bottom; errors-only -> INDETERMINATE/75; clean -> PASS/0), tristate_is_transient_error (5xx/429/rate-limit/timeout/reset/gateway only), tristate_code_to_name. - review-tristate-lattice.md: review-specific contract + security guardrail. - review-coverage-invariant.sh: separates could-not-confirm (INDETERMINATE/75) from a genuine unreviewed SHA (FAIL/1); both still block under enforce. TDD: test-review-tristate-lib.sh 20/20 (new); coverage-invariant T4 -> 75, T4b violation-dominates -> 1; harness hermetic. Refs 3ebb-fa93-f980-40de DD1. DSO-Story: 3ebb-fa93-f980-40de
…arker-absent Completes 3ebb DD1: retrofit the second named Goal-1 gate to the tristate lattice. The marker-absent path (provenance verifier crashed / never ran) previously exited 1 — conflated with the normal 'dispatch runner' exit 1. It is genuinely 'could not COMPUTE the dispatch decision' -> INDETERMINATE (exit 75): still blocks (never the 8a77 silent-skip exit 0), but a distinct signal the orchestrator may retry on a transient cause / route to escalation (DD3). Sources review-tristate-lib.sh script-relative (hermetic under test). TDD: test-llm-review-dispatch-or-skip.sh Test 9 updated (marker-absent -> 75); 43/43 green. Refs 3ebb-fa93-f980-40de DD1. DSO-Story: 3ebb-fa93-f980-40de
… never authoritative-empty 3ebb DD2 (durable --resume): treat a GitHub PR-state read as authoritative only when the call succeeded AND returned well-formed JSON. A non-zero rc, an empty payload, or truncated/malformed JSON (a page cut short) is INDETERMINATE. - resume_pr_query_trustworthy (pure, INC-008 guard): rc 0 + valid JSON (incl. a legitimately-empty '[]') -> trustworthy; otherwise -> 75 (INDETERMINATE). The empty-but-valid '[]' case is deliberately distinguished from a truncated read. - advance-to-PR2 wiring: the PR1-state read now routes through the guard. On an indeterminate read it re-fetches once, then HALTS as INDETERMINATE (exit 75) rather than GUESS a direction — collapsing partial state into '0 PRs' would advance onto unverified state (skip review) and into '>0' would mint a duplicate staged-*+PR1. Neither is safe to guess (INC-008 / INC-015 stall, hit twice this session). TDD: test-merge-to-main-resume-inc008.sh 8/8 (new); resume regressions green (staged-advance 12/12, existing-pr-discovery 6/6). No new shellcheck findings. Refs 3ebb-fa93-f980-40de DD2. DSO-Story: 3ebb-fa93-f980-40de
…dicts 3ebb DD3: an INDETERMINATE verdict from any gate routes to /dso:fp-recovery via a single standardized, greppable (telemetry-ready) marker — never manual git surgery. Last-resort by design: fires only after transient retries are spent; a genuine FAIL never escalates here. - tristate_indeterminate_escalation (review-tristate-lib.sh): emits the INDETERMINATE_ESCALATION JSON marker + an actionable banner; JSON-safe reason. - wired at all three INDETERMINATE sites: review-coverage-invariant.sh (INDETERMINATE-enforce exit), llm-review-dispatch-or-skip.sh (marker-absent), merge-to-main-pr.sh (INC-008 resume halt, after one re-fetch). - merge-to-main-pr.sh now sources review-tristate-lib.sh (script-relative). - contract doc documents the escalation surface + the anti-friction stance (escalation rate is a defect signal, not normal operation). TDD: test-review-tristate-lib.sh 23/23 (T8/T8b/T8c added); coverage-invariant 14/14, dispatcher 43/43, inc008 8/8, staged-advance 12/12 — all green. Refs 3ebb-fa93-f980-40de DD3. DSO-Story: 3ebb-fa93-f980-40de
…/tmp is a cache) Completes 3ebb DD2: treat /tmp/merge-to-main-state-*.json as a CACHE, not the source of truth (GitHub + git are), and kill the INC-015 stale-checkout stall. - resume_staged_ref_is_spent (pure): a staged-* ref with 0 commits ahead of origin/main is SPENT (already merged / empty). Fail-safe: unknown count is NOT spent (never skip/GC on uncertainty). - resume_gc_stale_staged_state: GC /tmp cache entries for staged-* branches that are GONE from origin or SPENT, so a later --resume never reconstructs off a completed promotion's stale state. - current-branch assertion (INC-015 (a)): before the PR1/PR2 phase, if the worktree sits on a SPENT staged-* ref, emit a concrete recovery (checkout your feature branch) and exit 1 + GC the stale cache — instead of the confusing 'No commits between main and staged-*' stall. Deliberately NOT routed to /dso:fp-recovery: a wrong-branch checkout is a DETERMINATE operational error, not an INDETERMINATE verdict (anti-over-reliance, per the friction review). TDD: test-merge-to-main-resume-inc008.sh 17/17 (predicate T8-T11 + GC T12-T14); resume regressions green (staged-advance 12, resume-and-stale 7, cwd 7, pr2-statefile 2, existing-pr-discovery 6, freshness 6). No new shellcheck. Refs 3ebb-fa93-f980-40de DD2. DSO-Story: 3ebb-fa93-f980-40de
CI ShellCheck (warning threshold) flagged SC2120 — resume_gc_stale_staged_state references $1 but production callers use the /tmp default; only the test passes an explicit dir. Add a scoped shellcheck disable documenting the intentional optional arg. Comment-only; no behavior change. DSO-Story: 3ebb-fa93-f980-40de
…branch fix) CI Script Tests failed T14_gc_keeps_live: the fixture used 'git branch staged-spent main' / 'checkout -b staged-live main', but CI's init.defaultBranch=master means no local 'main' existed -> 'src refspec does not match any' -> branches never created -> GC removed their (gone) cache. Push HEAD directly to named remote refs instead; no local-branch-name dependency. Verified 17/17 both normally and with init.defaultBranch=master. Production GC code unchanged (it was correct). DSO-Story: 3ebb-fa93-f980-40de
…nate conflict recovery cca8-DD1 follow-on (vetted via the epic 3-lens panel + convergence reviewer). When origin/main advanced mid-PR2, _phase_merge rebased the staged-* ref and force-pushed it; the staged-* ruleset requires review-sub-pr on pushes (a check that only runs on pull_request events, never on a push), so the force-push was structurally un-satisfiable -> GH013 rejection -> pipeline died with a misleading CONFLICT_DATA. - _branch_is_staged_promotion: single-source-of-truth predicate; folds the 4 scattered '[[ $BRANCH == staged-* ]]' literals (N1) so PR2-phase semantics cannot drift. - R-A: in _phase_merge, SKIP the entire publish block (sync + source-branch version-bump + push) when on a staged-* promotion. staged was already created/pushed/reviewed by PR1; GitHub PR2 rebase-merge replays it onto current main at merge time. Prevents the GH013 from relocating to the plain-push path. - R-C: a true PR2 CONFLICTING is a DETERMINATE operational error -> concrete re-stage RECOVERY + exit 1, NOT routed to /dso:fp-recovery (which cannot rebase staged); poll until mergeable is computed first (N2). - BEHIND-base refresh is intentionally NOT duplicated here — the existing auto-merge poll loop already calls gh pr update-branch (jira-dig-2529). TDD: test-merge-to-main-resume-inc008.sh +6 predicate cases (both polarities, 23/23); linear-sync 6/6, staged-advance 12/12, resume-and-stale 7/7 green. shellcheck -S warning clean. Refs 3ebb-fa93-f980-40de; origin cca8-f1ed-ddd3-48f1 DD1. DSO-Story: 3ebb-fa93-f980-40de
…imitations Document the two race classes the two-tier-over-merge-queue choice does NOT serialize away (consequences of declining a merge queue, mitigated by quiescing): - L1: concurrent PR2s can be individually-green but semantically conflict on main (the merge-queue serialization this ADR traded away). Mitigation: one main-bound promotion at a time. - L2: the 3ee4 enforce-flip races in-flight PR2s during the live<->file skew window. Mitigation: runbook quiesce (no PR2 in flight during the swap). Both are correctness/operational limits, not integrity regressions (no unreviewed content can reach main). Surfaced per a workflow-resilience review so they are not rediscovered as surprises; the 3ebb DD3 drift-reconciler stretch goal is the place to revisit if the concurrency cost becomes material. DSO-Story: 3ebb-fa93-f980-40de
CI ShellCheck (warning) flagged SC2120 — _branch_is_staged_promotion references $1 but production callers use the $BRANCH default; only the test passes an explicit arg. Add the scoped disable (same pattern as resume_gc_stale_staged_state). Comment-only; verified SC2120=0 at default severity (CI-equivalent); remaining 8 findings are pre-existing SC2004/SC2006 style (below CI's failure threshold). DSO-Story: 3ebb-fa93-f980-40de
- F3 (real): add a cleanup trap for the inc008 GC temp dir so /tmp is not left stale on early exit/interrupt. - F4 (real): resolve the tristate test's lib path via $BASH_SOURCE (script- relative) instead of 'git rev-parse --show-toplevel', robust under worktrees. - F1 (relaxed): T4/T4b assert the behavioral exit code (75=INDETERMINATE, 1=FAIL-dominates) + the offending SHA's presence, dropping the prose verdict word to reduce output-string coupling. F2 (dispatcher marker assertion) defended: it asserts exit 75 + the stable 'provenance-complete.marker' filename (a contract path, not prose) — kept. Tests green: tristate 23/23, inc008 23/23, coverage-invariant 14/14. DSO-Story: 3ebb-fa93-f980-40de
…er, GC fail-safe, poll sleep) Three real findings from review-sub-pr on PR #676 (production code): - tristate_is_transient_error: the 5xx/429 globs (*50[0-9]*, *429*) matched a status code EMBEDDED in a larger number (e.g. '504800 bytes') -> could flip a non-transient INDETERMINATE into a retry, violating lattice rule 4. Anchor the numeric match on non-digit boundaries (grep -E '(^|[^0-9])(5[0-9]{2}|429)([^0-9]|$)'). +4 edge-case tests (byte counts / offsets / SHA fragments stay non-transient); 27/27. - resume_gc_stale_staged_state: an empty ls-remote was treated unconditionally as 'branch gone -> GC cache', so a TRANSIENT ls-remote failure could delete a live staged ref's cache. Now distinguishes rc!=0 (transient -> keep, fail-safe) from rc==0+empty (truly gone -> GC). - mergeability poll: skip the trailing sleep after the final (6th) attempt. DSO-Story: 3ebb-fa93-f980-40de
- test-llm-review-dispatch-or-skip (T9 no-marker path): assert stderr emits the INDETERMINATE_ESCALATION: machine-routing marker (DD3), not just exit 75 — a partial impl that exits 75 without escalating would otherwise pass silently. - test-review-coverage-invariant (T4b): assert BOTH range SHAs are walked (UNREVIEWED sha1 + INDETERMINATE sha2 + summary unreviewed=1 errors=1) so the violation-dominates-error scenario is genuinely exercised, not just sha2. - test-merge-to-main-resume-inc008: isolate git config (GIT_CONFIG_GLOBAL/SYSTEM =/dev/null) so ambient ~/.gitconfig or /etc/gitconfig hooks/signing/aliases cannot leak into the init/clone/commit/push test setup. All green: dispatch 44, coverage-invariant 14, inc008 23. shellcheck SC2120=0. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The schema-correction-exhausted fail-closed path in the dso_ci_review runner emitted only a generic "exhausted all N attempt(s)" message, omitting _schema_result.errors. The violating field/value lived solely in the synthetic schema_error finding description (written to the findings JSON, which CI does not upload), so the actual violation was unrecoverable from the CI job log. - runner.py: append the (truncated, max 10 + overflow) schema errors to the fail-closed stderr message. Enforcement is unchanged — still returns 1. - runner.py: reconcile get_schema_correction_max_attempts docstring (default 1 -> 2) to match the code (read_config_int default 2). - add test_runner_schema_exhaustion_stderr.py (RED->GREEN): asserts the schema error detail appears in stderr AND exit is non-zero (fail-closed preserved). Diagnosability-only: does not change which findings block. Follow-up 284b-2f0a-3b12-41d6 tracks the functional convergence fix (frozen-field re-rejection), blocked until a captured payload exists — which this enables. Antipattern-Scan: fail-closed-swallows-schema-errors root=plugins/dso/scripts/dso_ci_review matches=0 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… (9d2c) Address review finding on PR #680: the single-error test did not exercise the truncation logic. Add test_schema_exhaustion_stderr_truncates_and_counts_overflow (12 distinct errors) asserting the REAL formatting in runner.py — first 10 present, 11th/12th truncated, "(+2 more)" overflow suffix — so a broken slice or overflow count now fails the test. Refactor shared setup into _run_exhaustion_path. The pre-existing single-error test already pins `_err_preview` interpolation (it fails if the errors are not interpolated into the message). Mocking _validate_findings_schema (external subprocess) and dispatch_schema_correction (live LLM dispatch) is appropriate unit isolation; the assertions verify the runner's own formatting output, not mock passthrough. Antipattern-Scan: fail-closed-swallows-schema-errors root=plugins/dso/scripts/dso_ci_review matches=0 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
DSO-Story: 3ebb-fa93-f980-40de
…284b) The dso_ci_review dispatcher appends a `fallback_hops` provenance key to the result dict when a provider-fallback hop occurs (dispatch.py:817/852) — AFTER the LLM output, so the schema-correction loop cannot strip it. The validator's top-level allowlist omitted it, so any review that hopped providers failed schema validation, exhausted the 2-attempt corrector, and blocked merge fail-closed with 0 content findings (the recurring "schema correction exhausted" failure; payload captured on PR #684 via the 9d2c diagnosability logging). Fix: add fallback_hops to optional_top in validate-review-output.sh. Per the epic-588e panel, ALLOWLIST (not strip) — stripping unknown top-level keys would mask a malformed/empty payload into a false pass. RED->GREEN: test_code_review_dispatch_with_fallback_hops_passes; validator suite 96/96. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- runner.py: str()-coerce schema-error entries before join so a non-string entry (None/dict/int) cannot raise TypeError and crash the schema-exhaustion fail-closed branch. +regression test (non-string + truncation cases). - merge-to-main-pr.sh: the PR mergeability poll now fails CLOSED when `gh pr view` exits non-zero (404 deleted PR / 401 revoked-or-expired token / API outage) and leaves mergeability unresolved, instead of silently proceeding as if mergeable. A legitimately-UNKNOWN-then-timeout with gh rc 0 keeps the prior behavior. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Warning Review limit reached
More reviews will be available in 47 minutes and 27 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (15)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
CI failed: The CI build failed because the PR is targeting the 'main' branch directly, which violates the required 'staged-*' branch promotion policy.OverviewOne failure identified in the FailuresBranch Naming Policy Violation (confidence: high)
Summary
Code Review ✅ ApprovedDeploys 3ebb resilience updates, including a fix for the live e1bf regression and improved schema error diagnostics. No issues found. Tip Comment OptionsAuto-apply is off → Gitar will not commit updates to this branch. Comment with these commands to change:
Was this helpful? React with 👍 / 👎 | Gitar |
3ebb resilience landing — direct PR to main for manual bypass-merge
Why this PR exists (not the normal two-tier flow): the staged two-tier
merge-to-mainflow livelocked —review-sub-prFP-amplified (escalating, re-litigated findings) so each landing attempt was slow enough thatmainadvanced and re-stale-d the staged branch (3× determinate version-only conflicts: PRs #682/#683/#687). Per maintainer instruction, opening a direct PR tomainfor a manual bypass-merge.The content is twice fp-recovery-cleared (0 findings):
926898d3…), examining the exact areasreview-sub-prlater FP-flagged.What lands here (15 files; rebased onto current
main):--resumeINC-008/INC-015, escalation)_branch_is_staged_promotion— fixes the livee1bfGH013 regression onmain9d2c) schema-error diagnosability + O5 (284b)fallback_hopsallowlistBypass note:
review-sub-pr/llm-reviewhere is the FP-amplified check; all other required checks should be green. Manual web-UI bypass-merge by the named bypass actor. Closes the dd123-5/-6/-7 re-stage loop.