3ebb resilience landing (direct-to-main, manual bypass-merge) — fixes live e1bf + lands DD1-3/9d2c/284b by joeoakhart · Pull Request #688 · navapbc/digital-service-orchestra

joeoakhart · 2026-06-06T03:12:33Z

3ebb resilience landing — direct PR to main for manual bypass-merge

Why this PR exists (not the normal two-tier flow): the staged two-tier merge-to-main flow livelocked — review-sub-pr FP-amplified (escalating, re-litigated findings) so each landing attempt was slow enough that main advanced and re-stale-d the staged branch (3× determinate version-only conflicts: PRs #682/#683/#687). Per maintainer instruction, opening a direct PR to main for a manual bypass-merge.

The content is twice fp-recovery-cleared (0 findings):

PR staged: story/588e-abec-221f-42c0/3ebb-resilience-dd123-5 -> staged-9c51981f3a40-1780696804 #676 neutral opus re-review → 0 findings
PR staged: story/588e-abec-221f-42c0/3ebb-resilience-dd123-6 -> staged-793b29ce7031-1780709220 #684 neutral opus re-review → 0 findings (reviewer hash 926898d3…), examining the exact areas review-sub-pr later FP-flagged.

What lands here (15 files; rebased onto current main):

3ebb DD1–3 (tristate review-gate lattice, durable --resume INC-008/INC-015, escalation)
staged-force-push fix _branch_is_staged_promotion — fixes the live e1bf GH013 regression on main
fix A (9d2c) schema-error diagnosability + O5 (284b) fallback_hops allowlist
2 hardening fixes (str()-coerce schema errors; structural-error fail-closed on mergeability poll)

Bypass note: review-sub-pr/llm-review here is the FP-amplified check; all other required checks should be green. Manual web-UI bypass-merge by the named bypass actor. Closes the dd123-5/-6/-7 re-stage loop.

3ebb DD1 (resilience-harden, part 1): codify the review-gate decidability lattice PASS/FAIL/INDETERMINATE and retrofit the first named Goal-1 gate. - review-tristate-lib.sh: lattice helper (generalizes precondition-gate.sh exit-78). tristate_classify_verdict (unreviewed>0 -> FAIL/1 safe bottom; errors-only -> INDETERMINATE/75; clean -> PASS/0), tristate_is_transient_error (5xx/429/rate-limit/timeout/reset/gateway only), tristate_code_to_name. - review-tristate-lattice.md: review-specific contract + security guardrail. - review-coverage-invariant.sh: separates could-not-confirm (INDETERMINATE/75) from a genuine unreviewed SHA (FAIL/1); both still block under enforce. TDD: test-review-tristate-lib.sh 20/20 (new); coverage-invariant T4 -> 75, T4b violation-dominates -> 1; harness hermetic. Refs 3ebb-fa93-f980-40de DD1. DSO-Story: 3ebb-fa93-f980-40de

…arker-absent Completes 3ebb DD1: retrofit the second named Goal-1 gate to the tristate lattice. The marker-absent path (provenance verifier crashed / never ran) previously exited 1 — conflated with the normal 'dispatch runner' exit 1. It is genuinely 'could not COMPUTE the dispatch decision' -> INDETERMINATE (exit 75): still blocks (never the 8a77 silent-skip exit 0), but a distinct signal the orchestrator may retry on a transient cause / route to escalation (DD3). Sources review-tristate-lib.sh script-relative (hermetic under test). TDD: test-llm-review-dispatch-or-skip.sh Test 9 updated (marker-absent -> 75); 43/43 green. Refs 3ebb-fa93-f980-40de DD1. DSO-Story: 3ebb-fa93-f980-40de

… never authoritative-empty 3ebb DD2 (durable --resume): treat a GitHub PR-state read as authoritative only when the call succeeded AND returned well-formed JSON. A non-zero rc, an empty payload, or truncated/malformed JSON (a page cut short) is INDETERMINATE. - resume_pr_query_trustworthy (pure, INC-008 guard): rc 0 + valid JSON (incl. a legitimately-empty '[]') -> trustworthy; otherwise -> 75 (INDETERMINATE). The empty-but-valid '[]' case is deliberately distinguished from a truncated read. - advance-to-PR2 wiring: the PR1-state read now routes through the guard. On an indeterminate read it re-fetches once, then HALTS as INDETERMINATE (exit 75) rather than GUESS a direction — collapsing partial state into '0 PRs' would advance onto unverified state (skip review) and into '>0' would mint a duplicate staged-*+PR1. Neither is safe to guess (INC-008 / INC-015 stall, hit twice this session). TDD: test-merge-to-main-resume-inc008.sh 8/8 (new); resume regressions green (staged-advance 12/12, existing-pr-discovery 6/6). No new shellcheck findings. Refs 3ebb-fa93-f980-40de DD2. DSO-Story: 3ebb-fa93-f980-40de

…dicts 3ebb DD3: an INDETERMINATE verdict from any gate routes to /dso:fp-recovery via a single standardized, greppable (telemetry-ready) marker — never manual git surgery. Last-resort by design: fires only after transient retries are spent; a genuine FAIL never escalates here. - tristate_indeterminate_escalation (review-tristate-lib.sh): emits the INDETERMINATE_ESCALATION JSON marker + an actionable banner; JSON-safe reason. - wired at all three INDETERMINATE sites: review-coverage-invariant.sh (INDETERMINATE-enforce exit), llm-review-dispatch-or-skip.sh (marker-absent), merge-to-main-pr.sh (INC-008 resume halt, after one re-fetch). - merge-to-main-pr.sh now sources review-tristate-lib.sh (script-relative). - contract doc documents the escalation surface + the anti-friction stance (escalation rate is a defect signal, not normal operation). TDD: test-review-tristate-lib.sh 23/23 (T8/T8b/T8c added); coverage-invariant 14/14, dispatcher 43/43, inc008 8/8, staged-advance 12/12 — all green. Refs 3ebb-fa93-f980-40de DD3. DSO-Story: 3ebb-fa93-f980-40de

…/tmp is a cache) Completes 3ebb DD2: treat /tmp/merge-to-main-state-*.json as a CACHE, not the source of truth (GitHub + git are), and kill the INC-015 stale-checkout stall. - resume_staged_ref_is_spent (pure): a staged-* ref with 0 commits ahead of origin/main is SPENT (already merged / empty). Fail-safe: unknown count is NOT spent (never skip/GC on uncertainty). - resume_gc_stale_staged_state: GC /tmp cache entries for staged-* branches that are GONE from origin or SPENT, so a later --resume never reconstructs off a completed promotion's stale state. - current-branch assertion (INC-015 (a)): before the PR1/PR2 phase, if the worktree sits on a SPENT staged-* ref, emit a concrete recovery (checkout your feature branch) and exit 1 + GC the stale cache — instead of the confusing 'No commits between main and staged-*' stall. Deliberately NOT routed to /dso:fp-recovery: a wrong-branch checkout is a DETERMINATE operational error, not an INDETERMINATE verdict (anti-over-reliance, per the friction review). TDD: test-merge-to-main-resume-inc008.sh 17/17 (predicate T8-T11 + GC T12-T14); resume regressions green (staged-advance 12, resume-and-stale 7, cwd 7, pr2-statefile 2, existing-pr-discovery 6, freshness 6). No new shellcheck. Refs 3ebb-fa93-f980-40de DD2. DSO-Story: 3ebb-fa93-f980-40de

CI ShellCheck (warning threshold) flagged SC2120 — resume_gc_stale_staged_state references $1 but production callers use the /tmp default; only the test passes an explicit dir. Add a scoped shellcheck disable documenting the intentional optional arg. Comment-only; no behavior change. DSO-Story: 3ebb-fa93-f980-40de

…branch fix) CI Script Tests failed T14_gc_keeps_live: the fixture used 'git branch staged-spent main' / 'checkout -b staged-live main', but CI's init.defaultBranch=master means no local 'main' existed -> 'src refspec does not match any' -> branches never created -> GC removed their (gone) cache. Push HEAD directly to named remote refs instead; no local-branch-name dependency. Verified 17/17 both normally and with init.defaultBranch=master. Production GC code unchanged (it was correct). DSO-Story: 3ebb-fa93-f980-40de

…nate conflict recovery cca8-DD1 follow-on (vetted via the epic 3-lens panel + convergence reviewer). When origin/main advanced mid-PR2, _phase_merge rebased the staged-* ref and force-pushed it; the staged-* ruleset requires review-sub-pr on pushes (a check that only runs on pull_request events, never on a push), so the force-push was structurally un-satisfiable -> GH013 rejection -> pipeline died with a misleading CONFLICT_DATA. - _branch_is_staged_promotion: single-source-of-truth predicate; folds the 4 scattered '[[ $BRANCH == staged-* ]]' literals (N1) so PR2-phase semantics cannot drift. - R-A: in _phase_merge, SKIP the entire publish block (sync + source-branch version-bump + push) when on a staged-* promotion. staged was already created/pushed/reviewed by PR1; GitHub PR2 rebase-merge replays it onto current main at merge time. Prevents the GH013 from relocating to the plain-push path. - R-C: a true PR2 CONFLICTING is a DETERMINATE operational error -> concrete re-stage RECOVERY + exit 1, NOT routed to /dso:fp-recovery (which cannot rebase staged); poll until mergeable is computed first (N2). - BEHIND-base refresh is intentionally NOT duplicated here — the existing auto-merge poll loop already calls gh pr update-branch (jira-dig-2529). TDD: test-merge-to-main-resume-inc008.sh +6 predicate cases (both polarities, 23/23); linear-sync 6/6, staged-advance 12/12, resume-and-stale 7/7 green. shellcheck -S warning clean. Refs 3ebb-fa93-f980-40de; origin cca8-f1ed-ddd3-48f1 DD1. DSO-Story: 3ebb-fa93-f980-40de

…imitations Document the two race classes the two-tier-over-merge-queue choice does NOT serialize away (consequences of declining a merge queue, mitigated by quiescing): - L1: concurrent PR2s can be individually-green but semantically conflict on main (the merge-queue serialization this ADR traded away). Mitigation: one main-bound promotion at a time. - L2: the 3ee4 enforce-flip races in-flight PR2s during the live<->file skew window. Mitigation: runbook quiesce (no PR2 in flight during the swap). Both are correctness/operational limits, not integrity regressions (no unreviewed content can reach main). Surfaced per a workflow-resilience review so they are not rediscovered as surprises; the 3ebb DD3 drift-reconciler stretch goal is the place to revisit if the concurrency cost becomes material. DSO-Story: 3ebb-fa93-f980-40de

CI ShellCheck (warning) flagged SC2120 — _branch_is_staged_promotion references $1 but production callers use the $BRANCH default; only the test passes an explicit arg. Add the scoped disable (same pattern as resume_gc_stale_staged_state). Comment-only; verified SC2120=0 at default severity (CI-equivalent); remaining 8 findings are pre-existing SC2004/SC2006 style (below CI's failure threshold). DSO-Story: 3ebb-fa93-f980-40de

- F3 (real): add a cleanup trap for the inc008 GC temp dir so /tmp is not left stale on early exit/interrupt. - F4 (real): resolve the tristate test's lib path via $BASH_SOURCE (script- relative) instead of 'git rev-parse --show-toplevel', robust under worktrees. - F1 (relaxed): T4/T4b assert the behavioral exit code (75=INDETERMINATE, 1=FAIL-dominates) + the offending SHA's presence, dropping the prose verdict word to reduce output-string coupling. F2 (dispatcher marker assertion) defended: it asserts exit 75 + the stable 'provenance-complete.marker' filename (a contract path, not prose) — kept. Tests green: tristate 23/23, inc008 23/23, coverage-invariant 14/14. DSO-Story: 3ebb-fa93-f980-40de

…er, GC fail-safe, poll sleep) Three real findings from review-sub-pr on PR #676 (production code): - tristate_is_transient_error: the 5xx/429 globs (*50[0-9]*, *429*) matched a status code EMBEDDED in a larger number (e.g. '504800 bytes') -> could flip a non-transient INDETERMINATE into a retry, violating lattice rule 4. Anchor the numeric match on non-digit boundaries (grep -E '(^|[^0-9])(5[0-9]{2}|429)([^0-9]|$)'). +4 edge-case tests (byte counts / offsets / SHA fragments stay non-transient); 27/27. - resume_gc_stale_staged_state: an empty ls-remote was treated unconditionally as 'branch gone -> GC cache', so a TRANSIENT ls-remote failure could delete a live staged ref's cache. Now distinguishes rc!=0 (transient -> keep, fail-safe) from rc==0+empty (truly gone -> GC). - mergeability poll: skip the trailing sleep after the final (6th) attempt. DSO-Story: 3ebb-fa93-f980-40de

- test-llm-review-dispatch-or-skip (T9 no-marker path): assert stderr emits the INDETERMINATE_ESCALATION: machine-routing marker (DD3), not just exit 75 — a partial impl that exits 75 without escalating would otherwise pass silently. - test-review-coverage-invariant (T4b): assert BOTH range SHAs are walked (UNREVIEWED sha1 + INDETERMINATE sha2 + summary unreviewed=1 errors=1) so the violation-dominates-error scenario is genuinely exercised, not just sha2. - test-merge-to-main-resume-inc008: isolate git config (GIT_CONFIG_GLOBAL/SYSTEM =/dev/null) so ambient ~/.gitconfig or /etc/gitconfig hooks/signing/aliases cannot leak into the init/clone/commit/push test setup. All green: dispatch 44, coverage-invariant 14, inc008 23. shellcheck SC2120=0. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The schema-correction-exhausted fail-closed path in the dso_ci_review runner emitted only a generic "exhausted all N attempt(s)" message, omitting _schema_result.errors. The violating field/value lived solely in the synthetic schema_error finding description (written to the findings JSON, which CI does not upload), so the actual violation was unrecoverable from the CI job log. - runner.py: append the (truncated, max 10 + overflow) schema errors to the fail-closed stderr message. Enforcement is unchanged — still returns 1. - runner.py: reconcile get_schema_correction_max_attempts docstring (default 1 -> 2) to match the code (read_config_int default 2). - add test_runner_schema_exhaustion_stderr.py (RED->GREEN): asserts the schema error detail appears in stderr AND exit is non-zero (fail-closed preserved). Diagnosability-only: does not change which findings block. Follow-up 284b-2f0a-3b12-41d6 tracks the functional convergence fix (frozen-field re-rejection), blocked until a captured payload exists — which this enables. Antipattern-Scan: fail-closed-swallows-schema-errors root=plugins/dso/scripts/dso_ci_review matches=0 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… (9d2c) Address review finding on PR #680: the single-error test did not exercise the truncation logic. Add test_schema_exhaustion_stderr_truncates_and_counts_overflow (12 distinct errors) asserting the REAL formatting in runner.py — first 10 present, 11th/12th truncated, "(+2 more)" overflow suffix — so a broken slice or overflow count now fails the test. Refactor shared setup into _run_exhaustion_path. The pre-existing single-error test already pins `_err_preview` interpolation (it fails if the errors are not interpolated into the message). Mocking _validate_findings_schema (external subprocess) and dispatch_schema_correction (live LLM dispatch) is appropriate unit isolation; the assertions verify the runner's own formatting output, not mock passthrough. Antipattern-Scan: fail-closed-swallows-schema-errors root=plugins/dso/scripts/dso_ci_review matches=0 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

DSO-Story: 3ebb-fa93-f980-40de

…284b) The dso_ci_review dispatcher appends a `fallback_hops` provenance key to the result dict when a provider-fallback hop occurs (dispatch.py:817/852) — AFTER the LLM output, so the schema-correction loop cannot strip it. The validator's top-level allowlist omitted it, so any review that hopped providers failed schema validation, exhausted the 2-attempt corrector, and blocked merge fail-closed with 0 content findings (the recurring "schema correction exhausted" failure; payload captured on PR #684 via the 9d2c diagnosability logging). Fix: add fallback_hops to optional_top in validate-review-output.sh. Per the epic-588e panel, ALLOWLIST (not strip) — stripping unknown top-level keys would mask a malformed/empty payload into a false pass. RED->GREEN: test_code_review_dispatch_with_fallback_hops_passes; validator suite 96/96. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

- runner.py: str()-coerce schema-error entries before join so a non-string entry (None/dict/int) cannot raise TypeError and crash the schema-exhaustion fail-closed branch. +regression test (non-string + truncation cases). - merge-to-main-pr.sh: the PR mergeability poll now fails CLOSED when `gh pr view` exits non-zero (404 deleted PR / 401 revoked-or-expired token / API outage) and leaves mergeability unresolved, instead of silently proceeding as if mergeable. A legitimately-UNKNOWN-then-timeout with gh rc 0 keeps the prior behavior. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-06-06T03:12:39Z

Warning

Review limit reached

@joeoakhart, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 47 minutes and 27 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 4819fbdd-4db7-44ca-960f-f8dd130b9d52

📥 Commits

Reviewing files that changed from the base of the PR and between efe667c and beee290.

📒 Files selected for processing (15)

docs/adr/0020-two-tier-hardening-over-merge-queue.md
plugins/dso/.claude-plugin/plugin.json
plugins/dso/docs/contracts/review-tristate-lattice.md
plugins/dso/scripts/ci/review-coverage-invariant.sh
plugins/dso/scripts/dso_ci_review/runner.py
plugins/dso/scripts/lib/review-tristate-lib.sh
plugins/dso/scripts/llm-review-dispatch-or-skip.sh
plugins/dso/scripts/merge-to-main-pr.sh
plugins/dso/scripts/validate-review-output.sh
tests/hooks/test-validate-review-output.sh
tests/scripts/test-llm-review-dispatch-or-skip.sh
tests/scripts/test-merge-to-main-resume-inc008.sh
tests/scripts/test-review-coverage-invariant.sh
tests/scripts/test-review-tristate-lib.sh
tests/skills/dso_ci_review/test_runner_schema_exhaustion_stderr.py

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch story/588e-abec-221f-42c0/3ebb-resilience-dd123-7

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gitar-bot · 2026-06-06T03:13:26Z

CI failed: The CI build failed because the PR is targeting the 'main' branch directly, which violates the required 'staged-*' branch promotion policy.

Overview

One failure identified in the check-staged-head job. The build is correctly enforcing a branch naming convention to ensure the two-tier promotion workflow is followed.

Failures

Branch Naming Policy Violation (confidence: high)

Type: configuration
Affected jobs: 79846423777
Related to change: yes
Root cause: The source branch story/588e-abec-221f-42c0/3ebb-resilience-dd123-7 does not match the staged-* pattern required for PRs targeting main.
Suggested fix: Either retarget this PR to a staged-* branch, or merge this PR into a staged-* branch first and then open a follow-up PR into main.

Summary

Change-related failures: 1 (Branch naming policy enforcement).
Infrastructure/flaky failures: 0
Recommended action: Update the base branch of the PR to conform to the required promotion workflow as defined in the repository's CI configuration.

Code Review ✅ Approved

Deploys 3ebb resilience updates, including a fix for the live e1bf regression and improved schema error diagnostics. No issues found.

Tip

Comment Gitar fix CI or enable auto-apply: gitar auto-apply:on

Options

Auto-apply is off → Gitar will not commit updates to this branch.
Display: compact → Showing less information.

Comment with these commands to change:

`Auto-apply`	`Compact`
`gitar auto-apply:on`	`gitar display:verbose`

_{Was this helpful? React with 👍 / 👎 | Gitar}

Test and others added 18 commits June 5, 2026 20:11

chore: bump version to v1.17.141

8ec1f84

DSO-Story: 3ebb-fa93-f980-40de

JoeOakhartNava merged commit 5452cad into main Jun 6, 2026
26 of 27 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

3ebb resilience landing (direct-to-main, manual bypass-merge) — fixes live e1bf + lands DD1-3/9d2c/284b#688

3ebb resilience landing (direct-to-main, manual bypass-merge) — fixes live e1bf + lands DD1-3/9d2c/284b#688
JoeOakhartNava merged 18 commits into
mainfrom
story/588e-abec-221f-42c0/3ebb-resilience-dd123-7

joeoakhart commented Jun 6, 2026

Uh oh!

coderabbitai Bot commented Jun 6, 2026

Review limit reached

Uh oh!

gitar-bot Bot commented Jun 6, 2026 •

edited

Loading

Overview

Failures

Branch Naming Policy Violation (confidence: high)

Summary

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

joeoakhart commented Jun 6, 2026

3ebb resilience landing — direct PR to main for manual bypass-merge

Uh oh!

coderabbitai Bot commented Jun 6, 2026

Review limit reached

Uh oh!

gitar-bot Bot commented Jun 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Failures

Branch Naming Policy Violation (confidence: high)

Summary

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

gitar-bot Bot commented Jun 6, 2026 •

edited

Loading