Skip to content

v1.100 PR-22B — lifecycle truth repair (dry-run honesty + history gating + authority alignment + panel consent)#482

Merged
itcmsgr merged 3 commits intomainfrom
fix/v1.100-pr-22b-lifecycle-truth-repair
Apr 19, 2026
Merged

v1.100 PR-22B — lifecycle truth repair (dry-run honesty + history gating + authority alignment + panel consent)#482
itcmsgr merged 3 commits intomainfrom
fix/v1.100-pr-22b-lifecycle-truth-repair

Conversation

@itcmsgr
Copy link
Copy Markdown
Owner

@itcmsgr itcmsgr commented Apr 19, 2026

Why this PR exists

Second independent audit of the install/update lifecycle found systemic issues beyond what PR-22A fixed for uninstall. The install path silently ignored --dry-run. The update path wrote three files under /var/lib/nftban/ during every "dry-run" and recorded successful previews as install_fail. The authority classifier had no Ambiguous state on the install side and implicitly auto-approved takeover on panel-managed hosts. Two different definitions of "nftban authoritative" coexisted. CI whitelisted writes under /var/lib/nftban/state/ and never snapshot the history file.

This PR is boundary-repair ONLY. It does not advance uninstall mutation, does not add install preview capability, does not begin PR-23 work.

PR-22B does not add install preview capability; it removes false dry-run semantics by refusing unsupported install dry-run invocations.

Depends on: #481 (PR-22A uninstall boundary repair).
Scope lock: 12 items from the repair contract seed — nothing else.


What changed (12-item mapping)

1. Dry-run honesty

Surface Before After
--mode=install --dry-run Silently ignored — all 5 mutating phases executed Refused with explicit error + non-zero exit
update_dryrun.goos.WriteFile(update_plan.json) Written every run Removed — stdout only
phaseDetectsf.Transition(StateDetectComplete) during update dry-run Wrote install_state In-memory onlyStateFile.DryRun=true suppresses WriteAtomic

2. History gating

  • New state.IsApplyTerminal(s) helper — explicit allowlist of states representing a completed apply attempt (Committed / Degraded / FailedSSH / FailedAbort / FailedRender / FailedRebuild / FailedNoFirewall / FailedTakeover).
  • main.go gate: if !cfg.dryRun && state.IsApplyTerminal(sf.State) { writeHistory(...) }. Structural, not mode-based. Every dry-run + every non-terminal state is excluded by construction.

3. Authority predicate + Ambiguous + panel consent

  • authority.IsNftbanAuthoritative(exec) — canonical shared predicate requiring table + chain + active daemon. update.Preflight P-1 now uses this single source of truth. Previously two weaker definitions existed.
  • authority.Decision gains Ambiguous. Orphan table or daemon-without-table routes here. phaseSwitch treats Ambiguous with the same emergency-SSH injection as Takeover/Fresh — never silent-continue.
  • authority.Classify now takes a panelAutoApprove bool parameter (wired from cfg.panelAutoTakeover). Panel detection alone no longer auto-approves takeover. Operators must pass --panel-auto-takeover or NFTBAN_PANEL_AUTO_TAKEOVER=1.

4. Lifecycle bridge truthfulness

  • observeResult.DryRun now wired from cfg.dryRun (was hard-coded false).
  • observePlan / mapAuthority switches: compared authority.Decision values (UPPERCASE) against lowercase literals — every switch silently hit default, so every consumer saw PreserveAuthority regardless of the real decision. Now pinned to authority package constants. Ambiguous routes through ActionTakeAuthority + new AuthorityUnknown owner.

5. Flag validation

Combination Before After
--mode=install --dry-run Silently proceeds as real install Refused
--takeover --dry-run Both applied (meaningless) Refused
--rpm --deb Silently picks one Refused
--force-delete-operator-config without --purge Silently dropped Refused

6. CI truth surfaces

  • ci-update-canonization.yml G3-U3: seeds update-history.json, snapshots install_state, hard-asserts byte-identical after dry-run. Removed || true soft diff handling. Widened G3-U5..U10 structural grep scope to include update_dryrun.go; extended pattern list (os.WriteFile, os.Create, os.MkdirAll, os.Rename, nft create, apt-get purge, dnf erase).
  • ci-install-canonization.yml NEW: G3-IN-REFUSE-DRY-RUN (install dry-run exits non-zero with refusal message; history untouched). G3-IN-FLAG-COMBOS (invalid combos rejected).

7. Purity tests + reusable audit harness

  • internal/installer/audit/harness.go NEW: reusable PurityHarness with AssertNoExecutorWrites / AssertNoDirectoryCreations / AssertNoMutationCommands / AssertNoStateDirEntries + AssertAllPurity convenience method. Self-tests prove it catches each class.
  • state.IsApplyTerminal allowlist tests + DryRun Transition regression test.
  • authority.Classify ambiguous-on-orphan-table + symmetric daemon-without-table + table-without-chain + predicate-requires-all-three tests.
  • uninstall.Plan render ↔ JSON equivalence grid (audit item 11).
  • history_test.go write-gate predicate unit tests.

Falsifiability proof

Could the audit's install/update findings pass this CI? No.

Previous bug Gate that would now catch it
--mode=install --dry-run silently proceeds G3-IN-REFUSE-DRY-RUN
update_dryrun.go writes update_plan.json G3-U5..U10 extended os.WriteFile( grep
phaseDetectTransition writes install_state during update dry-run G3-U3 hard assertion on install_state sha256
writeHistory records update dry-run as install_fail G3-U3 hard assertion on update-history.json sha256
Orphan nftban table classifies as Update TestClassify_Ambiguous_OrphanTable
Panel silently auto-approves takeover TestClassify_PanelDetected_NoFlag_Aborts
Two "nftban authoritative" predicates drift update.Preflight now imports authority.IsNftbanAuthoritative — single source

Explicit non-goals

This PR does not add uninstall mutation, does not add install preview capability, does not change purge/remove semantics, does not redesign installer history generally (only the write-gate predicate), does not tighten PriorRecordUsable semantics (deferred), does not add an exec-trace CI gate (deferred), and does not begin PR-23 work.


Acceptance criteria (repair contract §6 extended)

  • All dry-run paths are observational or explicitly refused
  • No dry-run writes history or install_state by default
  • Authority ambiguity is surfaced conservatively (Ambiguous + emergency-SSH in phaseSwitch)
  • Silent panel takeover removed; explicit opt-in required
  • Canonical authority predicate; no duplicate definitions
  • Reusable purity harness exists + proves itself in self-tests
  • CI would fail on the exact bugs the audit found

Test plan

  • ci-install-canonization matrix green (ubuntu-24.04 + almalinux-9)
  • ci-update-canonization matrix green (including new hard-asserts)
  • ci-uninstall-canonization matrix green (regression-safe)
  • Runtime Truth matrix green
  • Build & Test green (all unit tests including new harness self-tests)
  • Classifier tests pass with new 5-state enum
  • Flag-combo CI refusals assert correctly

🤖 Generated with Claude Code

…y gating, authority alignment, panel consent

Extended audit found systemic issues beyond uninstall that would invite
the exact class of false confidence PR-22A closed, but for install and
update paths. This PR is boundary-repair ONLY — does not add uninstall
mutation, does not add install preview capability, does not begin
PR-23 work.

PR-22B does not add install preview capability; it removes false
dry-run semantics by refusing unsupported install dry-run invocations.

Scope (12 items from the repair contract):

## 1. Dry-run honesty

- flags.go: refuse `--mode=install --dry-run` with an explicit error.
  An honest install dry-run orchestrator is deferred — until it lands,
  the flag combination has no truthful meaning.
- state.StateFile.DryRun field + guard in Transition(): when DryRun is
  true, in-memory fields are updated but no file write occurs. main.go
  wires sf.DryRun = cfg.dryRun at construction. Any shared phase
  function (phaseDetect in particular) that was persisting install_state
  during update dry-run is now in-memory-only.
- update_dryrun.go: removed os.WriteFile(update_plan.json). Plan renders
  to stdout only; no default filesystem persistence.

## 2. History gating (terminal-state allowlist)

- state.IsApplyTerminal(s) helper — explicit allowlist of states
  representing a completed apply attempt. Catch-all default mapped
  StateDetectComplete / StateUninstallPlanning / intermediate states to
  install_fail; now they are excluded from history writes entirely.
- main.go: writeHistory gated on `!cfg.dryRun && IsApplyTerminal(sf.State)`.
  Removes the PR-22A mode-name heuristic (never mode-based, always
  structural).

## 3. Authority predicate + Ambiguous + panel consent

- authority.IsNftbanAuthoritative(exec) — canonical shared predicate
  requiring table + chain + active daemon. update.Preflight P-1 now
  uses this single source of truth; no two different definitions of
  "nftban authoritative."
- authority.Decision gains `Ambiguous` — orphan table or
  daemon-without-table routes here instead of silently classifying as
  Update or Fresh.
- authority.Classify signature adds panelAutoApprove bool. Panel
  detection alone no longer auto-approves takeover — operators who
  want the old behaviour must pass --panel-auto-takeover explicitly
  (flag + NFTBAN_PANEL_AUTO_TAKEOVER=1 env mirror).
- phases.go phaseSwitch treats Ambiguous with the same pre-switch
  emergency-SSH injection as Takeover/Fresh. Never silent-continue,
  never skip safety paths.

## 4. Lifecycle bridge truthfulness

- lifecycle_bridge observeResult: DryRun is wired from cfg.dryRun
  (was hard-coded false).
- observePlan / mapAuthority: switches now compare against
  authority.Decision constants. Previous version compared UPPERCASE
  enum values against lowercase literals — every switch silently hit
  default, so every lifecycle consumer saw PreserveAuthority regardless
  of the real decision.
- internal/lifecycle: new AuthorityUnknown owner for the Ambiguous case.

## 5. Flag validation (reject contradictory combinations)

- refuse --mode=install --dry-run
- refuse --takeover with --dry-run
- reject --rpm with --deb
- reject --force-delete-operator-config without --purge (both for
  install paths and the uninstall early-return block)

## 6. CI truth surfaces

- ci-update-canonization.yml G3-U3: snapshot + hard-assert
  /var/lib/nftban/update-history.json and /var/lib/nftban/state/
  install_state byte-identical after dry-run. Removed || true soft
  diff handling. Widened G3-U5..U10 structural grep to include
  update_dryrun.go and extended the pattern list (os.WriteFile,
  os.Create, os.MkdirAll, os.Rename, nft create, apt-get purge, dnf
  erase).
- ci-install-canonization.yml NEW: G3-IN-REFUSE-DRY-RUN gate asserts
  `--mode=install --dry-run` exits non-zero with an explicit refusal
  message; G3-IN-FLAG-COMBOS asserts invalid combos are rejected; no
  history pollution from refused runs.

## 7. Forbidden-side-effect tests + reusable audit harness

- internal/installer/audit/harness.go: reusable PurityHarness with
  AssertNoExecutorWrites / AssertNoDirectoryCreations /
  AssertNoMutationCommands / AssertNoStateDirEntries. Self-tests
  in harness_test.go prove it catches each class.
- state machine_test.go: IsApplyTerminal allowlist + DryRun-Transition
  regression tests.
- authority classify_test.go: Ambiguous on orphan-table, symmetric on
  daemon-without-table, table-without-chain routed to Ambiguous,
  predicate requires all three (table+chain+daemon).
- uninstall_test.go: render ↔ JSON equivalence grid (audit item 11).
- history_test.go: write-gate predicate unit tests (dry-run skips,
  non-terminal skips, apply-terminal writes).

Explicit non-goals:
- no uninstall mutation added
- no install preview capability added (PR-22B only removes false
  dry-run semantics by refusing)
- no purge/remove semantics change
- no validator contract change
- no generic history redesign beyond the terminal-state gate
- no PR-23 work begun

Depends on: #481 (PR-22A uninstall boundary repair)
Refs: V1100_LIFECYCLE_COMPLETION_CONTRACT.md §13 (frozen 2026-04-19)
Refs: internal/installer/uninstall/contract.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 19, 2026

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

OpenSSF Scorecard

PackageVersionScoreDetails
actions/actions/checkout 34e114876b0b11c390a56381ad16ebd13914f8d5 🟢 5.7
Details
CheckScoreReason
Maintained⚠️ 00 commit(s) and 0 issue activity found in the last 90 days -- score normalized to 0
Dangerous-Workflow🟢 10no dangerous workflow patterns detected
Binary-Artifacts🟢 10no binaries found in the repo
Code-Review🟢 10all changesets reviewed
Token-Permissions⚠️ 0detected GitHub workflow tokens with excessive permissions
CII-Best-Practices⚠️ 0no effort to earn an OpenSSF best practices badge detected
Fuzzing⚠️ 0project is not fuzzed
Packaging⚠️ -1packaging workflow not detected
License🟢 10license file detected
Signed-Releases⚠️ -1no releases found
Pinned-Dependencies🟢 3dependency not pinned by hash detected -- score normalized to 3
Security-Policy🟢 9security policy file detected
Branch-Protection🟢 5branch protection is not maximal on development and all release branches
SAST🟢 8SAST tool detected but not run on all commits
actions/actions/setup-go d35c59abb061a4a6fb18e82ac0862c26744d6ab5 🟢 5.7
Details
CheckScoreReason
Maintained🟢 67 commit(s) and 1 issue activity found in the last 90 days -- score normalized to 6
Code-Review🟢 10all changesets reviewed
Binary-Artifacts🟢 10no binaries found in the repo
Packaging⚠️ -1packaging workflow not detected
Dangerous-Workflow🟢 10no dangerous workflow patterns detected
CII-Best-Practices⚠️ 0no effort to earn an OpenSSF best practices badge detected
Token-Permissions⚠️ 0detected GitHub workflow tokens with excessive permissions
Pinned-Dependencies⚠️ 0dependency not pinned by hash detected -- score normalized to 0
Fuzzing⚠️ 0project is not fuzzed
License🟢 10license file detected
Signed-Releases⚠️ -1no releases found
Security-Policy🟢 9security policy file detected
Branch-Protection⚠️ 0branch protection not enabled on development/release branches
SAST🟢 10SAST tool is run on all commits

Scanned Files

  • .github/workflows/ci-install-canonization.yml

itcmsgr and others added 2 commits April 20, 2026 00:05
Two CI failures from the first PR-22B push:

1. go vet failure in internal/installer/audit/harness_test.go — my
   recordingT mock could not satisfy testing.TB (unexported method).
   Switch the harness method signatures from *testing.T to testing.TB
   so any TB implementation works; rewrite self-tests to use t.Run's
   return value (expectInnerFail helper) to observe subtest failures
   without implementing TB externally.

2. G3-U3 FAIL because the history-file seed happened AFTER the before
   snapshot, so the seeded file appeared as "new" in after. Move the
   seed BEFORE snapshot so both before and after include it; use
   size-only snapshot (drop mtime/%T@) so stat-touch no-ops don't
   trigger the hard diff assertion.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ist gap

Re-audit of PR-22A+PR-22B stack returned GO WITH CONDITIONS with three
must-fix blockers before merge. All three are closed here, plus the
ApplyWhitelist entry needed for the new IsNftbanAuthoritative probe.

N-1 — refuse --repair --dry-run
--------------------------------
Previously the validation block was skipped entirely for repair mode,
so --dry-run was silently accepted while phaseSwitch continued to
mutate kernel + services. sf.DryRun=true suppresses state-file writes
but not kernel mutation. flags.go now refuses the combination with an
explicit error before the validation block.

N-2 — adopt audit.PurityHarness + add update dry-run purity test
----------------------------------------------------------------
uninstall_dryrun_test.go had its own inline forbidden-command list and
inline state-dir check — parallel code to audit.PurityHarness. Replace
with a single AssertAllPurity call. No more list-drift risk.

cmd/nftban-installer/update_dryrun_test.go NEW — invokes runUpdateDryRun
under MockExecutor + temp stateDir and asserts AssertAllPurity. Covers
both preflight-pass (happy path) and preflight-fail branches. Closes
the audit gap where update dry-run was defended only at CI
filesystem-snapshot granularity.

Harness redesign for testability: Check* methods return []string of
violation messages; Assert* methods call them and report via t.Errorf.
Self-tests now exercise Check* directly (no testing.TB mock needed).
Previous attempt used a recordingT mock that could not satisfy
testing.TB due to its unexported private() method.

N-3 — doc comment typo in state/machine.go
------------------------------------------
The package-level IsApplyTerminal func doc said "alias for
IsApplyTerminal" (self-referential). Reworded to "alias for the
(InstallState) IsApplyTerminal method."

Bonus — ApplyWhitelist gap uncovered by re-audit CI
---------------------------------------------------
update.IsNftbanAuthoritative added a "nft list chain ip nftban input"
probe (required daemon+chain+table predicate). Preflight runs this on
every apply entry. The contract-audit harness in update_apply_test.go
rejected the command as non-whitelisted. Add "nft list chain ip nftban
input" to ApplyWhitelist — read-only, part of preflight P-1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@itcmsgr itcmsgr merged commit bd7ea75 into main Apr 19, 2026
60 checks passed
@itcmsgr itcmsgr deleted the fix/v1.100-pr-22b-lifecycle-truth-repair branch April 19, 2026 21:22
itcmsgr added a commit that referenced this pull request Apr 19, 2026
…3 blockers + standing lifecycle-truth rule

Post-PR-22B hygiene per approved plan. One tight commit, no code changes.

CHANGELOG.md — new [Unreleased] section:
- summary of PR-22A + PR-22B structural repair
- data-integrity note on the lifecycle-bridge authority-mapping bug:
  pre-PR-22B `observePlan`/`mapAuthority` switches silently hit default
  arms because of UPPERCASE-vs-lowercase comparison. Between v1.98 and
  the merge of PR-22B (#482), any lifecycle-telemetry consumer saw
  `PreserveAuthority`/`AuthorityNone` regardless of real decision.
  Kernel behavior + install_state + update-history unaffected — only
  the lifecycle bridge's external reporting surface. Forensic
  interpretation of pre-PR-22B lifecycle output should treat the
  authority decision as "unknown," not "preserve."

internal/installer/uninstall/contract.md — two new sections:

1. Standing lifecycle-truth rule: codifies the merge-discipline
   constraint — no new lifecycle code may bypass the shared authority
   predicate, history gate, or dry-run contract. Enumerates the five
   concrete requirements that every new lifecycle PR must respect, and
   points at the CI gates that should catch bypass attempts.

2. Pre-PR-23 blockers: explicit table of the six follow-up PRs that
   must land before PR-23 (uninstall mutation) can start:
     (1) prior-authority record hardening
     (2) external-firewall detection unification
     (3) kernel/service snapshot CI gate
     (4) exec-trace CI gate
     (5) auto-elevate shim removal gate
     (6) payload integrity minimum checks
   Plus the Phase 3 gating rule: verification audit after items 1-6
   land, with three focused questions, no exploratory scope.

No code changes. No behavior changes. Institutional-memory commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
itcmsgr added a commit that referenced this pull request Apr 19, 2026
…23 blockers + standing lifecycle-truth rule (#483)

Post-PR-22B hygiene per approved plan. One tight commit, no code changes.

CHANGELOG.md — new [Unreleased] section:
- summary of PR-22A + PR-22B structural repair
- data-integrity note on the lifecycle-bridge authority-mapping bug:
  pre-PR-22B `observePlan`/`mapAuthority` switches silently hit default
  arms because of UPPERCASE-vs-lowercase comparison. Between v1.98 and
  the merge of PR-22B (#482), any lifecycle-telemetry consumer saw
  `PreserveAuthority`/`AuthorityNone` regardless of real decision.
  Kernel behavior + install_state + update-history unaffected — only
  the lifecycle bridge's external reporting surface. Forensic
  interpretation of pre-PR-22B lifecycle output should treat the
  authority decision as "unknown," not "preserve."

internal/installer/uninstall/contract.md — two new sections:

1. Standing lifecycle-truth rule: codifies the merge-discipline
   constraint — no new lifecycle code may bypass the shared authority
   predicate, history gate, or dry-run contract. Enumerates the five
   concrete requirements that every new lifecycle PR must respect, and
   points at the CI gates that should catch bypass attempts.

2. Pre-PR-23 blockers: explicit table of the six follow-up PRs that
   must land before PR-23 (uninstall mutation) can start:
     (1) prior-authority record hardening
     (2) external-firewall detection unification
     (3) kernel/service snapshot CI gate
     (4) exec-trace CI gate
     (5) auto-elevate shim removal gate
     (6) payload integrity minimum checks
   Plus the Phase 3 gating rule: verification audit after items 1-6
   land, with three focused questions, no exploratory scope.

No code changes. No behavior changes. Institutional-memory commit.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant