Skip to content

v1.100 PR-22A — uninstall dry-run boundary repair (no-mutation + history + classifier fix)#481

Merged
itcmsgr merged 2 commits intomainfrom
fix/v1.100-pr-22a-uninstall-boundary-repair
Apr 19, 2026
Merged

v1.100 PR-22A — uninstall dry-run boundary repair (no-mutation + history + classifier fix)#481
itcmsgr merged 2 commits intomainfrom
fix/v1.100-pr-22a-uninstall-boundary-repair

Conversation

@itcmsgr
Copy link
Copy Markdown
Owner

@itcmsgr itcmsgr commented Apr 19, 2026

Why this PR exists

Independent audit of merged PR #480 (commit 547aa089) found that PR-22's observational scaffold shipped with three concrete contract violations plus a classifier false-negative plus a CI blind spot. This PR is a boundary-repair PR, not a feature PR. It does not advance uninstall mutation.

Scope: R1–R5 from the repair contract seed.
Depends on: 547aa089 (PR #480).


A. Before/after behavior — filesystem writes during dry-run

Surface Before (PR #480) After (this PR)
`/uninstall_plan.json` Written every dry-run (`os.WriteFile`) Not written — plan renders to stdout only
`/var/lib/nftban/state/install_state` Written every dry-run (`sf.Transition(StateUninstallPlanning, …)`) Not written — no state transition during observational run
`/var/lib/nftban/update-history.json` Appended on every dry-run as status `install_fail` (silent audit-trail pollution) Unchanged — `writeHistory()` guarded by `cfg.mode != "uninstall"`

Option B chosen from the repair contract: installer-state persistence during uninstall dry-run is removed entirely. A read-only planning run has no reason to mutate install_state.


B. Classifier bug proof — partial-nftban false-negative

Input Before After
`ip nftban` table present, `nftband.service` down, no external firewall `AuthorityNone` + note "host is unprotected" `AuthorityAmbiguous` + note "kernel still holds nftban artifacts — host is in an indeterminate transition"
`nftband.service` up, no `ip nftban` table, no external firewall `AuthorityNone` `AuthorityAmbiguous`

Why it matters: PR-23's release logic will read `CurrentAuthority`. If the classifier says `AuthorityNone` while the kernel still holds an orphan `ip nftban` table, release skips kernel cleanup and the table persists. This is exactly the silent-takeover-setup risk the PR-22 contract promised to prevent.

Regression tests added:

  • `TestClassify_PartialNFTBan_NoExternal_IsAmbiguous` — table present, daemon down, no external → `Ambiguous`
  • `TestClassify_DaemonUpNoTable_NoExternal_IsAmbiguous` — symmetric case

Existing `TestClassify_PartialNFTBan_WithExternal_IsAmbiguous` already covered the with-external case.

Code change: `authority.go` now has an explicit `case nftbanPartial && !extPresent: AuthorityAmbiguous`. The none-case is tightened to `!nftbanPresent && !nftbanPartial && !extPresent` so the pattern is exhaustive by construction.


C. Purity proof — falsifiable tests that catch the previous failure class

New unit test: `cmd/nftban-installer/uninstall_dryrun_test.go`

Test What it asserts
`TestRunUninstallDryRun_Purity_NoMutation` Calls `runUninstallDryRun` with MockExecutor + real `t.TempDir()` as stateDir. Asserts (1) zero `WriteFileAtomic` calls, (2) zero `MkdirAll` calls, (3) zero commands matching mutation patterns (nft add/delete/flush, systemctl lifecycle verbs, ufw/firewall-cmd/iptables-restore, package-manager removal, userdel/groupdel), (4) every command matches a read-only whitelist, (5) temp state dir is empty — catches any direct `os.WriteFile` that bypasses the executor
`TestRunUninstallDryRun_Purity_WithAmbiguousHost` Same purity guarantees on the repaired classifier path

New CI jobs (on top of existing `G3-UN-NO-MUTATION` + `G3-UN-PLAN-RENDERS`):

Job What it detects that PR #480 missed
`Unit tests — runUninstallDryRun purity (R4)` Orchestrator-level direct writes of any kind
`G3-UN-NO-MUTATION` (extended grep) Added patterns: `os.WriteFile(`, `os.Create(`, `os.MkdirAll(`, `os.Mkdir(`, `os.Rename(`, `sf.Transition(`, `nft create`, `apt-get purge`, `dnf erase`
`G3-UN-PLAN-RENDERS — no-mutation` (widened snapshot) Before: snapshot only `/etc/nftban` + `/usr/lib/nftban` + `/usr/sbin/nftban*`. After: also `/var/lib/nftban/` — the directory that PR #480 polluted
`G3-UN-HISTORY-PURITY` (new gate) Seeds `/var/lib/nftban/update-history.json`, runs dry-run twice (explicit + implicit), asserts byte-identical hash

Falsifiability checkcould PR #480's bug have passed this CI? No.

  • `os.WriteFile(dst, data, 0600)` would trip the extended `G3-UN-NO-MUTATION` grep
  • `sf.Transition(StateUninstallPlanning, …)` would trip the new `sf.Transition` grep pattern
  • The unconditional `writeHistory` → `install_fail` would trip `G3-UN-HISTORY-PURITY`
  • Any of the three would also trip the widened `/var/lib/nftban/` snapshot diff

D. Scope proof — what this PR does NOT do

This PR does not add uninstall mutation, does not flush or remove nftables tables, does not disable services or timers, does not delete files from `/etc/nftban/`, does not restore external firewalls, does not change purge/remove semantics, does not expand prior-authority restore logic, does not redesign installer history generally, does not tighten `PriorRecordUsable` semantics (audit item 9 — deferred to PR-23/24), does not add flag-combination validation (audit item 10 — deferred), does not touch the auto-elevation block in `flags.go` (documented in `contract.md` as PR-23 gate), and does not begin PR-23 work in any form.

If any proposed change is not required to close R1–R5, it is out of scope for this PR.


Acceptance criteria (repair contract §6)

  • uninstall dry-run no longer writes misleading plan/history artifacts under `/var/lib/nftban/`
  • successful uninstall dry-run does not create `install_fail`
  • partial nftban without external firewall returns `AuthorityAmbiguous`
  • a purity-focused test exists and passes
  • CI would fail on the exact write-pattern class that escaped in PR feat(v1.100 PR-22): uninstall lifecycle scaffold — detect + dry-run plan only #480
  • no uninstall mutation capability is added
  • scope remains boundary repair only

Merge standard

Per the repair contract §9: this PR is a repair gate for PR-23. PR-23 must not start until this PR is merged and verified. A green result means only this:

PR-22 is once again a bounded observational scaffold, and its CI/tests can detect the exact contamination class that escaped in PR #480.

It does not authorize mutation work by itself.


Test plan

  • `ci-uninstall-canonization` matrix green on ubuntu-24.04 + almalinux-9
  • `Unit tests — runUninstallDryRun purity (R4)` green in CI
  • `G3-UN-NO-MUTATION` extended grep green
  • `G3-UN-PLAN-RENDERS` with widened snapshot green
  • `G3-UN-HISTORY-PURITY` new gate green
  • all existing `ci-go-tests` still green (no regression elsewhere)

🤖 Generated with Claude Code

Independent audit of merged PR #480 found three concrete violations of
the PR-22 contract Stop Condition ("no writes under /var/lib/nftban/"):

  1. os.WriteFile(<stateDir>/uninstall_plan.json, …) on every dry-run
  2. sf.Transition(StateUninstallPlanning, …) persisted install_state
  3. unconditional writeHistory() recorded a successful plan preview as
     "install_fail" in update-history.json — silently poisoning every
     operator dashboard that alerts on install_fail

Plus one classifier false-negative:

  4. partial nftban (table OR daemon, not both) WITHOUT external
     firewall fell through to AuthorityNone — PR-23 release logic would
     skip kernel cleanup of the orphan table

Plus CI blind spot:

  5. G3-UN-NO-MUTATION grep missed os.WriteFile/Create/MkdirAll/Rename;
     snapshot step did not cover /var/lib/nftban/ at all

This PR is BOUNDARY REPAIR ONLY. It does not add uninstall mutation,
does not change purge/remove semantics, does not expand prior-authority
logic, does not begin PR-23. Scope locked to R1-R5 from the repair
contract seed.

R1 (uninstall_dryrun.go): remove os.WriteFile + sf.Transition. Plan
renders to stdout only. Option B — no installer-state persistence
during dry-run.

R2 (main.go): guard writeHistory with cfg.mode != "uninstall" so a
successful plan preview is not recorded as install_fail.

R3 (authority.go): explicit case nftbanPartial && !extPresent →
AuthorityAmbiguous with diagnostic note; tighten the AuthorityNone
case with !nftbanPartial. Regression test added
(TestClassify_PartialNFTBan_NoExternal_IsAmbiguous) plus symmetric
daemon-up-no-table case.

R4 (uninstall_dryrun_test.go): new falsifiable purity test calls
runUninstallDryRun with MockExecutor + real tempdir and asserts:
  - zero executor writes (WriteFileAtomic)
  - zero directory creates
  - zero mutation commands (nft add/flush/delete, systemctl lifecycle,
    ufw/firewall-cmd, package-manager removal)
  - zero files on disk under temp stateDir
Two tests: no-authority host and ambiguous host.

R5 (ci-uninstall-canonization.yml):
  - extend grep patterns: os.WriteFile/Create/MkdirAll/Mkdir/Rename,
    sf.Transition, nft create, apt-get purge, dnf erase
  - widen E2E snapshot to include /var/lib/nftban/ (was only
    /etc/nftban + /usr/lib/nftban + /usr/sbin/nftban*)
  - new G3-UN-HISTORY-PURITY gate: seeds realistic update-history.json,
    runs dry-run twice (explicit + implicit), asserts byte-identical
    hash — catches the exact class that escaped PR #480
  - run the new orchestrator purity test in CI

Acceptance criteria (all pass by construction):
  - uninstall dry-run no longer writes misleading plan/history
    artifacts under /var/lib/nftban/
  - successful uninstall dry-run does not create install_fail
  - partial nftban without external returns AuthorityAmbiguous
  - purity test exists and would fail on regression
  - CI would fail on the exact write-pattern class that escaped in
    PR #480 (os.WriteFile + history pollution + snapshot blind spot)
  - no uninstall mutation capability added
  - scope remains boundary repair only

Depends on: 547aa08 (PR #480)
Refs: V1100_LIFECYCLE_COMPLETION_CONTRACT.md §13 (frozen 2026-04-19)
Refs: internal/installer/uninstall/contract.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 19, 2026

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

Scanned Files

None

The G3-UN-NO-MUTATION grep extension added in PR-22A matched historical
references in the doc comment that explains what was removed from the
orchestrator. The grep is text-based so it cannot distinguish between
code and comments.

Reword the comment to describe removed behavior in prose rather than
literal Go identifiers. Zero behavior change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@itcmsgr itcmsgr merged commit 6d79246 into main Apr 19, 2026
57 checks passed
@itcmsgr itcmsgr deleted the fix/v1.100-pr-22a-uninstall-boundary-repair branch April 19, 2026 20:45
itcmsgr added a commit that referenced this pull request Apr 19, 2026
…y gating, authority alignment, panel consent

Extended audit found systemic issues beyond uninstall that would invite
the exact class of false confidence PR-22A closed, but for install and
update paths. This PR is boundary-repair ONLY — does not add uninstall
mutation, does not add install preview capability, does not begin
PR-23 work.

PR-22B does not add install preview capability; it removes false
dry-run semantics by refusing unsupported install dry-run invocations.

Scope (12 items from the repair contract):

## 1. Dry-run honesty

- flags.go: refuse `--mode=install --dry-run` with an explicit error.
  An honest install dry-run orchestrator is deferred — until it lands,
  the flag combination has no truthful meaning.
- state.StateFile.DryRun field + guard in Transition(): when DryRun is
  true, in-memory fields are updated but no file write occurs. main.go
  wires sf.DryRun = cfg.dryRun at construction. Any shared phase
  function (phaseDetect in particular) that was persisting install_state
  during update dry-run is now in-memory-only.
- update_dryrun.go: removed os.WriteFile(update_plan.json). Plan renders
  to stdout only; no default filesystem persistence.

## 2. History gating (terminal-state allowlist)

- state.IsApplyTerminal(s) helper — explicit allowlist of states
  representing a completed apply attempt. Catch-all default mapped
  StateDetectComplete / StateUninstallPlanning / intermediate states to
  install_fail; now they are excluded from history writes entirely.
- main.go: writeHistory gated on `!cfg.dryRun && IsApplyTerminal(sf.State)`.
  Removes the PR-22A mode-name heuristic (never mode-based, always
  structural).

## 3. Authority predicate + Ambiguous + panel consent

- authority.IsNftbanAuthoritative(exec) — canonical shared predicate
  requiring table + chain + active daemon. update.Preflight P-1 now
  uses this single source of truth; no two different definitions of
  "nftban authoritative."
- authority.Decision gains `Ambiguous` — orphan table or
  daemon-without-table routes here instead of silently classifying as
  Update or Fresh.
- authority.Classify signature adds panelAutoApprove bool. Panel
  detection alone no longer auto-approves takeover — operators who
  want the old behaviour must pass --panel-auto-takeover explicitly
  (flag + NFTBAN_PANEL_AUTO_TAKEOVER=1 env mirror).
- phases.go phaseSwitch treats Ambiguous with the same pre-switch
  emergency-SSH injection as Takeover/Fresh. Never silent-continue,
  never skip safety paths.

## 4. Lifecycle bridge truthfulness

- lifecycle_bridge observeResult: DryRun is wired from cfg.dryRun
  (was hard-coded false).
- observePlan / mapAuthority: switches now compare against
  authority.Decision constants. Previous version compared UPPERCASE
  enum values against lowercase literals — every switch silently hit
  default, so every lifecycle consumer saw PreserveAuthority regardless
  of the real decision.
- internal/lifecycle: new AuthorityUnknown owner for the Ambiguous case.

## 5. Flag validation (reject contradictory combinations)

- refuse --mode=install --dry-run
- refuse --takeover with --dry-run
- reject --rpm with --deb
- reject --force-delete-operator-config without --purge (both for
  install paths and the uninstall early-return block)

## 6. CI truth surfaces

- ci-update-canonization.yml G3-U3: snapshot + hard-assert
  /var/lib/nftban/update-history.json and /var/lib/nftban/state/
  install_state byte-identical after dry-run. Removed || true soft
  diff handling. Widened G3-U5..U10 structural grep to include
  update_dryrun.go and extended the pattern list (os.WriteFile,
  os.Create, os.MkdirAll, os.Rename, nft create, apt-get purge, dnf
  erase).
- ci-install-canonization.yml NEW: G3-IN-REFUSE-DRY-RUN gate asserts
  `--mode=install --dry-run` exits non-zero with an explicit refusal
  message; G3-IN-FLAG-COMBOS asserts invalid combos are rejected; no
  history pollution from refused runs.

## 7. Forbidden-side-effect tests + reusable audit harness

- internal/installer/audit/harness.go: reusable PurityHarness with
  AssertNoExecutorWrites / AssertNoDirectoryCreations /
  AssertNoMutationCommands / AssertNoStateDirEntries. Self-tests
  in harness_test.go prove it catches each class.
- state machine_test.go: IsApplyTerminal allowlist + DryRun-Transition
  regression tests.
- authority classify_test.go: Ambiguous on orphan-table, symmetric on
  daemon-without-table, table-without-chain routed to Ambiguous,
  predicate requires all three (table+chain+daemon).
- uninstall_test.go: render ↔ JSON equivalence grid (audit item 11).
- history_test.go: write-gate predicate unit tests (dry-run skips,
  non-terminal skips, apply-terminal writes).

Explicit non-goals:
- no uninstall mutation added
- no install preview capability added (PR-22B only removes false
  dry-run semantics by refusing)
- no purge/remove semantics change
- no validator contract change
- no generic history redesign beyond the terminal-state gate
- no PR-23 work begun

Depends on: #481 (PR-22A uninstall boundary repair)
Refs: V1100_LIFECYCLE_COMPLETION_CONTRACT.md §13 (frozen 2026-04-19)
Refs: internal/installer/uninstall/contract.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
itcmsgr added a commit that referenced this pull request Apr 19, 2026
…ry gating + authority alignment + panel consent (#482)

Closes the systemic install/update audit findings. Scope-locked boundary repair:

- Dry-run honesty: `--mode=install --dry-run` refused explicitly; `--repair --dry-run` refused; `StateFile.DryRun` suppresses `Transition` writes; `update_dryrun.go` no longer writes `update_plan.json`.
- History gating: new `state.IsApplyTerminal(s)` explicit allowlist; `writeHistory` gated `!cfg.dryRun && IsApplyTerminal(sf.State)` — structural, not mode-heuristic.
- Authority: canonical `IsNftbanAuthoritative(exec)` (table + chain + daemon) — used by both `authority.Classify` and `update.Preflight` P-1; new `Ambiguous` variant for orphan-table/daemon-down; `phaseSwitch` routes `Ambiguous` through emergency-SSH injection.
- Panel consent: silent panel auto-approve replaced with `--panel-auto-takeover` default-off flag + `NFTBAN_PANEL_AUTO_TAKEOVER` env mirror.
- Lifecycle bridge: real `DryRun` wired from `cfg.dryRun`; `observePlan`/`mapAuthority` switches pinned to `authority.Decision` constants (fixes pre-existing UPPERCASE-vs-lowercase bug that silently mapped every consumer to `PreserveAuthority` from v1.98).
- Flag validation: `--takeover+--dry-run`, `--rpm+--deb`, `--force-delete-operator-config` without `--purge` all rejected.
- CI: new `ci-install-canonization.yml` (G3-IN-REFUSE-DRY-RUN + G3-IN-FLAG-COMBOS); tightened `ci-update-canonization.yml` G3-U3 with history + state hard assertions, no `|| true`; extended G3-U5..U10 grep scope + patterns.
- Reusable `audit.PurityHarness` at `internal/installer/audit/harness.go`, adopted by uninstall + new update purity tests.

Three re-audit blockers closed in commit 966a175: refuse `--repair --dry-run`; port uninstall test to `PurityHarness` + add update purity test; fix `IsApplyTerminal` alias doc typo; plus `nft list chain ip nftban input` added to `ApplyWhitelist` for the tightened predicate.

Pre-PR-23 blockers (not in scope; tracked as follow-up PRs):
1. Prior-authority record hardening (timestamp + installer version + active_at_install semantics)
2. External-firewall detection unification
3. Kernel/service snapshot CI gate
4. Exec-trace CI gate
5. Auto-elevate uninstall shim removal gate
6. Payload integrity minimum checks

Depends on: #481 (PR-22A)
Refs: V1100_LIFECYCLE_COMPLETION_CONTRACT.md §13 (frozen 2026-04-19)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant