Skip to content

release: v1.96.0 — rebuild recovery bridge#456

Merged
itcmsgr merged 8 commits intomainfrom
feat/v1.96-rebuild-recovery
Apr 17, 2026
Merged

release: v1.96.0 — rebuild recovery bridge#456
itcmsgr merged 8 commits intomainfrom
feat/v1.96-rebuild-recovery

Conversation

@itcmsgr
Copy link
Copy Markdown
Owner

@itcmsgr itcmsgr commented Apr 17, 2026

Summary

Implements v1.96 rebuild recovery bridge — the missing recovery semantics between safe rebuild (v1.70+) and lifecycle canonization (v1.97+).

Core additions:

  • Failure classification (12 classes: PREVALIDATION_FAILED through RETRY_EXHAUSTED)
  • Recovery markers (JSON persistence at /var/lib/nftban/state/rebuild_recovery.json)
  • Bounded immediate retry (max 1, transient classes only)
  • Deferred retry (systemd timer, 60s after boot, max 2 attempts, non-recurring)
  • Module restore verification (Level 1+2: structure + wiring)
  • Two-truths model: operation result vs system health (separated)

Includes hotfix: a15bcf80 — smoke validator path resolution + module gating truth alignment (v1.95.1 patch, will deduplicate on rebase if merged separately)

Lab4-derived fix: 78f01a32 — corrected chain names to actual kernel names (ddos_protection, portscan_detection) and treats idle as valid post-rebuild state (structurally equivalent to protected, no false rollback).

Commit Stack (7 commits)

Commit Scope
PR-01 6b669fb5 Types + marker foundation (13 Go tests PASS on lab4)
hotfix a15bcf80 Smoke validator path + module gating fix
PR-02 7f3b706e Failure classification + recovery markers wired to all exit paths
PR-03 e618c4e9 Bounded immediate retry (1 attempt, transient classes only)
PR-04 789045d2 Deferred retry service + timer + polkit whitelist update
PR-05 a091e237 Module restore truth + post-restore verification
fix 78f01a32 Chain names + idle state fix (lab4 testing)

Contract

V196_REBUILD_RECOVERY_CONTRACT.md (locked 2026-04-17, 550 lines)

10 invariants (INV-RR-001 through INV-RR-010), 3 tightenings applied:

  1. PREVALIDATION_FAILED excluded from recovery flow entirely
  2. POSTVALIDATION_REGRESSION retry only if daemon-related
  3. Module restore verified at 3 levels (structure, wiring, activation evidence)

Lab4 Validation Matrix

Test Result Exit Marker Verdict
Clean success PROTECTED/IDLE 0 Cleared PASS
Pre-validation failure (broken template) No kernel mutation 1 NOT written (INV-RR-005) PASS
Daemon down + module restore Auto-recovered via socket activation 0 Cleared PASS

Not fully destructive-tested on lab4

  • Forced post-validation regression → rollback: Code path verified by review. Rollback logic unchanged from v1.78 (proven in production). Classification wiring verified by marker content inspection.
  • Rollback failure / exit 3 fatal: Recovery instructions output verified by code review. Hard to simulate safely on live host without risking SSH lockout.

Files Changed

New files (9):

  • internal/rebuild/types.go — OperationResult, FailureClass, ModuleRestoreResult enums
  • internal/rebuild/policy.go — Retry policy constants + disposition logic
  • internal/rebuild/marker.go — RecoveryMarker JSON persistence
  • internal/rebuild/policy_test.go — Unit tests
  • internal/rebuild/marker_test.go — Unit tests
  • cli/lib/nftban/core/nftban_rebuild_classify.sh — Shell classification helpers
  • cli/lib/nftban/core/nftban_rebuild_recovery.sh — Deferred retry script
  • install/systemd/nftban-rebuild-recovery.service — Oneshot recovery service
  • install/systemd/nftban-rebuild-recovery.timer — Boot-triggered timer (once, 60s delay)

Modified files (2):

  • cli/lib/nftban/cli/cmd_firewall.sh — Classification wiring + retry wrapper + module verification
  • packaging/polkit-1/rules.d/10-nftban-systemd.rules — Added recovery service to operator whitelist

Test plan

  • 13/13 Go unit tests PASS on lab4 (policy + marker)
  • Shell classification library loads and tracks correctly (lab4)
  • Marker write/read/clear lifecycle verified (lab4)
  • INV-RR-005: PREVALIDATION_FAILED does NOT write marker (lab4)
  • Clean rebuild succeeds and clears marker (lab4)
  • Daemon-down scenario self-heals via socket activation (lab4)
  • Pre-commit hooks pass (SPDX, inventory, pipefail)
  • Systemd units hardened (PrivateTmp, NoNewPrivileges, ProtectKernel*, etc.)

🤖 Generated with Claude Code

Add internal/rebuild package with:
- OperationResult enum (SUCCESS, FAILED_RECOVERED, FAILED_DEGRADED, FAILED_FATAL)
- FailureClass enum (12 classes: PREVALIDATION_FAILED through RETRY_EXHAUSTED)
- ModuleRestoreResult enum (3-level verification: structure, wiring, activation)
- RetryDisposition enum + GetRetryDisposition() policy function
- RecoveryMarker struct with JSON persistence (read/write/clear)
- ModuleRestoreReport with per-module tracking
- Comprehensive tests for policy logic, marker lifecycle, serialization

Contract: V196_REBUILD_RECOVERY_CONTRACT.md
Invariants: INV-RR-001 through INV-RR-010
No behavior change — foundation types only.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 17, 2026

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

Scanned Files

None

Comment thread internal/rebuild/marker.go
Comment thread internal/rebuild/marker.go
itcmsgr and others added 3 commits April 17, 2026 09:32
…arkers

Wire explicit failure classes into every rebuild exit path:

- PREVALIDATION_FAILED (lines 1201, 1220): no marker written (INV-RR-005)
- APPLY_FAILED (line 1248): marker written, retryable
- DAEMON_RESTART_FAILED (line 1315): tracked for conditional retry
- MODULE_RESTORE_FAILED: per-module tracking (ddos, portscan, botguard)
- POSTVALIDATION_REGRESSION (line 1413): marker + rollback result + restored health
- ROLLBACK_FAILED (line 1418): marker + enhanced exit 3 recovery instructions
- SUCCESS (line 1438): stale marker cleared
- DEGRADED (line 1443): classified by root cause (module/daemon/hard-fail)

New file: nftban_rebuild_classify.sh — failure class constants, module
restore tracking, recovery marker read/write/clear helpers.

No retry activated. No systemd changes. Classification + markers only.
Preserves existing exit-code behavior (0/1/2/3).

Contract: V196_REBUILD_RECOVERY_CONTRACT.md §5-§7
Invariants: INV-RR-005 (prevalidation excluded), INV-RR-007 (module visible)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Two fixes for smoke prerequisite detection discovered during v1.95 lab validation:

1. Validator path resolution: smoke used exec.LookPath() (PATH-based) for
   nftban-validate, but the binary is installed at /usr/lib/nftban/bin/ which
   is not in PATH. Now uses constants.ValidatorBinPath (shared with evidence
   layer). T1/T2 truth tests now PASS instead of false-SKIP.

2. Module-enabled detection: smoke checked wrong config keys (BOTGUARD_ENABLED
   instead of HTTP_BOTGUARD_ENABLED, LOGINMON_ENABLED instead of
   NFTBAN_LOGIN_ALERT_ENABLED). Replaced ad-hoc config parsing with
   validator-backed detection (single validator call, cached via sync.Once,
   config-file fallback when validator binary is missing). Module gating now
   matches `nftban health --json` exactly.

Adds internal/constants/paths.go for single canonical binary path definition.

Verified on RHEL-family and Debian-family hosts:
- Before: 6/10 PASS, 4 SKIP (false SKIPs)
- After: 9-10/10 PASS, 0-1 SKIP (only genuinely disabled modules)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ilures

Add retry wrapper around rebuild core:
- firewall_rebuild() becomes retry envelope
- _firewall_rebuild_core() is the original rebuild logic (renamed, unchanged)
- At most 1 immediate retry (INV-RR-006)
- Only retries eligible classes: APPLY_FAILED, DAEMON_RESTART_FAILED,
  MODULE_RESTORE_FAILED, MODULE_RESTORE_INCOMPLETE
- POSTVALIDATION_REGRESSION retried only if daemon-related (tightening #2)
- Never retries: PREVALIDATION_FAILED, ROLLBACK_FAILED, structural failures
- Updates marker retry_count on failed retry
- Marks exhausted after cap reached

Exit codes unchanged. No systemd changes. No deferred retry yet.

Contract: V196_REBUILD_RECOVERY_CONTRACT.md §5.2, §10 PR-03

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Comment thread internal/smoke/prereqs.go
Comment thread internal/smoke/prereqs.go
itcmsgr and others added 3 commits April 17, 2026 09:47
Add systemd-driven deferred rebuild recovery:

- nftban-rebuild-recovery.timer: fires ONCE 60s after boot (not recurring)
- nftban-rebuild-recovery.service: oneshot, reads marker, attempts rebuild
- nftban_rebuild_recovery.sh: recovery script with full safety checks
  - Exits immediately if no marker, exhausted, or non-retryable class
  - Checks daemon availability before attempting recovery
  - Clears marker on success
  - Marks exhausted after cap (3 total = 1 immediate + 2 deferred)
  - No boot loop: oneshot service, non-repeating timer

Polkit: added nftban-rebuild-recovery.service/.timer to operator
whitelist in 10-nftban-systemd.rules.

Systemd hardening: full hardening applied (PrivateTmp, NoNewPrivileges,
ProtectKernel*, RestrictAddressFamilies, etc.)

Contract: V196_REBUILD_RECOVERY_CONTRACT.md §10 PR-04
INV-RR-006: Retry bounded and persisted

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…fication

Add post-module-restore verification step between steps 8-12 and POST
validation. Closes the silent daemon-dependent module restoration gap.

Verification checks (Level 1+2 per contract):
- DDoS: nft list chain ip nftban nftban_ddos_filter
- Portscan: nft list chain ip nftban nftban_portscan
- BotGuard: nft list chain ip nftban nftban_botguard

If a module reported RESTORE_OK but its chain is missing from kernel,
result is downgraded to RESTORE_INCOMPLETE. This prevents false
PROTECTED when module enable command returned 0 but the chain was
not actually created (daemon dependency failure).

Level 3 (activation evidence) is not checked here — requires traffic
and produces WARNING only, not DEGRADED (per contract tightening #3).

Contract: V196_REBUILD_RECOVERY_CONTRACT.md §8
INV-RR-007: Module restore failure is surfaced, not silent

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…state

Fixes from lab4 failure matrix testing:

1. Module verification chain names corrected to actual kernel names:
   - DDoS: nftban_ddos_filter → ddos_protection
   - Portscan: nftban_portscan → portscan_detection
   - BotGuard: nftban_botguard → botguard_filter

2. Post-rebuild regression check now treats 'idle' as acceptable:
   - idle = structurally equivalent to protected (all checks pass,
     no traffic observed yet after flush+reload)
   - protected → idle is NOT a regression, should not trigger rollback
   - protected → degraded/down still triggers rollback (unchanged)

3. Success path now accepts both protected and idle as exit 0.

Discovered during v1.96 failure matrix testing on lab4.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@itcmsgr itcmsgr changed the title feat(rebuild): v1.96 PR-01 — rebuild recovery types + marker foundation release: v1.96.0 — rebuild recovery bridge Apr 17, 2026
Constants in nftban_rebuild_classify.sh are used by sourcing scripts
(cmd_firewall.sh), not within the file itself. ShellCheck correctly
flags them as unused within scope. Add per-line SC2034 disable
directives to document intent.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@itcmsgr itcmsgr merged commit 338ccef into main Apr 17, 2026
46 of 48 checks passed
@itcmsgr itcmsgr deleted the feat/v1.96-rebuild-recovery branch April 17, 2026 07:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants