fix(installer): update cascade fix — rebuild exit codes + daemon recovery + dpkg resilience by itcmsgr · Pull Request #438 · itcmsgr/nftban

itcmsgr · 2026-04-16T17:58:06Z

Summary — 4 P0 fixes for the update failure cascade

Root cause chain (lab2 v1.87.1 → v1.90.0):

Daemon hits start-limit-hit from restart cycles during update
Module re-enable commands fail silently (daemon down)
POST validation sees missing chains → DEGRADED (exit 1)
Go installer treats exit 1 as FAILED_REBUILD
postinst set -e aborts on non-zero → dpkg marks package broken (iF)
Repair crashes on nil distro (Detect phase skipped)

Fix 1: rebuild.go — align with shell exit-code contract

Shell contract: 0=PROTECTED, 1=DEGRADED, 2=FAILED, 3=FATAL
Before: any non-zero → FAILED_REBUILD
After: exit 1 → log warning, continue. Exit 2+ → FAILED_REBUILD

Fix 2: daemon.go — reset-failed before start retry loop

Before: raw ServiceStart retries, fails on start-limit-hit
After: systemctl reset-failed nftband.{service,socket} first

Fix 3: postinst — capture exit without set -e abort

Before: set -e kills script on installer exit 1, if/elif dead code
After: || INSTALLER_EXIT=$? captures exit code, script continues

Fix 4: cmd_firewall.sh — preflight daemon liveness before module re-enable

Detects daemon down, attempts reset-failed + restart recovery
If still down, warns explicitly instead of failing silently

Also included (from earlier commits)

Shell-side nftban_service_clear_failed() helper + safe restart wrappers
systemd StartLimitBurst 5→10, Interval 300→600s
Repair nil-pointer fix (always run Detect before resume)

Lab verification

lab4: Go build PASS (installer + daemon + go vet)
lab2: dpkg --configure PASS → COMMITTED → PROTECTED

Acceptance criteria

Update with daemon in start-limit-hit does not leave dpkg broken
Rebuild exit 1 logged as DEGRADED, not FAILED_REBUILD
Module re-enable succeeds after daemon recovery or warns explicitly
End state is never broken-package + dead-daemon + stuck-repair

🤖 Generated with Claude Code

When nftband crashes or gets repeatedly started/stopped (e.g. during update, socket activation, or health auto-heal), systemd's start-limit can block all future start attempts with 'start-limit-hit'. This leaves the daemon permanently failed until manual intervention. Fixes: - Add nftban_service_clear_failed() helper — clears failed state before starting any service - Add nftban_daemon_restart() and nftban_daemon_start() safe wrappers that always clear start-limit-hit first - Use safe wrappers in service_control, autoheal, and health checks - Increase systemd StartLimitBurst from 5→10 and interval from 300→600s to accommodate update/install restart cycles The root cause on lab2: socket activation triggered repeated start/stop cycles during an update, exhausting the 5-restart limit in 5 minutes. Subsequent health auto-heal attempts also failed because systemd refused to start the failed unit. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

github-actions · 2026-04-16T17:58:57Z

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

Scanned Files

None

When --repair resumes from SWITCH phase, it skipped Detect, leaving pd.distro=nil. The Switch phase then called EnableNftables(nil) which panicked on distro.NftConfPath (nil pointer dereference). Root cause: repair mode skipped all phases before the resume point, but later phases depend on Detect results (distro, panel, conflicts). Fixes: - Always run phaseDetect before resuming from any phase in repair mode - Add nil guard in EnableNftables for defense-in-depth This was the root cause of the lab2 update failure: dpkg post-install ran the installer, rebuild failed (daemon down from start-limit-hit), then --repair crashed on nil distro because Detect was skipped. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Four P0 fixes for the update failure cascade discovered on lab2. Root cause: during upgrade, daemon hits start-limit-hit from restart cycles. Module re-enable commands (ddos, portscan, botguard) fail silently because daemon is down. POST validation sees missing chains, reports DEGRADED (exit 1). Go installer treats exit 1 as FAILED. postinst set -e aborts on non-zero. dpkg marks package broken. Fix 1 — rebuild.go: align with shell rebuild exit-code contract Shell contract: 0=PROTECTED, 1=DEGRADED, 2=FAILED, 3=FATAL Before: any non-zero → FAILED_REBUILD After: exit 1 (DEGRADED) → log warning, continue exit 2+ → FAILED_REBUILD (unchanged) Fix 2 — daemon.go: reset-failed before daemon start retry loop Before: raw ServiceStart retries, fails on start-limit-hit After: systemctl reset-failed nftband.{service,socket} first Fix 3 — postinst: capture installer exit without set -e abort Before: set -e kills script on installer exit 1, if/elif dead code After: || INSTALLER_EXIT=$? captures exit code, script continues Fix 4 — cmd_firewall.sh: preflight daemon liveness before module re-enable block. Attempts reset-failed + start recovery. If daemon still down, warns explicitly instead of failing silently. Acceptance criteria: - Update from 1.87.x with daemon in start-limit-hit does not leave dpkg broken - Rebuild exit 1 logged as DEGRADED, not FAILED_REBUILD - Module re-enable either succeeds after recovery or warns explicitly - End state is never broken-package + dead-daemon + stuck-repair Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

TestRebuild_Failure expected exit 1 to return error. With the new contract alignment, exit 1 = DEGRADED (non-fatal). Updated test to use exit 2 for the failure case. Added TestRebuild_Degraded to verify exit 1 is accepted without error. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

itcmsgr and others added 2 commits April 16, 2026 21:03

itcmsgr changed the title ~~fix(daemon): clear start-limit-hit before restart + increase burst limit~~ fix(installer): update cascade fix — rebuild exit codes + daemon recovery + dpkg resilience Apr 16, 2026

itcmsgr and others added 2 commits April 16, 2026 21:27

merge: update with v1.91 from main

654b8fb

itcmsgr merged commit b90e342 into main Apr 16, 2026
47 of 48 checks passed

itcmsgr deleted the fix/daemon-restart-safety branch April 16, 2026 18:40

This was referenced Apr 16, 2026

release: v1.91.0 — pipeline unification + update cascade fix #439

Merged

feat(update): PKG-STATE-INCONSISTENT auto-recovery for DEB installs #384

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(installer): update cascade fix — rebuild exit codes + daemon recovery + dpkg resilience#438

fix(installer): update cascade fix — rebuild exit codes + daemon recovery + dpkg resilience#438
itcmsgr merged 5 commits intomainfrom
fix/daemon-restart-safety

itcmsgr commented Apr 16, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Apr 16, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

itcmsgr commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary — 4 P0 fixes for the update failure cascade

Fix 1: rebuild.go — align with shell exit-code contract

Fix 2: daemon.go — reset-failed before start retry loop

Fix 3: postinst — capture exit without set -e abort

Fix 4: cmd_firewall.sh — preflight daemon liveness before module re-enable

Also included (from earlier commits)

Lab verification

Acceptance criteria

Uh oh!

github-actions Bot commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Dependency Review

Scanned Files

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

itcmsgr commented Apr 16, 2026 •

edited

Loading

github-actions Bot commented Apr 16, 2026 •

edited

Loading