Skip to content

fix(installer): update cascade fix — rebuild exit codes + daemon recovery + dpkg resilience#438

Merged
itcmsgr merged 5 commits intomainfrom
fix/daemon-restart-safety
Apr 16, 2026
Merged

fix(installer): update cascade fix — rebuild exit codes + daemon recovery + dpkg resilience#438
itcmsgr merged 5 commits intomainfrom
fix/daemon-restart-safety

Conversation

@itcmsgr
Copy link
Copy Markdown
Owner

@itcmsgr itcmsgr commented Apr 16, 2026

Summary — 4 P0 fixes for the update failure cascade

Root cause chain (lab2 v1.87.1 → v1.90.0):

  1. Daemon hits start-limit-hit from restart cycles during update
  2. Module re-enable commands fail silently (daemon down)
  3. POST validation sees missing chains → DEGRADED (exit 1)
  4. Go installer treats exit 1 as FAILED_REBUILD
  5. postinst set -e aborts on non-zero → dpkg marks package broken (iF)
  6. Repair crashes on nil distro (Detect phase skipped)

Fix 1: rebuild.go — align with shell exit-code contract

  • Shell contract: 0=PROTECTED, 1=DEGRADED, 2=FAILED, 3=FATAL
  • Before: any non-zero → FAILED_REBUILD
  • After: exit 1 → log warning, continue. Exit 2+ → FAILED_REBUILD

Fix 2: daemon.go — reset-failed before start retry loop

  • Before: raw ServiceStart retries, fails on start-limit-hit
  • After: systemctl reset-failed nftband.{service,socket} first

Fix 3: postinst — capture exit without set -e abort

  • Before: set -e kills script on installer exit 1, if/elif dead code
  • After: || INSTALLER_EXIT=$? captures exit code, script continues

Fix 4: cmd_firewall.sh — preflight daemon liveness before module re-enable

  • Detects daemon down, attempts reset-failed + restart recovery
  • If still down, warns explicitly instead of failing silently

Also included (from earlier commits)

  • Shell-side nftban_service_clear_failed() helper + safe restart wrappers
  • systemd StartLimitBurst 5→10, Interval 300→600s
  • Repair nil-pointer fix (always run Detect before resume)

Lab verification

  • lab4: Go build PASS (installer + daemon + go vet)
  • lab2: dpkg --configure PASS → COMMITTED → PROTECTED

Acceptance criteria

  • Update with daemon in start-limit-hit does not leave dpkg broken
  • Rebuild exit 1 logged as DEGRADED, not FAILED_REBUILD
  • Module re-enable succeeds after daemon recovery or warns explicitly
  • End state is never broken-package + dead-daemon + stuck-repair

🤖 Generated with Claude Code

When nftband crashes or gets repeatedly started/stopped (e.g. during
update, socket activation, or health auto-heal), systemd's start-limit
can block all future start attempts with 'start-limit-hit'. This leaves
the daemon permanently failed until manual intervention.

Fixes:
- Add nftban_service_clear_failed() helper — clears failed state before
  starting any service
- Add nftban_daemon_restart() and nftban_daemon_start() safe wrappers
  that always clear start-limit-hit first
- Use safe wrappers in service_control, autoheal, and health checks
- Increase systemd StartLimitBurst from 5→10 and interval from 300→600s
  to accommodate update/install restart cycles

The root cause on lab2: socket activation triggered repeated start/stop
cycles during an update, exhausting the 5-restart limit in 5 minutes.
Subsequent health auto-heal attempts also failed because systemd refused
to start the failed unit.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 16, 2026

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

Scanned Files

None

itcmsgr and others added 2 commits April 16, 2026 21:03
When --repair resumes from SWITCH phase, it skipped Detect, leaving
pd.distro=nil. The Switch phase then called EnableNftables(nil) which
panicked on distro.NftConfPath (nil pointer dereference).

Root cause: repair mode skipped all phases before the resume point,
but later phases depend on Detect results (distro, panel, conflicts).

Fixes:
- Always run phaseDetect before resuming from any phase in repair mode
- Add nil guard in EnableNftables for defense-in-depth

This was the root cause of the lab2 update failure: dpkg post-install
ran the installer, rebuild failed (daemon down from start-limit-hit),
then --repair crashed on nil distro because Detect was skipped.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Four P0 fixes for the update failure cascade discovered on lab2.

Root cause: during upgrade, daemon hits start-limit-hit from restart
cycles. Module re-enable commands (ddos, portscan, botguard) fail
silently because daemon is down. POST validation sees missing chains,
reports DEGRADED (exit 1). Go installer treats exit 1 as FAILED.
postinst set -e aborts on non-zero. dpkg marks package broken.

Fix 1 — rebuild.go: align with shell rebuild exit-code contract
  Shell contract: 0=PROTECTED, 1=DEGRADED, 2=FAILED, 3=FATAL
  Before: any non-zero → FAILED_REBUILD
  After:  exit 1 (DEGRADED) → log warning, continue
          exit 2+ → FAILED_REBUILD (unchanged)

Fix 2 — daemon.go: reset-failed before daemon start retry loop
  Before: raw ServiceStart retries, fails on start-limit-hit
  After:  systemctl reset-failed nftband.{service,socket} first

Fix 3 — postinst: capture installer exit without set -e abort
  Before: set -e kills script on installer exit 1, if/elif dead code
  After:  || INSTALLER_EXIT=$? captures exit code, script continues

Fix 4 — cmd_firewall.sh: preflight daemon liveness before module
  re-enable block. Attempts reset-failed + start recovery. If daemon
  still down, warns explicitly instead of failing silently.

Acceptance criteria:
- Update from 1.87.x with daemon in start-limit-hit does not leave
  dpkg broken
- Rebuild exit 1 logged as DEGRADED, not FAILED_REBUILD
- Module re-enable either succeeds after recovery or warns explicitly
- End state is never broken-package + dead-daemon + stuck-repair

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@itcmsgr itcmsgr changed the title fix(daemon): clear start-limit-hit before restart + increase burst limit fix(installer): update cascade fix — rebuild exit codes + daemon recovery + dpkg resilience Apr 16, 2026
itcmsgr and others added 2 commits April 16, 2026 21:27
TestRebuild_Failure expected exit 1 to return error. With the new
contract alignment, exit 1 = DEGRADED (non-fatal). Updated test to
use exit 2 for the failure case. Added TestRebuild_Degraded to verify
exit 1 is accepted without error.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@itcmsgr itcmsgr merged commit b90e342 into main Apr 16, 2026
47 of 48 checks passed
@itcmsgr itcmsgr deleted the fix/daemon-restart-safety branch April 16, 2026 18:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant