Conversation
When nftband crashes or gets repeatedly started/stopped (e.g. during update, socket activation, or health auto-heal), systemd's start-limit can block all future start attempts with 'start-limit-hit'. This leaves the daemon permanently failed until manual intervention. Fixes: - Add nftban_service_clear_failed() helper — clears failed state before starting any service - Add nftban_daemon_restart() and nftban_daemon_start() safe wrappers that always clear start-limit-hit first - Use safe wrappers in service_control, autoheal, and health checks - Increase systemd StartLimitBurst from 5→10 and interval from 300→600s to accommodate update/install restart cycles The root cause on lab2: socket activation triggered repeated start/stop cycles during an update, exhausting the 5-restart limit in 5 minutes. Subsequent health auto-heal attempts also failed because systemd refused to start the failed unit. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Contributor
Dependency Review✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.Scanned FilesNone |
When --repair resumes from SWITCH phase, it skipped Detect, leaving pd.distro=nil. The Switch phase then called EnableNftables(nil) which panicked on distro.NftConfPath (nil pointer dereference). Root cause: repair mode skipped all phases before the resume point, but later phases depend on Detect results (distro, panel, conflicts). Fixes: - Always run phaseDetect before resuming from any phase in repair mode - Add nil guard in EnableNftables for defense-in-depth This was the root cause of the lab2 update failure: dpkg post-install ran the installer, rebuild failed (daemon down from start-limit-hit), then --repair crashed on nil distro because Detect was skipped. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Four P0 fixes for the update failure cascade discovered on lab2.
Root cause: during upgrade, daemon hits start-limit-hit from restart
cycles. Module re-enable commands (ddos, portscan, botguard) fail
silently because daemon is down. POST validation sees missing chains,
reports DEGRADED (exit 1). Go installer treats exit 1 as FAILED.
postinst set -e aborts on non-zero. dpkg marks package broken.
Fix 1 — rebuild.go: align with shell rebuild exit-code contract
Shell contract: 0=PROTECTED, 1=DEGRADED, 2=FAILED, 3=FATAL
Before: any non-zero → FAILED_REBUILD
After: exit 1 (DEGRADED) → log warning, continue
exit 2+ → FAILED_REBUILD (unchanged)
Fix 2 — daemon.go: reset-failed before daemon start retry loop
Before: raw ServiceStart retries, fails on start-limit-hit
After: systemctl reset-failed nftband.{service,socket} first
Fix 3 — postinst: capture installer exit without set -e abort
Before: set -e kills script on installer exit 1, if/elif dead code
After: || INSTALLER_EXIT=$? captures exit code, script continues
Fix 4 — cmd_firewall.sh: preflight daemon liveness before module
re-enable block. Attempts reset-failed + start recovery. If daemon
still down, warns explicitly instead of failing silently.
Acceptance criteria:
- Update from 1.87.x with daemon in start-limit-hit does not leave
dpkg broken
- Rebuild exit 1 logged as DEGRADED, not FAILED_REBUILD
- Module re-enable either succeeds after recovery or warns explicitly
- End state is never broken-package + dead-daemon + stuck-repair
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
TestRebuild_Failure expected exit 1 to return error. With the new contract alignment, exit 1 = DEGRADED (non-fatal). Updated test to use exit 2 for the failure case. Added TestRebuild_Degraded to verify exit 1 is accepted without error. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This was referenced Apr 16, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary — 4 P0 fixes for the update failure cascade
Root cause chain (lab2 v1.87.1 → v1.90.0):
set -eaborts on non-zero → dpkg marks package broken (iF)Fix 1: rebuild.go — align with shell exit-code contract
Fix 2: daemon.go — reset-failed before start retry loop
systemctl reset-failed nftband.{service,socket}firstFix 3: postinst — capture exit without set -e abort
set -ekills script on installer exit 1, if/elif dead code|| INSTALLER_EXIT=$?captures exit code, script continuesFix 4: cmd_firewall.sh — preflight daemon liveness before module re-enable
Also included (from earlier commits)
nftban_service_clear_failed()helper + safe restart wrappersLab verification
Acceptance criteria
🤖 Generated with Claude Code