-
Notifications
You must be signed in to change notification settings - Fork 903
Description
Important notices
Before you add a new report, we ask you kindly to acknowledge the following:
- I have read the contributing guide lines at https://github.com/opnsense/core/blob/master/CONTRIBUTING.md
- I am convinced that my issue is new after having checked both open and closed issues at https://github.com/opnsense/core/issues?q=is%3Aissue
Describe the bug
After an OPNsense update (from 26.1_4 to 26.1.1), Kea DHCP fails to bind to sockets on startup due to a race condition. The old Kea process's sockets appear to still be in use (likely TIME_WAIT state) when the new process attempts to start. Kea then runs in a broken state, process alive but unable to serve any DHCP requests, for hours until manual intervention.
Critically, Kea does not retry binding to sockets after the initial failure, and there is no alerting that the DHCP server is non-functional. The only indication in the logs is a WARN level message, and Kea continues running its housekeeping tasks (Lease File Cleanup) as if everything is normal.
Last known working version: 26.1_4 (Kea was working correctly before the upgrade)
Note: Related to #9609 but distinct root cause. In #9609, dnsmasq was re-enabled during update. In my case, ISC DHCP was already disabled and not running, the socket conflict was caused by the old Kea process's sockets not being released before the new process started (race condition during service restart).
To Reproduce
- Have Kea DHCP running and serving multiple VLANs/subnets
- Perform an OPNsense update that triggers a Kea service restart
- Kea shuts down and restarts within a few minutes
- New Kea instance fails to bind to port 67 with "Address already in use" on all interfaces
- Kea runs but serves zero DHCP requests
- Network connectivity is lost as client leases expire (~66 minutes with default 4000 second lease time)
Expected behavior
- Kea should wait for sockets to be fully released before attempting to bind, OR
- Kea should retry binding to sockets after initial failure, OR
- OPNsense should detect that Kea failed to bind and alert the administrator / attempt a restart, OR
- At minimum, the failure should be logged at ERROR level, not just WARN
Describe alternatives you considered
- Reverted to ISC DHCP to restore network connectivity
- Considered manually restarting Kea after updates, but this defeats the purpose of automatic updates
Screenshots
N/A
Relevant log files
See attached log file showing:
- 00:27-00:33: Normal DHCP operation, devices receiving leases
- 00:33:52: Kea shutdown command received (triggered by update)
- 00:36:01: Kea restart fails with
DHCPSRV_NO_SOCKETS_OPENandAddress already in useon all interfaces - 01:36-06:36: Only LFC housekeeping tasks running, zero DHCP traffic served
- 07:30: Manual recovery attempt
Key error messages:
DHCPSRV_OPEN_SOCKET_FAIL failed to open socket: Failed to open socket on interface vlan08, reason: failed to bind fallback socket to address 10.12.225.1, port 67, reason: Address already in use - is another DHCP server running?
DHCP4_OPEN_SOCKETS_FAILED maximum number of open service sockets attempts: 0, has been exhausted without success
DHCPSRV_NO_SOCKETS_OPEN no interface configured to listen to DHCP traffic
Additional context
- ISC DHCP was disabled and not running at the time of the failure
- The "Address already in use" error at 00:36:01 was caused by the old Kea process's sockets not being fully released, not by another DHCP server
- Kea ran for nearly 7 hours in this broken state, performing hourly Lease File Cleanup but serving zero DHCP requests
- This resulted in complete network connectivity loss as leases expired
- Configuration: 8 subnets across multiple VLANs, multi-threading enabled with 8 threads
Environment
OPNsense 26.1.1 (amd64, upgraded from 26.1_4)
Kea DHCP 3.0.2
Deciso DEC3860 (AMD EPYC 3201, 32GB DDR4 RAM, 4x GbE + 2x SFP+ 10Gbps)