fix: refuse to start a second lemond when the port is already in use (#2255)#2258
fix: refuse to start a second lemond when the port is already in use (#2255)#2258siavashhub wants to merge 25 commits into
Conversation
…ns, allows TIME_WAIT rebinds for faster restart which will fix the ci tests
The exclusive socket-options override and the over-aggressive bind-failure handling broke startup on dual-stack hosts (e.g. macOS CI), which surfaced as backend subprocesses like whisper-server "failing to start or become ready" even though the main server briefly reported reachable. Two issues: 1. The real listeners dropped SO_REUSEPORT via exclusive_socket_options. Because httplib binds IPv6 with IPV6_V6ONLY=0, the IPv6 wildcard "::" overlaps the already-bound IPv4 wildcard "0.0.0.0", and only SO_REUSEPORT lets the two coexist. Restore httplibs default socket options on the real listeners and rely solely on the preflight port_is_available() probe for duplicate-instance detection. 2. The run loop treated ANY listener bind failure as fatal and called stop(), so on a dual-stack host an IPv6 bind failure would tear down a working IPv4 listener right after the readiness check passed. Only treat a TOTAL failure (no family bound) as fatal; a partial failure now keeps serving on the family that bound successfully, matching the original behavior.
fl0rianr
left a comment
There was a problem hiding this comment.
Thanks, this looks like the right direction for the sequential duplicate-start case from #2255.
One concern: the new check is a preflight bind probe, but the probe socket is closed before the real httplib listeners bind, and the real listeners intentionally keep the default reuse options. That means two lemond processes started at nearly the same time can both pass the preflight and then both bind via the real listener path. If the intended guarantee is only “second process started after the first is already listening”, this is probably fine, but it may be worth documenting the remaining TOCTOU window or adding a Lemonade-specific per-host/port lock.
I’d also strongly suggest adding a direct regression test for #2255: start one lemond on a free port, attempt a second lemond on the same port, assert non-zero exit + expected stderr, and assert the first server is still healthy.
Thanks for the hard work iterating with CI, the remaining error is unrelated and will be fixed once #2303 is merged.
No problem. |
|
added regression test as well |
fl0rianr
left a comment
There was a problem hiding this comment.
LGTM — thanks for adding the direct duplicate-port regression test.
The implementation now covers the reported #2255 scenario: an existing lemond is already listening, a second lemond on the same port exits non-zero with a clear error, and the original server remains healthy. The added endpoint test exercises exactly that behavior.
Assuming the remaining CI failures are unrelated, this looks good to merge.
Summary
Fixes #2255
Starting a second
lemondon a port that's already in use (e.g. 13305) used to silently succeed. The second instance now detects the conflict, prints error and exits with a non-zero code.Root cause
The front HTTP listeners inherited cpp-httplib's
default_socket_options, which enable address/port reuse:SO_REUSEPORT: multiple sockets may bind the samehost:portand the kernel load-balances connections across them.SO_REUSEADDR: the second socket can hijack the port.Because the bind appeared to succeed, the existing duplicate-instance guard in the run loop never fired and the process kept running as a confusing second listener.
Changes
Server (core fix)
Exclusive bind: A duplicate bind now fails with
EADDRINUSE.Early, readable failure: Added a preflight bind probe at the top of
Server::run()that detects an in-use port before the WebSocket server starts.Fast exit: On a port conflict, will write to
stderrand exit immediately viastd::_Exit(1)(gated by a newstartup_failed()flag), skipping the multi-second destructor teardown so the error stays as the last visible line and the process returns a non-zero exit code for scripting.Files:
src/cpp/server/server.cpp,src/cpp/include/lemon/server.h,src/cpp/server/main.cpp.CI pipeline (required by the stricter bind)
Making
lemondrefuse an in-use port means the non-containerized self-hosted jobs could now hard-fail: they share the host network, so a concurrent job may already hold the default port 13305..github/actions/install-lemonade-deb: new optionalportinput (default13305, orauto). Inautomode it picks a free loopback port, retries on a fresh port if it loses the pick→bind race, and exportsLEMONADE_PORT/LEMONADE_TEST_PORTso the downstream test steps and thelemonadeCLI target the resolved port..github/workflows/cpp_server_build_test_release.yml: the self-hosted jobs now passport: autoto the action. (The containerizedtest-cli-endpoints-linuxjob keeps the default port — it has network isolation.)Tests (required by the stricter bind)
test/test_llamacpp_system_backend.pyrepeatedly stops and restartslemond. With the stricter bind, two latent problems became real failures:lemondshuts downllama-serverhandles--versionSetting thesystembackend makeslemondrunllama-server --versionto read its version. The test's fakellama-serverdidn't handle--versionand instead ran forever, so that call hung. It now prints a version and exits, like the real binary. (This only started failing once the fresh-port change above stopped accidentally hiding it.)Behavior
Before:
the second instance kept running and competed for requests.
After:
(The reported IP is whichever of the resolved IPv4/IPv6 loopback addresses is in use.)
Testing
lemondon Windows (MSVC).line and exits with code 1; instance 1 keeps serving uninterrupted.