Skip to content

fix: refuse to start a second lemond when the port is already in use (#2255)#2258

Open
siavashhub wants to merge 25 commits into
lemonade-sdk:mainfrom
siavashhub:fix/2255
Open

fix: refuse to start a second lemond when the port is already in use (#2255)#2258
siavashhub wants to merge 25 commits into
lemonade-sdk:mainfrom
siavashhub:fix/2255

Conversation

@siavashhub

@siavashhub siavashhub commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator

Summary

Fixes #2255

Starting a second lemond on a port that's already in use (e.g. 13305) used to silently succeed. The second instance now detects the conflict, prints error and exits with a non-zero code.

Root cause

The front HTTP listeners inherited cpp-httplib's default_socket_options, which enable address/port reuse:

  • Linux/macOSSO_REUSEPORT: multiple sockets may bind the same host:port and the kernel load-balances connections across them.
  • WindowsSO_REUSEADDR: the second socket can hijack the port.

Because the bind appeared to succeed, the existing duplicate-instance guard in the run loop never fired and the process kept running as a confusing second listener.

Changes

Server (core fix)

  1. Exclusive bind: A duplicate bind now fails with EADDRINUSE.

  2. Early, readable failure: Added a preflight bind probe at the top of Server::run() that detects an in-use port before the WebSocket server starts.

  3. Fast exit: On a port conflict, will write to stderr and exit immediately via std::_Exit(1) (gated by a new startup_failed() flag), skipping the multi-second destructor teardown so the error stays as the last visible line and the process returns a non-zero exit code for scripting.

Files: src/cpp/server/server.cpp, src/cpp/include/lemon/server.h, src/cpp/server/main.cpp.

CI pipeline (required by the stricter bind)

Making lemond refuse an in-use port means the non-containerized self-hosted jobs could now hard-fail: they share the host network, so a concurrent job may already hold the default port 13305.

  • .github/actions/install-lemonade-deb: new optional port input (default 13305, or auto). In auto mode it picks a free loopback port, retries on a fresh port if it loses the pick→bind race, and exports LEMONADE_PORT / LEMONADE_TEST_PORT so the downstream test steps and the lemonade CLI target the resolved port.
  • .github/workflows/cpp_server_build_test_release.yml: the self-hosted jobs now pass port: auto to the action. (The containerized test-cli-endpoints-linux job keeps the default port — it has network isolation.)

Tests (required by the stricter bind)

test/test_llamacpp_system_backend.py repeatedly stops and restarts lemond. With the stricter bind, two latent problems became real failures:

  1. Fresh port per instance To allow immediate rebind when lemond shuts down
  2. Mock llama-server handles --version Setting the system backend makes lemond run llama-server --version to read its version. The test's fake llama-server didn't handle --version and instead ran forever, so that call hung. It now prints a version and exits, like the real binary. (This only started failing once the fresh-port change above stopped accidentally hiding it.)
  3. Pull targets the live port One test helper was still pointing at the old fixed port; it now uses the current instance's port.

Behavior

Before:
the second instance kept running and competed for requests.

After:

[Server] ERROR: Port 13305 on 127.0.0.1 is already in use. Another Lemonade server (lemond) is likely already running on this port. This instance will now exit.

(The reported IP is whichever of the resolved IPv4/IPv6 loopback addresses is in use.)

Testing

  • Built lemond on Windows (MSVC).
  • Started instance 1 on port 13305 (confirmed listening).
  • Started instance 2 on the same port → prints the error above as the final
    line and exits with code 1; instance 1 keeps serving uninterrupted.

@github-actions github-actions Bot added bug Something isn't working cpp labels Jun 15, 2026
…ns, allows TIME_WAIT rebinds for faster restart which will fix the ci tests
The exclusive socket-options override and the over-aggressive bind-failure
handling broke startup on dual-stack hosts (e.g. macOS CI), which surfaced as
backend subprocesses like whisper-server "failing to start or become ready"
even though the main server briefly reported reachable.
Two issues:

1. The real listeners dropped SO_REUSEPORT via exclusive_socket_options.
   Because httplib binds IPv6 with IPV6_V6ONLY=0, the IPv6 wildcard "::"
   overlaps the already-bound IPv4 wildcard "0.0.0.0", and only SO_REUSEPORT
   lets the two coexist. Restore httplibs default socket options on the real
   listeners and rely solely on the preflight port_is_available() probe for
   duplicate-instance detection.

2. The run loop treated ANY listener bind failure as fatal and called stop(),
   so on a dual-stack host an IPv6 bind failure would tear down a working IPv4
   listener right after the readiness check passed. Only treat a TOTAL failure
   (no family bound) as fatal; a partial failure now keeps serving on the
   family that bound successfully, matching the original behavior.
@siavashhub siavashhub self-assigned this Jun 17, 2026

@fl0rianr fl0rianr left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this looks like the right direction for the sequential duplicate-start case from #2255.

One concern: the new check is a preflight bind probe, but the probe socket is closed before the real httplib listeners bind, and the real listeners intentionally keep the default reuse options. That means two lemond processes started at nearly the same time can both pass the preflight and then both bind via the real listener path. If the intended guarantee is only “second process started after the first is already listening”, this is probably fine, but it may be worth documenting the remaining TOCTOU window or adding a Lemonade-specific per-host/port lock.

I’d also strongly suggest adding a direct regression test for #2255: start one lemond on a free port, attempt a second lemond on the same port, assert non-zero exit + expected stderr, and assert the first server is still healthy.

Thanks for the hard work iterating with CI, the remaining error is unrelated and will be fixed once #2303 is merged.

@siavashhub siavashhub requested a review from jeremyfowers June 18, 2026 16:10
@siavashhub

Copy link
Copy Markdown
Collaborator Author

I’d also strongly suggest adding a direct regression test for #2255: start one lemond on a free port, attempt a second lemond on the same port, assert non-zero exit + expected stderr, and assert the first server is still healthy.

Thanks for the hard work iterating with CI, the remaining error is unrelated and will be fixed once #2303 is merged.

No problem.
Don't have access to my pc today, I will add the test tomorrow or on the weekend.

@siavashhub

Copy link
Copy Markdown
Collaborator Author

added regression test as well

@fl0rianr fl0rianr left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — thanks for adding the direct duplicate-port regression test.

The implementation now covers the reported #2255 scenario: an existing lemond is already listening, a second lemond on the same port exits non-zero with a clear error, and the original server remains healthy. The added endpoint test exercises exactly that behavior.

Assuming the remaining CI failures are unrelated, this looks good to merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working cpp

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Raise an error when lemond is started on a port that already has lemond

2 participants