fix hang when server startup fails mid-sequence#176
Open
jgknight wants to merge 1 commit into
Open
Conversation
70506cb to
fb96f19
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #176 +/- ##
==========================================
+ Coverage 72.84% 72.96% +0.12%
==========================================
Files 50 50
Lines 13141 13165 +24
==========================================
+ Hits 9572 9606 +34
+ Misses 2942 2931 -11
- Partials 627 628 +1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
When a server (e.g. OSCAR) fails to bind its listener, the errgroup context is cancelled and main triggers Shutdown on all servers before their ListenAndServe goroutines have been scheduled. If a server's ListenAndServe then runs after its Shutdown has already been called, the new listener never gets added to s.listeners before cleanupListeners ran, so it is never closed and acceptLoop blocks forever. Fix by introducing tryAddListener, which holds a mutex shared with cleanupListeners. This makes the two operations mutually exclusive: either the listener is appended before cleanup runs (and cleanup closes it), or cleanup runs first (setting listenersDone, causing tryAddListener to close the listener and return false). There is no window in which a listener can be bound but escape cleanup. Add a unit test for each server verifying that ListenAndServe returns cleanly when Shutdown has already run before the goroutine is scheduled. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
fb96f19 to
5203e5a
Compare
mk6i
reviewed
Mar 20, 2026
| return fmt.Errorf("unable to start TOC server: %w", err) | ||
| } | ||
|
|
||
| // Atomically append the listener, or bail out if Shutdown already ran. |
Owner
There was a problem hiding this comment.
If the listener length is 2 and the second call to tryAddListener for the second item returns false, won't the server just continue running since the errgroup has something to block on?
Owner
There was a problem hiding this comment.
im gonna take a bit more time to understand
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
I noticed that with an invalid listener config (such as entering the wrong bind IP address), that sometimes the server would immediately print shutdown messages and hang until a kill -9 was used.
After analyzing the issue with Claude, it was determined to be a race condition. If a bind failure occurs, we start firing the Shutdown sequence on the TOC and OSCAR servers, but it's possible that ListenAndServe has not added the listener yet.
Using a mutex prevents the race occurring. All of the server listeners will be appended before cleanup runs.
Testing
I setup my config using a listen address of 1.0.0.0. After restarting oos several times, I reproduce the hang.
With the fix in place, I restarted oos several times and can no longer reproduce the hang.
Unit tests were added to exercise the change. If I undo the bug fix and run the new unit tests (with a timeout) they fail
With a properly configured settings.env (valid listener IP), oos starts up successfully and I'm able to connect to the server using a windows AIM client.