Skip to content

fix hang when server startup fails mid-sequence#176

Open
jgknight wants to merge 1 commit into
mk6i:mainfrom
jgknight:claude/suspicious-ishizaka
Open

fix hang when server startup fails mid-sequence#176
jgknight wants to merge 1 commit into
mk6i:mainfrom
jgknight:claude/suspicious-ishizaka

Conversation

@jgknight
Copy link
Copy Markdown
Contributor

Summary

I noticed that with an invalid listener config (such as entering the wrong bind IP address), that sometimes the server would immediately print shutdown messages and hang until a kill -9 was used.

After analyzing the issue with Claude, it was determined to be a race condition. If a bind failure occurs, we start firing the Shutdown sequence on the TOC and OSCAR servers, but it's possible that ListenAndServe has not added the listener yet.

If a server's ListenAndServe then runs after its Shutdown has already been called, the new listener never gets added to s.listeners before cleanupListeners ran, so it is never closed and acceptLoop blocks forever.

Using a mutex prevents the race occurring. All of the server listeners will be appended before cleanup runs.

Testing

I setup my config using a listen address of 1.0.0.0. After restarting oos several times, I reproduce the hang.

export OSCAR_LISTENERS=LAN1://1.0.0.0:5190,LAN2://192.168.77.1:5190

./ras-toc2-mac
Successfully loaded config file (settings.env)
time=2026-03-19T19:45:14.768-04:00 level=DEBUG msg="no kerberos listeners defined" svc=Kerberos
time=2026-03-19T19:45:14.768-04:00 level=DEBUG msg="Initiating graceful shutdown..." svc=OSCAR
time=2026-03-19T19:45:14.768-04:00 level=INFO msg="shutdown complete" svc=OSCAR
time=2026-03-19T19:45:14.768-04:00 level=INFO msg="shutdown complete" svc=API
time=2026-03-19T19:45:14.768-04:00 level=DEBUG msg="Initiating graceful shutdown..." svc=TOC
time=2026-03-19T19:45:14.768-04:00 level=INFO msg="starting server" svc=API addr=127.0.0.1:8080
time=2026-03-19T19:45:14.769-04:00 level=INFO msg="shutdown complete" svc=TOC
time=2026-03-19T19:45:14.769-04:00 level=INFO msg="starting server" svc=TOC listen_host=0.0.0.0:9898


^C^C^C[1]    30164 killed     ./ras-toc2-mac

With the fix in place, I restarted oos several times and can no longer reproduce the hang.

Unit tests were added to exercise the change. If I undo the bug fix and run the new unit tests (with a timeout) they fail

go test ./server/oscar/ ./server/toc/ -run "TestServer_ListenAndServe_ShutdownBeforeStart" -v -timeout 15s 
=== RUN   TestServer_ListenAndServe_ShutdownBeforeStart
2026/03/19 20:59:32 INFO starting server listen_address=:0 advertised_host_plain=localhost
panic: test timed out after 15s
        running tests:
                TestServer_ListenAndServe_ShutdownBeforeStart (15s)

goroutine 25 [running]:
testing.(*M).startAlarm.func1()
        /opt/homebrew/Cellar/go/1.25.5/libexec/src/testing/testing.go:2682 +0x2b0
created by time.goFunc
        /opt/homebrew/Cellar/go/1.25.5/libexec/src/time/sleep.go:215 +0x38

goroutine 1 [chan receive]:
testing.(*T).Run(0x14000003c00, {0x1049bef44?, 0x1400002fb38?}, 0x104bcfcb8)
        /opt/homebrew/Cellar/go/1.25.5/libexec/src/testing/testing.go:2005 +0x378
testing.runTests.func1(0x14000003c00)
        /opt/homebrew/Cellar/go/1.25.5/libexec/src/testing/testing.go:2477 +0x38
testing.tRunner(0x14000003c00, 0x1400002fc68)
        /opt/homebrew/Cellar/go/1.25.5/libexec/src/testing/testing.go:1934 +0xc8
testing.runTests(0x1400000e210, {0x1050b6800, 0x50, 0x50}, {0x1400038fe60?, 0x7?, 0x1050bc3e0?})
        /opt/homebrew/Cellar/go/1.25.5/libexec/src/testing/testing.go:2475 +0x3b8
testing.(*M).Run(0x140000d6f00)
        /opt/homebrew/Cellar/go/1.25.5/libexec/src/testing/testing.go:2337 +0x530
main.main()
        _testmain.go:203 +0x80

goroutine 23 [chan receive]:
github.com/mk6i/open-oscar-server/server/oscar.(*Server).ListenAndServe(0x1400015c480)
        /Users/josh/code/open-oscar-server/server/oscar/server.go:110 +0x358
github.com/mk6i/open-oscar-server/server/oscar.TestServer_ListenAndServe_ShutdownBeforeStart(0x14000003dc0)
        /Users/josh/code/open-oscar-server/server/oscar/server_test.go:43 +0x140
testing.tRunner(0x14000003dc0, 0x104bcfcb8)
        /opt/homebrew/Cellar/go/1.25.5/libexec/src/testing/testing.go:1934 +0xc8
created by testing.(*T).Run in goroutine 1
        /opt/homebrew/Cellar/go/1.25.5/libexec/src/testing/testing.go:1997 +0x364

goroutine 24 [IO wait]:
internal/poll.runtime_pollWait(0x105499200, 0x72)
        /opt/homebrew/Cellar/go/1.25.5/libexec/src/runtime/netpoll.go:351 +0xa0
internal/poll.(*pollDesc).wait(0x1400018c880?, 0x105056ee0?, 0x0)
        /opt/homebrew/Cellar/go/1.25.5/libexec/src/internal/poll/fd_poll_runtime.go:84 +0x28
internal/poll.(*pollDesc).waitRead(...)
        /opt/homebrew/Cellar/go/1.25.5/libexec/src/internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Accept(0x1400018c880)
        /opt/homebrew/Cellar/go/1.25.5/libexec/src/internal/poll/fd_unix.go:613 +0x21c
net.(*netFD).accept(0x1400018c880)
        /opt/homebrew/Cellar/go/1.25.5/libexec/src/net/fd_unix.go:161 +0x28
net.(*TCPListener).accept(0x140000fab00)
        /opt/homebrew/Cellar/go/1.25.5/libexec/src/net/tcpsock_posix.go:159 +0x24
net.(*TCPListener).Accept(0x140000fab00)
        /opt/homebrew/Cellar/go/1.25.5/libexec/src/net/tcpsock.go:380 +0x2c
github.com/mk6i/open-oscar-server/server/oscar.(*Server).acceptLoop(0x1400015c480, {0x104bd8b80, 0x140000fab00}, {{0x1049a29de, 0x2}, {0x1049a706c, 0x9}, {0x0, 0x0}, {0x0, ...}, ...})
        /Users/josh/code/open-oscar-server/server/oscar/server.go:143 +0x88
created by github.com/mk6i/open-oscar-server/server/oscar.(*Server).ListenAndServe in goroutine 23
        /Users/josh/code/open-oscar-server/server/oscar/server.go:107 +0x60
FAIL    github.com/mk6i/open-oscar-server/server/oscar  15.259s
=== RUN   TestServer_ListenAndServe_ShutdownBeforeStart
--- FAIL: TestServer_ListenAndServe_ShutdownBeforeStart (0.00s)
panic: runtime error: invalid memory address or nil pointer dereference [recovered, repanicked]
[signal SIGSEGV: segmentation violation code=0x2 addr=0x0 pc=0x1045ec630]

goroutine 7 [running]:
testing.tRunner.func1.2({0x104cf4520, 0x10534a940})
        /opt/homebrew/Cellar/go/1.25.5/libexec/src/testing/testing.go:1872 +0x190
testing.tRunner.func1()
        /opt/homebrew/Cellar/go/1.25.5/libexec/src/testing/testing.go:1875 +0x31c
panic({0x104cf4520?, 0x10534a940?})
        /opt/homebrew/Cellar/go/1.25.5/libexec/src/runtime/panic.go:783 +0x120
log/slog.(*Logger).Handler(...)
        /opt/homebrew/Cellar/go/1.25.5/libexec/src/log/slog/logger.go:121
log/slog.(*Logger).Enabled(0x140000a4ae8?, {0x104dc1bb0?, 0x140000f8af0?}, 0x104dc1a28?)
        /opt/homebrew/Cellar/go/1.25.5/libexec/src/log/slog/logger.go:168 +0x20
log/slog.(*Logger).log(0x0, {0x104dc1bb0?, 0x140000f8af0?}, 0x0, {0x104afa640, 0xf}, {0x1400020fea8, 0x2, 0x2})
        /opt/homebrew/Cellar/go/1.25.5/libexec/src/log/slog/logger.go:244 +0x6c
log/slog.(*Logger).InfoContext(...)
        /opt/homebrew/Cellar/go/1.25.5/libexec/src/log/slog/logger.go:214
github.com/mk6i/open-oscar-server/server/toc.(*Server).ListenAndServe(0x14000208000)
        /Users/josh/code/open-oscar-server/server/toc/server.go:196 +0x160
github.com/mk6i/open-oscar-server/server/toc.TestServer_ListenAndServe_ShutdownBeforeStart(0x14000485180)
        /Users/josh/code/open-oscar-server/server/toc/server_test.go:31 +0x134
testing.tRunner(0x14000485180, 0x104db4568)
        /opt/homebrew/Cellar/go/1.25.5/libexec/src/testing/testing.go:1934 +0xc8
created by testing.(*T).Run in goroutine 1
        /opt/homebrew/Cellar/go/1.25.5/libexec/src/testing/testing.go:1997 +0x364
FAIL    github.com/mk6i/open-oscar-server/server/toc    0.463s
FAIL

With a properly configured settings.env (valid listener IP), oos starts up successfully and I'm able to connect to the server using a windows AIM client.

@jgknight jgknight force-pushed the claude/suspicious-ishizaka branch from 70506cb to fb96f19 Compare March 20, 2026 01:09
@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 20, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 72.96%. Comparing base (82e8793) to head (5203e5a).
⚠️ Report is 3 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #176      +/-   ##
==========================================
+ Coverage   72.84%   72.96%   +0.12%     
==========================================
  Files          50       50              
  Lines       13141    13165      +24     
==========================================
+ Hits         9572     9606      +34     
+ Misses       2942     2931      -11     
- Partials      627      628       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

When a server (e.g. OSCAR) fails to bind its listener, the errgroup
context is cancelled and main triggers Shutdown on all servers before
their ListenAndServe goroutines have been scheduled. If a server's
ListenAndServe then runs after its Shutdown has already been called,
the new listener never gets added to s.listeners before cleanupListeners
ran, so it is never closed and acceptLoop blocks forever.

Fix by introducing tryAddListener, which holds a mutex shared with
cleanupListeners. This makes the two operations mutually exclusive:
either the listener is appended before cleanup runs (and cleanup closes
it), or cleanup runs first (setting listenersDone, causing tryAddListener
to close the listener and return false). There is no window in which a
listener can be bound but escape cleanup.

Add a unit test for each server verifying that ListenAndServe returns
cleanly when Shutdown has already run before the goroutine is scheduled.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@jgknight jgknight force-pushed the claude/suspicious-ishizaka branch from fb96f19 to 5203e5a Compare March 20, 2026 01:28
Comment thread server/toc/server.go
return fmt.Errorf("unable to start TOC server: %w", err)
}

// Atomically append the listener, or bail out if Shutdown already ran.
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the listener length is 2 and the second call to tryAddListener for the second item returns false, won't the server just continue running since the errgroup has something to block on?

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

im gonna take a bit more time to understand

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants