fix(BucketWatcherManager): release mu before watcher.Start to avoid self-deadlock by omer9564 · Pull Request #3 · permitio/opa-nats

omer9564 · 2026-05-04T15:06:14Z

Summary

GetOrCreateWatcher held gwm.mu (write) for the entire body, including the call to watcher.Start(). Start() ultimately writes to the OPA inmem store via Commit, which acquires the store's RWMutex.Lock.

watchBucketBuiltin invokes GetOrCreateWatcher from a goroutine spawned inside a Rego builtin call. The parent goroutine still holds the inmem store's RLock for the lifetime of its read transaction. The spawned writer therefore cannot acquire the store's write lock until the parent's read transaction closes — and meanwhile holds gwm.mu (write). The parent's next watch_bucket call hits HasWatcher(), which needs gwm.mu.RLock() — blocked behind the spawned writer — and the parent never finishes, so the writer never proceeds. Classic 2-resource circular wait.

Once the writer is queued on the inmem store mutex, all new readers (including OPA's /health handler, which goes through Server.canEval → Rego.Eval → getTxn → inmem.store.NewTransaction) are blocked to prevent writer starvation. Within ~180s the kubelet liveness probe fails 36× and kills the pod (exit 137, "Failed to shutdown server gracefully").

Evidence

Captured goroutine dump from a hung permit-opa pod in production (433KB, 106 goroutines):

 41  [sync.RWMutex.RLock]                 ← new readers piling up
 39  [sync.RWMutex.RLock, 2 minutes]      ← readers stuck on inmem.store.NewTransaction
  1  [sync.RWMutex.Lock,  2 minutes]      ← THE WRITER (this code path)

The writer (opa-nats/natsstore.(*DataTransformer).LoadBucketDataBulk → inmem.(*store).Commit) was spawned by goroutine N which was itself blocked at BucketWatcherManager.HasWatcher waiting on gwm.mu.RLock. "Created by" line of the writer pointed at the same parent N. Pure self-deadlock.

Fix

Hold gwm.mu only for the cache lookup and final ContainsOrAdd. Run NewBucketWatcher and watcher.Start() outside the lock. ContainsOrAdd already resolves the duplicate-create race; the loser is stopped and the winning watcher is returned to the caller.

This is purely a lock-scope reduction — no behavior change in the single-caller fast path. The race window for two concurrent first-time callers on the same bucket is handled exactly the same way it was before (ContainsOrAdd + stop the loser).

Test plan

go build ./...
go vet ./...
golangci-lint (via pre-commit) clean
Existing tests pass on a NATS-enabled CI (the local failures are pre-existing — they require a NATS server)
After merge & release: bump in permit-opa and verify the permit-opa pods in prod-us-east stop restarting (they currently restart every ~5–8 minutes due to this bug)

🤖 Generated with Claude Code

…elf-deadlock GetOrCreateWatcher held gwm.mu (write) for the entire body, including the call to watcher.Start() which writes to the OPA inmem store via Commit (acquiring the store's RWMutex.Lock). When watchBucketBuiltin spawns this function in a goroutine inside a Rego builtin invocation, the parent goroutine still holds the inmem store's RLock for the lifetime of its read transaction. The spawned writer therefore cannot acquire the store's write lock until the parent's read transaction closes, and meanwhile holds gwm.mu (write). The parent's next watch_bucket call hits HasWatcher(), which needs gwm.mu.RLock() — blocked behind the spawned writer — and the parent never finishes, so the writer never proceeds. Classic 2-resource circular wait. Once the writer is queued on the inmem store mutex, all new readers (including OPA's /health handler) are blocked to prevent writer starvation, and the pod stops responding within 180s the kubelet kills it (exit 137). Fix: hold gwm.mu only for the cache lookup and final cache insert. Run NewBucketWatcher and watcher.Start outside the lock. ContainsOrAdd already resolves the duplicate-create race; the loser is stopped and the winning watcher is returned to the caller. Verified against a captured goroutine dump from prod permit-opa where 39+ readers were stuck on inmem.store.NewTransaction's RLock, 1 writer (LoadBucketDataBulk -> Commit) on the inmem store's Lock for >2 minutes, and the parent goroutine that spawned the writer was itself blocked on gwm.mu.RLock inside HasWatcher.

Copilot

Pull request overview

This PR updates the NATS bucket watcher manager to reduce the scope of gwm.mu inside GetOrCreateWatcher, aiming to avoid a production self-deadlock between watcher startup and OPA in-memory store transactions.

Changes:

Moves the initial watcher cache lookup to a read-locked fast path.
Creates and starts new bucket watchers outside gwm.mu.
Reconciles concurrent first-time creation with ContainsOrAdd, stopping the loser and returning the cached watcher.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

… before docker compose up CI repeatedly failed with: Bind for 0.0.0.0:4222 failed: port is already allocated The integration test starts an example docker-compose stack that publishes host ports 4222, 8222, 8181, and 31311. When a previous run on the same CI host left containers behind (or another concurrent job is using one of those ports), `docker compose up` fails before any test code runs. Fix: before bringing the stack up, run `docker compose down -v --remove-orphans` and forcibly `docker rm -f` any container still publishing one of our ports. This makes the test self-cleaning and resilient to leaked containers from prior runs.

…ntainers before docker compose up" This reverts commit 0c5f270.

…opping flag Address review feedback on the previous deadlock fix: 1. Without serialization, two callers could both create+Start a watcher for the same bucket, and stopping the loser would call cleanOPAStore which RemoveOps everything under /nats/kv/<bucket> — wiping the winner's just-loaded data. Add a dedicated createMu mutex that serializes the create-and-start path so only one watcher is ever created per bucket; the second caller sees the cache via a double- check and returns the canonical watcher. 2. Without a stopping flag, BucketWatcherManager.Stop could complete while a slow GetOrCreateWatcher (blocked on the OPA inmem store's write lock) was still in progress, and the in-flight create would then insert a new watcher into the now-purged cache and leave a background watch loop running after shutdown. Add a stopping flag protected by mu, set under createMu by Stop. GetOrCreateWatcher checks it both before Start and again before insert, abandoning a half-built watcher if Stop won the race. 3. createMu is distinct from gwm.mu, so HasWatcher and the fast path are NOT blocked while we are creating. The original self-deadlock fix is preserved: Start runs without gwm.mu held, the parent Rego query can finish and release the inmem store's RLock, and our Commit then proceeds. Tests: - TestGetOrCreateWatcher_FastPath - TestGetOrCreateWatcher_RefusedAfterStop - TestGetOrCreateWatcher_FastPathAfterStopReturnsCached - TestStop_IsIdempotent - TestGetOrCreateWatcher_ConcurrentFastPath

omer9564 · 2026-05-04T15:30:52Z

Thanks @copilot-pull-request-reviewer — all three concerns are valid and now addressed in commit 02f5d4e (force-pushed on top of the original 32cf15e):

Loser's Stop() wipes winner's data (pkg/natsstore/bucket_watcher_manager.go:307 of the old code) — confirmed: BucketWatcher.Stop calls cleanOPAStore which RemoveOps everything under /nats/kv/<bucket>, so a duplicate-create race would have wiped the winner's just-loaded data. Fix: added createMu sync.Mutex that serializes the create-and-start path, so only one watcher per bucket is ever created. The double-check after createMu.Lock() returns the canonical watcher to the second caller. The duplicate-create branch is gone entirely.
Stop() racing with the slow path (line 300) — confirmed: in-flight creates could complete and insert a watcher into a now-purged cache, leaving a background loop running. Fix: added stopping bool (protected by mu). Stop() acquires createMu first to drain any in-flight create, then sets stopping=true under mu.Lock(). GetOrCreateWatcher checks stopping both right after acquiring createMu (early bail) and again under mu.Lock() right before insert (late bail — tears down the half-built watcher if Stop won the race).
No regression test (line 299) — added pkg/natsstore/bucket_watcher_manager_concurrency_test.go with 5 tests:
- TestGetOrCreateWatcher_FastPath
- TestGetOrCreateWatcher_RefusedAfterStop
- TestGetOrCreateWatcher_FastPathAfterStopReturnsCached
- TestStop_IsIdempotent
- TestGetOrCreateWatcher_ConcurrentFastPath

Note that createMu is intentionally distinct from gwm.mu — the original self-deadlock fix is preserved: Start runs without gwm.mu held, HasWatcher callers (using gwm.mu.RLock) are not blocked by an in-flight create, the parent Rego query can finish and release the OPA inmem store's RLock, and the spawned Commit then proceeds.

…rt bindings TestIntegration's example docker-compose published host ports for NATS (4222, 8222) and the NATS UI (31311). The integration test only talks to OPA at 8181, so those bindings are not needed for the test to function — and they collide with the GHA workflow's `services: nats:` (which already binds 4222), making CI fail every time with: Bind for 0.0.0.0:4222 failed: port is already allocated Add a test-only docker-compose override at cmd/opa-nats/docker-compose.test.override.yaml that resets the host-port mappings on `nats` and `nats-ui` to []. The example file itself is not changed, so end users still get the published ports outside of CI. The test now invokes `docker compose -f docker-compose.yaml -f <override>` and also runs an explicit `down -v --remove-orphans` before `up` to clear any leftover containers from previous runs.

Last CI run got past the port-collision but OPA never became ready and all we saw was "context deadline exceeded" with no idea why. Add a helper that dumps `docker compose logs --tail 200` and `docker compose ps -a` when the test fails or when waitForOPA times out, so the next failure surfaces the actual reason in the GHA log.

The docker-compose example pointed OPA at nats://localhost:4222, but inside the OPA container `localhost` is the OPA container itself — not the NATS container. OPA therefore failed at startup with: Failed to connect to NATS: ... nats: no servers available for connection and exited before opening the HTTP server, which is why TestIntegration in CI got "context deadline exceeded" while waiting for /health. Use the docker-compose service name `nats` so OPA can reach NATS via the docker network. A comment notes the standalone (host) value.

…n failure The previous CI run made it past waitForOPA but then hung in evaluatePolicy's http.Post for 10 minutes, until the Go test framework's default timeout killed the process — before t.Cleanup could run dumpComposeLogs. - Give evaluatePolicy a 30s per-request HTTP timeout, so a hung OPA surfaces a clear test failure within 30s and t.Cleanup runs. - Enable OPA's --pprof in the test compose override. - On failure / OPA-not-ready, dump OPA's full goroutine stack from /debug/pprof/goroutine?debug=2 BEFORE the compose logs. This is the highest-signal diagnostic for the deadlock-class of hangs we are trying to verify the fix for.

…erting on call N The previous assertions assumed the second call would already see bucket_watched=true. nats.kv.watch_bucket spawns the watcher registration in a goroutine that completes off the request-handling path — there is no guarantee a follow-up call sees the watcher in the cache yet. Before the deadlock fix this happened to work because the parent's HasWatcher blocked behind the spawned writer (the very deadlock this PR is fixing); after the fix it is a race the test should not be asserting against. Use require.Eventually to poll for bucket_watched=true within 30s. If the watcher registration never completes within the deadline the test still fails loudly, but normal async timing is no longer treated as a bug.

The example rego intentionally returns different shapes for `x` in the two states: bucket_watched=false → x = nats.kv.get_data(bucket, "members") (just the members map) bucket_watched=true → x = data.nats.kv[bucket] (the entire bucket: members + metadata + permissions) The original assertion `x1 == x2` assumed otherwise and was always wrong; before the deadlock fix the second call deadlocked instead of returning, so we never got to compare. Now that calls return correctly in both states, compare the members submap explicitly: x_unwatched vs x_watched["members"]

zeevmoney

Check comments, there is a critical bug here.

@zeevmoney

zeevmoney pointed out that ContainsOrAdd was called while gwm.mu.Lock was held, and the LRU was created with NewWithEvict — so an eviction fired the onEviction callback synchronously under the lock. The callback ran BucketWatcher.Stop → cleanOPAStore → opaStore.Commit, which takes the inmem store's write lock and blocks on outstanding read transactions. Same deadlock as the original bug, just shifted from Start to eviction: 1. Goroutine A (this call, spawned from watchBucketBuiltin): holds gwm.mu.Lock, blocked on db.rmu.Lock for the eviction Commit. 2. Goroutine B (the parent Rego query, or any concurrent reader): holds db.rmu.RLock, calls another watch_bucket → HasWatcher → blocked on gwm.mu.RLock. Default MaxBucketsWatchers = 10, so the eviction path is hot in any deployment with more than 10 buckets — exactly the workload that produced the original incident. Fix: drop the LRU's eviction callback (lru.New instead of lru.NewWithEvict) and perform eviction manually. Peek capacity and RemoveOldest under gwm.mu (no callbacks fire), drop the lock, then call Stop on the evicted watcher with no manager lock held. Also addressed in this commit (related cleanups in the same area): - Remove the dead `if gwm.stopping` re-check after Start. createMu is held for the full slow path and Stop() takes createMu BEFORE setting stopping=true, so stopping cannot transition false→true here. - Add a comment to Stop()'s Purge() noting that it is now a plain map clear with no eviction callbacks (each watcher was already stopped in the loop above). - Introduce a `newWatcher` function field on BucketWatcherManager so the regression tests can exercise Start without a real NATS connection. Production behavior unchanged. Addresses review comments: - #3 (comment) (@zeevmoney) - #3 (comment) (@zeevmoney) - #3 (comment) (@zeevmoney)

@zeevmoney

…enarios The previous concurrency test file only exercised fast-path lookups, the stopping flag set manually, and Stop() idempotency on an empty manager — none of the genuinely concurrent scenarios that motivated this PR. Reviewers correctly pointed out that a future change could re-introduce the deadlock without any test failing. Add four regression tests, all running under -race: - TestGetOrCreateWatcher_DoesNotDeadlockOnReader The original bug. A reader holds the OPA inmem store's RLock while another goroutine enters GetOrCreateWatcher and tries to Commit (which needs the store's WLock). Asserts both that the create eventually completes after the reader releases its RLock AND that concurrent HasWatcher calls remain responsive throughout — proving gwm.mu is not held across watcher.Start. - TestGetOrCreateWatcher_DoesNotDeadlockOnEviction The eviction-deadlock variant zeevmoney flagged. Fills the cache to capacity, then triggers an eviction-during-create with concurrent HasWatcher calls. Asserts every HasWatcher call returns within 100ms — proving gwm.mu is not held across the evicted watcher's Stop. Also asserts the cache contents reflect the eviction. - TestGetOrCreateWatcher_SerializesSameBucket Asserts that N concurrent callers for the same bucket cause newWatcher to be invoked exactly once and all callers receive the same (canonical) watcher — proving createMu serializes creates and the double-check returns the canonical watcher to the loser(s). - TestStop_DrainsInflightCreates Asserts that Stop blocks until any in-flight GetOrCreateWatcher completes, and that subsequent creates are refused. This is the contract that prevents a watcher from being inserted into the cache after Stop has torn the manager down. Tests use the new newWatcher seam (defaults to NewBucketWatcher in production) to exercise Start without a real NATS connection. The deadlock test substitutes a Start that opens a write transaction on the supplied opa store and Commits it, faithfully exercising the same lock chain the real cleanOPAStore touches. Addresses review comments: - #3 (comment) (@zeevmoney) - #3 (comment) (@copilot-pull-request-reviewer)

@zeevmoney

…pose.yaml in CI zeevmoney pointed out (correctly) that an example config should default to the form most users will be running — OPA on the host, reaching NATS at localhost:4222 — not the in-compose-network form. The earlier change that flipped the default to nats://nats:4222 was a workaround for the CI integration test, which OPA-inside-compose needed. This commit: - Reverts examples/opa-nats/config.yaml back to nats://localhost:4222 so `opa run -c examples/opa-nats/config.yaml` works out of the box on the host. - Updates examples/opa-nats/config-compose.yaml (which already existed and shipped with the same broken localhost URL) to actually be the in-compose config, with server_url=nats://nats:4222 and a header comment explaining when to use which file. - Updates cmd/opa-nats/docker-compose.test.override.yaml to mount examples/opa-nats/config-compose.yaml in place of the default config.yaml inside the OPA container, so the CI integration test keeps working with no extra files in cmd/opa-nats/. Addresses review comment: - #3 (comment) (@zeevmoney)

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

zeevmoney

Looks good, see comments.

The newWatcher field comment claimed it was reassigned after Reconfigure, but BucketWatcherManager has no Reconfigure method — Plugin.Reconfigure discards the entire manager and constructs a fresh one. The compose override header described mounting test-config.yaml; the override actually mounts ./config-compose.yaml. Addresses review comments: - #3 (comment) (@copilot-pull-request-reviewer) - #3 (comment) (@copilot-pull-request-reviewer) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@zeevmoney

CreateRootWatcher previously bypassed every concurrency safety mechanism this PR introduced: it did not take createMu, did not check stopping, and the root watcher itself was never stopped from BucketWatcherManager.Stop. Today the only caller is Plugin.Start during init, but if Plugin.Stop ran concurrently with Plugin.Start (e.g. rapid Reconfigure during tests), the root watcher could finish creation after stopping=true and then leak its watch loop past teardown. Any future caller invoking CreateRootWatcher on a live OPA instance would also re-introduce the deadlock the rest of the PR set out to fix. Fold CreateRootWatcher into the same createMu/stopping flow as GetOrCreateWatcher and stop the root watcher in BucketWatcherManager.Stop outside gwm.mu (its Stop takes the OPA store WLock). Addresses review comments: - #3 (comment) (@zeevmoney) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@zeevmoney

…ow-path insert createMu is held end-to-end through GetOrCreateWatcher's slow path, and the cache double-check after acquiring createMu already returned the existing watcher if any. With createMu held, no concurrent writer can populate the cache for bucketName between the double-check and the insert — so gwm.watchers.Contains(bucketName) cannot be true here. The risk is the silent fall-through the code did not write: if Contains ever did fire, both the Add and any Stop of the freshly-created watcher would be skipped while the function still returned watcher with err=nil. The caller would receive a started-but-uncached watcher whose goroutine runs unbounded, holding a NATS subscription and writing to OPA store. Drop the guard and let Add run unconditionally. The explanatory comment above the block is updated to spell out the invariant rather than relying on a defensive check. Addresses review comments: - #3 (comment) (@zeevmoney) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The previous handshake used a single buffered channel for both directions: Stop sent into watcherLoopStopSignal, then received from it expecting watchLoop to have closed it. If Stop's two statements ran before watchLoop reached its select, Stop's own receive drained the buffered value and watchLoop blocked forever waiting for another send — a goroutine leak that also held the watcher's NATS subscription open. In production this typically worked because watchLoop had been parked in its select for a long time before any Stop ran. The race is real but scheduling-dependent. A test that drives watchLoop and Stop close together exposes it deterministically under -race. Replace the single channel with a request/ack pair: - stopReq is closed by Stop (broadcast — every receiver observes it). - watcherLoopExited is closed by watchLoop in a defer when the loop returns; Stop blocks on it as the "loop has exited" confirmation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@zeevmoney

…chLoop Three test changes that respond to review feedback that previous tests were either flaky or did not actually exercise the contract they claimed to verify: - TestStop_DrainsInflightCreates: replace the 50ms sleep that assumed the in-flight create had acquired createMu before Stop ran with an explicit channel signal from newWatcher. The previous timing-based sync was racy under load/slow CI. - TestGetOrCreateWatcher_DoesNotDeadlockOnEviction: previously the preloaded watcher was a stub with started=false, so its Stop short- circuited and the eviction-Stop code path was never actually exercised; a regression that moved evictedWatcher.Stop back inside gwm.mu.Lock would have passed silently. Use a real BucketWatcher backed by a stub nats.KeyWatcher whose Stop blocks on a gate so the eviction's Stop is observably in-flight, then assert HasWatcher remains responsive while gated. - TestBucketWatcher_StopExitsWatchLoop: new test that drives a real BucketWatcher with started=true and a stub KeyWatcher whose Updates never close, then calls Stop and asserts both that Stop returns within 1s and that watchLoop's deferred close fires. This exercises the watcherLoopExited handshake that the rest of the suite (which uses started=false stubs) skips entirely. Sharing infrastructure between this test and the eviction test is provided by stubKeyWatcher and newRealBucketWatcher. Addresses review comments: - #3 (comment) (@copilot-pull-request-reviewer) - #3 (comment) (@zeevmoney) - #3 (comment) (@zeevmoney) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@zeevmoney

…g daemon dumpComposeLogs runs docker logs and docker ps via exec.Command without a context or deadline. If the docker daemon itself is wedged (rare but seen in CI under OOM-kill aftermath) the cleanup hangs past the test framework's 10-minute kill — and zero diagnostics get logged because the t.Logf calls only fire after CombinedOutput returns. Wrap each docker invocation in exec.CommandContext with a 10s timeout. On context deadline, whatever partial output was captured is still logged. Matches the 10s timeout already used by fetchOPAGoroutineDump. Addresses review comments: - #3 (comment) (@zeevmoney) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

omer9564 requested a review from Copilot May 4, 2026 15:07

Copilot started reviewing on behalf of omer9564 May 4, 2026 15:07 View session

Copilot AI reviewed May 4, 2026

View reviewed changes

Comment thread pkg/natsstore/bucket_watcher_manager.go Outdated

Comment thread pkg/natsstore/bucket_watcher_manager.go Outdated

Comment thread pkg/natsstore/bucket_watcher_manager.go Outdated

omer9564 added 3 commits May 4, 2026 18:16

Revert "test(integration): aggressively clean up host-port-binding co…

8573092

…ntainers before docker compose up" This reverts commit 0c5f270.

Copilot started work on behalf of omer9564 May 4, 2026 15:31 View session

Copilot finished work on behalf of omer9564 May 4, 2026 15:35

omer9564 added 6 commits May 4, 2026 20:02

omer9564 requested review from Zivxx and zeevmoney May 4, 2026 18:13

zeevmoney requested changes May 5, 2026

View reviewed changes

omer9564 added 3 commits May 6, 2026 14:48

omer9564 requested review from Copilot and zeevmoney May 6, 2026 12:08

Copilot started reviewing on behalf of omer9564 May 6, 2026 12:09 View session

Copilot AI reviewed May 6, 2026

View reviewed changes

Comment thread pkg/natsstore/bucket_watcher_manager.go Outdated

Comment thread cmd/opa-nats/docker-compose.test.override.yaml Outdated

Comment thread pkg/natsstore/bucket_watcher_manager_concurrency_test.go Outdated

zeevmoney approved these changes May 10, 2026

View reviewed changes

omer9564 and others added 3 commits May 10, 2026 15:57

omer9564 and others added 3 commits May 10, 2026 15:59

omer9564 merged commit 24867e8 into main May 10, 2026
4 checks passed

omer9564 deleted the omer/fix-getorcreatewatcher-deadlock branch May 10, 2026 13:06

Conversation

omer9564 commented May 4, 2026

Summary

Evidence

Fix

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

omer9564 commented May 4, 2026

Uh oh!

zeevmoney left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zeevmoney left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants