fix(BucketWatcherManager): release mu before watcher.Start to avoid self-deadlock#3
Conversation
…elf-deadlock GetOrCreateWatcher held gwm.mu (write) for the entire body, including the call to watcher.Start() which writes to the OPA inmem store via Commit (acquiring the store's RWMutex.Lock). When watchBucketBuiltin spawns this function in a goroutine inside a Rego builtin invocation, the parent goroutine still holds the inmem store's RLock for the lifetime of its read transaction. The spawned writer therefore cannot acquire the store's write lock until the parent's read transaction closes, and meanwhile holds gwm.mu (write). The parent's next watch_bucket call hits HasWatcher(), which needs gwm.mu.RLock() — blocked behind the spawned writer — and the parent never finishes, so the writer never proceeds. Classic 2-resource circular wait. Once the writer is queued on the inmem store mutex, all new readers (including OPA's /health handler) are blocked to prevent writer starvation, and the pod stops responding within 180s the kubelet kills it (exit 137). Fix: hold gwm.mu only for the cache lookup and final cache insert. Run NewBucketWatcher and watcher.Start outside the lock. ContainsOrAdd already resolves the duplicate-create race; the loser is stopped and the winning watcher is returned to the caller. Verified against a captured goroutine dump from prod permit-opa where 39+ readers were stuck on inmem.store.NewTransaction's RLock, 1 writer (LoadBucketDataBulk -> Commit) on the inmem store's Lock for >2 minutes, and the parent goroutine that spawned the writer was itself blocked on gwm.mu.RLock inside HasWatcher.
There was a problem hiding this comment.
Pull request overview
This PR updates the NATS bucket watcher manager to reduce the scope of gwm.mu inside GetOrCreateWatcher, aiming to avoid a production self-deadlock between watcher startup and OPA in-memory store transactions.
Changes:
- Moves the initial watcher cache lookup to a read-locked fast path.
- Creates and starts new bucket watchers outside
gwm.mu. - Reconciles concurrent first-time creation with
ContainsOrAdd, stopping the loser and returning the cached watcher.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
… before docker compose up CI repeatedly failed with: Bind for 0.0.0.0:4222 failed: port is already allocated The integration test starts an example docker-compose stack that publishes host ports 4222, 8222, 8181, and 31311. When a previous run on the same CI host left containers behind (or another concurrent job is using one of those ports), `docker compose up` fails before any test code runs. Fix: before bringing the stack up, run `docker compose down -v --remove-orphans` and forcibly `docker rm -f` any container still publishing one of our ports. This makes the test self-cleaning and resilient to leaked containers from prior runs.
…ntainers before docker compose up" This reverts commit 0c5f270.
…opping flag Address review feedback on the previous deadlock fix: 1. Without serialization, two callers could both create+Start a watcher for the same bucket, and stopping the loser would call cleanOPAStore which RemoveOps everything under /nats/kv/<bucket> — wiping the winner's just-loaded data. Add a dedicated createMu mutex that serializes the create-and-start path so only one watcher is ever created per bucket; the second caller sees the cache via a double- check and returns the canonical watcher. 2. Without a stopping flag, BucketWatcherManager.Stop could complete while a slow GetOrCreateWatcher (blocked on the OPA inmem store's write lock) was still in progress, and the in-flight create would then insert a new watcher into the now-purged cache and leave a background watch loop running after shutdown. Add a stopping flag protected by mu, set under createMu by Stop. GetOrCreateWatcher checks it both before Start and again before insert, abandoning a half-built watcher if Stop won the race. 3. createMu is distinct from gwm.mu, so HasWatcher and the fast path are NOT blocked while we are creating. The original self-deadlock fix is preserved: Start runs without gwm.mu held, the parent Rego query can finish and release the inmem store's RLock, and our Commit then proceeds. Tests: - TestGetOrCreateWatcher_FastPath - TestGetOrCreateWatcher_RefusedAfterStop - TestGetOrCreateWatcher_FastPathAfterStopReturnsCached - TestStop_IsIdempotent - TestGetOrCreateWatcher_ConcurrentFastPath
|
Thanks @copilot-pull-request-reviewer — all three concerns are valid and now addressed in commit 02f5d4e (force-pushed on top of the original 32cf15e):
Note that |
…rt bindings TestIntegration's example docker-compose published host ports for NATS (4222, 8222) and the NATS UI (31311). The integration test only talks to OPA at 8181, so those bindings are not needed for the test to function — and they collide with the GHA workflow's `services: nats:` (which already binds 4222), making CI fail every time with: Bind for 0.0.0.0:4222 failed: port is already allocated Add a test-only docker-compose override at cmd/opa-nats/docker-compose.test.override.yaml that resets the host-port mappings on `nats` and `nats-ui` to []. The example file itself is not changed, so end users still get the published ports outside of CI. The test now invokes `docker compose -f docker-compose.yaml -f <override>` and also runs an explicit `down -v --remove-orphans` before `up` to clear any leftover containers from previous runs.
Last CI run got past the port-collision but OPA never became ready and all we saw was "context deadline exceeded" with no idea why. Add a helper that dumps `docker compose logs --tail 200` and `docker compose ps -a` when the test fails or when waitForOPA times out, so the next failure surfaces the actual reason in the GHA log.
The docker-compose example pointed OPA at nats://localhost:4222, but inside the OPA container `localhost` is the OPA container itself — not the NATS container. OPA therefore failed at startup with: Failed to connect to NATS: ... nats: no servers available for connection and exited before opening the HTTP server, which is why TestIntegration in CI got "context deadline exceeded" while waiting for /health. Use the docker-compose service name `nats` so OPA can reach NATS via the docker network. A comment notes the standalone (host) value.
…n failure The previous CI run made it past waitForOPA but then hung in evaluatePolicy's http.Post for 10 minutes, until the Go test framework's default timeout killed the process — before t.Cleanup could run dumpComposeLogs. - Give evaluatePolicy a 30s per-request HTTP timeout, so a hung OPA surfaces a clear test failure within 30s and t.Cleanup runs. - Enable OPA's --pprof in the test compose override. - On failure / OPA-not-ready, dump OPA's full goroutine stack from /debug/pprof/goroutine?debug=2 BEFORE the compose logs. This is the highest-signal diagnostic for the deadlock-class of hangs we are trying to verify the fix for.
…erting on call N The previous assertions assumed the second call would already see bucket_watched=true. nats.kv.watch_bucket spawns the watcher registration in a goroutine that completes off the request-handling path — there is no guarantee a follow-up call sees the watcher in the cache yet. Before the deadlock fix this happened to work because the parent's HasWatcher blocked behind the spawned writer (the very deadlock this PR is fixing); after the fix it is a race the test should not be asserting against. Use require.Eventually to poll for bucket_watched=true within 30s. If the watcher registration never completes within the deadline the test still fails loudly, but normal async timing is no longer treated as a bug.
The example rego intentionally returns different shapes for `x` in the
two states:
bucket_watched=false → x = nats.kv.get_data(bucket, "members")
(just the members map)
bucket_watched=true → x = data.nats.kv[bucket]
(the entire bucket: members + metadata + permissions)
The original assertion `x1 == x2` assumed otherwise and was always
wrong; before the deadlock fix the second call deadlocked instead of
returning, so we never got to compare. Now that calls return correctly
in both states, compare the members submap explicitly:
x_unwatched vs x_watched["members"]
zeevmoney
left a comment
There was a problem hiding this comment.
Check comments, there is a critical bug here.
zeevmoney pointed out that ContainsOrAdd was called while gwm.mu.Lock
was held, and the LRU was created with NewWithEvict — so an eviction
fired the onEviction callback synchronously under the lock. The
callback ran BucketWatcher.Stop → cleanOPAStore → opaStore.Commit,
which takes the inmem store's write lock and blocks on outstanding
read transactions. Same deadlock as the original bug, just shifted
from Start to eviction:
1. Goroutine A (this call, spawned from watchBucketBuiltin): holds
gwm.mu.Lock, blocked on db.rmu.Lock for the eviction Commit.
2. Goroutine B (the parent Rego query, or any concurrent reader):
holds db.rmu.RLock, calls another watch_bucket → HasWatcher →
blocked on gwm.mu.RLock.
Default MaxBucketsWatchers = 10, so the eviction path is hot in any
deployment with more than 10 buckets — exactly the workload that
produced the original incident.
Fix: drop the LRU's eviction callback (lru.New instead of
lru.NewWithEvict) and perform eviction manually. Peek capacity and
RemoveOldest under gwm.mu (no callbacks fire), drop the lock, then
call Stop on the evicted watcher with no manager lock held.
Also addressed in this commit (related cleanups in the same area):
- Remove the dead `if gwm.stopping` re-check after Start. createMu is
held for the full slow path and Stop() takes createMu BEFORE setting
stopping=true, so stopping cannot transition false→true here.
- Add a comment to Stop()'s Purge() noting that it is now a plain map
clear with no eviction callbacks (each watcher was already stopped
in the loop above).
- Introduce a `newWatcher` function field on BucketWatcherManager so
the regression tests can exercise Start without a real NATS
connection. Production behavior unchanged.
Addresses review comments:
- #3 (comment) (@zeevmoney)
- #3 (comment) (@zeevmoney)
- #3 (comment) (@zeevmoney)
…enarios The previous concurrency test file only exercised fast-path lookups, the stopping flag set manually, and Stop() idempotency on an empty manager — none of the genuinely concurrent scenarios that motivated this PR. Reviewers correctly pointed out that a future change could re-introduce the deadlock without any test failing. Add four regression tests, all running under -race: - TestGetOrCreateWatcher_DoesNotDeadlockOnReader The original bug. A reader holds the OPA inmem store's RLock while another goroutine enters GetOrCreateWatcher and tries to Commit (which needs the store's WLock). Asserts both that the create eventually completes after the reader releases its RLock AND that concurrent HasWatcher calls remain responsive throughout — proving gwm.mu is not held across watcher.Start. - TestGetOrCreateWatcher_DoesNotDeadlockOnEviction The eviction-deadlock variant zeevmoney flagged. Fills the cache to capacity, then triggers an eviction-during-create with concurrent HasWatcher calls. Asserts every HasWatcher call returns within 100ms — proving gwm.mu is not held across the evicted watcher's Stop. Also asserts the cache contents reflect the eviction. - TestGetOrCreateWatcher_SerializesSameBucket Asserts that N concurrent callers for the same bucket cause newWatcher to be invoked exactly once and all callers receive the same (canonical) watcher — proving createMu serializes creates and the double-check returns the canonical watcher to the loser(s). - TestStop_DrainsInflightCreates Asserts that Stop blocks until any in-flight GetOrCreateWatcher completes, and that subsequent creates are refused. This is the contract that prevents a watcher from being inserted into the cache after Stop has torn the manager down. Tests use the new newWatcher seam (defaults to NewBucketWatcher in production) to exercise Start without a real NATS connection. The deadlock test substitutes a Start that opens a write transaction on the supplied opa store and Commits it, faithfully exercising the same lock chain the real cleanOPAStore touches. Addresses review comments: - #3 (comment) (@zeevmoney) - #3 (comment) (@copilot-pull-request-reviewer)
…pose.yaml in CI
zeevmoney pointed out (correctly) that an example config should default
to the form most users will be running — OPA on the host, reaching NATS
at localhost:4222 — not the in-compose-network form. The earlier change
that flipped the default to nats://nats:4222 was a workaround for the CI
integration test, which OPA-inside-compose needed.
This commit:
- Reverts examples/opa-nats/config.yaml back to nats://localhost:4222
so `opa run -c examples/opa-nats/config.yaml` works out of the box
on the host.
- Updates examples/opa-nats/config-compose.yaml (which already
existed and shipped with the same broken localhost URL) to actually
be the in-compose config, with server_url=nats://nats:4222 and a
header comment explaining when to use which file.
- Updates cmd/opa-nats/docker-compose.test.override.yaml to mount
examples/opa-nats/config-compose.yaml in place of the default
config.yaml inside the OPA container, so the CI integration test
keeps working with no extra files in cmd/opa-nats/.
Addresses review comment:
- #3 (comment) (@zeevmoney)
The newWatcher field comment claimed it was reassigned after Reconfigure, but BucketWatcherManager has no Reconfigure method — Plugin.Reconfigure discards the entire manager and constructs a fresh one. The compose override header described mounting test-config.yaml; the override actually mounts ./config-compose.yaml. Addresses review comments: - #3 (comment) (@copilot-pull-request-reviewer) - #3 (comment) (@copilot-pull-request-reviewer) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CreateRootWatcher previously bypassed every concurrency safety mechanism this PR introduced: it did not take createMu, did not check stopping, and the root watcher itself was never stopped from BucketWatcherManager.Stop. Today the only caller is Plugin.Start during init, but if Plugin.Stop ran concurrently with Plugin.Start (e.g. rapid Reconfigure during tests), the root watcher could finish creation after stopping=true and then leak its watch loop past teardown. Any future caller invoking CreateRootWatcher on a live OPA instance would also re-introduce the deadlock the rest of the PR set out to fix. Fold CreateRootWatcher into the same createMu/stopping flow as GetOrCreateWatcher and stop the root watcher in BucketWatcherManager.Stop outside gwm.mu (its Stop takes the OPA store WLock). Addresses review comments: - #3 (comment) (@zeevmoney) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ow-path insert createMu is held end-to-end through GetOrCreateWatcher's slow path, and the cache double-check after acquiring createMu already returned the existing watcher if any. With createMu held, no concurrent writer can populate the cache for bucketName between the double-check and the insert — so gwm.watchers.Contains(bucketName) cannot be true here. The risk is the silent fall-through the code did not write: if Contains ever did fire, both the Add and any Stop of the freshly-created watcher would be skipped while the function still returned watcher with err=nil. The caller would receive a started-but-uncached watcher whose goroutine runs unbounded, holding a NATS subscription and writing to OPA store. Drop the guard and let Add run unconditionally. The explanatory comment above the block is updated to spell out the invariant rather than relying on a defensive check. Addresses review comments: - #3 (comment) (@zeevmoney) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous handshake used a single buffered channel for both directions:
Stop sent into watcherLoopStopSignal, then received from it expecting
watchLoop to have closed it. If Stop's two statements ran before watchLoop
reached its select, Stop's own receive drained the buffered value and
watchLoop blocked forever waiting for another send — a goroutine leak that
also held the watcher's NATS subscription open.
In production this typically worked because watchLoop had been parked in
its select for a long time before any Stop ran. The race is real but
scheduling-dependent. A test that drives watchLoop and Stop close together
exposes it deterministically under -race.
Replace the single channel with a request/ack pair:
- stopReq is closed by Stop (broadcast — every receiver observes it).
- watcherLoopExited is closed by watchLoop in a defer when the loop
returns; Stop blocks on it as the "loop has exited" confirmation.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…chLoop
Three test changes that respond to review feedback that previous tests
were either flaky or did not actually exercise the contract they claimed
to verify:
- TestStop_DrainsInflightCreates: replace the 50ms sleep that assumed
the in-flight create had acquired createMu before Stop ran with an
explicit channel signal from newWatcher. The previous timing-based
sync was racy under load/slow CI.
- TestGetOrCreateWatcher_DoesNotDeadlockOnEviction: previously the
preloaded watcher was a stub with started=false, so its Stop short-
circuited and the eviction-Stop code path was never actually
exercised; a regression that moved evictedWatcher.Stop back inside
gwm.mu.Lock would have passed silently. Use a real BucketWatcher
backed by a stub nats.KeyWatcher whose Stop blocks on a gate so the
eviction's Stop is observably in-flight, then assert HasWatcher
remains responsive while gated.
- TestBucketWatcher_StopExitsWatchLoop: new test that drives a real
BucketWatcher with started=true and a stub KeyWatcher whose Updates
never close, then calls Stop and asserts both that Stop returns
within 1s and that watchLoop's deferred close fires. This exercises
the watcherLoopExited handshake that the rest of the suite (which
uses started=false stubs) skips entirely. Sharing infrastructure
between this test and the eviction test is provided by stubKeyWatcher
and newRealBucketWatcher.
Addresses review comments:
- #3 (comment) (@copilot-pull-request-reviewer)
- #3 (comment) (@zeevmoney)
- #3 (comment) (@zeevmoney)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…g daemon dumpComposeLogs runs docker logs and docker ps via exec.Command without a context or deadline. If the docker daemon itself is wedged (rare but seen in CI under OOM-kill aftermath) the cleanup hangs past the test framework's 10-minute kill — and zero diagnostics get logged because the t.Logf calls only fire after CombinedOutput returns. Wrap each docker invocation in exec.CommandContext with a 10s timeout. On context deadline, whatever partial output was captured is still logged. Matches the 10s timeout already used by fetchOPAGoroutineDump. Addresses review comments: - #3 (comment) (@zeevmoney) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
GetOrCreateWatcherheldgwm.mu(write) for the entire body, including the call towatcher.Start().Start()ultimately writes to the OPA inmem store viaCommit, which acquires the store'sRWMutex.Lock.watchBucketBuiltininvokesGetOrCreateWatcherfrom a goroutine spawned inside a Rego builtin call. The parent goroutine still holds the inmem store'sRLockfor the lifetime of its read transaction. The spawned writer therefore cannot acquire the store's write lock until the parent's read transaction closes — and meanwhile holdsgwm.mu(write). The parent's nextwatch_bucketcall hitsHasWatcher(), which needsgwm.mu.RLock()— blocked behind the spawned writer — and the parent never finishes, so the writer never proceeds. Classic 2-resource circular wait.Once the writer is queued on the inmem store mutex, all new readers (including OPA's
/healthhandler, which goes throughServer.canEval→Rego.Eval→getTxn→inmem.store.NewTransaction) are blocked to prevent writer starvation. Within ~180s the kubelet liveness probe fails 36× and kills the pod (exit 137, "Failed to shutdown server gracefully").Evidence
Captured goroutine dump from a hung
permit-opapod in production (433KB, 106 goroutines):The writer (
opa-nats/natsstore.(*DataTransformer).LoadBucketDataBulk→inmem.(*store).Commit) was spawned by goroutine N which was itself blocked atBucketWatcherManager.HasWatcherwaiting ongwm.mu.RLock. "Created by" line of the writer pointed at the same parent N. Pure self-deadlock.Fix
Hold
gwm.muonly for the cache lookup and finalContainsOrAdd. RunNewBucketWatcherandwatcher.Start()outside the lock.ContainsOrAddalready resolves the duplicate-create race; the loser is stopped and the winning watcher is returned to the caller.This is purely a lock-scope reduction — no behavior change in the single-caller fast path. The race window for two concurrent first-time callers on the same bucket is handled exactly the same way it was before (
ContainsOrAdd+ stop the loser).Test plan
permit-opaand verify thepermit-opapods in prod-us-east stop restarting (they currently restart every ~5–8 minutes due to this bug)🤖 Generated with Claude Code