fix(distributed): don't let a dead worker pin the model-load advisory lock#10600
Open
localai-bot wants to merge 1 commit into
Open
fix(distributed): don't let a dead worker pin the model-load advisory lock#10600localai-bot wants to merge 1 commit into
localai-bot wants to merge 1 commit into
Conversation
… lock In distributed mode a chat request could fail with: failed to route model with internal loader: routing model ...: loading model ...: advisorylock: acquiring lock <id>: ERROR: canceling statement due to lock timeout (SQLSTATE 55P03) Root cause is two independent defects in the cross-replica model-load path: 1. SmartRouter.Route holds a per-model PostgreSQL advisory lock for the whole cold-load sequence, which includes installBackendOnNode -> InstallBackend, a NATS request-reply with a 15m deadline (DefaultBackendInstallTimeout) that ignored ctx. When the chosen worker died mid-install, the holder sat on the lock for up to 15m. The detached loadCtx (WithoutCancel) had no deadline, so nothing capped the hold. 2. The acquiring statement, pg_advisory_lock(), is subject to any deployment global lock_timeout. A common operator setting (e.g. 10s) aborts the wait with SQLSTATE 55P03, so every other replica's request for that model hard -errored instead of waiting for the in-progress load and reusing it. For the ~15m window the model was effectively unroutable. Fixes: - advisorylock.WithLockCtx (postgres): SET lock_timeout = 0 on its dedicated connection (RESET before it returns to the pool) so the Go context, not a deployment-wide GUC, governs how long we wait. Waiters now block and then re-check, reusing the model another replica just loaded. - SmartRouter: bound the detached loadCtx with a single ModelLoadCeiling so the lock is always released in bounded time even if a sub-step wedges. Default is the configured backend.install deadline + 10m (staging + LoadModel margin), so a legitimately slow load is never cut. - installBackendOnNode: use singleflight.DoChan + select on ctx.Done() so the install wait honors cancellation; the ceiling can then actually free a caller pinned behind a dead worker. The shared install still coalesces via singleflight. Reproduced both defects as failing tests first (a real 55P03 against a testcontainer with a short lock_timeout; a wedged install that blocks Route) and confirmed green. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code]
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
In distributed mode a chat request could fail with:
Observed live: a worker node that the model was routed to went down, and every subsequent request to that model hard-errored with
55P03for a ~15-minute window. Bringing the worker back did not help, because a still-running coordinator goroutine held the lock.Root cause (two independent defects)
The per-model advisory lock is held across a 15-minute,
ctx-ignoring NATS install.SmartRouter.Routewraps the entire cold-load (scheduleNewModel→installBackendOnNode→InstallBackend) inadvisorylock.WithLockCtx.InstallBackendis a NATS request-reply with a 15m deadline (DefaultBackendInstallTimeout) that ignoredctx. When the chosen worker died mid-install, the holder sat on the lock for up to 15m. The detachedloadCtx(context.WithoutCancel) had no deadline, so nothing capped the hold.A deployment-global
lock_timeoutturns the intentional cross-replica wait into a hard error.The acquiring statement
pg_advisory_lock()is subject to anylock_timeoutGUC set on the role/database (a common operator setting, e.g.10s). That aborts the wait with55P03, so other replicas hard-errored instead of waiting for the in-progress load and reusing it.Fix
advisorylock.WithLockCtx(postgres path):SET lock_timeout = 0on its dedicated connection (RESETbefore it returns to the pool) so the Go context is the single source of truth for how long to wait. Waiters now block and then re-check, reusing the model another replica just loaded.SmartRouter: bound the detachedloadCtxwith a singleModelLoadCeilingso the lock is always released in bounded time even if a sub-step wedges. Default is the configuredbackend.installdeadline+ 10m(staging +LoadModelmargin), derived from the install timeout so raising it for slow links widens the ceiling too and never cuts a legitimately slow load.installBackendOnNode: usesingleflight.DoChan+selectonctx.Done()so the install wait honors cancellation; the ceiling can then actually free a caller pinned behind a dead worker. The shared install still coalesces concurrent identical callers via singleflight.Tests (TDD)
Both defects were reproduced as failing tests first, then made green:
advisorylock: against a PostgreSQL testcontainer with a short database-levellock_timeout, a waiter previously failed with the real55P03; it now waits out the holder and runs.nodes: a wedged install (worker that never replies) previously blockedRouteindefinitely; it now aborts at theModelLoadCeiling. The existing singleflight coalescing tests still pass underDoChan.core/services/nodesandcore/services/advisorylocksuites pass;golangci-lint(new-from-merge-base) clean.Assisted-by: Claude:claude-opus-4-8 [Claude Code]