fix(distributed): don't let a dead worker pin the model-load advisory lock by localai-bot · Pull Request #10600 · mudler/LocalAI

localai-bot · 2026-06-29T22:53:14Z

Problem

In distributed mode a chat request could fail with:

failed to route model with internal loader: routing model .../<model>.gguf:
loading model <model>: advisorylock: acquiring lock <id>:
ERROR: canceling statement due to lock timeout (SQLSTATE 55P03)

Observed live: a worker node that the model was routed to went down, and every subsequent request to that model hard-errored with 55P03 for a ~15-minute window. Bringing the worker back did not help, because a still-running coordinator goroutine held the lock.

Root cause (two independent defects)

The per-model advisory lock is held across a 15-minute, ctx-ignoring NATS install.
SmartRouter.Route wraps the entire cold-load (scheduleNewModel → installBackendOnNode → InstallBackend) in advisorylock.WithLockCtx. InstallBackend is a NATS request-reply with a 15m deadline (DefaultBackendInstallTimeout) that ignored ctx. When the chosen worker died mid-install, the holder sat on the lock for up to 15m. The detached loadCtx (context.WithoutCancel) had no deadline, so nothing capped the hold.
A deployment-global lock_timeout turns the intentional cross-replica wait into a hard error.
The acquiring statement pg_advisory_lock() is subject to any lock_timeout GUC set on the role/database (a common operator setting, e.g. 10s). That aborts the wait with 55P03, so other replicas hard-errored instead of waiting for the in-progress load and reusing it.

Fix

advisorylock.WithLockCtx (postgres path): SET lock_timeout = 0 on its dedicated connection (RESET before it returns to the pool) so the Go context is the single source of truth for how long to wait. Waiters now block and then re-check, reusing the model another replica just loaded.
SmartRouter: bound the detached loadCtx with a single ModelLoadCeiling so the lock is always released in bounded time even if a sub-step wedges. Default is the configured backend.install deadline + 10m (staging + LoadModel margin), derived from the install timeout so raising it for slow links widens the ceiling too and never cuts a legitimately slow load.
installBackendOnNode: use singleflight.DoChan + select on ctx.Done() so the install wait honors cancellation; the ceiling can then actually free a caller pinned behind a dead worker. The shared install still coalesces concurrent identical callers via singleflight.

Tests (TDD)

Both defects were reproduced as failing tests first, then made green:

advisorylock: against a PostgreSQL testcontainer with a short database-level lock_timeout, a waiter previously failed with the real 55P03; it now waits out the holder and runs.
nodes: a wedged install (worker that never replies) previously blocked Route indefinitely; it now aborts at the ModelLoadCeiling. The existing singleflight coalescing tests still pass under DoChan.

core/services/nodes and core/services/advisorylock suites pass; golangci-lint (new-from-merge-base) clean.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]

… lock In distributed mode a chat request could fail with: failed to route model with internal loader: routing model ...: loading model ...: advisorylock: acquiring lock <id>: ERROR: canceling statement due to lock timeout (SQLSTATE 55P03) Root cause is two independent defects in the cross-replica model-load path: 1. SmartRouter.Route holds a per-model PostgreSQL advisory lock for the whole cold-load sequence, which includes installBackendOnNode -> InstallBackend, a NATS request-reply with a 15m deadline (DefaultBackendInstallTimeout) that ignored ctx. When the chosen worker died mid-install, the holder sat on the lock for up to 15m. The detached loadCtx (WithoutCancel) had no deadline, so nothing capped the hold. 2. The acquiring statement, pg_advisory_lock(), is subject to any deployment global lock_timeout. A common operator setting (e.g. 10s) aborts the wait with SQLSTATE 55P03, so every other replica's request for that model hard -errored instead of waiting for the in-progress load and reusing it. For the ~15m window the model was effectively unroutable. Fixes: - advisorylock.WithLockCtx (postgres): SET lock_timeout = 0 on its dedicated connection (RESET before it returns to the pool) so the Go context, not a deployment-wide GUC, governs how long we wait. Waiters now block and then re-check, reusing the model another replica just loaded. - SmartRouter: bound the detached loadCtx with a single ModelLoadCeiling so the lock is always released in bounded time even if a sub-step wedges. Default is the configured backend.install deadline + 10m (staging + LoadModel margin), so a legitimately slow load is never cut. - installBackendOnNode: use singleflight.DoChan + select on ctx.Done() so the install wait honors cancellation; the ceiling can then actually free a caller pinned behind a dead worker. The shared install still coalesces via singleflight. Reproduced both defects as failing tests first (a real 55P03 against a testcontainer with a short lock_timeout; a wedged install that blocks Route) and confirmed green. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(distributed): don't let a dead worker pin the model-load advisory lock#10600

fix(distributed): don't let a dead worker pin the model-load advisory lock#10600
localai-bot wants to merge 1 commit into
masterfrom
fix/distributed-model-load-lock-timeout

localai-bot commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

localai-bot commented Jun 29, 2026

Problem

Root cause (two independent defects)

Fix

Tests (TDD)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants