Skip to content

fix(distributed): don't let a dead worker pin the model-load advisory lock#10600

Open
localai-bot wants to merge 1 commit into
masterfrom
fix/distributed-model-load-lock-timeout
Open

fix(distributed): don't let a dead worker pin the model-load advisory lock#10600
localai-bot wants to merge 1 commit into
masterfrom
fix/distributed-model-load-lock-timeout

Conversation

@localai-bot

Copy link
Copy Markdown
Collaborator

Problem

In distributed mode a chat request could fail with:

failed to route model with internal loader: routing model .../<model>.gguf:
loading model <model>: advisorylock: acquiring lock <id>:
ERROR: canceling statement due to lock timeout (SQLSTATE 55P03)

Observed live: a worker node that the model was routed to went down, and every subsequent request to that model hard-errored with 55P03 for a ~15-minute window. Bringing the worker back did not help, because a still-running coordinator goroutine held the lock.

Root cause (two independent defects)

  1. The per-model advisory lock is held across a 15-minute, ctx-ignoring NATS install.
    SmartRouter.Route wraps the entire cold-load (scheduleNewModelinstallBackendOnNodeInstallBackend) in advisorylock.WithLockCtx. InstallBackend is a NATS request-reply with a 15m deadline (DefaultBackendInstallTimeout) that ignored ctx. When the chosen worker died mid-install, the holder sat on the lock for up to 15m. The detached loadCtx (context.WithoutCancel) had no deadline, so nothing capped the hold.

  2. A deployment-global lock_timeout turns the intentional cross-replica wait into a hard error.
    The acquiring statement pg_advisory_lock() is subject to any lock_timeout GUC set on the role/database (a common operator setting, e.g. 10s). That aborts the wait with 55P03, so other replicas hard-errored instead of waiting for the in-progress load and reusing it.

Fix

  • advisorylock.WithLockCtx (postgres path): SET lock_timeout = 0 on its dedicated connection (RESET before it returns to the pool) so the Go context is the single source of truth for how long to wait. Waiters now block and then re-check, reusing the model another replica just loaded.
  • SmartRouter: bound the detached loadCtx with a single ModelLoadCeiling so the lock is always released in bounded time even if a sub-step wedges. Default is the configured backend.install deadline + 10m (staging + LoadModel margin), derived from the install timeout so raising it for slow links widens the ceiling too and never cuts a legitimately slow load.
  • installBackendOnNode: use singleflight.DoChan + select on ctx.Done() so the install wait honors cancellation; the ceiling can then actually free a caller pinned behind a dead worker. The shared install still coalesces concurrent identical callers via singleflight.

Tests (TDD)

Both defects were reproduced as failing tests first, then made green:

  • advisorylock: against a PostgreSQL testcontainer with a short database-level lock_timeout, a waiter previously failed with the real 55P03; it now waits out the holder and runs.
  • nodes: a wedged install (worker that never replies) previously blocked Route indefinitely; it now aborts at the ModelLoadCeiling. The existing singleflight coalescing tests still pass under DoChan.

core/services/nodes and core/services/advisorylock suites pass; golangci-lint (new-from-merge-base) clean.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]

… lock

In distributed mode a chat request could fail with:

  failed to route model with internal loader: routing model ...:
  loading model ...: advisorylock: acquiring lock <id>:
  ERROR: canceling statement due to lock timeout (SQLSTATE 55P03)

Root cause is two independent defects in the cross-replica model-load path:

1. SmartRouter.Route holds a per-model PostgreSQL advisory lock for the whole
   cold-load sequence, which includes installBackendOnNode -> InstallBackend,
   a NATS request-reply with a 15m deadline (DefaultBackendInstallTimeout) that
   ignored ctx. When the chosen worker died mid-install, the holder sat on the
   lock for up to 15m. The detached loadCtx (WithoutCancel) had no deadline, so
   nothing capped the hold.

2. The acquiring statement, pg_advisory_lock(), is subject to any deployment
   global lock_timeout. A common operator setting (e.g. 10s) aborts the wait
   with SQLSTATE 55P03, so every other replica's request for that model hard
   -errored instead of waiting for the in-progress load and reusing it. For the
   ~15m window the model was effectively unroutable.

Fixes:

- advisorylock.WithLockCtx (postgres): SET lock_timeout = 0 on its dedicated
  connection (RESET before it returns to the pool) so the Go context, not a
  deployment-wide GUC, governs how long we wait. Waiters now block and then
  re-check, reusing the model another replica just loaded.

- SmartRouter: bound the detached loadCtx with a single ModelLoadCeiling so the
  lock is always released in bounded time even if a sub-step wedges. Default is
  the configured backend.install deadline + 10m (staging + LoadModel margin),
  so a legitimately slow load is never cut.

- installBackendOnNode: use singleflight.DoChan + select on ctx.Done() so the
  install wait honors cancellation; the ceiling can then actually free a caller
  pinned behind a dead worker. The shared install still coalesces via
  singleflight.

Reproduced both defects as failing tests first (a real 55P03 against a
testcontainer with a short lock_timeout; a wedged install that blocks Route)
and confirmed green.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants