Skip to content

adapter-web: pending stuck after leader handoff (LeaderAheadError loop) #955

@schickling

Description

@schickling

Summary

After leader tab exits (or during concurrent boot handoff), follower pending never clears. The new leader’s UpdateMessagePort never completes, so the shared worker never installs a new leader worker context. Requests to the leader (PullStream/GetLeaderSyncState) hang, so the follower never receives missing events and keeps retrying invalid pushes.

Repro (automated, minimal)

  1. From repo root, run:
    • env CI=1 LIVESTORE_SYNC_LEADER_EXIT_REPRO=1 direnv exec . bunx vitest run tests/integration/src/tests/adapter-web/adapter-web.test.ts --testNamePattern "leader exit"
  2. Test passes but logs show stuck pending after leader exit.

Evidence (from latest run)

  • client-session-sync leader exit results: leader=a followerTimedOut=true pendingAfterWait=5
  • New leader UpdateMessagePort never completes:
    • Session b logs TMP shared worker update port start, but never logs TMP shared worker update port done.
    • Shared worker logs only one updateMessagePort:* sequence (initial leader), none after handoff.
  • GetLeaderSyncState from session b times out after handoff:
    • TMP client sync heads after-wait session=b error=(FiberFailure) TimeoutException: Operation timed out after '1s'

Additional repro (race)

  • env CI=1 LIVESTORE_SYNC_RACE_REPRO=1 direnv exec . bunx vitest run tests/integration/src/tests/adapter-web/adapter-web.test.ts --testNamePattern "race attempt"
  • This test intentionally fails today to surface the bug; it reports pendingAfterWait=5 for the tab that becomes leader after handoff.

Expected

After UpdatePort(B) (handoff), B‑CS should rebase/refresh its pending set against LW‑B’s head so that:

  • B‑CS PushToLeader(pending) is accepted by LW‑B
  • pending drains to 0
  • no repeated LeaderAheadError loop

Actual

After UpdatePort(B) (handoff):

  • B‑CS sends PushToLeader(pending) via SW to LW‑B
  • LW‑B rejects with LeaderAheadError (⚠️)
  • B‑CS rebases locally, but pending keeps the same global range (still 5) (❗)
  • B‑CS retries PushToLeader(pending) and hits LeaderAheadError again (⚠️)
  • UpdateMessagePort for the new leader never completes, so PullStream/GetLeaderSyncState hang (❗)
  • This loop repeats; pending never clears (❗)

Hypothesis (root cause)

  • When a follower becomes leader, SharedWorkerUpdateMessagePort is sent but never completes.
  • Shared worker never installs the new leader worker context (leaderWorkerContextSubRef stays undefined).
  • forwardRequest waits for a worker context, so PullStream/GetLeaderSyncState hang.
  • Since the follower never receives the missing events, it cannot advance beyond pushHead and pending stays > 0.

Timing diagram (hypothesis)

Time ↓
A-CS (tab A)                    | LW-A (leader)                 | SW (proxy)                    | LW-B (leader)                 | B-CS (tab B)
--------------------------------|-------------------------------|-------------------------------|-------------------------------|------------------------------
A-CS -> SW: UpdatePort(A)       |                               |                               |                               | 
                                |                               | SW -> LW-A: UpdatePort(A)     |                               |
                                | LW-A init                     |                               |                               |
                                |                               |                               |                               | B-CS local: commit x5 (p=5)
                                |                               |                               |                               | B-CS -> SW: PushToLeader(e1..e5)
                                |                               | SW -> LW-A: PushToLeader      |                               |
                                | LW-A: apply ok / advance      |                               |                               |

A-CS local: shutdown            |                               |                               |                               |
                                | LW-A stops                    |                               |                               |
                                |                               | SW: resetCurrentWorkerCtx     |                               |
                                |                               |                               |                               | B-CS -> SW: UpdatePort(B)
                                |                               | (UpdatePort(B) hangs) ❗       |                               |
                                |                               |                               |                               | B-CS -> SW: PullStream
                                |                               | (waitForWorker blocks) ❗      |                               |
                                |                               |                               |                               | B-CS local: retry PushToLeader
                                |                               | (waitForWorker blocks) ❗      |                               |

Repro code (not on dev yet)

I added a minimal fixture + test on branch schickling/pangyo to keep this repro deterministic. These files do not exist on origin/dev yet, so here are the key snippets to recreate locally:

Route wiring

tests/integration/src/tests/playwright/fixtures/main.tsx

{
  path: '/adapter-web/client-session-sync',
  component: React.lazy(() =>
    import('./adapter-web/client-session-sync/Root.tsx').then((m) => ({ default: m.Root })),
  ),
}

Test (leader exit)

tests/integration/src/tests/adapter-web/adapter-web.test.ts

Vitest.scopedLive('client session sync pending sticks after leader exit (explicit)', (test) =>
  Effect.gen(function* () {
    const storeId = `adapter-web-sync-leader-exit-${Date.now()}`
    const url = appUrl('/adapter-web/client-session-sync')
    const progressChannel = `ls-webtest-client-session-sync-progress-${storeId}`
    const shutdownChannel = `ls-webtest-client-session-sync-shutdown-${storeId}`
    const baseQuery =
      `barrier=1&storeId=${storeId}&commitCount=5&timeoutMs=8000&disableFastPath=1` +
      `&manualShutdown=1&shutdownChannel=${shutdownChannel}&progressChannel=${progressChannel}`

    page1.goto(`${url}?${baseQuery}&sessionId=a&clientId=A&bootDelayMs=0`)
    page2.goto(`${url}?${baseQuery}&sessionId=b&clientId=B&bootDelayMs=80`)

    // wait for lock-start from both, pick leader, wait for follower committed
    // send shutdown to leader session via shutdownChannel
    // assert follower timedOut=true and pendingAfterWait>0
  }).pipe(withTestCtx(test)),
)

Fixture (core repro loop)

tests/integration/src/tests/playwright/fixtures/adapter-web/client-session-sync/Root.tsx

const runRepro = async (store, options) => {
  const syncState = store[StoreInternalsSymbol].syncProcessor.syncState
  const lockStatusRef = store[StoreInternalsSymbol].clientSession.lockStatus
  const sessionId = store[StoreInternalsSymbol].clientSession.sessionId
  const progress = createProgressReporter(options.progressChannel, { storeId: store.storeId, sessionId })

  const initialState = await syncState.pipe(Effect.runPromise)
  const lockStatusStart = await SubscriptionRef.get(lockStatusRef).pipe(Effect.runPromise)
  progress.send('lock-start', { status: lockStatusStart })

  // commit N events (options.commitCount)

  const afterCommitState = await syncState.pipe(Effect.runPromise)
  progress.send('committed', { pendingAfterCommit: afterCommitState.pending.length })

  // wait for pending to clear or timeout
  // progress.send('done', { timedOut, pendingAfterWait })
}

Logs (leader-exit repro)

These lines show the new leader’s UpdateMessagePort never completes, so GetLeaderSyncState/PullStream hang.

[leader-exit-tab-2] TMP shared worker update port start (storeId: adapter-web-sync-leader-exit-..., sessionId: b)
[leader-exit-tab-1] TMP shared worker update port done (storeId: adapter-web-sync-leader-exit-..., sessionId: a)
[leader-exit-listener] TMP shared-worker: updateMessagePort:makeWorkerLayer:start
[leader-exit-listener] TMP shared-worker: updateMessagePort:ready
[leader-exit-tab-2] TMP client sync heads after-wait session=b error=(FiberFailure) TimeoutException: Operation timed out after '1s'
client-session-sync leader exit results: leader=a followerTimedOut=true pendingAfterWait=5

Logs (race attempt)

These lines show the follower becomes leader but cannot clear pending after handoff.

client-session-sync race results: tab1 status=done lockStart=has-lock lockEnd=has-lock timeline=2ms:has-lock|2ms:has-lock|59ms:has-lock timedOut=false pendingAfterWait=0; tab2 status=done lockStart=no-lock lockEnd=has-lock timeline=2ms:no-lock|2ms:no-lock|65ms:has-lock|9037ms:has-lock timedOut=true pendingAfterWait=5
[race-tab-2] TMP client sync heads after-wait session=b error=(FiberFailure) TimeoutException: Operation timed out after '1s'

Solution ideas

1) Make UpdateMessagePort non-blocking / ack early

  • Change: install leaderWorkerContextSubRef as soon as the worker pool is created, then run devtools connect + any slow setup in forked fibers.
  • Pros: forwardRequest unblocks immediately; minimal behavior change; reduces handoff stalls.
  • Cons: devtools errors become async; need a clear “ready” boundary for requests.
  • Challenges: ensure requests are safe before devtools setup completes; avoid silent failures.

2) Timeout + reset on UpdateMessagePort

  • Change: wrap the shared worker UpdateMessagePort flow in a timeout; on timeout, clear invariants + reset worker context and optionally retry.
  • Pros: recovers from hangs; surfaces the failure; avoids permanent stuck state.
  • Cons: may abort legitimate slow startups; risk of flapping if underlying issue persists.
  • Challenges: pick safe timeout; ensure prior worker is fully closed; avoid data loss.

3) Bypass shared worker for leader APIs

  • Change: leader session calls its dedicated worker directly for push/pull; shared worker only proxies for followers.
  • Pros: removes shared-worker bottleneck during handoff; clearer ownership.
  • Cons: larger refactor; more code paths; higher regression risk.
  • Challenges: keep APIs consistent; manage leader/follower transitions; keep devtools wiring intact.

Filed by an AI assistant on behalf of the user.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions