-
-
Notifications
You must be signed in to change notification settings - Fork 119
Description
Summary
After leader tab exits (or during concurrent boot handoff), follower pending never clears. The new leader’s UpdateMessagePort never completes, so the shared worker never installs a new leader worker context. Requests to the leader (PullStream/GetLeaderSyncState) hang, so the follower never receives missing events and keeps retrying invalid pushes.
Repro (automated, minimal)
- From repo root, run:
env CI=1 LIVESTORE_SYNC_LEADER_EXIT_REPRO=1 direnv exec . bunx vitest run tests/integration/src/tests/adapter-web/adapter-web.test.ts --testNamePattern "leader exit"
- Test passes but logs show stuck pending after leader exit.
Evidence (from latest run)
client-session-sync leader exit results: leader=a followerTimedOut=true pendingAfterWait=5- New leader
UpdateMessagePortnever completes:- Session
blogsTMP shared worker update port start, but never logsTMP shared worker update port done. - Shared worker logs only one
updateMessagePort:*sequence (initial leader), none after handoff.
- Session
GetLeaderSyncStatefrom sessionbtimes out after handoff:TMP client sync heads after-wait session=b error=(FiberFailure) TimeoutException: Operation timed out after '1s'
Additional repro (race)
env CI=1 LIVESTORE_SYNC_RACE_REPRO=1 direnv exec . bunx vitest run tests/integration/src/tests/adapter-web/adapter-web.test.ts --testNamePattern "race attempt"- This test intentionally fails today to surface the bug; it reports
pendingAfterWait=5for the tab that becomes leader after handoff.
Expected
After UpdatePort(B) (handoff), B‑CS should rebase/refresh its pending set against LW‑B’s head so that:
- B‑CS
PushToLeader(pending)is accepted by LW‑B pendingdrains to 0- no repeated
LeaderAheadErrorloop
Actual
After UpdatePort(B) (handoff):
- B‑CS sends
PushToLeader(pending)via SW to LW‑B - LW‑B rejects with
LeaderAheadError(⚠️ ) - B‑CS rebases locally, but
pendingkeeps the same global range (still 5) (❗) - B‑CS retries
PushToLeader(pending)and hitsLeaderAheadErroragain (⚠️ ) UpdateMessagePortfor the new leader never completes, so PullStream/GetLeaderSyncState hang (❗)- This loop repeats;
pendingnever clears (❗)
Hypothesis (root cause)
- When a follower becomes leader,
SharedWorkerUpdateMessagePortis sent but never completes. - Shared worker never installs the new leader worker context (
leaderWorkerContextSubRefstaysundefined). forwardRequestwaits for a worker context, so PullStream/GetLeaderSyncState hang.- Since the follower never receives the missing events, it cannot advance beyond
pushHeadandpendingstays > 0.
Timing diagram (hypothesis)
Time ↓
A-CS (tab A) | LW-A (leader) | SW (proxy) | LW-B (leader) | B-CS (tab B)
--------------------------------|-------------------------------|-------------------------------|-------------------------------|------------------------------
A-CS -> SW: UpdatePort(A) | | | |
| | SW -> LW-A: UpdatePort(A) | |
| LW-A init | | |
| | | | B-CS local: commit x5 (p=5)
| | | | B-CS -> SW: PushToLeader(e1..e5)
| | SW -> LW-A: PushToLeader | |
| LW-A: apply ok / advance | | |
A-CS local: shutdown | | | |
| LW-A stops | | |
| | SW: resetCurrentWorkerCtx | |
| | | | B-CS -> SW: UpdatePort(B)
| | (UpdatePort(B) hangs) ❗ | |
| | | | B-CS -> SW: PullStream
| | (waitForWorker blocks) ❗ | |
| | | | B-CS local: retry PushToLeader
| | (waitForWorker blocks) ❗ | |
Repro code (not on dev yet)
I added a minimal fixture + test on branch schickling/pangyo to keep this repro deterministic. These files do not exist on origin/dev yet, so here are the key snippets to recreate locally:
Route wiring
tests/integration/src/tests/playwright/fixtures/main.tsx
{
path: '/adapter-web/client-session-sync',
component: React.lazy(() =>
import('./adapter-web/client-session-sync/Root.tsx').then((m) => ({ default: m.Root })),
),
}Test (leader exit)
tests/integration/src/tests/adapter-web/adapter-web.test.ts
Vitest.scopedLive('client session sync pending sticks after leader exit (explicit)', (test) =>
Effect.gen(function* () {
const storeId = `adapter-web-sync-leader-exit-${Date.now()}`
const url = appUrl('/adapter-web/client-session-sync')
const progressChannel = `ls-webtest-client-session-sync-progress-${storeId}`
const shutdownChannel = `ls-webtest-client-session-sync-shutdown-${storeId}`
const baseQuery =
`barrier=1&storeId=${storeId}&commitCount=5&timeoutMs=8000&disableFastPath=1` +
`&manualShutdown=1&shutdownChannel=${shutdownChannel}&progressChannel=${progressChannel}`
page1.goto(`${url}?${baseQuery}&sessionId=a&clientId=A&bootDelayMs=0`)
page2.goto(`${url}?${baseQuery}&sessionId=b&clientId=B&bootDelayMs=80`)
// wait for lock-start from both, pick leader, wait for follower committed
// send shutdown to leader session via shutdownChannel
// assert follower timedOut=true and pendingAfterWait>0
}).pipe(withTestCtx(test)),
)Fixture (core repro loop)
tests/integration/src/tests/playwright/fixtures/adapter-web/client-session-sync/Root.tsx
const runRepro = async (store, options) => {
const syncState = store[StoreInternalsSymbol].syncProcessor.syncState
const lockStatusRef = store[StoreInternalsSymbol].clientSession.lockStatus
const sessionId = store[StoreInternalsSymbol].clientSession.sessionId
const progress = createProgressReporter(options.progressChannel, { storeId: store.storeId, sessionId })
const initialState = await syncState.pipe(Effect.runPromise)
const lockStatusStart = await SubscriptionRef.get(lockStatusRef).pipe(Effect.runPromise)
progress.send('lock-start', { status: lockStatusStart })
// commit N events (options.commitCount)
const afterCommitState = await syncState.pipe(Effect.runPromise)
progress.send('committed', { pendingAfterCommit: afterCommitState.pending.length })
// wait for pending to clear or timeout
// progress.send('done', { timedOut, pendingAfterWait })
}Logs (leader-exit repro)
These lines show the new leader’s UpdateMessagePort never completes, so GetLeaderSyncState/PullStream hang.
[leader-exit-tab-2] TMP shared worker update port start (storeId: adapter-web-sync-leader-exit-..., sessionId: b)
[leader-exit-tab-1] TMP shared worker update port done (storeId: adapter-web-sync-leader-exit-..., sessionId: a)
[leader-exit-listener] TMP shared-worker: updateMessagePort:makeWorkerLayer:start
[leader-exit-listener] TMP shared-worker: updateMessagePort:ready
[leader-exit-tab-2] TMP client sync heads after-wait session=b error=(FiberFailure) TimeoutException: Operation timed out after '1s'
client-session-sync leader exit results: leader=a followerTimedOut=true pendingAfterWait=5
Logs (race attempt)
These lines show the follower becomes leader but cannot clear pending after handoff.
client-session-sync race results: tab1 status=done lockStart=has-lock lockEnd=has-lock timeline=2ms:has-lock|2ms:has-lock|59ms:has-lock timedOut=false pendingAfterWait=0; tab2 status=done lockStart=no-lock lockEnd=has-lock timeline=2ms:no-lock|2ms:no-lock|65ms:has-lock|9037ms:has-lock timedOut=true pendingAfterWait=5
[race-tab-2] TMP client sync heads after-wait session=b error=(FiberFailure) TimeoutException: Operation timed out after '1s'
Solution ideas
1) Make UpdateMessagePort non-blocking / ack early
- Change: install
leaderWorkerContextSubRefas soon as the worker pool is created, then run devtools connect + any slow setup in forked fibers. - Pros:
forwardRequestunblocks immediately; minimal behavior change; reduces handoff stalls. - Cons: devtools errors become async; need a clear “ready” boundary for requests.
- Challenges: ensure requests are safe before devtools setup completes; avoid silent failures.
2) Timeout + reset on UpdateMessagePort
- Change: wrap the shared worker UpdateMessagePort flow in a timeout; on timeout, clear invariants + reset worker context and optionally retry.
- Pros: recovers from hangs; surfaces the failure; avoids permanent stuck state.
- Cons: may abort legitimate slow startups; risk of flapping if underlying issue persists.
- Challenges: pick safe timeout; ensure prior worker is fully closed; avoid data loss.
3) Bypass shared worker for leader APIs
- Change: leader session calls its dedicated worker directly for push/pull; shared worker only proxies for followers.
- Pros: removes shared-worker bottleneck during handoff; clearer ownership.
- Cons: larger refactor; more code paths; higher regression risk.
- Challenges: keep APIs consistent; manage leader/follower transitions; keep devtools wiring intact.
Filed by an AI assistant on behalf of the user.