adapter-web: pending stuck after leader handoff (LeaderAheadError loop)

## Summary
After leader tab exits (or during concurrent boot handoff), follower `pending` never clears. The new leader’s `UpdateMessagePort` never completes, so the shared worker never installs a new leader worker context. Requests to the leader (PullStream/GetLeaderSyncState) hang, so the follower never receives missing events and keeps retrying invalid pushes.

## Repro (automated, minimal)
1. From repo root, run:
   - `env CI=1 LIVESTORE_SYNC_LEADER_EXIT_REPRO=1 direnv exec . bunx vitest run tests/integration/src/tests/adapter-web/adapter-web.test.ts --testNamePattern "leader exit"`
2. Test passes but logs show stuck pending after leader exit.

### Evidence (from latest run)
- `client-session-sync leader exit results: leader=a followerTimedOut=true pendingAfterWait=5`
- New leader `UpdateMessagePort` never completes:
  - Session `b` logs `TMP shared worker update port start`, but never logs `TMP shared worker update port done`.
  - Shared worker logs only one `updateMessagePort:*` sequence (initial leader), none after handoff.
- `GetLeaderSyncState` from session `b` times out after handoff:
  - `TMP client sync heads after-wait session=b error=(FiberFailure) TimeoutException: Operation timed out after '1s'`

## Additional repro (race)
- `env CI=1 LIVESTORE_SYNC_RACE_REPRO=1 direnv exec . bunx vitest run tests/integration/src/tests/adapter-web/adapter-web.test.ts --testNamePattern "race attempt"`
- This test intentionally fails today to surface the bug; it reports `pendingAfterWait=5` for the tab that becomes leader after handoff.

## Expected
After `UpdatePort(B)` (handoff), B‑CS should rebase/refresh its pending set against LW‑B’s head so that:
- B‑CS `PushToLeader(pending)` is accepted by LW‑B
- `pending` drains to 0
- no repeated `LeaderAheadError` loop

## Actual
After `UpdatePort(B)` (handoff):
- B‑CS sends `PushToLeader(pending)` via SW to LW‑B
- LW‑B rejects with `LeaderAheadError` (⚠️)
- B‑CS rebases locally, but `pending` keeps the same global range (still 5) (❗)
- B‑CS retries `PushToLeader(pending)` and hits `LeaderAheadError` again (⚠️)
- `UpdateMessagePort` for the new leader never completes, so PullStream/GetLeaderSyncState hang (❗)
- This loop repeats; `pending` never clears (❗)

## Hypothesis (root cause)
- When a follower becomes leader, `SharedWorkerUpdateMessagePort` is sent but never completes.
- Shared worker never installs the new leader worker context (`leaderWorkerContextSubRef` stays `undefined`).
- `forwardRequest` waits for a worker context, so PullStream/GetLeaderSyncState hang.
- Since the follower never receives the missing events, it cannot advance beyond `pushHead` and `pending` stays > 0.

## Timing diagram (hypothesis)
```
Time ↓
A-CS (tab A)                    | LW-A (leader)                 | SW (proxy)                    | LW-B (leader)                 | B-CS (tab B)
--------------------------------|-------------------------------|-------------------------------|-------------------------------|------------------------------
A-CS -> SW: UpdatePort(A)       |                               |                               |                               | 
                                |                               | SW -> LW-A: UpdatePort(A)     |                               |
                                | LW-A init                     |                               |                               |
                                |                               |                               |                               | B-CS local: commit x5 (p=5)
                                |                               |                               |                               | B-CS -> SW: PushToLeader(e1..e5)
                                |                               | SW -> LW-A: PushToLeader      |                               |
                                | LW-A: apply ok / advance      |                               |                               |

A-CS local: shutdown            |                               |                               |                               |
                                | LW-A stops                    |                               |                               |
                                |                               | SW: resetCurrentWorkerCtx     |                               |
                                |                               |                               |                               | B-CS -> SW: UpdatePort(B)
                                |                               | (UpdatePort(B) hangs) ❗       |                               |
                                |                               |                               |                               | B-CS -> SW: PullStream
                                |                               | (waitForWorker blocks) ❗      |                               |
                                |                               |                               |                               | B-CS local: retry PushToLeader
                                |                               | (waitForWorker blocks) ❗      |                               |
```

## Repro code (not on dev yet)
I added a minimal fixture + test on branch `schickling/pangyo` to keep this repro deterministic. These files do **not** exist on `origin/dev` yet, so here are the key snippets to recreate locally:

### Route wiring
`tests/integration/src/tests/playwright/fixtures/main.tsx`
```ts
{
  path: '/adapter-web/client-session-sync',
  component: React.lazy(() =>
    import('./adapter-web/client-session-sync/Root.tsx').then((m) => ({ default: m.Root })),
  ),
}
```

### Test (leader exit)
`tests/integration/src/tests/adapter-web/adapter-web.test.ts`
```ts
Vitest.scopedLive('client session sync pending sticks after leader exit (explicit)', (test) =>
  Effect.gen(function* () {
    const storeId = `adapter-web-sync-leader-exit-${Date.now()}`
    const url = appUrl('/adapter-web/client-session-sync')
    const progressChannel = `ls-webtest-client-session-sync-progress-${storeId}`
    const shutdownChannel = `ls-webtest-client-session-sync-shutdown-${storeId}`
    const baseQuery =
      `barrier=1&storeId=${storeId}&commitCount=5&timeoutMs=8000&disableFastPath=1` +
      `&manualShutdown=1&shutdownChannel=${shutdownChannel}&progressChannel=${progressChannel}`

    page1.goto(`${url}?${baseQuery}&sessionId=a&clientId=A&bootDelayMs=0`)
    page2.goto(`${url}?${baseQuery}&sessionId=b&clientId=B&bootDelayMs=80`)

    // wait for lock-start from both, pick leader, wait for follower committed
    // send shutdown to leader session via shutdownChannel
    // assert follower timedOut=true and pendingAfterWait>0
  }).pipe(withTestCtx(test)),
)
```

### Fixture (core repro loop)
`tests/integration/src/tests/playwright/fixtures/adapter-web/client-session-sync/Root.tsx`
```ts
const runRepro = async (store, options) => {
  const syncState = store[StoreInternalsSymbol].syncProcessor.syncState
  const lockStatusRef = store[StoreInternalsSymbol].clientSession.lockStatus
  const sessionId = store[StoreInternalsSymbol].clientSession.sessionId
  const progress = createProgressReporter(options.progressChannel, { storeId: store.storeId, sessionId })

  const initialState = await syncState.pipe(Effect.runPromise)
  const lockStatusStart = await SubscriptionRef.get(lockStatusRef).pipe(Effect.runPromise)
  progress.send('lock-start', { status: lockStatusStart })

  // commit N events (options.commitCount)

  const afterCommitState = await syncState.pipe(Effect.runPromise)
  progress.send('committed', { pendingAfterCommit: afterCommitState.pending.length })

  // wait for pending to clear or timeout
  // progress.send('done', { timedOut, pendingAfterWait })
}
```

## Logs (leader-exit repro)
These lines show the new leader’s UpdateMessagePort never completes, so GetLeaderSyncState/PullStream hang.
```
[leader-exit-tab-2] TMP shared worker update port start (storeId: adapter-web-sync-leader-exit-..., sessionId: b)
[leader-exit-tab-1] TMP shared worker update port done (storeId: adapter-web-sync-leader-exit-..., sessionId: a)
[leader-exit-listener] TMP shared-worker: updateMessagePort:makeWorkerLayer:start
[leader-exit-listener] TMP shared-worker: updateMessagePort:ready
[leader-exit-tab-2] TMP client sync heads after-wait session=b error=(FiberFailure) TimeoutException: Operation timed out after '1s'
client-session-sync leader exit results: leader=a followerTimedOut=true pendingAfterWait=5
```

## Logs (race attempt)
These lines show the follower becomes leader but cannot clear pending after handoff.
```
client-session-sync race results: tab1 status=done lockStart=has-lock lockEnd=has-lock timeline=2ms:has-lock|2ms:has-lock|59ms:has-lock timedOut=false pendingAfterWait=0; tab2 status=done lockStart=no-lock lockEnd=has-lock timeline=2ms:no-lock|2ms:no-lock|65ms:has-lock|9037ms:has-lock timedOut=true pendingAfterWait=5
[race-tab-2] TMP client sync heads after-wait session=b error=(FiberFailure) TimeoutException: Operation timed out after '1s'
```


## Solution ideas

### 1) Make UpdateMessagePort non-blocking / ack early
- Change: install `leaderWorkerContextSubRef` as soon as the worker pool is created, then run devtools connect + any slow setup in forked fibers.
- Pros: `forwardRequest` unblocks immediately; minimal behavior change; reduces handoff stalls.
- Cons: devtools errors become async; need a clear “ready” boundary for requests.
- Challenges: ensure requests are safe before devtools setup completes; avoid silent failures.

### 2) Timeout + reset on UpdateMessagePort
- Change: wrap the shared worker UpdateMessagePort flow in a timeout; on timeout, clear invariants + reset worker context and optionally retry.
- Pros: recovers from hangs; surfaces the failure; avoids permanent stuck state.
- Cons: may abort legitimate slow startups; risk of flapping if underlying issue persists.
- Challenges: pick safe timeout; ensure prior worker is fully closed; avoid data loss.

### 3) Bypass shared worker for leader APIs
- Change: leader session calls its dedicated worker directly for push/pull; shared worker only proxies for followers.
- Pros: removes shared-worker bottleneck during handoff; clearer ownership.
- Cons: larger refactor; more code paths; higher regression risk.
- Challenges: keep APIs consistent; manage leader/follower transitions; keep devtools wiring intact.

---
_Filed by an AI assistant on behalf of the user._


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

adapter-web: pending stuck after leader handoff (LeaderAheadError loop) #955

Summary

Repro (automated, minimal)

Evidence (from latest run)

Additional repro (race)

Expected

Actual

Hypothesis (root cause)

Timing diagram (hypothesis)

Repro code (not on dev yet)

Route wiring

Test (leader exit)

Fixture (core repro loop)

Logs (leader-exit repro)

Logs (race attempt)

Solution ideas

1) Make UpdateMessagePort non-blocking / ack early

2) Timeout + reset on UpdateMessagePort

3) Bypass shared worker for leader APIs

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

adapter-web: pending stuck after leader handoff (LeaderAheadError loop) #955

Description

Summary

Repro (automated, minimal)

Evidence (from latest run)

Additional repro (race)

Expected

Actual

Hypothesis (root cause)

Timing diagram (hypothesis)

Repro code (not on dev yet)

Route wiring

Test (leader exit)

Fixture (core repro loop)

Logs (leader-exit repro)

Logs (race attempt)

Solution ideas

1) Make UpdateMessagePort non-blocking / ack early

2) Timeout + reset on UpdateMessagePort

3) Bypass shared worker for leader APIs

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions