Skip to content

fix(engine): close concurrent preempt race in target-state ownership transfer#1994

Merged
georgeh0 merged 1 commit into
mainfrom
g/preempt-race-fix
May 19, 2026
Merged

fix(engine): close concurrent preempt race in target-state ownership transfer#1994
georgeh0 merged 1 commit into
mainfrom
g/preempt-race-fix

Conversation

@georgeh0
Copy link
Copy Markdown
Member

Summary

  • Adds a pending_process_token: Option<u128> to TargetStateInfoItem, written by pre_commit whenever it queues a sink action and cleared by commit_in_txn's retention pass. A detection sub-pass at the top of pre_commit peeks the token on every preempt-source item: live token in the same process → return PreCommitOutcome::PendingRetry so submit() re-runs pre_commit after an exponential backoff (5ms → 200ms, max 8 retries); dead token from a crashed prior process → force prev_may_be_missing=true on reconcile.
  • Caches old-owner tracking_info bytes in a per-call HashMap<StablePath, Vec<u8>> shared between the detection sub-pass and the Phase 1 preempt branch — each owner is read once, modified in place across multiple preempts, and emitted as a single DeferredWrite at the end of Phase 1.
  • On sink_apply / commit_in_txn failure, rollback_pending_tokens re-reads the component's tracking_info and clears every token matching this process's startup token. Retried indefinitely with logged backoff; the process-exits-mid-rollback case is covered by the dead-token recovery branch on next startup. set_provider_generation (OnceLock-backed) is deferred past the last PendingRetry exit so it runs at most once per successful lifecycle.

See specs/target_state_ownership_transfer/concurrent_preempt_race_fix.md for the full design.

Test plan

  • cargo test -p cocoindex_core --lib: 31/31 ✅
  • python/tests/core/test_ownership_transfer.py: 8/8 across 5 consecutive runs ✅
  • Full python/tests/core/: 358/358 ✅
  • CI

…transfer

When two components race to own the same target state, the prior fix
relied on tokio scheduling to serialize their `pre_commit` ->
`sink_apply` -> `commit` sequences. That holds for LMDB (microsecond
ops) but breaks under PG latency, where the loser's `delete` lands
after the winner's `upsert` and the target state is lost.

Add a `pending_process_token: Option<u128>` to `TargetStateInfoItem`,
written by `pre_commit` whenever it queues a sink action and cleared
by `commit_in_txn`'s retention pass. A detection sub-pass at the top
of `pre_commit` peeks the token on every preempt-source item: live
(same process token) -> return `PendingRetry` with the unconsumed
declared map so `submit()` can re-invoke after a backoff; dead
(crashed prior process) -> force `prev_may_be_missing=true` on
reconcile.

The detection sub-pass also caches old-owner `tracking_info` bytes
into a per-call `HashMap<StablePath, Vec<u8>>` shared with the Phase 1
preempt branch -- each owner is read once, modified in place across
multiple preempts, and emitted as a single `DeferredWrite` at the end
of Phase 1.

`set_provider_generation` (OnceLock-backed, can't run twice) is
deferred until after the last possible PendingRetry exit. On
`sink_apply` / `commit_in_txn` failure, `rollback_pending_tokens`
re-reads tracking_info and clears every token matching this process's
token; retried indefinitely with logged backoff until success. The
process-exits-mid-rollback case is covered by the dead-token recovery
path on next startup.

`contained_target_state_paths` wrapped in `Arc` to avoid full HashSet
rehash on every retry iteration.

See `specs/target_state_ownership_transfer/concurrent_preempt_race_fix.md`.
@georgeh0 georgeh0 merged commit 4421582 into main May 19, 2026
14 checks passed
@georgeh0 georgeh0 deleted the g/preempt-race-fix branch May 19, 2026 00:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant