fix(engine): close concurrent preempt race in target-state ownership transfer#1994
Merged
Conversation
…transfer When two components race to own the same target state, the prior fix relied on tokio scheduling to serialize their `pre_commit` -> `sink_apply` -> `commit` sequences. That holds for LMDB (microsecond ops) but breaks under PG latency, where the loser's `delete` lands after the winner's `upsert` and the target state is lost. Add a `pending_process_token: Option<u128>` to `TargetStateInfoItem`, written by `pre_commit` whenever it queues a sink action and cleared by `commit_in_txn`'s retention pass. A detection sub-pass at the top of `pre_commit` peeks the token on every preempt-source item: live (same process token) -> return `PendingRetry` with the unconsumed declared map so `submit()` can re-invoke after a backoff; dead (crashed prior process) -> force `prev_may_be_missing=true` on reconcile. The detection sub-pass also caches old-owner `tracking_info` bytes into a per-call `HashMap<StablePath, Vec<u8>>` shared with the Phase 1 preempt branch -- each owner is read once, modified in place across multiple preempts, and emitted as a single `DeferredWrite` at the end of Phase 1. `set_provider_generation` (OnceLock-backed, can't run twice) is deferred until after the last possible PendingRetry exit. On `sink_apply` / `commit_in_txn` failure, `rollback_pending_tokens` re-reads tracking_info and clears every token matching this process's token; retried indefinitely with logged backoff until success. The process-exits-mid-rollback case is covered by the dead-token recovery path on next startup. `contained_target_state_paths` wrapped in `Arc` to avoid full HashSet rehash on every retry iteration. See `specs/target_state_ownership_transfer/concurrent_preempt_race_fix.md`.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
pending_process_token: Option<u128>toTargetStateInfoItem, written bypre_commitwhenever it queues a sink action and cleared bycommit_in_txn's retention pass. A detection sub-pass at the top ofpre_commitpeeks the token on every preempt-source item: live token in the same process → returnPreCommitOutcome::PendingRetrysosubmit()re-runspre_commitafter an exponential backoff (5ms → 200ms, max 8 retries); dead token from a crashed prior process → forceprev_may_be_missing=trueon reconcile.tracking_infobytes in a per-callHashMap<StablePath, Vec<u8>>shared between the detection sub-pass and the Phase 1 preempt branch — each owner is read once, modified in place across multiple preempts, and emitted as a singleDeferredWriteat the end of Phase 1.sink_apply/commit_in_txnfailure,rollback_pending_tokensre-reads the component's tracking_info and clears every token matching this process's startup token. Retried indefinitely with logged backoff; the process-exits-mid-rollback case is covered by the dead-token recovery branch on next startup.set_provider_generation(OnceLock-backed) is deferred past the last PendingRetry exit so it runs at most once per successful lifecycle.See
specs/target_state_ownership_transfer/concurrent_preempt_race_fix.mdfor the full design.Test plan
cargo test -p cocoindex_core --lib: 31/31 ✅python/tests/core/test_ownership_transfer.py: 8/8 across 5 consecutive runs ✅python/tests/core/: 358/358 ✅