You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When the lease container is partitioned by /partitionKey, PartitionSynchronizerCore.HandlePartitionGoneAsync creates child lease documents using Guid.NewGuid().ToString() as the partitionKey value via DocumentServiceLeaseManagerCosmos.CreateLeaseIfNotExistAsync. Because each split retry (host restart mid-split, concurrent hosts, or error retries) picks a new Guid, TryCreateItemAsync's per-partition-key id-uniqueness check never catches cross-PK duplicates. Once duplicates exist, EqualPartitionsBalancingStrategy.CategorizeLeases throws from Dictionary.Add on every balance tick, blocking all lease acquisition for the container until the extra documents are manually deleted (IcM 768856224).
Two fixes:
1. EqualPartitionsBalancingStrategy.CategorizeLeases: tolerate duplicate CurrentLeaseToken entries (keep the first, log a warning with the conflicting ids/PKs and remediation pointer). This unblocks the load balancer even when duplicates already exist in a customer's lease container.
2. PartitionSynchronizerCore.HandlePartitionGoneAsync (both DocumentServiceLeaseCore and DocumentServiceLeaseCoreEpk overloads): pre-query existing leases once and skip CreateLeaseIfNotExistAsync for ranges whose child lease token is already present, mirroring the dedup in CreateLeasesAsync.
Tests added:
- EqualPartitionsBalancingStrategyTests.CalculateLeasesToTake_DuplicateLeaseTokens_DoesNotThrow
- PartitionSynchronizerCoreTests.HandlePartitionGoneAsync_PKRangeBasedLease_Split_DoesNotCreateDuplicateChildLeases
- PartitionSynchronizerCoreTests.HandlePartitionGoneAsync_EpkBasedLease_Split_DoesNotCreateDuplicateChildLeases
- PartitionSynchronizerCoreTests.HandlePartitionGoneAsync_PKRangeBasedLease_Split_CreatesOnlyMissingChildLeases
- Microsoft.Azure.Cosmos.EmulatorTests.ChangeFeed.DuplicateLeaseRegressionTests (emulator-based end-to-end regression)
Existing HandlePartitionGoneAsync tests were updated to mock GetAllLeasesAsync (via a CreateEmptyLeaseContainer helper) since the synchronizer now consults the lease container before creating children.
Out of scope: long-term deterministic partition-key derivation from LeaseToken. That change is backward-incompatible for existing lease containers (customers have GUID-based PKs on all current lease docs) and will need a migration story; deferred to follow-up work.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
"Duplicate lease document detected for lease token '{0}'. Keeping lease with id '{1}' (partitionKey '{2}') and ignoring duplicate with id '{3}' (partitionKey '{4}'). To fully resolve, delete the duplicate lease document(s) (same id, different partitionKey values) from the lease container; if feasible, migrate the lease container to /id partitioning which avoids this condition entirely.",
0 commit comments