Skip to content

Fix IndexOutOfRangeException Observed in AddressEnumerator::MoveFailedReplicasToTheEnd()#5090

Closed
dibahlfi wants to merge 1 commit intomasterfrom
users/dibahl/addressEnumeratorFix
Closed

Fix IndexOutOfRangeException Observed in AddressEnumerator::MoveFailedReplicasToTheEnd()#5090
dibahlfi wants to merge 1 commit intomasterfrom
users/dibahl/addressEnumeratorFix

Conversation

@dibahlfi
Copy link
Copy Markdown
Member

due to a race condition between modifying and reading a shared variable FailedEndpoints which is instantiated in the DocumentServiceRequestContext class. multiple threads seems to be accessing it at the same time - one adding to it in the AddToFailedEndpoints() method in the DocumentServiceRequestContext and the other in the AddressEnumerator where it is trying to read an item in the GetEffectiveStatus() method which is called with in MoveFailedReplicasToTheEnd() and it results in an IndexOutOfRangeException . We need to move to a thread safe data structure.

  • [] Bug fix (non-breaking change which fixes an issue)

Closing issues

To automatically close an issue: closes #5046

Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please follow the required format: "[Internal] Category: (Adds|Fixes|Refactors|Removes) Description"

Internal should be used for PRs that have no customer impact. This flag is used to help generate the changelog to know which PRs should be included. Examples:
Diagnostics: Adds GetElapsedClientLatency to CosmosDiagnostics
PartitionKey: Fixes null reference when using default(PartitionKey)
[v4] Client Encryption: Refactors code to external project
[Internal] Query: Adds code generator for CosmosNumbers for easy additions in the future.

@kundadebdatta
Copy link
Copy Markdown
Member

kundadebdatta commented Mar 28, 2025

On shipping part, let's keep both the methods as overloaded methods. See example below:

       private static void SetTransportAddressUrisToUnhealthy(
           PartitionAddressInformation stalePartitionAddressInformation,
           Lazy<HashSet<TransportAddressUri>> failedEndpoints)
       {
           if (stalePartitionAddressInformation == null ||
               failedEndpoints == null ||
               !failedEndpoints.IsValueCreated)
           {
               return;
           }

           IReadOnlyList<TransportAddressUri> perProtocolPartitionAddressInformation = stalePartitionAddressInformation.Get(Protocol.Tcp)?.ReplicaTransportAddressUris;
           if (perProtocolPartitionAddressInformation == null)
           {
               return;
           }

           foreach (TransportAddressUri failed in perProtocolPartitionAddressInformation)
           {
               if (failedEndpoints.Value.Contains(failed))
               {
                   failed.SetUnhealthy();
               }
           }
       }

       private static void SetTransportAddressUrisToUnhealthy(
           PartitionAddressInformation stalePartitionAddressInformation,
           Lazy<ConcurrentDictionary<TransportAddressUri, bool>> failedEndpoints)
       {
           if (stalePartitionAddressInformation == null ||
               failedEndpoints == null ||
               !failedEndpoints.IsValueCreated)
           {
               return;
           }

           IReadOnlyList<TransportAddressUri> perProtocolPartitionAddressInformation = stalePartitionAddressInformation.Get(Protocol.Tcp)?.ReplicaTransportAddressUris;
           if (perProtocolPartitionAddressInformation == null)
           {
               return;
           }

           foreach (TransportAddressUri failed in perProtocolPartitionAddressInformation)
           {
               if (failedEndpoints.Value.ContainsKey(failed))
               {
                   failed.SetUnhealthy();
               }
           }
       }

Once this is merged to master, let's cherry-pick the v3 master commit to OSS and create a new OSS release to use the OSS commit in the msdata PR to unblock the build failures from OSS.

Cc: @kirankumarkolli

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fundamentals: Troubleshoot, Identify and Fix IndexOutOfRangeException Observed in AddressEnumerator::MoveFailedReplicasToTheEnd() Method

2 participants