Skip to content

ServiceUnavailable "client resource starvation" on cross-partition queries with large continuation tokens #5881

@bkobel

Description

@bkobel

Describe the bug
Cross-partition queries in direct mode fails with ServiceUnavailable (503) on subsequent (>1) pages of a FeedIterator.

To Reproduce
Run a FeedIterator paged query in direct mode without setting ResponseContinuationTokenLimitInKb

var query = new QueryDefinition("SELECT * FROM c WHERE c.family = @family")
    .WithParameter("@family", "simple-objects");

using var iterator = container.GetItemQueryIterator<Item>(
    query,
    requestOptions: new QueryRequestOptions
    {
        // ResponseContinuationTokenLimitInKb = 10  // <-- omit, or set > 10, to reproduce
    });

while (iterator.HasMoreResults)
{
    var page = await iterator.ReadNextAsync(); // throws on 2nd page
}

I was able to reproduce it only by adding a WHERE clause to the request query: SELECT * FROM c would successfully retrieve all pages, SELECT * FROM c WHERE c.family = @family on the other hand would fail with cryptic exception when fetching second page. Reproduced across different machines, OS vendors and ISPs.

Only reproducible against a large container (millions of docs.); smaller containers with the same schema and a subset of data do not exhibit the issue.

Expected behavior
All pages are fetched without exceptions

Actual behavior

Microsoft.Azure.Cosmos.CosmosException: Response status code does not indicate success: ServiceUnavailable (503); Substatus: 20001; ActivityId: c4776d10-baa3-4e72-b460-9154c51d1fc1;
Reason: (The request failed because the client was unable to establish connections to 4 endpoints across 1 regions.
        Please check for client resource starvation issues and verify connectivity between client and server.
        More info: https://aka.ms/cosmosdb-tsg-service-unavailable ...);
 ---> GoneException: The requested resource is no longer available at the server.
 ---> TransportException: A client transport error occurred: The connection failed. ... error code: ConnectionBroken [0x0012] ... payload sent: True
 ---> TransportException: The remote system closed the connection. ... error code: ReceiveStreamClosed [0x0011]

Environment summary
SDK Version: Microsoft.Azure.Cosmos v3.35.4, v3.60.0
Ubuntu 24.04 (host), Windows 11

Additional context
Every direct (RNTBD) call to the partition fails identically; every gateway call (HTTPS:443) succeeds. payload sent: True on every failure.

Cosmos Diagnostics summary:

DirectCalls: { "(410, 20001)": 80 }
GatewayCalls: { "(200, 0)": 26 }
duration: 31189 ms
System Info: CPU 7-14%, isThreadStarving: False, availableThreads: 32763+, openTcpConnections: 3-4
ContactedReplicas: []
FailedReplicas: [all 4 replicas of the partition]

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions