Skip to content

[@azure/cosmos] Reuse shared partition key range cache for cross-partition queries#39144

Open
amanrao23 wants to merge 1 commit into
mainfrom
amanrao23/cosmos-pkrange-cache-reuse
Open

[@azure/cosmos] Reuse shared partition key range cache for cross-partition queries#39144
amanrao23 wants to merge 1 commit into
mainfrom
amanrao23/cosmos-pkrange-cache-reuse

Conversation

@amanrao23

Copy link
Copy Markdown
Member

Problem

Every cross-partition query issues a redundant GET /pkranges metadata call instead of reusing the cache. CosmosClient keeps a long-lived shared PartitionKeyRangeCache on ClientContext (used by reads, bulk, change feed), but SmartRoutingMapProvider — the helper that resolves overlapping ranges for parallel/ORDER BY queries — constructed its own cache, and a new provider is created per query. Each provider started cold and re-fetched. Hybrid queries are worst-hit: the global-statistics query plus each component query spun up its own cold cache, so one hybrid query triggered several /pkranges fetches.

Impact: extra metadata round-trip + latency on every query, scaling with query volume.

Fix

  • SmartRoutingMapProvider now uses the shared clientContext.partitionKeyRangeCache. Parallel, ORDER BY, and hybrid all share one warm cache.
  • Split recovery uses an explicit forceRefresh instead of allocating a fresh provider.
  • Made the shared cache failure-safe (partitionKeyRangeCache.ts): concurrent fetches (cold or forceRefresh) dedupe to one request; the map is published only on success, so a transient failure no longer poisons later lookups (the next call retries) and a failed forceRefresh keeps serving the last known-good map.

Testing

  • test/internal/unit/partitionKeyRangeCache.spec.ts (5 tests): dedupe, cache hit, evict-on-failure, keep-prior-on-failed-refresh, dedupe concurrent forceRefresh.
  • test/public/functional/partitionKeyRangeCacheReuse.spec.ts: counts /pkranges requests — 1 with the fix, 4 without.
  • Full unit suite (497) + split integration tests pass. No public API change.

Cross-partition queries re-fetched /pkranges on every query because SmartRoutingMapProvider built its own cache and was recreated per query. Use the shared ClientContext cache (hybrid queries previously fetched per component query). Make the cache failure-safe: dedupe concurrent fetches, evict on failure so transient errors don't poison later lookups, and keep the last known-good map until a forceRefresh succeeds.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@amanrao23 amanrao23 requested a review from aditishree1 as a code owner June 30, 2026 12:50
Copilot AI review requested due to automatic review settings June 30, 2026 12:50
@amanrao23 amanrao23 changed the title [cosmos] Reuse shared partition key range cache for cross-partition queries [@azure/cosmos] Reuse shared partition key range cache for cross-partition queries Jun 30, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR eliminates a redundant GET /pkranges metadata round-trip on every cross-partition query in @azure/cosmos. Previously SmartRoutingMapProvider constructed its own PartitionKeyRangeCache and was recreated per query, so every parallel/ORDER BY/hybrid query started with a cold cache. The provider now reuses the long-lived clientContext.partitionKeyRangeCache that reads, bulk, and change feed already share. The shared cache is also hardened to dedupe concurrent fetches and to avoid cache poisoning on transient failures.

Changes:

  • SmartRoutingMapProvider reuses clientContext.partitionKeyRangeCache and threads a new forceRefresh parameter through getOverlappingRanges; split recovery in parallelQueryExecutionContextBase now calls forceRefresh = true instead of allocating a fresh provider.
  • PartitionKeyRangeCache separates known-good maps from in-flight fetches: concurrent lookups (cold or forced) dedupe to one request, the map is published only on success, and a failed refresh keeps serving the last known-good map.
  • Adds unit tests for dedupe/eviction/forced-refresh behavior and a functional test asserting /pkranges is fetched once across repeated cross-partition queries; updates MockedClientContext to expose the shared cache; adds a changelog entry.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated no comments.

Show a summary per file
File Description
src/routing/partitionKeyRangeCache.ts Adds pendingByCollectionId dedup map; publishes resolved maps only on success and clears pending in finally for failure-safety.
src/routing/smartRoutingMapProvider.ts Switches to import type, reuses the shared client cache, and forwards a new forceRefresh flag.
src/queryExecutionContext/parallelQueryExecutionContextBase.ts Split recovery reuses the existing provider with forceRefresh = true instead of constructing a new provider.
test/internal/unit/partitionKeyRangeCache.spec.ts New unit tests: dedupe, cache hit, evict-on-failure, keep-prior-on-failed-refresh, concurrent forceRefresh dedupe.
test/public/functional/partitionKeyRangeCacheReuse.spec.ts New functional test counting /pkranges requests to verify cache reuse across queries.
test/public/common/MockClientContext.ts Mock now exposes a real partitionKeyRangeCache so provider/cache tests share one instance.
CHANGELOG.md Documents the redundant-fetch fix and failure-safety improvement under Bugs Fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants