Diagnostics: Fixes GetQueryMetrics returning null inside custom RequestHandler#5879
Conversation
|
There is a slight change in the behavior of DiagnosticString where previously it was aggregating as fanned out queries would come back (it would aggregate as they would come back), where now only the current response in handler will expose only it's own details, which I believe is the right approach. This is noted in changelog.md as well. |
|
@copilot resolve the merge conflicts in this pull request |
| // makes `response.Diagnostics.GetQueryMetrics()` work inside handlers | ||
| // (issue #5117) and prevents cross-partition fan-out responses from | ||
| // leaking siblings' metrics into each handler invocation. | ||
| this.Value = trace; |
There was a problem hiding this comment.
🔴 Blocking · Correctness: Public API Regression
response.Diagnostics is silently narrowed for every point-op / typed response, not just inside handlers
Removing the while (rootTrace.Parent != null) climb here, combined with the pre-existing request.Trace = processMessageAsyncTrace; in TransportHandler.ProcessMessageAsync (never restored), means ResponseMessage.Trace is always the transport-leaf subtree. Verified call-chain at HEAD:
ResponseMessage.cs:56—this.Trace = requestMessage?.Trace ?? NoOpTrace.SingletonResponseMessage.cs:177— theDiagnosticsgetter wrapsthis.Tracewith this new ctor every callCosmosResponseFactoryCore.cs— all 11 typed responses (ItemResponse<T>,DatabaseResponse,ContainerResponse,PermissionResponse,UserResponse,ThroughputResponse,StoredProcedureResponse,TriggerResponse,UserDefinedFunctionResponse,ClientEncryptionKeyResponse,ReadFeedResponse<T>) passcosmosResponseMessage.Diagnosticsstraight through.
After this PR, itemResponse.Diagnostics.ToString() / .GetClientElapsedTime() / .GetContactedRegions() from any ReadItemAsync / CreateItemAsync / ReplaceItemAsync / UpsertItemAsync / DeleteItemAsync / Database / Container / Throughput / StoredProcedure / Trigger / UDF operation, plus ReadFeedResponse<T> from non-query iterators, sees only the transport subtree — missing client-config setup, address resolution, gateway lookups, retry context, and intermediate handler datums.
The PR description claims "User-facing responses (FeedResponse from ReadNextAsync, Response from point operations) are constructed with the operation-root trace". This is only true for QueryIterator (verified: QueryIterator.cs:206/254/285 call QueryResponse.CreateSuccess(trace: trace) with the iterator-root). It is false for every point-op Response<T> because cosmosResponseMessage.Diagnostics is computed from the mutated transport trace, not the operation root.
Action: Either (a) restore request.Trace after the using block in TransportHandler.ProcessMessageAsync, or (b) build a separate, scoped CosmosTraceDiagnostics inside the handler block and expose only that to handlers — keeping the public ResponseMessage.Diagnostics rooted at the operation. Option (b) is preferred because it solves #5117 without touching shared semantics.
| // makes `response.Diagnostics.GetQueryMetrics()` work inside handlers | ||
| // (issue #5117) and prevents cross-partition fan-out responses from | ||
| // leaking siblings' metrics into each handler invocation. | ||
| this.Value = trace; |
There was a problem hiding this comment.
🔴 Blocking · Correctness: Public API Regression
CosmosException.Diagnostics is silently narrowed for transport-thrown exceptions
At CosmosExceptions/CosmosException.cs:42: this.Diagnostics = new CosmosTraceDiagnostics(this.Trace ?? NoOpTrace.Singleton); — and this.Trace for transport-thrown exceptions is the same mutated transport-leaf trace described in the previous comment. Customers (and Azure Monitor exporters) rely on ex.Diagnostics.ToString() to surface the full operation context (gateway, address resolution, retries) for triage. After this PR they get only the leaf transport subtree.
This is not mentioned in changelog.md. Either fix (preferred — via option (b) on the sibling comment) or add an explicit "Breaking diagnostics scope" entry. The silent-failure path here (existing telemetry pipelines suddenly losing operation-root context with no warning) is materially worse than the bug being fixed.
| queryMetricsText, | ||
| IndexUtilizationInfo.Empty, | ||
| ClientSideMetrics.Empty))); | ||
| processMessageAsyncTrace.AddDatum(TraceDatumKeys.QueryMetrics, queryMetricsDatum); |
There was a problem hiding this comment.
🔴 Blocking · Correctness: Missing Guard
TransportHandler attaches QueryMetrics for any operation type with a QueryMetricsText header
TransportHandler runs for every operation type (point read/write, batch, store procedure, etc.). The new code gates only on !string.IsNullOrEmpty(responseMessage?.Headers?.QueryMetricsText). If a non-query operation ever surfaces x-ms-documentdb-query-metrics (planned hybrid ops, query-plan calls, server-side leaks), it will receive a misleading QueryMetricsTraceDatum, and downstream consumers like MetricsAccumulator will start aggregating non-query data into query metrics.
Add an operation-type guard:
| processMessageAsyncTrace.AddDatum(TraceDatumKeys.QueryMetrics, queryMetricsDatum); | |
| // Attach query metrics to the trace at the transport layer so they are | |
| // discoverable from custom RequestHandlers as the response unwinds back | |
| // through the handler pipeline (issue #5117). For cross-partition queries | |
| // each partition response adds its metrics to its own transport trace, | |
| // giving handlers a strict per-partition view via response.Diagnostics. | |
| if ((request.OperationType == Documents.OperationType.Query | |
| || request.OperationType == Documents.OperationType.SqlQuery) | |
| && !string.IsNullOrEmpty(responseMessage?.Headers?.QueryMetricsText)) | |
| { | |
| string queryMetricsText = responseMessage.Headers.QueryMetricsText; | |
| QueryMetricsTraceDatum queryMetricsDatum = new QueryMetricsTraceDatum( | |
| new Lazy<QueryMetrics>(() => new QueryMetrics( | |
| queryMetricsText, | |
| IndexUtilizationInfo.Empty, | |
| ClientSideMetrics.Empty))); | |
| processMessageAsyncTrace.AddDatum(TraceDatumKeys.QueryMetrics, queryMetricsDatum); | |
| } |
Past PR #5266 (the issue that motivated centralizing the TraceDatumKeys.QueryMetrics key) had to fix a similar over-attach issue — keeping this tightly scoped avoids reopening that class of bugs.
| using System.Linq; | ||
| using System.Threading; | ||
| using System.Threading.Tasks; | ||
| using Microsoft.Azure.Cosmos.Query.Core.Metrics; |
There was a problem hiding this comment.
🟡 Recommendation · Maintainability: Layering
Handlers now depends on Query.Core.Metrics
Microsoft.Azure.Cosmos.Handlers previously had no dependency on Microsoft.Azure.Cosmos.Query.Core. With this using the transport handler becomes query-aware, which couples a generic transport layer to a specific feature area.
Consider one of:
- Move the datum construction to a small adapter in the
Query.Corenamespace (e.g.QueryMetricsTraceDatum.TryAttach(processMessageAsyncTrace, headers)) that the handler calls — preserves layering and gives you a single seam to extend forIndexUtilizationTextandQueryAdviceTextlater. - Inject an
IQueryMetricsExtractorvia constructor so the transport handler stays unaware of the query model.
Either preserves the existing architectural boundary that has kept Handlers thin.
| queryMetricsText, | ||
| IndexUtilizationInfo.Empty, | ||
| ClientSideMetrics.Empty))); | ||
| processMessageAsyncTrace.AddDatum(TraceDatumKeys.QueryMetrics, queryMetricsDatum); |
There was a problem hiding this comment.
🟡 Recommendation · Maintainability: Comment / Invariant
CosmosDistributedQueryClient.CreatePage still attaches TraceDatumKeys.QueryMetrics
The thin-client / distributed-gateway path bypasses TransportHandler (uses DocumentClient.ExecuteQueryAsync directly), so there is no double-attach in normal flow today. But the comment that used to live in CosmosQueryClientCore.GetCosmosElementResponse ("attach the datum once per query roundtrip") no longer holds globally — there are now two unrelated attach sites for the same key.
Please:
- Add a code comment on both attach sites describing which transport each owns (REST/RNTBD via
TransportHandlervs distributed-gateway viaCosmosDistributedQueryClient). - Consider a lightweight
Debug.Assert(!processMessageAsyncTrace.Data.ContainsKey(TraceDatumKeys.QueryMetrics))here to catch future cross-wiring early.
Otherwise the next person to wire a new query transport will silently double-attach and break MetricsAccumulator aggregation.
| Assert.AreEqual(20, totalItems); | ||
|
|
||
| Assert.IsTrue( | ||
| partitionCountsObservedInHandler.Count >= 2, |
There was a problem hiding this comment.
🟡 Recommendation · Reliability: Test Flakiness
Cross-partition test relies on the emulator actually fanning out
Assert.IsTrue(partitionCountsObservedInHandler.Count >= 2, ...) assumes the emulator has actually fanned out across 2 physical partitions. On low-RU containers, single-physical-partition emulators, or future emulator versions with different split heuristics, this can fail intermittently in CI without indicating a real product bug.
The container is provisioned at 15000 RU here, which usually splits, but it is not guaranteed across emulator versions. Two safer options:
- Replace the partition-count assertion with the invariant the PR is actually proving — "each handler invocation observed exactly its own partition's metrics" — combined with a check that the per-page aggregated count 1 (which is true even on a single partition).
- Force a split via
CreateContainerIfNotExistsAsync(... throughput: 20000)plus a deterministic ingest pattern, and gate the test on confirming a split actually happened (pollGetFeedRangesAsync().Count >= 2with a timeout, fail withAssert.Inconclusiverather thanAssert.Failif the split did not occur).
Either preserves the test's value without making CI noisy.
| !string.IsNullOrEmpty(response.Headers?.QueryMetricsText)) | ||
| { | ||
| queryResponseCallbackCount++; | ||
| metricsFromHandler = response.Diagnostics.GetQueryMetrics(); |
There was a problem hiding this comment.
🟡 Recommendation · Concurrency: Test Race
metricsFromHandler has a last-writer-wins race in the cross-partition test
metricsFromHandler is assigned from inside the handler callback without any synchronization. For the single-partition test this is mostly fine because there is at most one query response, but if retries or background SDK round-trips cause the handler to be invoked multiple times the assertion Assert.IsNotNull(metricsFromHandler, ...) could still pass while the value reflects an unrelated retry's metrics.
For the cross-partition sibling test where MaxConcurrency = -1 makes concurrent handler invocations the norm, the pattern (partitionCountsObservedInHandler inside a lock) already does this correctly — apply the same pattern here:
| metricsFromHandler = response.Diagnostics.GetQueryMetrics(); | |
| ConcurrentBag<ServerSideCumulativeMetrics> handlerMetrics = new ConcurrentBag<ServerSideCumulativeMetrics>(); | |
| RequestHandlerHelper testHandler = new RequestHandlerHelper | |
| { | |
| CallBackOnResponse = (request, response) => | |
| { | |
| if (request.OperationType == OperationType.Query && | |
| response.IsSuccessStatusCode && | |
| !string.IsNullOrEmpty(response.Headers?.QueryMetricsText)) | |
| { | |
| handlerMetrics.Add(response.Diagnostics.GetQueryMetrics()); | |
| } |
Then assert handlerMetrics.Count 1 and pick the first element. Same fix pattern PR #5361 had to retrofit for thread safety on the trace itself.
| queryMetricsText, | ||
| IndexUtilizationInfo.Empty, | ||
| ClientSideMetrics.Empty))); | ||
| processMessageAsyncTrace.AddDatum(TraceDatumKeys.QueryMetrics, queryMetricsDatum); |
There was a problem hiding this comment.
🟡 Recommendation · Maintainability: Asymmetric Header Handling
IndexUtilizationText and QueryAdviceText are not attached symmetrically
The pre-PR CosmosQueryClientCore.GetCosmosElementResponse only attached QueryMetrics so this PR is consistent with the prior behaviour. But moving the attach point to the transport layer is the right time to also surface the sibling headers:
x-ms-cosmos-index-utilization→IndexUtilizationInfox-ms-cosmos-query-advice→QueryAdvice
Both are already exposed via QueryMetrics aggregation downstream, so users invoking response.Diagnostics.GetQueryMetrics() inside a handler still get an empty IndexUtilizationInfo / no advice — defeating part of the diagnostic value the PR is trying to surface.
A small follow-up issue is fine; calling it out so it does not get forgotten.
| queryMetricsText, | ||
| IndexUtilizationInfo.Empty, | ||
| ClientSideMetrics.Empty))); | ||
| processMessageAsyncTrace.AddDatum(TraceDatumKeys.QueryMetrics, queryMetricsDatum); |
There was a problem hiding this comment.
🟢 Suggestion · Maintainability: Design Praise (conditional)
Per-partition trace as the isolation unit is elegant
Once the universal-narrowing regression (sibling blocking comments) is addressed, attaching QueryMetrics at processMessageAsyncTrace (one per round-trip, one per partition for fan-out) cleanly maps "one round-trip = one metrics datum". That is a nicer mental model than the previous CosmosQueryClientCore-level attach, and naturally gives handlers a strict per-partition view without any extra plumbing. Worth keeping.
|
|
||
| #### Bugs Fixed | ||
|
|
||
| - Diagnostics: Fixes `response.Diagnostics.GetQueryMetrics()` returning null when called inside a custom `RequestHandler.SendAsync` for query operations. Query metrics are now attached to the transport-layer trace, so they are visible to handlers as the response unwinds through the pipeline. For cross-partition queries, each handler invocation now sees only its own partition's metrics (strict per-partition isolation) while the iterator-level `FeedResponse.Diagnostics.GetQueryMetrics()` continues to return the aggregated view across all partitions. As a related scope change, inside a custom handler `response.Diagnostics.ToString()` and `response.Diagnostics.GetClientElapsedTime()` now reflect only the transport request subtree rather than the full operation tree; calls made after the SDK operation returns are unchanged. |
There was a problem hiding this comment.
💬 Observation · Maintainability: Changelog Completeness
Changelog only mentions handler-scope; missing the broader impact
The entry mentions the handler-scope change but does not mention:
Response<T>.Diagnosticsfor all point operations andReadFeedResponse<T>for non-query iterators is also narrowed (see the blocking comment onCosmosTraceDiagnostics.cs).CosmosException.Diagnosticsis narrowed for transport-thrown exceptions.
Either:
- Adopt option (b) of the suggested fix (build a scoped diagnostics object only for handler consumption) and the changelog as-written becomes accurate.
- Or keep this PR's design and explicitly call out the public-API behaviour change in the changelog, including
CosmosException.Diagnostics— this is a breaking change in diagnostics output for telemetry consumers and should be discoverable in release notes.
…stHandler Fixes Azure#5117. `response.Diagnostics.GetQueryMetrics()` previously returned `null` when invoked inside a custom `RequestHandler.SendAsync` for query operations. The `QueryMetricsTraceDatum` was attached to the trace tree in `CosmosQueryClientCore.GetCosmosElementResponse` -- i.e. after the response had finished unwinding back through the handler pipeline -- so handlers walking the trace saw nothing. Changes: - TransportHandler.ProcessMessageAsync now attaches the QueryMetricsTraceDatum to the per-request transport trace immediately after the ResponseMessage is constructed, guarded on OperationType (Query / SqlQuery) so non-query ops that may ever surface a query-metrics header are not mis-tagged. - Removes the now-redundant attach in CosmosQueryClientCore.GetCosmosElementResponse and the now-unused Query.Core.Metrics using. - Cross-references the two attach sites (REST/RNTBD via TransportHandler; thin-client / distributed-gateway via CosmosDistributedQueryClient.CreatePage) with code comments so future query transports stay in sync. - CosmosTraceDiagnostics is unchanged: it continues to climb to the operation root, so response.Diagnostics retains the full operation context for every typed Response<T>, ReadFeedResponse<T>, FeedResponse<T>, and CosmosException.Diagnostics. Inside a custom handler the walk also climbs to root, so GetQueryMetrics returns the per-partition metrics for the current response plus any sibling partition responses that have already completed. Adds emulator tests covering both single-partition and cross-partition fan-out scenarios. The cross-partition test verifies via GetFeedRangesAsync that the emulator actually split into >= 2 physical partitions (Assert.Inconclusive otherwise) and uses a ConcurrentBag to safely capture handler observations across parallel partition responses. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
31b19ff to
6f3a581
Compare
Description
Fixes #5117.
response.Diagnostics.GetQueryMetrics()previously returnednullwhen invoked inside a customRequestHandler.SendAsyncfor query operations. The reason:QueryMetricsTraceDatumwas attached to the trace tree inCosmosQueryClientCore.GetCosmosElementResponse— i.e. after the response had finished unwinding back through the handler pipeline — so handlers walking the trace saw nothing.Changes
TransportHandler.ProcessMessageAsyncnow attaches theQueryMetricsTraceDatumto the transport-layer trace immediately after theResponseMessageis constructed, making query metrics visible to handlers as the response unwinds.CosmosQueryClientCore.GetCosmosElementResponse— removed the now-redundantAddDatumblock (and the now-unusedMicrosoft.Azure.Cosmos.Query.Core.Metricsusing).CosmosTraceDiagnosticsctor no longer climbs the trace parent chain to the operation root. It uses the provided trace as-is. Iterator-level callers (FeedResponse.Diagnostics) continue to see the aggregated view because they receive the operation-root trace viaQueryResponse.CreateSuccess(trace: rootTrace). Custom-handler callers see a per-request (per-partition, in the cross-partition case) view — strict isolation between partition responses.Behavior
response.Diagnostics.GetQueryMetrics()inside a customRequestHandler(single-partition)nullnull(handler invoked N times)feedResponse.Diagnostics.GetQueryMetrics()afterReadNextAsyncTests
Added two emulator tests in
CosmosHandlersTests:QueryMetricsAvailableInsideRequestHandler_SinglePartition— asserts non-nullGetQueryMetrics()inside the handler, with exactly onePartitionedMetricsentry.QueryMetricsAvailableInsideRequestHandler_CrossPartitionIsolation— runs a cross-partitionSELECT * FROM cagainst a 15000 RU container with 20 items, asserts the handler is invoked for ≥2 partition responses with exactly 1PartitionedMetricsentry per invocation, and that the iterator-level view aggregates ≥2 partitions on at least one page.Verification on Windows + .NET 6 + running Cosmos Emulator:
BinaryEncodingOverTheWireTests(which exercise the query/trace path end-to-end against the emulator) pass.master.Changelog
Added entry under
### Unreleased/#### Bugs Fixedinchangelog.md.