Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 24 additions & 0 deletions Microsoft.Azure.Cosmos/src/Handler/TransportHandler.cs
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ namespace Microsoft.Azure.Cosmos.Handlers
using System.Linq;
using System.Threading;
using System.Threading.Tasks;
using Microsoft.Azure.Cosmos.Query.Core.Metrics;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Recommendation · Maintainability: Layering

Handlers now depends on Query.Core.Metrics

Microsoft.Azure.Cosmos.Handlers previously had no dependency on Microsoft.Azure.Cosmos.Query.Core. With this using the transport handler becomes query-aware, which couples a generic transport layer to a specific feature area.

Consider one of:

  1. Move the datum construction to a small adapter in the Query.Core namespace (e.g. QueryMetricsTraceDatum.TryAttach(processMessageAsyncTrace, headers)) that the handler calls — preserves layering and gives you a single seam to extend for IndexUtilizationText and QueryAdviceText later.
  2. Inject an IQueryMetricsExtractor via constructor so the transport handler stays unaware of the query model.

Either preserves the existing architectural boundary that has kept Handlers thin.


⚠️ AI-generated review — may be incorrect. Agree? → resolve the conversation. Disagree? → reply with your reasoning.

using Microsoft.Azure.Cosmos.Resource.CosmosExceptions;
using Microsoft.Azure.Cosmos.Tracing;
using Microsoft.Azure.Cosmos.Tracing.TraceData;
Expand Down Expand Up @@ -155,6 +156,29 @@ internal async Task<ResponseMessage> ProcessMessageAsync(
request,
serviceRequest.RequestContext.RequestChargeTracker);

// Attach query metrics to the per-request transport trace so they are
// reachable from a custom RequestHandler.SendAsync as the response unwinds
// back through the handler pipeline (issue #5117). Before this, the datum
// was added in CosmosQueryClientCore.GetCosmosElementResponse, which runs
// after the handler pipeline has fully unwound, so handlers always saw
// a null GetQueryMetrics result. Guarded on operation type to avoid
// mis-tagging non-query operations that may ever surface a query-metrics
// header. The thin-client / distributed-gateway path adds the same datum
// from CosmosDistributedQueryClient.CreatePage (that path bypasses
// TransportHandler).
if ((request.OperationType == OperationType.Query
|| request.OperationType == OperationType.SqlQuery)
&& !string.IsNullOrEmpty(responseMessage?.Headers?.QueryMetricsText))
{
string queryMetricsText = responseMessage.Headers.QueryMetricsText;
QueryMetricsTraceDatum queryMetricsDatum = new QueryMetricsTraceDatum(
new Lazy<QueryMetrics>(() => new QueryMetrics(
queryMetricsText,
IndexUtilizationInfo.Empty,
ClientSideMetrics.Empty)));
processMessageAsyncTrace.AddDatum(TraceDatumKeys.QueryMetrics, queryMetricsDatum);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Blocking · Correctness: Missing Guard

TransportHandler attaches QueryMetrics for any operation type with a QueryMetricsText header

TransportHandler runs for every operation type (point read/write, batch, store procedure, etc.). The new code gates only on !string.IsNullOrEmpty(responseMessage?.Headers?.QueryMetricsText). If a non-query operation ever surfaces x-ms-documentdb-query-metrics (planned hybrid ops, query-plan calls, server-side leaks), it will receive a misleading QueryMetricsTraceDatum, and downstream consumers like MetricsAccumulator will start aggregating non-query data into query metrics.

Add an operation-type guard:

Suggested change
processMessageAsyncTrace.AddDatum(TraceDatumKeys.QueryMetrics, queryMetricsDatum);
// Attach query metrics to the trace at the transport layer so they are
// discoverable from custom RequestHandlers as the response unwinds back
// through the handler pipeline (issue #5117). For cross-partition queries
// each partition response adds its metrics to its own transport trace,
// giving handlers a strict per-partition view via response.Diagnostics.
if ((request.OperationType == Documents.OperationType.Query
|| request.OperationType == Documents.OperationType.SqlQuery)
&& !string.IsNullOrEmpty(responseMessage?.Headers?.QueryMetricsText))
{
string queryMetricsText = responseMessage.Headers.QueryMetricsText;
QueryMetricsTraceDatum queryMetricsDatum = new QueryMetricsTraceDatum(
new Lazy<QueryMetrics>(() => new QueryMetrics(
queryMetricsText,
IndexUtilizationInfo.Empty,
ClientSideMetrics.Empty)));
processMessageAsyncTrace.AddDatum(TraceDatumKeys.QueryMetrics, queryMetricsDatum);
}

Past PR #5266 (the issue that motivated centralizing the TraceDatumKeys.QueryMetrics key) had to fix a similar over-attach issue — keeping this tightly scoped avoids reopening that class of bugs.


⚠️ AI-generated review — may be incorrect. Agree? → resolve the conversation. Disagree? → reply with your reasoning.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Recommendation · Maintainability: Comment / Invariant

CosmosDistributedQueryClient.CreatePage still attaches TraceDatumKeys.QueryMetrics

The thin-client / distributed-gateway path bypasses TransportHandler (uses DocumentClient.ExecuteQueryAsync directly), so there is no double-attach in normal flow today. But the comment that used to live in CosmosQueryClientCore.GetCosmosElementResponse ("attach the datum once per query roundtrip") no longer holds globally — there are now two unrelated attach sites for the same key.

Please:

  • Add a code comment on both attach sites describing which transport each owns (REST/RNTBD via TransportHandler vs distributed-gateway via CosmosDistributedQueryClient).
  • Consider a lightweight Debug.Assert(!processMessageAsyncTrace.Data.ContainsKey(TraceDatumKeys.QueryMetrics)) here to catch future cross-wiring early.

Otherwise the next person to wire a new query transport will silently double-attach and break MetricsAccumulator aggregation.


⚠️ AI-generated review — may be incorrect. Agree? → resolve the conversation. Disagree? → reply with your reasoning.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Recommendation · Reliability: Trace Shape Change

MetricsAccumulator / ServerSideMetricsTraceExtractor assumptions change shape

Tracing/TraceData/MetricsAccumulator.cs and ServerSideMetricsTraceExtractor.cs walk the trace tree assuming QueryMetrics datums sit one level inside the per-partition query subtree (the historical placement under CosmosQueryClientCore.GetCosmosElementResponse). Moving the attach point to TransportHandler changes both the depth and the siblings of the datum (now alongside RequestCharge / FeedRange datums, which were affected by PR #4252).

The new unit tests in this PR only validate that the handler is invoked — they do not validate that aggregation still produces correct per-partition rollups. Please:

  1. Run the existing aggregation tests (QueryResponseFactoryTests, ClientSideRequestStatisticsTraceDatumTests, ServerSideMetricsTraceExtractorTests if present) and confirm they still pass.
  2. Add at least one test asserting that the iterator-level FeedResponse.Diagnostics.GetQueryMetrics() PartitionedMetrics rollup matches the sum of per-partition handler observations.

Otherwise a future trace-shape change can silently break per-partition aggregation without any signal.


⚠️ AI-generated review — may be incorrect. Agree? → resolve the conversation. Disagree? → reply with your reasoning.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Recommendation · Maintainability: Asymmetric Header Handling

IndexUtilizationText and QueryAdviceText are not attached symmetrically

The pre-PR CosmosQueryClientCore.GetCosmosElementResponse only attached QueryMetrics so this PR is consistent with the prior behaviour. But moving the attach point to the transport layer is the right time to also surface the sibling headers:

  • x-ms-cosmos-index-utilizationIndexUtilizationInfo
  • x-ms-cosmos-query-adviceQueryAdvice

Both are already exposed via QueryMetrics aggregation downstream, so users invoking response.Diagnostics.GetQueryMetrics() inside a handler still get an empty IndexUtilizationInfo / no advice — defeating part of the diagnostic value the PR is trying to surface.

A small follow-up issue is fine; calling it out so it does not get forgotten.


⚠️ AI-generated review — may be incorrect. Agree? → resolve the conversation. Disagree? → reply with your reasoning.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Suggestion · Maintainability: Design Praise (conditional)

Per-partition trace as the isolation unit is elegant

Once the universal-narrowing regression (sibling blocking comments) is addressed, attaching QueryMetrics at processMessageAsyncTrace (one per round-trip, one per partition for fan-out) cleanly maps "one round-trip = one metrics datum". That is a nicer mental model than the previous CosmosQueryClientCore-level attach, and naturally gives handlers a strict per-partition view without any extra plumbing. Worth keeping.


⚠️ AI-generated review — may be incorrect. Agree? → resolve the conversation. Disagree? → reply with your reasoning.

}

// Enrich diagnostics context in-case of auth failures
if (responseMessage?.StatusCode == System.Net.HttpStatusCode.Unauthorized || responseMessage?.StatusCode == System.Net.HttpStatusCode.Forbidden)
{
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -120,8 +120,11 @@ private DocumentServiceRequest CreateRequest(
return request;
}

private static TryCatch<QueryPage> CreatePage(DocumentServiceResponse response, Tracing.ITrace trace)
{
private static TryCatch<QueryPage> CreatePage(DocumentServiceResponse response, Tracing.ITrace trace)
{
// Attach QueryMetricsTraceDatum for the thin-client / distributed-gateway path
// (this path bypasses Handler/TransportHandler.cs which writes the same datum
// for the REST/RNTBD path). Keep both call sites in sync — see issue #5117.
string queryMetricsText = response.Headers[HttpConstants.HttpHeaders.QueryMetrics];
if (queryMetricsText != null)
{
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,6 @@ namespace Microsoft.Azure.Cosmos
using Microsoft.Azure.Cosmos.CosmosElements;
using Microsoft.Azure.Cosmos.Json;
using Microsoft.Azure.Cosmos.Query.Core;
using Microsoft.Azure.Cosmos.Query.Core.Metrics;
using Microsoft.Azure.Cosmos.Query.Core.Monads;
using Microsoft.Azure.Cosmos.Query.Core.Pipeline.Pagination;
using Microsoft.Azure.Cosmos.Query.Core.QueryClient;
Expand Down Expand Up @@ -322,15 +321,11 @@ private static TryCatch<QueryPage> GetCosmosElementResponse(
{
using (cosmosResponseMessage)
{
if (cosmosResponseMessage.Headers.QueryMetricsText != null)
{
QueryMetricsTraceDatum datum = new QueryMetricsTraceDatum(
new Lazy<QueryMetrics>(() => new QueryMetrics(
cosmosResponseMessage.Headers.QueryMetricsText,
IndexUtilizationInfo.Empty,
ClientSideMetrics.Empty)));
trace.AddDatum(TraceDatumKeys.QueryMetrics, datum);
}
// Note: QueryMetricsTraceDatum is attached at the transport layer
// (Handler/TransportHandler.cs) for the REST/RNTBD path so the datum
// is reachable from custom RequestHandlers as the response unwinds.
// The thin-client path attaches it in
// Query/Core/QueryClient/CosmosDistributedQueryClient.CreatePage.

if (!cosmosResponseMessage.IsSuccessStatusCode)
{
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,9 @@
namespace Microsoft.Azure.Cosmos.SDK.EmulatorTests
{
using System;
using System.Collections.Concurrent;
using System.Collections.Generic;
using System.Linq;
using System.Threading;
using System.Threading.Tasks;
using Microsoft.Azure.Documents;
Expand Down Expand Up @@ -95,6 +97,197 @@ public async Task TestBatchRequiredHeadersWithHandler()
}
}

[TestMethod]
public async Task QueryMetricsAvailableInsideRequestHandler_SinglePartition()
{
// Regression test for issue #5117:
// response.Diagnostics.GetQueryMetrics() must be non-null when invoked inside
// a custom RequestHandler.SendAsync for a query operation.
ConcurrentBag<ServerSideCumulativeMetrics> observedMetrics = new ConcurrentBag<ServerSideCumulativeMetrics>();

RequestHandlerHelper testHandler = new RequestHandlerHelper
{
CallBackOnResponse = (request, response) =>
{
if (request.OperationType == OperationType.Query
&& response.IsSuccessStatusCode
&& !string.IsNullOrEmpty(response.Headers?.QueryMetricsText))
{
ServerSideCumulativeMetrics metrics = response.Diagnostics.GetQueryMetrics();
Assert.IsNotNull(metrics, "GetQueryMetrics() must be non-null inside the handler (issue #5117).");
observedMetrics.Add(metrics);
}

return response;
}
};

CosmosClient customClient = TestCommon.CreateCosmosClient(
builder => builder.AddCustomHandlers(testHandler));

try
{
Container container = customClient.GetContainer(this.database.Id, this.Container.Id);
QueryMetricsItem item = new QueryMetricsItem
{
id = Guid.NewGuid().ToString(),
pk = Guid.NewGuid().ToString(),
description = "single-partition"
};
await container.CreateItemAsync(item, new Cosmos.PartitionKey(item.pk));

FeedIterator<QueryMetricsItem> iterator = container.GetItemQueryIterator<QueryMetricsItem>(
$"SELECT * FROM c WHERE c.id = '{item.id}'",
requestOptions: new QueryRequestOptions
{
PartitionKey = new Cosmos.PartitionKey(item.pk)
});

int totalItems = 0;
while (iterator.HasMoreResults)
{
FeedResponse<QueryMetricsItem> page = await iterator.ReadNextAsync();
totalItems += page.Count;
}

Assert.AreEqual(1, totalItems);
Assert.IsTrue(
observedMetrics.Count >= 1,
"Handler must be invoked for at least one query response.");

foreach (ServerSideCumulativeMetrics metrics in observedMetrics)
{
Assert.IsNotNull(metrics);
Assert.AreEqual(
1,
metrics.PartitionedMetrics.Count,
"Single-partition query should report exactly one PartitionedMetrics entry.");
}
}
finally
{
customClient.Dispose();
}
}

[TestMethod]
public async Task QueryMetricsAvailableInsideRequestHandler_CrossPartition()
{
// Companion to the single-partition test. Verifies that for cross-partition
// queries the handler still sees non-null metrics on every query response.
// Option C semantics: response.Diagnostics walks from the operation root, so
// each handler invocation sees its own partition's metrics plus any sibling
// partitions that have already completed at that point. The exact per-page
// distribution is timing-dependent; what we guarantee is (a) metrics are not
// null inside the handler and (b) the iterator-level aggregated view exposes
// metrics from multiple partitions.
ConcurrentBag<int> observedPartitionCounts = new ConcurrentBag<int>();

RequestHandlerHelper testHandler = new RequestHandlerHelper
{
CallBackOnResponse = (request, response) =>
{
if (request.OperationType == OperationType.Query
&& response.IsSuccessStatusCode
&& !string.IsNullOrEmpty(response.Headers?.QueryMetricsText))
{
ServerSideCumulativeMetrics metrics = response.Diagnostics.GetQueryMetrics();
Assert.IsNotNull(metrics, "GetQueryMetrics() must be non-null inside the handler (issue #5117).");
observedPartitionCounts.Add(metrics.PartitionedMetrics.Count);
}

return response;
}
};

CosmosClient customClient = TestCommon.CreateCosmosClient(
builder => builder.AddCustomHandlers(testHandler));

Cosmos.Database multiPartitionDatabase = null;
try
{
multiPartitionDatabase = await customClient.CreateDatabaseAsync(
"MultiPkDb_" + Guid.NewGuid().ToString());

Container multiPartitionContainer = await multiPartitionDatabase.CreateContainerAsync(
new ContainerProperties(id: Guid.NewGuid().ToString(), partitionKeyPath: "/pk"),
throughput: 15000);

IReadOnlyList<FeedRange> feedRanges = await multiPartitionContainer.GetFeedRangesAsync();
if (feedRanges.Count < 2)
{
Assert.Inconclusive(
$"Emulator did not split the 15000 RU container into >= 2 physical partitions (got {feedRanges.Count}); cannot exercise cross-partition fan-out.");
}

for (int i = 0; i < 30; i++)
{
QueryMetricsItem item = new QueryMetricsItem
{
id = Guid.NewGuid().ToString(),
pk = Guid.NewGuid().ToString(),
description = "cross-partition-fanout"
};
await multiPartitionContainer.CreateItemAsync(item, new Cosmos.PartitionKey(item.pk));
}

FeedIterator<QueryMetricsItem> iterator = multiPartitionContainer.GetItemQueryIterator<QueryMetricsItem>(
"SELECT * FROM c",
requestOptions: new QueryRequestOptions
{
MaxConcurrency = -1
});

int totalItems = 0;
int maxPartitionedCountOnIterator = 0;
bool iteratorSawMetrics = false;
while (iterator.HasMoreResults)
{
FeedResponse<QueryMetricsItem> page = await iterator.ReadNextAsync();
totalItems += page.Count;

ServerSideCumulativeMetrics pageMetrics = page.Diagnostics.GetQueryMetrics();
if (pageMetrics != null)
{
iteratorSawMetrics = true;
if (pageMetrics.PartitionedMetrics.Count > maxPartitionedCountOnIterator)
{
maxPartitionedCountOnIterator = pageMetrics.PartitionedMetrics.Count;
}
}
}

Assert.AreEqual(30, totalItems);

Assert.IsTrue(
observedPartitionCounts.Count >= 2,
$"Cross-partition query should invoke the handler for at least 2 partition responses, got {observedPartitionCounts.Count}.");

foreach (int count in observedPartitionCounts)
{
Assert.IsTrue(
count >= 1,
"Each handler invocation should see at least one partition's metrics (issue #5117).");
}

Assert.IsTrue(
iteratorSawMetrics,
"Expected at least one page's iterator-level diagnostics to expose query metrics.");
Assert.IsTrue(
maxPartitionedCountOnIterator >= 2,
$"Iterator-level diagnostics should aggregate metrics from multiple partitions on at least one page (max observed was {maxPartitionedCountOnIterator}).");
}
finally
{
if (multiPartitionDatabase != null)
{
await multiPartitionDatabase.DeleteAsync();
}

customClient.Dispose();
}
}

private async Task<IList<ToDoActivity>> CreateRandomItems(int pkCount, int perPKItemCount = 1, bool randomPartitionKey = true)
{
Assert.IsFalse(!randomPartitionKey && perPKItemCount > 1);
Expand Down Expand Up @@ -146,5 +339,12 @@ public class ToDoActivity
public string description { get; set; }
public string status { get; set; }
}

private class QueryMetricsItem
{
public string id { get; set; }
public string pk { get; set; }
public string description { get; set; }
}
}
}
1 change: 1 addition & 0 deletions changelog.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
#### Bugs Fixed

- [5827](https://github.com/Azure/azure-cosmos-dotnet-v3/pull/5827) ChangeFeedEstimator: Change feed estimator threw `ArgumentNullException` when an inmemory lease container was being used. Update validations so in-memory lease containers work with change feed estimator
- [5879](https://github.com/Azure/azure-cosmos-dotnet-v3/pull/5879) Diagnostics: Fixes `response.Diagnostics.GetQueryMetrics()` returning `null` when called inside a custom `RequestHandler.SendAsync` for query operations. Query metrics for the REST/RNTBD path are now attached at the transport layer, so they are visible to handlers as the response unwinds through the pipeline. `response.Diagnostics` continues to expose the full operation context (handlers see metrics from any sibling partition responses that have already completed). `FeedResponse.Diagnostics.GetQueryMetrics()` at the iterator level is unchanged.

#### Other Changes

Expand Down
Loading