Deployment and TaskQueue Stats API #571

stephanos · 2025-04-16T04:10:21Z

READ BEFORE MERGING: All PRs require approval by both Server AND SDK teams before merging! This is why the number of required approvals is "2" and not "1"--two reviewers from the same team is NOT sufficient. If your PR is not approved by someone in BOTH teams, it may be summarily reverted.

What changed?

Added two new APIs:

GetWorkerDeploymentStats
GetTaskQueueStats

Why?

To return task queue stats to users for making scaling decisions.

Breaking changes

No.

Server PR

temporalio/temporal#7581 (feature complete, but not yet production-ready)

stephanos · 2025-04-16T04:12:01Z

temporal/api/workflowservice/v1/request_response.proto

+    repeated temporal.api.taskqueue.v1.TaskQueueStatDetails per_queue_stats = 1;
+
+    // Aggregated backlog count across all task queues in this deployment version.
+    int64 approximate_total_backlog_count = 2;


Name is derived from TaskQueueStatDetails's approximate_backlog_count field.

Is this just a convenience value that is easily derived from the per queue metrics field? If so, is there any need for "get worker deployment stats" if it's just "list task queues with deployment name filter and opt-in to stats"?

I suggest we take a different approach. I think we should add a ListTaskQueues as a multi-task-queue equivalent to DescribeTaskQueue, have it accept filters (can limit right now to requiring deployment name at runtime if you don't want full implementation), and allow opting in to getting stats while listing. The DescribeX for single + ListXs for multiple is a much more consistent API and has better reuse potential and is clearer to use/understand.

Is this just a convenience value that is easily derived from the per queue metrics field?

Yes, it's really just for convenience. There was a long discussion about this at the DevX review (happy to provide details) and this was the compromise that both camps (no aggregation vs lots of aggregation) were okay with.

I think we should add a ListTaskQueues

The major issue with that option is that we cannot list task queues. Since their are "unloaded" when not in active use after a few minutes, the API wouldn't be able to return stats for unloaded task queues. Users are expected to provide the task queues they are interested in for that reason. On the other hands with deployments we actually track the task queues, so that's why that API is more convenient/powerful.

Yes, it's really just for convenience. There was a long discussion about this at the DevX review (happy to provide details) and this was the compromise that both camps (no aggregation vs lots of aggregation) were okay with.

I think it's rough to half-aggregate things for people that can do it themselves (that's probably even worse than all or none). But regardless, this may be fine if we move to a list task queue model. But why would we not ask the various clients to do the aggregation instead of the server?

The major issue with that option is that we cannot list task queues [...] On the other hands with deployments we actually track the task queues, so that's why that API is more convenient/powerful.

To confirm, you can list task queues if deployment name is present? This is why I added "can limit right now to requiring deployment name at runtime if you don't want full implementation" to my comment. IMO it is better to have a limited form of ListTaskQueues than to have worker deployment have its own "list task queues". I would like to see an issue open to make the full list task queues (i.e. if a user doesn't provide deployment name) work.

temporal/api/workflowservice/v1/request_response.proto

stephanos · 2025-04-28T21:38:40Z

temporal/api/workflowservice/v1/request_response.proto

+
+    // Optional. Deployment version to return stats for.
+    // If left empty, the current version will be queried.
+    string deployment_version = 3;


Need to double-check that build_id isn't a better field name here.

We're in the middle of that discussion right now in the Versioning crew, we have not fully converged yet. If you want to merge this asap, I think build_id is the safer alternative.

stephanos · 2025-04-28T21:47:00Z

temporal/api/taskqueue/v1/message.proto

+message TaskQueueStatsInfo {
+    temporal.api.taskqueue.v1.TaskQueue task_queue = 1;
+    temporal.api.enums.v1.TaskQueueType task_queue_type = 2;
+    temporal.api.taskqueue.v1.TaskQueueStats task_queue_stats = 3;
+}


This message only exists because TaskQueueStats doesn't contain the task queue name and task_queue_type.

I considered adding them to the existing TaskQueueStats, but was discouraged to do so.

Hrmm, I wonder if we can come up with a better name for this that just wraps stats and has a qualified set. Would you consider making this message a sub-message of GetWorkerDeploymentStatsResponse instead of top-level where it is a bit confusing/ambiguous?

Though if you look at my suggestion later concerning ListTaskQueues, you may find that this message may not be necessary.

stephanos · 2025-04-28T21:51:27Z

temporal/api/workflowservice/v1/service.proto

+    // GetWorkerDeploymentStats returns the stats for all task queues in a deployment version.
+    rpc GetWorkerDeploymentStats (GetWorkerDeploymentStatsRequest) returns (GetWorkerDeploymentStatsResponse) {
+        option (google.api.http) = {
+            get: "/namespaces/{namespace}/worker-deployments/{deployment_name}/{deployment_version}/stats"


Since deployment_version can be empty, I'm not sure this is right yet.

I think we should only put the name in the url.

This shows I'm not familiar with how the grpc-http bridge works. I suppose the other args are passed as query parameters?

I was just figuring out the grpc-http stuff yesterday! This convo with ChatGPT helped (start reading from where I ask "What is the benefit of putting deployment_name in the http path, and what is the problem with not including it"): https://chatgpt.com/share/68113ebc-e300-8004-9732-3b1bd38a1029

Basically, namespace must be included in the path because it is used for both routing and auth. These resource names are not used for routing or auth, so it is technically not required to put them in the path. However, it is more "RESTful" to put resource names in the http path, because it makes using the APIs via curl more idiomatic.

stephanos · 2025-04-28T21:53:17Z

temporal/api/workflowservice/v1/service.proto

+    // GetTaskQueueStats returns stats for a single task queue.
+    rpc GetTaskQueueStats (GetTaskQueueStatsRequest) returns (GetTaskQueueStatsResponse) {
+        option (google.api.http) = {
+            post: "/namespaces/{namespace}/task-queues/{task_queue.name}/type/{task_queue_type}/get-stats"


Does this work with the task_queue_type enum in there?

Copilot says we can and caller would put the numeric values of the enum in the URL.

Hm, I don't know if the numeric enum representation is a good choice here.

cretz · 2025-04-30T14:01:59Z

temporal/api/taskqueue/v1/message.proto

+message TaskQueueStatsInfo {
+    temporal.api.taskqueue.v1.TaskQueue task_queue = 1;
+    temporal.api.enums.v1.TaskQueueType task_queue_type = 2;
+    temporal.api.taskqueue.v1.TaskQueueStats task_queue_stats = 3;
+}


Hrmm, I wonder if we can come up with a better name for this that just wraps stats and has a qualified set. Would you consider making this message a sub-message of GetWorkerDeploymentStatsResponse instead of top-level where it is a bit confusing/ambiguous?

Though if you look at my suggestion later concerning ListTaskQueues, you may find that this message may not be necessary.

cretz · 2025-04-30T14:06:13Z

temporal/api/workflowservice/v1/service.proto

+    // GetWorkerDeploymentStats returns the stats for all task queues in a deployment version.
+    rpc GetWorkerDeploymentStats (GetWorkerDeploymentStatsRequest) returns (GetWorkerDeploymentStatsResponse) {
+        option (google.api.http) = {
+            get: "/namespaces/{namespace}/worker-deployments/{deployment_name}/stats"


#579 seems like it may move away from this deployment_name concept. Want to make sure whichever PR is merged second is changed to apply to the first.

👍 I'll check-in with the team about that.

Hi! In #579 I have removed deployment_name from APIs that also take a WorkerDeploymentVersion (with it's own deployment_name field, so that it will be impossible to create a request object that has deployment_name != version.deployment_name.

APIs that do not take a version, such as DescribeWorkerDeployment still take a string deployment_name and I can't think of any reason that would change.

cretz · 2025-04-30T14:07:29Z

temporal/api/workflowservice/v1/request_response.proto

+    repeated temporal.api.taskqueue.v1.TaskQueueStatDetails per_queue_stats = 1;
+
+    // Aggregated backlog count across all task queues in this deployment version.
+    int64 approximate_total_backlog_count = 2;


Is this just a convenience value that is easily derived from the per queue metrics field? If so, is there any need for "get worker deployment stats" if it's just "list task queues with deployment name filter and opt-in to stats"?

I suggest we take a different approach. I think we should add a ListTaskQueues as a multi-task-queue equivalent to DescribeTaskQueue, have it accept filters (can limit right now to requiring deployment name at runtime if you don't want full implementation), and allow opting in to getting stats while listing. The DescribeX for single + ListXs for multiple is a much more consistent API and has better reuse potential and is clearer to use/understand.

cretz · 2025-04-30T14:08:23Z

temporal/api/workflowservice/v1/service.proto

@@ -582,6 +582,16 @@ service WorkflowService {
        };
    }

+    // GetTaskQueueStats returns stats for a single task queue.
+    rpc GetTaskQueueStats (GetTaskQueueStatsRequest) returns (GetTaskQueueStatsResponse) {


What does this provide that DescribeTaskQueue doesn't?

I'll send you the DevX review; it details this in depth! The short version is that the DescribeTaskQueue has been designed into a corner and we need a refresh.

👍 I have disagreements with this assessment (we already did a refresh, this is another refresh with no benefit from a user POV, only yet another way to do the same thing). I'll comment internally on the details there.

~~To clarify; any objections towards the Deployment Stats API part?~~ So the other comment now.

Shivs11 · 2025-05-01T15:40:02Z

temporal/api/workflowservice/v1/request_response.proto

+    repeated temporal.api.taskqueue.v1.TaskQueueStatsInfo per_queue_metrics = 1;
+
+    // The sum of backlog counts across all task queues in the deployment.
+    // Note that same as the task queue metrics, this value is


nit: is this an unfinished comment?

stephanos changed the title ~~Task Queue Stats API~~ [WiP] Task Queue Stats API Apr 16, 2025

stephanos commented Apr 16, 2025

View reviewed changes

stephanos force-pushed the task-stats branch 8 times, most recently from 6b25fed to fa2df53 Compare April 22, 2025 21:21

stephanos force-pushed the task-stats branch 4 times, most recently from f46f542 to 9778bd9 Compare April 28, 2025 17:15

stephanos commented Apr 28, 2025

View reviewed changes

temporal/api/workflowservice/v1/request_response.proto Outdated Show resolved Hide resolved

stephanos mentioned this pull request Apr 28, 2025

[WiP] Deployment and TaskQueue Stats API temporalio/temporal#7581

Closed

stephanos changed the title ~~[WiP] Task Queue Stats API~~ [WiP] Deployment and TaskQueue Stats API Apr 28, 2025

stephanos commented Apr 28, 2025

View reviewed changes

stephanos force-pushed the task-stats branch 2 times, most recently from ba4700d to 2c3d3cf Compare April 28, 2025 21:51

stephanos commented Apr 28, 2025

View reviewed changes

Deployment and TaskQueue Stats API

789de85

stephanos force-pushed the task-stats branch from 2c3d3cf to 789de85 Compare April 28, 2025 21:52

stephanos commented Apr 28, 2025

View reviewed changes

address feedback

69424ea

stephanos changed the title ~~[WiP] Deployment and TaskQueue Stats API~~ Deployment and TaskQueue Stats API Apr 29, 2025

stephanos marked this pull request as ready for review April 29, 2025 23:56

stephanos requested review from a team as code owners April 29, 2025 23:56

cretz reviewed Apr 30, 2025

View reviewed changes

Shivs11 reviewed May 1, 2025

View reviewed changes

stephanos closed this May 27, 2025

Deployment and TaskQueue Stats API #571

Deployment and TaskQueue Stats API #571

Uh oh!

Conversation

stephanos commented Apr 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stephanos Apr 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cretz Apr 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cretz Apr 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stephanos Apr 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stephanos Apr 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cretz Apr 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stephanos Apr 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cretz Apr 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stephanos Apr 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

stephanos commented Apr 16, 2025 •

edited

Loading

stephanos Apr 16, 2025 •

edited

Loading

cretz Apr 30, 2025 •

edited

Loading

cretz Apr 30, 2025 •

edited

Loading

stephanos Apr 28, 2025 •

edited

Loading

stephanos Apr 30, 2025 •

edited

Loading

cretz Apr 30, 2025 •

edited

Loading

stephanos Apr 30, 2025 •

edited

Loading

cretz Apr 30, 2025 •

edited

Loading

stephanos Apr 30, 2025 •

edited

Loading