Skip to content

Deployment and TaskQueue Stats API #571

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

stephanos
Copy link
Contributor

@stephanos stephanos commented Apr 16, 2025

READ BEFORE MERGING: All PRs require approval by both Server AND SDK teams before merging! This is why the number of required approvals is "2" and not "1"--two reviewers from the same team is NOT sufficient. If your PR is not approved by someone in BOTH teams, it may be summarily reverted.

What changed?

Added two new APIs:

  • GetWorkerDeploymentStats
  • GetTaskQueueStats

Why?

To return task queue stats to users for making scaling decisions.

Breaking changes

No.

Server PR

temporalio/temporal#7581 (feature complete, but not yet production-ready)

@stephanos stephanos changed the title Task Queue Stats API [WiP] Task Queue Stats API Apr 16, 2025
repeated temporal.api.taskqueue.v1.TaskQueueStatDetails per_queue_stats = 1;

// Aggregated backlog count across all task queues in this deployment version.
int64 approximate_total_backlog_count = 2;
Copy link
Contributor Author

@stephanos stephanos Apr 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Name is derived from TaskQueueStatDetails's approximate_backlog_count field.

Copy link
Member

@cretz cretz Apr 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this just a convenience value that is easily derived from the per queue metrics field? If so, is there any need for "get worker deployment stats" if it's just "list task queues with deployment name filter and opt-in to stats"?

I suggest we take a different approach. I think we should add a ListTaskQueues as a multi-task-queue equivalent to DescribeTaskQueue, have it accept filters (can limit right now to requiring deployment name at runtime if you don't want full implementation), and allow opting in to getting stats while listing. The DescribeX for single + ListXs for multiple is a much more consistent API and has better reuse potential and is clearer to use/understand.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this just a convenience value that is easily derived from the per queue metrics field?

Yes, it's really just for convenience. There was a long discussion about this at the DevX review (happy to provide details) and this was the compromise that both camps (no aggregation vs lots of aggregation) were okay with.

I think we should add a ListTaskQueues

The major issue with that option is that we cannot list task queues. Since their are "unloaded" when not in active use after a few minutes, the API wouldn't be able to return stats for unloaded task queues. Users are expected to provide the task queues they are interested in for that reason. On the other hands with deployments we actually track the task queues, so that's why that API is more convenient/powerful.

Copy link
Member

@cretz cretz Apr 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's really just for convenience. There was a long discussion about this at the DevX review (happy to provide details) and this was the compromise that both camps (no aggregation vs lots of aggregation) were okay with.

I think it's rough to half-aggregate things for people that can do it themselves (that's probably even worse than all or none). But regardless, this may be fine if we move to a list task queue model. But why would we not ask the various clients to do the aggregation instead of the server?

The major issue with that option is that we cannot list task queues [...] On the other hands with deployments we actually track the task queues, so that's why that API is more convenient/powerful.

To confirm, you can list task queues if deployment name is present? This is why I added "can limit right now to requiring deployment name at runtime if you don't want full implementation" to my comment. IMO it is better to have a limited form of ListTaskQueues than to have worker deployment have its own "list task queues". I would like to see an issue open to make the full list task queues (i.e. if a user doesn't provide deployment name) work.

@stephanos stephanos force-pushed the task-stats branch 8 times, most recently from 6b25fed to fa2df53 Compare April 22, 2025 21:21
@stephanos stephanos force-pushed the task-stats branch 4 times, most recently from f46f542 to 9778bd9 Compare April 28, 2025 17:15
@stephanos stephanos changed the title [WiP] Task Queue Stats API [WiP] Deployment and TaskQueue Stats API Apr 28, 2025

// Optional. Deployment version to return stats for.
// If left empty, the current version will be queried.
string deployment_version = 3;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to double-check that build_id isn't a better field name here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're in the middle of that discussion right now in the Versioning crew, we have not fully converged yet. If you want to merge this asap, I think build_id is the safer alternative.

Comment on lines +340 to +344
message TaskQueueStatsInfo {
temporal.api.taskqueue.v1.TaskQueue task_queue = 1;
temporal.api.enums.v1.TaskQueueType task_queue_type = 2;
temporal.api.taskqueue.v1.TaskQueueStats task_queue_stats = 3;
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This message only exists because TaskQueueStats doesn't contain the task queue name and task_queue_type.

I considered adding them to the existing TaskQueueStats, but was discouraged to do so.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrmm, I wonder if we can come up with a better name for this that just wraps stats and has a qualified set. Would you consider making this message a sub-message of GetWorkerDeploymentStatsResponse instead of top-level where it is a bit confusing/ambiguous?

Though if you look at my suggestion later concerning ListTaskQueues, you may find that this message may not be necessary.

@stephanos stephanos force-pushed the task-stats branch 2 times, most recently from ba4700d to 2c3d3cf Compare April 28, 2025 21:51
// GetWorkerDeploymentStats returns the stats for all task queues in a deployment version.
rpc GetWorkerDeploymentStats (GetWorkerDeploymentStatsRequest) returns (GetWorkerDeploymentStatsResponse) {
option (google.api.http) = {
get: "/namespaces/{namespace}/worker-deployments/{deployment_name}/{deployment_version}/stats"
Copy link
Contributor Author

@stephanos stephanos Apr 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since deployment_version can be empty, I'm not sure this is right yet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should only put the name in the url.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shows I'm not familiar with how the grpc-http bridge works. I suppose the other args are passed as query parameters?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just figuring out the grpc-http stuff yesterday! This convo with ChatGPT helped (start reading from where I ask "What is the benefit of putting deployment_name in the http path, and what is the problem with not including it"): https://chatgpt.com/share/68113ebc-e300-8004-9732-3b1bd38a1029

Basically, namespace must be included in the path because it is used for both routing and auth. These resource names are not used for routing or auth, so it is technically not required to put them in the path. However, it is more "RESTful" to put resource names in the http path, because it makes using the APIs via curl more idiomatic.

// GetTaskQueueStats returns stats for a single task queue.
rpc GetTaskQueueStats (GetTaskQueueStatsRequest) returns (GetTaskQueueStatsResponse) {
option (google.api.http) = {
post: "/namespaces/{namespace}/task-queues/{task_queue.name}/type/{task_queue_type}/get-stats"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this work with the task_queue_type enum in there?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot says we can and caller would put the numeric values of the enum in the URL.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I don't know if the numeric enum representation is a good choice here.

@stephanos stephanos changed the title [WiP] Deployment and TaskQueue Stats API Deployment and TaskQueue Stats API Apr 29, 2025
@stephanos stephanos marked this pull request as ready for review April 29, 2025 23:56
@stephanos stephanos requested review from a team as code owners April 29, 2025 23:56
Comment on lines +340 to +344
message TaskQueueStatsInfo {
temporal.api.taskqueue.v1.TaskQueue task_queue = 1;
temporal.api.enums.v1.TaskQueueType task_queue_type = 2;
temporal.api.taskqueue.v1.TaskQueueStats task_queue_stats = 3;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrmm, I wonder if we can come up with a better name for this that just wraps stats and has a qualified set. Would you consider making this message a sub-message of GetWorkerDeploymentStatsResponse instead of top-level where it is a bit confusing/ambiguous?

Though if you look at my suggestion later concerning ListTaskQueues, you may find that this message may not be necessary.

// GetWorkerDeploymentStats returns the stats for all task queues in a deployment version.
rpc GetWorkerDeploymentStats (GetWorkerDeploymentStatsRequest) returns (GetWorkerDeploymentStatsResponse) {
option (google.api.http) = {
get: "/namespaces/{namespace}/worker-deployments/{deployment_name}/stats"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#579 seems like it may move away from this deployment_name concept. Want to make sure whichever PR is merged second is changed to apply to the first.

Copy link
Contributor Author

@stephanos stephanos Apr 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I'll check-in with the team about that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi! In #579 I have removed deployment_name from APIs that also take a WorkerDeploymentVersion (with it's own deployment_name field, so that it will be impossible to create a request object that has deployment_name != version.deployment_name.

APIs that do not take a version, such as DescribeWorkerDeployment still take a string deployment_name and I can't think of any reason that would change.

repeated temporal.api.taskqueue.v1.TaskQueueStatDetails per_queue_stats = 1;

// Aggregated backlog count across all task queues in this deployment version.
int64 approximate_total_backlog_count = 2;
Copy link
Member

@cretz cretz Apr 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this just a convenience value that is easily derived from the per queue metrics field? If so, is there any need for "get worker deployment stats" if it's just "list task queues with deployment name filter and opt-in to stats"?

I suggest we take a different approach. I think we should add a ListTaskQueues as a multi-task-queue equivalent to DescribeTaskQueue, have it accept filters (can limit right now to requiring deployment name at runtime if you don't want full implementation), and allow opting in to getting stats while listing. The DescribeX for single + ListXs for multiple is a much more consistent API and has better reuse potential and is clearer to use/understand.

@@ -582,6 +582,16 @@ service WorkflowService {
};
}

// GetTaskQueueStats returns stats for a single task queue.
rpc GetTaskQueueStats (GetTaskQueueStatsRequest) returns (GetTaskQueueStatsResponse) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this provide that DescribeTaskQueue doesn't?

Copy link
Contributor Author

@stephanos stephanos Apr 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll send you the DevX review; it details this in depth! The short version is that the DescribeTaskQueue has been designed into a corner and we need a refresh.

Copy link
Member

@cretz cretz Apr 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I have disagreements with this assessment (we already did a refresh, this is another refresh with no benefit from a user POV, only yet another way to do the same thing). I'll comment internally on the details there.

Copy link
Contributor Author

@stephanos stephanos Apr 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify; any objections towards the Deployment Stats API part? So the other comment now.

repeated temporal.api.taskqueue.v1.TaskQueueStatsInfo per_queue_metrics = 1;

// The sum of backlog counts across all task queues in the deployment.
// Note that same as the task queue metrics, this value is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: is this an unfinished comment?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants