-
Notifications
You must be signed in to change notification settings - Fork 69
Deployment and TaskQueue Stats API #571
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
repeated temporal.api.taskqueue.v1.TaskQueueStatDetails per_queue_stats = 1; | ||
|
||
// Aggregated backlog count across all task queues in this deployment version. | ||
int64 approximate_total_backlog_count = 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Name is derived from TaskQueueStatDetails
's approximate_backlog_count
field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this just a convenience value that is easily derived from the per queue metrics field? If so, is there any need for "get worker deployment stats" if it's just "list task queues with deployment name filter and opt-in to stats"?
I suggest we take a different approach. I think we should add a ListTaskQueues
as a multi-task-queue equivalent to DescribeTaskQueue
, have it accept filters (can limit right now to requiring deployment name at runtime if you don't want full implementation), and allow opting in to getting stats while listing. The DescribeX
for single + ListXs
for multiple is a much more consistent API and has better reuse potential and is clearer to use/understand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this just a convenience value that is easily derived from the per queue metrics field?
Yes, it's really just for convenience. There was a long discussion about this at the DevX review (happy to provide details) and this was the compromise that both camps (no aggregation vs lots of aggregation) were okay with.
I think we should add a
ListTaskQueues
The major issue with that option is that we cannot list task queues. Since their are "unloaded" when not in active use after a few minutes, the API wouldn't be able to return stats for unloaded task queues. Users are expected to provide the task queues they are interested in for that reason. On the other hands with deployments we actually track the task queues, so that's why that API is more convenient/powerful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's really just for convenience. There was a long discussion about this at the DevX review (happy to provide details) and this was the compromise that both camps (no aggregation vs lots of aggregation) were okay with.
I think it's rough to half-aggregate things for people that can do it themselves (that's probably even worse than all or none). But regardless, this may be fine if we move to a list task queue model. But why would we not ask the various clients to do the aggregation instead of the server?
The major issue with that option is that we cannot list task queues [...] On the other hands with deployments we actually track the task queues, so that's why that API is more convenient/powerful.
To confirm, you can list task queues if deployment name is present? This is why I added "can limit right now to requiring deployment name at runtime if you don't want full implementation" to my comment. IMO it is better to have a limited form of ListTaskQueues
than to have worker deployment have its own "list task queues". I would like to see an issue open to make the full list task queues (i.e. if a user doesn't provide deployment name) work.
6b25fed
to
fa2df53
Compare
f46f542
to
9778bd9
Compare
|
||
// Optional. Deployment version to return stats for. | ||
// If left empty, the current version will be queried. | ||
string deployment_version = 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to double-check that build_id
isn't a better field name here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're in the middle of that discussion right now in the Versioning crew, we have not fully converged yet. If you want to merge this asap, I think build_id
is the safer alternative.
message TaskQueueStatsInfo { | ||
temporal.api.taskqueue.v1.TaskQueue task_queue = 1; | ||
temporal.api.enums.v1.TaskQueueType task_queue_type = 2; | ||
temporal.api.taskqueue.v1.TaskQueueStats task_queue_stats = 3; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This message only exists because TaskQueueStats
doesn't contain the task queue name and task_queue_type.
I considered adding them to the existing TaskQueueStats
, but was discouraged to do so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hrmm, I wonder if we can come up with a better name for this that just wraps stats and has a qualified set. Would you consider making this message a sub-message of GetWorkerDeploymentStatsResponse
instead of top-level where it is a bit confusing/ambiguous?
Though if you look at my suggestion later concerning ListTaskQueues
, you may find that this message may not be necessary.
ba4700d
to
2c3d3cf
Compare
// GetWorkerDeploymentStats returns the stats for all task queues in a deployment version. | ||
rpc GetWorkerDeploymentStats (GetWorkerDeploymentStatsRequest) returns (GetWorkerDeploymentStatsResponse) { | ||
option (google.api.http) = { | ||
get: "/namespaces/{namespace}/worker-deployments/{deployment_name}/{deployment_version}/stats" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since deployment_version
can be empty, I'm not sure this is right yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should only put the name in the url.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shows I'm not familiar with how the grpc-http bridge works. I suppose the other args are passed as query parameters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was just figuring out the grpc-http stuff yesterday! This convo with ChatGPT helped (start reading from where I ask "What is the benefit of putting deployment_name in the http path, and what is the problem with not including it"): https://chatgpt.com/share/68113ebc-e300-8004-9732-3b1bd38a1029
Basically, namespace must be included in the path because it is used for both routing and auth. These resource names are not used for routing or auth, so it is technically not required to put them in the path. However, it is more "RESTful" to put resource names in the http path, because it makes using the APIs via curl more idiomatic.
// GetTaskQueueStats returns stats for a single task queue. | ||
rpc GetTaskQueueStats (GetTaskQueueStatsRequest) returns (GetTaskQueueStatsResponse) { | ||
option (google.api.http) = { | ||
post: "/namespaces/{namespace}/task-queues/{task_queue.name}/type/{task_queue_type}/get-stats" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this work with the task_queue_type
enum in there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot says we can and caller would put the numeric values of the enum in the URL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, I don't know if the numeric enum representation is a good choice here.
message TaskQueueStatsInfo { | ||
temporal.api.taskqueue.v1.TaskQueue task_queue = 1; | ||
temporal.api.enums.v1.TaskQueueType task_queue_type = 2; | ||
temporal.api.taskqueue.v1.TaskQueueStats task_queue_stats = 3; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hrmm, I wonder if we can come up with a better name for this that just wraps stats and has a qualified set. Would you consider making this message a sub-message of GetWorkerDeploymentStatsResponse
instead of top-level where it is a bit confusing/ambiguous?
Though if you look at my suggestion later concerning ListTaskQueues
, you may find that this message may not be necessary.
// GetWorkerDeploymentStats returns the stats for all task queues in a deployment version. | ||
rpc GetWorkerDeploymentStats (GetWorkerDeploymentStatsRequest) returns (GetWorkerDeploymentStatsResponse) { | ||
option (google.api.http) = { | ||
get: "/namespaces/{namespace}/worker-deployments/{deployment_name}/stats" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#579 seems like it may move away from this deployment_name
concept. Want to make sure whichever PR is merged second is changed to apply to the first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 I'll check-in with the team about that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi! In #579 I have removed deployment_name
from APIs that also take a WorkerDeploymentVersion
(with it's own deployment_name
field, so that it will be impossible to create a request object that has deployment_name != version.deployment_name
.
APIs that do not take a version, such as DescribeWorkerDeployment
still take a string deployment_name
and I can't think of any reason that would change.
repeated temporal.api.taskqueue.v1.TaskQueueStatDetails per_queue_stats = 1; | ||
|
||
// Aggregated backlog count across all task queues in this deployment version. | ||
int64 approximate_total_backlog_count = 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this just a convenience value that is easily derived from the per queue metrics field? If so, is there any need for "get worker deployment stats" if it's just "list task queues with deployment name filter and opt-in to stats"?
I suggest we take a different approach. I think we should add a ListTaskQueues
as a multi-task-queue equivalent to DescribeTaskQueue
, have it accept filters (can limit right now to requiring deployment name at runtime if you don't want full implementation), and allow opting in to getting stats while listing. The DescribeX
for single + ListXs
for multiple is a much more consistent API and has better reuse potential and is clearer to use/understand.
@@ -582,6 +582,16 @@ service WorkflowService { | |||
}; | |||
} | |||
|
|||
// GetTaskQueueStats returns stats for a single task queue. | |||
rpc GetTaskQueueStats (GetTaskQueueStatsRequest) returns (GetTaskQueueStatsResponse) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this provide that DescribeTaskQueue
doesn't?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll send you the DevX review; it details this in depth! The short version is that the DescribeTaskQueue
has been designed into a corner and we need a refresh.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 I have disagreements with this assessment (we already did a refresh, this is another refresh with no benefit from a user POV, only yet another way to do the same thing). I'll comment internally on the details there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify; any objections towards the Deployment Stats API part? So the other comment now.
repeated temporal.api.taskqueue.v1.TaskQueueStatsInfo per_queue_metrics = 1; | ||
|
||
// The sum of backlog counts across all task queues in the deployment. | ||
// Note that same as the task queue metrics, this value is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: is this an unfinished comment?
READ BEFORE MERGING: All PRs require approval by both Server AND SDK teams before merging! This is why the number of required approvals is "2" and not "1"--two reviewers from the same team is NOT sufficient. If your PR is not approved by someone in BOTH teams, it may be summarily reverted.
What changed?
Added two new APIs:
GetWorkerDeploymentStats
GetTaskQueueStats
Why?
To return task queue stats to users for making scaling decisions.
Breaking changes
No.
Server PR
temporalio/temporal#7581 (feature complete, but not yet production-ready)