-
Notifications
You must be signed in to change notification settings - Fork 94
Enhancement/allow custom metric buckets #781
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhancement/allow custom metric buckets #781
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, minor suggestion
poetry.lock
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to merge/rebase with main
to fix conflicts
@@ -181,3 +183,54 @@ async def has_log() -> bool: | |||
assert record.levelno == logging.WARNING | |||
assert record.name == f"{logger.name}-sdk_core::temporal_sdk_core::worker::workflow" | |||
assert record.temporal_log.fields["run_id"] == handle.result_run_id # type: ignore | |||
|
|||
|
|||
async def test_prometheus_histogram_bucket_overrides(client: Client): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for completeness sake, can you also add a check for custom metric? Basically just make a histogram override for your custom metric too, just assign Runtime
to a var, and in addition to all that you're doing below, use runtime.metric_meter()
to create/record a custom histogram metric value and confirm it too gets the histogram override.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added a custom_histogram
and verified the buckets are updated.
This work is actually NOT accomplishing what I want in the ability to control the binning of temporal_workflow_endtoend_latency_bucket
and temporal_activity_execution_latency_[milliseconds]_bucket
Are you able to tell if I will be able to do this in a PR on this repository or if it will require an update to sdk-core
for that functionality?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may have to remove the temporal_
prefix. If that is indeed the case, may be a good thing to document where the override attr is defined.
f656462
to
af83eea
Compare
tests/test_runtime.py
Outdated
histogram_overrides = { | ||
"temporal_long_request_latency": [special_value / 2, special_value], | ||
"custom_histogram": [special_value / 2, special_value], | ||
# "temporal_workflow_endtoend_latency": [special_value / 2, special_value], # This still does not work :( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Drawing attention here. If i include this in the check it will fail. temporal_workflow_endtoend_latency
still appears in the metrics endpoint but the binning is not updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you tried "workflow_endtoend_latency": [special_value / 2, special_value]
? And does temporal_long_request_latency
work as expected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
temporal_long_request_latency
does work as expected. This test passes as is.
workflow_endtoend_latency
does not update the buckets. In the metrics endpoint I see
# TYPE temporal_workflow_endtoend_latency histogram
temporal_workflow_endtoend_latency_bucket{namespace="default",service_name="temporal-core-sdk",task_queue="task-queue-5af26b55-afbc-4f20-ad5d-2d900f7fe453",workflow_type="HelloWorkflow",le="100"} 1
I have tried severl variations ... temporal_workflow_endtoend_latency
, workflow_endtoend_latency
. I see in the sdk-core
that the binning is defined as as function and am concerned that the histogram override does not apply in those but am not familiar enough with rust to know definitively.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see the problem. You are not passing your client_with_overrides
to the worker, you're passing the client
that comes from the test session that does not use this Runtime
. Change the first parameter of the Worker
to be client_with_overrides
. Also use client_with_overrides
as the one to execute_workflow
instead of client
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the local definition of client
within run_workflow
is already set to client_with_overrides
.
I did try it though but still do not seem to be able to effect binning on the workflow end to end histogram
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this seem related to temporalio/sdk-core#873?
Seems similar that custom binning is applied to some histograms but not others.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know of a reason that the application of custom binning would work for some histograms but not the workflow or activity related ones?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No it's not immediately clear to me. I will need to look into that bug when I have a moment, which hopefully would be next week sometime
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TheCodeWrangler - now that we have upgraded Core with this fix, there should be no histograms missing these buckets. I have merged main
back into this branch. Want to uncomment and see if your test now passes? If so, we can merge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests now pass! Thank you!
@TheCodeWrangler - may also need to run |
There is a flake at:
Granted we have other test flakes, but this is the one I noticed here. For some reason there is a flake where the value is not there in some cases, hrmm. |
What is the local command to recreate the flake test? I noticed poe is using ruff and did not see flake in any of the github workflows? I will try to get the flake error resolved in my test if I have a way to test locally. |
For the flake happening for instance on 3.9 macos-intel, it would be a command like:
But unsure if you'll be able to replicate. As for the other flakes, we apologize for those, we are trying to work through them. |
Test passes locally 🤷 |
Something is strange where in a rare case |
I could just use a different metric (but that was one i saw that did not get custom binning applied previously) |
Yeah, it seems to be failing fairly reliably on some platforms and Python versions. I believe this metric may be eventually consistent. I would suggest either using a different histogram metric, or changing the assertion to check eventually, e.g. using the |
(reopened in case close was accidental, but if there's a separate PR, can close this one or if you're wanting us to help and/or take more control, we can help there too, thanks for all the patience!) |
Confirmed, some (most?) metrics may take a (literal) second, but we don't want explicit sleeps. Can switch metrics or do a repeated assertion over a few seconds until it appears (e.g. the |
Co-authored-by: Chad Retz <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, will merge if/when CI passes (we have some flakes in other places we are working on, so I may end up running multiple times). Thanks for seeing this through!
Hrmm, seems even after several seconds it is not showing up. Something else is happening, I will set aside time to replicate on 3.9 and see if I can figure out the issue. It may take a few days to get back to this, sorry to leave this PR hanging on that. |
I am looking into this now, I hope you don't mind as I push to your branch during this effort |
Ok, I believe this is a rust Core issue. I have removed |
(sorry, doing more CI flake investigation in your branch, can ignore basically everything from here on out) |
Merged, thanks again! |
Working to allow for custom histogram binning to be applied to prometheus metrics. Ideally this would be able to be applied to activities as well as workflows. The current implementation appears to not be applied to workflow end-to-end metrics.
What was changed
Added parameter for "histogram_bucket_overrides" to
PrometheusConfig
as well as the bridge to the sdk-core.Added a test case for checking that custom binning is applied. Confirmed that the metrics endpoint for custom metrics were updated but not updated for workflow end to end latencies.
Why?
I was facing a limitation on viewing long running applications due to the default maximum activity bin being 60 seconds.
Checklist
Closes 777
How was this tested: