Enhancement/allow custom metric buckets #781

TheCodeWrangler · 2025-03-03T16:45:36Z

Working to allow for custom histogram binning to be applied to prometheus metrics. Ideally this would be able to be applied to activities as well as workflows. The current implementation appears to not be applied to workflow end-to-end metrics.

What was changed

Added parameter for "histogram_bucket_overrides" to PrometheusConfig as well as the bridge to the sdk-core.

Added a test case for checking that custom binning is applied. Confirmed that the metrics endpoint for custom metrics were updated but not updated for workflow end to end latencies.

Why?

I was facing a limitation on viewing long running applications due to the default maximum activity bin being 60 seconds.

Checklist

Closes 777
How was this tested:

Any docs updates needed?

cretz

Looks great, minor suggestion

cretz · 2025-03-07T12:06:22Z

poetry.lock

Need to merge/rebase with main to fix conflicts

cretz · 2025-03-07T12:09:51Z

tests/test_runtime.py

@@ -181,3 +183,54 @@ async def has_log() -> bool:
    assert record.levelno == logging.WARNING
    assert record.name == f"{logger.name}-sdk_core::temporal_sdk_core::worker::workflow"
    assert record.temporal_log.fields["run_id"] == handle.result_run_id  # type: ignore
+
+
+async def test_prometheus_histogram_bucket_overrides(client: Client):


Just for completeness sake, can you also add a check for custom metric? Basically just make a histogram override for your custom metric too, just assign Runtime to a var, and in addition to all that you're doing below, use runtime.metric_meter() to create/record a custom histogram metric value and confirm it too gets the histogram override.

I have added a custom_histogram and verified the buckets are updated.

This work is actually NOT accomplishing what I want in the ability to control the binning of temporal_workflow_endtoend_latency_bucket and temporal_activity_execution_latency_[milliseconds]_bucket

Are you able to tell if I will be able to do this in a PR on this repository or if it will require an update to sdk-core for that functionality?

You may have to remove the temporal_ prefix. If that is indeed the case, may be a good thing to document where the override attr is defined.

TheCodeWrangler · 2025-03-07T14:17:55Z

tests/test_runtime.py

+    histogram_overrides = {
+        "temporal_long_request_latency": [special_value / 2, special_value],
+        "custom_histogram": [special_value / 2, special_value],
+        # "temporal_workflow_endtoend_latency": [special_value / 2, special_value],  # This still does not work :(


Drawing attention here. If i include this in the check it will fail. temporal_workflow_endtoend_latency still appears in the metrics endpoint but the binning is not updated.

Have you tried "workflow_endtoend_latency": [special_value / 2, special_value]? And does temporal_long_request_latency work as expected?

temporal_long_request_latency does work as expected. This test passes as is.

workflow_endtoend_latency does not update the buckets. In the metrics endpoint I see

# TYPE temporal_workflow_endtoend_latency histogram temporal_workflow_endtoend_latency_bucket{namespace="default",service_name="temporal-core-sdk",task_queue="task-queue-5af26b55-afbc-4f20-ad5d-2d900f7fe453",workflow_type="HelloWorkflow",le="100"} 1

I have tried severl variations ... temporal_workflow_endtoend_latency, workflow_endtoend_latency. I see in the sdk-core that the binning is defined as as function and am concerned that the histogram override does not apply in those but am not familiar enough with rust to know definitively.

Oh, I see the problem. You are not passing your client_with_overrides to the worker, you're passing the client that comes from the test session that does not use this Runtime. Change the first parameter of the Worker to be client_with_overrides. Also use client_with_overrides as the one to execute_workflow instead of client.

I think that the local definition of client within run_workflow is already set to client_with_overrides.

I did try it though but still do not seem to be able to effect binning on the workflow end to end histogram

Does this seem related to temporalio/sdk-core#873?

Seems similar that custom binning is applied to some histograms but not others.

@Sushisource

Do you know of a reason that the application of custom binning would work for some histograms but not the workflow or activity related ones?

No it's not immediately clear to me. I will need to look into that bug when I have a moment, which hopefully would be next week sometime

@TheCodeWrangler - now that we have upgraded Core with this fix, there should be no histograms missing these buckets. I have merged main back into this branch. Want to uncomment and see if your test now passes? If so, we can merge.

Tests now pass! Thank you!

cretz · 2025-04-03T20:21:30Z

@TheCodeWrangler - may also need to run poe format on the source

…ics.

cretz · 2025-04-15T13:26:57Z

There is a flake at:

FAILED tests/test_runtime.py::test_prometheus_histogram_bucket_overrides - assert 'temporal_workflow_endtoend_latency' in '# HELP custom_histogram Custom histogram\n# TYPE custom_histogram

Granted we have other test flakes, but this is the one I noticed here. For some reason there is a flake where the value is not there in some cases, hrmm.

TheCodeWrangler · 2025-04-15T13:48:44Z

There is a flake at:

FAILED tests/test_runtime.py::test_prometheus_histogram_bucket_overrides - assert 'temporal_workflow_endtoend_latency' in '# HELP custom_histogram Custom histogram\n# TYPE custom_histogram

Granted we have other test flakes, but this is the one I noticed here. For some reason there is a flake where the value is not there in some cases, hrmm.

What is the local command to recreate the flake test? I noticed poe is using ruff and did not see flake in any of the github workflows?

I will try to get the flake error resolved in my test if I have a way to test locally.

cretz · 2025-04-15T15:08:46Z

For the flake happening for instance on 3.9 macos-intel, it would be a command like:

poe test -s --workflow-environment time-skipping --log-cli-level=DEBUG -k test_prometheus_histogram_bucket_overrides

But unsure if you'll be able to replicate. As for the other flakes, we apologize for those, we are trying to work through them.

TheCodeWrangler · 2025-04-15T15:19:44Z

poe test -s --workflow-environment time-skipping --log-cli-level=DEBUG -k test_prometheus_histogram_bucket_overrides

Test passes locally 🤷

cretz · 2025-04-15T15:28:30Z

Something is strange where in a rare case temporal_workflow_endtoend_latency isn't in the Prometheus output in rare cases. I am checking with peers internally and re-running to see if it fails reliably on certain platform.

TheCodeWrangler · 2025-04-15T15:46:26Z

Something is strange where in a rare case temporal_workflow_endtoend_latency isn't in the Prometheus output in rare cases. I am checking with peers internally and re-running to see if it fails reliably on certain platform.

I could just use a different metric (but that was one i saw that did not get custom binning applied previously)

cretz · 2025-04-15T15:47:38Z

Yeah, it seems to be failing fairly reliably on some platforms and Python versions. I believe this metric may be eventually consistent. I would suggest either using a different histogram metric, or changing the assertion to check eventually, e.g. using the assert_eventually helper.

cretz · 2025-04-15T15:48:30Z

(reopened in case close was accidental, but if there's a separate PR, can close this one or if you're wanting us to help and/or take more control, we can help there too, thanks for all the patience!)

cretz · 2025-04-15T15:57:12Z

Confirmed, some (most?) metrics may take a (literal) second, but we don't want explicit sleeps. Can switch metrics or do a repeated assertion over a few seconds until it appears (e.g. the assert_eventually helper).

tests/test_runtime.py

Co-authored-by: Chad Retz <[email protected]>

cretz

Looks great, will merge if/when CI passes (we have some flakes in other places we are working on, so I may end up running multiple times). Thanks for seeing this through!

cretz · 2025-04-15T18:34:13Z

Hrmm, seems even after several seconds it is not showing up. Something else is happening, I will set aside time to replicate on 3.9 and see if I can figure out the issue. It may take a few days to get back to this, sorry to leave this PR hanging on that.

cretz · 2025-04-16T14:26:43Z

I am looking into this now, I hope you don't mind as I push to your branch during this effort

cretz · 2025-04-16T20:04:40Z

Ok, I believe this is a rust Core issue. I have removed temporal_workflow_endtoend_latency from the test and opened temporalio/sdk-core#902. Will merge if/when passes CI.

cretz · 2025-04-17T14:51:41Z

(sorry, doing more CI flake investigation in your branch, can ignore basically everything from here on out)

cretz · 2025-04-21T15:43:49Z

Merged, thanks again!

TheCodeWrangler requested a review from a team as a code owner March 3, 2025 16:45

TheCodeWrangler mentioned this pull request Mar 3, 2025

[Feature Request] Allow custom metric buckets #777

Closed

cretz reviewed Mar 7, 2025

View reviewed changes

TheCodeWrangler marked this pull request as draft March 7, 2025 13:31

TheCodeWrangler added 7 commits March 7, 2025 07:33

WIP

0e9666b

WIP

b4d4706

Updated

3052733

Removed format edits

a7a318e

Removed format edits

97331d8

Removed format edits

d760fb7

Removed format edits

af83eea

TheCodeWrangler force-pushed the enhancement/allow-custom-metric-buckets branch from f656462 to af83eea Compare March 7, 2025 13:34

TheCodeWrangler added 2 commits March 7, 2025 08:10

Added a custom histogram

4b12391

Match prior linting

b1ae028

TheCodeWrangler marked this pull request as ready for review March 7, 2025 14:12

Removed saving out metrics endpoint

ff0cb7a

TheCodeWrangler commented Mar 7, 2025

View reviewed changes

Merge branch 'main' into enhancement/allow-custom-metric-buckets

5570148

TheCodeWrangler and others added 3 commits April 14, 2025 08:03

Merge branch 'main' into enhancement/allow-custom-metric-buckets

b9871bd

Updated after rebase and unit tests working on workflow endtoend metr…

e41206b

…ics.

Updated format with poe

4f3a1b4

TheCodeWrangler requested a review from cretz April 14, 2025 17:47

Removed invalid comment

7f5f0cd

TheCodeWrangler closed this Apr 15, 2025

cretz reopened this Apr 15, 2025

Updated to assert eventually on metrics check

b9a43e9

cretz reviewed Apr 15, 2025

View reviewed changes

tests/test_runtime.py Outdated Show resolved Hide resolved

Update tests/test_runtime.py

319d758

Co-authored-by: Chad Retz <[email protected]>

cretz approved these changes Apr 15, 2025

View reviewed changes

Work on test flake

5313c1d

Removed temporal_workflow_endtoend_latency from test

7d2199f

Merge branch 'main' into enhancement/allow-custom-metric-buckets

dbb2e22

cretz added 2 commits April 17, 2025 09:55

CI flake investigations

77a900e

Merge branch 'main' into enhancement/allow-custom-metric-buckets

04bc99e

cretz merged commit 3d10ba6 into temporalio:main Apr 21, 2025
13 checks passed

Enhancement/allow custom metric buckets #781

Enhancement/allow custom metric buckets #781

Uh oh!

Conversation

TheCodeWrangler commented Mar 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What was changed

Why?

Checklist

Uh oh!

cretz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cretz Mar 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TheCodeWrangler Mar 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cretz commented Apr 3, 2025

Uh oh!

cretz commented Apr 15, 2025

Uh oh!

TheCodeWrangler commented Apr 15, 2025

Uh oh!

cretz commented Apr 15, 2025

Uh oh!

TheCodeWrangler commented Apr 15, 2025

Uh oh!

cretz commented Apr 15, 2025

Uh oh!

TheCodeWrangler commented Apr 15, 2025

Uh oh!

cretz commented Apr 15, 2025

Uh oh!

cretz commented Apr 15, 2025

Uh oh!

cretz commented Apr 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

cretz left a comment

Choose a reason for hiding this comment

Uh oh!

cretz commented Apr 15, 2025

Uh oh!

cretz commented Apr 16, 2025

Uh oh!

cretz commented Apr 16, 2025

Uh oh!

cretz commented Apr 17, 2025

Uh oh!

Uh oh!

cretz commented Apr 21, 2025

TheCodeWrangler commented Mar 3, 2025 •

edited

Loading

cretz Mar 7, 2025 •

edited

Loading

TheCodeWrangler Mar 7, 2025 •

edited

Loading

cretz commented Apr 15, 2025 •

edited

Loading