Skip to content

Enhancement/allow custom metric buckets #781

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

TheCodeWrangler
Copy link
Contributor

@TheCodeWrangler TheCodeWrangler commented Mar 3, 2025

Working to allow for custom histogram binning to be applied to prometheus metrics. Ideally this would be able to be applied to activities as well as workflows. The current implementation appears to not be applied to workflow end-to-end metrics.

What was changed

Added parameter for "histogram_bucket_overrides" to PrometheusConfig as well as the bridge to the sdk-core.

Added a test case for checking that custom binning is applied. Confirmed that the metrics endpoint for custom metrics were updated but not updated for workflow end to end latencies.

Why?

I was facing a limitation on viewing long running applications due to the default maximum activity bin being 60 seconds.

Checklist

  1. Closes 777

  2. How was this tested:

  1. Any docs updates needed?

Copy link
Member

@cretz cretz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, minor suggestion

poetry.lock Outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to merge/rebase with main to fix conflicts

@@ -181,3 +183,54 @@ async def has_log() -> bool:
assert record.levelno == logging.WARNING
assert record.name == f"{logger.name}-sdk_core::temporal_sdk_core::worker::workflow"
assert record.temporal_log.fields["run_id"] == handle.result_run_id # type: ignore


async def test_prometheus_histogram_bucket_overrides(client: Client):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for completeness sake, can you also add a check for custom metric? Basically just make a histogram override for your custom metric too, just assign Runtime to a var, and in addition to all that you're doing below, use runtime.metric_meter() to create/record a custom histogram metric value and confirm it too gets the histogram override.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added a custom_histogram and verified the buckets are updated.

This work is actually NOT accomplishing what I want in the ability to control the binning of temporal_workflow_endtoend_latency_bucket and temporal_activity_execution_latency_[milliseconds]_bucket

Are you able to tell if I will be able to do this in a PR on this repository or if it will require an update to sdk-core for that functionality?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may have to remove the temporal_ prefix. If that is indeed the case, may be a good thing to document where the override attr is defined.

@TheCodeWrangler TheCodeWrangler marked this pull request as draft March 7, 2025 13:31
@TheCodeWrangler TheCodeWrangler force-pushed the enhancement/allow-custom-metric-buckets branch from f656462 to af83eea Compare March 7, 2025 13:34
@TheCodeWrangler TheCodeWrangler marked this pull request as ready for review March 7, 2025 14:12
histogram_overrides = {
"temporal_long_request_latency": [special_value / 2, special_value],
"custom_histogram": [special_value / 2, special_value],
# "temporal_workflow_endtoend_latency": [special_value / 2, special_value], # This still does not work :(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drawing attention here. If i include this in the check it will fail. temporal_workflow_endtoend_latency still appears in the metrics endpoint but the binning is not updated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tried "workflow_endtoend_latency": [special_value / 2, special_value]? And does temporal_long_request_latency work as expected?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

temporal_long_request_latency does work as expected. This test passes as is.

workflow_endtoend_latency does not update the buckets. In the metrics endpoint I see

# TYPE temporal_workflow_endtoend_latency histogram
temporal_workflow_endtoend_latency_bucket{namespace="default",service_name="temporal-core-sdk",task_queue="task-queue-5af26b55-afbc-4f20-ad5d-2d900f7fe453",workflow_type="HelloWorkflow",le="100"} 1

I have tried severl variations ... temporal_workflow_endtoend_latency, workflow_endtoend_latency. I see in the sdk-core that the binning is defined as as function and am concerned that the histogram override does not apply in those but am not familiar enough with rust to know definitively.

Copy link
Member

@cretz cretz Mar 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see the problem. You are not passing your client_with_overrides to the worker, you're passing the client that comes from the test session that does not use this Runtime. Change the first parameter of the Worker to be client_with_overrides. Also use client_with_overrides as the one to execute_workflow instead of client.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the local definition of client within run_workflow is already set to client_with_overrides.

I did try it though but still do not seem to be able to effect binning on the workflow end to end histogram

Copy link
Contributor Author

@TheCodeWrangler TheCodeWrangler Mar 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this seem related to temporalio/sdk-core#873?

Seems similar that custom binning is applied to some histograms but not others.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Sushisource

Do you know of a reason that the application of custom binning would work for some histograms but not the workflow or activity related ones?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No it's not immediately clear to me. I will need to look into that bug when I have a moment, which hopefully would be next week sometime

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TheCodeWrangler - now that we have upgraded Core with this fix, there should be no histograms missing these buckets. I have merged main back into this branch. Want to uncomment and see if your test now passes? If so, we can merge.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests now pass! Thank you!

@cretz
Copy link
Member

cretz commented Apr 3, 2025

@TheCodeWrangler - may also need to run poe format on the source

@TheCodeWrangler TheCodeWrangler requested a review from cretz April 14, 2025 17:47
@cretz
Copy link
Member

cretz commented Apr 15, 2025

There is a flake at:

FAILED tests/test_runtime.py::test_prometheus_histogram_bucket_overrides - assert 'temporal_workflow_endtoend_latency' in '# HELP custom_histogram Custom histogram\n# TYPE custom_histogram

Granted we have other test flakes, but this is the one I noticed here. For some reason there is a flake where the value is not there in some cases, hrmm.

@TheCodeWrangler
Copy link
Contributor Author

There is a flake at:

FAILED tests/test_runtime.py::test_prometheus_histogram_bucket_overrides - assert 'temporal_workflow_endtoend_latency' in '# HELP custom_histogram Custom histogram\n# TYPE custom_histogram

Granted we have other test flakes, but this is the one I noticed here. For some reason there is a flake where the value is not there in some cases, hrmm.

What is the local command to recreate the flake test? I noticed poe is using ruff and did not see flake in any of the github workflows?

I will try to get the flake error resolved in my test if I have a way to test locally.

@cretz
Copy link
Member

cretz commented Apr 15, 2025

For the flake happening for instance on 3.9 macos-intel, it would be a command like:

poe test -s --workflow-environment time-skipping --log-cli-level=DEBUG -k test_prometheus_histogram_bucket_overrides

But unsure if you'll be able to replicate. As for the other flakes, we apologize for those, we are trying to work through them.

@TheCodeWrangler
Copy link
Contributor Author

poe test -s --workflow-environment time-skipping --log-cli-level=DEBUG -k test_prometheus_histogram_bucket_overrides

Test passes locally 🤷

@cretz
Copy link
Member

cretz commented Apr 15, 2025

Something is strange where in a rare case temporal_workflow_endtoend_latency isn't in the Prometheus output in rare cases. I am checking with peers internally and re-running to see if it fails reliably on certain platform.

@TheCodeWrangler
Copy link
Contributor Author

Something is strange where in a rare case temporal_workflow_endtoend_latency isn't in the Prometheus output in rare cases. I am checking with peers internally and re-running to see if it fails reliably on certain platform.

I could just use a different metric (but that was one i saw that did not get custom binning applied previously)

@cretz
Copy link
Member

cretz commented Apr 15, 2025

Yeah, it seems to be failing fairly reliably on some platforms and Python versions. I believe this metric may be eventually consistent. I would suggest either using a different histogram metric, or changing the assertion to check eventually, e.g. using the assert_eventually helper.

@cretz cretz reopened this Apr 15, 2025
@cretz
Copy link
Member

cretz commented Apr 15, 2025

(reopened in case close was accidental, but if there's a separate PR, can close this one or if you're wanting us to help and/or take more control, we can help there too, thanks for all the patience!)

@cretz
Copy link
Member

cretz commented Apr 15, 2025

Confirmed, some (most?) metrics may take a (literal) second, but we don't want explicit sleeps. Can switch metrics or do a repeated assertion over a few seconds until it appears (e.g. the assert_eventually helper).

Copy link
Member

@cretz cretz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, will merge if/when CI passes (we have some flakes in other places we are working on, so I may end up running multiple times). Thanks for seeing this through!

@cretz
Copy link
Member

cretz commented Apr 15, 2025

Hrmm, seems even after several seconds it is not showing up. Something else is happening, I will set aside time to replicate on 3.9 and see if I can figure out the issue. It may take a few days to get back to this, sorry to leave this PR hanging on that.

@cretz
Copy link
Member

cretz commented Apr 16, 2025

I am looking into this now, I hope you don't mind as I push to your branch during this effort

@cretz
Copy link
Member

cretz commented Apr 16, 2025

Ok, I believe this is a rust Core issue. I have removed temporal_workflow_endtoend_latency from the test and opened temporalio/sdk-core#902. Will merge if/when passes CI.

@cretz
Copy link
Member

cretz commented Apr 17, 2025

(sorry, doing more CI flake investigation in your branch, can ignore basically everything from here on out)

@cretz cretz merged commit 3d10ba6 into temporalio:main Apr 21, 2025
13 checks passed
@cretz
Copy link
Member

cretz commented Apr 21, 2025

Merged, thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants