Merged
Conversation
furkhat
reviewed
Aug 11, 2025
6b9a202 to
0157399
Compare
furkhat
approved these changes
Aug 20, 2025
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR adds the following metrics:
action_started_total- count of started actions by typeaction_executed_duration_seconds- summary metric of the action duration for quantiles 0.5, 0.9 and 0.99.It also adds support to export histogram and summary type metrics.
For
action_executed_duration_secondsI decided to use a summary instead of a histogram to limit the number of series we send. Our action duration can range from milliseconds to dozens of seconds, so we might need up to 16 buckets to have some good data (and the_count,_sumseries). If we want to have action type as dimension (we have 14 of those), that means:3k clusters * 2 pods * (16 + 2) series * 14 action types = 1 512 000 series.With a summary for 3 quantiles, we get:
3k clusters * 2 pods * (3 + 2) series * 14 action types = 420 000 series.Still a lot, so we might consider keeping the metric export disabled by default and enable via env var (or maybe remotely from Cast AI).
Another drawback with summaries vs histograms is that we cannot aggregate them, because of the precalculation done on client side.