feat: add action duration metric by Trojan295 · Pull Request #196 · castai/cluster-controller

Trojan295 · 2025-08-06T07:43:48Z

This PR adds the following metrics:

action_started_total - count of started actions by type
action_executed_duration_seconds - summary metric of the action duration for quantiles 0.5, 0.9 and 0.99.

It also adds support to export histogram and summary type metrics.

For action_executed_duration_seconds I decided to use a summary instead of a histogram to limit the number of series we send. Our action duration can range from milliseconds to dozens of seconds, so we might need up to 16 buckets to have some good data (and the _count, _sum series). If we want to have action type as dimension (we have 14 of those), that means:
3k clusters * 2 pods * (16 + 2) series * 14 action types = 1 512 000 series.

With a summary for 3 quantiles, we get:
3k clusters * 2 pods * (3 + 2) series * 14 action types = 420 000 series.

Still a lot, so we might consider keeping the metric export disabled by default and enable via env var (or maybe remotely from Cast AI).
Another drawback with summaries vs histograms is that we cannot aggregate them, because of the precalculation done on client side.

internal/config/config.go

internal/actions/drain_node_handler_test.go

Makefile

internal/castai/client.go

internal/castai/types.go

internal/controller/metricexporter/metricexporter.go

internal/metrics/metrics.go

internal/monitor/monitor.go

Trojan295 changed the title ~~Kube 1330/add action duration metric~~ feat: add action duration metric Aug 6, 2025

furkhat reviewed Aug 11, 2025

View reviewed changes

feat: add more metrics

0157399

Trojan295 force-pushed the kube-1330/add-action-duration-metric branch from 6b9a202 to 0157399 Compare August 18, 2025 12:30

address review comments

c2b6b06

Trojan295 requested a review from furkhat August 18, 2025 12:45

Trojan295 marked this pull request as ready for review August 18, 2025 12:45

Trojan295 requested a review from a team as a code owner August 18, 2025 12:45

furkhat approved these changes Aug 20, 2025

View reviewed changes

Trojan295 merged commit e3b15f1 into main Aug 25, 2025
6 checks passed

Trojan295 deleted the kube-1330/add-action-duration-metric branch August 25, 2025 08:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add action duration metric#196

feat: add action duration metric#196
Trojan295 merged 2 commits intomainfrom
kube-1330/add-action-duration-metric

Trojan295 commented Aug 6, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Trojan295 commented Aug 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Trojan295 commented Aug 6, 2025 •

edited

Loading