Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Emit queue heartbeat metric for monitoring queue processor liveness #7117

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

yycptt
Copy link
Member

@yycptt yycptt commented Jan 17, 2025

What changed?

  • Emit queue heartbeat duration metric

Why?

  • For monitoring if queue processor accidentally got shutdown due to a bug.
  • We previously rely on queue backlog metric for this purpose, but queue ack level may not move and backlog could be keep increasing for some expected cases (e.g. a namespace get throttled or due to replication delay). We still want the lag metric to be visibility for understanding the backlog, but we need to new metric for monitoring queue liveness.

How did you test it?

  • WIP (will run server locally and check the metric)

Potential risks

  • Some extra metric load. But the increase should be small as each shard only emit one data pointer per queue every 5mins.

Documentation

Is hotfix candidate?

  • NO.

@yycptt yycptt requested a review from prathyushpv January 17, 2025 22:46
@dnr
Copy link
Member

dnr commented Jan 20, 2025

But the increase should be small as each shard only emit one data pointer per queue every 5mins.

Is that true? Regardless of how it's emitted, it still gets scraped regularly, right?

Also, I would have thought a gauge of some kind would make more sense for monitoring liveness. It's fewer timeseries too. Although if you want the histogram for other reasons that makes sense

@yycptt
Copy link
Member Author

yycptt commented Jan 27, 2025

But the increase should be small as each shard only emit one data pointer per queue every 5mins.

Is that true? Regardless of how it's emitted, it still gets scraped regularly, right?

Also, I would have thought a gauge of some kind would make more sense for monitoring liveness. It's fewer timeseries too. Although if you want the histogram for other reasons that makes sense

The metric is emitted every queueMetricUpdateInterval which is 5mins. But yeah you are right, each scrape still get the data :(. Good news is that we don't have any high cardinality tag on the metric, so it will just be one number for each bucket we configured * # of queues, so not that bad. And I can reduce the number of buckets we use for this metric.

I think the issue with gauge is that the liveness is per shard per queue, so I will need to add a shardID tag as well to the metric, which will cause even more data to be scraped every time (# of shards * # of queues).

One thing I need to add to the histogram approach is to emit a log with the shardID tag.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants