Emit queue heartbeat metric for monitoring queue processor liveness #7117

yycptt · 2025-01-17T22:46:41Z

What changed?

Emit queue heartbeat duration metric

Why?

For monitoring if queue processor accidentally got shutdown due to a bug.
We previously rely on queue backlog metric for this purpose, but queue ack level may not move and backlog could be keep increasing for some expected cases (e.g. a namespace get throttled or due to replication delay). We still want the lag metric to be visibility for understanding the backlog, but we need to new metric for monitoring queue liveness.

How did you test it?

WIP (will run server locally and check the metric)

Potential risks

Some extra metric load. But the increase should be small as each shard only emit one data pointer per queue every 5mins.

Documentation

Is hotfix candidate?

NO.

dnr · 2025-01-20T22:59:24Z

But the increase should be small as each shard only emit one data pointer per queue every 5mins.

Is that true? Regardless of how it's emitted, it still gets scraped regularly, right?

Also, I would have thought a gauge of some kind would make more sense for monitoring liveness. It's fewer timeseries too. Although if you want the histogram for other reasons that makes sense

yycptt · 2025-01-27T23:47:28Z

But the increase should be small as each shard only emit one data pointer per queue every 5mins.

Is that true? Regardless of how it's emitted, it still gets scraped regularly, right?

Also, I would have thought a gauge of some kind would make more sense for monitoring liveness. It's fewer timeseries too. Although if you want the histogram for other reasons that makes sense

The metric is emitted every queueMetricUpdateInterval which is 5mins. But yeah you are right, each scrape still get the data :(. Good news is that we don't have any high cardinality tag on the metric, so it will just be one number for each bucket we configured * # of queues, so not that bad. And I can reduce the number of buckets we use for this metric.

I think the issue with gauge is that the liveness is per shard per queue, so I will need to add a shardID tag as well to the metric, which will cause even more data to be scraped every time (# of shards * # of queues).

One thing I need to add to the histogram approach is to emit a log with the shardID tag.

github-actions · 2025-05-28T00:09:03Z

This PR was marked as stale. Please update or close it.

Emit queue heartbeat metric for monitoring queue processor liveness

dec8d50

yycptt requested a review from prathyushpv January 17, 2025 22:46

github-actions bot added the stale label May 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Emit queue heartbeat metric for monitoring queue processor liveness #7117

Emit queue heartbeat metric for monitoring queue processor liveness #7117

Uh oh!

yycptt commented Jan 17, 2025

Uh oh!

dnr commented Jan 20, 2025

Uh oh!

yycptt commented Jan 27, 2025

Uh oh!

github-actions bot commented May 28, 2025

Uh oh!

Uh oh!

Emit queue heartbeat metric for monitoring queue processor liveness #7117

Are you sure you want to change the base?

Emit queue heartbeat metric for monitoring queue processor liveness #7117

Uh oh!

Conversation

yycptt commented Jan 17, 2025

What changed?

Why?

How did you test it?

Potential risks

Documentation

Is hotfix candidate?

Uh oh!

dnr commented Jan 20, 2025

Uh oh!

yycptt commented Jan 27, 2025

Uh oh!

github-actions bot commented May 28, 2025

Uh oh!

Uh oh!