Description
Description
Ingestion delay is defined as delay between time of the sample accepted in Cortex and time the sample gets scraped
When there are ingestion delays, there can be a difference between recording rule and the underlying metric. This is because when looking back, it would appear as though the samples were there at the time of rule evaluation when in fact the sample might not have been ingested. Even though the samples arrive late, since the timestamp of samples are scrape timestamps, currently there is no way to know if there was a delay in ingestion. This can be confusing to troubleshoot and reason about why the recording rules differ from underlying metric.
Proposed Solution
I'm proposing to create a metric to track delays at the user/tenant level. For example, a histogram metric with buckets that can track at 5s, 30s, and > 30s can be helpful to know how much delay a tenant/user is experiencing. This metric can be used to set rule_query_offset
. This new metric can be emitted from Distributor. To reduce # of time series, we might be able to use a native histogram