-
Notifications
You must be signed in to change notification settings - Fork 129
Description
Is your feature request related to a problem? Please describe.
The ISM rollup
action is a terminal operation, preventing the creation of multi-tiered rollups on already aggregated data. This limitation forces users with tiered retention strategies to build complex external automation.
The common workaround involves a cumbersome chain of: Rollup
-> _reindex
-> Second Rollup Job
. This process is difficult to manage, error-prone, and requires external scheduling, which undermines the goal of an integrated lifecycle system like ISM.
A common use case is a tiered retention strategy for observability data:
- Last 6 Hours: Data aggregated at 1-minute granularity.
- 6-24 Hours: Data re-aggregated into 10-minute granularity.
- 24+ Hours: Data re-aggregated into 1-hour granularity for long-term storage.
Describe the solution you'd like
Enhance the ISM policy to support a multi-layered rollup action, where an index can transition through sequential rollup states. The system must be intelligent and safe by design, validating the compatibility of metrics before performing a consecutive rollup.
Key capabilities:
- Chained Rollup States: Allow a rollup action's target index to be the source for a subsequent rollup action in a later state.
- Metric Compatibility Validation: Automatically check if source rollup metrics are re-aggregatable.
- Allow: Support metrics like
sum
,value_count
,min
, andmax
. - Intelligent
avg
Handling: For anavg
metric, the initial rollup should store the underlyingsum
andvalue_count
(if not specified by the user) to allow for accurate, weighted re-calculation in subsequent tiers. - Disallow (Initially): Fail with a clear error for unsupported metrics like
cardinality
orpercentiles
for now.
- Allow: Support metrics like
Future Compatibility for Advanced Metrics (RFC)
This feature should be designed to be forward-compatible. A later RFC could add support for advanced metrics like cardinality
and percentiles
by providing an option to store the raw underlying sketches (e.g., HyperLogLog++ or T-Digest) as binary objects. This would enable them to be merged in subsequent rollups. I will plan to open up a separate RFC/FR to talk more into details for advanced metrics beyond the currently supported ones.
Describe alternatives you've considered
The primary alternative is the current manual workaround using _reindex
and external scripts. This alternative adds significant operational overhead and is not a scalable solution.
Proposed Implementation Strategy (can use some brainstorming)
As someone who is not deeply familiar with the ISM codebase, this is a high-level suggestion intended as a starting point for discussion.
A potential approach could involve updating the policy to permit sequential rollup states while enhancing the rollup action itself. The action would need to first detect if its source is already a rollup index. If so, it would trigger new logic to validate metric compatibility (e.g., sum, avg) and correctly build re-aggregation queries that target the appropriate sub-fields from the source documents.
Additional context: Proposed Policy Structure
This conceptual policy structure defines a multi-tier policy where data is progressively rolled up.
{
"policy": {
"description": "A multi-tier rollup policy for observability-logs.",
"default_state": "live",
"states": [
{
"name": "live",
"actions": [],
"transitions": [
{
"state_name": "rollup_tier1_1m",
"conditions": { "min_index_age": "6h" }
}
]
},
{
"name": "rollup_tier1_1m",
"actions": [
{
"rollup": {
"source_index": "observability-logs-{{ctx.index}}",
"target_index": "rollup-1m-{{ctx.index}}",
"dimensions": [{"date_histogram": {"source_field": "@timestamp", "fixed_interval": "1m"}}],
"metrics": [
{"name": "latency_stats", "source_field": "latency_ms", "metrics": ["avg", "max"]}
]
}
}
],
"transitions": [
{
"state_name": "rollup_tier2_10m",
"conditions": { "min_index_age": "24h" }
}
]
},
{
"name": "rollup_tier2_10m",
"actions": [
{
"rollup": {
"source_index": "rollup-1m-{{ctx.index}}",
"target_index": "rollup-10m-{{ctx.index}}",
"dimensions": [{"date_histogram": {"source_field": "@timestamp", "fixed_interval": "10m"}}],
"metrics": [
{"name": "latency_stats", "source_field": "latency_stats", "metrics": ["avg", "max"]}
]
}
}
],
"transitions": [
{
"state_name": "delete",
"conditions": { "min_index_age": "30d" }
}
]
},
{
"name": "delete",
"actions": [{"delete": {}}]
}
]
}
}