Skip to content

[FEATURE REQUEST] Multi-Tier Rollup Support #1490

@sandeshkr419

Description

@sandeshkr419

Is your feature request related to a problem? Please describe.

The ISM rollup action is a terminal operation, preventing the creation of multi-tiered rollups on already aggregated data. This limitation forces users with tiered retention strategies to build complex external automation.

The common workaround involves a cumbersome chain of: Rollup -> _reindex -> Second Rollup Job. This process is difficult to manage, error-prone, and requires external scheduling, which undermines the goal of an integrated lifecycle system like ISM.

A common use case is a tiered retention strategy for observability data:

  • Last 6 Hours: Data aggregated at 1-minute granularity.
  • 6-24 Hours: Data re-aggregated into 10-minute granularity.
  • 24+ Hours: Data re-aggregated into 1-hour granularity for long-term storage.

Describe the solution you'd like

Enhance the ISM policy to support a multi-layered rollup action, where an index can transition through sequential rollup states. The system must be intelligent and safe by design, validating the compatibility of metrics before performing a consecutive rollup.

Key capabilities:

  • Chained Rollup States: Allow a rollup action's target index to be the source for a subsequent rollup action in a later state.
  • Metric Compatibility Validation: Automatically check if source rollup metrics are re-aggregatable.
    • Allow: Support metrics like sum, value_count, min, and max.
    • Intelligent avg Handling: For an avg metric, the initial rollup should store the underlying sum and value_count (if not specified by the user) to allow for accurate, weighted re-calculation in subsequent tiers.
    • Disallow (Initially): Fail with a clear error for unsupported metrics like cardinality or percentiles for now.

Future Compatibility for Advanced Metrics (RFC)

This feature should be designed to be forward-compatible. A later RFC could add support for advanced metrics like cardinality and percentiles by providing an option to store the raw underlying sketches (e.g., HyperLogLog++ or T-Digest) as binary objects. This would enable them to be merged in subsequent rollups. I will plan to open up a separate RFC/FR to talk more into details for advanced metrics beyond the currently supported ones.

Describe alternatives you've considered

The primary alternative is the current manual workaround using _reindex and external scripts. This alternative adds significant operational overhead and is not a scalable solution.

Proposed Implementation Strategy (can use some brainstorming)

As someone who is not deeply familiar with the ISM codebase, this is a high-level suggestion intended as a starting point for discussion.

A potential approach could involve updating the policy to permit sequential rollup states while enhancing the rollup action itself. The action would need to first detect if its source is already a rollup index. If so, it would trigger new logic to validate metric compatibility (e.g., sum, avg) and correctly build re-aggregation queries that target the appropriate sub-fields from the source documents.

Additional context: Proposed Policy Structure

This conceptual policy structure defines a multi-tier policy where data is progressively rolled up.

{
  "policy": {
    "description": "A multi-tier rollup policy for observability-logs.",
    "default_state": "live",
    "states": [
      {
        "name": "live",
        "actions": [],
        "transitions": [
          {
            "state_name": "rollup_tier1_1m",
            "conditions": { "min_index_age": "6h" }
          }
        ]
      },
      {
        "name": "rollup_tier1_1m",
        "actions": [
          {
            "rollup": {
              "source_index": "observability-logs-{{ctx.index}}",
              "target_index": "rollup-1m-{{ctx.index}}",
              "dimensions": [{"date_histogram": {"source_field": "@timestamp", "fixed_interval": "1m"}}],
              "metrics": [
                {"name": "latency_stats", "source_field": "latency_ms", "metrics": ["avg", "max"]}
              ]
            }
          }
        ],
        "transitions": [
          {
            "state_name": "rollup_tier2_10m",
            "conditions": { "min_index_age": "24h" }
          }
        ]
      },
      {
        "name": "rollup_tier2_10m",
        "actions": [
          {
            "rollup": {
              "source_index": "rollup-1m-{{ctx.index}}",
              "target_index": "rollup-10m-{{ctx.index}}",
              "dimensions": [{"date_histogram": {"source_field": "@timestamp", "fixed_interval": "10m"}}],
              "metrics": [
                {"name": "latency_stats", "source_field": "latency_stats", "metrics": ["avg", "max"]}
              ]
            }
          }
        ],
        "transitions": [
          {
            "state_name": "delete",
            "conditions": { "min_index_age": "30d" }
          }
        ]
      },
      {
        "name": "delete",
        "actions": [{"delete": {}}]
      }
    ]
  }
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions