[WIP] Failure store - Lifecycle Management #125658

gmarouli · 2025-03-26T12:09:17Z

The failure store is a set of data stream indices that are used to store certain type of ingestion failures. Until this moment they were sharing the configuration of the backing indices. We understand that the two data sets have different lifecycle needs.

We believe that typically the failures will need to be retained much less than the data. Considering this we believe the lifecycle needs of the failures also more limited and they fit better the simplicity of the data stream lifecycle feature.

This allows the user to only set the desired retention and we will perform the rollover and other maintenance tasks without the user having to think about them. Furthermore, having only one lifecycle management feature allows us to ensure that these data is managed by default.

This PR introduces the following:

Configuration

We extend the failure store configuration to allow lifecycle configuration too, this configuration reflects the user's configuration only as shown below:

PUT _data_stream/*/options
{
  "failure_store": {
     "lifecycle": {
       "retention": "5d"
     }
  }
}

GET _data_stream/*/options

{
  "data_streams": [
    {
      "name": "my-ds",
      "options": {
        "failure_store": {
          "lifecycle": {
            "retention": "5d"
          }
        }
      }
    }
  ]
}

To retrieve the effective configuration you need to use the GET data streams API, see #126668

Functionality

The data stream lifecycle (DLM) will manage the failure indices regardless if the failure store is enabled or not. This will ensure that if the failure store gets disabled we will not have stagnant data.
The data stream options APIs reflect only the user's configuration.
The GET data stream API should be used to check the current state of the effective failure store configuration.
Telemetry

We extend the data stream failure store telemetry to also include the lifecycle telemetry.

{
  "data_streams": {
     "available": true,
     "enabled": true,
     "data_streams": 10,
     "indices_count": 50,
     "failure_store": {
       "explicitly_enabled_count": 1,
       "effectively_enabled_count": 15,
       "failure_indices_count": 30
       "lifecycle": { 
         "explicitly_enabled_count": 5,
         "effectively_enabled_count": 20,
         "data_retention": {
           "configured_data_streams": 5,
           "minimum_millis": X,
           "maximum_millis": Y,
           "average_millis": Z,
          },
          "effective_retention": {
            "retained_data_streams": 20,
            "minimum_millis": X,
            "maximum_millis": Y, 
            "average_millis": Z
          },
         "global_retention": {
           "max": {
             "defined": false
           },
           "default": {
             "defined": true,  <------ this is the default value applicable for the failure store
             "millis": X
           }
        }
      }
   }
}

…odules

…e store.

gmarouli · 2025-04-09T16:43:58Z

Work that has been done here is also extracted in:

…ure-store/lifecycle

The class `DataStreamLifecycle` is currently capturing the lifecycle configuration that currently manages all data stream indices, but soon enough it will be split into two variants, the data and the failures lifecycle. Some pre-work has been done already but as we are progressing in our POC, we see that it will be really useful if the `DataStreamLifecycle` is "aware" of the target index component. This will allow us to correctly apply global retention or to throw an error if a downsampling configuration is provided to a failure lifecycle. In this PR, we perform a small refactoring to reduce the noise in #125658. Here we introduce the following: - A factory method that creates a data lifecycle, for now it's trivial but it will be more useful soon. - We rename the "empty" builder to explicitly mention the index component it refers to.

The class `DataStreamLifecycle` is currently capturing the lifecycle configuration that currently manages all data stream indices, but soon enough it will be split into two variants, the data and the failures lifecycle. Some pre-work has been done already but as we are progressing in our POC, we see that it will be really useful if the `DataStreamLifecycle` is "aware" of the target index component. This will allow us to correctly apply global retention or to throw an error if a downsampling configuration is provided to a failure lifecycle. In this PR, we perform a small refactoring to reduce the noise in elastic#125658. Here we introduce the following: - A factory method that creates a data lifecycle, for now it's trivial but it will be more useful soon. - We rename the "empty" builder to explicitly mention the index component it refers to. (cherry picked from commit b991708) # Conflicts: # modules/data-streams/src/test/java/org/elasticsearch/datastreams/lifecycle/DataStreamLifecycleServiceTests.java # server/src/test/java/org/elasticsearch/cluster/metadata/MetadataDataStreamsServiceTests.java # server/src/test/java/org/elasticsearch/cluster/metadata/MetadataIndexTemplateServiceTests.java

The class `DataStreamLifecycle` is currently capturing the lifecycle configuration that currently manages all data stream indices, but soon enough it will be split into two variants, the data and the failures lifecycle. Some pre-work has been done already but as we are progressing in our POC, we see that it will be really useful if the `DataStreamLifecycle` is "aware" of the target index component. This will allow us to correctly apply global retention or to throw an error if a downsampling configuration is provided to a failure lifecycle. In this PR, we perform a small refactoring to reduce the noise in #125658. Here we introduce the following: - A factory method that creates a data lifecycle, for now it's trivial but it will be more useful soon. - We rename the "empty" builder to explicitly mention the index component it refers to. (cherry picked from commit b991708) # Conflicts: # modules/data-streams/src/test/java/org/elasticsearch/datastreams/lifecycle/DataStreamLifecycleServiceTests.java # server/src/test/java/org/elasticsearch/cluster/metadata/MetadataDataStreamsServiceTests.java # server/src/test/java/org/elasticsearch/cluster/metadata/MetadataIndexTemplateServiceTests.java

gmarouli added 3 commits March 26, 2025 14:00

Merge getBackingIndicesPastRetention & getFailureIndicesPastRetention

b87eebf

Introduce failure store configuration

7f37ad1

Retrieve the failure lifecycle from a data stream.

42efb21

elasticsearchmachine added the v9.1.0 label Mar 26, 2025

gmarouli added 12 commits March 26, 2025 17:53

Preliminary testing

4ace797

Merge with main

839d874

Fix existing tests

7debf5a

Remove ILM policy from the supported settings

ca8a642

Test the failures lifecycle retrieval

fec79c8

Test failure lifecycle with the global retention

6e24f58

Merge with main

b36df2d

Extend the global retention tests to work with failure lifecycle

63283f2

Make the backingIndexEqualTo check the prefixes too

864d710

Test and fix bugs in recording DLM errors for failure store

aa75489

Rename new yaml test to preserve number sequence

d65ec83

Merge branch 'main' into failure-store/lifecycle

a441c6f

elasticsearchmachine added the serverless-linked Added by automation, don't add manually label Apr 3, 2025

Consistently name the failures lifecycle

df1c573

gmarouli changed the title ~~[Failure store] Introduce dedicated lifecycle~~ Failure store - Lifecycle Management Apr 3, 2025

gmarouli added 8 commits April 3, 2025 19:23

Move PutDataStreamOptionsAction to server to be accessible by other m…

3c0a756

…odules

Merge branch 'main' into failure-store/lifecycle

301f3e5

Fix camel case format

85e3ae3

Merge branch 'main' into failure-store/lifecycle

31c5fea

Small fixes

2189447

Merge branch 'main' into failure-store/lifecycle

46ba862

Add lifecycle in the edit data stream options

8375dc4

Extend documentation

c9780c4

gmarouli added >non-issue :Data Management/Data streams Data streams and their lifecycles labels Apr 7, 2025

gmarouli requested a review from jbaiera April 7, 2025 11:05

gmarouli added 5 commits April 9, 2025 13:50

Rename random lifecycle generators too

0d3d038

Introduce lifecycle type to data stream lifecycle

3e5cae4

Add a label to the lifecycle type that can be used messages.

038129f

Extend data stream lifecycle service security test to test the failur…

24c5abe

…e store.

Bug fix: pass along system descriptor when creating a failure index

a4f919f

gmarouli marked this pull request as draft April 9, 2025 16:43

gmarouli added 3 commits April 10, 2025 18:17

Merge branch 'main' into failure-store/lifecycle

508094a

Merge branch 'main' into failure-store/lifecycle

a52295e

Expose failure store lifecycle information via the GET data stream API

dc5f3cd

gmarouli changed the title ~~Failure store - Lifecycle Management~~ [WIP] Failure store - Lifecycle Management Apr 11, 2025

gmarouli added 12 commits April 11, 2025 12:55

Add test case for when we display ilm

cd4e3fa

Move test to a more appropriate place

3fa4335

Merge branch 'main' into failure-store/lifecycle

b0151d9

Merge branch 'failure-lifecycle/expose-in-get-data-streams' into fail…

5cfc23e

…ure-store/lifecycle

Adjust tests

9ab3cbd

Merge branch 'main' into failure-store/lifecycle

db93f59

Test fix: ensure lifecycle remains enabled

08fa539

Test fix: ensure downsampling is not null

bab5b27

Add failures default retention

17446df

Make explicit that failures default is not nullable.

1129985

Fix tests that did not anticipate not null failures_default

f487c37

Merge branch 'main' into failure-store/lifecycle

13f935b

Merge branch 'main' into failure-store/lifecycle

bb2d66f

gmarouli added 2 commits April 23, 2025 16:54

Merge branch 'main' into failure-store/lifecycle

8d9de30

Skip certain test in rest compatibility

6e2907d

gmarouli closed this Apr 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP] Failure store - Lifecycle Management #125658

[WIP] Failure store - Lifecycle Management #125658

Uh oh!

gmarouli commented Mar 26, 2025 •

edited

Loading

Uh oh!

gmarouli commented Apr 9, 2025 •

edited

Loading

Uh oh!

Uh oh!

[WIP] Failure store - Lifecycle Management #125658

[WIP] Failure store - Lifecycle Management #125658

Uh oh!

Conversation

gmarouli commented Mar 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gmarouli commented Apr 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

gmarouli commented Mar 26, 2025 •

edited

Loading

gmarouli commented Apr 9, 2025 •

edited

Loading