Skip to content

[WIP] Failure store - Lifecycle Management #125658

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 58 commits into from

Conversation

gmarouli
Copy link
Contributor

@gmarouli gmarouli commented Mar 26, 2025

The failure store is a set of data stream indices that are used to store certain type of ingestion failures. Until this moment they were sharing the configuration of the backing indices. We understand that the two data sets have different lifecycle needs.

We believe that typically the failures will need to be retained much less than the data. Considering this we believe the lifecycle needs of the failures also more limited and they fit better the simplicity of the data stream lifecycle feature.

This allows the user to only set the desired retention and we will perform the rollover and other maintenance tasks without the user having to think about them. Furthermore, having only one lifecycle management feature allows us to ensure that these data is managed by default.

This PR introduces the following:

Configuration

We extend the failure store configuration to allow lifecycle configuration too, this configuration reflects the user's configuration only as shown below:

PUT _data_stream/*/options
{
  "failure_store": {
     "lifecycle": {
       "retention": "5d"
     }
  }
}

GET _data_stream/*/options

{
  "data_streams": [
    {
      "name": "my-ds",
      "options": {
        "failure_store": {
          "lifecycle": {
            "retention": "5d"
          }
        }
      }
    }
  ]
}

To retrieve the effective configuration you need to use the GET data streams API, see #126668

Functionality

  • The data stream lifecycle (DLM) will manage the failure indices regardless if the failure store is enabled or not. This will ensure that if the failure store gets disabled we will not have stagnant data.
  • The data stream options APIs reflect only the user's configuration.
  • The GET data stream API should be used to check the current state of the effective failure store configuration.
    Telemetry

We extend the data stream failure store telemetry to also include the lifecycle telemetry.

{
  "data_streams": {
     "available": true,
     "enabled": true,
     "data_streams": 10,
     "indices_count": 50,
     "failure_store": {
       "explicitly_enabled_count": 1,
       "effectively_enabled_count": 15,
       "failure_indices_count": 30
       "lifecycle": { 
         "explicitly_enabled_count": 5,
         "effectively_enabled_count": 20,
         "data_retention": {
           "configured_data_streams": 5,
           "minimum_millis": X,
           "maximum_millis": Y,
           "average_millis": Z,
          },
          "effective_retention": {
            "retained_data_streams": 20,
            "minimum_millis": X,
            "maximum_millis": Y, 
            "average_millis": Z
          },
         "global_retention": {
           "max": {
             "defined": false
           },
           "default": {
             "defined": true,  <------ this is the default value applicable for the failure store
             "millis": X
           }
        }
      }
   }
}

@elasticsearchmachine elasticsearchmachine added the serverless-linked Added by automation, don't add manually label Apr 3, 2025
@gmarouli gmarouli changed the title [Failure store] Introduce dedicated lifecycle Failure store - Lifecycle Management Apr 3, 2025
@gmarouli gmarouli added >non-issue :Data Management/Data streams Data streams and their lifecycles labels Apr 7, 2025
@gmarouli gmarouli requested a review from jbaiera April 7, 2025 11:05
@gmarouli gmarouli marked this pull request as draft April 9, 2025 16:43
@gmarouli
Copy link
Contributor Author

gmarouli commented Apr 9, 2025

@gmarouli gmarouli changed the title Failure store - Lifecycle Management [WIP] Failure store - Lifecycle Management Apr 11, 2025
elasticsearchmachine pushed a commit that referenced this pull request Apr 23, 2025
The class `DataStreamLifecycle` is currently capturing the lifecycle
configuration that currently manages all data stream indices, but soon
enough it will be split into two variants, the data and the failures
lifecycle. 

Some pre-work has been done already but as we are progressing in our
POC, we see that it will be really useful if the `DataStreamLifecycle`
is "aware" of the target index component. This will allow us to
correctly apply global retention or to throw an error if a downsampling
configuration is provided to a failure lifecycle.

In this PR, we perform a small refactoring to reduce the noise in
#125658. Here we introduce
the following:

- A factory method that creates a data lifecycle, for now it's trivial but it will be more useful soon.
- We rename the "empty" builder to explicitly mention the index component it refers to.
gmarouli added a commit to gmarouli/elasticsearch that referenced this pull request Apr 23, 2025
The class `DataStreamLifecycle` is currently capturing the lifecycle
configuration that currently manages all data stream indices, but soon
enough it will be split into two variants, the data and the failures
lifecycle.

Some pre-work has been done already but as we are progressing in our
POC, we see that it will be really useful if the `DataStreamLifecycle`
is "aware" of the target index component. This will allow us to
correctly apply global retention or to throw an error if a downsampling
configuration is provided to a failure lifecycle.

In this PR, we perform a small refactoring to reduce the noise in
elastic#125658. Here we introduce
the following:

- A factory method that creates a data lifecycle, for now it's trivial but it will be more useful soon.
- We rename the "empty" builder to explicitly mention the index component it refers to.

(cherry picked from commit b991708)

# Conflicts:
#	modules/data-streams/src/test/java/org/elasticsearch/datastreams/lifecycle/DataStreamLifecycleServiceTests.java
#	server/src/test/java/org/elasticsearch/cluster/metadata/MetadataDataStreamsServiceTests.java
#	server/src/test/java/org/elasticsearch/cluster/metadata/MetadataIndexTemplateServiceTests.java
gmarouli added a commit that referenced this pull request Apr 23, 2025
The class `DataStreamLifecycle` is currently capturing the lifecycle
configuration that currently manages all data stream indices, but soon
enough it will be split into two variants, the data and the failures
lifecycle.

Some pre-work has been done already but as we are progressing in our
POC, we see that it will be really useful if the `DataStreamLifecycle`
is "aware" of the target index component. This will allow us to
correctly apply global retention or to throw an error if a downsampling
configuration is provided to a failure lifecycle.

In this PR, we perform a small refactoring to reduce the noise in
#125658. Here we introduce
the following:

- A factory method that creates a data lifecycle, for now it's trivial but it will be more useful soon.
- We rename the "empty" builder to explicitly mention the index component it refers to.

(cherry picked from commit b991708)

# Conflicts:
#	modules/data-streams/src/test/java/org/elasticsearch/datastreams/lifecycle/DataStreamLifecycleServiceTests.java
#	server/src/test/java/org/elasticsearch/cluster/metadata/MetadataDataStreamsServiceTests.java
#	server/src/test/java/org/elasticsearch/cluster/metadata/MetadataIndexTemplateServiceTests.java
@gmarouli gmarouli closed this Apr 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/Data streams Data streams and their lifecycles >non-issue serverless-linked Added by automation, don't add manually Team:Data Management Meta label for data/management team v9.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants