Add `events.failure_store` metric to track events sent to Elasticsearch failure store #48068

belimawr · 2025-12-11T21:18:30Z

Proposed commit message

See title

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
~~I have made corresponding change to the default configuration files~~
I have added tests that prove my fix is effective or that my feature works. Where relevant, I have used the stresstest.sh script to run them under stress conditions and race detector to verify their stability.
I have added an entry in ./changelog/fragments using the changelog tool.

~~## Disruptive User Impact~~
~~## Author's Checklist~~

How to test this PR locally

Manual Testing Procedure: Failure Store Metric

Prerequisites

Elasticsearch cluster (version 8.11.0+) with failure store support enabled
A Beat instance (Filebeat, Metricbeat, etc.) configured to output to Elasticsearch
Access to Elasticsearch API and Beat monitoring/metrics endpoint

Test Setup

1. Create a Data Stream with Failure Store Enabled

Create an index template with failure store enabled and strict mappings:

PUT _index_template/test-failure-store-template
{
  "index_patterns": ["test-failure-store-*"],
  "data_stream": {},
  "template": {
    "data_stream_options": {
      "failure_store": {
        "enabled": true
      }
    },
    "mappings": {
      "properties": {
        "method":{
            "type": "integer"
        }
      }
    }
  }
}

2. Initialize the Data Stream

Create the data stream by indexing two documents

POST test-failure-store-ds/_bulk
{"create":{}}
{"@timestamp":"2025-12-12T15:42:00Z", "foo":"bar"}
{"create":{}}
{"@timestamp":"2025-12-12T15:42:00Z", "method": "POST"}

Ensure one of the documents went to the failure store (look for
"failure_store": "used"), the response should look like this:

{
  "errors": false,
  "took": 200,
  "items": [
    {
      "create": {
        "_index": ".ds-test-failure-store-ds-2025.12.12-000001",
        "_id": "SIdcFJsBIoqQtrd2QGHk",
        "_version": 1,
        "result": "created",
        "_shards": {
          "total": 2,
          "successful": 1,
          "failed": 0
        },
        "_seq_no": 2,
        "_primary_term": 1,
        "status": 201
      }
    },
    {
      "create": {
        "_index": ".fs-test-failure-store-ds-2025.12.12-000002",
        "_id": "SodcFJsBIoqQtrd2QGH3",
        "_version": 1,
        "result": "created",
        "_shards": {
          "total": 2,
          "successful": 1,
          "failed": 0
        },
        "_seq_no": 268,
        "_primary_term": 1,
        "failure_store": "used",
        "status": 201
      }
    }
  ]
}

Ensure there is one document in the failure store:

GET test-failure-store-ds::failures/_search

3. Generate some logs that will cause mapping conflict

You can use Docker and flog for this:

docker run -it --rm mingrammer/flog -f json -d 1 -s 1 -l > /tmp/flog.ndjson

4. Run Filebeat

Build Filebeat from this PR and run it using the following
configuration (adjust the output settings as necessary):

filebeat.yml

filebeat.inputs:
  - type: filestream
    id: a-very-unique-id
    enabled: true
    paths:
      - /tmp/flog.ndjson
    parsers:
      - ndjson:
          keys_under_root: true
    index: test-failure-store-ds
    file_identity.native: ~
    prospector.scanner:
      fingerprint.enabled: false

queue.mem:
  flush.timeout: 0

output.elasticsearch:
  hosts:
    - "https://localhost:9200"
  username: elastic
  password: changeme
  ssl.verification_mode: none

logging:
  metrics.period: 5s
  to_stderr: true

You can run Filebeat using jq to parse the logs:

go run . --path.home=$PWD 2>&1 | jq '{"ts": ."@timestamp", "lvl": ."log.level", "logger": ."log.logger", "m": .message, "fs": .monitoring.metrics.libbeat.output.events.failure_store}' -c

You should see some 5s metrics like this:

{"ts":"2025-12-12T16:04:40.795-0500","lvl":"info","logger":"monitoring","m":"Non-zero metrics in the last 5s","fs"
:10}                                                                                                             
{"ts":"2025-12-12T16:04:45.794-0500","lvl":"info","logger":"monitoring","m":"Non-zero metrics in the last 5s","fs"
:4}                                                                                                              
{"ts":"2025-12-12T16:04:50.794-0500","lvl":"info","logger":"monitoring","m":"Non-zero metrics in the last 5s","fs"
:6}

Where fs is the counter of events sent to the failure store

The metrics are also published in the stats endpoint:

curl http://localhost:5066/stats | jq '.libbeat.output.events'

will output something like this:

{
  "acked": 105,
  "active": 0,
  "batches": 21,
  "dead_letter": 0,
  "dropped": 0,
  "duplicates": 0,
  "failed": 0,
  "failure_store": 105,
  "toomany": 0,
  "total": 105
}

Related issues

Closes Add Elasticsearch output metric to track the number of documents accepted by the failure store #47164

~~## Use cases~~
~~## Screenshots~~
~~## Logs~~

github-actions · 2025-12-11T21:18:41Z

🤖 GitHub comments

Just comment with:

run docs-build : Re-trigger the docs validation. (use unformatted text in the comment!)

mergify · 2025-12-11T21:19:13Z

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @belimawr? 🙏.
For such, you'll need to label your PR with:

The upcoming major version of the Elastic Stack
The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

backport-8./d is the label to automatically backport to the 8./d branch. /d is the digit
backport-active-all is the label that automatically backports to all active branches.
backport-active-8 is the label that automatically backports to all active minor branches for the 8 major.
backport-active-9 is the label that automatically backports to all active minor branches for the 9 major.

…e-store-count

github-actions · 2025-12-12T18:17:08Z

🔍 Preview links for changed docs

elasticmachine · 2025-12-12T21:46:05Z

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

bmorelli25 · 2025-12-12T21:54:06Z

docs/reference/auditbeat/understand-auditbeat-logs.md

 | `.output.events.failed` | Integer | Number of events that Auditbeat tried to send to the output destination, but the destination failed to receive them. | Generally, we want this field to be absent or its value to be zero. When the value is greater than zero, it’s useful to check Auditbeat’s logs right before this log entry’s `@timestamp` to see if there are any connectivity issues with the output destination. Note that failed events are not lost or dropped; they will be sent back to the publisher pipeline for retrying later. |
 | `.output.events.dropped` | Integer | Number of events that Auditbeat gave up sending to the output destination because of a permanent (non-retryable) error. |
 | `.output.events.dead_letter` | Integer | Number of events that Auditbeat successfully sent to a configured dead letter index after they failed to ingest in the primary index. |
+| `.output.events.failure_store` | Integer | Number of events that were sent to the failure store. The failure store is a feature in Elasticsearch data streams that stores events that fail mapping or ingestion. Events sent to the failure store are still counted as acknowledged. | This metric indicates how many events encountered mapping or ingestion errors but were successfully stored in the failure store. A non-zero value suggests there may be mapping issues or data type mismatches that need to be addressed. |


Will this only be applicable to 9.3+? If so, we should add an applies_to badge to each of the beat reference pages, similar to what is shown here:

Suggested change

| `.output.events.failure_store` | Integer | Number of events that were sent to the failure store. The failure store is a feature in Elasticsearch data streams that stores events that fail mapping or ingestion. Events sent to the failure store are still counted as acknowledged. | This metric indicates how many events encountered mapping or ingestion errors but were successfully stored in the failure store. A non-zero value suggests there may be mapping issues or data type mismatches that need to be addressed. |

| `.output.events.failure_store` {applies_to}`stack: ga 9.3` | Integer | Number of events that were sent to the failure store. The failure store is a feature in Elasticsearch data streams that stores events that fail mapping or ingestion. Events sent to the failure store are still counted as acknowledged. | This metric indicates how many events encountered mapping or ingestion errors but were successfully stored in the failure store. A non-zero value suggests there may be mapping issues or data type mismatches that need to be addressed. |

[WIP] add failure store usage count

d18d1fb

belimawr self-assigned this Dec 11, 2025

belimawr added the Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team label Dec 11, 2025

botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels Dec 11, 2025

belimawr added 9 commits December 11, 2025 17:32

Properly create/configure the data stream

9a76c2e

improve configureDatastream

21d87a7

Improve tests

1161a4f

Add unittests

139121b

add more unit tests

75a27db

Add documentation

2f38e69

Add changelog fragment

e13f928

Merge branch 'main' of github.com:elastic/beats into 47164-add-failur…

68ec6a5

…e-store-count

Add AI notice

097711e

belimawr changed the title ~~[WIP] add failure store usage count~~ Add events.failure_store metric to track events sent to Elasticsearch failure store Dec 12, 2025

github-actions bot deployed to docs-preview December 12, 2025 18:16 View deployment

Comments and clean up

705bbc8

belimawr marked this pull request as ready for review December 12, 2025 21:46

belimawr requested review from a team as code owners December 12, 2025 21:46

belimawr requested review from khushijain21 and leehinman December 12, 2025 21:46

github-actions bot deployed to docs-preview December 12, 2025 21:46 View deployment

bmorelli25 reviewed Dec 12, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add `events.failure_store` metric to track events sent to Elasticsearch failure store #48068

Add `events.failure_store` metric to track events sent to Elasticsearch failure store #48068

belimawr commented Dec 11, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Dec 11, 2025

Uh oh!

mergify bot commented Dec 11, 2025

Uh oh!

github-actions bot commented Dec 12, 2025 •

edited

Loading

Uh oh!

elasticmachine commented Dec 12, 2025

Uh oh!

bmorelli25 Dec 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add events.failure_store metric to track events sent to Elasticsearch failure store #48068

Are you sure you want to change the base?

Add events.failure_store metric to track events sent to Elasticsearch failure store #48068

Conversation

belimawr commented Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proposed commit message

Checklist

How to test this PR locally

Manual Testing Procedure: Failure Store Metric

Prerequisites

Test Setup

1. Create a Data Stream with Failure Store Enabled

2. Initialize the Data Stream

3. Generate some logs that will cause mapping conflict

4. Run Filebeat

Related issues

Uh oh!

github-actions bot commented Dec 11, 2025

🤖 GitHub comments

Uh oh!

mergify bot commented Dec 11, 2025

Uh oh!

github-actions bot commented Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔍 Preview links for changed docs

Uh oh!

elasticmachine commented Dec 12, 2025

Uh oh!

bmorelli25 Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add `events.failure_store` metric to track events sent to Elasticsearch failure store #48068

Add `events.failure_store` metric to track events sent to Elasticsearch failure store #48068

belimawr commented Dec 11, 2025 •

edited

Loading

github-actions bot commented Dec 12, 2025 •

edited

Loading