Skip to content

Conversation

@belimawr
Copy link
Contributor

@belimawr belimawr commented Dec 11, 2025

Proposed commit message

See title

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works. Where relevant, I have used the stresstest.sh script to run them under stress conditions and race detector to verify their stability.
  • I have added an entry in ./changelog/fragments using the changelog tool.

## Disruptive User Impact
## Author's Checklist

How to test this PR locally

Manual Testing Procedure: Failure Store Metric

Prerequisites

  1. Elasticsearch cluster (version 8.11.0+) with failure store support enabled
  2. A Beat instance (Filebeat, Metricbeat, etc.) configured to output to Elasticsearch
  3. Access to Elasticsearch API and Beat monitoring/metrics endpoint

Test Setup

1. Create a Data Stream with Failure Store Enabled

Create an index template with failure store enabled and strict mappings:

PUT _index_template/test-failure-store-template
{
  "index_patterns": ["test-failure-store-*"],
  "data_stream": {},
  "template": {
    "data_stream_options": {
      "failure_store": {
        "enabled": true
      }
    },
    "mappings": {
      "properties": {
        "method":{
            "type": "integer"
        }
      }
    }
  }
}

2. Initialize the Data Stream

Create the data stream by indexing two documents

POST test-failure-store-ds/_bulk
{"create":{}}
{"@timestamp":"2025-12-12T15:42:00Z", "foo":"bar"}
{"create":{}}
{"@timestamp":"2025-12-12T15:42:00Z", "method": "POST"}

Ensure one of the documents went to the failure store (look for
"failure_store": "used"), the response should look like this:

{
  "errors": false,
  "took": 200,
  "items": [
    {
      "create": {
        "_index": ".ds-test-failure-store-ds-2025.12.12-000001",
        "_id": "SIdcFJsBIoqQtrd2QGHk",
        "_version": 1,
        "result": "created",
        "_shards": {
          "total": 2,
          "successful": 1,
          "failed": 0
        },
        "_seq_no": 2,
        "_primary_term": 1,
        "status": 201
      }
    },
    {
      "create": {
        "_index": ".fs-test-failure-store-ds-2025.12.12-000002",
        "_id": "SodcFJsBIoqQtrd2QGH3",
        "_version": 1,
        "result": "created",
        "_shards": {
          "total": 2,
          "successful": 1,
          "failed": 0
        },
        "_seq_no": 268,
        "_primary_term": 1,
        "failure_store": "used",
        "status": 201
      }
    }
  ]
}

Ensure there is one document in the failure store:

GET test-failure-store-ds::failures/_search

3. Generate some logs that will cause mapping conflict

You can use Docker and flog for this:

docker run -it --rm mingrammer/flog -f json -d 1 -s 1 -l > /tmp/flog.ndjson

4. Run Filebeat

Build Filebeat from this PR and run it using the following
configuration (adjust the output settings as necessary):

filebeat.yml

filebeat.inputs:
  - type: filestream
    id: a-very-unique-id
    enabled: true
    paths:
      - /tmp/flog.ndjson
    parsers:
      - ndjson:
          keys_under_root: true
    index: test-failure-store-ds
    file_identity.native: ~
    prospector.scanner:
      fingerprint.enabled: false

queue.mem:
  flush.timeout: 0

output.elasticsearch:
  hosts:
    - "https://localhost:9200"
  username: elastic
  password: changeme
  ssl.verification_mode: none

logging:
  metrics.period: 5s
  to_stderr: true

You can run Filebeat using jq to parse the logs:

go run . --path.home=$PWD 2>&1 | jq '{"ts": ."@timestamp", "lvl": ."log.level", "logger": ."log.logger", "m": .message, "fs": .monitoring.metrics.libbeat.output.events.failure_store}' -c

You should see some 5s metrics like this:

{"ts":"2025-12-12T16:04:40.795-0500","lvl":"info","logger":"monitoring","m":"Non-zero metrics in the last 5s","fs"
:10}                                                                                                             
{"ts":"2025-12-12T16:04:45.794-0500","lvl":"info","logger":"monitoring","m":"Non-zero metrics in the last 5s","fs"
:4}                                                                                                              
{"ts":"2025-12-12T16:04:50.794-0500","lvl":"info","logger":"monitoring","m":"Non-zero metrics in the last 5s","fs"
:6}                                                                                                              

Where fs is the counter of events sent to the failure store

The metrics are also published in the stats endpoint:

curl http://localhost:5066/stats | jq '.libbeat.output.events'

will output something like this:

{
  "acked": 105,
  "active": 0,
  "batches": 21,
  "dead_letter": 0,
  "dropped": 0,
  "duplicates": 0,
  "failed": 0,
  "failure_store": 105,
  "toomany": 0,
  "total": 105
}

Related issues

## Use cases
## Screenshots
## Logs

@belimawr belimawr self-assigned this Dec 11, 2025
@belimawr belimawr added the Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team label Dec 11, 2025
@botelastic botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels Dec 11, 2025
@github-actions
Copy link
Contributor

🤖 GitHub comments

Just comment with:

  • run docs-build : Re-trigger the docs validation. (use unformatted text in the comment!)

@mergify
Copy link
Contributor

mergify bot commented Dec 11, 2025

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @belimawr? 🙏.
For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-8./d is the label to automatically backport to the 8./d branch. /d is the digit
  • backport-active-all is the label that automatically backports to all active branches.
  • backport-active-8 is the label that automatically backports to all active minor branches for the 8 major.
  • backport-active-9 is the label that automatically backports to all active minor branches for the 9 major.

@belimawr belimawr changed the title [WIP] add failure store usage count Add events.failure_store metric to track events sent to Elasticsearch failure store Dec 12, 2025
@belimawr belimawr marked this pull request as ready for review December 12, 2025 21:46
@belimawr belimawr requested review from a team as code owners December 12, 2025 21:46
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

| `.output.events.failed` | Integer | Number of events that Auditbeat tried to send to the output destination, but the destination failed to receive them. | Generally, we want this field to be absent or its value to be zero. When the value is greater than zero, it’s useful to check Auditbeat’s logs right before this log entry’s `@timestamp` to see if there are any connectivity issues with the output destination. Note that failed events are not lost or dropped; they will be sent back to the publisher pipeline for retrying later. |
| `.output.events.dropped` | Integer | Number of events that Auditbeat gave up sending to the output destination because of a permanent (non-retryable) error. |
| `.output.events.dead_letter` | Integer | Number of events that Auditbeat successfully sent to a configured dead letter index after they failed to ingest in the primary index. |
| `.output.events.failure_store` | Integer | Number of events that were sent to the failure store. The failure store is a feature in Elasticsearch data streams that stores events that fail mapping or ingestion. Events sent to the failure store are still counted as acknowledged. | This metric indicates how many events encountered mapping or ingestion errors but were successfully stored in the failure store. A non-zero value suggests there may be mapping issues or data type mismatches that need to be addressed. |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this only be applicable to 9.3+? If so, we should add an applies_to badge to each of the beat reference pages, similar to what is shown here:

Suggested change
| `.output.events.failure_store` | Integer | Number of events that were sent to the failure store. The failure store is a feature in Elasticsearch data streams that stores events that fail mapping or ingestion. Events sent to the failure store are still counted as acknowledged. | This metric indicates how many events encountered mapping or ingestion errors but were successfully stored in the failure store. A non-zero value suggests there may be mapping issues or data type mismatches that need to be addressed. |
| `.output.events.failure_store` {applies_to}`stack: ga 9.3` | Integer | Number of events that were sent to the failure store. The failure store is a feature in Elasticsearch data streams that stores events that fail mapping or ingestion. Events sent to the failure store are still counted as acknowledged. | This metric indicates how many events encountered mapping or ingestion errors but were successfully stored in the failure store. A non-zero value suggests there may be mapping issues or data type mismatches that need to be addressed. |

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add Elasticsearch output metric to track the number of documents accepted by the failure store

3 participants