Skip to content

Conversation

@swiatekm
Copy link
Contributor

@swiatekm swiatekm commented Dec 10, 2025

Proposed commit message

Report subcomponent status for beats receivers

Make beats receivers report otel status for their subcomponents - inputs for filebeat and modules for metricbeat.

This is done via the otel status Event Attributes field. Under the inputs key, we add a map to the attributes, where input ids are keys, and statuses are values. The status structure is the same as the one used for streams in the control protocol. This is a temporary measure until there's a standard convention for doing this in upstream otel - then we'll switch to that. The purpose of this change is to let elastic-agent report per-stream and per-input status for inputs running in a beat receiver.

We currently need to do a hacky workaround to ensure status events are delivered in spite of the component status not changing. This is due to open-telemetry/opentelemetry-collector#14282.

The output of the healthcheckv2 extension with this change looks like the following:

{
  "components": {
    "receiver:metricbeatreceiver/_agent-component/system/metrics-default": {
      "healthy": true,
      "status": "StatusRecoverableError",
      "error": "Error fetching data for metricset system.process: error fetching process list: non fatal error; reporting partial metrics: error fetching PID metrics for 377 processes, most likely a \"permission denied\" error. Enable debug logging to determine the exact cause.",
      "status_time": "2025-12-10T18:19:53.552220344+01:00",
      "attributes": {
        "inputs": {
          "unique-system-metrics-input-2-cpu": {
            "error": "",
            "status": "Running"
          },
          "unique-system-metrics-input-2-process": {
            "error": "Error fetching data for metricset system.process: error fetching process list: non fatal error; reporting partial metrics: error fetching PID metrics for 377 processes, most likely a \"permission denied\" error. Enable debug logging to determine the exact cause.",
            "status": "Degraded"
          },
          "unique-system-metrics-input-cpu": {
            "error": "",
            "status": "Running"
          },
          "unique-system-metrics-input-process": {
            "error": "Error fetching data for metricset system.process: non fatal error; reporting partial metrics: error fetching PID metrics for 377 processes, most likely a \"permission denied\" error. Enable debug logging to determine the exact cause.",
            "status": "Running"
          }
        }
      }
    }
  }
}

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • [ ] I have made corresponding changes to the documentation
  • [ ] I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works. Where relevant, I have used the stresstest.sh script to run them under stress conditions and race detector to verify their stability.
  • I have added an entry in ./changelog/fragments using the changelog tool.

Related issues

@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Dec 10, 2025
@github-actions
Copy link
Contributor

🤖 GitHub comments

Just comment with:

  • run docs-build : Re-trigger the docs validation. (use unformatted text in the comment!)

@swiatekm swiatekm changed the title Feat/beat receiver subcomponent status Report subcomponent status for beats receivers Dec 10, 2025
@mergify
Copy link
Contributor

mergify bot commented Dec 10, 2025

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b feat/beat-receiver-subcomponent-status upstream/feat/beat-receiver-subcomponent-status
git merge upstream/main
git push upstream feat/beat-receiver-subcomponent-status

@mergify
Copy link
Contributor

mergify bot commented Dec 10, 2025

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @swiatekm? 🙏.
For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-8./d is the label to automatically backport to the 8./d branch. /d is the digit
  • backport-active-all is the label that automatically backports to all active branches.
  • backport-active-8 is the label that automatically backports to all active minor branches for the 8 major.
  • backport-active-9 is the label that automatically backports to all active minor branches for the 9 major.

@swiatekm swiatekm force-pushed the feat/beat-receiver-subcomponent-status branch from d7a630c to b5b3a30 Compare December 10, 2025 12:44
@swiatekm swiatekm added Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team and removed needs_team Indicates that the issue/PR needs a Team:* label labels Dec 10, 2025
@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Dec 10, 2025
@swiatekm swiatekm added the backport-8.19 Automated backport to the 8.19 branch label Dec 10, 2025
@swiatekm swiatekm force-pushed the feat/beat-receiver-subcomponent-status branch from b5b3a30 to ac56fe5 Compare December 10, 2025 18:08
@swiatekm swiatekm marked this pull request as ready for review December 10, 2025 18:19
@swiatekm swiatekm requested a review from a team as a code owner December 10, 2025 18:19
@swiatekm swiatekm requested review from AndersonQ and faec December 10, 2025 18:19
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

Copy link
Contributor

@leehinman leehinman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@cmacknz
Copy link
Member

cmacknz commented Dec 10, 2025

            "unique-system-metrics-input-2-process": {
            "error": "Error fetching data for metricset system.process: error fetching process list: non fatal error; reporting partial metrics: error fetching PID metrics for 377 processes, most likely a \"permission denied\" error. Enable debug logging to determine the exact cause.",
            "status": "Degraded"
          },

Instead of keeping the control protocol statuses, can we use the healthhcheck extension statuses? e.g. StatusRecoverableError instead of Degraded? This way we don't leak control protocol states into the OTel healthcheck extension.

This would almost certainly be a change required if sub-component status gets standardized upstream anyway.

@swiatekm
Copy link
Contributor Author

            "unique-system-metrics-input-2-process": {
            "error": "Error fetching data for metricset system.process: error fetching process list: non fatal error; reporting partial metrics: error fetching PID metrics for 377 processes, most likely a \"permission denied\" error. Enable debug logging to determine the exact cause.",
            "status": "Degraded"
          },

Instead of keeping the control protocol statuses, can we use the healthhcheck extension statuses? e.g. StatusRecoverableError instead of Degraded? This way we don't leak control protocol states into the OTel healthcheck extension.

This would almost certainly be a change required if sub-component status gets standardized upstream anyway.

Sure, we can. It's the more idiomatic choice, even if it creates more work for elastic agent to convert it back. I figured that since we'll be changing this again once the upstream convention is in place, I'd just do the most convenient thing for us right now. I don't mind changing it if you think we should be more idiomatic, though.

@swiatekm
Copy link
Contributor Author

            "unique-system-metrics-input-2-process": {
            "error": "Error fetching data for metricset system.process: error fetching process list: non fatal error; reporting partial metrics: error fetching PID metrics for 377 processes, most likely a \"permission denied\" error. Enable debug logging to determine the exact cause.",
            "status": "Degraded"
          },

Instead of keeping the control protocol statuses, can we use the healthhcheck extension statuses? e.g. StatusRecoverableError instead of Degraded? This way we don't leak control protocol states into the OTel healthcheck extension.
This would almost certainly be a change required if sub-component status gets standardized upstream anyway.

Sure, we can. It's the more idiomatic choice, even if it creates more work for elastic agent to convert it back. I figured that since we'll be changing this again once the upstream convention is in place, I'd just do the most convenient thing for us right now. I don't mind changing it if you think we should be more idiomatic, though.

Done: b31ea4c.

@swiatekm swiatekm requested a review from cmacknz December 11, 2025 15:08
@swiatekm swiatekm merged commit 6ba7b47 into main Dec 11, 2025
209 checks passed
@swiatekm swiatekm deleted the feat/beat-receiver-subcomponent-status branch December 11, 2025 17:12
mergify bot pushed a commit that referenced this pull request Dec 11, 2025
* Add input statuses to beat receiver status

# Conflicts:
#	x-pack/filebeat/fbreceiver/receiver_test.go

* Emit dummy status to force otel core to process it

* Add unit tests

* Add changelog entry

* Switch to otel statuses for inputs

(cherry picked from commit 6ba7b47)
swiatekm added a commit that referenced this pull request Dec 12, 2025
#48056)

* Report subcomponent status for beats receivers (#48015)

* Add input statuses to beat receiver status

# Conflicts:
#	x-pack/filebeat/fbreceiver/receiver_test.go

* Emit dummy status to force otel core to process it

* Add unit tests

* Add changelog entry

* Switch to otel statuses for inputs

(cherry picked from commit 6ba7b47)

* Fix linter warnings

---------

Co-authored-by: Mikołaj Świątek <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-8.19 Automated backport to the 8.19 branch enhancement Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants