Skip to content

pillar/types: add NumKmsgDropped metric to NewlogMetrics#5625

Draft
rucoder wants to merge 1 commit intolf-edge:masterfrom
rucoder:fix/kmsg-drop-metric
Draft

pillar/types: add NumKmsgDropped metric to NewlogMetrics#5625
rucoder wants to merge 1 commit intolf-edge:masterfrom
rucoder:fix/kmsg-drop-metric

Conversation

@rucoder
Copy link
Contributor

@rucoder rucoder commented Feb 24, 2026

Description

Add NumKmsgDropped field to NewlogMetrics to track kernel messages lost due to kernel ring buffer overflow. This makes kernel log loss observable via the controller in the future.

Under heavy system load, newlogd can fall behind reading /dev/kmsg, causing the kernel ring buffer (128KB by default, CONFIG_LOG_BUF_SHIFT=17) to overflow and silently drop messages — typically the earliest messages
that contain the root cause of the problem being debugged.

Currently there is no metric to detect this loss. The new field will be populated by newlogd using /dev/kmsg sequence number gap detection (in a follow-up PR to pkg/newlog).

PR dependencies

None. This is the first PR in a two-PR sequence:

  1. This PR — adds the metric field to pillar types
  2. Follow-up PR to pkg/newlog — implements the kernel log pipeline improvements and populates the metric (depends on this PR being merged and vendored)

How to test and validate this PR

This PR only adds a new field to a struct. It has no behavioral change
on its own. Validation:

  • cd pkg/pillar && go build ./types/ — passes
  • No existing tests affected (additive struct field change)
  • Full validation will be done with the follow-up pkg/newlog PR

Changelog notes

Added NumKmsgDropped metric to track kernel message loss due to ring buffer overflow. This metric will be populated by newlogd once the companion newlog changes land.

PR Backports

  • 16.0-stable: To be backported (together with the follow-up newlog PR).
  • 14.5-stable: To be backported (together with the follow-up newlog PR).
  • 13.4-stable: To be backported (together with the follow-up newlog PR).

Checklist

  • I've provided a proper description

  • I've added the proper documentation

  • I've tested my PR on amd64 device

  • I've tested my PR on arm64 device

  • I've written the test verification instructions

  • I've set the proper labels to this PR

  • I've checked the boxes above, or I've provided a good reason why I didn't
    check them.

Add NumKmsgDropped field to NewlogMetrics to track kernel messages
lost due to kernel ring buffer overflow. This makes kernel log loss
observable via the controller.

Under heavy system load, newlogd can fall behind reading /dev/kmsg,
causing the kernel ring buffer (128KB by default) to overflow and
silently drop messages. Currently there is no metric to detect this.
The new field will be populated by newlogd using /dev/kmsg sequence
number gap detection.

Signed-off-by: Mikhail Malyshev <mike.malyshev@gmail.com>
@rucoder
Copy link
Contributor Author

rucoder commented Feb 24, 2026

we decided to update eve-api as well. moving to draft for now

@rucoder rucoder marked this pull request as draft February 24, 2026 21:32
@codecov
Copy link

codecov bot commented Feb 28, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 29.49%. Comparing base (2281599) to head (06cbc44).
⚠️ Report is 305 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5625      +/-   ##
==========================================
+ Coverage   19.52%   29.49%   +9.96%     
==========================================
  Files          19       18       -1     
  Lines        3021     2417     -604     
==========================================
+ Hits          590      713     +123     
+ Misses       2310     1552     -758     
- Partials      121      152      +31     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants