newlog: prevent kernel log loss under heavy system load by rucoder · Pull Request #5626 · lf-edge/eve

rucoder · 2026-02-24T03:45:09Z

Description

Under heavy system load, newlogd loses the beginning of kernel logs. The root cause is a cascading backpressure chain and slow system boot that prevents getKernelMsg from reading /dev/kmsg, causing the kernel ring buffer
(128KB, CONFIG_LOG_BUF_SHIFT=17) to overflow and silently drop the earliest messages — exactly the ones needed to diagnose the problem.

The backpressure chain:

doMoveCompressFile slow (disk I/O) --> movefileChan full (cap 5) --> 
trigMoveToGzip blocks in writelogFile --> unbuffered downstream channels block --> 
sendLogsToVector blocks --> loggerChan full (cap 10) --> getKernelMsg blocks --> 
/dev/kmsg not read → kernel ring buffer overflows --> EPIPE --> early messages gone

This PR fixes all choke points:

Dedicated kernel buffer channel (cap 500): getKernelMsg now writes to its own channel with non-blocking sends, so it always drains /dev/kmsg regardless of downstream backpressure. A merge goroutine forwards messages to the main pipeline.
Sequence number gap detection: track /dev/kmsg sequence numbers to detect and count ring buffer overflows. Populate the new NumKmsgDropped metric so kernel log loss becomes observable via the controller.
Buffer downstream channels: appLogChan, uploadLogChan, keepLogChan changed from unbuffered to cap 100. A single blocked app log send could previously freeze the entire pipeline.
Increase loggerChan from 10 to 500 and movefileChan from 5
to 20.
Non-blocking gzip handoff: trigMoveToGzip uses non-blocking send to moveChan. If the main goroutine is busy compressing, the log file grows beyond the size limit until the next attempt succeeds, rather than blocking the entire write pipeline.

Kernel message timestamps come from the kernel's own monotonic clock (not from when newlogd reads them), so buffering does not affect timestamp correctness — logs can still be sorted chronologically.

Note: the vendored copy of newlogtypes.go is manually updated to include the NumKmsgDropped field. This should be re-vendored properly after the companion pillar PR is merged.

PR dependencies

Depends on #5625 (pillar/types: add NumKmsgDropped metric to NewlogMetrics) — the vendored copy needs to be regenerated after that PR is merged.

How to test and validate this PR

Build and run tests: cd pkg/newlog && go test ./cmd/ — all 15
tests pass.
Deploy on a device under heavy load (multiple VMs with I/O, CPU
stress).
Check that kernel logs (especially early boot messages or OOM events)
are no longer truncated at the beginning.
Monitor NumKmsgDropped in newlog metrics — if non-zero, it
indicates the kernel ring buffer overflowed but the loss was
detected and counted (previously invisible).
Check for getKernelMsg: detected kernel log gap warning messages
in device logs.

Changelog notes

Fixed kernel log loss under heavy system load. Previously, when the device was under heavy I/O or CPU load, the beginning of kernel logs (dmesg) could be silently lost due to backpressure in the logging pipeline causing the kernel ring buffer to overflow. Kernel log loss is now detected and reported via the NumKmsgDropped metric.

PR Backports

16.0-stable: To be backported.
14.5-stable: To be backported.
13.4-stable: To be backported.

Checklist

I've provided a proper description
I've added the proper documentation
I've tested my PR on amd64 device
I've tested my PR on arm64 device
I've written the test verification instructions
I've set the proper labels to this PR
I've checked the boxes above, or I've provided a good reason why I didn't
check them.

Under heavy system load, newlogd loses the beginning of kernel logs because backpressure from downstream processing cascades up to the kernel message reader, which stops reading /dev/kmsg and causes the kernel ring buffer (128KB) to overflow silently. The backpressure chain: doMoveCompressFile slow (disk I/O) -> movefileChan full (cap 5) -> trigMoveToGzip blocks in writelogFile -> unbuffered downstream channels block -> sendLogsToVector blocks -> loggerChan full (cap 10) -> getKernelMsg blocks -> /dev/kmsg not read -> kernel ring buffer overflows -> EPIPE -> early messages gone Fix all choke points in the chain: 1. Dedicated kernel buffer channel (cap 500): getKernelMsg now writes to its own channel with non-blocking sends, so it always drains /dev/kmsg regardless of downstream backpressure. A merge goroutine forwards messages to the main pipeline. 2. Sequence number gap detection: track /dev/kmsg sequence numbers to detect and count ring buffer overflows. Populate the new NumKmsgDropped metric (added in pkg/pillar/types) so kernel log loss becomes observable via the controller. 3. Buffer downstream channels: appLogChan, uploadLogChan, keepLogChan changed from unbuffered to cap 100. A single blocked app log send could previously freeze the entire pipeline. 4. Increase loggerChan from 10 to 500 and movefileChan from 5 to 20. 5. Non-blocking gzip handoff: trigMoveToGzip uses non-blocking send to moveChan. If the main goroutine is busy compressing, the log file grows beyond the size limit until the next attempt succeeds, rather than blocking the entire write pipeline. Note: the vendored copy of newlogtypes.go is manually updated to include the NumKmsgDropped field. This should be re-vendored properly after the companion pillar PR is merged. Signed-off-by: Mikhail Malyshev <mike.malyshev@gmail.com>

codecov · 2026-02-24T04:41:23Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 29.49%. Comparing base (2281599) to head (9a95639).
⚠️ Report is 298 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #5626      +/-   ##
==========================================
+ Coverage   19.52%   29.49%   +9.96%     
==========================================
  Files          19       18       -1     
  Lines        3021     2417     -604     
==========================================
+ Hits          590      713     +123     
+ Misses       2310     1552     -758     
- Partials      121      152      +31

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

europaul

It looks good to me, but I would remove most of those AI-generated comments.

europaul · 2026-02-24T13:34:40Z

pkg/newlog/cmd/getKernelMsg.go

+		// /dev/kmsg assigns a monotonically increasing sequence number
+		// to each message. A gap means the kernel ring buffer overflowed
+		// and messages were lost (EPIPE from the reader's perspective).
+		if lastSeqNum >= 0 && msg.SequenceNumber > lastSeqNum+1 {


we don't need the check lastSeqNum >= 0

rucoder · 2026-02-24T21:32:43Z

we decided to update eve-api as well. moving to draft for now

eriknordmark · 2026-02-25T16:41:52Z

Can one do additional manually stress testing with the fix by giving newlogd a SIGSUSP and let it sit for a while before a SIGCONT to make sure log entries pile up in the system?

europaul · 2026-02-26T10:07:55Z

Can one do additional manually stress testing with the fix by giving newlogd a SIGSUSP and let it sit for a while before a SIGCONT to make sure log entries pile up in the system?

just be sure not to trigger the watchdog

rucoder requested a review from deitch as a code owner February 24, 2026 03:45

github-actions bot requested review from eriknordmark, europaul and naiming-zededa February 24, 2026 03:45

europaul approved these changes Feb 24, 2026

View reviewed changes

rucoder marked this pull request as draft February 24, 2026 21:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

newlog: prevent kernel log loss under heavy system load#5626

newlog: prevent kernel log loss under heavy system load#5626
rucoder wants to merge 1 commit intolf-edge:masterfrom
rucoder:fix/newlog-kernel-backpressure

rucoder commented Feb 24, 2026 •

edited

Loading

Uh oh!

codecov bot commented Feb 24, 2026

Uh oh!

europaul left a comment

Uh oh!

europaul Feb 24, 2026

Uh oh!

rucoder commented Feb 24, 2026

Uh oh!

eriknordmark commented Feb 25, 2026

Uh oh!

europaul commented Feb 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

rucoder commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

PR dependencies

How to test and validate this PR

Changelog notes

PR Backports

Checklist

Uh oh!

codecov bot commented Feb 24, 2026

Codecov Report

Uh oh!

europaul left a comment

Choose a reason for hiding this comment

Uh oh!

europaul Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

rucoder commented Feb 24, 2026

Uh oh!

eriknordmark commented Feb 25, 2026

Uh oh!

europaul commented Feb 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rucoder commented Feb 24, 2026 •

edited

Loading