Skip to content

newlog: prevent kernel log loss under heavy system load#5626

Draft
rucoder wants to merge 1 commit intolf-edge:masterfrom
rucoder:fix/newlog-kernel-backpressure
Draft

newlog: prevent kernel log loss under heavy system load#5626
rucoder wants to merge 1 commit intolf-edge:masterfrom
rucoder:fix/newlog-kernel-backpressure

Conversation

@rucoder
Copy link
Contributor

@rucoder rucoder commented Feb 24, 2026

Description

Under heavy system load, newlogd loses the beginning of kernel logs. The root cause is a cascading backpressure chain and slow system boot that prevents getKernelMsg from reading /dev/kmsg, causing the kernel ring buffer
(128KB, CONFIG_LOG_BUF_SHIFT=17) to overflow and silently drop the earliest messages — exactly the ones needed to diagnose the problem.

The backpressure chain:

doMoveCompressFile slow (disk I/O) --> movefileChan full (cap 5) --> 
trigMoveToGzip blocks in writelogFile --> unbuffered downstream channels block --> 
sendLogsToVector blocks --> loggerChan full (cap 10) --> getKernelMsg blocks --> 
/dev/kmsg not read → kernel ring buffer overflows --> EPIPE --> early messages gone

This PR fixes all choke points:

  1. Dedicated kernel buffer channel (cap 500): getKernelMsg now writes to its own channel with non-blocking sends, so it always drains /dev/kmsg regardless of downstream backpressure. A merge goroutine forwards messages to the main pipeline.

  2. Sequence number gap detection: track /dev/kmsg sequence numbers to detect and count ring buffer overflows. Populate the new NumKmsgDropped metric so kernel log loss becomes observable via the controller.

  3. Buffer downstream channels: appLogChan, uploadLogChan, keepLogChan changed from unbuffered to cap 100. A single blocked app log send could previously freeze the entire pipeline.

  4. Increase loggerChan from 10 to 500 and movefileChan from 5
    to 20.

  5. Non-blocking gzip handoff: trigMoveToGzip uses non-blocking send to moveChan. If the main goroutine is busy compressing, the log file grows beyond the size limit until the next attempt succeeds, rather than blocking the entire write pipeline.

Kernel message timestamps come from the kernel's own monotonic clock (not from when newlogd reads them), so buffering does not affect timestamp correctness — logs can still be sorted chronologically.

Note: the vendored copy of newlogtypes.go is manually updated to include the NumKmsgDropped field. This should be re-vendored properly after the companion pillar PR is merged.

PR dependencies

Depends on #5625 (pillar/types: add NumKmsgDropped metric to NewlogMetrics) — the vendored copy needs to be regenerated after that PR is merged.

How to test and validate this PR

  1. Build and run tests: cd pkg/newlog && go test ./cmd/ — all 15
    tests pass.
  2. Deploy on a device under heavy load (multiple VMs with I/O, CPU
    stress).
  3. Check that kernel logs (especially early boot messages or OOM events)
    are no longer truncated at the beginning.
  4. Monitor NumKmsgDropped in newlog metrics — if non-zero, it
    indicates the kernel ring buffer overflowed but the loss was
    detected and counted (previously invisible).
  5. Check for getKernelMsg: detected kernel log gap warning messages
    in device logs.

Changelog notes

Fixed kernel log loss under heavy system load. Previously, when the device was under heavy I/O or CPU load, the beginning of kernel logs (dmesg) could be silently lost due to backpressure in the logging pipeline causing the kernel ring buffer to overflow. Kernel log loss is now detected and reported via the NumKmsgDropped metric.

PR Backports

  • 16.0-stable: To be backported.
  • 14.5-stable: To be backported.
  • 13.4-stable: To be backported.

Checklist

  • I've provided a proper description

  • I've added the proper documentation

  • I've tested my PR on amd64 device

  • I've tested my PR on arm64 device

  • I've written the test verification instructions

  • I've set the proper labels to this PR

  • I've checked the boxes above, or I've provided a good reason why I didn't
    check them.

Under heavy system load, newlogd loses the beginning of kernel logs
because backpressure from downstream processing cascades up to the
kernel message reader, which stops reading /dev/kmsg and causes the
kernel ring buffer (128KB) to overflow silently.

The backpressure chain:
  doMoveCompressFile slow (disk I/O) -> movefileChan full (cap 5) ->
  trigMoveToGzip blocks in writelogFile -> unbuffered downstream
  channels block -> sendLogsToVector blocks -> loggerChan full
  (cap 10) -> getKernelMsg blocks -> /dev/kmsg not read -> kernel
  ring buffer overflows -> EPIPE -> early messages gone

Fix all choke points in the chain:

1. Dedicated kernel buffer channel (cap 500): getKernelMsg now writes
   to its own channel with non-blocking sends, so it always drains
   /dev/kmsg regardless of downstream backpressure. A merge goroutine
   forwards messages to the main pipeline.

2. Sequence number gap detection: track /dev/kmsg sequence numbers to
   detect and count ring buffer overflows. Populate the new
   NumKmsgDropped metric (added in pkg/pillar/types) so kernel log
   loss becomes observable via the controller.

3. Buffer downstream channels: appLogChan, uploadLogChan, keepLogChan
   changed from unbuffered to cap 100. A single blocked app log send
   could previously freeze the entire pipeline.

4. Increase loggerChan from 10 to 500 and movefileChan from 5 to 20.

5. Non-blocking gzip handoff: trigMoveToGzip uses non-blocking send to
   moveChan. If the main goroutine is busy compressing, the log file
   grows beyond the size limit until the next attempt succeeds, rather
   than blocking the entire write pipeline.

Note: the vendored copy of newlogtypes.go is manually updated to
include the NumKmsgDropped field. This should be re-vendored properly
after the companion pillar PR is merged.

Signed-off-by: Mikhail Malyshev <mike.malyshev@gmail.com>
@codecov
Copy link

codecov bot commented Feb 24, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 29.49%. Comparing base (2281599) to head (9a95639).
⚠️ Report is 298 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5626      +/-   ##
==========================================
+ Coverage   19.52%   29.49%   +9.96%     
==========================================
  Files          19       18       -1     
  Lines        3021     2417     -604     
==========================================
+ Hits          590      713     +123     
+ Misses       2310     1552     -758     
- Partials      121      152      +31     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

@europaul europaul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good to me, but I would remove most of those AI-generated comments.

// /dev/kmsg assigns a monotonically increasing sequence number
// to each message. A gap means the kernel ring buffer overflowed
// and messages were lost (EPIPE from the reader's perspective).
if lastSeqNum >= 0 && msg.SequenceNumber > lastSeqNum+1 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't need the check lastSeqNum >= 0

@rucoder
Copy link
Contributor Author

rucoder commented Feb 24, 2026

we decided to update eve-api as well. moving to draft for now

@rucoder rucoder marked this pull request as draft February 24, 2026 21:32
@eriknordmark
Copy link
Contributor

Can one do additional manually stress testing with the fix by giving newlogd a SIGSUSP and let it sit for a while before a SIGCONT to make sure log entries pile up in the system?

@europaul
Copy link
Contributor

Can one do additional manually stress testing with the fix by giving newlogd a SIGSUSP and let it sit for a while before a SIGCONT to make sure log entries pile up in the system?

just be sure not to trigger the watchdog

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants