Skip to content

[Bug]: syslog-health-monitor can miss XIDs when journald rotates before scan #1417

Description

@XRFXLP

Prerequisites

  • I searched existing issues
  • I can reproduce this issue

Code of Conduct

  • I agree to follow NVSentinel's Code of Conduct

Bug Description

syslog-health-monitor can miss XID events when journald rotates or vacuums entries before the monitor reaches them.
In an observed incident, XIDs were still visible through dmesg and detected by DCGM, but the retained journald range started after the XID timestamps. Since syslog-health-monitor reads /var/log/journal via journald rather than dmesg or DCGM, it had no retained journal entries to parse.

Expected behavior: XID-related syslog checks should process only kernel-origin journal entries so the monitor can catch up quickly and avoid falling behind due to unrelated audit/container log volume.

Proposed fix: add kernel-only journald filtering for syslog checks, for example SYSLOG_FACILITY=0, by default for XID/GPU-fallen/SXID checks.

Component

Health Monitor

Steps to Reproduce

  1. Run syslog-health-monitor on a GPU node where it reads from /var/log/journal.
  2. Generate enough non-kernel journald traffic to put journald under retention pressure.
  3. Trigger or inject an NVIDIA XID kernel log entry.
  4. Continue generating journald traffic until the journal segment containing the XID is rotated or vacuumed before syslog-health-monitor processes it.
  5. Confirm the XID was produced by checking dmesg -T | grep -i "nvrm: xid".
  6. Confirm the XID is no longer available through journald using journalctl | grep -i "nvrm: xid" or by checking that journalctl --list-boots starts after the XID timestamp.
  7. Observe that syslog-health-monitor does not emit a corresponding health event.

Idea here is to delay syslog HM log processing completion so logs get rotated before HM could process it.

Environment

  • NVSentinel version: v1.9.0
  • Kubernetes version:
  • Deployment method:

Logs/Output

No response

Metadata

Metadata

Assignees

Type

No fields configured for Bug.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions