Prerequisites
Code of Conduct
Bug Description
syslog-health-monitor can miss XID events when journald rotates or vacuums entries before the monitor reaches them.
In an observed incident, XIDs were still visible through dmesg and detected by DCGM, but the retained journald range started after the XID timestamps. Since syslog-health-monitor reads /var/log/journal via journald rather than dmesg or DCGM, it had no retained journal entries to parse.
Expected behavior: XID-related syslog checks should process only kernel-origin journal entries so the monitor can catch up quickly and avoid falling behind due to unrelated audit/container log volume.
Proposed fix: add kernel-only journald filtering for syslog checks, for example SYSLOG_FACILITY=0, by default for XID/GPU-fallen/SXID checks.
Component
Health Monitor
Steps to Reproduce
- Run
syslog-health-monitor on a GPU node where it reads from /var/log/journal.
- Generate enough non-kernel journald traffic to put journald under retention pressure.
- Trigger or inject an NVIDIA XID kernel log entry.
- Continue generating journald traffic until the journal segment containing the XID is rotated or vacuumed before
syslog-health-monitor processes it.
- Confirm the XID was produced by checking
dmesg -T | grep -i "nvrm: xid".
- Confirm the XID is no longer available through journald using
journalctl | grep -i "nvrm: xid" or by checking that journalctl --list-boots starts after the XID timestamp.
- Observe that
syslog-health-monitor does not emit a corresponding health event.
Idea here is to delay syslog HM log processing completion so logs get rotated before HM could process it.
Environment
- NVSentinel version: v1.9.0
- Kubernetes version:
- Deployment method:
Logs/Output
No response
Prerequisites
Code of Conduct
Bug Description
syslog-health-monitorcan miss XID events when journald rotates or vacuums entries before the monitor reaches them.In an observed incident, XIDs were still visible through
dmesgand detected by DCGM, but the retained journald range started after the XID timestamps. Sincesyslog-health-monitorreads/var/log/journalvia journald rather thandmesgor DCGM, it had no retained journal entries to parse.Expected behavior: XID-related syslog checks should process only kernel-origin journal entries so the monitor can catch up quickly and avoid falling behind due to unrelated audit/container log volume.
Proposed fix: add kernel-only journald filtering for syslog checks, for example
SYSLOG_FACILITY=0, by default for XID/GPU-fallen/SXID checks.Component
Health Monitor
Steps to Reproduce
syslog-health-monitoron a GPU node where it reads from/var/log/journal.syslog-health-monitorprocesses it.dmesg -T | grep -i "nvrm: xid".journalctl | grep -i "nvrm: xid"or by checking thatjournalctl --list-bootsstarts after the XID timestamp.syslog-health-monitordoes not emit a corresponding health event.Idea here is to delay syslog HM log processing completion so logs get rotated before HM could process it.
Environment
Logs/Output
No response