The Syslog Health Monitor watches system logs for GPU-related errors that may not be caught by DCGM. It monitors journald/syslog for XID errors, SXID errors (NVSwitch/NVLink errors), and GPU fallen-off-bus events - critical failures that indicate serious GPU, NVSwitch, or driver problems. In addition to failures, it monitors system logs for other GPU-related events such as GPU resets to indicate that a required remediation action has completed.
Think of it as a log analyzer that reads between the lines - catching GPU and NVSwitch problems recorded in system logs that other monitoring might miss.
Some GPU and NVSwitch failures or events manifest in system logs before DCGM can detect them:
- XID errors: GPU hardware errors logged by the NVIDIA driver
- SXID errors: NVSwitch errors related to NVSwitch and NVLink interconnects
- GPU fallen off the bus: GPU became inaccessible to the system
- GPU Reset: A GPU reset was executed by nvidia-smi
These errors or events often appear in system logs first and can indicate imminent GPU or fabric failure, making early detection critical for preventing workload disruptions or returning GPUs to service.
The Syslog Health Monitor runs as a DaemonSet on GPU nodes:
- Reads journald logs from the host system
- Parses log entries for GPU-related error patterns (XID, SXID, fallen-off-bus, GPU reset)
- Maintains cursor position to avoid re-processing old logs
- For XID errors, uses embedded NVIDIA XID Catalog spreadsheet to determine recommended actions
- Optionally analyzes XID errors via XID analyzer sidecar for custom logic
- Sends health events to Platform Connectors via gRPC
The monitor maintains persistent state across restarts, ensuring logs are processed exactly once even if the pod is restarted.
Configure the Syslog Health Monitor through Helm values:
syslog-health-monitor:
enabled: true
enabledChecks:
- SysLogsXIDError # GPU XID hardware errors and GPU reset events
- SysLogsSXIDError # NVSwitch/NVLink SXID errors
- SysLogsGPUFallenOff # GPU fallen off the bus
logLevel: info
# Optional XID analyzer for custom error analysis logic
xidSideCar:
enabled: false
image:
repository: ""
tag: ""- Enabled Checks: Select which types of errors to monitor (XID, SXID, fallen-off-bus)
- Log Level: Control logging verbosity (info, debug, warn, error)
- XID Analyzer Sidecar: Optional sidecar for injecting custom XID analysis logic
- Polling Interval: Configure how frequently to check system logs
Critical GPU hardware failures logged by the NVIDIA driver. Uses embedded NVIDIA XID Catalog (Excel spreadsheet) to map XID codes to recommended actions:
- XID 48: Double-bit ECC error (memory corruption, requires GPU replacement)
- XID 64: Page retirement limit (GPU memory degradation)
- XID 79: GPU has fallen off the bus
- Many other XID codes with appropriate remediation actions from NVIDIA's official catalog
NVSwitch errors related to the high-speed NVLink interconnect fabric:
- NVSwitch hardware errors
- NVLink connection failures
- Fabric-level issues affecting multi-GPU communication
A GPU became inaccessible to the system - critical failure requiring immediate attention.
A GPU was reset by nvidia-smi, indicating that a remediation action for a previous GPU failure has completed.
Efficiently parses journald logs for error patterns and maintains cursor position across restarts - ensures logs are processed exactly once.
Uses official NVIDIA XID Error Catalog (Excel spreadsheet) embedded in the binary to determine recommended actions for each XID code - no external dependencies required.
Optional XID analyzer sidecar allows injecting custom logic for XID analysis:
- Override default remediation actions
- Add custom error categorization
- Integrate with proprietary error handling systems
- HTTP API for extensibility
When enabled, the sidecar receives XID messages and can return custom recommended actions, allowing you to tailor remediation to your environment without modifying the main monitor.
Separate sidecar monitors NVIDIA driver pod logs for driver-specific issues.
Enable/disable specific check types based on your monitoring needs - monitor only what's relevant to your hardware.