Skip to content

[Bug]: race condition between platform connector and syslog health monitor #599

@romachalm

Description

@romachalm

Prerequisites

  • I searched existing issues
  • I can reproduce this issue

Bug Description

I upgraded nvsentinel on a cluster with 250 nodes and it rolled out all the daemonsets
50 syslog health monitor were crashing with the error :

{"time":"2025-12-17T10:11:20.435006509Z","level":"ERROR","msg":"Fatal error","module":"syslog-health-monitor","version":"dev","error":"failed to create gRPC client after retries: platform connector socket file not found after retries: stat /var/run/nvsentinel.sock: no such file or directory"}

As platform connector is reponsible to initiate the socket, I had to restart platform connector on the node then restart syslog health monitor, and it worked.

So it seems that there is a race condition, and that we must ensure that platform connector is ready before syslog monitor starts

Component

Health Monitor

Steps to Reproduce

Cannot reproduce

Environment

  • NVSentinel version: v0.3.0
  • Kubernetes version: 1.29
  • Deployment method: argocd/kustomize

Logs/Output

{"time":"2025-12-17T10:11:20.435006509Z","level":"ERROR","msg":"Fatal error","module":"syslog-health-monitor","version":"dev","error":"failed to create gRPC client after retries: platform connector socket file not found after retries: stat /var/run/nvsentinel.sock: no such file or directory"}

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions