Skip to content

reschedule the reconcile loop when DOCA PCC process dies#286

Merged
almaslennikov merged 1 commit intoMellanox:mainfrom
almaslennikov:restart-pcc
Feb 13, 2026
Merged

reschedule the reconcile loop when DOCA PCC process dies#286
almaslennikov merged 1 commit intoMellanox:mainfrom
almaslennikov:restart-pcc

Conversation

@almaslennikov
Copy link
Collaborator

We need to restart the PCC and reapply its params

We need to restart the PCC and reapply its params

Signed-off-by: Alexander Maslennikov <amaslennikov@nvidia.com>
@greptile-apps
Copy link

greptile-apps bot commented Feb 12, 2026

Greptile Overview

Greptile Summary

This PR implements automatic recovery when the DOCA PCC (Precision Congestion Control) process dies unexpectedly during runtime. The solution adds a termination notification channel that bridges the SpectrumX manager (which is k8s-free) to the controller's event system.

Key changes:

  • Added GetCCTerminationChannel() method to SpectrumXManager interface that returns a read-only channel carrying RDMA interface names
  • Implemented startupCheckPassed flag in ccProcess to distinguish between startup failures (which return errors) and runtime crashes (which send notifications)
  • Modified controller's SetupWithManager to watch the termination channel via WatchesRawSource, triggering reconciliation when CC processes die
  • Added comprehensive tests covering both startup failures and runtime crashes

How it works:

  1. When a CC process terminates after the 3-second startup window, it sends the RDMA interface name to the termination channel
  2. A goroutine in the controller bridges this plain Go channel to controller events
  3. The reconcile loop is triggered, which reapplies runtime configuration and restarts the CC process

The implementation correctly handles edge cases: startup failures don't trigger reconciliation (they return errors immediately), and the channel has a buffer of 10 with non-blocking sends to prevent deadlocks.

Confidence Score: 5/5

  • This PR is safe to merge with no identified issues
  • The implementation is well-designed with proper concurrency primitives (atomic.Bool, mutex-protected errors), comprehensive test coverage for both happy and error paths, and correct handling of channel operations to prevent deadlocks. The startup vs runtime crash distinction is sound, and the non-blocking channel send prevents goroutine leaks.
  • No files require special attention

Important Files Changed

Filename Overview
internal/controller/nicdevice_controller.go Added event watcher to trigger reconciliation when CC process terminates, bridges termination channel to controller events
pkg/spectrumx/spectrumx.go Added termination channel and startup check logic to distinguish startup failures from runtime crashes, sends notifications only for runtime crashes
internal/controller/nicdevice_controller_test.go Added comprehensive test verifying reconcile triggers when CC termination channel fires
pkg/spectrumx/spectrumx_test.go Added tests for runtime crash notifications and startup failure handling, verifies channel behavior

Sequence Diagram

sequenceDiagram
    participant Controller as NicDevice Controller
    participant Manager as SpectrumX Manager
    participant Process as DOCA CC Process
    participant Channel as Termination Channel
    participant Reconcile as Reconcile Loop

    Note over Controller,Manager: Setup Phase
    Controller->>Manager: GetCCTerminationChannel()
    Manager-->>Controller: <-chan string
    Controller->>Controller: Start goroutine to bridge channel to events

    Note over Manager,Process: Runtime Configuration
    Controller->>Manager: ApplyRuntimeConfig(device)
    Manager->>Process: RunDocaSpcXCC(port)
    Process->>Process: Start CC process
    Process->>Process: Wait 3s for startup
    alt Startup Success
        Process->>Process: Set startupCheckPassed = true
        Process-->>Manager: nil (success)
    else Startup Failure
        Process-->>Manager: error
    end

    Note over Process,Reconcile: Runtime Crash Detection
    Process->>Process: CC process dies unexpectedly
    alt After Startup (startupCheckPassed = true)
        Process->>Channel: Send RDMA interface name
        Channel->>Controller: Notification received
        Controller->>Reconcile: Trigger reconcile event
        Reconcile->>Manager: ApplyRuntimeConfig(device)
        Manager->>Process: Restart CC process
    else During Startup (startupCheckPassed = false)
        Process->>Process: No notification sent
        Note over Process: Startup failures handled via error return
    end
Loading

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

8 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@almaslennikov almaslennikov merged commit e1c5c7c into Mellanox:main Feb 13, 2026
9 of 10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant