Skip to content

reschedule the reconcile loop when DOCA PCC process dies#289

Merged
almaslennikov merged 1 commit intonetwork-operator-26.1.xfrom
restart-pcc-release
Feb 13, 2026
Merged

reschedule the reconcile loop when DOCA PCC process dies#289
almaslennikov merged 1 commit intonetwork-operator-26.1.xfrom
restart-pcc-release

Conversation

@almaslennikov
Copy link
Collaborator

We need to restart the PCC and reapply its params

(cherry picked from commit 9f5de2e)

We need to restart the PCC and reapply its params

Signed-off-by: Alexander Maslennikov <amaslennikov@nvidia.com>
(cherry picked from commit 9f5de2e)
@greptile-apps
Copy link

greptile-apps bot commented Feb 12, 2026

Greptile Overview

Greptile Summary

This PR implements automatic recovery when the DOCA PCC (Precision Congestion Control) process terminates unexpectedly. The implementation adds a notification mechanism that triggers controller reconciliation to restart the PCC process and reapply its configuration parameters.

Key Changes:

  • Added ccTerminationChan buffered channel (size 10) in SpectrumXManager to carry RDMA interface names when PCC processes die
  • Introduced startupCheckPassed atomic flag to distinguish between startup failures (within 3s) and runtime crashes (after 3s)
  • Bridged the termination channel into the Kubernetes controller event loop via a goroutine that converts channel events into TypedGenericEvent instances
  • When PCC dies after startup, a reconcile is triggered which calls ApplyDeviceRuntimeSpec(), ultimately restarting the PCC process via RunDocaSpcXCC()
  • Includes comprehensive test coverage for both the notification mechanism and controller integration

Implementation Quality:

  • Clean separation between k8s-free package (plain Go channels) and controller events
  • Proper use of non-blocking channel send with fallback to prevent goroutine blocking
  • Atomic flags ensure thread-safe state checks
  • Test coverage validates both happy path (runtime crash triggers reconcile) and edge cases (startup failures don't notify)

Confidence Score: 5/5

  • This PR is safe to merge with minimal risk
  • The implementation is well-designed with proper concurrency primitives (atomic flags, buffered channels, non-blocking sends), comprehensive test coverage for both the core mechanism and edge cases, and clean integration with the existing controller reconciliation pattern. The change is a cherry-pick from main, indicating it has already been reviewed and validated.
  • No files require special attention

Important Files Changed

Filename Overview
pkg/spectrumx/spectrumx.go Added CC termination channel and logic to notify controller when DOCA PCC process dies after startup
internal/controller/nicdevice_controller.go Integrated CC termination channel into controller event loop to trigger reconcile when PCC process dies
internal/controller/nicdevice_controller_test.go Added test coverage for CC termination event triggering reconcile loop
pkg/spectrumx/spectrumx_test.go Added tests for CC termination channel behavior during runtime crashes vs startup failures

Sequence Diagram

sequenceDiagram
    participant PCC as DOCA PCC Process
    participant SM as SpectrumXManager
    participant Chan as ccTerminationChan
    participant Bridge as Bridge Goroutine
    participant Controller as NicDeviceController
    participant Reconcile as Reconcile Loop

    Note over PCC,SM: PCC process starts successfully
    PCC->>SM: Process running (3s startup check passed)
    SM->>SM: Set startupCheckPassed = true
    
    Note over PCC: Process dies unexpectedly
    PCC->>SM: Process termination detected
    SM->>SM: Check startupCheckPassed.Load()
    SM->>Chan: Send RdmaInterface name
    
    Chan->>Bridge: Receive RdmaInterface
    Bridge->>Bridge: Log CC termination event
    Bridge->>Controller: Send TypedGenericEvent
    
    Controller->>Controller: Enqueue reconcile request
    Controller->>Reconcile: Trigger reconciliation
    
    Reconcile->>Reconcile: ApplyDeviceRuntimeSpec()
    Reconcile->>SM: RunDocaSpcXCC()
    SM->>PCC: Restart PCC process
    Note over PCC,SM: PCC restarted with params reapplied
Loading

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

7 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@almaslennikov almaslennikov merged commit acbfe66 into network-operator-26.1.x Feb 13, 2026
12 of 14 checks passed
@almaslennikov almaslennikov deleted the restart-pcc-release branch February 13, 2026 15:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants