reschedule the reconcile loop when DOCA PCC process dies#286
Merged
almaslennikov merged 1 commit intoMellanox:mainfrom Feb 13, 2026
Merged
reschedule the reconcile loop when DOCA PCC process dies#286almaslennikov merged 1 commit intoMellanox:mainfrom
almaslennikov merged 1 commit intoMellanox:mainfrom
Conversation
We need to restart the PCC and reapply its params Signed-off-by: Alexander Maslennikov <amaslennikov@nvidia.com>
Greptile OverviewGreptile SummaryThis PR implements automatic recovery when the DOCA PCC (Precision Congestion Control) process dies unexpectedly during runtime. The solution adds a termination notification channel that bridges the SpectrumX manager (which is k8s-free) to the controller's event system. Key changes:
How it works:
The implementation correctly handles edge cases: startup failures don't trigger reconciliation (they return errors immediately), and the channel has a buffer of 10 with non-blocking sends to prevent deadlocks. Confidence Score: 5/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant Controller as NicDevice Controller
participant Manager as SpectrumX Manager
participant Process as DOCA CC Process
participant Channel as Termination Channel
participant Reconcile as Reconcile Loop
Note over Controller,Manager: Setup Phase
Controller->>Manager: GetCCTerminationChannel()
Manager-->>Controller: <-chan string
Controller->>Controller: Start goroutine to bridge channel to events
Note over Manager,Process: Runtime Configuration
Controller->>Manager: ApplyRuntimeConfig(device)
Manager->>Process: RunDocaSpcXCC(port)
Process->>Process: Start CC process
Process->>Process: Wait 3s for startup
alt Startup Success
Process->>Process: Set startupCheckPassed = true
Process-->>Manager: nil (success)
else Startup Failure
Process-->>Manager: error
end
Note over Process,Reconcile: Runtime Crash Detection
Process->>Process: CC process dies unexpectedly
alt After Startup (startupCheckPassed = true)
Process->>Channel: Send RDMA interface name
Channel->>Controller: Notification received
Controller->>Reconcile: Trigger reconcile event
Reconcile->>Manager: ApplyRuntimeConfig(device)
Manager->>Process: Restart CC process
else During Startup (startupCheckPassed = false)
Process->>Process: No notification sent
Note over Process: Startup failures handled via error return
end
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
We need to restart the PCC and reapply its params