refactor: fix unimplemented concurrency in CheckComponentStatus#5522
refactor: fix unimplemented concurrency in CheckComponentStatus#5522yoonseo-han wants to merge 8 commits into
Conversation
Signed-off-by: Yoonseo Han <yooncer00@gmail.com>
Signed-off-by: Yoonseo Han <yooncer00@gmail.com>
…ests Signed-off-by: Yoonseo Han <yooncer00@gmail.com>
…round Signed-off-by: Yoonseo Han <yooncer00@gmail.com>
…adability Signed-off-by: Yoonseo Han <yooncer00@gmail.com>
There was a problem hiding this comment.
Pull request overview
Refactors CheckComponentStatus in the ChaosCenter subscriber to actually perform infra component health checks concurrently, simplifying the previous (effectively synchronous) WaitGroup pattern and making the core checks more testable.
Changes:
- Replaces the old single-goroutine + immediate
Wait()pattern with per-selector goroutines inallDeploymentsHealthy. - Introduces a pure helper
deploymentHealthyto evaluate readiness for a single label selector. - Adds unit tests for
deploymentHealthyandallDeploymentsHealthy, and updates existing tests to use renamed fake client imports.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
chaoscenter/subscriber/pkg/k8s/operations.go |
Refactors component status checking into concurrent, testable helper functions and simplifies orchestration logic. |
chaoscenter/subscriber/pkg/k8s/operations_test.go |
Adds unit tests covering the new helper functions and updates fake client imports. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| for retry := range LiveCheckMaxTries { | ||
| if allDeploymentsHealthy(ctx, clientset, InfraNamespace, components.Deployments) { | ||
| logrus.Info("All infra deployments are up") | ||
| return nil | ||
| } | ||
| if retry < LiveCheckMaxTries-1 { | ||
| time.Sleep(30 * time.Second) | ||
| } | ||
| } |
| healthy: false, | ||
| }, | ||
| { | ||
| name: "Selector doesnt match any pod", |
|
hey @yoonseo-han |
|
@yoonseo-han Trivy check is failing for this PR. Could you please raise a new PR for this and link it here? |
Signed-off-by: Yoonseo Han <yooncer00@gmail.com>
|
Hey @PriteshKiri Thanks for the check. Had a look at the copilot comments and changed the codes respectively. As for the CI, it seems ok based on the rerun triggerred thanks to you. Ready for review when you feel free. Thanks |
Proposed changes
CheckComponentStatusinpkg/k8s/operations.goclaimed to check infra components concurrently but launched a single goroutine and immediately blocked onWaitGroup.Wait()making it fully synchronous.The
sync.MutexguardingLiveStatushad no concurrent writer to protect against, making it dead complexity.The comment // add all agent components to waitgroup directly contradicted the wait.Add(1) below it.
This PR replaces the misleading pattern with correct goroutine usage and seperation of concerns by splitting functions for easiser testing:
deploymentHealthy: pure function, single selector, single API callallDeploymentsHealthy: spawns one goroutine per selector, usesatomic.Int32for race-safe failure counting,sync.WaitGroupas barrierCheckComponentStatus: retry orchestration logic executed only; dead state (LiveStatus, AccessLiveStatus) removed and replaced by return valuesTypes of changes
What types of changes does your code introduce to Litmus? Put an
xin the boxes that applyChecklist
Put an
xin the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your code.Dependency
NA
Special notes for your reviewer:
The full
CheckComponentStatusretry loop is intentionally left untested as implementing would require clock injection or 150s+ real sleeps; coverage sits at the two pure helper layers instead.kubernetes.Interfaceis used in place of*kubernetes.Clientsetto allow fake client injection in tests. No production impact.