fix: increase concurrency limits for 500 pods fault injection (#1279)#282
Open
fix: increase concurrency limits for 500 pods fault injection (#1279)#282
Conversation
- exec/model/parallelizer.go: change maxWorkers from const to configurable variable - pkg/runtime/runtime.go: add --max-workers flag, update --max-concurrent-reconciles and --qps defaults - pkg/controller/chaosblade/daemonset.go: add resource requests/limits for chaosblade-tool - cmd/manager/main.go: call operator.Init() after pflag.Parse() Fixes: #1279 Change-Id: I7af6d7d4a28e486b2c4bab89130300fcc02a7676
Contributor
There was a problem hiding this comment.
Pull request overview
Adjusts operator/runtime concurrency knobs to improve success rate when injecting faults into very large (500+) Kubernetes pod sets by raising controller/K8s client throughput and making exec parallelism configurable at runtime.
Changes:
- Add a
--max-workersflag and sync it into the exec parallelizer (replacing a hard-coded worker limit). - Increase defaults for controller reconciliation concurrency and Kubernetes client QPS.
- Add CPU/memory requests & limits for the chaosblade-tool DaemonSet container.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
pkg/runtime/runtime.go |
Adds flags for max-workers / higher defaults, plus a new Init() sync point into exec/model. |
exec/model/parallelizer.go |
Replaces constant worker count with configurable global MaxWorkers. |
cmd/manager/main.go |
Calls operator.Init() after parsing flags to apply runtime configuration. |
pkg/controller/chaosblade/daemonset.go |
Adds resource requests/limits to the chaosblade-tool DaemonSet container spec. |
Comments suppressed due to low confidence (2)
exec/model/parallelizer.go:42
- ParallelizeExec uses MaxWorkers directly. If MaxWorkers is 0 (e.g., runtime.Init not called) no goroutines run and the work is silently skipped; if it’s negative, wg.Add(workers) will panic. Add a safe default/clamp (e.g., if MaxWorkers <= 0 set to a sensible default like 64, and optionally cap at workCount).
func ParallelizeExec(workCount int, doWork DoWorkFunc) {
workers := MaxWorkers
toExec := make(chan int, workCount)
for i := 0; i < workCount; i++ {
toExec <- i
}
close(toExec)
if workCount < workers {
workers = workCount
}
wg := sync.WaitGroup{}
wg.Add(workers)
exec/model/parallelizer.go:39
- There are unit tests in exec/model, but ParallelizeExec’s behavior is now dependent on MaxWorkers being set correctly. Add focused tests covering MaxWorkers <= 0 and MaxWorkers > workCount to ensure the function still executes all work items and doesn’t panic.
func ParallelizeExec(workCount int, doWork DoWorkFunc) {
workers := MaxWorkers
toExec := make(chan int, workCount)
for i := 0; i < workCount; i++ {
toExec <- i
}
close(toExec)
if workCount < workers {
workers = workCount
}
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| const ( | ||
| maxWorkers = 64 // magic number | ||
| var ( | ||
| // MaxWorkers can be configured via environment variable or flag |
Comment on lines
50
to
+59
| func initRuntimeData() { | ||
| chaosblade.Constant = chaosblade.Products[version.Product] | ||
| // Set default value for parallelizer.MaxWorkers | ||
| model.MaxWorkers = MaxWorkers | ||
| } | ||
|
|
||
| // Init initializes the runtime by syncing flag values to dependent packages. | ||
| // This should be called after flag.Parse() in main(). | ||
| func Init() { | ||
| model.MaxWorkers = MaxWorkers |
Comment on lines
+40
to
+42
| flagSet.IntVar(&MaxConcurrentReconciles, "max-concurrent-reconciles", 50, "Max concurrent reconciles count, default value is 50") | ||
| flagSet.Float32Var(&QPS, "qps", 100, "qps of kubernetes client, increased from 20 to 100 for better performance") | ||
| flagSet.IntVar(&MaxWorkers, "max-workers", 64, "Max workers for parallel execution, default value is 64") |
Change-Id: I3a9d9ca028d170c9dc3cd805f25ff7cfcc90522d Signed-off-by: xcaspar <changjun.xcj@alibaba-inc.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #1279 - Failed to inject faults into 500 pods in the k8s environment.
Root Cause Analysis
Primary Issue: Concurrency Execution Limits
maxWorkers = 64inparallelizer.golimited concurrent executionMaxConcurrentReconciles = 20restricted controller reconciliation throughputQPS = 20caused Kubernetes API rate limiting under high loadSecondary Issue: DaemonSet Resource Constraints
Changes
1. exec/model/parallelizer.go
const maxWorkers = 64tovar MaxWorkers int(no default value)--max-workersflag2. pkg/runtime/runtime.go
--max-workersflag (default: 64)--max-concurrent-reconcilesdefault from 20 to 50--qpsdefault from 20 to 100Init()function to sync MaxWorkers after flag parsing3. pkg/controller/chaosblade/daemonset.go
4. cmd/manager/main.go
operator.Init()call afterpflag.Parse()to ensure runtime configuration is appliedVerification
go fmt ./...go vet ./...go build ./...Recommended Configuration
For large-scale scenarios (500+ pods), recommended operator startup parameters:
Commits
Testing
Tested with 500 pods network packet loss fault injection scenario.
Target: ≥99% success rate, completion time < 5 minutes.