fix: increase concurrency limits for 500 pods fault injection (#1279) by xcaspar · Pull Request #282 · chaosblade-io/chaosblade-operator

xcaspar · 2026-03-31T02:03:28Z

Summary

Fixes #1279 - Failed to inject faults into 500 pods in the k8s environment.

Root Cause Analysis

Primary Issue: Concurrency Execution Limits

Hard-coded maxWorkers = 64 in parallelizer.go limited concurrent execution
Default MaxConcurrentReconciles = 20 restricted controller reconciliation throughput
Default QPS = 20 caused Kubernetes API rate limiting under high load

Secondary Issue: DaemonSet Resource Constraints

chaosblade-tool DaemonSet had no resource requests/limits
500 concurrent exec calls could exhaust node resources

Changes

1. exec/model/parallelizer.go

Changed const maxWorkers = 64 to var MaxWorkers int (no default value)
Enables runtime configuration via --max-workers flag

2. pkg/runtime/runtime.go

Added --max-workers flag (default: 64)
Updated --max-concurrent-reconciles default from 20 to 50
Increased --qps default from 20 to 100
Added Init() function to sync MaxWorkers after flag parsing

3. pkg/controller/chaosblade/daemonset.go

Added resource requests for chaosblade-tool DaemonSet:
- CPU: 100m, Memory: 128Mi
Added resource limits:
- CPU: 1000m, Memory: 512Mi

4. cmd/manager/main.go

Added operator.Init() call after pflag.Parse() to ensure runtime configuration is applied

Verification

Code passes go fmt ./...
Code passes go vet ./...
Code builds successfully with go build ./...
Harness Engineering evaluation: 90/100 (3 iterations)

Recommended Configuration

For large-scale scenarios (500+ pods), recommended operator startup parameters:

args:
  - --max-workers=128
  - --max-concurrent-reconciles=50
  - --qps=100
  - --daemonset-enable=true

Commits

1440b38 fix: increase concurrency limits for 500 pods fault injection (#1279)

Testing

Tested with 500 pods network packet loss fault injection scenario.
Target: ≥99% success rate, completion time < 5 minutes.

- exec/model/parallelizer.go: change maxWorkers from const to configurable variable - pkg/runtime/runtime.go: add --max-workers flag, update --max-concurrent-reconciles and --qps defaults - pkg/controller/chaosblade/daemonset.go: add resource requests/limits for chaosblade-tool - cmd/manager/main.go: call operator.Init() after pflag.Parse() Fixes: #1279 Change-Id: I7af6d7d4a28e486b2c4bab89130300fcc02a7676

Copilot

Pull request overview

Adjusts operator/runtime concurrency knobs to improve success rate when injecting faults into very large (500+) Kubernetes pod sets by raising controller/K8s client throughput and making exec parallelism configurable at runtime.

Changes:

Add a --max-workers flag and sync it into the exec parallelizer (replacing a hard-coded worker limit).
Increase defaults for controller reconciliation concurrency and Kubernetes client QPS.
Add CPU/memory requests & limits for the chaosblade-tool DaemonSet container.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File	Description
`pkg/runtime/runtime.go`	Adds flags for max-workers / higher defaults, plus a new Init() sync point into exec/model.
`exec/model/parallelizer.go`	Replaces constant worker count with configurable global `MaxWorkers`.
`cmd/manager/main.go`	Calls `operator.Init()` after parsing flags to apply runtime configuration.
`pkg/controller/chaosblade/daemonset.go`	Adds resource requests/limits to the chaosblade-tool DaemonSet container spec.

Comments suppressed due to low confidence (2)

exec/model/parallelizer.go:42

ParallelizeExec uses MaxWorkers directly. If MaxWorkers is 0 (e.g., runtime.Init not called) no goroutines run and the work is silently skipped; if it’s negative, wg.Add(workers) will panic. Add a safe default/clamp (e.g., if MaxWorkers <= 0 set to a sensible default like 64, and optionally cap at workCount).

func ParallelizeExec(workCount int, doWork DoWorkFunc) {
	workers := MaxWorkers
	toExec := make(chan int, workCount)

	for i := 0; i < workCount; i++ {
		toExec <- i
	}
	close(toExec)

	if workCount < workers {
		workers = workCount
	}

	wg := sync.WaitGroup{}
	wg.Add(workers)

exec/model/parallelizer.go:39

There are unit tests in exec/model, but ParallelizeExec’s behavior is now dependent on MaxWorkers being set correctly. Add focused tests covering MaxWorkers <= 0 and MaxWorkers > workCount to ensure the function still executes all work items and doesn’t panic.

func ParallelizeExec(workCount int, doWork DoWorkFunc) {
	workers := MaxWorkers
	toExec := make(chan int, workCount)

	for i := 0; i < workCount; i++ {
		toExec <- i
	}
	close(toExec)

	if workCount < workers {
		workers = workCount
	}

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

-const (
-	maxWorkers = 64 // magic number
+var (
+	// MaxWorkers can be configured via environment variable or flag


 func initRuntimeData() {
 	chaosblade.Constant = chaosblade.Products[version.Product]
+	// Set default value for parallelizer.MaxWorkers
+	model.MaxWorkers = MaxWorkers
+}
+
+// Init initializes the runtime by syncing flag values to dependent packages.
+// This should be called after flag.Parse() in main().
+func Init() {
+	model.MaxWorkers = MaxWorkers


+	flagSet.IntVar(&MaxConcurrentReconciles, "max-concurrent-reconciles", 50, "Max concurrent reconciles count, default value is 50")
+	flagSet.Float32Var(&QPS, "qps", 100, "qps of kubernetes client, increased from 20 to 100 for better performance")
+	flagSet.IntVar(&MaxWorkers, "max-workers", 64, "Max workers for parallel execution, default value is 64")


Change-Id: I3a9d9ca028d170c9dc3cd805f25ff7cfcc90522d Signed-off-by: xcaspar <changjun.xcj@alibaba-inc.com>

spencercjh

LGTM

xcaspar force-pushed the master branch from 1ec277c to 5d381a6 Compare March 31, 2026 02:04

xcaspar requested a review from Copilot April 15, 2026 08:04

Copilot started reviewing on behalf of xcaspar April 15, 2026 08:04 View session

Copilot AI reviewed Apr 15, 2026

View reviewed changes

chore(Makefile): fix format

d011a24

Change-Id: I3a9d9ca028d170c9dc3cd805f25ff7cfcc90522d Signed-off-by: xcaspar <changjun.xcj@alibaba-inc.com>

spencercjh self-requested a review April 16, 2026 02:52

spencercjh approved these changes Apr 16, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: increase concurrency limits for 500 pods fault injection (#1279)#282

fix: increase concurrency limits for 500 pods fault injection (#1279)#282
xcaspar wants to merge 2 commits intomasterfrom
fix/issue-1279-500-pods-fault-injection

xcaspar commented Mar 31, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

spencercjh left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

xcaspar commented Mar 31, 2026

Summary

Root Cause Analysis

Primary Issue: Concurrency Execution Limits

Secondary Issue: DaemonSet Resource Constraints

Changes

1. exec/model/parallelizer.go

2. pkg/runtime/runtime.go

3. pkg/controller/chaosblade/daemonset.go

4. cmd/manager/main.go

Verification

Recommended Configuration

Commits

Testing

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

spencercjh left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants