Skip to content

fix: increase concurrency limits for 500 pods fault injection (#1279)#282

Open
xcaspar wants to merge 2 commits intomasterfrom
fix/issue-1279-500-pods-fault-injection
Open

fix: increase concurrency limits for 500 pods fault injection (#1279)#282
xcaspar wants to merge 2 commits intomasterfrom
fix/issue-1279-500-pods-fault-injection

Conversation

@xcaspar
Copy link
Copy Markdown
Member

@xcaspar xcaspar commented Mar 31, 2026

Summary

Fixes #1279 - Failed to inject faults into 500 pods in the k8s environment.

Root Cause Analysis

Primary Issue: Concurrency Execution Limits

  • Hard-coded maxWorkers = 64 in parallelizer.go limited concurrent execution
  • Default MaxConcurrentReconciles = 20 restricted controller reconciliation throughput
  • Default QPS = 20 caused Kubernetes API rate limiting under high load

Secondary Issue: DaemonSet Resource Constraints

  • chaosblade-tool DaemonSet had no resource requests/limits
  • 500 concurrent exec calls could exhaust node resources

Changes

1. exec/model/parallelizer.go

  • Changed const maxWorkers = 64 to var MaxWorkers int (no default value)
  • Enables runtime configuration via --max-workers flag

2. pkg/runtime/runtime.go

  • Added --max-workers flag (default: 64)
  • Updated --max-concurrent-reconciles default from 20 to 50
  • Increased --qps default from 20 to 100
  • Added Init() function to sync MaxWorkers after flag parsing

3. pkg/controller/chaosblade/daemonset.go

  • Added resource requests for chaosblade-tool DaemonSet:
    • CPU: 100m, Memory: 128Mi
  • Added resource limits:
    • CPU: 1000m, Memory: 512Mi

4. cmd/manager/main.go

  • Added operator.Init() call after pflag.Parse() to ensure runtime configuration is applied

Verification

  • Code passes go fmt ./...
  • Code passes go vet ./...
  • Code builds successfully with go build ./...
  • Harness Engineering evaluation: 90/100 (3 iterations)

Recommended Configuration

For large-scale scenarios (500+ pods), recommended operator startup parameters:

args:
  - --max-workers=128
  - --max-concurrent-reconciles=50
  - --qps=100
  - --daemonset-enable=true

Commits

  • 1440b38 fix: increase concurrency limits for 500 pods fault injection (#1279)

Testing

Tested with 500 pods network packet loss fault injection scenario.
Target: ≥99% success rate, completion time < 5 minutes.

- exec/model/parallelizer.go: change maxWorkers from const to configurable variable
- pkg/runtime/runtime.go: add --max-workers flag, update --max-concurrent-reconciles and --qps defaults
- pkg/controller/chaosblade/daemonset.go: add resource requests/limits for chaosblade-tool
- cmd/manager/main.go: call operator.Init() after pflag.Parse()

Fixes: #1279

Change-Id: I7af6d7d4a28e486b2c4bab89130300fcc02a7676
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adjusts operator/runtime concurrency knobs to improve success rate when injecting faults into very large (500+) Kubernetes pod sets by raising controller/K8s client throughput and making exec parallelism configurable at runtime.

Changes:

  • Add a --max-workers flag and sync it into the exec parallelizer (replacing a hard-coded worker limit).
  • Increase defaults for controller reconciliation concurrency and Kubernetes client QPS.
  • Add CPU/memory requests & limits for the chaosblade-tool DaemonSet container.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File Description
pkg/runtime/runtime.go Adds flags for max-workers / higher defaults, plus a new Init() sync point into exec/model.
exec/model/parallelizer.go Replaces constant worker count with configurable global MaxWorkers.
cmd/manager/main.go Calls operator.Init() after parsing flags to apply runtime configuration.
pkg/controller/chaosblade/daemonset.go Adds resource requests/limits to the chaosblade-tool DaemonSet container spec.
Comments suppressed due to low confidence (2)

exec/model/parallelizer.go:42

  • ParallelizeExec uses MaxWorkers directly. If MaxWorkers is 0 (e.g., runtime.Init not called) no goroutines run and the work is silently skipped; if it’s negative, wg.Add(workers) will panic. Add a safe default/clamp (e.g., if MaxWorkers <= 0 set to a sensible default like 64, and optionally cap at workCount).
func ParallelizeExec(workCount int, doWork DoWorkFunc) {
	workers := MaxWorkers
	toExec := make(chan int, workCount)

	for i := 0; i < workCount; i++ {
		toExec <- i
	}
	close(toExec)

	if workCount < workers {
		workers = workCount
	}

	wg := sync.WaitGroup{}
	wg.Add(workers)

exec/model/parallelizer.go:39

  • There are unit tests in exec/model, but ParallelizeExec’s behavior is now dependent on MaxWorkers being set correctly. Add focused tests covering MaxWorkers <= 0 and MaxWorkers > workCount to ensure the function still executes all work items and doesn’t panic.
func ParallelizeExec(workCount int, doWork DoWorkFunc) {
	workers := MaxWorkers
	toExec := make(chan int, workCount)

	for i := 0; i < workCount; i++ {
		toExec <- i
	}
	close(toExec)

	if workCount < workers {
		workers = workCount
	}

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread exec/model/parallelizer.go Outdated
const (
maxWorkers = 64 // magic number
var (
// MaxWorkers can be configured via environment variable or flag
Comment thread pkg/runtime/runtime.go
Comment on lines 50 to +59
func initRuntimeData() {
chaosblade.Constant = chaosblade.Products[version.Product]
// Set default value for parallelizer.MaxWorkers
model.MaxWorkers = MaxWorkers
}

// Init initializes the runtime by syncing flag values to dependent packages.
// This should be called after flag.Parse() in main().
func Init() {
model.MaxWorkers = MaxWorkers
Comment thread pkg/runtime/runtime.go
Comment on lines +40 to +42
flagSet.IntVar(&MaxConcurrentReconciles, "max-concurrent-reconciles", 50, "Max concurrent reconciles count, default value is 50")
flagSet.Float32Var(&QPS, "qps", 100, "qps of kubernetes client, increased from 20 to 100 for better performance")
flagSet.IntVar(&MaxWorkers, "max-workers", 64, "Max workers for parallel execution, default value is 64")
Change-Id: I3a9d9ca028d170c9dc3cd805f25ff7cfcc90522d
Signed-off-by: xcaspar <changjun.xcj@alibaba-inc.com>
@spencercjh spencercjh self-requested a review April 16, 2026 02:52
Copy link
Copy Markdown
Member

@spencercjh spencercjh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants