You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
perf: optimize RB to Work throughput for large-scale Pod distribution
This commit introduces multiple performance optimizations for the
ResourceBinding to Work synchronization path, targeting scenarios
with 10000+ Pods distribution.
Key optimizations:
1. AsyncWorkCreator for Binding Controller
- Decouples Work creation from reconcile loop using 64 async workers
- Implements Assume Cache pattern (similar to kube-scheduler)
- Adds failure retry via requeue callback mechanism
- Periodic cleanup of stale cache entries (every 5 min)
2. Parallel Work preparation and execution
- Parallelizes DeepCopy and ApplyOverridePolicies across clusters
- Concurrent Work creation for multi-cluster scenarios
3. CreateOrUpdateWork optimization
- Implements Create-First pattern (try Create before Get+Update)
- Adds fast-path comparison to skip unchanged Work updates
- Reduces API calls by 30-50% in update scenarios
4. Precise orphan Work detection
- Uses TargetClustersHashAnnotation to track cluster changes
- Skips orphan check when clusters haven't changed
- Expected 90%+ reduction in List API calls
5. AsyncBinder for Scheduler
- 32 async workers for RB/CRB patch operations
- Decouples scheduling decisions from persistence
New configuration options:
--enable-async-work-creation=true
--async-work-workers=64
--enable-async-bind=true
--async-bind-workers=32
Performance improvement:
- New Work API calls: 2 -> 1 per Work (50% reduction)
- Orphan check: Every reconcile -> Only on cluster change (90%+ reduction)
- Multi-cluster Work creation: Sequential -> Parallel (Nx speedup)
- Expected throughput: ~200 Work/s -> ~1000+ Work/s (5-10x improvement)
Signed-off-by: Kevinz857 <[email protected]>
flags.BoolVar(&o.EnableClusterResourceModeling, "enable-cluster-resource-modeling", true, "Enable means controller would build resource modeling for each cluster by syncing Nodes and Pods resources.\n"+
229
237
"The resource modeling might be used by the scheduler to make scheduling decisions in scenario of dynamic replica assignment based on cluster free resources.\n"+
230
238
"Disable if it does not fit your cases for better performance.")
239
+
flags.BoolVar(&o.EnableAsyncWorkCreation, "enable-async-work-creation", false, "Enable asynchronous work creation for binding controller. When enabled, work creation tasks are submitted to an async queue for processing by dedicated workers, improving throughput for large-scale deployments.")
240
+
flags.IntVar(&o.AsyncWorkWorkers, "async-work-workers", 64, "Number of concurrent workers for asynchronous work creation. Only effective when --enable-async-work-creation is true.")
fmt.Sprintf("A list of plugins to enable. '*' enables all build-in and customized plugins, 'foo' enables the plugin named 'foo', '*,-foo' disables the plugin named 'foo'.\nAll build-in plugins: %s.", strings.Join(frameworkplugins.NewInTreeRegistry().FactoryNames(), ",")))
165
179
fs.StringVar(&o.SchedulerName, "scheduler-name", scheduler.DefaultScheduler, "SchedulerName represents the name of the scheduler. default is 'default-scheduler'.")
180
+
fs.IntVar(&o.ScheduleWorkers, "schedule-workers", 1, "Number of concurrent workers for scheduling ResourceBindings. Higher values improve throughput but increase API server load. Defaults to 1 for backward compatibility.")
181
+
fs.BoolVar(&o.EnableAsyncBind, "enable-async-bind", false, "Enable asynchronous binding of scheduling results. When enabled, the scheduler submits binding requests to an async queue for processing by dedicated workers, improving throughput.")
182
+
fs.IntVar(&o.AsyncBindWorkers, "async-bind-workers", 32, "Number of concurrent workers for asynchronous binding. Only effective when --enable-async-bind is true.")
0 commit comments