-
Notifications
You must be signed in to change notification settings - Fork 1k
Description
What would you like to be added:
Optimize ResourceBinding to Work synchronization throughput for large-scale Pod distribution scenarios (10000+ Pods).
Why is this needed:
Problem Description
In large-scale resource distribution scenarios (e.g., distributing 10000+ Pods to multiple member clusters), we observed a significant bottleneck in the ResourceBinding to Work synchronization path.
Observed data during stress testing:
etcd object counts:
- Pod: 12649
- ResourceBinding: 12645 ✅ Almost caught up with Pods
- Work: 6646 ❌ ~6000 lagging behind ResourceBindings
Bottleneck Analysis
We analyzed the complete data path and identified where the latency occurs:
| Stage | Component | Latency | Status |
|---|---|---|---|
| Pod → ResourceBinding | resource-detector | ~ms | ✅ Normal |
| ResourceBinding → RB.spec.clusters | karmada-scheduler | ~ms | ✅ Normal |
| ResourceBinding → Work | binding-controller | seconds~minutes |
Root Cause Analysis
The binding-controller has several performance issues in the RB → Work path:
-
Synchronous Work Creation
- Each ResourceBinding reconcile blocks on Work creation
- No decoupling between scheduling decision and persistence
- Redundant reconciles when works are being created
-
Inefficient API Call Pattern
- Uses
controllerutil.CreateOrUpdatewhich always does Get before Create - For new Work objects: 2 API calls (Get + Create) instead of 1
- No fast-path to skip unchanged Work updates
- Uses
-
Unnecessary Orphan Work Checks
removeOrphanWorks()is called on every reconcile- Each check triggers a List API call via
GetWorksByBindingID() - With 12645 RBs, this means 12645+ unnecessary List operations
-
Sequential Multi-Cluster Work Creation
- When distributing to N clusters, Work objects are created sequentially
- For dual-cluster scenario: 2 sequential API calls instead of parallel
-
Excessive Event Recording
- Records 2 Events per successful sync (binding + workload)
- In high-throughput scenarios, this creates significant API load
API Call Analysis (Dual-Cluster Distribution)
Before optimization (per ResourceBinding):
├── GetWorksByBindingID (List) = 1 call
├── FetchResourceTemplate (Get) = 1 call
├── Cluster A: CreateOrUpdateWork
│ ├── Get (NotFound) = 1 call
│ └── Create = 1 call
├── Cluster B: CreateOrUpdateWork (sequential!)
│ ├── Get (NotFound) = 1 call
│ └── Create = 1 call
├── Event(binding) = 1 call
└── Event(workload) = 1 call
Total: 8 API calls (sequential)
After optimization (per ResourceBinding):
├── FetchResourceTemplate (Get) = 1 call
├── Parallel:
│ ├── Cluster A: Create = 1 call
│ └── Cluster B: Create = 1 call
└── Log (no API call)
Total: 3 API calls (2 parallel)
Configuration Bottleneck
We also discovered that the default RateLimiter configuration is too conservative:
Default: --rate-limiter-qps=10
Mathematical calculation for 6160 ResourceBindings:
Theoretical minimum processing time = 6160 / 10 = 616 seconds ≈ 10 minutes!
Expected Improvements
| Metric | Before | After | Improvement |
|---|---|---|---|
| New Work API calls | 2 per Work | 1 per Work | 50% |
| Orphan check frequency | Every reconcile | Only on cluster change | 90%+ |
| Multi-cluster Work creation | Sequential | Parallel | Nx |
| Event recording | 2 per success | 0 per success | 100% |
| Total API calls per RB (dual-cluster) | ~8 (sequential) | ~3 (2 parallel) | 60%+ |
| Expected throughput | ~200 Work/s | ~1000+ Work/s | 5-10x |
Proposed Solution
- AsyncWorkCreator - Decouple Work creation from reconcile loop with async workers
- Assume Cache - Skip redundant reconciles for in-flight work creation (similar to kube-scheduler)
- Create-First Pattern - Try Create before Get+Update
- Precise Orphan Detection - Use hash annotation to skip unchanged cluster checks
- Parallel Work Creation - Create Works for multiple clusters concurrently
- AsyncBinder for Scheduler - Async workers for RB/CRB patch operations
All optimizations should be behind feature flags for backward compatibility.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status