Skip to content

Optimize ResourceBinding to Work synchronization throughput for large-scale Pod distribution scenarios (10000+ Pods). #7062

@Kevinz857

Description

@Kevinz857

What would you like to be added:

Optimize ResourceBinding to Work synchronization throughput for large-scale Pod distribution scenarios (10000+ Pods).

Why is this needed:

Problem Description

In large-scale resource distribution scenarios (e.g., distributing 10000+ Pods to multiple member clusters), we observed a significant bottleneck in the ResourceBinding to Work synchronization path.

Observed data during stress testing:
etcd object counts:

  • Pod: 12649
  • ResourceBinding: 12645 ✅ Almost caught up with Pods
  • Work: 6646 ❌ ~6000 lagging behind ResourceBindings

Bottleneck Analysis

We analyzed the complete data path and identified where the latency occurs:

Stage Component Latency Status
Pod → ResourceBinding resource-detector ~ms ✅ Normal
ResourceBinding → RB.spec.clusters karmada-scheduler ~ms ✅ Normal
ResourceBinding → Work binding-controller seconds~minutes ⚠️ Bottleneck

Root Cause Analysis

The binding-controller has several performance issues in the RB → Work path:

  1. Synchronous Work Creation

    • Each ResourceBinding reconcile blocks on Work creation
    • No decoupling between scheduling decision and persistence
    • Redundant reconciles when works are being created
  2. Inefficient API Call Pattern

    • Uses controllerutil.CreateOrUpdate which always does Get before Create
    • For new Work objects: 2 API calls (Get + Create) instead of 1
    • No fast-path to skip unchanged Work updates
  3. Unnecessary Orphan Work Checks

    • removeOrphanWorks() is called on every reconcile
    • Each check triggers a List API call via GetWorksByBindingID()
    • With 12645 RBs, this means 12645+ unnecessary List operations
  4. Sequential Multi-Cluster Work Creation

    • When distributing to N clusters, Work objects are created sequentially
    • For dual-cluster scenario: 2 sequential API calls instead of parallel
  5. Excessive Event Recording

    • Records 2 Events per successful sync (binding + workload)
    • In high-throughput scenarios, this creates significant API load

API Call Analysis (Dual-Cluster Distribution)

Before optimization (per ResourceBinding):

  ├── GetWorksByBindingID (List)     = 1 call
  ├── FetchResourceTemplate (Get)    = 1 call
  ├── Cluster A: CreateOrUpdateWork
  │   ├── Get (NotFound)             = 1 call
  │   └── Create                     = 1 call
  ├── Cluster B: CreateOrUpdateWork (sequential!)
  │   ├── Get (NotFound)             = 1 call
  │   └── Create                     = 1 call
  ├── Event(binding)                 = 1 call
  └── Event(workload)                = 1 call
  Total: 8 API calls (sequential)

After optimization (per ResourceBinding):

  ├── FetchResourceTemplate (Get)    = 1 call
  ├── Parallel:
  │   ├── Cluster A: Create          = 1 call
  │   └── Cluster B: Create          = 1 call
  └── Log (no API call)
  Total: 3 API calls (2 parallel)

Configuration Bottleneck

We also discovered that the default RateLimiter configuration is too conservative:

Default: --rate-limiter-qps=10

Mathematical calculation for 6160 ResourceBindings:
Theoretical minimum processing time = 6160 / 10 = 616 seconds ≈ 10 minutes!

Expected Improvements

Metric Before After Improvement
New Work API calls 2 per Work 1 per Work 50%
Orphan check frequency Every reconcile Only on cluster change 90%+
Multi-cluster Work creation Sequential Parallel Nx
Event recording 2 per success 0 per success 100%
Total API calls per RB (dual-cluster) ~8 (sequential) ~3 (2 parallel) 60%+
Expected throughput ~200 Work/s ~1000+ Work/s 5-10x

Proposed Solution

  1. AsyncWorkCreator - Decouple Work creation from reconcile loop with async workers
  2. Assume Cache - Skip redundant reconciles for in-flight work creation (similar to kube-scheduler)
  3. Create-First Pattern - Try Create before Get+Update
  4. Precise Orphan Detection - Use hash annotation to skip unchanged cluster checks
  5. Parallel Work Creation - Create Works for multiple clusters concurrently
  6. AsyncBinder for Scheduler - Async workers for RB/CRB patch operations

All optimizations should be behind feature flags for backward compatibility.

Metadata

Metadata

Assignees

Labels

kind/featureCategorizes issue or PR as related to a new feature.

Type

No type

Projects

Status

No status

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions