Skip to content

Commit efc0e62

Browse files
committed
Optimize snapshot documents and add flowcharts
Signed-off-by: handan-yxh <[email protected]>
1 parent bee7818 commit efc0e62

File tree

2 files changed

+16
-8
lines changed

2 files changed

+16
-8
lines changed

docs/design/agent-scheduler.md

Lines changed: 16 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -87,19 +87,27 @@ The workflow diagram is shown as follows:
8787
![](images/agent-scheduler/scheduling-queue-diagram.png)
8888

8989
## Snapshot maintenance
90-
To improve the overall scheduling throughput of the Agent scheduler, we optimized the Cache and Snapshot data structures and algorithms.
91-
1. **Generation-based Incremental Updates**: Each node in the cache maintains a monotonically increasing generation number. When a node's state changes (pod addition/removal, resource updates), its generation is incremented and the node is moved to the head of a doubly linked list. During snapshot updates, only nodes with generation numbers greater than the snapshot's current generation are processed.Significantly reduces the number of nodes that need to be processed for each update.
92-
2. **Dual Data Structure for Efficient Updates**:
93-
- Doubly Linked List: Maintains nodes in recency order (most recently updated at head).
94-
- Hash Map: Provides O(1) node lookup by name.
95-
- Indexed Snapshot: Snapshot maintains maps for quick node position lookup.
96-
3. **Concurrent Access Support**: The read-write lock separation design supports multiple workers concurrently accessing and updating their individual snapshots. Each worker maintains an independent snapshot instance, avoiding contention during scheduling cycles.Significantly improves concurrency performance in multithreaded scenarios.
90+
During each scheduling cycle, plugins repeatedly read node-level state such as resources, pods, affinities, and NUMA topology. Directly accessing and locking the global cache for every scheduling decision would introduce excessive lock contention and severely limit throughput, especially when multiple scheduler workers run concurrently. To address this, the Agent Scheduler adopts a dual-layer data structure to optimize scheduling performance:
91+
92+
1. **SchedulerCache (Primary Cache)**: Maintains the authoritative and up-to-date cluster state, including nodes, pods, and resource information. It is continuously updated by Kubernetes informers and shared across all scheduling workers.
93+
2. **Snapshot (Worker-local View)**: Each scheduling worker maintains an independent, read-only snapshot representing the cluster state at a specific point in time. Snapshots are initialized when workers start and are incrementally updated before each scheduling cycle.
94+
95+
### Key Design Mechanisms
96+
97+
1. **Generation-based Incremental Updates**: Each node in the cache tracks a monotonically increasing generation number. Whenever a node’s state changes (e.g., node/pod addition/removal or resource updates), its generation is incremented and the node is moved to the head of a doubly linked list. During snapshot synchronization, only nodes with generation numbers greater than the snapshot’s current generation are processed, significantly reducing the update cost.
98+
2. **Dual Data Structures for Efficient Access**:
99+
- Doubly Linked List: Maintains nodes in recency order, with the most recently updated nodes at the head.
100+
- Hash Map: Enables O(1) lookup of nodes by name.
101+
- Indexed Snapshot: Snapshots maintain internal maps to support fast node access and position tracking.
102+
3. **Concurrent Access Optimization**: The separation of read-write locks and worker-local snapshots allows multiple scheduler workers to operate concurrently without contending on shared state. Each worker updates and reads its own snapshot instance, substantially improving scalability and throughput in multi-threaded scheduling scenarios.
103+
104+
The interaction flow between Cache and Snapshot is shown in the following diagram:
105+
![](images/agent-scheduler/cache-snapshot-design.png)
97106

98107
## Multi-Worker scheduling
99108
Single scheduling process has performance bottleneck when a large number of Pods need to be scheduled. To improve throughput of scheduling, multiple worker can be enabled to perform parallel scheduling. Worker count can be configured via startup parameter `agent_scheduler_worker_count=x`.
100109
Parallel scheduling may bring scheduling conflict when cluster lack of resource. So Binder component is involved to resolve the conflict before executing real binding.
101110

102-
103111
Workers pops Pods from the scheduling queue and perform scheduling. After predicates and node ordering, multiple candidate nodes (configurable in number) is stored in scheduling result for allocating. The scheduling results are then passed to the Binder for final binding. The Binder processes allocation results from multiple workers, using optimistic concurrency control to resolve scheduling conflict, executing Bind for non-conflicting results.
104112

105113
![](images/agent-scheduler/binder-flow.png)
83.5 KB
Loading

0 commit comments

Comments
 (0)