-
Notifications
You must be signed in to change notification settings - Fork 53
Open
Description
Summary
Enhance the router's load balancing by replacing simple request counting with multi-dimensional metrics that consider "request length", "KV cache utilization", and "worker role"(Prefill vs Decode).
The current load balancing has several limitations:
Single-dimension tracking: Only counts requests, ignoring request size differences.
No KV cache awareness: Workers with high KV utilization continue receiving requests.
Worker A: 2 requests (each 10K tokens) → load = 2
Worker B: 5 requests (each 100 tokens) → load = 5
Router picks A, but A is actually 20x more loaded!
Proposed Solution
Core Idea: Local Tracking + Remote Calibration
Load Perception:
┌───────────────────────────────────────────────────────────────────────┐
│ Router │
├───────────────────────────────────────────────────────────────────────┤
│ Local Tracking (real-time, per request) │
│ ├─ prompt : active prompt tokens being processed │
│ ├─ max_tok : max output tokens requested │
│ ├─ kv_est : estimated KV blocks to be consumed │
│ └─ req : number of active requests │
│ │
│ Remote Calibration (async, every 5s, from /metrics) │
│ ├─ kv_usage : real KV cache utilization (0.0~1.0) │
│ └─ wait : requests waiting in queue │
└───────────────────────────────────────────────────────────────────────┘
Role-Aware Load Calculation (Adaptive Log Compression)
Key insight: Different metrics have vastly different scales. We use adaptive log compression with EMA (Exponential Moving Average)
Formula: ln(1 + value/base)
Where: base = EMA(recent_values)
| Role | Formula | Variables Used |
|---|---|---|
| Prefill | 0.6×ln(1+prompt/ema_p) + 0.3×ln(1+kv_est/ema_k) + 0.1×ln(1+req/ema_r) |
prompt, kv_est, req |
| Decode | 0.4×ln(1+max_tok/ema_t) + 0.5×kv_usage + 0.1×ln(1+wait/ema_w) |
max_tok, kv_usage, wait |
Predictive Compensation (for calibration latency)
Remote calibration has 0-5s latency. We use local tracking to predict real KV usage between calibrations:
// Core formula:
predicted_kv = calibrated_kv + local_delta
// Where:
// calibrated_kv = last value from /metrics
// local_delta = tokens_since_calibration / block_size / capacity
// Example:
// T=0s: calibrated_kv = 30%, tokens_since = 0
// T=2s: request +3000 tokens → local_delta = 3000/16/10000 = 1.9%
// predicted_kv = 30% + 1.9% = 31.9%
// T=5s: actual KV = 32% (very close to prediction!)To be Discussed
- Is
chars/4sufficient forprompttoken estimation, or need tokenizer? - Weight defaults for Prefill: Are (0.6/0.3/0.1) reasonable for compute-bound workloads?
- Weight defaults for Decode: Are (0.4/0.5/0.1) reasonable for memory-bound workloads?
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels