[RFC] Multi-Dimensional Load Balancing with Request-Aware Metrics

## Summary
Enhance the router's load balancing by replacing simple request counting with multi-dimensional metrics that consider "**request length**", "**KV cache utilization**", and "**worker role**"(Prefill vs Decode).

**The current load balancing has several limitations:**
Single-dimension tracking: Only counts requests, ignoring request size differences.
No KV cache awareness: Workers with high KV utilization continue receiving requests.
```
Worker A: 2 requests (each 10K tokens) → load = 2
Worker B: 5 requests (each 100 tokens) → load = 5
Router picks A, but A is actually 20x more loaded!
```

## Proposed Solution
### Core Idea: Local Tracking + Remote Calibration
```
Load Perception:

┌───────────────────────────────────────────────────────────────────────┐
│                              Router                                   │
├───────────────────────────────────────────────────────────────────────┤
│  Local Tracking (real-time, per request)                              │
│  ├─ prompt      : active prompt tokens being processed                │
│  ├─ max_tok     : max output tokens requested                         │
│  ├─ kv_est      : estimated KV blocks to be consumed                  │
│  └─ req         : number of active requests                           │
│                                                                       │
│  Remote Calibration (async, every 5s, from /metrics)                  │
│  ├─ kv_usage    : real KV cache utilization (0.0~1.0)                 │
│  └─ wait        : requests waiting in queue                           │
└───────────────────────────────────────────────────────────────────────┘
```
### Role-Aware Load Calculation (Adaptive Log Compression)
**Key insight**: Different metrics have vastly different scales. We use **adaptive log compression** with EMA (Exponential Moving Average)
```
Formula: ln(1 + value/base)
Where:  base = EMA(recent_values)
```
| Role | Formula | Variables Used |
|------|---------|----------------|
| **Prefill** | `0.6×ln(1+prompt/ema_p) + 0.3×ln(1+kv_est/ema_k) + 0.1×ln(1+req/ema_r)` | prompt, kv_est, req |
| **Decode** | `0.4×ln(1+max_tok/ema_t) + 0.5×kv_usage + 0.1×ln(1+wait/ema_w)` | max_tok, kv_usage, wait |

### Predictive Compensation (for calibration latency)
Remote calibration has 0-5s latency. We use local tracking to predict real KV usage between calibrations:

```rust
// Core formula:
predicted_kv = calibrated_kv + local_delta

// Where:
//   calibrated_kv = last value from /metrics
//   local_delta = tokens_since_calibration / block_size / capacity

// Example:
// T=0s: calibrated_kv = 30%, tokens_since = 0
// T=2s: request +3000 tokens → local_delta = 3000/16/10000 = 1.9%
//       predicted_kv = 30% + 1.9% = 31.9%
// T=5s: actual KV = 32% (very close to prediction!)
```
## To be Discussed
1. Is `chars/4` sufficient for `prompt` token estimation, or need tokenizer?
2. Weight defaults for Prefill: Are (0.6/0.3/0.1) reasonable for compute-bound workloads?
3. Weight defaults for Decode: Are (0.4/0.5/0.1) reasonable for memory-bound workloads?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Multi-Dimensional Load Balancing with Request-Aware Metrics #51

Summary

Proposed Solution

Core Idea: Local Tracking + Remote Calibration

Role-Aware Load Calculation (Adaptive Log Compression)

Predictive Compensation (for calibration latency)

To be Discussed

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Role	Formula	Variables Used
Prefill	`0.6×ln(1+prompt/ema_p) + 0.3×ln(1+kv_est/ema_k) + 0.1×ln(1+req/ema_r)`	prompt, kv_est, req
Decode	`0.4×ln(1+max_tok/ema_t) + 0.5×kv_usage + 0.1×ln(1+wait/ema_w)`	max_tok, kv_usage, wait

[RFC] Multi-Dimensional Load Balancing with Request-Aware Metrics #51

Description

Summary

Proposed Solution

Core Idea: Local Tracking + Remote Calibration

Role-Aware Load Calculation (Adaptive Log Compression)

Predictive Compensation (for calibration latency)

To be Discussed

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions