Skip to content

[RFC] Multi-Dimensional Load Balancing with Request-Aware Metrics #51

@tobking-rain

Description

@tobking-rain

Summary

Enhance the router's load balancing by replacing simple request counting with multi-dimensional metrics that consider "request length", "KV cache utilization", and "worker role"(Prefill vs Decode).

The current load balancing has several limitations:
Single-dimension tracking: Only counts requests, ignoring request size differences.
No KV cache awareness: Workers with high KV utilization continue receiving requests.

Worker A: 2 requests (each 10K tokens) → load = 2
Worker B: 5 requests (each 100 tokens) → load = 5
Router picks A, but A is actually 20x more loaded!

Proposed Solution

Core Idea: Local Tracking + Remote Calibration

Load Perception:

┌───────────────────────────────────────────────────────────────────────┐
│                              Router                                   │
├───────────────────────────────────────────────────────────────────────┤
│  Local Tracking (real-time, per request)                              │
│  ├─ prompt      : active prompt tokens being processed                │
│  ├─ max_tok     : max output tokens requested                         │
│  ├─ kv_est      : estimated KV blocks to be consumed                  │
│  └─ req         : number of active requests                           │
│                                                                       │
│  Remote Calibration (async, every 5s, from /metrics)                  │
│  ├─ kv_usage    : real KV cache utilization (0.0~1.0)                 │
│  └─ wait        : requests waiting in queue                           │
└───────────────────────────────────────────────────────────────────────┘

Role-Aware Load Calculation (Adaptive Log Compression)

Key insight: Different metrics have vastly different scales. We use adaptive log compression with EMA (Exponential Moving Average)

Formula: ln(1 + value/base)
Where:  base = EMA(recent_values)
Role Formula Variables Used
Prefill 0.6×ln(1+prompt/ema_p) + 0.3×ln(1+kv_est/ema_k) + 0.1×ln(1+req/ema_r) prompt, kv_est, req
Decode 0.4×ln(1+max_tok/ema_t) + 0.5×kv_usage + 0.1×ln(1+wait/ema_w) max_tok, kv_usage, wait

Predictive Compensation (for calibration latency)

Remote calibration has 0-5s latency. We use local tracking to predict real KV usage between calibrations:

// Core formula:
predicted_kv = calibrated_kv + local_delta

// Where:
//   calibrated_kv = last value from /metrics
//   local_delta = tokens_since_calibration / block_size / capacity

// Example:
// T=0s: calibrated_kv = 30%, tokens_since = 0
// T=2s: request +3000 tokens → local_delta = 3000/16/10000 = 1.9%
//       predicted_kv = 30% + 1.9% = 31.9%
// T=5s: actual KV = 32% (very close to prediction!)

To be Discussed

  1. Is chars/4 sufficient for prompt token estimation, or need tokenizer?
  2. Weight defaults for Prefill: Are (0.6/0.3/0.1) reasonable for compute-bound workloads?
  3. Weight defaults for Decode: Are (0.4/0.5/0.1) reasonable for memory-bound workloads?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions