Skip to content

Commit b5d8486

Browse files
authored
feat: add support for routing-profiles (#1944)
Signed-off-by: varungupta <varungup90@gmail.com>
1 parent 304ffd6 commit b5d8486

24 files changed

+1499
-293
lines changed
Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
apiVersion: apps/v1
2+
kind: Deployment
3+
metadata:
4+
name: mock-qwen3-8b
5+
labels:
6+
model.aibrix.ai/name: "qwen3-8b"
7+
model.aibrix.ai/port: "8000"
8+
adapter.model.aibrix.ai/enabled: "true"
9+
spec:
10+
replicas: 1
11+
selector:
12+
matchLabels:
13+
adapter.model.aibrix.ai/enabled: "true"
14+
model.aibrix.ai/name: "qwen3-8b"
15+
app: "mock-qwen3-8b"
16+
template:
17+
metadata:
18+
labels:
19+
adapter.model.aibrix.ai/enabled: "true"
20+
model.aibrix.ai/name: "qwen3-8b"
21+
app: "mock-qwen3-8b"
22+
annotations:
23+
model.aibrix.ai/config: |
24+
{
25+
"defaultProfile": "least-request",
26+
"profiles": {
27+
"least-request": {
28+
"routingStrategy": "least-request"
29+
},
30+
"throughput": {
31+
"routingStrategy": "throughput"
32+
}
33+
}
34+
}
35+
spec:
36+
serviceAccountName: mocked-app-sa
37+
containers:
38+
- name: llm-engine
39+
image: aibrix/vllm-mock:nightly
40+
command:
41+
- python3
42+
- app.py
43+
- --api_key
44+
- test-key-1234567890

development/app/config/mock/kustomization.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
resources:
22
- ../templates/deployment
33
- components.yaml
4+
- config-profile.yaml
45

56
# enable following patch when we test lora + api-key
67
patches:
Lines changed: 162 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,162 @@
1+
.. _model_config_profiles:
2+
3+
=========================
4+
Model Config and Profiles
5+
=========================
6+
7+
This design describes how to supply **model/gateway configuration** (routing strategy, PD bucket bounds, combined mode, etc.) via a **single annotation** (or ConfigMap), with support for **multiple named profiles** selectable at **runtime** by the client.
8+
9+
Motivation
10+
----------
11+
12+
Today, options are encoded as many pod labels (e.g. ``model.aibrix.ai/name``, ``model.aibrix.ai/port``, ``model.aibrix.ai/routing-strategy``, ``prompt-min-length``, etc.). Adding new options requires new labels and gateway changes to read them. This does not scale. Using a single structured annotation with **multiple profiles** allows:
13+
14+
* One place to add new options (extend the JSON schema).
15+
* Different configurations for the same model (e.g. ``default``, ``pd``, ``low-latency``) selectable per request via a header.
16+
17+
Overview
18+
--------
19+
20+
* **Annotation** (on the pod): ``model.aibrix.ai/config`` holds a JSON object with a ``profiles`` map. Each profile is a set of gateway options: ``routingStrategy``, ``promptLenBucketMinLength``, ``promptLenBucketMaxLength``, ``combined``.
21+
* **Runtime selection**: Client sends header ``config-profile: <profile-name>`` (e.g. ``pd``, ``low-latency``). If omitted, the ``defaultProfile`` (or ``"default"``) is used.
22+
23+
JSON Schema (Implementation)
24+
----------------------------
25+
26+
The implementation parses the following structure. Extra fields (e.g. ``name``, ``port``, ``engine``) in the JSON are ignored.
27+
28+
Root object:
29+
30+
* ``defaultProfile`` (string, optional): Profile name to use when header is empty or profile not found. Default: ``"default"``.
31+
* ``profiles`` (object, required): Map of profile name → profile object.
32+
33+
Profile object (``ModelConfigProfile``):
34+
35+
* ``routingStrategy`` (string): e.g. ``random``, ``pd``, ``least-latency``.
36+
* ``promptLenBucketMinLength`` (int, optional): Lower bound for bucketing. Default: ``0``. If negative, normalized to ``0``.
37+
* ``promptLenBucketMaxLength`` (int, optional): Upper bound for bucketing. Default: ``math.MaxInt32`` when ``0`` or omitted.
38+
* ``combined`` (bool, optional): When true, indicates combined prefill/decode pod for PD routing.
39+
40+
Single profile (backward compatible):
41+
42+
.. code-block:: json
43+
44+
{
45+
"profiles": {
46+
"default": {
47+
"routingStrategy": "pd",
48+
"promptLenBucketMinLength": 0,
49+
"promptLenBucketMaxLength": 2048
50+
}
51+
}
52+
}
53+
54+
Multiple profiles with default:
55+
56+
.. code-block:: json
57+
58+
{
59+
"defaultProfile": "pd",
60+
"profiles": {
61+
"default": {
62+
"routingStrategy": "random",
63+
"promptLenBucketMinLength": 0,
64+
"promptLenBucketMaxLength": 4096
65+
},
66+
"pd": {
67+
"routingStrategy": "pd",
68+
"promptLenBucketMinLength": 0,
69+
"promptLenBucketMaxLength": 2048
70+
},
71+
"low-latency": {
72+
"routingStrategy": "least-latency",
73+
"promptLenBucketMinLength": 0,
74+
"promptLenBucketMaxLength": 2048
75+
}
76+
}
77+
}
78+
79+
Runtime Behavior
80+
----------------
81+
82+
1. Gateway resolves config from pod annotation ``model.aibrix.ai/config``. ConfigMap lookup is not yet implemented. If no annotation, fall back to existing label-based resolution.
83+
2. Gateway reads ``config-profile`` from request headers. If missing, use ``defaultProfile`` from the JSON, or ``"default"``.
84+
3. Gateway selects the profile via ``GetProfile(profileName)``: exact match first, then fallback to ``defaultProfile``, then ``"default"``.
85+
4. The resolved profile is stored on ``RoutingContext.ConfigProfile`` (``ResolvedConfigProfile``) for the request.
86+
5. Routing strategy is derived from: request headers → ``ConfigProfile.RoutingStrategy`` → env ``ROUTING_ALGORITHM``.
87+
6. PD router uses ``ResolveProfileFromPod(pod, routingCtx.ReqConfigProfile)`` with fallback to the default profile; prompt bounds and ``combined`` are read from the selected profile.
88+
89+
Annotation Example (StormService pod template)
90+
----------------------------------------------
91+
92+
.. code-block:: yaml
93+
94+
template:
95+
metadata:
96+
labels:
97+
app: sglang-qwen3-8b-1p1d-0-2k
98+
model.aibrix.ai/name: qwen3-8B
99+
annotations:
100+
prometheus.io/scrape: "true"
101+
prometheus.io/port: "30000"
102+
prometheus.io/path: "/metrics"
103+
model.aibrix.ai/config: |
104+
{
105+
"defaultProfile": "pd",
106+
"profiles": {
107+
"default": {
108+
"routingStrategy": "random",
109+
"promptLenBucketMinLength": 0,
110+
"promptLenBucketMaxLength": 4096
111+
},
112+
"pd": {
113+
"routingStrategy": "pd",
114+
"promptLenBucketMinLength": 0,
115+
"promptLenBucketMaxLength": 2048
116+
}
117+
}
118+
}
119+
120+
Client Usage
121+
------------
122+
123+
* Use default profile: do not set any header (or set ``config-profile: default``).
124+
* Use a specific profile: set header ``config-profile: pd`` or ``config-profile: low-latency``.
125+
126+
Implementation
127+
-------------
128+
129+
Package: ``pkg/plugins/gateway/configprofiles/``
130+
131+
* ``ModelConfigProfile``: struct with ``RoutingStrategy``, ``PromptLenBucketMinLength``, ``PromptLenBucketMaxLength``, ``Combined``.
132+
* ``ModelConfigProfiles``: struct with ``DefaultProfile``, ``Profiles map[string]ModelConfigProfile``.
133+
* ``ParseModelConfig(jsonStr)``: parses JSON; normalizes ``promptLenBucketMinLength`` (≥0) and ``promptLenBucketMaxLength`` (0→MaxInt32).
134+
* ``GetProfile(name)``: returns profile by name; falls back to ``defaultProfile`` then ``"default"``.
135+
* ``ResolveProfile(pods, headerProfile)``: iterates pods, returns first non-nil from ``ResolveProfileFromPod``.
136+
* ``ResolveProfileFromPod(pod, headerProfile)``: reads ``model.aibrix.ai/config`` from pod, parses, returns ``GetProfile(headerProfile)``.
137+
* Prompt length bounds normalization occurs in ``ParseModelConfig``: ``promptLenBucketMinLength`` (<0 → 0), ``promptLenBucketMaxLength`` (0 → ``math.MaxInt32``).
138+
139+
Constants: ``ModelAnnoConfig`` (pkg/constants/model.go), ``HeaderConfigProfile`` (pkg/plugins/gateway/types.go).
140+
141+
Gateway flow:
142+
143+
* ``HandleRequestHeaders``: captures ``config-profile`` into ``ReqConfigProfile``.
144+
* ``HandleRequestBody``: calls ``applyConfigProfile`` which resolves config from pod annotation, sets ``routingCtx.ConfigProfile``, and provides routing strategy to ``deriveRoutingStrategyFromContext``.
145+
* ``deriveRoutingStrategyFromContext``: chooses the routing strategy for the request using this precedence: (1) request header ``routing-strategy`` if present and non-empty; (2) ``routingCtx.ConfigProfile.RoutingStrategy`` from the resolved profile (config-profile + pod annotation); (3) environment default. Returns the strategy and whether it was explicitly set (used to validate and set ``routingCtx.Algorithm`` in ``HandleRequestBody``).
146+
147+
PD router:
148+
149+
* ``isPodSuitableForPromptLength(routingCtx, pod, promptLength)``: uses ``ResolveProfileFromPod(pod, routingCtx.ReqConfigProfile)`` for ``promptLenBucketMinLength``/``promptLenBucketMaxLength``.
150+
* ``isCombinedPod(routingCtx, pod)``: uses ``ResolveProfileFromPod(pod, routingCtx.ReqConfigProfile)`` for ``combined``.
151+
152+
Backward Compatibility
153+
----------------------
154+
155+
If no annotation is present, ``ResolveProfile`` returns nil. Gateway continues to use existing pod labels and env for routing strategy, port, engine, etc.
156+
157+
Future Work
158+
----------
159+
160+
* ConfigMap lookup (wire when gateway config supports it).
161+
* Extend profile schema: ``port``, ``metricPort``, ``engine``, ``name`` for full parity with labels.
162+
* Use request-level ``ConfigProfile`` (from ``config-profile``) for PD bucketing instead of per-pod ``"pd"`` profile.

pkg/constants/model.go

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -45,4 +45,9 @@ const (
4545
// ModelAnnoRouterCustomPath is the anno for add PathPrefixes in httpRoute, split by comma
4646
// Example: "model.aibrix.ai/model-router-custom-paths": "/score,/version"
4747
ModelAnnoRouterCustomPath = "model.aibrix.ai/model-router-custom-paths"
48+
49+
// ModelAnnoConfig is the annotation holding JSON model config with multiple profiles.
50+
// Client selects profile at runtime via config-profile header or defaultProfile is selected.
51+
// See docs/source/designs/model-config-profiles.rst for schema.
52+
ModelAnnoConfig = "model.aibrix.ai/config"
4853
)

pkg/plugins/gateway/algorithms/pd_disaggregation.go

Lines changed: 27 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,7 @@ import (
3333
"github.com/vllm-project/aibrix/pkg/cache"
3434
"github.com/vllm-project/aibrix/pkg/constants"
3535
"github.com/vllm-project/aibrix/pkg/metrics"
36+
"github.com/vllm-project/aibrix/pkg/plugins/gateway/configprofiles"
3637
"github.com/vllm-project/aibrix/pkg/types"
3738
"github.com/vllm-project/aibrix/pkg/utils"
3839
"github.com/vllm-project/aibrix/pkg/utils/prefixcacheindexer"
@@ -50,11 +51,10 @@ const (
5051
LLMEngineIdentifier string = constants.ModelLabelEngine
5152
PDRoleSetIdentifier string = "roleset-name"
5253
PDRoleIdentifier string = "role-name"
53-
CombinedIdentifier string = "model.aibrix.ai/combined"
5454
RoleReplicaIndex string = "stormservice.orchestration.aibrix.ai/role-replica-index"
5555
PodGroupIndex string = "stormservice.orchestration.aibrix.ai/pod-group-index"
56-
PromptMinLength string = "prompt-min-length"
57-
PromptMaxLength string = "prompt-max-length"
56+
PromptLenBucketMinLength string = "prompt-len-bucket-min-length"
57+
PromptLenBucketMaxLength string = "prompt-len-bucket-max-length"
5858
defaultPrefillRequestTimeout int = 30
5959

6060
defaultMaxRequest float64 = 32
@@ -73,6 +73,9 @@ const (
7373
// KV connector types for different backends
7474
KVConnectorTypeSHFS = "shfs" // Default - AIBrix SHFS/KVCacheManager (GPU)
7575
KVConnectorTypeNIXL = "nixl" // NIXL for Neuron (uses disagg_prefill_resp wrapper)
76+
77+
HeaderPrefillTargetPodIP = "prefill-target-pod-ip"
78+
HeaderPrefillTargetPod = "prefill-target-pod"
7679
)
7780

7881
var (
@@ -172,6 +175,11 @@ func (r *pdRouter) Route(ctx *types.RoutingContext, readyPodList types.PodList)
172175

173176
if prefillPod != nil {
174177
klog.InfoS("selected prefill/decode pods", "request_id", ctx.RequestID, "prefill_pod", prefillPod.Name, "decode_pod", decodePod.Name)
178+
if ctx.RespHeaders == nil {
179+
ctx.RespHeaders = make(map[string]string)
180+
}
181+
ctx.RespHeaders[HeaderPrefillTargetPod] = prefillPod.Name
182+
ctx.RespHeaders[HeaderPrefillTargetPodIP] = prefillPod.Status.PodIP
175183
err = r.doPrefillRequest(ctx, prefillPod, llmEngine)
176184
if err != nil {
177185
metrics.EmitMetricToPrometheus(ctx, nil, metrics.GatewayPrefillRequestFailTotal, &metrics.SimpleMetricValue{Value: 1.0},
@@ -203,7 +211,7 @@ func (r *pdRouter) filterPrefillDecodePods(routingCtx *types.RoutingContext, rea
203211
klog.V(4).InfoS("prompt length based filtering enabled", "request_id", routingCtx.RequestID, "prompt_length", promptLength)
204212
}
205213

206-
prefillPods, decodePods, promptLengthBucketingPrefillPods, promptLengthBucketingDecodePods, combinedPods := r.collectAndBucketPods(readyPods, promptLength)
214+
prefillPods, decodePods, promptLengthBucketingPrefillPods, promptLengthBucketingDecodePods, combinedPods := r.collectAndBucketPods(routingCtx, readyPods, promptLength)
207215
combinedAvailable := aibrixPromptLengthBucketing && len(combinedPods) > 0
208216
if len(prefillPods) == 0 && !combinedAvailable {
209217
return nil, nil, fmt.Errorf("prefill pods are not ready: prefill=%d, decode=%d", len(prefillPods), len(decodePods))
@@ -932,8 +940,12 @@ func (t *PrefillRequestTracker) GetPrefillRequestCountsForPod(podname string) in
932940
return int(countInterface.(*atomic.Int32).Load())
933941
}
934942

935-
func (r *pdRouter) isPodSuitableForPromptLength(pod *v1.Pod, promptLength int) bool {
936-
minLength, maxLength := r.getPodPromptRange(pod)
943+
func (r *pdRouter) isPodSuitableForPromptLength(routingCtx *types.RoutingContext, pod *v1.Pod, promptLength int) bool {
944+
profile := configprofiles.ResolveProfileFromPod(pod, routingCtx.ReqConfigProfile)
945+
if profile == nil {
946+
return false
947+
}
948+
minLength, maxLength := profile.PromptLenBucketMinLength, profile.PromptLenBucketMaxLength
937949

938950
if minLength > maxLength {
939951
return false
@@ -946,31 +958,15 @@ func (r *pdRouter) isPodSuitableForPromptLength(pod *v1.Pod, promptLength int) b
946958
return promptLength >= minLength && promptLength <= maxLength
947959
}
948960

949-
// getPodPromptRange retrieves the minimum and maximum prompt lengths from pod labels.
950-
func (r *pdRouter) getPodPromptRange(pod *v1.Pod) (int, int) {
951-
minLength := 0
952-
maxLength := math.MaxInt32
953-
954-
if val, ok := pod.Labels[PromptMinLength]; ok {
955-
if parsed, err := strconv.Atoi(val); err == nil {
956-
minLength = parsed
957-
}
958-
}
959-
960-
if val, ok := pod.Labels[PromptMaxLength]; ok {
961-
if parsed, err := strconv.Atoi(val); err == nil {
962-
maxLength = parsed
963-
}
961+
func isCombinedPod(routingCtx *types.RoutingContext, pod *v1.Pod) bool {
962+
profile := configprofiles.ResolveProfileFromPod(pod, routingCtx.ReqConfigProfile)
963+
if profile == nil {
964+
return false
964965
}
965-
966-
return minLength, maxLength
967-
}
968-
969-
func isCombinedPod(pod *v1.Pod) bool {
970-
return pod != nil && pod.Labels[CombinedIdentifier] == "true"
966+
return profile.Combined
971967
}
972968

973-
func (r *pdRouter) collectAndBucketPods(readyPods []*v1.Pod, promptLength int) ([]*v1.Pod, []*v1.Pod, []*v1.Pod, []*v1.Pod, []*v1.Pod) {
969+
func (r *pdRouter) collectAndBucketPods(routingCtx *types.RoutingContext, readyPods []*v1.Pod, promptLength int) ([]*v1.Pod, []*v1.Pod, []*v1.Pod, []*v1.Pod, []*v1.Pod) {
974970
prefillPods, decodePods := []*v1.Pod{}, []*v1.Pod{}
975971
promptLengthBucketingPrefillPods, promptLengthBucketingDecodePods, promptLengthBucketingCombinedPods := []*v1.Pod{}, []*v1.Pod{}, []*v1.Pod{}
976972

@@ -991,16 +987,16 @@ func (r *pdRouter) collectAndBucketPods(readyPods []*v1.Pod, promptLength int) (
991987
switch pod.Labels[PDRoleIdentifier] {
992988
case "prefill":
993989
prefillPods = append(prefillPods, pod)
994-
if aibrixPromptLengthBucketing && r.isPodSuitableForPromptLength(pod, promptLength) {
990+
if aibrixPromptLengthBucketing && r.isPodSuitableForPromptLength(routingCtx, pod, promptLength) {
995991
promptLengthBucketingPrefillPods = append(promptLengthBucketingPrefillPods, pod)
996992
}
997993
case "decode":
998994
decodePods = append(decodePods, pod)
999-
if aibrixPromptLengthBucketing && r.isPodSuitableForPromptLength(pod, promptLength) {
995+
if aibrixPromptLengthBucketing && r.isPodSuitableForPromptLength(routingCtx, pod, promptLength) {
1000996
promptLengthBucketingDecodePods = append(promptLengthBucketingDecodePods, pod)
1001997
}
1002998
default:
1003-
if aibrixPromptLengthBucketing && isCombinedPod(pod) && r.isPodSuitableForPromptLength(pod, promptLength) {
999+
if aibrixPromptLengthBucketing && isCombinedPod(routingCtx, pod) && r.isPodSuitableForPromptLength(routingCtx, pod, promptLength) {
10041000
promptLengthBucketingCombinedPods = append(promptLengthBucketingCombinedPods, pod)
10051001
}
10061002
}

0 commit comments

Comments
 (0)