Skip to content

Commit 7f838c4

Browse files
vishakha-ramaniatantawimamy-CS
authored
SLO-driven model-based scaling (#791)
* Basic model tuner functionality * queue analyzer engine skeleton * tuner integration init * initial implementation of computeAllVariantCapacities * Register queueing model metrics queries for arrival rate and max batch size * integration v2 * wire queueing model analyzer to engine v1 * handle error cases * set go version to 1.24.0 * code refactoring * add comments and rearrange code * fix openshift ci e2e gate for fork (#843) * fix openshift ci e2e gate for fork Signed-off-by: Mohammed Abdi <mohammed.munir.abdi@ibm.com> * clarify openshift e2e triggers Signed-off-by: Mohammed Abdi <mohammed.munir.abdi@ibm.com> * update comments to be explicit Signed-off-by: Mohammed Abdi <mohammed.munir.abdi@ibm.com> --------- Signed-off-by: Mohammed Abdi <mohammed.munir.abdi@ibm.com> * document wva release process (#845) * document wva release process Signed-off-by: Mohammed Abdi <mohammed.munir.abdi@ibm.com> * nit Signed-off-by: Mohammed Abdi <mohammed.munir.abdi@ibm.com> --------- Signed-off-by: Mohammed Abdi <mohammed.munir.abdi@ibm.com> * wire queueing model analyzer to engine v1 * set go version to 1.24.0 * add QueueingModelScalingConfig struct * add QueueingModelConfigMapName constant and helper * add config map functionality that lets user set config for queue model * rename QueueingModel config types and functions to QMAnalyzer Rename types, struct fields, and functions to use the QMAnalyzer naming convention instead of QueueingModel for config-related code in internal/config and internal/controller packages, plus callers in internal/engines/saturation. * Renames the internal analyzer config type QueueingModelConfig to QMConfig for consistency with the QMAnalyzer naming used in the config/controller layer. * guess slo using all variants of model * added unit tests for parameters and utils * fix lint-and-test errors * add model_name filtering to dispatch rate query and use 1m rate window * added pod based tuning and weighted sum of metrics * move queueing model configmap to deploy/ directory --------- Signed-off-by: vishakha-ramani <92736776+vishakha-ramani@users.noreply.github.com> Signed-off-by: Mohammed Abdi <mohammed.munir.abdi@ibm.com> Co-authored-by: tantawi <tantawi@us.ibm.com> Co-authored-by: Mohammed Munir Abdi <abdimamy@gmail.com>
1 parent b8d93b2 commit 7f838c4

28 files changed

Lines changed: 2549 additions & 39 deletions
Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
apiVersion: v1
2+
kind: ConfigMap
3+
# This ConfigMap configures the queueing model analyzer for the WVA controller.
4+
# Its presence (with a "default" key) activates the queueing model analyzer.
5+
# Delete this ConfigMap to fall back to the analyzer selected in
6+
# wva-saturation-scaling-config (V1 or V2 saturation analyzer).
7+
#
8+
# Configuration structure:
9+
# - 'default' key: Global default parameters applied to all models.
10+
# - Additional keys: Per-model overrides. The key name is arbitrary; the
11+
# model_id and namespace fields inside the value identify the target model.
12+
# Add one key per model override. Per-model entries must set both targetTTFT
13+
# and targetITL, or neither (0 means infer from metrics).
14+
metadata:
15+
name: wva-queueing-model-config
16+
namespace: workload-variant-autoscaler-system
17+
labels:
18+
app.kubernetes.io/name: workload-variant-autoscaler
19+
app.kubernetes.io/managed-by: kustomize
20+
data:
21+
# Global defaults applied to all models unless overridden.
22+
default: |
23+
# rho = 1 - 1/k
24+
# k=3.0 -> rho ~= 0.67, k=2.0 -> rho = 0.50, k=5.0 -> rho = 0.80
25+
# where rho is the fraction of server capacity consumed by arrivals.
26+
# Must be > 1.0.
27+
# Default: 3.0
28+
sloMultiplier: 3.0
29+
# Enable online parameter learning via Kalman filter.
30+
# When true, the tuner learns alpha/beta/gamma from observed metrics.
31+
# When false, relies on explicit SLO targets or fallback heuristics.
32+
tuningEnabled: true
33+
34+
# Per-model overrides — add one entry per model that needs custom parameters.
35+
# The key name (e.g. "llama-prod") is arbitrary; model_id + namespace identify the model.
36+
#
37+
# llama-prod: |
38+
# model_id: "unsloth/Meta-Llama-3.1-8B"
39+
# namespace: "llm-d-prod"
40+
# targetTTFT: 500.0 # ms; must be set together with targetITL, or both omitted
41+
# targetITL: 50.0 # ms
42+
# sloMultiplier: 2.5
43+
# tuningEnabled: true
44+
#
45+
# mistral-staging: |
46+
# model_id: "mistralai/Mistral-7B-Instruct-v0.2"
47+
# namespace: "llm-d-staging"
48+
# targetTTFT: 800.0
49+
# targetITL: 80.0
50+
# sloMultiplier: 3.0
51+
# tuningEnabled: false

go.mod

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@ require (
1919
github.com/Masterminds/semver/v3 v3.4.0 // indirect
2020
github.com/cenkalti/backoff/v5 v5.0.3 // indirect
2121
github.com/go-viper/mapstructure/v2 v2.4.0 // indirect
22+
github.com/gogo/protobuf v1.3.2 // indirect
2223
github.com/pelletier/go-toml/v2 v2.2.4 // indirect
2324
github.com/pmezard/go-difflib v1.0.1-0.20181226105442-5d4384ee4fb2 // indirect
2425
github.com/sagikazarmark/locafero v0.11.0 // indirect
@@ -32,7 +33,7 @@ require (
3233
go.yaml.in/yaml/v3 v3.0.4 // indirect
3334
golang.org/x/mod v0.32.0 // indirect
3435
sigs.k8s.io/randfill v1.0.0 // indirect
35-
sigs.k8s.io/structured-merge-diff/v6 v6.3.0 // indirect
36+
sigs.k8s.io/structured-merge-diff/v6 v6.3.2-0.20260122202528-d9cc6641c482 // indirect
3637
)
3738

3839
require (
@@ -54,7 +55,6 @@ require (
5455
github.com/go-openapi/jsonreference v0.21.0 // indirect
5556
github.com/go-openapi/swag v0.23.1 // indirect
5657
github.com/go-task/slim-sprig/v3 v3.0.0 // indirect
57-
github.com/gogo/protobuf v1.3.2 // indirect
5858
github.com/google/btree v1.1.3 // indirect
5959
github.com/google/cel-go v0.26.0 // indirect
6060
github.com/google/gnostic-models v0.7.0 // indirect
@@ -74,7 +74,7 @@ require (
7474
github.com/prometheus/client_model v0.6.2 // indirect
7575
github.com/prometheus/common v0.67.5
7676
github.com/prometheus/procfs v0.17.0 // indirect
77-
github.com/spf13/cobra v1.9.1 // indirect
77+
github.com/spf13/cobra v1.10.0 // indirect
7878
github.com/spf13/pflag v1.0.10
7979
github.com/stoewer/go-strcase v1.3.0 // indirect
8080
github.com/x448/float16 v0.8.4 // indirect
@@ -110,7 +110,7 @@ require (
110110
k8s.io/apiserver v0.34.3 // indirect
111111
k8s.io/component-base v0.34.3 // indirect
112112
k8s.io/klog/v2 v2.130.1 // indirect
113-
k8s.io/kube-openapi v0.0.0-20250814151709-d7b6acb124c3 // indirect
113+
k8s.io/kube-openapi v0.0.0-20250910181357-589584f1c912 // indirect
114114
k8s.io/utils v0.0.0-20251002143259-bc988d571ff4
115115
sigs.k8s.io/apiserver-network-proxy/konnectivity-client v0.31.2 // indirect
116116
sigs.k8s.io/json v0.0.0-20250730193827-2d320260d730 // indirect

go.sum

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -149,9 +149,9 @@ github.com/spf13/afero v1.15.0 h1:b/YBCLWAJdFWJTN9cLhiXXcD7mzKn9Dm86dNnfyQw1I=
149149
github.com/spf13/afero v1.15.0/go.mod h1:NC2ByUVxtQs4b3sIUphxK0NioZnmxgyCrfzeuq8lxMg=
150150
github.com/spf13/cast v1.10.0 h1:h2x0u2shc1QuLHfxi+cTJvs30+ZAHOGRic8uyGTDWxY=
151151
github.com/spf13/cast v1.10.0/go.mod h1:jNfB8QC9IA6ZuY2ZjDp0KtFO2LZZlg4S/7bzP6qqeHo=
152-
github.com/spf13/cobra v1.9.1 h1:CXSaggrXdbHK9CF+8ywj8Amf7PBRmPCOJugH954Nnlo=
153-
github.com/spf13/cobra v1.9.1/go.mod h1:nDyEzZ8ogv936Cinf6g1RU9MRY64Ir93oCnqb9wxYW0=
154-
github.com/spf13/pflag v1.0.6/go.mod h1:McXfInJRrz4CZXVZOBLb0bTZqETkiAhM9Iw0y3An2Bg=
152+
github.com/spf13/cobra v1.10.0 h1:a5/WeUlSDCvV5a45ljW2ZFtV0bTDpkfSAj3uqB6Sc+0=
153+
github.com/spf13/cobra v1.10.0/go.mod h1:9dhySC7dnTtEiqzmqfkLj47BslqLCUPMXjG2lj/NgoE=
154+
github.com/spf13/pflag v1.0.8/go.mod h1:McXfInJRrz4CZXVZOBLb0bTZqETkiAhM9Iw0y3An2Bg=
155155
github.com/spf13/pflag v1.0.10 h1:4EBh2KAYBwaONj6b2Ye1GiHfwjqyROoF4RwYO+vPwFk=
156156
github.com/spf13/pflag v1.0.10/go.mod h1:McXfInJRrz4CZXVZOBLb0bTZqETkiAhM9Iw0y3An2Bg=
157157
github.com/spf13/viper v1.21.0 h1:x5S+0EU27Lbphp4UKm1C+1oQO+rKx36vfCoaVebLFSU=
@@ -296,8 +296,8 @@ k8s.io/component-base v0.34.3 h1:zsEgw6ELqK0XncCQomgO9DpUIzlrYuZYA0Cgo+JWpVk=
296296
k8s.io/component-base v0.34.3/go.mod h1:5iIlD8wPfWE/xSHTRfbjuvUul2WZbI2nOUK65XL0E/c=
297297
k8s.io/klog/v2 v2.130.1 h1:n9Xl7H1Xvksem4KFG4PYbdQCQxqc/tTUyrgXaOhHSzk=
298298
k8s.io/klog/v2 v2.130.1/go.mod h1:3Jpz1GvMt720eyJH1ckRHK1EDfpxISzJ7I9OYgaDtPE=
299-
k8s.io/kube-openapi v0.0.0-20250814151709-d7b6acb124c3 h1:liMHz39T5dJO1aOKHLvwaCjDbf07wVh6yaUlTpunnkE=
300-
k8s.io/kube-openapi v0.0.0-20250814151709-d7b6acb124c3/go.mod h1:UZ2yyWbFTpuhSbFhv24aGNOdoRdJZgsIObGBUaYVsts=
299+
k8s.io/kube-openapi v0.0.0-20250910181357-589584f1c912 h1:Y3gxNAuB0OBLImH611+UDZcmKS3g6CthxToOb37KgwE=
300+
k8s.io/kube-openapi v0.0.0-20250910181357-589584f1c912/go.mod h1:kdmbQkyfwUagLfXIad1y2TdrjPFWp2Q89B3qkRwf/pQ=
301301
k8s.io/utils v0.0.0-20251002143259-bc988d571ff4 h1:SjGebBtkBqHFOli+05xYbK8YF1Dzkbzn+gDM4X9T4Ck=
302302
k8s.io/utils v0.0.0-20251002143259-bc988d571ff4/go.mod h1:OLgZIPagt7ERELqWJFomSt595RzquPNLL48iOWgYOg0=
303303
sigs.k8s.io/apiserver-network-proxy/konnectivity-client v0.31.2 h1:jpcvIRr3GLoUoEKRkHKSmGjxb6lWwrBlJsXc+eUYQHM=
@@ -310,7 +310,7 @@ sigs.k8s.io/json v0.0.0-20250730193827-2d320260d730 h1:IpInykpT6ceI+QxKBbEflcR5E
310310
sigs.k8s.io/json v0.0.0-20250730193827-2d320260d730/go.mod h1:mdzfpAEoE6DHQEN0uh9ZbOCuHbLK5wOm7dK4ctXE9Tg=
311311
sigs.k8s.io/randfill v1.0.0 h1:JfjMILfT8A6RbawdsK2JXGBR5AQVfd+9TbzrlneTyrU=
312312
sigs.k8s.io/randfill v1.0.0/go.mod h1:XeLlZ/jmk4i1HRopwe7/aU3H5n1zNUcX6TM94b3QxOY=
313-
sigs.k8s.io/structured-merge-diff/v6 v6.3.0 h1:jTijUJbW353oVOd9oTlifJqOGEkUw2jB/fXCbTiQEco=
314-
sigs.k8s.io/structured-merge-diff/v6 v6.3.0/go.mod h1:M3W8sfWvn2HhQDIbGWj3S099YozAsymCo/wrT5ohRUE=
313+
sigs.k8s.io/structured-merge-diff/v6 v6.3.2-0.20260122202528-d9cc6641c482 h1:2WOzJpHUBVrrkDjU4KBT8n5LDcj824eX0I5UKcgeRUs=
314+
sigs.k8s.io/structured-merge-diff/v6 v6.3.2-0.20260122202528-d9cc6641c482/go.mod h1:M3W8sfWvn2HhQDIbGWj3S099YozAsymCo/wrT5ohRUE=
315315
sigs.k8s.io/yaml v1.6.0 h1:G8fkbMSAFqgEFgh4b1wmtzDnioxFCUgTZhlbj5P9QYs=
316316
sigs.k8s.io/yaml v1.6.0/go.mod h1:796bPqUfzR/0jLAl6XjHl3Ck7MiyVv8dbTdyT3/pMf4=
Lines changed: 76 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,76 @@
1+
// This file provides queueing model analyzer metrics collection using the source
2+
// infrastructure with registered query templates.
3+
package registration
4+
5+
import (
6+
"github.com/llm-d/llm-d-workload-variant-autoscaler/internal/collector/source"
7+
)
8+
9+
// Query name constants for queueing model analyzer metrics.
10+
const (
11+
// QuerySchedulerDispatchRate is the query name for per-endpoint request dispatch rate from scheduler.
12+
// This represents the arrival rate (requests/sec) being dispatched to each replica by the scheduler.
13+
// Source: inference_extension_scheduler_attempts_total (gateway-api-inference-extension)
14+
QuerySchedulerDispatchRate = "scheduler_dispatch_rate"
15+
16+
// QueryAvgTTFT is the query name for average time-to-first-token per pod (in seconds).
17+
// Source: vllm:time_to_first_token_seconds histogram
18+
QueryAvgTTFT = "avg_ttft"
19+
20+
// QueryAvgITL is the query name for average inter-token latency per pod (in seconds).
21+
// Source: vllm:time_per_output_token_seconds histogram
22+
QueryAvgITL = "avg_itl"
23+
)
24+
25+
// RegisterQueueingModelQueries registers queries used by the queueing model analyzer.
26+
func RegisterQueueingModelQueries(sourceRegistry *source.SourceRegistry) {
27+
registry := sourceRegistry.Get("prometheus").QueryList()
28+
29+
// Scheduler dispatch rate per endpoint (per-pod arrival rate)
30+
// Records successful scheduling attempts with endpoint and model information.
31+
// Metric labels: status, pod_name, namespace, port, model_name, target_model_name
32+
// We filter by status="success" and match model identity using target_model_name
33+
// (resolved model after routing, e.g. specific LoRA adapter) with fallback to
34+
// model_name (original request model) when target_model_name is not set.
35+
// This follows the same pattern as scheduler flow control queries.
36+
// Uses sum (not max) because dispatch rate is an additive counter — multiple
37+
// series per pod should be summed. Uses rate() over 1m window for requests/sec.
38+
registry.MustRegister(source.QueryTemplate{
39+
Name: QuerySchedulerDispatchRate,
40+
Type: source.QueryTypePromQL,
41+
Template: `sum by (pod_name, namespace) (rate(inference_extension_scheduler_attempts_total{status="success",namespace="{{.namespace}}",target_model_name="{{.modelID}}"}[1m]))` +
42+
` or sum by (pod_name, namespace) (rate(inference_extension_scheduler_attempts_total{status="success",namespace="{{.namespace}}",model_name="{{.modelID}}",target_model_name=""}[1m]))`,
43+
Params: []string{source.ParamNamespace, source.ParamModelID},
44+
Description: "Request dispatch rate per endpoint (requests/sec) from scheduler, " +
45+
"representing the arrival rate to each replica for a specific model",
46+
})
47+
48+
// Average time-to-first-token per pod (seconds).
49+
// Uses histogram _sum/_count from vLLM over a 1m rate window.
50+
// Used by queueing model tuner as the observed TTFT for Kalman filter updates.
51+
registry.MustRegister(source.QueryTemplate{
52+
Name: QueryAvgTTFT,
53+
Type: source.QueryTypePromQL,
54+
Template: `max by (pod) (rate(vllm:time_to_first_token_seconds_sum{namespace="{{.namespace}}",model_name="{{.modelID}}"}[1m]) / rate(vllm:time_to_first_token_seconds_count{namespace="{{.namespace}}",model_name="{{.modelID}}"}[1m]))`,
55+
Params: []string{source.ParamNamespace, source.ParamModelID},
56+
Description: "Average time-to-first-token per pod (seconds), " +
57+
"used by queueing model tuner for parameter learning",
58+
})
59+
60+
// Average inter-token latency per pod (seconds).
61+
// Uses histogram _sum/_count from vLLM over a 1m rate window.
62+
// Used by queueing model tuner as the observed ITL for Kalman filter updates.
63+
registry.MustRegister(source.QueryTemplate{
64+
Name: QueryAvgITL,
65+
Type: source.QueryTypePromQL,
66+
Template: `max by (pod) (rate(vllm:time_per_output_token_seconds_sum{namespace="{{.namespace}}",model_name="{{.modelID}}"}[1m]) / rate(vllm:time_per_output_token_seconds_count{namespace="{{.namespace}}",model_name="{{.modelID}}"}[1m]))`,
67+
Params: []string{source.ParamNamespace, source.ParamModelID},
68+
Description: "Average inter-token latency per pod (seconds), " +
69+
"used by queueing model tuner for parameter learning",
70+
})
71+
72+
// Note: MaxBatchSize (max_num_seqs) is not available as a Prometheus metric from vLLM.
73+
// It is sourced from the Deployment's container args using the deployment parser
74+
// (see saturation_v2.ParseVLLMArgs). The collector populates ReplicaMetrics.MaxBatchSize
75+
// by parsing the --max-num-seqs flag from the pod's parent Deployment spec.
76+
}

0 commit comments

Comments
 (0)