Skip to content

Saturation v2 fails to scale inference servers as it silently fails to get total demand#982

Closed
asm582 wants to merge 1 commit intollm-d:mainfrom
asm582:fix-epp-namespace-fallback
Closed

Saturation v2 fails to scale inference servers as it silently fails to get total demand#982
asm582 wants to merge 1 commit intollm-d:mainfrom
asm582:fix-epp-namespace-fallback

Conversation

@asm582
Copy link
Copy Markdown
Collaborator

@asm582 asm582 commented Apr 7, 2026

The EPP issue mentions adding a key namespace to metrics made available from inference servers in a different namespace.

The code change we applied previously acts as a graceful degradation safety net by leveraging PromQL's or logic to intentionally bypass the namespace requirement when necessary. In this fallback layer, we completely stripe out the namespace= filter and remove namespace from the sum by() aggregation grouping.

Note: Because this fallback logic explicitly bypasses strict physical namespace tenant isolation on Prometheus, it requires the upstream model_name or target_model_name (e.g., InferencePool name or HuggingFace identifier string) to be absolutely unique across the entire Kubernetes cluster. If two identical model strings are deployed in separate namespaces and the fallback path triggers, the autoscaler will erroneously ingest blended traffic counts resulting in inaccurate scaling math.

@lionelvillard
Copy link
Copy Markdown
Collaborator

The namespace label is automatically added by prometheus (see doc).

What error are you getting?

@asm582
Copy link
Copy Markdown
Collaborator Author

asm582 commented Apr 8, 2026

The namespace label is automatically added by prometheus (see doc).

What error are you getting?

Thanks, the error is total demand is always zero and WVA fails to scale.

@asm582
Copy link
Copy Markdown
Collaborator Author

asm582 commented Apr 8, 2026

/hold

@github-actions github-actions bot added the hold PRs that are blocked on design, other features, release cycle, etc. label Apr 8, 2026
@asm582
Copy link
Copy Markdown
Collaborator Author

asm582 commented Apr 8, 2026

I did a clean deploy of the controller, EPP all in the same namespace, and see the below log line:

2026-04-08T21:44:11Z    INFO    saturation/engine_v2.go:65      V2 saturation analysis completed        
{
  "modelID": "Qwen/Qwen3-0.6B", 
  "totalSupply": 6553, 
  "totalDemand": 7319, 
  "utilization": 1.116893, 
  "requiredCapacity": 2595.75, 
  "spareCapacity": 0
}

Below are HPA logs:

Normal   SuccessfulRescale             2m56s                 horizontal-pod-autoscaler  New size: 1; reason: All metrics below target
  Normal   SuccessfulRescale             101s                  horizontal-pod-autoscaler  New size: 2; reason: external metric wva_desired_replicas(&LabelSelector{MatchLabels:map[string]string{controller_instance: asmalvan-test,exported_namespace: asmalvan-test,variant_name: workload-variant-autoscaler-va,},MatchExpressions:[]LabelSelectorRequirement{},}) above target

Closing this PR for now.

@asm582 asm582 closed this Apr 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

hold PRs that are blocked on design, other features, release cycle, etc.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants