-
Notifications
You must be signed in to change notification settings - Fork 822
Description
Describe the Bug
When using Istio VirtualService traffic mirroring to send shadow traffic to Dynamo (with vLLM backend), the time_to_first_token_seconds histogram metric reports higher values compared to direct requests (without mirroring).
The difference is approximately ~12ms additional latency in TTFT when requests are mirrored vs sent directly.
Steps to Reproduce
Test Setup
- Dynamo frontend with vLLM backend
- Istio VirtualService configured to mirror traffic
Istio Mirroring Configuration
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: llm-mirror
spec:
gateways:
- llm-gateway
hosts:
- "*"
http:
- route:
- destination:
host: primary-llm-service
port:
number: 8000
mirror:
host: dynamo-frontend
port:
number: 8000
mirrorPercentage:
value: 100.0Test Configuration
- Request type:
streaming=false(non-streaming) - Requests sent through Istio Ingress Gateway
Observations
| Scenario | TTFT & Request Duration Difference (Dynamo - vLLM) |
|---|---|
| Without mirroring (direct requests) | ~11ms |
| With Istio mirroring | ~23ms |
The ~11ms baseline is expected Dynamo overhead (HTTP handling, tokenization, routing, etc.). The additional ~12ms when mirroring is the bug.
Expected Behavior
TTFT should be consistent between mirrored and non-mirrored requests, with only the expected Dynamo processing overhead (~11ms) compared to vLLM's native TTFT metric.
Actual Behavior
TTFT shows approximately 12ms additional latency when requests are received via Istio mirroring compared to direct requests.
Environment
I am using HGX H100s with a K3s Cluster I built the container for vLLM from the source code for v0.8.0 (some commits ahead of v0.8.0)
Additional Context
Key Findings
- ITL is identical in both scenarios.
- Envoy sidecar logs show no errors - the sidecar appears to be handling the mirrored traffic normally
- Non-streaming requests: The issue was observed with streaming=false, so it's not related to SSE stream handling
Screenshots
No response