Skip to content

[BUG]: TTFT metric shows inconsistent values when using Istio traffic mirroring #5610

@chay1045

Description

@chay1045

Describe the Bug

When using Istio VirtualService traffic mirroring to send shadow traffic to Dynamo (with vLLM backend), the time_to_first_token_seconds histogram metric reports higher values compared to direct requests (without mirroring).
The difference is approximately ~12ms additional latency in TTFT when requests are mirrored vs sent directly.

Steps to Reproduce

Test Setup

  • Dynamo frontend with vLLM backend
  • Istio VirtualService configured to mirror traffic

Istio Mirroring Configuration

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: llm-mirror
spec:
  gateways:
    - llm-gateway
  hosts:
    - "*"
  http:
    - route:
        - destination:
            host: primary-llm-service
            port:
              number: 8000
      mirror:
        host: dynamo-frontend
        port:
          number: 8000
      mirrorPercentage:
        value: 100.0

Test Configuration

  • Request type: streaming=false (non-streaming)
  • Requests sent through Istio Ingress Gateway

Observations

Scenario TTFT & Request Duration Difference (Dynamo - vLLM)
Without mirroring (direct requests) ~11ms
With Istio mirroring ~23ms

The ~11ms baseline is expected Dynamo overhead (HTTP handling, tokenization, routing, etc.). The additional ~12ms when mirroring is the bug.

Expected Behavior

TTFT should be consistent between mirrored and non-mirrored requests, with only the expected Dynamo processing overhead (~11ms) compared to vLLM's native TTFT metric.

Actual Behavior

TTFT shows approximately 12ms additional latency when requests are received via Istio mirroring compared to direct requests.

Environment

I am using HGX H100s with a K3s Cluster I built the container for vLLM from the source code for v0.8.0 (some commits ahead of v0.8.0)

Additional Context

Key Findings

  • ITL is identical in both scenarios.
  • Envoy sidecar logs show no errors - the sidecar appears to be handling the mirrored traffic normally
  • Non-streaming requests: The issue was observed with streaming=false, so it's not related to SSE stream handling

Screenshots

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions