Skip to content

[Bug]: LiteLLM triton provider adds ~5s delay to time-to-first-token (TTFT) #26699

@djangodesmet

Description

@djangodesmet

Check for existing issues

  • I have searched the existing issues and checked that my issue is not a duplicate.

What happened?

When using LiteLLM with the Triton provider to proxy requests to a Triton server (vLLM backend), every request shows an extra ~5.0s delay to TTFT compared to calling Triton directly.

What I see:

  • Request via LiteLLM → Triton: TTFT 5.10s
  • Same request directly to Triton: TTFT 0.10s
  • The ~5s delay occurs for every model served by Triton when proxied through LiteLLM.
  • Using LiteLLM → vLLM (with openai provider) shows no meaningful extra delay versus direct requests.

Environment

  • LiteLLM version: v1.83.7-stable
  • Triton with vLLM backend
  • Both LiteLLM and Triton running in same Kubernetes namespace
  • Fully air-gapped (no Internet access)

Debugging done

  • Checked LiteLLM logs: no obvious errors or warnings pointing to the delay.
  • Observed a repeated internal API call that includes header Authorization: Bearer ; unclear whether this call is related.
  • Timing is consistently ~5.0s, suggesting a timeout or retry behavior.

Impact

This consistent 5s TTFT addition prevents using LiteLLM + Triton in production.

Requested help

  • Any pointers where LiteLLM might add a fixed ~5s wait (timeouts, retries, health checks, auth calls)?
  • Guidance on which additional logs/traces or configuration settings would be most useful to capture next

Steps to Reproduce

  1. Make LiteLLM request to model hosted in triton via triton provider

Relevant log output

What part of LiteLLM is this about?

Proxy

What LiteLLM version are you on ?

v1.83.7-stable

Twitter / LinkedIn details

No response

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions