Skip to content

[Feature]: Optional gateway-side queue metadata / queue status protocol for overloaded self-hosted backends #26693

@d4n-sec

Description

@d4n-sec

Check for existing issues

  • I have searched the existing issues and checked that my issue is not a duplicate.

The Feature

Add an optional gateway-side queue protocol for overloaded/self-hosted backends, so LiteLLM Proxy can expose structured queue metadata instead of only returning a raw 429.

Today, when a self-hosted backend such as vLLM, Ollama, or an internal inference gateway is saturated, the practical outcomes are usually:

  • the upstream returns 429 Too Many Requests
  • the client retries blindly using retry-after
  • or the request is simply rejected and the client has no visibility into whether work is waiting, how long it may take, or whether the server is just overloaded

I am proposing a provider-agnostic, opt-in mechanism for the Proxy layer to expose structured waiting state, for example:

{
  "queue_position": 5,
  "estimated_wait_seconds": 18,
  "message": "Server busy, waiting for resources"
}

This does not have to be tied to one exact transport. Possible designs:

  1. SSE event during streaming requests, for example:
event: queue_status
data: {"queue_position": 5, "estimated_wait_seconds": 18, "message": "Server busy, waiting for resources"}
  1. Structured JSON response body for non-streaming overload cases
  2. Standardized response headers such as retry-after plus optional queue metadata headers
  3. A pluggable hook/adapter interface for custom backends/gateways to supply queue metadata

The key point is: LiteLLM should be able to expose more than just a bare 429 when the upstream system actually has a real queue and can describe it.

Motivation, pitch

I am working on a setup where the problem is naturally split across two layers:

  1. Gateway/server side: decide whether requests should be queued, rejected, or retried, and expose structured waiting metadata
  2. Client side: render that metadata so the user sees something like "server busy, queue position: 5" instead of assuming the request is frozen

Right now LiteLLM already has strong support for:

  • rate limiting
  • max_parallel_requests
  • retry/backoff
  • fallbacks
  • retry-after style signaling

But from what I can tell, it does not currently provide a standard way to expose queue position / estimated wait information from overloaded self-hosted gateways to downstream clients.

This would be especially useful for:

  • teams running self-hosted vLLM clusters
  • internal company gateways in front of multiple model servers
  • clients that want to provide better UX than "request failed with 429" or "spinning until retry succeeds"

One concrete downstream use case is an OpenCode client integration on the consumer side:

  • LiteLLM / gateway side: expose structured queue metadata
  • OpenCode side: render it in the TUI while the request is waiting

Related client-side discussion:

Some implementation questions I would love feedback on:

  • Should this be modeled as a new optional structured error / event contract?
  • Should LiteLLM standardize SSE queue_status events for streaming clients?
  • Should queue metadata be available on both streaming and non-streaming paths?
  • Should this be limited to proxy mode, where LiteLLM is acting as the gateway?
  • Would the maintainers prefer a hook-based approach for custom providers/backends first?

I think this is a natural extension of LiteLLM's existing gateway responsibilities. It would let LiteLLM remain provider-agnostic while still supporting a much better overload experience for self-hosted deployments.

If there is interest in this direction, I would be happy to help refine the proposal further.

What part of LiteLLM is this about?

Proxy

LiteLLM is hiring a founding backend engineer, are you interested in joining us and shipping to all our users?

No

Twitter / LinkedIn details

No response

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions