Check for existing issues
The Feature
Add an optional gateway-side queue protocol for overloaded/self-hosted backends, so LiteLLM Proxy can expose structured queue metadata instead of only returning a raw 429.
Today, when a self-hosted backend such as vLLM, Ollama, or an internal inference gateway is saturated, the practical outcomes are usually:
- the upstream returns
429 Too Many Requests
- the client retries blindly using
retry-after
- or the request is simply rejected and the client has no visibility into whether work is waiting, how long it may take, or whether the server is just overloaded
I am proposing a provider-agnostic, opt-in mechanism for the Proxy layer to expose structured waiting state, for example:
{
"queue_position": 5,
"estimated_wait_seconds": 18,
"message": "Server busy, waiting for resources"
}
This does not have to be tied to one exact transport. Possible designs:
- SSE event during streaming requests, for example:
event: queue_status
data: {"queue_position": 5, "estimated_wait_seconds": 18, "message": "Server busy, waiting for resources"}
- Structured JSON response body for non-streaming overload cases
- Standardized response headers such as
retry-after plus optional queue metadata headers
- A pluggable hook/adapter interface for custom backends/gateways to supply queue metadata
The key point is: LiteLLM should be able to expose more than just a bare 429 when the upstream system actually has a real queue and can describe it.
Motivation, pitch
I am working on a setup where the problem is naturally split across two layers:
- Gateway/server side: decide whether requests should be queued, rejected, or retried, and expose structured waiting metadata
- Client side: render that metadata so the user sees something like "server busy, queue position: 5" instead of assuming the request is frozen
Right now LiteLLM already has strong support for:
- rate limiting
max_parallel_requests
- retry/backoff
- fallbacks
retry-after style signaling
But from what I can tell, it does not currently provide a standard way to expose queue position / estimated wait information from overloaded self-hosted gateways to downstream clients.
This would be especially useful for:
- teams running self-hosted vLLM clusters
- internal company gateways in front of multiple model servers
- clients that want to provide better UX than "request failed with 429" or "spinning until retry succeeds"
One concrete downstream use case is an OpenCode client integration on the consumer side:
- LiteLLM / gateway side: expose structured queue metadata
- OpenCode side: render it in the TUI while the request is waiting
Related client-side discussion:
Some implementation questions I would love feedback on:
- Should this be modeled as a new optional structured error / event contract?
- Should LiteLLM standardize SSE
queue_status events for streaming clients?
- Should queue metadata be available on both streaming and non-streaming paths?
- Should this be limited to proxy mode, where LiteLLM is acting as the gateway?
- Would the maintainers prefer a hook-based approach for custom providers/backends first?
I think this is a natural extension of LiteLLM's existing gateway responsibilities. It would let LiteLLM remain provider-agnostic while still supporting a much better overload experience for self-hosted deployments.
If there is interest in this direction, I would be happy to help refine the proposal further.
What part of LiteLLM is this about?
Proxy
LiteLLM is hiring a founding backend engineer, are you interested in joining us and shipping to all our users?
No
Twitter / LinkedIn details
No response
Check for existing issues
The Feature
Add an optional gateway-side queue protocol for overloaded/self-hosted backends, so LiteLLM Proxy can expose structured queue metadata instead of only returning a raw
429.Today, when a self-hosted backend such as vLLM, Ollama, or an internal inference gateway is saturated, the practical outcomes are usually:
429 Too Many Requestsretry-afterI am proposing a provider-agnostic, opt-in mechanism for the Proxy layer to expose structured waiting state, for example:
{ "queue_position": 5, "estimated_wait_seconds": 18, "message": "Server busy, waiting for resources" }This does not have to be tied to one exact transport. Possible designs:
retry-afterplus optional queue metadata headersThe key point is: LiteLLM should be able to expose more than just a bare 429 when the upstream system actually has a real queue and can describe it.
Motivation, pitch
I am working on a setup where the problem is naturally split across two layers:
Right now LiteLLM already has strong support for:
max_parallel_requestsretry-afterstyle signalingBut from what I can tell, it does not currently provide a standard way to expose queue position / estimated wait information from overloaded self-hosted gateways to downstream clients.
This would be especially useful for:
One concrete downstream use case is an OpenCode client integration on the consumer side:
Related client-side discussion:
Some implementation questions I would love feedback on:
queue_statusevents for streaming clients?I think this is a natural extension of LiteLLM's existing gateway responsibilities. It would let LiteLLM remain provider-agnostic while still supporting a much better overload experience for self-hosted deployments.
If there is interest in this direction, I would be happy to help refine the proposal further.
What part of LiteLLM is this about?
Proxy
LiteLLM is hiring a founding backend engineer, are you interested in joining us and shipping to all our users?
No
Twitter / LinkedIn details
No response