[Feature]: Optional gateway-side queue metadata / queue status protocol for overloaded self-hosted backends

### Check for existing issues

- [X] I have searched the existing issues and checked that my issue is not a duplicate.

### The Feature

Add an optional **gateway-side queue protocol** for overloaded/self-hosted backends, so LiteLLM Proxy can expose structured queue metadata instead of only returning a raw `429`.

Today, when a self-hosted backend such as vLLM, Ollama, or an internal inference gateway is saturated, the practical outcomes are usually:

- the upstream returns `429 Too Many Requests`
- the client retries blindly using `retry-after`
- or the request is simply rejected and the client has no visibility into whether work is waiting, how long it may take, or whether the server is just overloaded

I am proposing a provider-agnostic, opt-in mechanism for the **Proxy** layer to expose structured waiting state, for example:

```json
{
  "queue_position": 5,
  "estimated_wait_seconds": 18,
  "message": "Server busy, waiting for resources"
}
```

This does not have to be tied to one exact transport. Possible designs:

1. SSE event during streaming requests, for example:

```text
event: queue_status
data: {"queue_position": 5, "estimated_wait_seconds": 18, "message": "Server busy, waiting for resources"}
```

2. Structured JSON response body for non-streaming overload cases
3. Standardized response headers such as `retry-after` plus optional queue metadata headers
4. A pluggable hook/adapter interface for custom backends/gateways to supply queue metadata

The key point is: LiteLLM should be able to expose **more than just a bare 429** when the upstream system actually has a real queue and can describe it.

### Motivation, pitch

I am working on a setup where the problem is naturally split across two layers:

1. **Gateway/server side**: decide whether requests should be queued, rejected, or retried, and expose structured waiting metadata
2. **Client side**: render that metadata so the user sees something like "server busy, queue position: 5" instead of assuming the request is frozen

Right now LiteLLM already has strong support for:

- rate limiting
- `max_parallel_requests`
- retry/backoff
- fallbacks
- `retry-after` style signaling

But from what I can tell, it does **not** currently provide a standard way to expose queue position / estimated wait information from overloaded self-hosted gateways to downstream clients.

This would be especially useful for:

- teams running self-hosted vLLM clusters
- internal company gateways in front of multiple model servers
- clients that want to provide better UX than "request failed with 429" or "spinning until retry succeeds"

One concrete downstream use case is an OpenCode client integration on the consumer side:

- LiteLLM / gateway side: expose structured queue metadata
- OpenCode side: render it in the TUI while the request is waiting

Related client-side discussion:
- OpenCode issue: https://github.com/anomalyco/opencode/issues/24763

Some implementation questions I would love feedback on:

- Should this be modeled as a new optional structured error / event contract?
- Should LiteLLM standardize SSE `queue_status` events for streaming clients?
- Should queue metadata be available on both streaming and non-streaming paths?
- Should this be limited to proxy mode, where LiteLLM is acting as the gateway?
- Would the maintainers prefer a hook-based approach for custom providers/backends first?

I think this is a natural extension of LiteLLM's existing gateway responsibilities. It would let LiteLLM remain provider-agnostic while still supporting a much better overload experience for self-hosted deployments.

If there is interest in this direction, I would be happy to help refine the proposal further.

### What part of LiteLLM is this about?

Proxy

### LiteLLM is hiring a founding backend engineer, are you interested in joining us and shipping to all our users?

No

### Twitter / LinkedIn details

No response


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature]: Optional gateway-side queue metadata / queue status protocol for overloaded self-hosted backends #26693

Check for existing issues

The Feature

Motivation, pitch

What part of LiteLLM is this about?

LiteLLM is hiring a founding backend engineer, are you interested in joining us and shipping to all our users?

Twitter / LinkedIn details

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Feature]: Optional gateway-side queue metadata / queue status protocol for overloaded self-hosted backends #26693

Description

Check for existing issues

The Feature

Motivation, pitch

What part of LiteLLM is this about?

LiteLLM is hiring a founding backend engineer, are you interested in joining us and shipping to all our users?

Twitter / LinkedIn details

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions