[Feature Request] Add request queueing + load balancing for high-concurrency inference requests

I’d like to propose a feature for handling high request concurrency more gracefully.

### Problem

When sending a large burst of requests (e.g. hundreds of inference requests in parallel), it’s easy to overload either:

* the LamaSwap server
* the backend model workers
* or even the client machine if requests pile up uncontrollably

Right now, high concurrency can lead to failures, instability, or inefficient routing.

### Proposed feature

Add a built-in request queue + load balancing layer.

#### 1. Request queue / backpressure

Instead of immediately attempting to process every incoming request, LamaSwap could:

* place excess requests into an internal queue
* process them according to configurable concurrency limits
* apply backpressure rather than letting the system get overwhelmed

Example config ideas:

```yaml
max_concurrent_requests: 32
max_queue_size: 1000
queue_strategy: fifo   # fifo | priority | fair
request_timeout_seconds: 300
```

This would make LamaSwap behave more like a resilient gateway rather than a pass-through router.

#### 2. Load balancing across identical model backends

If multiple backends serve the same model, route requests to the least busy backend instead of simple static routing.

Possible strategies:

* least-connections
* least-queue-depth
* weighted round robin
* latency-aware routing

Example:

If I have 3 instances serving `llama-3-70b`, and one already has 50 queued requests while another has 5, the new request should go to the less loaded instance.

#### 3. Optional request deduplication / batching (future enhancement)

For identical or compatible requests, optional batching/coalescing could improve throughput even further.

### Why this would help

This would make LamaSwap much more production-friendly for:

* bursty workloads
* agent systems firing many parallel calls
* benchmarking
* multi-user inference gateways
* self-hosted deployments where resource protection matters

Without this, users have to build their own queueing/load-shedding layer externally.

### Questions

* Is there already a recommended pattern for this?
* Would this fit within LamaSwap’s intended scope, or should it live as a separate proxy layer?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Add request queueing + load balancing for high-concurrency inference requests #772

Problem

Proposed feature

1. Request queue / backpressure

2. Load balancing across identical model backends

3. Optional request deduplication / batching (future enhancement)

Why this would help

Questions

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[Feature Request] Add request queueing + load balancing for high-concurrency inference requests #772

Description

Problem

Proposed feature

1. Request queue / backpressure

2. Load balancing across identical model backends

3. Optional request deduplication / batching (future enhancement)

Why this would help

Questions

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions