I’d like to propose a feature for handling high request concurrency more gracefully.
Problem
When sending a large burst of requests (e.g. hundreds of inference requests in parallel), it’s easy to overload either:
- the LamaSwap server
- the backend model workers
- or even the client machine if requests pile up uncontrollably
Right now, high concurrency can lead to failures, instability, or inefficient routing.
Proposed feature
Add a built-in request queue + load balancing layer.
1. Request queue / backpressure
Instead of immediately attempting to process every incoming request, LamaSwap could:
- place excess requests into an internal queue
- process them according to configurable concurrency limits
- apply backpressure rather than letting the system get overwhelmed
Example config ideas:
max_concurrent_requests: 32
max_queue_size: 1000
queue_strategy: fifo # fifo | priority | fair
request_timeout_seconds: 300
This would make LamaSwap behave more like a resilient gateway rather than a pass-through router.
2. Load balancing across identical model backends
If multiple backends serve the same model, route requests to the least busy backend instead of simple static routing.
Possible strategies:
- least-connections
- least-queue-depth
- weighted round robin
- latency-aware routing
Example:
If I have 3 instances serving llama-3-70b, and one already has 50 queued requests while another has 5, the new request should go to the less loaded instance.
3. Optional request deduplication / batching (future enhancement)
For identical or compatible requests, optional batching/coalescing could improve throughput even further.
Why this would help
This would make LamaSwap much more production-friendly for:
- bursty workloads
- agent systems firing many parallel calls
- benchmarking
- multi-user inference gateways
- self-hosted deployments where resource protection matters
Without this, users have to build their own queueing/load-shedding layer externally.
Questions
- Is there already a recommended pattern for this?
- Would this fit within LamaSwap’s intended scope, or should it live as a separate proxy layer?
I’d like to propose a feature for handling high request concurrency more gracefully.
Problem
When sending a large burst of requests (e.g. hundreds of inference requests in parallel), it’s easy to overload either:
Right now, high concurrency can lead to failures, instability, or inefficient routing.
Proposed feature
Add a built-in request queue + load balancing layer.
1. Request queue / backpressure
Instead of immediately attempting to process every incoming request, LamaSwap could:
Example config ideas:
This would make LamaSwap behave more like a resilient gateway rather than a pass-through router.
2. Load balancing across identical model backends
If multiple backends serve the same model, route requests to the least busy backend instead of simple static routing.
Possible strategies:
Example:
If I have 3 instances serving
llama-3-70b, and one already has 50 queued requests while another has 5, the new request should go to the less loaded instance.3. Optional request deduplication / batching (future enhancement)
For identical or compatible requests, optional batching/coalescing could improve throughput even further.
Why this would help
This would make LamaSwap much more production-friendly for:
Without this, users have to build their own queueing/load-shedding layer externally.
Questions