Skip to content

[Feature Request] Add request queueing + load balancing for high-concurrency inference requests #772

@0xrushi

Description

@0xrushi

I’d like to propose a feature for handling high request concurrency more gracefully.

Problem

When sending a large burst of requests (e.g. hundreds of inference requests in parallel), it’s easy to overload either:

  • the LamaSwap server
  • the backend model workers
  • or even the client machine if requests pile up uncontrollably

Right now, high concurrency can lead to failures, instability, or inefficient routing.

Proposed feature

Add a built-in request queue + load balancing layer.

1. Request queue / backpressure

Instead of immediately attempting to process every incoming request, LamaSwap could:

  • place excess requests into an internal queue
  • process them according to configurable concurrency limits
  • apply backpressure rather than letting the system get overwhelmed

Example config ideas:

max_concurrent_requests: 32
max_queue_size: 1000
queue_strategy: fifo   # fifo | priority | fair
request_timeout_seconds: 300

This would make LamaSwap behave more like a resilient gateway rather than a pass-through router.

2. Load balancing across identical model backends

If multiple backends serve the same model, route requests to the least busy backend instead of simple static routing.

Possible strategies:

  • least-connections
  • least-queue-depth
  • weighted round robin
  • latency-aware routing

Example:

If I have 3 instances serving llama-3-70b, and one already has 50 queued requests while another has 5, the new request should go to the less loaded instance.

3. Optional request deduplication / batching (future enhancement)

For identical or compatible requests, optional batching/coalescing could improve throughput even further.

Why this would help

This would make LamaSwap much more production-friendly for:

  • bursty workloads
  • agent systems firing many parallel calls
  • benchmarking
  • multi-user inference gateways
  • self-hosted deployments where resource protection matters

Without this, users have to build their own queueing/load-shedding layer externally.

Questions

  • Is there already a recommended pattern for this?
  • Would this fit within LamaSwap’s intended scope, or should it live as a separate proxy layer?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions