Skip to content

Conversation

@AWarno
Copy link
Contributor

@AWarno AWarno commented Sep 29, 2025

Enabling Multi-Instance Deployment with HAProxy

Why HAProxy

HAProxy is a lightweight, reliable, and widely used load balancer. It generalizes well to all server types. Using an external load balancer is officially recommended in the vLLM documentation (see vLLM Data Parallel Deployment); the documentation provides an example using NGINX, but HAProxy should work similarly.

Alternative Solutions

  • Ray
    This is useful for multi-node deployments when a model is too large for a single node. It can also be used for multi-instance setups, but it requires knowing how to launch and manage each server type individually (vLLM, SGLang may have different CLI arguments for this). It does not generalize as well as using an external load balancer. However, we may want to provide an example of how to use it for multi-node large model deployment.

  • LiteLLM
    Offers backend orchestration but is generally overkill for simple load balancing. The project evolves quickly, which may affect stability.

  • NGINX
    Very similar to HAProxy for this use case and officially recommended in the vLLM documentation:
    vLLM Data Parallel Deployment
    HAProxy, however, is slightly simpler/nicer to use in practice (based on my experience).

Literature

TODO

  • Run on longer tasks to validate stability and performance. (I have checked ifeval so far)
  • Check if the HAProxy template is correctly included in the pip wheel (consider renaming it)
  • Documentation
  • dataclass in types fix!!!!

Next Steps

  • Add a multi-node deployment example using Ray server. This will likely just require creating one example configuration file under examples/.

@AWarno AWarno requested review from a team and agronskiy as code owners September 29, 2025 11:45
@copy-pr-bot
Copy link

copy-pr-bot bot commented Sep 29, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

ko3n1g and others added 24 commits September 29, 2025 13:56
Signed-off-by: Oliver Koenig <[email protected]>
Signed-off-by: Anna Warno <[email protected]>
Signed-off-by: Oliver Koenig <[email protected]>
Signed-off-by: Anna Warno <[email protected]>
1. Add total stats.
2. Add reasoning token stats (if provided). -
https://platform.openai.com/docs/guides/reasoning or "reasoning_tokens"
in usage, (completion_tokens_details, output_tokens_details)
3. Make stats cache-resistant — do not include stats if the response is
from cache.

---------

Signed-off-by: Anna Warno <[email protected]>
checkbox added

Signed-off-by: AWarno <[email protected]>
Signed-off-by: Anna Warno <[email protected]>
Signed-off-by: Oliver Koenig <[email protected]>
Signed-off-by: Anna Warno <[email protected]>
It unblocks us to use new Eval Factory containers in the launcher — they
don't have `nv-eval`/`nv_eval` alias anymore.

Signed-off-by: Piotr Januszewski <[email protected]>
Signed-off-by: Anna Warno <[email protected]>
Signed-off-by: Wojciech Prazuch <[email protected]>
Signed-off-by: Anna Warno <[email protected]>
This is a very basic migration of the readme content + adding a minimal
toctree to the home index page so that the sphinx site produces a
sidebar. The sidebar will mature and break out in the future into
sections such as About, Get Started, etc.

We will also add more sections/cards to this page after all other basic
edits have been checked in, so it won't be a direct copy of the README,
instead it will become a proper docs site home page.

---------

Signed-off-by: Lawrence Lane <[email protected]>
Signed-off-by: L.B. <[email protected]>
Co-authored-by: jgerh <[email protected]>
Signed-off-by: Anna Warno <[email protected]>
Signed-off-by: Wojciech Prazuch <[email protected]>
Signed-off-by: Anna Warno <[email protected]>
Docs update

---------

Signed-off-by: Anna Warno <[email protected]>
Signed-off-by: Oliver Koenig <[email protected]>
Signed-off-by: Wojciech Prazuch <[email protected]>
Signed-off-by: AWarno <[email protected]>
Co-authored-by: Oliver Koenig <[email protected]>
Co-authored-by: Alexey Gronskiy <[email protected]>
Co-authored-by: Wojciech Prazuch <[email protected]>
Signed-off-by: Anna Warno <[email protected]>
Signed-off-by: Wojciech Prazuch <[email protected]>
Signed-off-by: Anna Warno <[email protected]>
Signed-off-by: Oliver Koenig <[email protected]>
Signed-off-by: Anna Warno <[email protected]>
Signed-off-by: oliver könig <[email protected]>
Signed-off-by: Anna Warno <[email protected]>
Signed-off-by: Marta Stepniewska-Dziubinska <[email protected]>
Signed-off-by: Anna Warno <[email protected]>
@AWarno
Copy link
Contributor Author

AWarno commented Oct 23, 2025

/ok to test b14f3d7

@copy-pr-bot
Copy link

copy-pr-bot bot commented Oct 23, 2025

/ok to test b14f3d7

@AWarno, there was an error processing your request: E2

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/2/

@AWarno
Copy link
Contributor Author

AWarno commented Oct 23, 2025

/ok to test e2fb5ce

@github-actions github-actions bot added the tests label Oct 23, 2025
fgalko-oss
fgalko-oss previously approved these changes Oct 27, 2025
@AWarno
Copy link
Contributor Author

AWarno commented Oct 27, 2025

/ok to test 10d1c02

@AWarno
Copy link
Contributor Author

AWarno commented Oct 27, 2025

/ok to test f74f6c4

@fgalko-oss fgalko-oss self-requested a review October 28, 2025 04:49
fgalko-oss
fgalko-oss previously approved these changes Oct 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.