feat: deploy auxiliary endpoints#830

Merged

AdamRajfer merged 16 commits intomainfrom

wprazuch/judge-deployment

Mar 18, 2026

Contributor

wprazuch commented Mar 10, 2026

No description provided.

copy-pr-bot Bot commented Mar 10, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

github-actions Bot added nemo-evaluator-launcher tests labels

wprazuch force-pushed the wprazuch/judge-deployment branch from 6d38bfa to 5a40b17 Compare

March 10, 2026 18:51

agronskiy reviewed

View reviewed changes

packages/nemo-evaluator-launcher/src/nemo_evaluator_launcher/executors/slurm/executor.py Outdated

packages/nemo-evaluator-launcher/src/nemo_evaluator_launcher/executors/slurm/executor.py

packages/nemo-evaluator-launcher/src/nemo_evaluator_launcher/executors/slurm/executor.py

packages/nemo-evaluator-launcher/src/nemo_evaluator_launcher/executors/slurm/executor.py

packages/nemo-evaluator-launcher/src/nemo_evaluator_launcher/executors/slurm/executor.py

gchlebus and others added 11 commits

March 16, 2026 13:04


          feat: add judge deployment support for on-cluster judge model serving

e022025

Add judge_deployment config section that deploys a judge model alongside
the model under test, on dedicated SLURM nodes within the same allocation.

Config structure:
- judge_deployment.type: vllm | sglang | none (default: none)
- judge_deployment.image, checkpoint_path, served_model_name, port, etc.
- judge_deployment.num_nodes: number of nodes for the judge server
- execution.judge_deployment.n_tasks: srun task count for judge
- execution.mounts.judge_deployment: container mounts for judge

When judge_deployment.type != none:
- Total SLURM allocation = num_nodes (model) + judge_deployment.num_nodes
- Nodes are split: first N for model, remaining M for judge
- Model deployment uses --nodelist to stay on its assigned nodes
- Judge server starts, health-checks, then eval runs
- JUDGE_ENDPOINT_URL and JUDGE_MODEL_ID env vars are auto-exported
  to evaluation containers for downstream use
- Judge server is terminated after evaluation completes

Motivation: Ultra posttraining team needs to run 20+ checkpoints x 3-4
judge-dependent evals (arenahard, LCR, tau2, HLE, IFBench). NVCF judge
capacity becomes a bottleneck at this scale, especially for heavy judges
like DeepSeek-3.2 and Qwen-235B. On-cluster deployment eliminates the
external dependency.

Ref: feedback from Ultra evaluation meeting (2026-03-04)

Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>


          Add user deployment config

da9e34f

Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>

Fix

5349bd8

Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>


          feat: generic auxiliary deployments replacing hardcoded judge/user de…

a35fe33

…ployment

Replace duplicated judge_deployment and user_deployment code with a generic
auxiliary_deployments system. Adding a new deployment type is now config-only.

- Add AuxDeploymentState dataclass and auxiliary_deployments.py module
- Add configs/auxiliary_deployment/ shared templates (vllm, none)
- Refactor executor.py: single _generate_auxiliary_deployment_srun_command()
  with haproxy/multi-instance support replaces 2 duplicated functions
- Normalization shim translates legacy judge_deployment/user_deployment keys
- Validation for duplicate ports, prefixes, required fields
- 137/137 tests pass

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>


          A little bit cleaner approach phase 1

07f5532

Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>


          refactor: clean up dead code and fix legacy mount migration

54b7c16

- Remove unused collect_judge/user_deployment_env_vars from env_vars.py
- Remove unused get_judge_endpoint_url/get_judge_served_model_name from helpers.py
- Add shell variable resolution for config_ef.yaml (auxiliary_deployments refs)
- Fix normalization shim to also migrate execution.mounts.{name}_deployment
- 137/137 tests pass

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>


          refactor: remove legacy judge_deployment/user_deployment config dirs

6f1da30

No more per-type Hydra config groups. Auxiliary deployments are configured
inline via auxiliary_deployments: {} in the user's run config.
The normalization shim still handles legacy top-level judge_deployment/
user_deployment keys for backward compatibility at the config level.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>


          refactor: remove legacy judge_deployment/user_deployment config dirs

989a396

No more per-type Hydra config groups. Auxiliary deployments are configured
inline via auxiliary_deployments: {} in the user's run config.
The normalization shim still handles legacy top-level judge_deployment/
user_deployment keys for backward compatibility at the config level.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>


          refactor: remove legacy judge_deployment/user_deployment normalizatio…

cdecf8b

…n shim

These features were never merged to main, so backward compatibility code
is unnecessary. auxiliary_deployments is the only supported API.

- Remove normalize_auxiliary_deployments() from auxiliary_deployments.py
- Remove its call and import from executor.py
- Remove legacy mount validation fallback for judge/user_deployment keys
- Update tests to use auxiliary_deployments directly
- 137/137 tests pass

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>


          Remove unnecessary port clash checks

f9f0527

Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>


          Fix review changes

7501aa4

Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>

wprazuch force-pushed the wprazuch/judge-deployment branch from d96608a to 7501aa4 Compare

March 16, 2026 12:33

copy-pr-bot Bot commented Mar 16, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Contributor Author

wprazuch commented Mar 16, 2026

/ok to test 7501aa4

copy-pr-bot Bot temporarily deployed to test

March 16, 2026 12:35

Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci

March 16, 2026 12:35

Inactive

copy-pr-bot Bot had a problem deploying to nemo-ci

March 16, 2026 12:35

Failure

copy-pr-bot Bot temporarily deployed to nemo-ci

March 16, 2026 12:37

Inactive


          Fix for upstream changes

f3a3950

Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>

Contributor Author

wprazuch commented Mar 16, 2026

/ok to test f3a3950

copy-pr-bot Bot temporarily deployed to test

March 16, 2026 13:19

Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci

March 16, 2026 13:19

Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci

March 16, 2026 13:19

Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci

March 16, 2026 13:21

Inactive


          Add fix for health checks

51ec4b1

Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>

copy-pr-bot Bot temporarily deployed to test

March 16, 2026 13:42

Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci

March 16, 2026 13:42

Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci

March 16, 2026 13:42

Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci

March 16, 2026 13:44

Inactive


          feat: generalize auxiliary deployment endpoint export

ff6156b

Signed-off-by: Adam Rajfer <arajfer@nvidia.com>

Contributor

AdamRajfer commented Mar 16, 2026

/ok to test ff6156b

copy-pr-bot Bot temporarily deployed to test

March 16, 2026 19:47

Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci

March 16, 2026 19:47

Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci

March 16, 2026 19:47

Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci

March 16, 2026 19:50

Inactive


          fix: use full NIM entrypoint command with nvidia_entrypoint.sh

1014f2e

Signed-off-by: Adam Rajfer <arajfer@nvidia.com>

Contributor

AdamRajfer commented Mar 17, 2026

/ok to test 1014f2e

copy-pr-bot Bot temporarily deployed to test

March 17, 2026 18:26

Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci

March 17, 2026 18:26

Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci

March 17, 2026 18:26

Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci

March 17, 2026 18:28

Inactive

agronskiy reviewed

View reviewed changes

packages/nemo-evaluator-launcher/src/nemo_evaluator_launcher/executors/slurm/executor.py

packages/nemo-evaluator-launcher/src/nemo_evaluator_launcher/common/helpers.py


          fix: missing timeout arg in aux haproxy wait handler, use keyword args

d9fd198

- Fix bug where service_name was passed as timeout positional arg in
  _generate_aux_haproxy_srun_command (caught by @agronskiy)
- Convert all _get_wait_for_server_handler calls to keyword arguments
- Merge duplicate auxiliary_deployments loops in helpers.py

Signed-off-by: Adam Rajfer <arajfer@nvidia.com>

Contributor

AdamRajfer commented Mar 18, 2026

/ok to test d9fd198

copy-pr-bot Bot temporarily deployed to test

March 18, 2026 12:02

Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci

March 18, 2026 12:02

Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci

March 18, 2026 12:02

Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci

March 18, 2026 12:04

Inactive

agronskiy approved these changes

View reviewed changes

Collaborator

agronskiy left a comment

Thanks Voytek and Adam!

AdamRajfer marked this pull request as ready for review

March 18, 2026 13:39

AdamRajfer requested review from a team as code owners

March 18, 2026 13:39

AdamRajfer merged commit 4c14381 into main

48 checks passed

AdamRajfer deleted the wprazuch/judge-deployment branch

March 18, 2026 13:40

marta-sd mentioned this pull request

feat(launcher): add judge deployment support for on-cluster judge model serving #802

Closed

3 tasks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

nemo-evaluator-launcher tests