Skip to content

feat: deploy auxiliary endpoints#830

Merged
AdamRajfer merged 16 commits intomainfrom
wprazuch/judge-deployment
Mar 18, 2026
Merged

feat: deploy auxiliary endpoints#830
AdamRajfer merged 16 commits intomainfrom
wprazuch/judge-deployment

Conversation

@wprazuch
Copy link
Copy Markdown
Contributor

No description provided.

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Mar 10, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

gchlebus and others added 11 commits March 16, 2026 13:04
Add judge_deployment config section that deploys a judge model alongside
the model under test, on dedicated SLURM nodes within the same allocation.

Config structure:
- judge_deployment.type: vllm | sglang | none (default: none)
- judge_deployment.image, checkpoint_path, served_model_name, port, etc.
- judge_deployment.num_nodes: number of nodes for the judge server
- execution.judge_deployment.n_tasks: srun task count for judge
- execution.mounts.judge_deployment: container mounts for judge

When judge_deployment.type != none:
- Total SLURM allocation = num_nodes (model) + judge_deployment.num_nodes
- Nodes are split: first N for model, remaining M for judge
- Model deployment uses --nodelist to stay on its assigned nodes
- Judge server starts, health-checks, then eval runs
- JUDGE_ENDPOINT_URL and JUDGE_MODEL_ID env vars are auto-exported
  to evaluation containers for downstream use
- Judge server is terminated after evaluation completes

Motivation: Ultra posttraining team needs to run 20+ checkpoints x 3-4
judge-dependent evals (arenahard, LCR, tau2, HLE, IFBench). NVCF judge
capacity becomes a bottleneck at this scale, especially for heavy judges
like DeepSeek-3.2 and Qwen-235B. On-cluster deployment eliminates the
external dependency.

Ref: feedback from Ultra evaluation meeting (2026-03-04)

Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>
Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>
Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>
…ployment

Replace duplicated judge_deployment and user_deployment code with a generic
auxiliary_deployments system. Adding a new deployment type is now config-only.

- Add AuxDeploymentState dataclass and auxiliary_deployments.py module
- Add configs/auxiliary_deployment/ shared templates (vllm, none)
- Refactor executor.py: single _generate_auxiliary_deployment_srun_command()
  with haproxy/multi-instance support replaces 2 duplicated functions
- Normalization shim translates legacy judge_deployment/user_deployment keys
- Validation for duplicate ports, prefixes, required fields
- 137/137 tests pass

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>
Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>
- Remove unused collect_judge/user_deployment_env_vars from env_vars.py
- Remove unused get_judge_endpoint_url/get_judge_served_model_name from helpers.py
- Add shell variable resolution for config_ef.yaml (auxiliary_deployments refs)
- Fix normalization shim to also migrate execution.mounts.{name}_deployment
- 137/137 tests pass

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>
No more per-type Hydra config groups. Auxiliary deployments are configured
inline via auxiliary_deployments: {} in the user's run config.
The normalization shim still handles legacy top-level judge_deployment/
user_deployment keys for backward compatibility at the config level.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>
No more per-type Hydra config groups. Auxiliary deployments are configured
inline via auxiliary_deployments: {} in the user's run config.
The normalization shim still handles legacy top-level judge_deployment/
user_deployment keys for backward compatibility at the config level.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>
…n shim

These features were never merged to main, so backward compatibility code
is unnecessary. auxiliary_deployments is the only supported API.

- Remove normalize_auxiliary_deployments() from auxiliary_deployments.py
- Remove its call and import from executor.py
- Remove legacy mount validation fallback for judge/user_deployment keys
- Update tests to use auxiliary_deployments directly
- 137/137 tests pass

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>
Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>
Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>
@wprazuch wprazuch force-pushed the wprazuch/judge-deployment branch from d96608a to 7501aa4 Compare March 16, 2026 12:33
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Mar 16, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@wprazuch
Copy link
Copy Markdown
Contributor Author

/ok to test 7501aa4

Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>
@wprazuch
Copy link
Copy Markdown
Contributor Author

/ok to test f3a3950

Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>
Signed-off-by: Adam Rajfer <arajfer@nvidia.com>
@AdamRajfer
Copy link
Copy Markdown
Contributor

/ok to test ff6156b

Signed-off-by: Adam Rajfer <arajfer@nvidia.com>
@AdamRajfer
Copy link
Copy Markdown
Contributor

/ok to test 1014f2e

- Fix bug where service_name was passed as timeout positional arg in
  _generate_aux_haproxy_srun_command (caught by @agronskiy)
- Convert all _get_wait_for_server_handler calls to keyword arguments
- Merge duplicate auxiliary_deployments loops in helpers.py

Signed-off-by: Adam Rajfer <arajfer@nvidia.com>
@AdamRajfer
Copy link
Copy Markdown
Contributor

/ok to test d9fd198

Copy link
Copy Markdown
Collaborator

@agronskiy agronskiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Voytek and Adam!

@AdamRajfer AdamRajfer marked this pull request as ready for review March 18, 2026 13:39
@AdamRajfer AdamRajfer requested review from a team as code owners March 18, 2026 13:39
@AdamRajfer AdamRajfer merged commit 4c14381 into main Mar 18, 2026
48 checks passed
@AdamRajfer AdamRajfer deleted the wprazuch/judge-deployment branch March 18, 2026 13:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants