feat(launcher): add judge deployment support for on-cluster judge model serving#802
Closed
feat(launcher): add judge deployment support for on-cluster judge model serving#802
Conversation
Add judge_deployment config section that deploys a judge model alongside the model under test, on dedicated SLURM nodes within the same allocation. Config structure: - judge_deployment.type: vllm | sglang | none (default: none) - judge_deployment.image, checkpoint_path, served_model_name, port, etc. - judge_deployment.num_nodes: number of nodes for the judge server - execution.judge_deployment.n_tasks: srun task count for judge - execution.mounts.judge_deployment: container mounts for judge When judge_deployment.type != none: - Total SLURM allocation = num_nodes (model) + judge_deployment.num_nodes - Nodes are split: first N for model, remaining M for judge - Model deployment uses --nodelist to stay on its assigned nodes - Judge server starts, health-checks, then eval runs - JUDGE_ENDPOINT_URL and JUDGE_MODEL_ID env vars are auto-exported to evaluation containers for downstream use - Judge server is terminated after evaluation completes Motivation: Ultra posttraining team needs to run 20+ checkpoints x 3-4 judge-dependent evals (arenahard, LCR, tau2, HLE, IFBench). NVCF judge capacity becomes a bottleneck at this scale, especially for heavy judges like DeepSeek-3.2 and Qwen-235B. On-cluster deployment eliminates the external dependency. Ref: feedback from Ultra evaluation meeting (2026-03-04)
Contributor
|
Feature introduced in #830 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add a
judge_deploymentconfig section to the launcher that deploys a judge model on dedicated SLURM nodes within the same allocation as the model under test. This eliminates the dependency on external judge APIs (e.g. NVCF) for judge-dependent evaluations.Motivation
The Ultra posttraining team needs to run 20+ checkpoints × 3-4 judge-dependent evals (ArenaHard, LCR, Tau2, HLE, IFBench). At this scale, NVCF judge capacity becomes a bottleneck — especially for heavy judges like DeepSeek-3.2 and Qwen-235B. On-cluster deployment removes that dependency entirely.
Feedback collected during the Ultra evaluation meeting (2026-03-04).
What changed
New config section:
judge_deploymentHow it works
num_nodes(model) +judge_deployment.num_nodes--nodelistto stay on assigned nodesJUDGE_ENDPOINT_URLandJUDGE_MODEL_IDenv vars are auto-exported to evaluation containerspre_cmdto patch their configs at runtimeFiles changed (11 files, +888/-33)
configs/judge_deployment/none.yaml— no-op defaultconfigs/judge_deployment/vllm.yaml— vLLM judge configconfigs/judge_deployment/sglang.yaml— SGLang judge configconfigs/default.yaml— added judge_deployment defaultconfigs/execution/slurm/default.yaml— judge_deployment mounts + n_tasksexecutors/slurm/executor.py— core implementation (node splitting, srun commands, health checks, env var export)common/env_vars.py— judge deployment env var handlingcommon/helpers.py— helper utilitiestests/unit_tests/test_slurm_executor.py— 348 lines of new testsE2E test results
CW-DFW — MT-Bench with on-cluster judge ✅ (2026-03-05)
Invocation
1255ed21432f2ab4| Slurm job9814265| Cluster CW-DFW | 2 nodes (1 model + 1 judge)JUDGE_ENDPOINT_URLexporthttp://pool0-01461:8001/v1/chat/completionsJUDGE_MODEL_IDexportmeta-llama/Llama-3.1-8B-Instructpre_cmdconfig patching__JUDGE_BASE_URL__and__JUDGE_MODEL_ID__replaced inconfig_ef.yamlreport.html,report.json, judgements JSONL, response stats all writtenStats: avg 656 tokens/response, 40 successful responses, 0 failures. Scores ranged 2–9 across both single-turn and multi-turn judgements.
Earlier CW-DFW validation runs
33bf1b53927ad35dc6ce965c44d14482Unit tests
348 lines of new tests covering:
Example usage
TODO before merge