feat(launcher): add judge deployment support for on-cluster judge model serving by gchlebus · Pull Request #802 · NVIDIA-NeMo/Evaluator

gchlebus · 2026-03-05T05:21:56Z

Summary

Add a judge_deployment config section to the launcher that deploys a judge model on dedicated SLURM nodes within the same allocation as the model under test. This eliminates the dependency on external judge APIs (e.g. NVCF) for judge-dependent evaluations.

Motivation

The Ultra posttraining team needs to run 20+ checkpoints × 3-4 judge-dependent evals (ArenaHard, LCR, Tau2, HLE, IFBench). At this scale, NVCF judge capacity becomes a bottleneck — especially for heavy judges like DeepSeek-3.2 and Qwen-235B. On-cluster deployment removes that dependency entirely.

Feedback collected during the Ultra evaluation meeting (2026-03-04).

What changed

New config section: `judge_deployment`

judge_deployment:
  type: vllm          # vllm | sglang | none (default: none)
  image: vllm/vllm-openai:v0.16.0
  checkpoint_path: /path/to/model
  hf_model_handle: Qwen/Qwen2.5-1.5B-Instruct
  served_model_name: Qwen/Qwen2.5-1.5B-Instruct
  tensor_parallel_size: 1
  data_parallel_size: 8
  num_nodes: 1
  port: 8001
  env_vars:
    HF_TOKEN: host:HF_TOKEN

How it works

Total SLURM allocation = num_nodes (model) + judge_deployment.num_nodes
Nodes are split: first N for model deployment, remaining M for judge
Model deployment uses --nodelist to stay on assigned nodes
Judge server starts on its dedicated nodes, health-checks pass
JUDGE_ENDPOINT_URL and JUDGE_MODEL_ID env vars are auto-exported to evaluation containers
Evaluation tasks can reference these via pre_cmd to patch their configs at runtime
Judge server is terminated after evaluation completes

Files changed (11 files, +888/-33)

configs/judge_deployment/none.yaml — no-op default
configs/judge_deployment/vllm.yaml — vLLM judge config
configs/judge_deployment/sglang.yaml — SGLang judge config
configs/default.yaml — added judge_deployment default
configs/execution/slurm/default.yaml — judge_deployment mounts + n_tasks
executors/slurm/executor.py — core implementation (node splitting, srun commands, health checks, env var export)
common/env_vars.py — judge deployment env var handling
common/helpers.py — helper utilities
tests/unit_tests/test_slurm_executor.py — 348 lines of new tests

E2E test results

CW-DFW — MT-Bench with on-cluster judge ✅ (2026-03-05)

Invocation 1255ed21432f2ab4 | Slurm job 9814265 | Cluster CW-DFW | 2 nodes (1 model + 1 judge)

Phase	Status	Details
Model deployment	✅	Llama-3.1-8B-Instruct on node 1, port 8000, DP=8
Judge deployment	✅	Llama-3.1-8B-Instruct on node 2, port 8001, vLLM v0.16.0, health check passed
`JUDGE_ENDPOINT_URL` export	✅	`http://pool0-01461:8001/v1/chat/completions`
`JUDGE_MODEL_ID` export	✅	`meta-llama/Llama-3.1-8B-Instruct`
`pre_cmd` config patching	✅	`__JUDGE_BASE_URL__` and `__JUDGE_MODEL_ID__` replaced in `config_ef.yaml`
Generation (20 samples × 2 turns)	✅	40/40 requests, all HTTP 200, ~9 min inference time
Judging (20 questions × 2 turns)	✅	40/40 judgements scored, ~67s judging time
Artifacts	✅	`report.html`, `report.json`, judgements JSONL, response stats all written

Stats: avg 656 tokens/response, 40 successful responses, 0 failures. Scores ranged 2–9 across both single-turn and multi-turn judgements.

Note: The Slurm job exited with code 1 due to a pre-existing mtbench container bug — the output parser looks for result.json at root level but the mtbench container writes it to mtbench/<model>/first_20/<model>-result.json. This is unrelated to judge deployment — all generation, judging, and artifact output completed successfully.

Earlier CW-DFW validation runs

Run	Invocation	What was validated
Attempt 3	`33bf1b53927ad35d`	Judge deploys ✅, health check ✅, env vars ✅, pre_cmd ✅, generation completes ✅. Failed: walltime too short (30min)
Attempt 4	`c6ce965c44d14482`	Killed manually to move testing to CW-PDX

Unit tests

348 lines of new tests covering:

Node splitting logic (model vs judge nodes)
Judge deployment srun command construction
Environment variable generation and export
Health check configuration with separate PID tracking

Example usage

# MT-Bench with on-cluster judge
defaults:
  - execution: slurm
  - deployment: vllm
  - judge_deployment: vllm
  - _self_

execution:
  num_nodes: 1          # for model under test
  walltime: 01:00:00

deployment:
  hf_model_handle: meta-llama/Llama-3.1-8B-Instruct
  tensor_parallel_size: 1
  data_parallel_size: 8

judge_deployment:
  hf_model_handle: Qwen/Qwen2.5-72B-Instruct
  tensor_parallel_size: 4
  data_parallel_size: 2
  num_nodes: 1

evaluation:
  tasks:
    - name: mtbench.mtbench
      pre_cmd: >-
        sed -i "s|__JUDGE_BASE_URL__|${JUDGE_ENDPOINT_URL%/chat/completions}|g" config_ef.yaml &&
        sed -i "s|__JUDGE_MODEL_ID__|${JUDGE_MODEL_ID}|g" config_ef.yaml

TODO before merge

Full E2E completion on CW-DFW (generation + judging ✅)
Multi-node judge deployment test (TP>1 across nodes)
Documentation update

Add judge_deployment config section that deploys a judge model alongside the model under test, on dedicated SLURM nodes within the same allocation. Config structure: - judge_deployment.type: vllm | sglang | none (default: none) - judge_deployment.image, checkpoint_path, served_model_name, port, etc. - judge_deployment.num_nodes: number of nodes for the judge server - execution.judge_deployment.n_tasks: srun task count for judge - execution.mounts.judge_deployment: container mounts for judge When judge_deployment.type != none: - Total SLURM allocation = num_nodes (model) + judge_deployment.num_nodes - Nodes are split: first N for model, remaining M for judge - Model deployment uses --nodelist to stay on its assigned nodes - Judge server starts, health-checks, then eval runs - JUDGE_ENDPOINT_URL and JUDGE_MODEL_ID env vars are auto-exported to evaluation containers for downstream use - Judge server is terminated after evaluation completes Motivation: Ultra posttraining team needs to run 20+ checkpoints x 3-4 judge-dependent evals (arenahard, LCR, tau2, HLE, IFBench). NVCF judge capacity becomes a bottleneck at this scale, especially for heavy judges like DeepSeek-3.2 and Qwen-235B. On-cluster deployment eliminates the external dependency. Ref: feedback from Ultra evaluation meeting (2026-03-04)

copy-pr-bot · 2026-03-05T05:22:00Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

marta-sd · 2026-04-09T06:53:17Z

Feature introduced in #830

github-actions Bot added nemo-evaluator-launcher tests labels Mar 5, 2026

marta-sd closed this Apr 9, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(launcher): add judge deployment support for on-cluster judge model serving#802

feat(launcher): add judge deployment support for on-cluster judge model serving#802
gchlebus wants to merge 1 commit intomainfrom
gchlebus/judge-deployment

gchlebus commented Mar 5, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented Mar 5, 2026

Uh oh!

marta-sd commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

gchlebus commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

What changed

New config section: judge_deployment

How it works

Files changed (11 files, +888/-33)

E2E test results

CW-DFW — MT-Bench with on-cluster judge ✅ (2026-03-05)

Earlier CW-DFW validation runs

Unit tests

Example usage

TODO before merge

Uh oh!

copy-pr-bot Bot commented Mar 5, 2026

Uh oh!

marta-sd commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

gchlebus commented Mar 5, 2026 •

edited

Loading

New config section: `judge_deployment`