Skip to content

feat(launcher): add judge deployment support for on-cluster judge model serving#802

Closed
gchlebus wants to merge 1 commit intomainfrom
gchlebus/judge-deployment
Closed

feat(launcher): add judge deployment support for on-cluster judge model serving#802
gchlebus wants to merge 1 commit intomainfrom
gchlebus/judge-deployment

Conversation

@gchlebus
Copy link
Copy Markdown
Contributor

@gchlebus gchlebus commented Mar 5, 2026

Summary

Add a judge_deployment config section to the launcher that deploys a judge model on dedicated SLURM nodes within the same allocation as the model under test. This eliminates the dependency on external judge APIs (e.g. NVCF) for judge-dependent evaluations.

Motivation

The Ultra posttraining team needs to run 20+ checkpoints × 3-4 judge-dependent evals (ArenaHard, LCR, Tau2, HLE, IFBench). At this scale, NVCF judge capacity becomes a bottleneck — especially for heavy judges like DeepSeek-3.2 and Qwen-235B. On-cluster deployment removes that dependency entirely.

Feedback collected during the Ultra evaluation meeting (2026-03-04).

What changed

New config section: judge_deployment

judge_deployment:
  type: vllm          # vllm | sglang | none (default: none)
  image: vllm/vllm-openai:v0.16.0
  checkpoint_path: /path/to/model
  hf_model_handle: Qwen/Qwen2.5-1.5B-Instruct
  served_model_name: Qwen/Qwen2.5-1.5B-Instruct
  tensor_parallel_size: 1
  data_parallel_size: 8
  num_nodes: 1
  port: 8001
  env_vars:
    HF_TOKEN: host:HF_TOKEN

How it works

  1. Total SLURM allocation = num_nodes (model) + judge_deployment.num_nodes
  2. Nodes are split: first N for model deployment, remaining M for judge
  3. Model deployment uses --nodelist to stay on assigned nodes
  4. Judge server starts on its dedicated nodes, health-checks pass
  5. JUDGE_ENDPOINT_URL and JUDGE_MODEL_ID env vars are auto-exported to evaluation containers
  6. Evaluation tasks can reference these via pre_cmd to patch their configs at runtime
  7. Judge server is terminated after evaluation completes

Files changed (11 files, +888/-33)

  • configs/judge_deployment/none.yaml — no-op default
  • configs/judge_deployment/vllm.yaml — vLLM judge config
  • configs/judge_deployment/sglang.yaml — SGLang judge config
  • configs/default.yaml — added judge_deployment default
  • configs/execution/slurm/default.yaml — judge_deployment mounts + n_tasks
  • executors/slurm/executor.py — core implementation (node splitting, srun commands, health checks, env var export)
  • common/env_vars.py — judge deployment env var handling
  • common/helpers.py — helper utilities
  • tests/unit_tests/test_slurm_executor.py — 348 lines of new tests

E2E test results

CW-DFW — MT-Bench with on-cluster judge ✅ (2026-03-05)

Invocation 1255ed21432f2ab4 | Slurm job 9814265 | Cluster CW-DFW | 2 nodes (1 model + 1 judge)

Phase Status Details
Model deployment Llama-3.1-8B-Instruct on node 1, port 8000, DP=8
Judge deployment Llama-3.1-8B-Instruct on node 2, port 8001, vLLM v0.16.0, health check passed
JUDGE_ENDPOINT_URL export http://pool0-01461:8001/v1/chat/completions
JUDGE_MODEL_ID export meta-llama/Llama-3.1-8B-Instruct
pre_cmd config patching __JUDGE_BASE_URL__ and __JUDGE_MODEL_ID__ replaced in config_ef.yaml
Generation (20 samples × 2 turns) 40/40 requests, all HTTP 200, ~9 min inference time
Judging (20 questions × 2 turns) 40/40 judgements scored, ~67s judging time
Artifacts report.html, report.json, judgements JSONL, response stats all written

Stats: avg 656 tokens/response, 40 successful responses, 0 failures. Scores ranged 2–9 across both single-turn and multi-turn judgements.

Note: The Slurm job exited with code 1 due to a pre-existing mtbench container bug — the output parser looks for result.json at root level but the mtbench container writes it to mtbench/<model>/first_20/<model>-result.json. This is unrelated to judge deployment — all generation, judging, and artifact output completed successfully.

Earlier CW-DFW validation runs

Run Invocation What was validated
Attempt 3 33bf1b53927ad35d Judge deploys ✅, health check ✅, env vars ✅, pre_cmd ✅, generation completes ✅. Failed: walltime too short (30min)
Attempt 4 c6ce965c44d14482 Killed manually to move testing to CW-PDX

Unit tests

348 lines of new tests covering:

  • Node splitting logic (model vs judge nodes)
  • Judge deployment srun command construction
  • Environment variable generation and export
  • Health check configuration with separate PID tracking

Example usage

# MT-Bench with on-cluster judge
defaults:
  - execution: slurm
  - deployment: vllm
  - judge_deployment: vllm
  - _self_

execution:
  num_nodes: 1          # for model under test
  walltime: 01:00:00

deployment:
  hf_model_handle: meta-llama/Llama-3.1-8B-Instruct
  tensor_parallel_size: 1
  data_parallel_size: 8

judge_deployment:
  hf_model_handle: Qwen/Qwen2.5-72B-Instruct
  tensor_parallel_size: 4
  data_parallel_size: 2
  num_nodes: 1

evaluation:
  tasks:
    - name: mtbench.mtbench
      pre_cmd: >-
        sed -i "s|__JUDGE_BASE_URL__|${JUDGE_ENDPOINT_URL%/chat/completions}|g" config_ef.yaml &&
        sed -i "s|__JUDGE_MODEL_ID__|${JUDGE_MODEL_ID}|g" config_ef.yaml

TODO before merge

  • Full E2E completion on CW-DFW (generation + judging ✅)
  • Multi-node judge deployment test (TP>1 across nodes)
  • Documentation update

Add judge_deployment config section that deploys a judge model alongside
the model under test, on dedicated SLURM nodes within the same allocation.

Config structure:
- judge_deployment.type: vllm | sglang | none (default: none)
- judge_deployment.image, checkpoint_path, served_model_name, port, etc.
- judge_deployment.num_nodes: number of nodes for the judge server
- execution.judge_deployment.n_tasks: srun task count for judge
- execution.mounts.judge_deployment: container mounts for judge

When judge_deployment.type != none:
- Total SLURM allocation = num_nodes (model) + judge_deployment.num_nodes
- Nodes are split: first N for model, remaining M for judge
- Model deployment uses --nodelist to stay on its assigned nodes
- Judge server starts, health-checks, then eval runs
- JUDGE_ENDPOINT_URL and JUDGE_MODEL_ID env vars are auto-exported
  to evaluation containers for downstream use
- Judge server is terminated after evaluation completes

Motivation: Ultra posttraining team needs to run 20+ checkpoints x 3-4
judge-dependent evals (arenahard, LCR, tau2, HLE, IFBench). NVCF judge
capacity becomes a bottleneck at this scale, especially for heavy judges
like DeepSeek-3.2 and Qwen-235B. On-cluster deployment eliminates the
external dependency.

Ref: feedback from Ultra evaluation meeting (2026-03-04)
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Mar 5, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@marta-sd
Copy link
Copy Markdown
Contributor

marta-sd commented Apr 9, 2026

Feature introduced in #830

@marta-sd marta-sd closed this Apr 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants