Skip to content

Commit 7e15601

Browse files
Add lm-eval benchmark runner for InferenceX evals (#12) (#122)
* Add lm-eval benchmark runner for InferenceX evals Adds support for running lm-eval accuracy evaluations as a post-benchmark step, leveraging the InferenceX benchmark_lib.sh harness. Co-authored-by: Bryan Shan <58582368+Oseltamivir@users.noreply.github.com>
1 parent a84badf commit 7e15601

8 files changed

Lines changed: 891 additions & 3 deletions

File tree

docs/accuracy.md

Lines changed: 84 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Accuracy Benchmarks
22

3-
In srt-slurm, users can run different accuracy benchmarks by setting the benchmark section in the config yaml file. Supported benchmarks include `mmlu`, `gpqa`, `longbenchv2`, and AIME (via the script under `configs/aime/`).
3+
In srt-slurm, users can run different accuracy benchmarks by setting the benchmark section in the config yaml file. Supported benchmarks include `mmlu`, `gpqa`, `longbenchv2`, `lm-eval`, and AIME (via the script under `configs/aime/`).
44

55
## Table of Contents
66

@@ -16,6 +16,7 @@ In srt-slurm, users can run different accuracy benchmarks by setting the benchma
1616
- [Example: Quick Validation](#example-quick-validation)
1717
- [Output](#output)
1818
- [Important Notes](#important-notes)
19+
- [lm-eval (InferenceX)](#lm-eval-inferencex)
1920

2021
---
2122

@@ -290,3 +291,85 @@ The output includes per-category scores and aggregate metrics:
290291
3. **Throughput**: Increase `num_threads` for faster evaluation, but monitor for OOM errors
291292
4. **Categories**: Running specific categories is useful for targeted validation (e.g., just testing summarization capabilities)
292293

294+
295+
## lm-eval (InferenceX)
296+
297+
The `lm-eval` benchmark runner integrates [EleutherAI/lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) via InferenceX's `benchmark_lib.sh`. Unlike the built-in benchmarks above, this runner sources evaluation logic from an external InferenceX workspace mounted at `/infmax-workspace`.
298+
299+
This is used by InferenceX CI to run evals such as GSM8K and GPQA against NVIDIA multi-node disaggregated deployments on GB200, GB300, B200, B300, H100, and H200. AMD MI355X multi-node evals are handled by InferenceX's upstreamed AMD Slurm path, not by this srt-slurm runner.
300+
301+
In InferenceX CI, recipes normally keep their throughput benchmark configuration. `do_sweep.py` invokes the registered `lm-eval` runner as a post-step when `RUN_EVAL=true`, or as the only benchmark-like step when `EVAL_ONLY=true`. There is no separate `infmax-eval` benchmark type.
302+
303+
### How it works
304+
305+
1. `RuntimeContext` mounts the host path from `INFMAX_WORKSPACE` at `/infmax-workspace` inside the Slurm container.
306+
2. `do_sweep.py` starts infrastructure, workers, and the frontend for the normal recipe topology.
307+
3. For `EVAL_ONLY=true`, `do_sweep.py` skips the throughput benchmark stage and runs `_run_post_eval()` directly after frontend startup.
308+
4. `_run_post_eval()` waits for the OpenAI-compatible endpoint on port 8000 and, in eval-only mode, performs the full `wait_for_model()` health check for the configured prefill/decode or aggregated topology.
309+
5. `_run_post_eval()` launches the registered `lm-eval` runner on the head node and passes through InferenceX metadata such as framework, precision, sequence length, prefill/decode topology, and eval concurrency.
310+
6. The runner script (`benchmarks/scripts/lm-eval/bench.sh`) uses `MODEL_NAME` from `do_sweep.py`, or auto-discovers the served model from `/v1/models` as a fallback.
311+
7. The runner sources `/infmax-workspace/benchmarks/benchmark_lib.sh`, runs `run_eval --framework lm-eval`, and calls `append_lm_eval_summary`.
312+
8. Eval artifacts are copied to `/logs/eval_results/` for InferenceX launcher-side artifact pickup.
313+
314+
### EVAL_ONLY mode
315+
316+
srt-slurm supports an `EVAL_ONLY` mode for CI jobs that should only validate accuracy. This is controlled by environment variables from the InferenceX workflow:
317+
318+
| Env var | Description |
319+
|---------|-------------|
320+
| `EVAL_ONLY` | Set to `true` to skip the throughput benchmark stage and run eval only |
321+
| `RUN_EVAL` | Set to `true` to run eval after the throughput benchmark completes |
322+
| `EVAL_CONC` | Concurrent requests for lm-eval, normally set by InferenceX from the generated `eval-conc` value |
323+
| `INFMAX_WORKSPACE` | Host path to the InferenceX checkout that should be mounted at `/infmax-workspace` |
324+
| `MODEL_NAME` | Served model alias for OpenAI-compatible requests; set by `do_sweep.py` from `config.served_model_name` |
325+
326+
When `EVAL_ONLY=true`:
327+
- Stage 4 skips the throughput benchmark entirely. No throughput result JSON is expected from srt-slurm.
328+
- The eval path uses the full `wait_for_model()` health check before starting lm-eval.
329+
- `_run_post_eval()` launches the `lm-eval` runner and returns its exit code.
330+
- Eval failure is fatal because eval is the only purpose of the job.
331+
332+
When `RUN_EVAL=true` (without `EVAL_ONLY`):
333+
- Throughput benchmark runs normally
334+
- After benchmark completes successfully, eval runs as a post-step
335+
- Eval failure is non-fatal; the benchmark job still succeeds if throughput passed
336+
337+
### Environment variables
338+
339+
The following env vars are passed through to the lm-eval runner container:
340+
341+
| Env var | Purpose |
342+
|---------|---------|
343+
| `RUN_EVAL`, `EVAL_ONLY`, `IS_MULTINODE` | Control whether eval runs and how InferenceX classifies the artifact |
344+
| `FRAMEWORK`, `PRECISION`, `MODEL_PREFIX`, `RUNNER_TYPE`, `SPEC_DECODING` | Benchmark identity metadata for `meta_env.json` |
345+
| `ISL`, `OSL`, `RESULT_FILENAME` | Sequence length and result-file metadata |
346+
| `MODEL`, `MODEL_PATH`, `MODEL_NAME` | Model metadata and the served model alias used for requests |
347+
| `MAX_MODEL_LEN`, `EVAL_MAX_MODEL_LEN` | Context-length metadata used by InferenceX eval helpers when available |
348+
| `PREFILL_TP`, `PREFILL_EP`, `PREFILL_NUM_WORKERS`, `PREFILL_DP_ATTN` | Prefill-side topology metadata |
349+
| `DECODE_TP`, `DECODE_EP`, `DECODE_NUM_WORKERS`, `DECODE_DP_ATTN` | Decode-side topology metadata |
350+
| `EVAL_CONC`, `EVAL_CONCURRENT_REQUESTS` | Eval concurrency controls |
351+
352+
The runner maps srt-slurm's `PREFILL_DP_ATTN` and `DECODE_DP_ATTN` names to InferenceX's `PREFILL_DP_ATTENTION` and `DECODE_DP_ATTENTION` names before calling `append_lm_eval_summary`. This is required for multi-node summary tables to preserve prefill/decode DPA state.
353+
354+
### Concurrency
355+
356+
Eval concurrency is ultimately read by InferenceX's `benchmark_lib.sh` from `EVAL_CONCURRENT_REQUESTS`. The runner script sets that value from `EVAL_CONC` when present, preserves an existing `EVAL_CONCURRENT_REQUESTS` otherwise, and falls back to `256` only if neither variable is set:
357+
358+
```bash
359+
export EVAL_CONCURRENT_REQUESTS="${EVAL_CONC:-${EVAL_CONCURRENT_REQUESTS:-256}}"
360+
```
361+
362+
The InferenceX workflow sets `EVAL_CONC` from the generated `eval-conc` value. For multi-node configs, InferenceX selects the `8k1k` entry with the highest max eligible concurrency for each `(model, runner, framework, precision, spec-decoding, prefill-dp-attn, decode-dp-attn)` group, then sets `eval-conc` to the upper median of that config's eligible concurrency list. If `EVAL_CONC` is not set in the environment, `do_sweep.py` falls back to the max of the recipe benchmark concurrency list.
363+
364+
### Output
365+
366+
Eval artifacts are written to `/logs/eval_results/` inside the container:
367+
- `meta_env.json` - metadata used by InferenceX aggregation and summary tables
368+
- `results*.json` - lm-eval scores per task
369+
- `sample*.jsonl` - per-sample outputs
370+
371+
These are collected by the InferenceX NVIDIA launch scripts and uploaded as workflow artifacts. In eval-only mode the InferenceX workflow expects eval artifacts, not throughput benchmark artifacts.
372+
373+
### Intricacies
374+
1. Eval floor of 16
375+
- There is 1 sweep config of conc: [1], which causes evals to take >4hrs to complete.

src/srtctl/benchmarks/__init__.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@
88
custom,
99
gpqa,
1010
gsm8k,
11+
lm_eval,
1112
longbenchv2,
1213
mmlu,
1314
mooncake_router,
@@ -30,6 +31,7 @@
3031
"register_benchmark",
3132
# Runners
3233
"custom",
34+
"lm_eval",
3335
"sa_bench",
3436
"sglang_bench",
3537
"mmlu",

src/srtctl/benchmarks/lm_eval.py

Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,58 @@
1+
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
# SPDX-FileCopyrightText: Copyright (c) 2026 SemiAnalysis LLC. All rights reserved.
3+
# SPDX-License-Identifier: Apache-2.0
4+
5+
"""lm-eval benchmark runner for InferenceX evals."""
6+
7+
from __future__ import annotations
8+
9+
from typing import TYPE_CHECKING
10+
11+
from srtctl.benchmarks.base import SCRIPTS_DIR, BenchmarkRunner, register_benchmark
12+
13+
if TYPE_CHECKING:
14+
from srtctl.core.runtime import RuntimeContext
15+
from srtctl.core.schema import SrtConfig
16+
17+
18+
@register_benchmark("lm-eval")
19+
class LMEvalRunner(BenchmarkRunner):
20+
"""lm-eval accuracy evaluation using InferenceX benchmark_lib.
21+
22+
Runs lm-eval via the InferenceX benchmark_lib.sh harness,
23+
which handles task selection, result collection, and summary generation.
24+
"""
25+
26+
@property
27+
def name(self) -> str:
28+
return "lm-eval"
29+
30+
@property
31+
def script_path(self) -> str:
32+
return "/srtctl-benchmarks/lm-eval/bench.sh"
33+
34+
@property
35+
def local_script_dir(self) -> str:
36+
return str(SCRIPTS_DIR / "lm-eval")
37+
38+
def validate_config(self, config: SrtConfig) -> list[str]:
39+
# lm-eval has sensible defaults
40+
return []
41+
42+
def build_command(
43+
self,
44+
config: SrtConfig,
45+
runtime: RuntimeContext,
46+
) -> list[str]:
47+
endpoint = f"http://localhost:{runtime.frontend_port}"
48+
# Always use the container mount path, not the host path.
49+
# INFMAX_WORKSPACE env var contains the host path (used for mount setup
50+
# in runtime.py), but inside the container it's at /infmax-workspace.
51+
infmax_workspace = "/infmax-workspace"
52+
53+
return [
54+
"bash",
55+
self.script_path,
56+
endpoint,
57+
infmax_workspace,
58+
]
Lines changed: 77 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,77 @@
1+
#!/bin/bash
2+
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
3+
# SPDX-FileCopyrightText: Copyright (c) 2026 SemiAnalysis LLC. All rights reserved.
4+
# SPDX-License-Identifier: Apache-2.0
5+
6+
# lm-eval accuracy evaluation using InferenceX benchmark_lib
7+
# Expects: endpoint [infmax_workspace]
8+
9+
set -e
10+
11+
ENDPOINT=$1
12+
INFMAX_WORKSPACE=${2:-/infmax-workspace}
13+
14+
# Extract HOST and PORT from endpoint (e.g., http://localhost:8000)
15+
HOST=$(echo "$ENDPOINT" | sed -E 's|https?://||; s|:.*||')
16+
PORT=$(echo "$ENDPOINT" | sed -E 's|.*:([0-9]+).*|\1|')
17+
18+
echo "lm-eval Config: endpoint=${ENDPOINT}; host=${HOST}; port=${PORT}; workspace=${INFMAX_WORKSPACE}"
19+
20+
# Auto-discover the served model name from /v1/models if MODEL_NAME is not set.
21+
# This ensures we use the exact name the server recognizes, regardless of what
22+
# $MODEL (the HuggingFace ID from the workflow) is set to.
23+
if [[ -z "${MODEL_NAME:-}" ]]; then
24+
DISCOVERED_MODEL=$(curl -sf "${ENDPOINT}/v1/models" 2>/dev/null \
25+
| python3 -c "import sys,json; d=json.load(sys.stdin); print(d['data'][0]['id'])" 2>/dev/null || true)
26+
if [[ -n "$DISCOVERED_MODEL" ]]; then
27+
export MODEL_NAME="$DISCOVERED_MODEL"
28+
echo "Auto-discovered MODEL_NAME from /v1/models: ${MODEL_NAME}"
29+
else
30+
echo "WARNING: Could not discover model name from /v1/models, using MODEL_NAME=${MODEL_NAME:-$MODEL}"
31+
fi
32+
else
33+
echo "Using MODEL_NAME from environment: ${MODEL_NAME}"
34+
fi
35+
36+
# cd to workspace so that relative paths (e.g., utils/evals/*.yaml) resolve
37+
cd "${INFMAX_WORKSPACE}"
38+
39+
# Source the InferenceX benchmark library
40+
source "${INFMAX_WORKSPACE}/benchmarks/benchmark_lib.sh"
41+
42+
# Run lm-eval via benchmark_lib
43+
# EVAL_CONC is set by the InferenceX workflow (median of conc list).
44+
# benchmark_lib reads concurrency from EVAL_CONCURRENT_REQUESTS env var.
45+
export EVAL_CONCURRENT_REQUESTS="${EVAL_CONC:-${EVAL_CONCURRENT_REQUESTS:-256}}"
46+
echo "Running lm-eval with concurrent-requests=${EVAL_CONCURRENT_REQUESTS}..."
47+
eval_rc=0
48+
run_eval --framework lm-eval --port "$PORT" || eval_rc=$?
49+
50+
# Derive metadata env vars that append_lm_eval_summary needs but do_sweep.py
51+
# does not pass directly (it passes PREFILL_TP/EP/etc, not TP/EP_SIZE/CONC).
52+
export IS_MULTINODE="${IS_MULTINODE:-true}"
53+
export TP="${TP:-${PREFILL_TP:-1}}"
54+
export CONC="${CONC:-${EVAL_CONC:-${EVAL_CONCURRENT_REQUESTS:-1}}}"
55+
export EP_SIZE="${EP_SIZE:-${PREFILL_EP:-1}}"
56+
export DP_ATTENTION="${DP_ATTENTION:-${PREFILL_DP_ATTN:-false}}"
57+
# Remap srt-slurm's DP_ATTN names to InferenceX's DP_ATTENTION names
58+
export PREFILL_DP_ATTENTION="${PREFILL_DP_ATTENTION:-${PREFILL_DP_ATTN:-${DP_ATTENTION:-false}}}"
59+
export DECODE_DP_ATTENTION="${DECODE_DP_ATTENTION:-${DECODE_DP_ATTN:-${DP_ATTENTION:-false}}}"
60+
61+
# Generate the lm-eval summary
62+
echo "Generating lm-eval summary..."
63+
append_lm_eval_summary || true
64+
65+
# Copy eval artifacts to /logs/eval_results/
66+
mkdir -p /logs/eval_results
67+
echo "Copying eval artifacts to /logs/eval_results/..."
68+
cp -v meta_env.json /logs/eval_results/ 2>/dev/null || true
69+
cp -v results*.json /logs/eval_results/ 2>/dev/null || true
70+
cp -v sample*.jsonl /logs/eval_results/ 2>/dev/null || true
71+
72+
if [[ "$eval_rc" -ne 0 ]]; then
73+
echo "lm-eval evaluation failed with exit code ${eval_rc}"
74+
exit "$eval_rc"
75+
fi
76+
77+
echo "lm-eval evaluation complete"

0 commit comments

Comments
 (0)