You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add lm-eval benchmark runner for InferenceX evals (#12) (#122)
* Add lm-eval benchmark runner for InferenceX evals
Adds support for running lm-eval accuracy evaluations as a post-benchmark
step, leveraging the InferenceX benchmark_lib.sh harness.
Co-authored-by: Bryan Shan <58582368+Oseltamivir@users.noreply.github.com>
Copy file name to clipboardExpand all lines: docs/accuracy.md
+84-1Lines changed: 84 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,6 @@
1
1
# Accuracy Benchmarks
2
2
3
-
In srt-slurm, users can run different accuracy benchmarks by setting the benchmark section in the config yaml file. Supported benchmarks include `mmlu`, `gpqa`, `longbenchv2`, and AIME (via the script under `configs/aime/`).
3
+
In srt-slurm, users can run different accuracy benchmarks by setting the benchmark section in the config yaml file. Supported benchmarks include `mmlu`, `gpqa`, `longbenchv2`, `lm-eval`, and AIME (via the script under `configs/aime/`).
4
4
5
5
## Table of Contents
6
6
@@ -16,6 +16,7 @@ In srt-slurm, users can run different accuracy benchmarks by setting the benchma
@@ -290,3 +291,85 @@ The output includes per-category scores and aggregate metrics:
290
291
3. **Throughput**: Increase `num_threads` for faster evaluation, but monitor for OOM errors
291
292
4. **Categories**: Running specific categories is useful for targeted validation (e.g., just testing summarization capabilities)
292
293
294
+
295
+
## lm-eval (InferenceX)
296
+
297
+
The `lm-eval` benchmark runner integrates [EleutherAI/lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) via InferenceX's `benchmark_lib.sh`. Unlike the built-in benchmarks above, this runner sources evaluation logic from an external InferenceX workspace mounted at `/infmax-workspace`.
298
+
299
+
This is used by InferenceX CI to run evals such as GSM8K and GPQA against NVIDIA multi-node disaggregated deployments on GB200, GB300, B200, B300, H100, and H200. AMD MI355X multi-node evals are handled by InferenceX's upstreamed AMD Slurm path, not by this srt-slurm runner.
300
+
301
+
In InferenceX CI, recipes normally keep their throughput benchmark configuration. `do_sweep.py` invokes the registered `lm-eval` runner as a post-step when `RUN_EVAL=true`, or as the only benchmark-like step when `EVAL_ONLY=true`. There is no separate `infmax-eval` benchmark type.
302
+
303
+
### How it works
304
+
305
+
1. `RuntimeContext` mounts the host path from `INFMAX_WORKSPACE` at `/infmax-workspace` inside the Slurm container.
306
+
2. `do_sweep.py` starts infrastructure, workers, and the frontend for the normal recipe topology.
307
+
3. For `EVAL_ONLY=true`, `do_sweep.py` skips the throughput benchmark stage and runs `_run_post_eval()` directly after frontend startup.
308
+
4. `_run_post_eval()` waits for the OpenAI-compatible endpoint on port 8000 and, in eval-only mode, performs the full `wait_for_model()` health check for the configured prefill/decode or aggregated topology.
309
+
5. `_run_post_eval()` launches the registered `lm-eval` runner on the head node and passes through InferenceX metadata such as framework, precision, sequence length, prefill/decode topology, and eval concurrency.
310
+
6. The runner script (`benchmarks/scripts/lm-eval/bench.sh`) uses `MODEL_NAME` from `do_sweep.py`, or auto-discovers the served model from `/v1/models` as a fallback.
311
+
7. The runner sources `/infmax-workspace/benchmarks/benchmark_lib.sh`, runs `run_eval --framework lm-eval`, and calls `append_lm_eval_summary`.
312
+
8. Eval artifacts are copied to `/logs/eval_results/` for InferenceX launcher-side artifact pickup.
313
+
314
+
### EVAL_ONLY mode
315
+
316
+
srt-slurm supports an `EVAL_ONLY` mode for CI jobs that should only validate accuracy. This is controlled by environment variables from the InferenceX workflow:
317
+
318
+
| Env var | Description |
319
+
|---------|-------------|
320
+
| `EVAL_ONLY` | Set to `true` to skip the throughput benchmark stage and run eval only |
321
+
| `RUN_EVAL` | Set to `true` to run eval after the throughput benchmark completes |
322
+
| `EVAL_CONC` | Concurrent requests for lm-eval, normally set by InferenceX from the generated `eval-conc` value |
323
+
| `INFMAX_WORKSPACE` | Host path to the InferenceX checkout that should be mounted at `/infmax-workspace` |
324
+
| `MODEL_NAME` | Served model alias for OpenAI-compatible requests; set by `do_sweep.py` from `config.served_model_name` |
325
+
326
+
When `EVAL_ONLY=true`:
327
+
- Stage 4 skips the throughput benchmark entirely. No throughput result JSON is expected from srt-slurm.
328
+
- The eval path uses the full `wait_for_model()` health check before starting lm-eval.
329
+
- `_run_post_eval()`launches the `lm-eval` runner and returns its exit code.
330
+
- Eval failure is fatal because eval is the only purpose of the job.
331
+
332
+
When `RUN_EVAL=true` (without `EVAL_ONLY`):
333
+
- Throughput benchmark runs normally
334
+
- After benchmark completes successfully, eval runs as a post-step
335
+
- Eval failure is non-fatal; the benchmark job still succeeds if throughput passed
336
+
337
+
### Environment variables
338
+
339
+
The following env vars are passed through to the lm-eval runner container:
340
+
341
+
| Env var | Purpose |
342
+
|---------|---------|
343
+
| `RUN_EVAL`, `EVAL_ONLY`, `IS_MULTINODE` | Control whether eval runs and how InferenceX classifies the artifact |
The runner maps srt-slurm's `PREFILL_DP_ATTN` and `DECODE_DP_ATTN` names to InferenceX's `PREFILL_DP_ATTENTION` and `DECODE_DP_ATTENTION` names before calling `append_lm_eval_summary`. This is required for multi-node summary tables to preserve prefill/decode DPA state.
353
+
354
+
### Concurrency
355
+
356
+
Eval concurrency is ultimately read by InferenceX's `benchmark_lib.sh` from `EVAL_CONCURRENT_REQUESTS`. The runner script sets that value from `EVAL_CONC` when present, preserves an existing `EVAL_CONCURRENT_REQUESTS` otherwise, and falls back to `256` only if neither variable is set:
The InferenceX workflow sets `EVAL_CONC` from the generated `eval-conc` value. For multi-node configs, InferenceX selects the `8k1k` entry with the highest max eligible concurrency for each `(model, runner, framework, precision, spec-decoding, prefill-dp-attn, decode-dp-attn)` group, then sets `eval-conc` to the upper median of that config's eligible concurrency list. If `EVAL_CONC` is not set in the environment, `do_sweep.py` falls back to the max of the recipe benchmark concurrency list.
363
+
364
+
### Output
365
+
366
+
Eval artifacts are written to `/logs/eval_results/` inside the container:
367
+
- `meta_env.json`- metadata used by InferenceX aggregation and summary tables
368
+
- `results*.json`- lm-eval scores per task
369
+
- `sample*.jsonl`- per-sample outputs
370
+
371
+
These are collected by the InferenceX NVIDIA launch scripts and uploaded as workflow artifacts. In eval-only mode the InferenceX workflow expects eval artifacts, not throughput benchmark artifacts.
372
+
373
+
### Intricacies
374
+
1. Eval floor of 16
375
+
- There is 1 sweep config of conc: [1], which causes evals to take >4hrs to complete.
0 commit comments