Add lm-eval benchmark runner for InferenceX evals by Oseltamivir · Pull Request #40 · NVIDIA/srt-slurm

Oseltamivir · 2026-04-17T02:15:55Z

Summary

Add InferenceX multi-node eval support through an lm-eval benchmark runner and eval-only orchestration path. Lets InferenceX run accuracy-only jobs against existing srt-slurm multi-node disaggregated recipes without running the throughput benchmark stage.

Copied from ishandhanani/srt-slurm#245

How

Add an lm-eval benchmark runner that sources InferenceX's benchmarks/benchmark_lib.sh from a mounted /infmax-workspace.
Mount INFMAX_WORKSPACE into the container as /infmax-workspace when provided.
Add EVAL_ONLY=true handling in do_sweep.py so eval-only jobs start infra/workers/frontend, run
the full model health check, skip throughput, and launch lm-eval directly.
Keep RUN_EVAL=true behavior as a post-benchmark eval path for normal throughput jobs.
Pass model/framework/topology metadata into the eval container, including served MODEL_NAME, prefill/decode TP/EP/DPA/worker counts, sequence length, precision, runner type, and eval concurrency.
Map srt-slurm PREFILL_DP_ATTN / DECODE_DP_ATTN env vars to the InferenceX PREFILL_DP_ATTENTION /DECODE_DP_ATTENTION names expected by append_lm_eval_summary.
Copy eval outputs (meta_env.json, results*.json, sample*.jsonl) into /logs/eval_results/ for launcher-side artifact pickup.
Preserve partial eval artifacts on lm-eval failure while still returning the original eval failure
code.
Document the InferenceX lm-eval integration in docs/accuracy.md.

What

For EVAL_ONLY=true:

srt-slurm still starts the normal deployment topology.
The throughput benchmark runner is skipped.
wait_for_model() verifies the configured prefill/decode or aggregated worker counts.
lm-eval runs against the OpenAI-compatible endpoint.
Eval failure is fatal.
Low score leads to failure

For RUN_EVAL=true without EVAL_ONLY=true:

The normal benchmark runs first.
lm-eval runs as a post-step if throughput succeeds.
Eval failure is non-fatal to the benchmark result.
Low score leads to failure

Validation run

https://github.com/SemiAnalysisAI/InferenceX/actions/runs/24059388771

InferenceX PR

SemiAnalysisAI/InferenceX#1000

Auto-detect container type at runtime: if /sgl-workspace exists (SGLang), use original install path unchanged; otherwise use portable /tmp build path with conditional dependency installation for non-SGLang containers.

* Add Kimi-K2.5 vLLM recipes and fix NIXL side channel host - Add kimi-k2.5 1k1k and 8k1k disagg GB200 recipes (from NVIDIA#7) - Fix vLLM NIXL handshake failures: set VLLM_NIXL_SIDE_CHANNEL_HOST to node's routable IP in get_process_environment() instead of leaving it as 0.0.0.0/localhost which caused transfer handshake failures - Update test_vllm_get_process_environment to cover NIXL host env var Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * ci: run checks on PRs targeting sa-submission-q2-2026 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

Adds support for running lm-eval accuracy evaluations as a post-benchmark step, leveraging the InferenceX benchmark_lib.sh harness. - New LMEvalRunner registered as "lm-eval" benchmark type - bench.sh script sources benchmark_lib.sh and calls run_eval/append_lm_eval_summary - Post-benchmark eval hook in SweepOrchestrator.run() triggered by RUN_EVAL=true - Auto-mount INFMAX_WORKSPACE into container when env var is set Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

In eval-only mode the benchmark stage is skipped, which also skips its model health check. The 30s port check in _run_post_eval is insufficient — workers are still loading. Use wait_for_model() with the full health check config (same as benchmark stage) when EVAL_ONLY=true. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Instead of capping eval examples with --limit to avoid timeouts, use the highest benchmark concurrency for eval requests. This runs the full eval set faster by matching the throughput the server was already benchmarked at. do_sweep.py computes max(config.benchmark.concurrencies) and passes it as EVAL_CONC to the lm-eval bench script. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

NVIDIA#24) * Add Kimi K2.5 disagg STP and MTP recipes for GB200 NVfp4 (ISL8K_OSL1K and ISL1K_OSL1K) Add optimized disaggregated inference recipes for Kimi K2.5 model with NVfp4 precision on GB200 GPUs. Includes both STP and MTP configurations for ISL8K_OSL1K and ISL1K_OSL1K workloads covering concurrency points from 5 to 2253, with Eagle speculative decoding for MTP variants. * Update Kimi K2.5 recipes: container, model path, concurrency format, and env cleanup - Update container to tensorrtllm-runtime-1.1.0-dev.2.sqsh - Point model path to shared /mnt/lustre01/models/kimi-k2.5-nvfp4 - Update Eagle model mount path for MTP configs - Remove HF_HOME (defaults to ~/.cache/huggingface) - Fix concurrency separator from space to 'x' for sa-bench compatibility - Enable multiple frontends for ctx1dep4_gen1dep32_batch64 * Use generic model path and container aliases for cluster portability Replace cluster-specific paths with generic alias names that are resolved via srtslurm.yaml model_paths and containers mappings, as per upstream convention. * Add extra_mount alias resolution and use generic Eagle model path Add model_paths alias resolution for extra_mount host paths in config.py, enabling MTP recipes to use generic name "kimi-k2.5-eagle3" instead of cluster-specific path for the Eagle speculative decoding model. * Use HuggingFace model names and full NVCR container paths Per review feedback, update model paths to HuggingFace format (nvidia/Kimi-K2.5-NVFP4) and container to full NVCR registry path (nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2) so recipes are portable and work without pre-built sqsh files. --------- Co-authored-by: nlevin-ui <nlevin@nvidia.com>

* recipes for minimax m2.5 fp4 b200 agg vllm * commit for signature

Covers Codecov gaps: lm_eval.py (100%), do_sweep.py eval paths, runtime.py INFMAX mount. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

# Conflicts: # src/srtctl/benchmarks/lm_eval.py # src/srtctl/benchmarks/scripts/lm-eval/bench.sh

Albert Cheng (Engrg-Hardware 1) and others added 15 commits April 2, 2026 14:17

Make Dynamo source install container-agnostic (vLLM, SGLang, etc.)

9cc6d50

Auto-detect container type at runtime: if /sgl-workspace exists (SGLang), use original install path unchanged; otherwise use portable /tmp build path with conditional dependency installation for non-SGLang containers.

update docs, clean up code

83a15fd

Clean up

41083ea

Add SPDX copyright headers for NVIDIA and SemiAnalysis

b3ac8b7

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

copyright

fa55725

Add Minimax M2.5 NVFP4 agg B200 single-node configs (NVIDIA#36)

b0f5b83

* recipes for minimax m2.5 fp4 b200 agg vllm * commit for signature

tests

119fdec

Merge branch 'sa-submission-q2-2026' into nvidia-pr

7070bb9

Add tests for lm-eval runner, _run_post_eval, and INFMAX_WORKSPACE mount

bc26d95

Covers Codecov gaps: lm_eval.py (100%), do_sweep.py eval paths, runtime.py INFMAX mount. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Merge branch 'pr-12-base' into nvidia-pr

ed8c1df

# Conflicts: # src/srtctl/benchmarks/lm_eval.py # src/srtctl/benchmarks/scripts/lm-eval/bench.sh

Oseltamivir requested review from alec-flowers, csahithi, hjjq, ishandhanani, kedarpotdar-nv, kyleliang-nv, nlevin-ui and qiching as code owners April 17, 2026 02:15

Oseltamivir force-pushed the nvidia-pr branch 2 times, most recently from 27d5209 to ed8c1df Compare April 17, 2026 02:29

Oseltamivir closed this Apr 17, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add lm-eval benchmark runner for InferenceX evals#40

Add lm-eval benchmark runner for InferenceX evals#40
Oseltamivir wants to merge 15 commits intoNVIDIA:mainfrom
Oseltamivir:nvidia-pr

Oseltamivir commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Oseltamivir commented Apr 17, 2026

Summary

How

What

Validation run

InferenceX PR

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants