Skip to content

feat: add vLLM GB200 GSM8K repro configs#106

Draft
alec-flowers wants to merge 2 commits intomainfrom
aflowers/vllm-gb200-gsm8k-repro
Draft

feat: add vLLM GB200 GSM8K repro configs#106
alec-flowers wants to merge 2 commits intomainfrom
aflowers/vllm-gb200-gsm8k-repro

Conversation

@alec-flowers
Copy link
Copy Markdown
Collaborator

@alec-flowers alec-flowers commented Apr 28, 2026

Summary

Adds a reproducible vLLM DeepSeek-V4-Pro GB200 GSM8K eval path and the two exact 1P1D configs used for the SA smoke comparison.

  • Adds a self-contained benchmark.type: lm-eval runner that evaluates OpenAI-compatible chat endpoints with EleutherAI lm-eval.
  • Runs the eval in the same runtime container as the server, matching the InferenceX/srt-slurm multi-node flow.
  • Vendors the GSM8K task YAML, score threshold, and score validator used by the repro configs.
  • Adds two exact vLLM GB200 1P1D eval recipes:
    • DEP8 prefill + TP8 decode: disagg-gb200-1p1d-dep8-tp8-gsm8k-smoke.yaml
    • DEP8 prefill + DEP8 decode: disagg-gb200-1p1d-dep8-dep8-gsm8k-smoke.yaml

SA Validation

Successful runs on the SA GB200 cluster:

  • TP8 decode repro: job 15558
    • exact_match,strict-match: 0.8635
    • exact_match,flexible-extract: 0.8597
  • DEP8 decode comparison: job 15559
    • exact_match,strict-match: 0.9636087945413192
    • exact_match,flexible-extract: 0.9628506444275967

Both use full GSM8K test split (1319 examples), 5 shots from the task YAML, EVAL_CONC=128, max_length=9472, and max_tokens=5376.

Test Plan

  • PYTHONPATH=src /home/aflowers/Documents/agent/srt-slurm-gsm8k-worktree/.venv/bin/python -m pytest tests/ -q
    • 639 passed, 2 skipped, 6 deselected
  • /home/aflowers/Documents/agent/srt-slurm-gsm8k-worktree/.venv/bin/ruff check src/srtctl tests/test_benchmarks.py
  • Parsed both new recipes with load_config; both resolve as benchmark=lm-eval, served model deepseek-ai/DeepSeek-V4-Pro, EVAL_CONC=128.
  • Ran the repo copyright-check logic locally; all checked files have NVIDIA SPDX headers.

Note: a fresh uv run pytest ... in the new worktree could not resolve jinja2 from the configured package index, so validation used the existing populated srt-slurm dev venv with PYTHONPATH=src pinned to this worktree.

Preflight Note

This PR also keeps srtctl apply from running blocking preflight implicitly. Some GB200 clusters keep model paths on per-compute-node local storage, so login-node preflight can fail even though the job would run correctly once scheduled. Explicit validation remains available via srtctl preflight -f ....

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant