Skip to content

Commit d2b5aa6

Browse files
committed
Cleanup
Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>
1 parent 56d1a52 commit d2b5aa6

18 files changed

Lines changed: 434 additions & 35 deletions

File tree

AGENTS.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -96,7 +96,7 @@ Dataset Manager --> Load Generator --> Endpoint Client --> External Endpoint
9696
| **CLI** | `src/inference_endpoint/main.py`, `commands/benchmark/cli.py` | cyclopts-based, auto-generated from `schema.py` and `HTTPClientConfig` Pydantic models. Flat shorthands via `cyclopts.Parameter(alias=...)` |
9797
| **Async Utils** | `src/inference_endpoint/async_utils/` | `LoopManager` (uvloop + eager_task_factory), ZMQ transport layer, generic `MessageCodec[T]`-parametrized pub/sub, event publisher |
9898
| **OpenAI/SGLang** | `src/inference_endpoint/openai/`, `sglang/` | Protocol adapters and response accumulators for different API formats. `openai_completions` adapter (`completions_adapter.py`) sends pre-tokenized token IDs to `/v1/completions`, bypassing the server chat template - required for gpt-oss-120b on vLLM. `sglang` adapter sends to `/generate` via `input_ids`. Both apply `Harmonize()` client-side. |
99-
| **DeepSeek-R1 (MLPerf)** | `src/inference_endpoint/evaluation/scoring.py` (`DeepSeekR1Scorer`), `examples/10_DeepSeekR1_Example/` | MLPerf DeepSeek-R1 accuracy. TensorRT-LLM is OpenAI-compatible, so it is served via `api_type: openai` / `openai_completions` (no dedicated trtllm adapter). The combined multi-subset eval (`math500`/`aime`/`gpqa`/`mmlu_pro`/`livecodebench`) is the official MLCommons `eval_accuracy.py`, run out-of-process via `uv run --project` against the isolated subproject at `examples/10_DeepSeekR1_Example/accuracy/` (mirrors the VBench pattern). The example feeds the exact MLPerf prompt via pre-tokenized `input_tokens` to `/v1/completions`. |
99+
| **DeepSeek-R1 (MLPerf)** | `src/inference_endpoint/evaluation/scoring.py` (`DeepSeekR1Scorer`), `examples/10_DeepSeekR1_Example/` | MLPerf DeepSeek-R1 accuracy. TensorRT-LLM is OpenAI-compatible, so it is served via `api_type: openai` / `openai_completions` (no dedicated trtllm adapter). The combined multi-subset eval (`math500`/`aime`/`gpqa`/`mmlu_pro`/`livecodebench`) is the official MLCommons `eval_accuracy.py`, run out-of-process via `uv run --project` against the isolated subproject at `src/inference_endpoint/evaluation/deepseek_r1/` (a uv subproject excluded from the parent wheel; mirrors the VBench pattern). The example feeds the exact MLPerf prompt via pre-tokenized `input_tokens` to `/v1/completions`. |
100100
| **VideoGen** | `src/inference_endpoint/videogen/` | Adapter for video-generation endpoints (e.g. trtllm-serve `POST /v1/videos/generations`, used by MLPerf WAN2.2-T2V-A14B). Defaults to `response_format=video_path` (server saves video to shared storage and returns path) to avoid large byte payloads. Accuracy mode also runs on `video_path`: the adapter mirrors the path into `response_output` so the event log carries it to `VBenchScorer` (see `evaluation/scoring.py`), which scores videos via VBench from a sibling `uv` subproject at `examples/09_Wan22_VideoGen_Example/accuracy/` (vbench's `transformers==4.33.2` + `numpy<2` pins are incompatible with the parent env, so it runs out-of-process via `uv run --project`). Dataset is ingested via the generic JSONL loader. |
101101

102102
### Hot-Path Architecture

examples/10_DeepSeekR1_Example/README.md

Lines changed: 16 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -7,8 +7,9 @@ accuracy dataset** (4388 samples) with this repo's `inference-endpoint` tool.
77
Accuracy is the official MLCommons _combined-subset_ evaluation - `math500`,
88
`aime`, `gpqa`, `mmlu_pro`, `livecodebench`, each graded by its own parser and
99
aggregated into one `exact_match` plus `tokens_per_sample`, via the
10-
`deepseek_r1` scorer (which shells out to the isolated `accuracy/` subproject -
11-
see [`accuracy/RUNBOOK.md`](accuracy/RUNBOOK.md)).
10+
`deepseek_r1` scorer (which shells out to the isolated subproject at
11+
`src/inference_endpoint/evaluation/deepseek_r1/` - see
12+
[`RUNBOOK.md`](../../src/inference_endpoint/evaluation/deepseek_r1/RUNBOOK.md)).
1213

1314
| Metric | Golden (FP32) | Pass criterion |
1415
| ------------------- | ------------- | -------------------------- |
@@ -17,16 +18,16 @@ see [`accuracy/RUNBOOK.md`](accuracy/RUNBOOK.md)).
1718

1819
## Files
1920

20-
| File | Purpose |
21-
| ------------------------------------------ | ----------------------------------------------------------------- |
22-
| `prepare_dataset.py` | pkl -> parquet (+ `--subset N` stratified slice, + tiny perf set) |
23-
| `trtllm_serve_config.yaml` | `trtllm-serve --extra_llm_api_options` for 4 GPUs (TP=4, EP=4) |
24-
| `launch_and_run.sh` | SLURM launch: serve -> health -> (probe + run \| `SERVER_ONLY`) |
25-
| `run_client.sh` | Drive the benchmark from the login node (cross-arch clusters) |
26-
| `score_livecodebench.sh` | Score the LCB subset on a compute node (hardened sandbox) |
27-
| `offline_deepseek_r1_accuracy.yaml` | Full 4388-sample accuracy config |
28-
| `offline_deepseek_r1_accuracy_subset.yaml` | ~385-sample representative config (quick estimate) |
29-
| `accuracy/` | Isolated `uv` subproject wrapping the MLCommons evaluator |
21+
| File | Purpose |
22+
| ------------------------------------------------ | ----------------------------------------------------------------- |
23+
| `prepare_dataset.py` | pkl -> parquet (+ `--subset N` stratified slice, + tiny perf set) |
24+
| `trtllm_serve_config.yaml` | `trtllm-serve --extra_llm_api_options` for 4 GPUs (TP=4, EP=4) |
25+
| `launch_and_run.sh` | SLURM launch: serve -> health -> (probe + run \| `SERVER_ONLY`) |
26+
| `run_client.sh` | Drive the benchmark from the login node (cross-arch clusters) |
27+
| `score_livecodebench.sh` | Score the LCB subset on a compute node (hardened sandbox) |
28+
| `offline_deepseek_r1_accuracy.yaml` | Full 4388-sample accuracy config |
29+
| `offline_deepseek_r1_accuracy_subset.yaml` | ~385-sample representative config (quick estimate) |
30+
| `src/inference_endpoint/evaluation/deepseek_r1/` | Isolated `uv` subproject wrapping the MLCommons evaluator |
3031

3132
## WARNING Read first - verified gotchas on a GB200 SLURM cluster
3233

@@ -75,7 +76,7 @@ unfinished`) and needs a ~21 GB dataset load that OOMs the login cgroup.
7576
- Parent env synced: `uv sync --extra dev` from the repo root; `uv` on `PATH`.
7677
- Accuracy subproject set up once (network needed):
7778
```bash
78-
cd examples/10_DeepSeekR1_Example/accuracy && uv sync && bash setup_eval.sh && cd -
79+
cd src/inference_endpoint/evaluation/deepseek_r1 && uv sync && bash setup_eval.sh && cd -
7980
```
8081

8182
## Prepare the dataset (once)
@@ -147,7 +148,7 @@ model id) in the chosen YAML, then:
147148

148149
```bash
149150
export MODEL_DIR=/path/to/deepseek_r1-torch-fp4
150-
export DEEPSEEK_EVAL_PROJECT_PATH=examples/10_DeepSeekR1_Example/accuracy # only if not running from the repo root
151+
export DEEPSEEK_EVAL_PROJECT_PATH=src/inference_endpoint/evaluation/deepseek_r1 # only if not running from the repo root
151152
inference-endpoint benchmark from-config \
152153
--config examples/10_DeepSeekR1_Example/offline_deepseek_r1_accuracy.yaml --mode acc
153154
```
@@ -237,7 +238,7 @@ score it afterward on a clean compute node:
237238

238239
```bash
239240
sbatch examples/10_DeepSeekR1_Example/score_livecodebench.sh
240-
# -> accuracy/lcb_datasets/lcb_results.json {"total_samples": 349, "passed_samples": P, ...}
241+
# -> src/inference_endpoint/evaluation/deepseek_r1/lcb_datasets/lcb_results.json {"total_samples": 349, "passed_samples": P, ...}
241242
```
242243

243244
It runs the same hardened `lcb_serve` (kill-on-timeout) directly on the node.

examples/10_DeepSeekR1_Example/offline_deepseek_r1_accuracy.yaml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,8 @@
1010
#
1111
# Accuracy is the official MLCommons combined-subset eval (math500 / aime /
1212
# gpqa / mmlu_pro / livecodebench) via the `deepseek_r1` scorer, which shells
13-
# out to the isolated subproject in ./accuracy/. Golden FP32 exact_match =
13+
# out to the isolated subproject at src/inference_endpoint/evaluation/deepseek_r1/.
14+
# Golden FP32 exact_match =
1415
# 81.3582; pass threshold = 99% of golden (>= 80.5246). tokens_per_sample
1516
# golden = 3886.2274 (pass band 90-110%).
1617
#

examples/10_DeepSeekR1_Example/run_client.sh

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -20,8 +20,8 @@ SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
2020
REPO_ROOT="$(cd "${SCRIPT_DIR}/../.." && pwd)"
2121
export PATH="${HOME}/.local/bin:${PATH}"
2222
# The deepseek_r1 scorer shells out to the accuracy subproject (set up once with
23-
# `cd accuracy && uv sync && bash setup_eval.sh`).
24-
export DEEPSEEK_EVAL_PROJECT_PATH="${SCRIPT_DIR}/accuracy"
23+
# `cd src/inference_endpoint/evaluation/deepseek_r1 && uv sync && bash setup_eval.sh`).
24+
export DEEPSEEK_EVAL_PROJECT_PATH="${REPO_ROOT}/src/inference_endpoint/evaluation/deepseek_r1"
2525
# The benchmark config references ${MODEL_DIR} (tokenizer for tokens_per_sample);
2626
# it is resolved when the YAML is loaded, so it must be set in this environment.
2727
: "${MODEL_DIR:?Set MODEL_DIR to your DeepSeek-R1 FP4 checkpoint directory}"

examples/10_DeepSeekR1_Example/score_livecodebench.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ echo "=== node $(hostname) arch $(uname -m) @ $(date -u +%H:%M:%S) ==="
3030
EX="examples/10_DeepSeekR1_Example"
3131
OUTPUTS_PARQUET="${OUTPUTS_PARQUET:-${REPO_ROOT}/logs/deepseek_r1_fp4_accuracy/deepseek_eval/deepseek_r1_accuracy_outputs.parquet}"
3232
LCB_VARIANT="${LCB_VARIANT:-release_v6}" # superset, so all question_ids resolve
33-
LCBDIR="${REPO_ROOT}/${EX}/accuracy/lcb_datasets"
33+
LCBDIR="${REPO_ROOT}/src/inference_endpoint/evaluation/deepseek_r1/lcb_datasets"
3434
LCBIN="${LCBDIR}/lcb_input.parquet"
3535
RESULTS="${LCBDIR}/lcb_results.json"
3636

pyproject.toml

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,13 @@ environments = [
1212

1313
[tool.uv.build-backend]
1414
module-root = "src"
15-
source-exclude = ["inference_endpoint/evaluation/livecodebench/_server.py"]
15+
source-exclude = [
16+
"inference_endpoint/evaluation/livecodebench/_server.py",
17+
# Isolated uv subproject (own pyproject.toml/uv.lock; pinned transformers /
18+
# numpy<2 / prm800k that conflict with the parent env). Invoked only via
19+
# `uv run --project`, never imported - keep it out of the parent wheel.
20+
"inference_endpoint/evaluation/deepseek_r1/**",
21+
]
1622

1723
[project]
1824
name = "inference-endpoint"

src/inference_endpoint/commands/benchmark/execute.py

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -900,8 +900,15 @@ def finalize_benchmark(ctx: BenchmarkContext, bench: BenchmarkResult) -> None:
900900
"ground_truth_column": eval_cfg.ground_truth_column,
901901
"score": score,
902902
"n_repeats": n_repeats,
903+
# False when the scorer produced only a partial headline (e.g.
904+
# DeepSeekR1Scorer when the lcb-service container was unreachable),
905+
# so a partial number is never mistaken for a complete one.
906+
"complete": scorer_instance.complete,
903907
}
904-
logger.info(f"Score for {eval_cfg.dataset_name}: {score} ({n_repeats} repeats)")
908+
logger.info(
909+
f"Score for {eval_cfg.dataset_name}: {score} "
910+
f"({n_repeats} repeats, complete={scorer_instance.complete})"
911+
)
905912

906913
# Report metrics: prefer Report from MetricsSnapshot, fall back to SessionResult
907914
if report is not None and report.duration_ns is not None:

src/inference_endpoint/dataset_manager/__init__.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,7 @@
2424
from .multi_turn_dataset import MultiTurnDataset
2525
from .predefined.aime25 import AIME25
2626
from .predefined.cnndailymail import CNNDailyMail
27+
from .predefined.deepseek_r1 import DeepSeekR1
2728
from .predefined.gpqa import GPQA
2829
from .predefined.livecodebench import LiveCodeBench
2930
from .predefined.open_orca import OpenOrca
@@ -56,6 +57,7 @@
5657
"MakeAdapterCompatible",
5758
"apply_transforms",
5859
"AIME25",
60+
"DeepSeekR1",
5961
"GPQA",
6062
"OpenOrca",
6163
"LiveCodeBench",

0 commit comments

Comments
 (0)