mlcommons
diff --git a/‎AGENTS.md‎
Lines changed: 1 addition & 1 deletion b/‎AGENTS.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎examples/10_DeepSeekR1_Example/README.md‎
Lines changed: 16 additions & 15 deletions b/‎examples/10_DeepSeekR1_Example/README.md‎
Lines changed: 16 additions & 15 deletions
diff --git a/‎examples/10_DeepSeekR1_Example/offline_deepseek_r1_accuracy.yaml‎
Lines changed: 2 additions & 1 deletion b/‎examples/10_DeepSeekR1_Example/offline_deepseek_r1_accuracy.yaml‎
Lines changed: 2 additions & 1 deletion
diff --git a/‎examples/10_DeepSeekR1_Example/run_client.sh‎
Lines changed: 2 additions & 2 deletions b/‎examples/10_DeepSeekR1_Example/run_client.sh‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎examples/10_DeepSeekR1_Example/score_livecodebench.sh‎
Lines changed: 1 addition & 1 deletion b/‎examples/10_DeepSeekR1_Example/score_livecodebench.sh‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎pyproject.toml‎
Lines changed: 7 additions & 1 deletion b/‎pyproject.toml‎
Lines changed: 7 additions & 1 deletion
diff --git a/‎src/inference_endpoint/commands/benchmark/execute.py‎
Lines changed: 8 additions & 1 deletion b/‎src/inference_endpoint/commands/benchmark/execute.py‎
Lines changed: 8 additions & 1 deletion
diff --git a/‎src/inference_endpoint/dataset_manager/__init__.py‎
Lines changed: 2 additions & 0 deletions b/‎src/inference_endpoint/dataset_manager/__init__.py‎
Lines changed: 2 additions & 0 deletions
@@ -96,7 +96,7 @@ Dataset Manager --> Load Generator --> Endpoint Client --> External Endpoint
 | **CLI**                  | `src/inference_endpoint/main.py`, `commands/benchmark/cli.py`                                          | cyclopts-based, auto-generated from `schema.py` and `HTTPClientConfig` Pydantic models. Flat shorthands via `cyclopts.Parameter(alias=...)`                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
 | **Async Utils**          | `src/inference_endpoint/async_utils/`                                                                  | `LoopManager` (uvloop + eager_task_factory), ZMQ transport layer, generic `MessageCodec[T]`-parametrized pub/sub, event publisher                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
 | **OpenAI/SGLang**        | `src/inference_endpoint/openai/`, `sglang/`                                                            | Protocol adapters and response accumulators for different API formats. `openai_completions` adapter (`completions_adapter.py`) sends pre-tokenized token IDs to `/v1/completions`, bypassing the server chat template - required for gpt-oss-120b on vLLM. `sglang` adapter sends to `/generate` via `input_ids`. Both apply `Harmonize()` client-side.                                                                                                                                                                                                                                                                                                                                                                                         |
-| **DeepSeek-R1 (MLPerf)** | `src/inference_endpoint/evaluation/scoring.py` (`DeepSeekR1Scorer`), `examples/10_DeepSeekR1_Example/` | MLPerf DeepSeek-R1 accuracy. TensorRT-LLM is OpenAI-compatible, so it is served via `api_type: openai` / `openai_completions` (no dedicated trtllm adapter). The combined multi-subset eval (`math500`/`aime`/`gpqa`/`mmlu_pro`/`livecodebench`) is the official MLCommons `eval_accuracy.py`, run out-of-process via `uv run --project` against the isolated subproject at `examples/10_DeepSeekR1_Example/accuracy/` (mirrors the VBench pattern). The example feeds the exact MLPerf prompt via pre-tokenized `input_tokens` to `/v1/completions`.                                                                                                                                                                                           |
+| **DeepSeek-R1 (MLPerf)** | `src/inference_endpoint/evaluation/scoring.py` (`DeepSeekR1Scorer`), `examples/10_DeepSeekR1_Example/` | MLPerf DeepSeek-R1 accuracy. TensorRT-LLM is OpenAI-compatible, so it is served via `api_type: openai` / `openai_completions` (no dedicated trtllm adapter). The combined multi-subset eval (`math500`/`aime`/`gpqa`/`mmlu_pro`/`livecodebench`) is the official MLCommons `eval_accuracy.py`, run out-of-process via `uv run --project` against the isolated subproject at `src/inference_endpoint/evaluation/deepseek_r1/` (a uv subproject excluded from the parent wheel; mirrors the VBench pattern). The example feeds the exact MLPerf prompt via pre-tokenized `input_tokens` to `/v1/completions`.                                                                                                                                     |
 | **VideoGen**             | `src/inference_endpoint/videogen/`                                                                     | Adapter for video-generation endpoints (e.g. trtllm-serve `POST /v1/videos/generations`, used by MLPerf WAN2.2-T2V-A14B). Defaults to `response_format=video_path` (server saves video to shared storage and returns path) to avoid large byte payloads. Accuracy mode also runs on `video_path`: the adapter mirrors the path into `response_output` so the event log carries it to `VBenchScorer` (see `evaluation/scoring.py`), which scores videos via VBench from a sibling `uv` subproject at `examples/09_Wan22_VideoGen_Example/accuracy/` (vbench's `transformers==4.33.2` + `numpy<2` pins are incompatible with the parent env, so it runs out-of-process via `uv run --project`). Dataset is ingested via the generic JSONL loader. |
 
 ### Hot-Path Architecture
 
@@ -7,8 +7,9 @@ accuracy dataset** (4388 samples) with this repo's `inference-endpoint` tool.
 Accuracy is the official MLCommons _combined-subset_ evaluation - `math500`,
 `aime`, `gpqa`, `mmlu_pro`, `livecodebench`, each graded by its own parser and
 aggregated into one `exact_match` plus `tokens_per_sample`, via the
-`deepseek_r1` scorer (which shells out to the isolated `accuracy/` subproject -
-see [`accuracy/RUNBOOK.md`](accuracy/RUNBOOK.md)).
+`deepseek_r1` scorer (which shells out to the isolated subproject at
+`src/inference_endpoint/evaluation/deepseek_r1/` - see
+[`RUNBOOK.md`](../../src/inference_endpoint/evaluation/deepseek_r1/RUNBOOK.md)).
 
 | Metric              | Golden (FP32) | Pass criterion             |
 | ------------------- | ------------- | -------------------------- |
@@ -17,16 +18,16 @@ see [`accuracy/RUNBOOK.md`](accuracy/RUNBOOK.md)).
 
 ## Files
 
-| File                                       | Purpose                                                           |
-| ------------------------------------------ | ----------------------------------------------------------------- |
-| `prepare_dataset.py`                       | pkl -> parquet (+ `--subset N` stratified slice, + tiny perf set) |
-| `trtllm_serve_config.yaml`                 | `trtllm-serve --extra_llm_api_options` for 4 GPUs (TP=4, EP=4)    |
-| `launch_and_run.sh`                        | SLURM launch: serve -> health -> (probe + run \| `SERVER_ONLY`)   |
-| `run_client.sh`                            | Drive the benchmark from the login node (cross-arch clusters)     |
-| `score_livecodebench.sh`                   | Score the LCB subset on a compute node (hardened sandbox)         |
-| `offline_deepseek_r1_accuracy.yaml`        | Full 4388-sample accuracy config                                  |
-| `offline_deepseek_r1_accuracy_subset.yaml` | ~385-sample representative config (quick estimate)                |
-| `accuracy/`                                | Isolated `uv` subproject wrapping the MLCommons evaluator         |
+| File                                             | Purpose                                                           |
+| ------------------------------------------------ | ----------------------------------------------------------------- |
+| `prepare_dataset.py`                             | pkl -> parquet (+ `--subset N` stratified slice, + tiny perf set) |
+| `trtllm_serve_config.yaml`                       | `trtllm-serve --extra_llm_api_options` for 4 GPUs (TP=4, EP=4)    |
+| `launch_and_run.sh`                              | SLURM launch: serve -> health -> (probe + run \| `SERVER_ONLY`)   |
+| `run_client.sh`                                  | Drive the benchmark from the login node (cross-arch clusters)     |
+| `score_livecodebench.sh`                         | Score the LCB subset on a compute node (hardened sandbox)         |
+| `offline_deepseek_r1_accuracy.yaml`              | Full 4388-sample accuracy config                                  |
+| `offline_deepseek_r1_accuracy_subset.yaml`       | ~385-sample representative config (quick estimate)                |
+| `src/inference_endpoint/evaluation/deepseek_r1/` | Isolated `uv` subproject wrapping the MLCommons evaluator         |
 
 ## WARNING Read first - verified gotchas on a GB200 SLURM cluster
 
@@ -75,7 +76,7 @@ unfinished`) and needs a ~21 GB dataset load that OOMs the login cgroup.
 - Parent env synced: `uv sync --extra dev` from the repo root; `uv` on `PATH`.
 - Accuracy subproject set up once (network needed):
   ```bash
-  cd examples/10_DeepSeekR1_Example/accuracy && uv sync && bash setup_eval.sh && cd -
+  cd src/inference_endpoint/evaluation/deepseek_r1 && uv sync && bash setup_eval.sh && cd -
   ```
 
 ## Prepare the dataset (once)
@@ -147,7 +148,7 @@ model id) in the chosen YAML, then:
 
 ```bash
 export MODEL_DIR=/path/to/deepseek_r1-torch-fp4
-export DEEPSEEK_EVAL_PROJECT_PATH=examples/10_DeepSeekR1_Example/accuracy  # only if not running from the repo root
+export DEEPSEEK_EVAL_PROJECT_PATH=src/inference_endpoint/evaluation/deepseek_r1  # only if not running from the repo root
 inference-endpoint benchmark from-config \
   --config examples/10_DeepSeekR1_Example/offline_deepseek_r1_accuracy.yaml --mode acc
 ```
@@ -237,7 +238,7 @@ score it afterward on a clean compute node:
 
 ```bash
 sbatch examples/10_DeepSeekR1_Example/score_livecodebench.sh
-# -> accuracy/lcb_datasets/lcb_results.json  {"total_samples": 349, "passed_samples": P, ...}
+# -> src/inference_endpoint/evaluation/deepseek_r1/lcb_datasets/lcb_results.json  {"total_samples": 349, "passed_samples": P, ...}
 ```
 
 It runs the same hardened `lcb_serve` (kill-on-timeout) directly on the node.
 
@@ -10,7 +10,8 @@
 #
 # Accuracy is the official MLCommons combined-subset eval (math500 / aime /
 # gpqa / mmlu_pro / livecodebench) via the `deepseek_r1` scorer, which shells
-# out to the isolated subproject in ./accuracy/. Golden FP32 exact_match =
+# out to the isolated subproject at src/inference_endpoint/evaluation/deepseek_r1/.
+# Golden FP32 exact_match =
 # 81.3582; pass threshold = 99% of golden (>= 80.5246). tokens_per_sample
 # golden = 3886.2274 (pass band 90-110%).
 #
 
@@ -20,8 +20,8 @@ SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
 REPO_ROOT="$(cd "${SCRIPT_DIR}/../.." && pwd)"
 export PATH="${HOME}/.local/bin:${PATH}"
 # The deepseek_r1 scorer shells out to the accuracy subproject (set up once with
-# `cd accuracy && uv sync && bash setup_eval.sh`).
-export DEEPSEEK_EVAL_PROJECT_PATH="${SCRIPT_DIR}/accuracy"
+# `cd src/inference_endpoint/evaluation/deepseek_r1 && uv sync && bash setup_eval.sh`).
+export DEEPSEEK_EVAL_PROJECT_PATH="${REPO_ROOT}/src/inference_endpoint/evaluation/deepseek_r1"
 # The benchmark config references ${MODEL_DIR} (tokenizer for tokens_per_sample);
 # it is resolved when the YAML is loaded, so it must be set in this environment.
 : "${MODEL_DIR:?Set MODEL_DIR to your DeepSeek-R1 FP4 checkpoint directory}"
 
@@ -30,7 +30,7 @@ echo "=== node $(hostname) arch $(uname -m) @ $(date -u +%H:%M:%S) ==="
 EX="examples/10_DeepSeekR1_Example"
 OUTPUTS_PARQUET="${OUTPUTS_PARQUET:-${REPO_ROOT}/logs/deepseek_r1_fp4_accuracy/deepseek_eval/deepseek_r1_accuracy_outputs.parquet}"
 LCB_VARIANT="${LCB_VARIANT:-release_v6}"   # superset, so all question_ids resolve
-LCBDIR="${REPO_ROOT}/${EX}/accuracy/lcb_datasets"
+LCBDIR="${REPO_ROOT}/src/inference_endpoint/evaluation/deepseek_r1/lcb_datasets"
 LCBIN="${LCBDIR}/lcb_input.parquet"
 RESULTS="${LCBDIR}/lcb_results.json"
 
 
@@ -12,7 +12,13 @@ environments = [
 
 [tool.uv.build-backend]
 module-root = "src"
-source-exclude = ["inference_endpoint/evaluation/livecodebench/_server.py"]
+source-exclude = [
+    "inference_endpoint/evaluation/livecodebench/_server.py",
+    # Isolated uv subproject (own pyproject.toml/uv.lock; pinned transformers /
+    # numpy<2 / prm800k that conflict with the parent env). Invoked only via
+    # `uv run --project`, never imported - keep it out of the parent wheel.
+    "inference_endpoint/evaluation/deepseek_r1/**",
+]
 
 [project]
 name = "inference-endpoint"
 
@@ -900,8 +900,15 @@ def finalize_benchmark(ctx: BenchmarkContext, bench: BenchmarkResult) -> None:
             "ground_truth_column": eval_cfg.ground_truth_column,
             "score": score,
             "n_repeats": n_repeats,
+            # False when the scorer produced only a partial headline (e.g.
+            # DeepSeekR1Scorer when the lcb-service container was unreachable),
+            # so a partial number is never mistaken for a complete one.
+            "complete": scorer_instance.complete,
         }
-        logger.info(f"Score for {eval_cfg.dataset_name}: {score} ({n_repeats} repeats)")
+        logger.info(
+            f"Score for {eval_cfg.dataset_name}: {score} "
+            f"({n_repeats} repeats, complete={scorer_instance.complete})"
+        )
 
     # Report metrics: prefer Report from MetricsSnapshot, fall back to SessionResult
     if report is not None and report.duration_ns is not None:
 
@@ -24,6 +24,7 @@
 from .multi_turn_dataset import MultiTurnDataset
 from .predefined.aime25 import AIME25
 from .predefined.cnndailymail import CNNDailyMail
+from .predefined.deepseek_r1 import DeepSeekR1
 from .predefined.gpqa import GPQA
 from .predefined.livecodebench import LiveCodeBench
 from .predefined.open_orca import OpenOrca
@@ -56,6 +57,7 @@
     "MakeAdapterCompatible",
     "apply_transforms",
     "AIME25",
+    "DeepSeekR1",
     "GPQA",
     "OpenOrca",
     "LiveCodeBench",
Original file line number	Diff line number	Diff line change
`@@ -10,7 +10,8 @@`
`10`	`10`	`#`
`11`	`11`	`# Accuracy is the official MLCommons combined-subset eval (math500 / aime /`
`12`	`12`	# gpqa / mmlu_pro / livecodebench) via the `deepseek_r1` scorer, which shells
`13`		`-# out to the isolated subproject in ./accuracy/. Golden FP32 exact_match =`
	`13`	`+# out to the isolated subproject at src/inference_endpoint/evaluation/deepseek_r1/.`
	`14`	`+# Golden FP32 exact_match =`
`14`	`15`	`# 81.3582; pass threshold = 99% of golden (>= 80.5246). tokens_per_sample`
`15`	`16`	`# golden = 3886.2274 (pass band 90-110%).`
`16`	`17`	`#`