Skip to content

Commit 54923f1

Browse files
qchappCopilotfabnemEPFL
authored
Feature/statistics (#42)
* trying a benchmark * fixed stats * small test * testing something * slurm config * display issue with multiple nodes * small test again * trying again * excluding cold start * now same for gpu * small corrections * ready for PR * Potential fix for pull request finding Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> * Potential fix for pull request finding Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> * copilot suggestions * Potential fix for pull request finding Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> * Potential fix for pull request finding Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> * function deduplication * Potential fix for pull request finding Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> * copilot changes * implemented changes requested by fabrice * fixed TokenCounts logic * fixed various typing and logic errors * fixed image base path --------- Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> Co-authored-by: Fabrice Nemo <fabrice.nemo@epfl.ch>
1 parent b956ef6 commit 54923f1

14 files changed

Lines changed: 817 additions & 42 deletions

File tree

README.md

Lines changed: 112 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -235,6 +235,117 @@ Key multimodal features:
235235
- `image_base_path`: Base directory for resolving relative image paths
236236
- Supports PIL Images, URLs, and file paths
237237

238+
### Benchmarking shard performance
239+
240+
Pass `--stats` to `run` or `submit` to enable per-shard benchmarking. This activates GPU
241+
utilization polling and throughput tracking on compute nodes — disabled by default to
242+
avoid unnecessary overhead.
243+
244+
```bash
245+
# Local run with stats collection
246+
mmirage run --config configs/config_mock.yaml --stats
247+
248+
```
249+
250+
After the run completes, inspect the results with:
251+
252+
```bash
253+
mmirage stats --config configs/config_mock.yaml
254+
```
255+
256+
This prints a JSON report with per-shard details and an aggregate summary:
257+
258+
```json
259+
{
260+
"per_shard": [
261+
{
262+
"shard_id": 0,
263+
"status": "success",
264+
"started_at": "2026-04-30T10:00:00",
265+
"finished_at": "2026-04-30T10:01:05",
266+
"stats": {
267+
"runtime_seconds": 65.2,
268+
"runtime_human": "1m 5s",
269+
"rows_processed": 1024,
270+
"throughput_rows_per_sec": 15.7,
271+
"gpu_util_mean": 88.4,
272+
"gpu_util_min": 72.0,
273+
"gpu_util_max": 98.0,
274+
"gpu_util_samples": 13,
275+
"input_tokens": 512000,
276+
"output_tokens": 196608,
277+
"num_gpus": 4,
278+
"tokens_per_sec_per_gpu": 753.1,
279+
"gpu_days_per_billion_tokens": 0.0015
280+
}
281+
}
282+
],
283+
"aggregate": {
284+
"total_shards": 1,
285+
"completed_shards": 1,
286+
"total_rows_processed": 1000,
287+
"wall_clock_runtime_seconds": 133.04,
288+
"wall_clock_runtime_human": "2m 13s",
289+
"sum_shard_runtime_seconds": 133.04,
290+
"sum_shard_runtime_human": "2m 13s",
291+
"min_shard_runtime_seconds": 133.04,
292+
"min_shard_runtime_human": "2m 13s",
293+
"max_shard_runtime_seconds": 133.04,
294+
"max_shard_runtime_human": "2m 13s",
295+
"overall_throughput_rows_per_sec": 7.52,
296+
"mean_gpu_util_pct": 86.2,
297+
"num_gpus": 4,
298+
"total_input_tokens": 146214,
299+
"total_output_tokens": 1022046,
300+
"sum_model_load_seconds": 38.272,
301+
"sum_inference_runtime_seconds": 94.768,
302+
"tokens_per_sec_per_gpu": 10784.72,
303+
"gpu_days_per_billion_tokens": 1.0732
304+
}
305+
}
306+
```
307+
308+
Key metrics:
309+
- **`runtime_seconds`** / **`runtime_human`**: time from when the shard started on the cluster (after dispatch), excluding queue wait time.
310+
- **`overall_throughput_rows_per_sec`**: total rows / wall-clock time across all shards running in parallel.
311+
- **`mean_gpu_util_pct`**: mean percentage GPU utilization across shards.
312+
- **`tokens_per_sec_per_gpu`**: output tokens generated per second per GPU — the primary throughput metric used by frameworks such as [DataTrove](https://github.com/huggingface/datatrove).
313+
- **`gpu_days_per_billion_tokens`**: total GPU-days consumed to generate 1 billion output tokens — useful for cost and scaling comparisons across different hardware configurations.
314+
- Token metrics are `null` when no LLM processor was active, and GPU stats are `null` when `nvidia-smi` is unavailable or `--stats` was not passed.
315+
316+
Reference benchmark:
317+
- [DataTrove Benchmark](https://github.com/huggingface/datatrove/tree/main/examples/inference/benchmark)
318+
319+
The config `configs/config_benchmark_datatrove.yaml` mirrors the DataTrove inference benchmark conditions:
320+
321+
| Setting | Value |
322+
|---|---|
323+
| Dataset | `simplescaling/s1K-1.1` (train split, 1 000 samples) |
324+
| Prompt | raw `question` field, no system prompt |
325+
| Output | up to 1 024 tokens per sample |
326+
| Context | 2 048-token model max context |
327+
| Model | `Qwen/Qwen3-4B` (DataTrove baseline: tp=1 on a single GPU) |
328+
329+
Download the dataset before running:
330+
331+
```python
332+
from datasets import load_dataset
333+
ds = load_dataset('simplescaling/s1K-1.1', split='train')
334+
ds.save_to_disk('data/s1K-1.1')
335+
```
336+
337+
Then run with stats collection enabled:
338+
339+
```bash
340+
mmirage run --config configs/config_benchmark_datatrove.yaml --stats
341+
```
342+
343+
Inspect results:
344+
345+
```bash
346+
mmirage stats --config configs/config_benchmark_datatrove.yaml
347+
```
348+
238349
## Architecture
239350

240351
MMIRAGE uses a modular architecture:
@@ -258,3 +369,4 @@ mmirage/
258369
- JMESPath for JSON queries: [link](https://jmespath.org/)
259370
- SGLang for fast inference: [link](https://github.com/sgl-project/sglang)
260371
- Performance paper: [link](https://arxiv.org/abs/2408.02442)
372+
- DataTrove Benchmark: [link](https://github.com/huggingface/datatrove/tree/main/examples/inference/benchmark)
Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,63 @@
1+
# MMIRAGE — DataTrove-compatible throughput benchmark
2+
# See README.md for setup instructions and benchmark details.
3+
4+
processors:
5+
- type: llm
6+
server_args:
7+
model_path: Qwen/Qwen3-4B # same model family as DataTrove baseline
8+
tp_size: 1 # DataTrove baseline: tp=1
9+
trust_remote_code: true
10+
disable_custom_all_reduce: true
11+
# SGLang engine tuning — equivalents of DataTrove's vLLM mns/mnbt knobs
12+
extra_engine_args:
13+
max_running_requests: 1000
14+
default_sampling_params:
15+
temperature: 0.0
16+
max_new_tokens: 1024 # DataTrove: max-tokens=1024
17+
18+
loading_params:
19+
state_dir: data/benchmark_s1k/_pipeline_state
20+
datasets:
21+
- path: data/s1K-1.1 # save_to_disk() target above
22+
type: loadable
23+
output_dir: data/benchmark_s1k/output
24+
num_shards: 1
25+
shard_id: "$SLURM_ARRAY_TASK_ID"
26+
batch_size: 1000
27+
28+
processing_params:
29+
inputs:
30+
- name: question
31+
key: question # DataTrove: prompt-column=question
32+
33+
outputs:
34+
- name: answer
35+
type: llm
36+
output_type: plain
37+
# Qwen3 thinking is disabled by embedding an empty <think> block in the prompt.
38+
# This is equivalent to passing enable_thinking=False to the chat template and
39+
# avoids any dependency on SGLang sampling-param support for that flag.
40+
prompt: "<|im_start|>user\n{{ question }}\n<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n"
41+
42+
remove_columns: false
43+
output_schema:
44+
question: "{{ question }}"
45+
answer: "{{ answer }}"
46+
47+
execution_params:
48+
mode: slurm
49+
retry: false
50+
merge: false
51+
max_retries: 3
52+
account: a127
53+
job_name: mmirage-sharded
54+
nodes: 1
55+
ntasks_per_node: 1
56+
gpus: 1
57+
cpus_per_task: 288
58+
time_limit: "11:59:59"
59+
report_dir: "/users/${USER}/reports"
60+
hf_home: "/capstor/store/cscs/swissai/a127/homes/${USER}/hf"
61+
edf_env: "/users/${USER}/.edf/mmirage.toml"
62+
poll_interval_seconds: 30
63+
settle_time_seconds: 60

pyproject.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -38,6 +38,7 @@ dependencies = [
3838
"jmespath",
3939
"jinja2>=3.0.0",
4040
"pillow>=9.0.0",
41+
"humanize>=4.0.0",
4142
]
4243

4344
[project.optional-dependencies]

0 commit comments

Comments
 (0)