Skip to content

Commit b18d843

Browse files
authored
Merge branch 'main' into fix/docker-deployment
2 parents 560d384 + 54923f1 commit b18d843

14 files changed

Lines changed: 818 additions & 42 deletions

File tree

README.md

Lines changed: 112 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -276,6 +276,117 @@ Key multimodal features:
276276
- `image_base_path`: Base directory for resolving relative image paths
277277
- Supports PIL Images, URLs, and file paths
278278

279+
### Benchmarking shard performance
280+
281+
Pass `--stats` to `run` or `submit` to enable per-shard benchmarking. This activates GPU
282+
utilization polling and throughput tracking on compute nodes — disabled by default to
283+
avoid unnecessary overhead.
284+
285+
```bash
286+
# Local run with stats collection
287+
mmirage run --config configs/config_mock.yaml --stats
288+
289+
```
290+
291+
After the run completes, inspect the results with:
292+
293+
```bash
294+
mmirage stats --config configs/config_mock.yaml
295+
```
296+
297+
This prints a JSON report with per-shard details and an aggregate summary:
298+
299+
```json
300+
{
301+
"per_shard": [
302+
{
303+
"shard_id": 0,
304+
"status": "success",
305+
"started_at": "2026-04-30T10:00:00",
306+
"finished_at": "2026-04-30T10:01:05",
307+
"stats": {
308+
"runtime_seconds": 65.2,
309+
"runtime_human": "1m 5s",
310+
"rows_processed": 1024,
311+
"throughput_rows_per_sec": 15.7,
312+
"gpu_util_mean": 88.4,
313+
"gpu_util_min": 72.0,
314+
"gpu_util_max": 98.0,
315+
"gpu_util_samples": 13,
316+
"input_tokens": 512000,
317+
"output_tokens": 196608,
318+
"num_gpus": 4,
319+
"tokens_per_sec_per_gpu": 753.1,
320+
"gpu_days_per_billion_tokens": 0.0015
321+
}
322+
}
323+
],
324+
"aggregate": {
325+
"total_shards": 1,
326+
"completed_shards": 1,
327+
"total_rows_processed": 1000,
328+
"wall_clock_runtime_seconds": 133.04,
329+
"wall_clock_runtime_human": "2m 13s",
330+
"sum_shard_runtime_seconds": 133.04,
331+
"sum_shard_runtime_human": "2m 13s",
332+
"min_shard_runtime_seconds": 133.04,
333+
"min_shard_runtime_human": "2m 13s",
334+
"max_shard_runtime_seconds": 133.04,
335+
"max_shard_runtime_human": "2m 13s",
336+
"overall_throughput_rows_per_sec": 7.52,
337+
"mean_gpu_util_pct": 86.2,
338+
"num_gpus": 4,
339+
"total_input_tokens": 146214,
340+
"total_output_tokens": 1022046,
341+
"sum_model_load_seconds": 38.272,
342+
"sum_inference_runtime_seconds": 94.768,
343+
"tokens_per_sec_per_gpu": 10784.72,
344+
"gpu_days_per_billion_tokens": 1.0732
345+
}
346+
}
347+
```
348+
349+
Key metrics:
350+
- **`runtime_seconds`** / **`runtime_human`**: time from when the shard started on the cluster (after dispatch), excluding queue wait time.
351+
- **`overall_throughput_rows_per_sec`**: total rows / wall-clock time across all shards running in parallel.
352+
- **`mean_gpu_util_pct`**: mean percentage GPU utilization across shards.
353+
- **`tokens_per_sec_per_gpu`**: output tokens generated per second per GPU — the primary throughput metric used by frameworks such as [DataTrove](https://github.com/huggingface/datatrove).
354+
- **`gpu_days_per_billion_tokens`**: total GPU-days consumed to generate 1 billion output tokens — useful for cost and scaling comparisons across different hardware configurations.
355+
- Token metrics are `null` when no LLM processor was active, and GPU stats are `null` when `nvidia-smi` is unavailable or `--stats` was not passed.
356+
357+
Reference benchmark:
358+
- [DataTrove Benchmark](https://github.com/huggingface/datatrove/tree/main/examples/inference/benchmark)
359+
360+
The config `configs/config_benchmark_datatrove.yaml` mirrors the DataTrove inference benchmark conditions:
361+
362+
| Setting | Value |
363+
|---|---|
364+
| Dataset | `simplescaling/s1K-1.1` (train split, 1 000 samples) |
365+
| Prompt | raw `question` field, no system prompt |
366+
| Output | up to 1 024 tokens per sample |
367+
| Context | 2 048-token model max context |
368+
| Model | `Qwen/Qwen3-4B` (DataTrove baseline: tp=1 on a single GPU) |
369+
370+
Download the dataset before running:
371+
372+
```python
373+
from datasets import load_dataset
374+
ds = load_dataset('simplescaling/s1K-1.1', split='train')
375+
ds.save_to_disk('data/s1K-1.1')
376+
```
377+
378+
Then run with stats collection enabled:
379+
380+
```bash
381+
mmirage run --config configs/config_benchmark_datatrove.yaml --stats
382+
```
383+
384+
Inspect results:
385+
386+
```bash
387+
mmirage stats --config configs/config_benchmark_datatrove.yaml
388+
```
389+
279390
## Architecture
280391

281392
MMIRAGE uses a modular architecture:
@@ -299,3 +410,4 @@ mmirage/
299410
- JMESPath for JSON queries: [link](https://jmespath.org/)
300411
- SGLang for fast inference: [link](https://github.com/sgl-project/sglang)
301412
- Performance paper: [link](https://arxiv.org/abs/2408.02442)
413+
- DataTrove Benchmark: [link](https://github.com/huggingface/datatrove/tree/main/examples/inference/benchmark)
Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,63 @@
1+
# MMIRAGE — DataTrove-compatible throughput benchmark
2+
# See README.md for setup instructions and benchmark details.
3+
4+
processors:
5+
- type: llm
6+
server_args:
7+
model_path: Qwen/Qwen3-4B # same model family as DataTrove baseline
8+
tp_size: 1 # DataTrove baseline: tp=1
9+
trust_remote_code: true
10+
disable_custom_all_reduce: true
11+
# SGLang engine tuning — equivalents of DataTrove's vLLM mns/mnbt knobs
12+
extra_engine_args:
13+
max_running_requests: 1000
14+
default_sampling_params:
15+
temperature: 0.0
16+
max_new_tokens: 1024 # DataTrove: max-tokens=1024
17+
18+
loading_params:
19+
state_dir: data/benchmark_s1k/_pipeline_state
20+
datasets:
21+
- path: data/s1K-1.1 # save_to_disk() target above
22+
type: loadable
23+
output_dir: data/benchmark_s1k/output
24+
num_shards: 1
25+
shard_id: "$SLURM_ARRAY_TASK_ID"
26+
batch_size: 1000
27+
28+
processing_params:
29+
inputs:
30+
- name: question
31+
key: question # DataTrove: prompt-column=question
32+
33+
outputs:
34+
- name: answer
35+
type: llm
36+
output_type: plain
37+
# Qwen3 thinking is disabled by embedding an empty <think> block in the prompt.
38+
# This is equivalent to passing enable_thinking=False to the chat template and
39+
# avoids any dependency on SGLang sampling-param support for that flag.
40+
prompt: "<|im_start|>user\n{{ question }}\n<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n"
41+
42+
remove_columns: false
43+
output_schema:
44+
question: "{{ question }}"
45+
answer: "{{ answer }}"
46+
47+
execution_params:
48+
mode: slurm
49+
retry: false
50+
merge: false
51+
max_retries: 3
52+
account: a127
53+
job_name: mmirage-sharded
54+
nodes: 1
55+
ntasks_per_node: 1
56+
gpus: 1
57+
cpus_per_task: 288
58+
time_limit: "11:59:59"
59+
report_dir: "/users/${USER}/reports"
60+
hf_home: "/capstor/store/cscs/swissai/a127/homes/${USER}/hf"
61+
edf_env: "/users/${USER}/.edf/mmirage.toml"
62+
poll_interval_seconds: 30
63+
settle_time_seconds: 60

pyproject.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,7 @@ dependencies = [
3535
"jinja2>=3.0.0",
3636
"pillow>=9.0.0",
3737
"typing_extensions>=4.5.0; python_version < '3.12'",
38+
"humanize>=4.0.0",
3839
]
3940

4041
[project.optional-dependencies]

0 commit comments

Comments
 (0)