You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+112Lines changed: 112 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -276,6 +276,117 @@ Key multimodal features:
276
276
- `image_base_path`: Base directory for resolving relative image paths
277
277
- Supports PIL Images, URLs, and file paths
278
278
279
+
### Benchmarking shard performance
280
+
281
+
Pass `--stats` to `run` or `submit` to enable per-shard benchmarking. This activates GPU
282
+
utilization polling and throughput tracking on compute nodes — disabled by default to
283
+
avoid unnecessary overhead.
284
+
285
+
```bash
286
+
# Local run with stats collection
287
+
mmirage run --config configs/config_mock.yaml --stats
288
+
289
+
```
290
+
291
+
After the run completes, inspect the results with:
292
+
293
+
```bash
294
+
mmirage stats --config configs/config_mock.yaml
295
+
```
296
+
297
+
This prints a JSON report with per-shard details and an aggregate summary:
298
+
299
+
```json
300
+
{
301
+
"per_shard": [
302
+
{
303
+
"shard_id": 0,
304
+
"status": "success",
305
+
"started_at": "2026-04-30T10:00:00",
306
+
"finished_at": "2026-04-30T10:01:05",
307
+
"stats": {
308
+
"runtime_seconds": 65.2,
309
+
"runtime_human": "1m 5s",
310
+
"rows_processed": 1024,
311
+
"throughput_rows_per_sec": 15.7,
312
+
"gpu_util_mean": 88.4,
313
+
"gpu_util_min": 72.0,
314
+
"gpu_util_max": 98.0,
315
+
"gpu_util_samples": 13,
316
+
"input_tokens": 512000,
317
+
"output_tokens": 196608,
318
+
"num_gpus": 4,
319
+
"tokens_per_sec_per_gpu": 753.1,
320
+
"gpu_days_per_billion_tokens": 0.0015
321
+
}
322
+
}
323
+
],
324
+
"aggregate": {
325
+
"total_shards": 1,
326
+
"completed_shards": 1,
327
+
"total_rows_processed": 1000,
328
+
"wall_clock_runtime_seconds": 133.04,
329
+
"wall_clock_runtime_human": "2m 13s",
330
+
"sum_shard_runtime_seconds": 133.04,
331
+
"sum_shard_runtime_human": "2m 13s",
332
+
"min_shard_runtime_seconds": 133.04,
333
+
"min_shard_runtime_human": "2m 13s",
334
+
"max_shard_runtime_seconds": 133.04,
335
+
"max_shard_runtime_human": "2m 13s",
336
+
"overall_throughput_rows_per_sec": 7.52,
337
+
"mean_gpu_util_pct": 86.2,
338
+
"num_gpus": 4,
339
+
"total_input_tokens": 146214,
340
+
"total_output_tokens": 1022046,
341
+
"sum_model_load_seconds": 38.272,
342
+
"sum_inference_runtime_seconds": 94.768,
343
+
"tokens_per_sec_per_gpu": 10784.72,
344
+
"gpu_days_per_billion_tokens": 1.0732
345
+
}
346
+
}
347
+
```
348
+
349
+
Key metrics:
350
+
- **`runtime_seconds`** / **`runtime_human`**: time from when the shard started on the cluster (after dispatch), excluding queue wait time.
351
+
- **`overall_throughput_rows_per_sec`**: total rows / wall-clock time across all shards running in parallel.
352
+
- **`mean_gpu_util_pct`**: mean percentage GPU utilization across shards.
353
+
- **`tokens_per_sec_per_gpu`**: output tokens generated per second per GPU — the primary throughput metric used by frameworks such as [DataTrove](https://github.com/huggingface/datatrove).
354
+
- **`gpu_days_per_billion_tokens`**: total GPU-days consumed to generate 1 billion output tokens — useful for cost and scaling comparisons across different hardware configurations.
355
+
- Token metrics are `null` when no LLM processor was active, and GPU stats are `null` when `nvidia-smi` is unavailable or `--stats` was not passed.
0 commit comments