Skip to content

Commit ecc8c5c

Browse files
committed
move tuner to benchmark
1 parent 99f20cd commit ecc8c5c

4 files changed

Lines changed: 20 additions & 477 deletions

File tree

docs/guide/mscclpp-torch-integration.md

Lines changed: 15 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -475,7 +475,7 @@ All examples are in [`examples/torch-integration/`](../../examples/torch-integra
475475

476476
The default algorithms use a fixed heuristic to select algorithms based on message size. For production workloads, you can achieve significantly better performance by **auto-tuning** — benchmarking every candidate algorithm, block count, and thread count for each message size at startup, then using the fastest configuration at runtime.
477477

478-
**Full example:** [customized_comm_with_tuning.py](../../examples/torch-integration/customized_comm_with_tuning.py)
478+
**Reference implementation:** MSCCL++ ships a ready-to-use autotuner in [`python/mscclpp_benchmark/bench_collective.py`](../../python/mscclpp_benchmark/bench_collective.py). It benchmarks every candidate algorithm, block count, and thread count per message size, writes the winning configuration to a JSON file, and can replay that file at runtime. The sections below explain the underlying mechanism; see that benchmark for the complete, maintained implementation.
479479

480480
### How It Works
481481

@@ -656,9 +656,20 @@ def benchmark(self, n_warmup=10, n_graph_launches=10, n_iter_per_graph=100):
656656
self.all_reduce(tensor, op=torch.distributed.ReduceOp.SUM)
657657
```
658658

659-
### Running the Tuning Example
659+
### Running the Autotuner
660+
661+
MSCCL++'s built-in autotuner benchmarks every candidate configuration and saves the best one to JSON. Run it across the ranks of your job, then reuse the generated config:
660662

661663
```bash
662-
MSCCLPP_MASTER_ADDR=<ip> MSCCLPP_MASTER_PORT=<port> \
663-
torchrun --nnodes=1 --nproc_per_node=8 customized_comm_with_tuning.py
664+
# Autotune and save the tuned config
665+
mpirun -np 8 --allow-run-as-root \
666+
python3 -m mscclpp_benchmark.bench_collective \
667+
--collective allreduce --dtype float16 --autotune \
668+
--write-config /tmp/mscclpp_tuned_configs.json
669+
670+
# Replay the tuned config in a benchmark
671+
mpirun -np 8 --allow-run-as-root \
672+
python3 -m mscclpp_benchmark.bench_collective \
673+
--collective allreduce --dtype float16 \
674+
--config-path /tmp/mscclpp_tuned_configs.json
664675
```

0 commit comments

Comments
 (0)