All benchmarks are wrong, some will cost you less than others.
Optimum-Benchmark is a unified multi-backend & multi-device utility for benchmarking Transformers, Diffusers, PEFT, TIMM and Optimum flavors, along with all their supported optimizations & quantization schemes, for inference & training, in distributed & non-distributed settings, in the most correct, efficient and scalable way possible (you don't even need to download the weights).
News π°
- PYPI release soon.
- Added a simple Python API to run benchmarks with all isolation and tracking features supported by the CLI.
Motivations π€
- HF hardware partners wanting to know how their hardware performs compared to another hardware on the same models.
- HF ecosystem users wanting to know how their chosen model performs in terms of latency, throughput, memory usage, energy consumption, etc compared to another model.
- Experimenting with hardware & backend specific optimizations & quantization schemes that can be applied to models and improve their computational/memory/energy efficiency.
Notes π
- If you were using
optimum-benchmark
before and want to keep using the old CLI only version, you can still do so by installing from this branch0.0.1
.
You can install optimum-benchmark
using pip:
pip install optimum-benchmark
or by cloning the repository and installing it in editable mode:
git clone https://github.com/huggingface/optimum-benchmark.git
cd optimum-benchmark
pip install -e .
Depending on the backends you want to use, you might need to install some extra dependencies:
- Pytorch (default):
pip install optimum-benchmark
- OpenVINO:
pip install optimum-benchmark[openvino]
- Torch-ORT:
pip install optimum-benchmark[torch-ort]
- OnnxRuntime:
pip install optimum-benchmark[onnxruntime]
- TensorRT-LLM:
pip install optimum-benchmark[tensorrt-llm]
- OnnxRuntime-GPU:
pip install optimum-benchmark[onnxruntime-gpu]
- Intel Neural Compressor:
pip install optimum-benchmark[neural-compressor]
- Text Generation Inference:
pip install optimum-benchmark[text-generation-inference]
You can run benchmarks from the Python API, using the launch
function. Here's an example of how to run a benchmark using the pytorch
backend, torchrun
launcher and inference
benchmark.
from optimum_benchmark.logging_utils import setup_logging
from optimum_benchmark.experiment import launch, ExperimentConfig
from optimum_benchmark.backends.pytorch.config import PyTorchConfig
from optimum_benchmark.launchers.torchrun.config import TorchrunConfig
from optimum_benchmark.benchmarks.inference.config import InferenceConfig
if __name__ == "__main__":
setup_logging(level="INFO")
launcher_config = TorchrunConfig(nproc_per_node=2)
benchmark_config = InferenceConfig(latency=True, memory=True)
backend_config = PyTorchConfig(model="gpt2", device="cuda", device_ids="0,1", no_weights=True)
experiment_config = ExperimentConfig(
experiment_name="api-launch",
benchmark=benchmark_config,
launcher=launcher_config,
backend=backend_config,
)
benchmark_report = launch(experiment_config)
experiment_config.push_to_hub("IlyasMoutawwakil/benchmarks") # pushes experiment_config.json to the hub
benchmark_report.push_to_hub("IlyasMoutawwakil/benchmarks") # pushes benchmark_report.json to the hub
Yep, it's that simple! Check the supported backends, launchers and benchmarks matrix in the features section.
You can also run a benchmark using the command line by specifying the configuration directory and the configuration name. Both arguments are mandatory for hydra
. --config-dir
is the directory where the configuration files are stored and --config-name
is the name of the configuration file without its .yaml
extension.
optimum-benchmark --config-dir examples/ --config-name pytorch_bert
This will run the benchmark using the configuration in examples/pytorch_bert.yaml
and store the results in runs/pytorch_bert
.
The result files are benchmark_report.json
, the program's logs cli.log
and the configuration that's been used experiment_config.json
, including backend, launcher, benchmark and environment configurations.
The directory for storing these results can be changed by setting hydra.run.dir
(and/or hydra.sweep.dir
in case of a multirun) in the command line or in the config file.
It's easy to override the default behavior of a benchmark from the command line.
optimum-benchmark --config-dir examples/ --config-name pytorch_bert backend.model=gpt2 backend.device=cuda
You can easily run configuration sweeps using the -m
or --multirun
option. By default, configurations will be executed serially but other kinds of executions are supported with hydra's launcher plugins : =submitit
, hydra/launcher=rays
, etc.
optimum-benchmark --config-dir examples --config-name pytorch_bert -m backend.device=cpu,cuda
You can create custom and more complex configuration files following these examples.
optimum-benchmark
allows you to run benchmarks with minimal configuration. The only required parameters are:
- The launcher to use (e.g.
process
). - The type of benchmark (e.g.
training
) - The backend to run on (e.g.
onnxruntime
). - The model name or path (e.g.
bert-base-uncased
)
Everything else is optional or inferred at runtime, but can be configured to your needs.
- Distributed inference/training (
launcher=torchrun
) - Process isolation between consecutive runs (
launcher=process
) - Assert GPU devices (NVIDIA & AMD) isolation (
launcher.device_isolation=true
)
- Pytorch backend for CPU (
backend=pytorch
,backend.device=cpu
) - Pytorch backend for CUDA (
backend=pytorch
,backend.device=cuda
) - Pytorch backend for Habana Gaudi Processor (
backend=pytorch
,backend.device=habana
) - OnnxRuntime backend for CPUExecutionProvider (
backend=onnxruntime
,backend.device=cpu
) - OnnxRuntime backend for CUDAExecutionProvider (
backend=onnxruntime
,backend.device=cuda
) - OnnxRuntime backend for ROCMExecutionProvider (
backend=onnxruntime
,backend.device=cuda
,backend.provider=ROCMExecutionProvider
) - OnnxRuntime backend for TensorrtExecutionProvider (
backend=onnxruntime
,backend.device=cuda
,backend.provider=TensorrtExecutionProvider
) - Intel Neural Compressor backend for CPU (
backend=neural-compressor
,backend.device=cpu
) - TensorRT-LLM backend for CUDA (
backend=tensorrt-llm
,backend.device=cuda
) - OpenVINO backend for CPU (
backend=openvino
,backend.device=cpu
)
- Memory tracking (
benchmark.memory=true
) - Energy and efficiency tracking (
benchmark.energy=true
) - Latency and throughput tracking (
benchmark.latency=true
) - Warm up runs before inference (
benchmark.warmup_runs=20
) - Warm up steps during training (
benchmark.warmup_steps=20
) - Inputs shapes control (e.g.
benchmark.input_shapes.sequence_length=128
) - Dataset shapes control (e.g.
benchmark.dataset_shapes.dataset_size=1000
) - Prefill latency and Decoding throughput deduced from Generate and Forward pass (auto-enabled for text generation models)
- Forward, Call and Generate pass kwargs control (e.g. for an LLM
benchmark.generate_kwargs.max_new_tokens=100
, for a diffusion modelbenchmark.call_kwargs.num_images_per_prompt=4
)
- "No weights" to benchmark models without downloading their weights (
backend.no_weights=true
) - Onnxruntime Quantization and AutoQuantization (
backend.quantization=true
orbackend.auto_quantization=avx2
, etc) - Onnxruntime Calibration for Static Quantization (
backend.quantization_config.is_static=true
, etc) - Onnxruntime Optimization and AutoOptimization (
backend.optimization=true
orbackend.auto_optimization=O4
, etc) - BitsAndBytes quantization scheme (
backend.quantization_scheme=bnb
,backend.quantization_config.load_in_4bit
, etc) - GPTQ quantization scheme (
backend.quantization_scheme=gptq
,backend.quantization_config.bits=4
, etc) - PEFT training (
backend.peft_strategy=lora
,backend.peft_config.task_type=CAUSAL_LM
, etc) - Transformers' Flash Attention V2 (
backend.use_flash_attention_v2=true
) - Optimum's BetterTransformer (
backend.to_bettertransformer=true
) - DeepSpeed-Inference support (
backend.deepspeed_inference=true
) - Dynamo/Inductor compiling (
backend.torch_compile=true
) - Automatic Mixed Precision (
backend.amp_autocast=true
)
Contributions are welcome! And we're happy to help you get started. Feel free to open an issue or a pull request. Things that we'd like to see:
- More backends (Tensorflow, TFLite, Jax, etc).
- More tests (for optimizations and quantization schemes).
- More hardware support (Habana Gaudi Processor (HPU), etc).
- Task evaluators for the most common tasks (would be great for output regression).