|
| 1 | +# Benchmarking |
| 2 | + |
| 3 | +## Inference |
| 4 | +End-to-end inference benchmarking can be performed using the 🤗 [`optimum-benchmark`](https://github.com/huggingface/optimum-benchmark) library. |
| 5 | + |
| 6 | +See the example script in |
| 7 | +[inference_benchmark.py](inference_benchmark.py). |
| 8 | + |
| 9 | +### Results (as of v0.45.0) |
| 10 | + |
| 11 | +Our overall benchmarking results compared with v0.44.1 provide the following insights: |
| 12 | +#### LLM.int8() |
| 13 | +* **Turing/Ampere/Ada**: The observed per-token throughput is improved by 60-85%, while latency is decreased by 40-45%. |
| 14 | +* **H100**: With our benchmarking of Llama 3.1 70B, we observed the new LLM.int8() to consistently outperform NF4 at batch size >= 8. |
| 15 | + |
| 16 | +#### NF4/FP4 |
| 17 | +* **Turing/Ampere/Ada**: With batch size of 1, per-token throughput is _improved by 10-25%_ and per-token latency is _decreased by 10-20%_. |
| 18 | +* **H100**: Across all batch sizes, per-token throughput is _improved by up to 28%_ and per-token latency is _decreased by up to 22%_. |
| 19 | + |
| 20 | +Summaries with the benchmarking results are provided below. |
| 21 | + |
| 22 | +#### NVIDIA T4 16GB |
| 23 | +<details> |
| 24 | +<summary>Qwen 2.5 3B Instruct</summary> |
| 25 | + |
| 26 | +| | Batch Size | Mean Latency (s) <sub>v0.45.0.dev</sub> | Throughput <sub>v0.45.0.dev</sub> | Mean Latency (s) <sub>v0.44.1</sub> | Latency Improvement | Throughput <sub>v0.44.1</sub> | Throughput Improvement | |
| 27 | +|----------------------|------------|------------------------------|------------------------|--------------------------|---------------------|--------------------|------------------------| |
| 28 | +| FP16 | 1 | 0.0390 | 25.66 | 0.0390 | 1.00 | 25.66 | 1.000x | |
| 29 | +| NF4 | 1 | 0.0608 | 16.45 | 0.0710 | 1.14 | 14.08 | 1.168x | |
| 30 | +| NF4+DQ | 1 | 0.0736 | 13.58 | 0.0905 | 1.19 | 11.05 | 1.229x | |
| 31 | +| INT8 | 1 | 0.0902 | 11.08 | 0.1609 | 1.44 | 6.21 | 1.784x | |
| 32 | +| INT8+Decomp | 1 | 0.1672 | 5.98 | 0.2994 | 1.44 | 3.34 | 1.790x | |
| 33 | +| FP16 | 8 | 0.0422 | 189.56 | 0.0422 | 1.00 | 189.56 | 1.000x | |
| 34 | +| NF4 | 8 | 0.0960 | 83.37 | 0.1010 | 1.05 | 79.17 | 1.053x | |
| 35 | +| NF4+DQ | 8 | 0.1042 | 76.80 | 0.1156 | 1.10 | 69.18 | 1.110x | |
| 36 | +| INT8 | 8 | 0.0919 | 87.01 | 0.1640 | 1.44 | 48.78 | 1.784x | |
| 37 | +| INT8+Decomp | 8 | 0.1812 | 44.15 | 0.3296 | 1.45 | 24.28 | 1.818x | |
| 38 | +| FP16 | 32 | 0.0601 | 532.30 | 0.0601 | 1.00 | 532.30 | 1.000x | |
| 39 | +| NF4 | 32 | 0.1150 | 278.32 | 0.1182 | 1.03 | 270.71 | 1.028x | |
| 40 | +| NF4+DQ | 32 | 0.1215 | 263.36 | 0.1297 | 1.06 | 246.76 | 1.067x | |
| 41 | +| INT8 | 32 | 0.0943 | 339.21 | 0.1640 | 1.42 | 195.14 | 1.738x | |
| 42 | +| INT8+Decomp | 32 | 0.1912 | 167.37 | 0.3413 | 1.44 | 93.75 | 1.785x | |
| 43 | +</details> |
| 44 | + |
| 45 | +#### NVIDIA RTX 4090 24GB |
| 46 | +<details> |
| 47 | +<summary>Llama 3.1 8B</summary> |
| 48 | + |
| 49 | +| | Batch Size | Mean Latency (s) <sub>v0.45.0.dev</sub> | Throughput <sub>v0.45.0.dev</sub> | Mean Latency (s) <sub>v0.44.1</sub> | Latency Improvement | Throughput <sub>v0.44.1</sub> | Throughput Improvement | |
| 50 | +|----------------------|------------|------------------------------|------------------------|--------------------------|---------------------|--------------------|------------------------| |
| 51 | +| BF16 | 1 | 0.0211 | 47.46 | 0.0211 | 1.00 | 47.46 | 1.000x | |
| 52 | +| NF4 | 1 | 0.0148 | 67.71 | 0.0164 | 1.10 | 61.08 | 1.109x | |
| 53 | +| NF4+DQ | 1 | 0.0175 | 57.08 | 0.0208 | 1.16 | 48.15 | 1.185x | |
| 54 | +| INT8 | 1 | 0.0220 | 45.39 | 0.0395 | 1.44 | 25.32 | 1.793x | |
| 55 | +| INT8+Decomp | 1 | 0.0449 | 22.26 | 0.0743 | 1.40 | 13.45 | 1.655x | |
| 56 | +| BF16 | 8 | 0.0239 | 334.64 | 0.0239 | 1.00 | 334.64 | 1.000x | |
| 57 | +| NF4 | 8 | 0.0425 | 188.08 | 0.0422 | 0.99 | 189.50 | 0.993x | |
| 58 | +| NF4+DQ | 8 | 0.0443 | 180.68 | 0.0437 | 0.99 | 183.02 | 0.987x | |
| 59 | +| INT8 | 8 | 0.0221 | 361.61 | 0.0389 | 1.43 | 205.82 | 1.757x | |
| 60 | +| INT8+Decomp | 8 | 0.0478 | 164.55 | 0.0777 | 1.38 | 103.01 | 1.597x | |
| 61 | +| BF16 | 32 | 0.0304 | 1054.35 | 0.0304 | 1.00 | 1054.35 | 1.000x | |
| 62 | +| NF4 | 32 | 0.0461 | 694.60 | 0.0466 | 1.01 | 686.90 | 1.011x | |
| 63 | +| NF4+DQ | 32 | 0.0471 | 678.73 | 0.0480 | 1.02 | 666.33 | 1.019x | |
| 64 | +| INT8 | 32 | 0.0230 | 1390.54 | 0.0390 | 1.41 | 819.99 | 1.696x | |
| 65 | +| INT8+Decomp | 32 | 0.0512 | 624.94 | 0.0835 | 1.39 | 383.18 | 1.631x | |
| 66 | +</details> |
| 67 | + |
| 68 | +<details> |
| 69 | +<summary>Qwen 2.5 14B Instruct</summary> |
| 70 | + |
| 71 | +| | Batch Size | Mean Latency (s) <sub>v0.45.0.dev</sub> | Throughput <sub>v0.45.0.dev</sub> | Mean Latency (s) <sub>v0.44.1</sub> | Latency Improvement | Throughput <sub>v0.44.1</sub> | Throughput Improvement | |
| 72 | +|----------------------|------------|------------------------------|------------------------|--------------------------|---------------------|--------------------|------------------------| |
| 73 | +| NF4 | 1 | 0.0214 | 46.74 | 0.0256 | 1.16 | 39.10 | 1.195x | |
| 74 | +| NF4+DQ | 1 | 0.0256 | 39.03 | 0.0318 | 1.19 | 31.46 | 1.241x | |
| 75 | +| INT8 | 1 | 0.0326 | 30.68 | 0.0596 | 1.45 | 16.79 | 1.827x | |
| 76 | +| INT8+Decomp | 1 | 0.0648 | 15.44 | 0.1105 | 1.41 | 9.05 | 1.706x | |
| 77 | +| NF4 | 8 | 0.0696 | 114.95 | 0.0697 | 1.00 | 114.78 | 1.001x | |
| 78 | +| NF4+DQ | 8 | 0.0719 | 111.29 | 0.0723 | 1.01 | 110.70 | 1.005x | |
| 79 | +| INT8 | 8 | 0.0325 | 246.22 | 0.0596 | 1.45 | 134.21 | 1.835x | |
| 80 | +| INT8+Decomp | 8 | 0.0721 | 110.95 | 0.1201 | 1.40 | 66.62 | 1.665x | |
| 81 | +</details> |
| 82 | + |
| 83 | + |
| 84 | +#### NVIDIA H100 80GB SXM |
| 85 | +<details> |
| 86 | +<summary>Llama 3.1 8B</summary> |
| 87 | + |
| 88 | +| | Batch Size | Mean Latency (s) <sub>v0.45.0.dev</sub> | Throughput <sub>v0.45.0.dev</sub> | Mean Latency (s) <sub>v0.44.1</sub> | Latency Improvement | Throughput <sub>v0.44.1</sub> | Throughput Improvement | |
| 89 | +|----------------------|------------|------------------------------|------------------------|--------------------------|---------------------|--------------------|------------------------| |
| 90 | +| BF16 | 1 | 0.0244 | 40.99 | 0.0244 | 1.00 | 40.99 | 1.000x | |
| 91 | +| NF4 | 1 | 0.0331 | 30.14 | 0.0391 | 1.15 | 25.60 | 1.177x | |
| 92 | +| NF4+DQ | 1 | 0.0411 | 24.34 | 0.0528 | 1.22 | 18.92 | 1.286x | |
| 93 | +| INT8 | 1 | 0.0522 | 19.17 | N/A | N/A | N/A | N/A | |
| 94 | +| INT8+Decomp | 1 | 0.0817 | 12.24 | N/A | N/A | N/A | N/A | |
| 95 | +| BF16 | 8 | 0.0255 | 313.90 | 0.0255 | 1.00 | 313.90 | 1.000x | |
| 96 | +| NF4 | 8 | 0.0476 | 168.05 | 0.0551 | 1.14 | 145.13 | 1.158x | |
| 97 | +| NF4+DQ | 8 | 0.0566 | 141.27 | 0.0663 | 1.15 | 120.67 | 1.171x | |
| 98 | +| INT8 | 8 | 0.0515 | 155.44 | N/A | N/A | N/A | N/A | |
| 99 | +| INT8+Decomp | 8 | 0.0853 | 93.79 | N/A | N/A | N/A | N/A | |
| 100 | +| BF16 | 32 | 0.0261 | 1227.96 | 0.0261 | 1.00 | 1227.96 | 1.000x | |
| 101 | +| NF4 | 32 | 0.0486 | 658.65 | 0.0546 | 1.11 | 585.91 | 1.124x | |
| 102 | +| NF4+DQ | 32 | 0.0577 | 555.06 | 0.0665 | 1.13 | 481.04 | 1.154x | |
| 103 | +| INT8 | 32 | 0.0545 | 586.26 | N/A | N/A | N/A | N/A | |
| 104 | +| INT8+Decomp | 32 | 0.0864 | 370.51 | N/A | N/A | N/A | N/A | |
| 105 | +</details> |
| 106 | + |
| 107 | +<details> |
| 108 | +<summary>Qwen 2.5 32B Instruct</summary> |
| 109 | + |
| 110 | +| | Batch Size | Mean Latency (s) <sub>v0.45.0.dev</sub> | Throughput <sub>v0.45.0.dev</sub> | |
| 111 | +|-------------|------------|-----------------------------------------|-----------------------------------| |
| 112 | +| BF16 | 1 | 0.0508 | 19.67 | |
| 113 | +| NF4 | 1 | 0.0707 | 14.14 | |
| 114 | +| NF4+DQ | 1 | 0.0860 | 11.63 | |
| 115 | +| INT8 | 1 | 0.1031 | 9.70 | |
| 116 | +| INT8+Decomp | 1 | 0.1820 | 5.49 | |
| 117 | +| BF16 | 8 | 0.0525 | 152.50 | |
| 118 | +| NF4 | 8 | 0.1154 | 69.35 | |
| 119 | +| NF4+DQ | 8 | 0.1209 | 66.19 | |
| 120 | +| INT8 | 8 | 0.1078 | 74.24 | |
| 121 | +| INT8+Decomp | 8 | 0.1958 | 40.87 | |
| 122 | +| BF16 | 32 | 0.0547 | 584.54 | |
| 123 | +| NF4 | 32 | 0.1246 | 256.84 | |
| 124 | +| NF4+DQ | 32 | 0.1298 | 246.47 | |
| 125 | +| INT8 | 32 | 0.1056 | 302.96 | |
| 126 | +| INT8+Decomp | 32 | 0.2027 | 157.83 | |
| 127 | +</details> |
| 128 | + |
| 129 | +<details> |
| 130 | +<summary>Llama 3.1 70B</summary> |
| 131 | + |
| 132 | +| | Batch Size | Mean Latency (s) <sub>v0.45.0.dev</sub> | Throughput <sub>v0.45.0.dev</sub> | |
| 133 | +|-------------|------------|-----------------------------------------|-----------------------------------| |
| 134 | +| NF4 | 1 | 0.0833 | 12.00 | |
| 135 | +| NF4+DQ | 1 | 0.1052 | 9.50 | |
| 136 | +| INT8 | 1 | 0.1294 | 7.73 | |
| 137 | +| INT8+Decomp | 1 | 0.1985 | 5.04 | |
| 138 | +| NF4 | 8 | 0.2348 | 34.07 | |
| 139 | +| NF4+DQ | 8 | 0.2423 | 33.01 | |
| 140 | +| INT8 | 8 | 0.1313 | 60.94 | |
| 141 | +| INT8+Decomp | 8 | 0.2052 | 38.99 | |
| 142 | +| NF4 | 32 | 0.2491 | 128.46 | |
| 143 | +| NF4+DQ | 32 | 0.2580 | 124.04 | |
| 144 | +| INT8 | 32 | 0.1314 | 243.45 | |
| 145 | +| INT8+Decomp | 32 | 0.2189 | 146.19 | |
| 146 | +</details> |
| 147 | + |
| 148 | +#### Software Configuration |
| 149 | +We focus on the default PyTorch CUDA backend in 🤗 [`optimum-benchmark`](https://github.com/huggingface/optimum-benchmark). We used commit [`6e6b1036`](https://github.com/huggingface/optimum-benchmark/commit/6e6b10363f3ac65926881f2c6a6113b6cefc06cd). |
| 150 | + |
| 151 | +For all hardware configurations, we used the following dependencies: |
| 152 | +* `transformers==4.46.3` |
| 153 | +* `accelerate==1.1.1` |
| 154 | +* `tokenizers==0.20.3` |
| 155 | +* `torch==2.5.1` |
| 156 | +* `bitsandbytes==0.44.1` |
| 157 | +* `bitsandbytes==0.45.0.dev` |
| 158 | + |
| 159 | +In the RTX 4090 setting, the CUDA 12.4 build of PyTorch is used. In the other settings we used the CUDA 12.1 build. |
0 commit comments