Commit 14631a4
authored
Vectorize act-and-mul kernels for speedup (vllm-project#207)
Add vectorized memory access to activation-and-mul kernels using
aligned_vec loads/stores with dynamic vec_size dispatch (1-16).
Switch from 3D to 1D nd_range for simpler indexing.
All 4 fused ops (silu_and_mul, mul_and_silu, gelu_and_mul,
gelu_tanh_and_mul) now use the vectorized path. The original
scalar kernel is retained as VEC_SIZE=1 fallback.
Benchmark results (avg GPU time in us, 200 iterations, no per-iter sync):
| Model | Tokens | Dtype | d (intermediate_size) | Baseline (us) | Vectorized (us) | Change |
|-------|--------|-------|-----------------------|---------------|-----------------|--------|
| llama3-70b | 128 | fp16 | 28672 | 24.01 | 8.96 | -62.7% |
| llama3-70b | 128 | bf16 | 28672 | 27.25 | 11.25 | -58.7% |
| llama3-70b | 512 | fp16 | 28672 | 262.79 | 202.13 | -23.1% |
| llama3-70b | 512 | bf16 | 28672 | 261.46 | 202.67 | -22.5% |
| llama3-70b | 1024 | fp16 | 28672 | 545.11 | 424.03 | -22.2% |
| llama3-70b | 1024 | bf16 | 28672 | 545.13 | 424.82 | -22.1% |
| llama3-70b | 2048 | fp16 | 28672 | 1108.82 | 872.10 | -21.3% |
| llama3-70b | 2048 | bf16 | 28672 | 1108.13 | 872.70 | -21.2% |
| llama3-8b | 128 | fp16 | 14336 | 33.05 | 6.51 | -80.3% |
| llama3-8b | 128 | bf16 | 14336 | 26.65 | 6.15 | -76.9% |
| llama3-8b | 512 | fp16 | 14336 | 169.74 | 92.10 | -45.7% |
| llama3-8b | 512 | bf16 | 14336 | 139.62 | 93.25 | -33.2% |
| llama3-8b | 1024 | fp16 | 14336 | 261.68 | 201.64 | -22.9% |
| llama3-8b | 1024 | bf16 | 14336 | 260.92 | 201.73 | -22.7% |
| llama3-8b | 2048 | fp16 | 14336 | 539.98 | 420.75 | -22.1% |
| llama3-8b | 2048 | bf16 | 14336 | 541.28 | 422.87 | -21.9% |
| qwen-14b | 512 | fp16 | 13824 | 116.04 | 85.32 | -26.5% |
| qwen-14b | 512 | bf16 | 13824 | 114.37 | 85.69 | -25.1% |
| qwen-14b | 1024 | fp16 | 13824 | 238.41 | 193.29 | -18.9% |
| qwen-14b | 1024 | bf16 | 13824 | 254.00 | 193.76 | -23.7% |
| qwen-14b | 2048 | fp16 | 13824 | 527.05 | 407.07 | -22.8% |
| qwen-14b | 2048 | bf16 | 13824 | 521.38 | 403.80 | -22.6% |
| qwen-32b | 128 | fp16 | 27648 | 20.65 | 6.29 | -69.5% |
| qwen-32b | 128 | bf16 | 27648 | 21.35 | 6.89 | -67.7% |
| qwen-32b | 512 | fp16 | 27648 | 253.84 | 193.79 | -23.7% |
| qwen-32b | 512 | bf16 | 27648 | 253.64 | 193.84 | -23.6% |
| qwen-32b | 1024 | fp16 | 27648 | 526.81 | 407.99 | -22.6% |
| qwen-32b | 1024 | bf16 | 27648 | 523.08 | 408.52 | -21.9% |
| qwen-32b | 2048 | fp16 | 27648 | 1069.97 | 838.01 | -21.7% |
| qwen-32b | 2048 | bf16 | 27648 | 1068.91 | 838.34 | -21.6% |
Signed-off-by: Ma, Liangliang <liangliang.ma@intel.com>1 parent 1b4770e commit 14631a4
1 file changed
Lines changed: 91 additions & 4 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
121 | 121 | | |
122 | 122 | | |
123 | 123 | | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
| 135 | + | |
| 136 | + | |
| 137 | + | |
| 138 | + | |
| 139 | + | |
| 140 | + | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
| 155 | + | |
| 156 | + | |
| 157 | + | |
| 158 | + | |
| 159 | + | |
| 160 | + | |
| 161 | + | |
| 162 | + | |
| 163 | + | |
| 164 | + | |
| 165 | + | |
| 166 | + | |
124 | 167 | | |
125 | 168 | | |
126 | 169 | | |
| |||
201 | 244 | | |
202 | 245 | | |
203 | 246 | | |
| 247 | + | |
| 248 | + | |
| 249 | + | |
| 250 | + | |
| 251 | + | |
| 252 | + | |
| 253 | + | |
| 254 | + | |
| 255 | + | |
| 256 | + | |
| 257 | + | |
| 258 | + | |
| 259 | + | |
| 260 | + | |
| 261 | + | |
| 262 | + | |
| 263 | + | |
| 264 | + | |
| 265 | + | |
| 266 | + | |
| 267 | + | |
| 268 | + | |
| 269 | + | |
| 270 | + | |
| 271 | + | |
| 272 | + | |
| 273 | + | |
| 274 | + | |
| 275 | + | |
| 276 | + | |
| 277 | + | |
| 278 | + | |
| 279 | + | |
| 280 | + | |
| 281 | + | |
| 282 | + | |
| 283 | + | |
| 284 | + | |
| 285 | + | |
| 286 | + | |
| 287 | + | |
| 288 | + | |
| 289 | + | |
| 290 | + | |
204 | 291 | | |
205 | 292 | | |
206 | 293 | | |
207 | 294 | | |
208 | 295 | | |
209 | | - | |
| 296 | + | |
210 | 297 | | |
211 | 298 | | |
212 | 299 | | |
| |||
215 | 302 | | |
216 | 303 | | |
217 | 304 | | |
218 | | - | |
| 305 | + | |
219 | 306 | | |
220 | 307 | | |
221 | 308 | | |
| |||
224 | 311 | | |
225 | 312 | | |
226 | 313 | | |
227 | | - | |
| 314 | + | |
228 | 315 | | |
229 | 316 | | |
230 | 317 | | |
| |||
233 | 320 | | |
234 | 321 | | |
235 | 322 | | |
236 | | - | |
| 323 | + | |
237 | 324 | | |
238 | 325 | | |
239 | 326 | | |
| |||
0 commit comments