Commit 3ae46b4
committed
[compile] Invoke split FX graph by codegen.
Summary:
This PR reduces inference loop runtime overhead by codegen-ing slightly faster Python
code instead of invoking the FX graph directly after compilation.
Context:
Today VllmBackend returns a callable as a FX GraphModule with multiple submodules with the following code:
```
def forward(self, ...):
self.submod_0(...)
self.submod_1(...)
...
```
FX graph execution has some overhead due to:
1. getattr() calls to fetch submodules.
2. submodule calls will push multiple levels of CPython stack frame before getting to the real kernels.
We address this by introducing a new codegen layer after all compiler passes and right
before inference runtime. In this codegen layer we get full customizability over how the graph is executed.
Sample generated code:
```
submod_0 = _submods[0](l_input_ids_, s72, l_self_modules_embed_tokens_parameters_weight_, l_self_modules_layers_modules_0_modules_input_layernorm_parameters_weight_, l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_weight_, l_positions_, l_self_modules_layers_modules_0_modules_self_attn_modules_rotary_emb_buffers_cos_sin_cache_)
getitem = submod_0[0]
getitem_1 = submod_0[1]
getitem_2 = submod_0[2]
getitem_3 = submod_0[3]
getitem_4 = submod_0[4]
submod_1 = _submods[1](getitem, s72, getitem_1, getitem_2, getitem_3)
submod_2 = _submods[2](getitem_3, s72, l_self_modules_layers_modules_0_modules_self_attn_modules_o_proj_parameters_weight_, l_self_modules_layers_modules_0_modules_post_attention_layernorm_parameters_weight_, getitem_4, l_self_modules_layers_modules_0_modules_mlp_modules_gate_up_proj_parameters_weight_, l_self_modules_layers_modules_0_modules_mlp_modules_down_proj_parameters_weight_, l_self_modules_layers_modules_1_modules_input_layernorm_parameters_weight_, l_self_modules_layers_modules_1_modules_self_attn_modules_qkv_proj_parameters_weight_, l_positions_, l_self_modules_layers_modules_0_modules_self_attn_modules_rotary_emb_buffers_cos_sin_cache_)
getitem_5 = submod_2[0]
getitem_6 = submod_2[1]
getitem_7 = submod_2[2]
getitem_8 = submod_2[3]
getitem_9 = submod_2[4]
submod_3 = _submods[3](getitem_5, s72, getitem_6, getitem_7, getitem_8)
```
This PR will reduce runtime overhead no matter VLLM_USE_AOT_COMPILE, VLLM_DISABLE_COMPILE_CACHE or VLLM_USE_MEGA_AOT_ARITFACT is enabled
or disabled. It will always be used in all paths.
In terms of caching, this PR will stores 2 extra pieces of data on disk:
1. Python execution code.
2. FQN of each submodule.
When VLLM_USE_AOT_COMPILE=1, these will be loaded and optionally used depending on whether
VLLM_USE_MEGA_ARTIFACT is enabled.
Based on the current change, it's possible to further reduce warm start time by skipping
graph module serialization. However to make the code review easier, we will do it in a separate
PR and this PR still helps with the runtime overhead in a self-contained way.
Benchmark script: https://github.com/zhxchen17/scripts/blob/main/vllm/overhead_bench.py
Test Plan:
<TODO Images>
Reviewers:
Subscribers:
Tasks:
Tags:
Signed-off-by: zhxchen17 <zhxchen17@fb.com>1 parent 6183cae commit 3ae46b4
3 files changed
+182
-9
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1234 | 1234 | | |
1235 | 1235 | | |
1236 | 1236 | | |
| 1237 | + | |
| 1238 | + | |
| 1239 | + | |
| 1240 | + | |
| 1241 | + | |
| 1242 | + | |
| 1243 | + | |
| 1244 | + | |
| 1245 | + | |
| 1246 | + | |
| 1247 | + | |
| 1248 | + | |
| 1249 | + | |
| 1250 | + | |
| 1251 | + | |
| 1252 | + | |
| 1253 | + | |
1237 | 1254 | | |
1238 | 1255 | | |
1239 | 1256 | | |
| |||
1242 | 1259 | | |
1243 | 1260 | | |
1244 | 1261 | | |
1245 | | - | |
| 1262 | + | |
1246 | 1263 | | |
1247 | 1264 | | |
| 1265 | + | |
| 1266 | + | |
1248 | 1267 | | |
1249 | 1268 | | |
1250 | 1269 | | |
| |||
1265 | 1284 | | |
1266 | 1285 | | |
1267 | 1286 | | |
1268 | | - | |
| 1287 | + | |
1269 | 1288 | | |
1270 | 1289 | | |
1271 | 1290 | | |
| |||
1276 | 1295 | | |
1277 | 1296 | | |
1278 | 1297 | | |
| 1298 | + | |
| 1299 | + | |
1279 | 1300 | | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
184 | 184 | | |
185 | 185 | | |
186 | 186 | | |
| 187 | + | |
| 188 | + | |
187 | 189 | | |
188 | 190 | | |
189 | 191 | | |
| |||
194 | 196 | | |
195 | 197 | | |
196 | 198 | | |
| 199 | + | |
| 200 | + | |
197 | 201 | | |
198 | 202 | | |
199 | 203 | | |
| |||
453 | 457 | | |
454 | 458 | | |
455 | 459 | | |
456 | | - | |
| 460 | + | |
457 | 461 | | |
458 | 462 | | |
459 | 463 | | |
| |||
473 | 477 | | |
474 | 478 | | |
475 | 479 | | |
476 | | - | |
| 480 | + | |
477 | 481 | | |
478 | 482 | | |
479 | 483 | | |
480 | 484 | | |
481 | 485 | | |
482 | | - | |
| 486 | + | |
483 | 487 | | |
484 | 488 | | |
485 | 489 | | |
| |||
490 | 494 | | |
491 | 495 | | |
492 | 496 | | |
493 | | - | |
| 497 | + | |
494 | 498 | | |
495 | 499 | | |
496 | 500 | | |
497 | 501 | | |
498 | 502 | | |
499 | 503 | | |
500 | 504 | | |
501 | | - | |
| 505 | + | |
502 | 506 | | |
503 | 507 | | |
504 | 508 | | |
| |||
513 | 517 | | |
514 | 518 | | |
515 | 519 | | |
| 520 | + | |
| 521 | + | |
| 522 | + | |
| 523 | + | |
| 524 | + | |
| 525 | + | |
| 526 | + | |
| 527 | + | |
| 528 | + | |
| 529 | + | |
| 530 | + | |
| 531 | + | |
| 532 | + | |
| 533 | + | |
| 534 | + | |
516 | 535 | | |
517 | 536 | | |
518 | 537 | | |
| |||
521 | 540 | | |
522 | 541 | | |
523 | 542 | | |
524 | | - | |
| 543 | + | |
| 544 | + | |
| 545 | + | |
525 | 546 | | |
526 | | - | |
| 547 | + | |
527 | 548 | | |
528 | 549 | | |
529 | 550 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
0 commit comments