Commit cc8112d
committed
fix(kv_pool): free blocks immediately on request finish in layerwise mode
AscendStoreConnector is SupportsHMA, so vLLM invokes
request_finished_all_groups (not request_finished). The HMA path missed
the layerwise early-return present in request_finished, so a layerwise
producer with saved tokens returned delay_free_blocks=True. vLLM then
deferred the free, but layerwise never records a sending event (only
touch_sending_mamba_blocks does), so update_connector_output never freed
those blocks -- GPU KV cache usage climbed monotonically to 100%.
Add the same `if self.use_layerwise: return False` guard so blocks are
freed immediately on request finish. This is safe because layerwise saves
each layer synchronously before the request finishes (save_kv_layer waits
on the last layer's save event).
Signed-off-by: F.Liu <1661888967@qq.com>1 parent 9d548d7 commit cc8112d
1 file changed
Lines changed: 8 additions & 0 deletions
Lines changed: 8 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1030 | 1030 | | |
1031 | 1031 | | |
1032 | 1032 | | |
| 1033 | + | |
| 1034 | + | |
| 1035 | + | |
| 1036 | + | |
| 1037 | + | |
| 1038 | + | |
| 1039 | + | |
| 1040 | + | |
1033 | 1041 | | |
1034 | 1042 | | |
1035 | 1043 | | |
| |||
0 commit comments