Skip to content

Commit 9990740

Browse files
committed
fix(kv_pool): free blocks immediately on request finish in layerwise mode
AscendStoreConnector is SupportsHMA, so vLLM invokes request_finished_all_groups (not request_finished). The HMA path missed the layerwise early-return present in request_finished, so a layerwise producer with saved tokens returned delay_free_blocks=True. vLLM then deferred the free, but layerwise never records a sending event (only touch_sending_mamba_blocks does), so update_connector_output never freed those blocks -- GPU KV cache usage climbed monotonically to 100%. Add the same `if self.use_layerwise: return False` guard so blocks are freed immediately on request finish. This is safe because layerwise saves each layer synchronously before the request finishes (save_kv_layer waits on the last layer's save event). Signed-off-by: F.Liu <1661888967@qq.com>
1 parent 932bff5 commit 9990740

1 file changed

Lines changed: 4 additions & 0 deletions

File tree

vllm_ascend/distributed/kv_transfer/kv_pool/ascend_store/pool_scheduler.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1030,6 +1030,10 @@ def request_finished_all_groups(
10301030
if self.kv_role == "kv_consumer" and not self.consumer_is_to_put:
10311031
self._delayed_free_req_ids.discard(request.request_id)
10321032
return False, None
1033+
if self.use_layerwise:
1034+
# Free now: layerwise records no sending event, so delay-free would leak.
1035+
self._delayed_free_req_ids.discard(request.request_id)
1036+
return False, None
10331037
tracker = self._request_trackers.get(request.request_id)
10341038
if tracker is not None and tracker.num_saved_tokens <= 0:
10351039
self._delayed_free_req_ids.discard(request.request_id)

0 commit comments

Comments
 (0)