-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Open
Labels
KV-Cache Managementkv-cache management for efficient LLM inferencekv-cache management for efficient LLM inferencebugSomething isn't workingSomething isn't working
Description
System Info
- GPU: NVIDIA H200 Node
- TensorRT-LLM version: main (906781b)
- Deployment: Aggregated via NVIDIA Dynamo
- KVBM enabled
Who can help?
No response
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
- Deploy Kimi model using TensorRT-LLM with KVBM enabled in Aggregated mode
- Send a request with prompt length exceeding max_seq_len (e.g. prompt_len=140529 with max_seq_len=131072)
- The request fails with
RequestError: default_max_tokens (-N) must be greater than 0 - On subsequent iterations, all ranks hang indefinitely — the HangDetector fires after 300s:
[TRT-LLM] [RANK 0] [E] Hang detected, shutting down immediately.
[TRT-LLM] [RANK 7] [E] Hang detected, shutting down immediately.
Expected behavior
The worker should not hanging.
actual behavior
Worker is hanging for 5 minutes then see below error logs:
[TRT-LLM] [RANK 0] [E] Hang detected, shutting down immediately.
[TRT-LLM] [RANK 7] [E] Hang detected, shutting down immediately.
additional notes
Actually I tried the latest build. It is solved by this PR in frontend. But I think the root cause is that when the request fails, its in-flight async KVBM transfer state is not cleaned up. On the next iteration, worker.get_finished() on rank 0 blocks waiting for a transfer that will never complete. This prevents rank 0 from reaching the mpi_allgather barrier, causing a collective stall across all TP ranks.
I created this issue for the PR to fix above.
Before submitting a new issue...
- Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
KV-Cache Managementkv-cache management for efficient LLM inferencekv-cache management for efficient LLM inferencebugSomething isn't workingSomething isn't working