Skip to content

[Bug]: KVBM hang when request fails during async KV cache transfer #12116

@zyang-Modular

Description

@zyang-Modular

System Info

  • GPU: NVIDIA H200 Node
  • TensorRT-LLM version: main (906781b)
  • Deployment: Aggregated via NVIDIA Dynamo
  • KVBM enabled

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

  1. Deploy Kimi model using TensorRT-LLM with KVBM enabled in Aggregated mode
  2. Send a request with prompt length exceeding max_seq_len (e.g. prompt_len=140529 with max_seq_len=131072)
  3. The request fails with RequestError: default_max_tokens (-N) must be greater than 0
  4. On subsequent iterations, all ranks hang indefinitely — the HangDetector fires after 300s:
[TRT-LLM] [RANK 0] [E] Hang detected, shutting down immediately.
[TRT-LLM] [RANK 7] [E] Hang detected, shutting down immediately.

Expected behavior

The worker should not hanging.

actual behavior

Worker is hanging for 5 minutes then see below error logs:

[TRT-LLM] [RANK 0] [E] Hang detected, shutting down immediately.
[TRT-LLM] [RANK 7] [E] Hang detected, shutting down immediately.

additional notes

Actually I tried the latest build. It is solved by this PR in frontend. But I think the root cause is that when the request fails, its in-flight async KVBM transfer state is not cleaned up. On the next iteration, worker.get_finished() on rank 0 blocks waiting for a transfer that will never complete. This prevents rank 0 from reaching the mpi_allgather barrier, causing a collective stall across all TP ranks.
I created this issue for the PR to fix above.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    KV-Cache Managementkv-cache management for efficient LLM inferencebugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions