[Bug]: KVBM hang when request fails during async KV cache transfer

### System Info

- GPU: NVIDIA H200 Node
- TensorRT-LLM version: main (906781bf4)
- Deployment: Aggregated via NVIDIA Dynamo
- KVBM enabled

### Who can help?

_No response_

### Information

- [ ] The official example scripts
- [ ] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

1. Deploy Kimi model using TensorRT-LLM with KVBM enabled in Aggregated mode 
2. Send a request with prompt length exceeding max_seq_len (e.g. prompt_len=140529 with max_seq_len=131072)
3. The request fails with `RequestError: default_max_tokens (-N) must be greater than 0`
4. On subsequent iterations, all ranks hang indefinitely — the HangDetector fires after 300s:
```
[TRT-LLM] [RANK 0] [E] Hang detected, shutting down immediately.
[TRT-LLM] [RANK 7] [E] Hang detected, shutting down immediately.
```



### Expected behavior

The worker should not hanging.

### actual behavior

Worker is hanging for 5 minutes then see below error logs:
```
[TRT-LLM] [RANK 0] [E] Hang detected, shutting down immediately.
[TRT-LLM] [RANK 7] [E] Hang detected, shutting down immediately.
```

### additional notes

Actually I tried the latest build. It is solved by this [PR](https://github.com/ai-dynamo/dynamo/pull/6635) in frontend. But I think the root cause is that when the request fails, its in-flight async KVBM transfer state is not cleaned up. On the next iteration, `worker.get_finished()` on rank 0 blocks waiting for a transfer that will never complete. This prevents rank 0 from reaching the mpi_allgather barrier, causing a collective stall across all TP ranks.
I created this issue for the PR to fix above.

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and checked the [documentation](https://nvidia.github.io/TensorRT-LLM/) and [examples](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples) for answers to frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: KVBM hang when request fails during async KV cache transfer #12116

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug]: KVBM hang when request fails during async KV cache transfer #12116

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions