[#12116][fix] prevent KVBM hang when requests abort during KV cache transfer#12117
[#12116][fix] prevent KVBM hang when requests abort during KV cache transfer#12117zyang-Modular wants to merge 1 commit intoNVIDIA:mainfrom
Conversation
…nsfer Signed-off-by: Zhu Yang <zyang@modular.com>
📝 WalkthroughWalkthroughThis change adds request abortion capability to the KV cache connector manager, enabling explicit cleanup of request tracking structures during error handling and cancellation. Additionally, exception handling is added to get_finished to prevent MPI barrier stalling when workers encounter errors. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning Review ran into problems🔥 ProblemsGit: Failed to clone repository. Please run the Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
tensorrt_llm/_torch/pyexecutor/kv_cache_connector.py (1)
553-564: Consider narrowing the exception type if specific failures are known.The broad
except Exceptioncatch-all is used here to prevent MPI stalls regardless of failure cause. While this approach achieves the resilience goal described in the PR, the coding guidelines recommend catching specific exceptions.If the expected failure modes are known (e.g.,
RuntimeErrorfrom KVBM panics as mentioned in tests), consider narrowing the exception type. However, if truly any failure should trigger this recovery path, the current approach is acceptable as a defensive measure.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/pyexecutor/kv_cache_connector.py` around lines 553 - 564, The current broad except Exception around self.worker.get_finished(...) should be narrowed: catch known failure types first (e.g., RuntimeError) and handle them the same way (set finished_saving/finished_loading = [] and unblock mpi_allgather), then keep a secondary broad fallback only if truly required; also replace logger.error(...) with logger.exception(...) or include the traceback to capture full error details. Update the try/except surrounding worker.get_finished to explicitly handle RuntimeError (and any other known worker/KVBM error classes) before an optional generic fallback, ensuring finished_saving and finished_loading are still set to [] in those branches.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tensorrt_llm/_torch/pyexecutor/kv_cache_connector.py`:
- Around line 558-562: The except block in get_finished() (the handler that
calls logger.error on exception 'e') references logger which is not imported;
add the missing import at the top of the module with the other imports by
importing the shared logger symbol (e.g., from tensorrt_llm.logger import
logger) so the exception handler and any other usages of logger resolve at
runtime.
---
Nitpick comments:
In `@tensorrt_llm/_torch/pyexecutor/kv_cache_connector.py`:
- Around line 553-564: The current broad except Exception around
self.worker.get_finished(...) should be narrowed: catch known failure types
first (e.g., RuntimeError) and handle them the same way (set
finished_saving/finished_loading = [] and unblock mpi_allgather), then keep a
secondary broad fallback only if truly required; also replace logger.error(...)
with logger.exception(...) or include the traceback to capture full error
details. Update the try/except surrounding worker.get_finished to explicitly
handle RuntimeError (and any other known worker/KVBM error classes) before an
optional generic fallback, ensuring finished_saving and finished_loading are
still set to [] in those branches.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 1ea3c1d1-709e-4a42-a8df-8f9b6e55ae9e
📒 Files selected for processing (3)
tensorrt_llm/_torch/pyexecutor/kv_cache_connector.pytensorrt_llm/_torch/pyexecutor/py_executor.pytests/unittest/_torch/test_connector.py
| except Exception as e: | ||
| logger.error( | ||
| f"KV connector worker get_finished() raised an exception: {e}. " | ||
| "Reporting no completions this iteration to unblock mpi_allgather." | ||
| ) |
There was a problem hiding this comment.
logger is not imported — this will raise NameError at runtime.
The exception handler references logger.error(...), but logger is not imported in this file. This will cause a NameError when an exception occurs, defeating the purpose of the exception handling.
🐛 Proposed fix: Add the logger import
Add at the top of the file with other imports:
from tensorrt_llm.logger import logger🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tensorrt_llm/_torch/pyexecutor/kv_cache_connector.py` around lines 558 - 562,
The except block in get_finished() (the handler that calls logger.error on
exception 'e') references logger which is not imported; add the missing import
at the top of the module with the other imports by importing the shared logger
symbol (e.g., from tensorrt_llm.logger import logger) so the exception handler
and any other usages of logger resolve at runtime.
Summary by CodeRabbit
Release Notes
New Features
Bug Fixes
Tests
Description
This PR addresses the root cause of #12116. While it ca be prevented by upstream (fixed by this PR), the engine should not hang if a request fails during an active KV transfer, regardless of the failure reason. This change improves robustness by ensuring failed requests are always cleaned up before worker.get_finished() is called, avoiding the collective stall entirely.
When a request fails (e.g. prompt exceeds
max_seq_len, GPU OOM, or a KVBM Rust panic), its in-flight async KV transfer state is not cleaned up.On the next iteration,
worker.get_finished()on rank 0 blocks indefinitely waiting for a transfer that will never complete. This prevents rank 0 from reaching thempi_allgatherbarrier, causing a collective stall across all TP ranks. TheHangDetectorfires after 300s and shuts down all ranks.Test Coverage
I put 2 new test cases
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.