[https://nvbugs/5963423][fix] Fix kv token estimation when ADP is on.#12099
[https://nvbugs/5963423][fix] Fix kv token estimation when ADP is on.#12099dominicshanshan wants to merge 1 commit intoNVIDIA:mainfrom
Conversation
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
📝 WalkthroughWalkthroughAdds capacity adjustment logic to KV cache block estimation in the pyexecutor utility module. When Attention Data Parallelism is enabled with tensor parallelism, the estimated cache blocks are now scaled by dividing by tensor parallel size with ceiling rounding to account for per-rank work distribution. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tensorrt_llm/_torch/pyexecutor/_util.py`:
- Around line 314-320: The per-rank KV cache sizing is unsafe because
_create_dummy_context_requests relies on a scheduling invariant that
DefaultADPRouter (and its _balance_requests_across_ranks) does not guarantee;
either set attention_dp_rank on the dummy requests so each rank gets one full
base copy during ADP warmup (ensuring one copy per rank) or change the sizing
math where mapping.enable_attention_dp and mapping.tp_size are used (the
ceil((num_cache_blocks + tp_size - 1)/tp_size) branch) to a conservative bound
that does not assume even distribution—for example compute per-rank capacity as
the max of ceil(total_blocks / tp_size) and the largest single-request block
count (or simply use total_blocks to be safe) and use those symbols:
_create_dummy_context_requests, attention_dp_rank, DefaultADPRouter,
_balance_requests_across_ranks, self._mapping.enable_attention_dp,
self._mapping.tp_size, and num_cache_blocks to locate and implement the fix.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: cbd8ec95-ba34-4332-806d-2b7c240ef938
📒 Files selected for processing (1)
tensorrt_llm/_torch/pyexecutor/_util.py
| # With ADP enabled, _create_dummy_context_requests produces tp_size | ||
| # copies so each rank gets work during the estimation warmup. But the | ||
| # scheduler distributes them evenly (1 per rank), so each rank's KV | ||
| # cache only needs capacity for its own share, not all of them. | ||
| if self._mapping.enable_attention_dp and self._mapping.tp_size > 1: | ||
| num_cache_blocks = (num_cache_blocks + self._mapping.tp_size - | ||
| 1) // self._mapping.tp_size |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== ADP warmup request construction =="
rg -n -C3 'requests = requests \* self\._mapping\.tp_size|_create_dummy_context_requests|_create_dummy_mm_context_request' tensorrt_llm/_torch/pyexecutor
echo
echo "== Warmup enqueue / execution path =="
rg -n -C3 'enqueue_requests\(|def enqueue_requests|is_warmup|enable_attention_dp|attention_dp' tensorrt_llm/_torch/pyexecutor
echo
echo "== Scheduler implementations / placement logic =="
rg -n -C3 'class .*Scheduler|round.?robin|least|capacity|attention_dp|schedule' tensorrt_llm/_torch/pyexecutorRepository: NVIDIA/TensorRT-LLM
Length of output: 50376
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== Searching for attention_dp_rank assignment in dummy request creation ==="
rg -n "attention_dp_rank\s*=" tensorrt_llm/_torch/pyexecutor/ -A2 -B2 | head -100
echo
echo "=== Checking where py_scheduling_params is set for warmup/dummy requests ==="
rg -n "py_scheduling_params|scheduling_params\s*=" tensorrt_llm/_torch/pyexecutor/_util.py -A3 -B1 | head -80
echo
echo "=== How ADP router assigns requests to ranks ==="
rg -n "route_requests|target_dp_rank" tensorrt_llm/_torch/pyexecutor/scheduler/adp_router.py -A5 -B2 | head -120Repository: NVIDIA/TensorRT-LLM
Length of output: 800
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== Reading _util.py KvCacheCreator class for dummy request creation ==="
sed -n '240,290p' tensorrt_llm/_torch/pyexecutor/_util.py
echo
echo "=== Check for py_scheduling_params anywhere in pyexecutor ==="
rg -l "py_scheduling_params" tensorrt_llm/_torch/pyexecutor/ --type py
echo
echo "=== Check how requests flow from _create_dummy_context_requests to warmup ==="
rg -n "self._dummy_reqs\|_get_token_num_for_estimation" tensorrt_llm/_torch/pyexecutor/_util.py | head -20
echo
echo "=== Check _pad_attention_dp_dummy_request in py_executor ==="
sed -n '2748,2790p' tensorrt_llm/_torch/pyexecutor/py_executor.py
echo
echo "=== How scheduler processes requests during warmup ==="
rg -n "warmup\|is_warmup" tensorrt_llm/_torch/pyexecutor/py_executor.py | grep -i "scheduler\|enqueue" | head -15Repository: NVIDIA/TensorRT-LLM
Length of output: 2818
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== Full _get_token_num_for_estimation method ==="
sed -n '288,330p' tensorrt_llm/_torch/pyexecutor/_util.py
echo
echo "=== Full _pad_attention_dp_dummy_request method ==="
sed -n '2747,2800p' tensorrt_llm/_torch/pyexecutor/py_executor.py
echo
echo "=== Check if adp_router assigns rank to dummy requests during warmup ==="
cat tensorrt_llm/_torch/pyexecutor/scheduler/adp_router.py | head -200Repository: NVIDIA/TensorRT-LLM
Length of output: 12706
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== Check the _balance_requests_across_ranks method ==="
sed -n '200,300p' tensorrt_llm/_torch/pyexecutor/scheduler/adp_router.py
echo
echo "=== How warmup dummy requests are created and enqueued ==="
rg -n "_create_dummy_context_requests\|_get_token_num_for_estimation" tensorrt_llm/_torch/pyexecutor/_util.py -B3 -A10 | head -60
echo
echo "=== Check estimate_max_kv_cache_tokens usage ==="
rg -n "estimate_max_kv_cache_tokens" tensorrt_llm/_torch/pyexecutor/_util.py -B5 -A10 | head -50Repository: NVIDIA/TensorRT-LLM
Length of output: 3012
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== estimate_max_kv_cache_tokens usage and warmup context ==="
rg -n "estimate_max_kv_cache_tokens" tensorrt_llm/_torch/pyexecutor/ -B3 -A15 | head -100
echo
echo "=== Check if dummy warmup requests have py_scheduling_params set ==="
rg -n "is_warmup\|dummy_request\|estimate_max_kv_cache" tensorrt_llm/_torch/pyexecutor/py_executor.py -B2 -A2 | head -80
echo
echo "=== Review the full _util.py dummy request creation flow ==="
sed -n '168,290p' tensorrt_llm/_torch/pyexecutor/_util.pyRepository: NVIDIA/TensorRT-LLM
Length of output: 2414
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== Full _create_dummy_context_requests and _create_dummy_mm_context_request ==="
sed -n '168,286p' tensorrt_llm/_torch/pyexecutor/_util.py
echo
echo "=== Check where py_scheduling_params is set on ANY request ==="
rg -n "py_scheduling_params\s*=" tensorrt_llm/_torch/pyexecutor/ --type py -B2 -A2 | head -80
echo
echo "=== Check if warmup requests flow through adp_router or scheduler ==="
rg -n "is_warmup.*schedule\|schedule.*is_warmup" tensorrt_llm/_torch/pyexecutor/ --type py -B2 -A2Repository: NVIDIA/TensorRT-LLM
Length of output: 6873
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== Check how warmup estimation phase works - does it actually use the ADP scheduler? ==="
sed -n '420,450p' tensorrt_llm/_torch/pyexecutor/py_executor.py
echo
echo "=== Check estimate_max_kv_cache_tokens in resource_manager ==="
rg -n "estimate_max_kv_cache_tokens" tensorrt_llm/_torch/pyexecutor/ --type py -B5 -A15 | head -100
echo
echo "=== Verify: are dummy requests from _create_dummy_context_requests actually scheduled through ADP? ==="
rg -n "self._dummy_reqs\|_get_token_num_for_estimation" tensorrt_llm/_torch/pyexecutor/_util.py -A20 | head -80Repository: NVIDIA/TensorRT-LLM
Length of output: 4119
🏁 Script executed:
#!/bin/bash
set -euo pipefixo pipefail
echo "=== Where is _get_token_num_for_estimation called? ==="
rg -n "_get_token_num_for_estimation" tensorrt_llm/_torch/pyexecutor/ --type py -B3 -A3
echo
echo "=== Find estimate_max_kv_cache_tokens definition and usage ==="
rg -n "def estimate_max_kv_cache_tokens\|estimate_max_kv_cache_tokens\(" tensorrt_llm/_torch/pyexecutor/ --type py -B2 -A8 | head -120Repository: NVIDIA/TensorRT-LLM
Length of output: 116
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== Where is _get_token_num_for_estimation called? ==="
rg -n "_get_token_num_for_estimation" tensorrt_llm/_torch/pyexecutor/ --type py -B3 -A3
echo
echo "=== Find estimate_max_kv_cache_tokens definition and usage ==="
rg -n "def estimate_max_kv_cache_tokens\|estimate_max_kv_cache_tokens\(" tensorrt_llm/_torch/pyexecutor/ --type py -B2 -A8 | head -120Repository: NVIDIA/TensorRT-LLM
Length of output: 1514
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== Check the context around line 356 in _util.py ==="
sed -n '340,380p' tensorrt_llm/_torch/pyexecutor/_util.py
echo
echo "=== Search for estimate_max_kv_cache_tokens C++ binding ==="
rg -n "estimate_max_kv_cache_tokens" tensorrt_llm/_torch/pyexecutor/ --type py -B5 -A10 | head -150
echo
echo "=== Check key insight: does the warmup actually route through ADP scheduler? ==="
sed -n '350,390p' tensorrt_llm/_torch/pyexecutor/_util.pyRepository: NVIDIA/TensorRT-LLM
Length of output: 5734
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== Search for estimate_max_kv_cache_tokens in C++ bindings or wrappers ==="
find tensorrt_llm -name "*.py" -type f | xargs grep -l "estimate_max_kv_cache_tokens" | head -5
echo
echo "=== Check model_engine.warmup to understand the estimation flow ==="
rg -n "def warmup" tensorrt_llm/_torch/pyexecutor/model_engine.py -A30 | head -60
echo
echo "=== Critical question: are dummy requests actually scheduled through ADP router during estimation? ==="
rg -n "estimate_max_kv_cache_tokens\|self._dummy_reqs" tensorrt_llm/_torch/pyexecutor/model_engine.py -B3 -A3Repository: NVIDIA/TensorRT-LLM
Length of output: 3689
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== Check if there's any explicit documentation or assertion about the scheduling invariant ==="
rg -n "one full base copy|even.*distribute|scheduler.*evenly" tensorrt_llm/_torch/pyexecutor/ --type py -B3 -A3
echo
echo "=== Search for any code that mentions grouping or partitioning dummy requests by base copy ==="
rg -n "base.*copy|copy.*base|partition.*request|group.*request" tensorrt_llm/_torch/pyexecutor/ --type py -B2 -A2 | head -40
echo
echo "=== Final check: is there any test or comment that validates the ceil(total/tp_size) assumption? ==="
rg -n "ceil.*tp_size\|num_cache_blocks.*tp_size\|ADP.*schedule" tensorrt_llm/_torch/pyexecutor/_util.py -B5 -A5Repository: NVIDIA/TensorRT-LLM
Length of output: 4461
The per-rank KV cache bound relies on an unverified scheduling invariant.
The comment claims the scheduler distributes replicated ADP warmup requests evenly (1 per rank), but the actual DefaultADPRouter uses heap-based load balancing on token count, not request count. Since _create_dummy_context_requests() does not assign attention_dp_rank to dummy requests, they enter the balancing phase where _balance_requests_across_ranks() distributes them to minimize token load per rank. With requests of varying sizes, one rank can receive zero requests while another receives multiple, violating the "1 per rank" assumption. The ceil(total_blocks / tp_size) calculation at line 318 will then underestimate per-rank KV capacity, risking OOM during warmup.
Either ensure attention_dp_rank is assigned to guarantee one full base copy per rank during warmup, or use a safer capacity bound that does not assume even distribution (for example, max(num_cache_blocks // tp_size, ceil(largest_request_blocks))).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tensorrt_llm/_torch/pyexecutor/_util.py` around lines 314 - 320, The per-rank
KV cache sizing is unsafe because _create_dummy_context_requests relies on a
scheduling invariant that DefaultADPRouter (and its
_balance_requests_across_ranks) does not guarantee; either set attention_dp_rank
on the dummy requests so each rank gets one full base copy during ADP warmup
(ensuring one copy per rank) or change the sizing math where
mapping.enable_attention_dp and mapping.tp_size are used (the
ceil((num_cache_blocks + tp_size - 1)/tp_size) branch) to a conservative bound
that does not assume even distribution—for example compute per-rank capacity as
the max of ceil(total_blocks / tp_size) and the largest single-request block
count (or simply use total_blocks to be safe) and use those symbols:
_create_dummy_context_requests, attention_dp_rank, DefaultADPRouter,
_balance_requests_across_ranks, self._mapping.enable_attention_dp,
self._mapping.tp_size, and num_cache_blocks to locate and implement the fix.
|
/bot run |
|
PR_Github #38574 [ run ] triggered by Bot. Commit: |
|
PR_Github #38574 [ run ] completed with state
|
Summary by CodeRabbit
Release Notes
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.