Skip to content

[https://nvbugs/5963423][fix] Fix kv token estimation when ADP is on.#12099

Open
dominicshanshan wants to merge 1 commit intoNVIDIA:mainfrom
dominicshanshan:user/shanshan/oom_adp_kvcache_estimation_fix
Open

[https://nvbugs/5963423][fix] Fix kv token estimation when ADP is on.#12099
dominicshanshan wants to merge 1 commit intoNVIDIA:mainfrom
dominicshanshan:user/shanshan/oom_adp_kvcache_estimation_fix

Conversation

@dominicshanshan
Copy link
Collaborator

@dominicshanshan dominicshanshan commented Mar 11, 2026

Summary by CodeRabbit

Release Notes

  • Performance Improvements
    • Improved KV cache estimation accuracy when using Attention Distributed Parallel with tensor parallelism by implementing per-rank capacity adjustments during warmup calculations.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 11, 2026

📝 Walkthrough

Walkthrough

Adds capacity adjustment logic to KV cache block estimation in the pyexecutor utility module. When Attention Data Parallelism is enabled with tensor parallelism, the estimated cache blocks are now scaled by dividing by tensor parallel size with ceiling rounding to account for per-rank work distribution.

Changes

Cohort / File(s) Summary
KV Cache Estimation Logic
tensorrt_llm/_torch/pyexecutor/_util.py
Added conditional per-rank capacity adjustment for KV cache blocks when Attention DP is enabled and tensor parallelism is active; applies ceiling division by tp_size to num_cache_blocks.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Description check ⚠️ Warning PR description is empty with no details provided about the fix. The author only checked the PR checklist but left all description and test coverage sections blank. Fill in the Description section explaining the KV cache estimation issue with ADP and the fix applied. Document relevant test coverage that validates the changes.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly identifies the fix (KV token estimation when ADP is on) and follows the required format with ticket reference and type indicator.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tensorrt_llm/_torch/pyexecutor/_util.py`:
- Around line 314-320: The per-rank KV cache sizing is unsafe because
_create_dummy_context_requests relies on a scheduling invariant that
DefaultADPRouter (and its _balance_requests_across_ranks) does not guarantee;
either set attention_dp_rank on the dummy requests so each rank gets one full
base copy during ADP warmup (ensuring one copy per rank) or change the sizing
math where mapping.enable_attention_dp and mapping.tp_size are used (the
ceil((num_cache_blocks + tp_size - 1)/tp_size) branch) to a conservative bound
that does not assume even distribution—for example compute per-rank capacity as
the max of ceil(total_blocks / tp_size) and the largest single-request block
count (or simply use total_blocks to be safe) and use those symbols:
_create_dummy_context_requests, attention_dp_rank, DefaultADPRouter,
_balance_requests_across_ranks, self._mapping.enable_attention_dp,
self._mapping.tp_size, and num_cache_blocks to locate and implement the fix.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: cbd8ec95-ba34-4332-806d-2b7c240ef938

📥 Commits

Reviewing files that changed from the base of the PR and between f7255e0 and 25956fb.

📒 Files selected for processing (1)
  • tensorrt_llm/_torch/pyexecutor/_util.py

Comment on lines +314 to +320
# With ADP enabled, _create_dummy_context_requests produces tp_size
# copies so each rank gets work during the estimation warmup. But the
# scheduler distributes them evenly (1 per rank), so each rank's KV
# cache only needs capacity for its own share, not all of them.
if self._mapping.enable_attention_dp and self._mapping.tp_size > 1:
num_cache_blocks = (num_cache_blocks + self._mapping.tp_size -
1) // self._mapping.tp_size
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== ADP warmup request construction =="
rg -n -C3 'requests = requests \* self\._mapping\.tp_size|_create_dummy_context_requests|_create_dummy_mm_context_request' tensorrt_llm/_torch/pyexecutor

echo
echo "== Warmup enqueue / execution path =="
rg -n -C3 'enqueue_requests\(|def enqueue_requests|is_warmup|enable_attention_dp|attention_dp' tensorrt_llm/_torch/pyexecutor

echo
echo "== Scheduler implementations / placement logic =="
rg -n -C3 'class .*Scheduler|round.?robin|least|capacity|attention_dp|schedule' tensorrt_llm/_torch/pyexecutor

Repository: NVIDIA/TensorRT-LLM

Length of output: 50376


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Searching for attention_dp_rank assignment in dummy request creation ==="
rg -n "attention_dp_rank\s*=" tensorrt_llm/_torch/pyexecutor/ -A2 -B2 | head -100

echo
echo "=== Checking where py_scheduling_params is set for warmup/dummy requests ==="
rg -n "py_scheduling_params|scheduling_params\s*=" tensorrt_llm/_torch/pyexecutor/_util.py -A3 -B1 | head -80

echo
echo "=== How ADP router assigns requests to ranks ==="
rg -n "route_requests|target_dp_rank" tensorrt_llm/_torch/pyexecutor/scheduler/adp_router.py -A5 -B2 | head -120

Repository: NVIDIA/TensorRT-LLM

Length of output: 800


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Reading _util.py KvCacheCreator class for dummy request creation ==="
sed -n '240,290p' tensorrt_llm/_torch/pyexecutor/_util.py

echo
echo "=== Check for py_scheduling_params anywhere in pyexecutor ==="
rg -l "py_scheduling_params" tensorrt_llm/_torch/pyexecutor/ --type py

echo
echo "=== Check how requests flow from _create_dummy_context_requests to warmup ==="
rg -n "self._dummy_reqs\|_get_token_num_for_estimation" tensorrt_llm/_torch/pyexecutor/_util.py | head -20

echo
echo "=== Check _pad_attention_dp_dummy_request in py_executor ==="
sed -n '2748,2790p' tensorrt_llm/_torch/pyexecutor/py_executor.py

echo
echo "=== How scheduler processes requests during warmup ==="
rg -n "warmup\|is_warmup" tensorrt_llm/_torch/pyexecutor/py_executor.py | grep -i "scheduler\|enqueue" | head -15

Repository: NVIDIA/TensorRT-LLM

Length of output: 2818


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Full _get_token_num_for_estimation method ==="
sed -n '288,330p' tensorrt_llm/_torch/pyexecutor/_util.py

echo
echo "=== Full _pad_attention_dp_dummy_request method ==="
sed -n '2747,2800p' tensorrt_llm/_torch/pyexecutor/py_executor.py

echo
echo "=== Check if adp_router assigns rank to dummy requests during warmup ==="
cat tensorrt_llm/_torch/pyexecutor/scheduler/adp_router.py | head -200

Repository: NVIDIA/TensorRT-LLM

Length of output: 12706


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Check the _balance_requests_across_ranks method ==="
sed -n '200,300p' tensorrt_llm/_torch/pyexecutor/scheduler/adp_router.py

echo
echo "=== How warmup dummy requests are created and enqueued ==="
rg -n "_create_dummy_context_requests\|_get_token_num_for_estimation" tensorrt_llm/_torch/pyexecutor/_util.py -B3 -A10 | head -60

echo
echo "=== Check estimate_max_kv_cache_tokens usage ==="
rg -n "estimate_max_kv_cache_tokens" tensorrt_llm/_torch/pyexecutor/_util.py -B5 -A10 | head -50

Repository: NVIDIA/TensorRT-LLM

Length of output: 3012


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== estimate_max_kv_cache_tokens usage and warmup context ==="
rg -n "estimate_max_kv_cache_tokens" tensorrt_llm/_torch/pyexecutor/ -B3 -A15 | head -100

echo
echo "=== Check if dummy warmup requests have py_scheduling_params set ==="
rg -n "is_warmup\|dummy_request\|estimate_max_kv_cache" tensorrt_llm/_torch/pyexecutor/py_executor.py -B2 -A2 | head -80

echo
echo "=== Review the full _util.py dummy request creation flow ==="
sed -n '168,290p' tensorrt_llm/_torch/pyexecutor/_util.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 2414


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Full _create_dummy_context_requests and _create_dummy_mm_context_request ==="
sed -n '168,286p' tensorrt_llm/_torch/pyexecutor/_util.py

echo
echo "=== Check where py_scheduling_params is set on ANY request ==="
rg -n "py_scheduling_params\s*=" tensorrt_llm/_torch/pyexecutor/ --type py -B2 -A2 | head -80

echo
echo "=== Check if warmup requests flow through adp_router or scheduler ==="
rg -n "is_warmup.*schedule\|schedule.*is_warmup" tensorrt_llm/_torch/pyexecutor/ --type py -B2 -A2

Repository: NVIDIA/TensorRT-LLM

Length of output: 6873


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Check how warmup estimation phase works - does it actually use the ADP scheduler? ==="
sed -n '420,450p' tensorrt_llm/_torch/pyexecutor/py_executor.py

echo
echo "=== Check estimate_max_kv_cache_tokens in resource_manager ==="
rg -n "estimate_max_kv_cache_tokens" tensorrt_llm/_torch/pyexecutor/ --type py -B5 -A15 | head -100

echo
echo "=== Verify: are dummy requests from _create_dummy_context_requests actually scheduled through ADP? ==="
rg -n "self._dummy_reqs\|_get_token_num_for_estimation" tensorrt_llm/_torch/pyexecutor/_util.py -A20 | head -80

Repository: NVIDIA/TensorRT-LLM

Length of output: 4119


🏁 Script executed:

#!/bin/bash
set -euo pipefixo pipefail

echo "=== Where is _get_token_num_for_estimation called? ==="
rg -n "_get_token_num_for_estimation" tensorrt_llm/_torch/pyexecutor/ --type py -B3 -A3

echo
echo "=== Find estimate_max_kv_cache_tokens definition and usage ==="
rg -n "def estimate_max_kv_cache_tokens\|estimate_max_kv_cache_tokens\(" tensorrt_llm/_torch/pyexecutor/ --type py -B2 -A8 | head -120

Repository: NVIDIA/TensorRT-LLM

Length of output: 116


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Where is _get_token_num_for_estimation called? ==="
rg -n "_get_token_num_for_estimation" tensorrt_llm/_torch/pyexecutor/ --type py -B3 -A3

echo
echo "=== Find estimate_max_kv_cache_tokens definition and usage ==="
rg -n "def estimate_max_kv_cache_tokens\|estimate_max_kv_cache_tokens\(" tensorrt_llm/_torch/pyexecutor/ --type py -B2 -A8 | head -120

Repository: NVIDIA/TensorRT-LLM

Length of output: 1514


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Check the context around line 356 in _util.py ==="
sed -n '340,380p' tensorrt_llm/_torch/pyexecutor/_util.py

echo
echo "=== Search for estimate_max_kv_cache_tokens C++ binding ==="
rg -n "estimate_max_kv_cache_tokens" tensorrt_llm/_torch/pyexecutor/ --type py -B5 -A10 | head -150

echo
echo "=== Check key insight: does the warmup actually route through ADP scheduler? ==="
sed -n '350,390p' tensorrt_llm/_torch/pyexecutor/_util.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 5734


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Search for estimate_max_kv_cache_tokens in C++ bindings or wrappers ==="
find tensorrt_llm -name "*.py" -type f | xargs grep -l "estimate_max_kv_cache_tokens" | head -5

echo
echo "=== Check model_engine.warmup to understand the estimation flow ==="
rg -n "def warmup" tensorrt_llm/_torch/pyexecutor/model_engine.py -A30 | head -60

echo
echo "=== Critical question: are dummy requests actually scheduled through ADP router during estimation? ==="
rg -n "estimate_max_kv_cache_tokens\|self._dummy_reqs" tensorrt_llm/_torch/pyexecutor/model_engine.py -B3 -A3

Repository: NVIDIA/TensorRT-LLM

Length of output: 3689


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Check if there's any explicit documentation or assertion about the scheduling invariant ==="
rg -n "one full base copy|even.*distribute|scheduler.*evenly" tensorrt_llm/_torch/pyexecutor/ --type py -B3 -A3

echo
echo "=== Search for any code that mentions grouping or partitioning dummy requests by base copy ==="
rg -n "base.*copy|copy.*base|partition.*request|group.*request" tensorrt_llm/_torch/pyexecutor/ --type py -B2 -A2 | head -40

echo
echo "=== Final check: is there any test or comment that validates the ceil(total/tp_size) assumption? ==="
rg -n "ceil.*tp_size\|num_cache_blocks.*tp_size\|ADP.*schedule" tensorrt_llm/_torch/pyexecutor/_util.py -B5 -A5

Repository: NVIDIA/TensorRT-LLM

Length of output: 4461


The per-rank KV cache bound relies on an unverified scheduling invariant.

The comment claims the scheduler distributes replicated ADP warmup requests evenly (1 per rank), but the actual DefaultADPRouter uses heap-based load balancing on token count, not request count. Since _create_dummy_context_requests() does not assign attention_dp_rank to dummy requests, they enter the balancing phase where _balance_requests_across_ranks() distributes them to minimize token load per rank. With requests of varying sizes, one rank can receive zero requests while another receives multiple, violating the "1 per rank" assumption. The ceil(total_blocks / tp_size) calculation at line 318 will then underestimate per-rank KV capacity, risking OOM during warmup.

Either ensure attention_dp_rank is assigned to guarantee one full base copy per rank during warmup, or use a safer capacity bound that does not assume even distribution (for example, max(num_cache_blocks // tp_size, ceil(largest_request_blocks))).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/pyexecutor/_util.py` around lines 314 - 320, The per-rank
KV cache sizing is unsafe because _create_dummy_context_requests relies on a
scheduling invariant that DefaultADPRouter (and its
_balance_requests_across_ranks) does not guarantee; either set attention_dp_rank
on the dummy requests so each rank gets one full base copy during ADP warmup
(ensuring one copy per rank) or change the sizing math where
mapping.enable_attention_dp and mapping.tp_size are used (the
ceil((num_cache_blocks + tp_size - 1)/tp_size) branch) to a conservative bound
that does not assume even distribution—for example compute per-rank capacity as
the max of ceil(total_blocks / tp_size) and the largest single-request block
count (or simply use total_blocks to be safe) and use those symbols:
_create_dummy_context_requests, attention_dp_rank, DefaultADPRouter,
_balance_requests_across_ranks, self._mapping.enable_attention_dp,
self._mapping.tp_size, and num_cache_blocks to locate and implement the fix.

@dominicshanshan
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #38574 [ run ] triggered by Bot. Commit: 25956fb Link to invocation

@tensorrt-cicd
Copy link
Collaborator

PR_Github #38574 [ run ] completed with state SUCCESS. Commit: 25956fb
/LLM/main/L0_MergeRequest_PR pipeline #29913 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@schetlur-nv schetlur-nv self-requested a review March 11, 2026 22:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants