[None][chore] Mass integration of release/1.2 weekly - 6th#11934
[None][chore] Mass integration of release/1.2 weekly - 6th#11934chzblych merged 14 commits intoNVIDIA:mainfrom
Conversation
b6afb17 to
e154350
Compare
📝 WalkthroughWalkthroughMultiple documentation updates standardize TensorRT LLM naming and hyperlinks across deployment and feature guides. Code additions include VSWA window-size memory management and synchronization in resource allocation, click option enhancements for help handling, attribute guarding in tokenizer initialization, and request failure tracking in test reporting. A new release sanity test list was introduced. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs). Comment |
e154350 to
eb53b26
Compare
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (2)
422-486:⚠️ Potential issue | 🟠 MajorCross-rank block reduction is executed twice.
Lines 455-486 duplicate the same MIN allreduce loop already run in Lines 422-453, which doubles cross-rank synchronization overhead and log noise.
Suggested cleanup
blocks_per_window = self.calculate_max_num_blocks_for_vswa( kv_cache_config=kv_cache_config, model_config=model_config, extra_cost_memory=0, ) if mapping.world_size > 1: + original_blocks_per_window = blocks_per_window.copy() # make sure all ranks use the same number of primary/secondary blocks if mpi_disabled(): @@ blocks_per_window[window_size] = ( reduced_primary_blocks, reduced_secondary_blocks) logger.info( - f"[MPI rank={mapping.rank}] Original blocks_per_window: {blocks_per_window}" + f"[MPI rank={mapping.rank}] Original blocks_per_window: {original_blocks_per_window}" ) logger.info( f"[MPI rank={mapping.rank}] Reduced blocks_per_window: {blocks_per_window}" ) - - if mapping.world_size > 1: - ...🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/pyexecutor/resource_manager.py` around lines 422 - 486, The code runs the same cross-rank MIN allreduce and logging twice for blocks_per_window; remove the duplicate second if mapping.world_size > 1 block (the one that repeats the allreduce loops and the two logger.info calls) so the reduction and logs only occur once; locate the duplicate by searching for the repeated use of mapping.world_size, mpi_disabled(), torch_comm()/mpi_comm(), MPI.MIN or torch.distributed.ReduceOp.MIN, and logger.info related to "Original blocks_per_window" / "Reduced blocks_per_window" and delete the redundant section.
1178-1333:⚠️ Potential issue | 🔴 CriticalFix parser-blocking merge artifacts in VSWA section.
The file cannot be imported due to syntax errors caused by duplicated lines:
- Duplicate
selfparameter inadjust_window_sizes_for_vswasignature (line 1180)- Duplicated
logger.info(statements (lines 1214–1215, 1228–1229, and 1222–1223)- Duplicate
calculate_max_num_blocks_for_vswadefinition (line 1303)These appear to be unintended merge artifacts. Remove all duplications to restore valid Python syntax.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/pyexecutor/resource_manager.py` around lines 1178 - 1333, The VSWA section contains merge-artifact duplicates causing syntax errors: in adjust_window_sizes_for_vswa remove the extra duplicated self parameter in the signature and delete duplicate statements (the repeated total_kv_heads assignment and all duplicated logger.info calls inside adjust_window_sizes_for_vswa), then remove the duplicate definition of calculate_max_num_blocks_for_vswa so only one valid def remains; ensure adjust_window_sizes_for_vswa and calculate_max_num_blocks_for_vswa have single, correctly formatted signatures and that logging calls are not repeated.
🧹 Nitpick comments (2)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (2)
415-417: Remove the stray tuple assignment before the real VSWA call.Lines 415-416 assign
blocks_per_windowto a tuple and then Line 417 overwrites it immediately. This is dead/confusing residue.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/pyexecutor/resource_manager.py` around lines 415 - 417, There is a stray tuple assignment to blocks_per_window before the real call; remove the redundant line so blocks_per_window is only set by the actual call to calculate_max_num_blocks_for_vswa(...). Locate the assignments around the calculate_max_num_blocks_for_vswa invocation in resource_manager.py (references: blocks_per_window and calculate_max_num_blocks_for_vswa, and the KvCacheConfig assertion) and delete the dead tuple-assignment statement so the single, correct assignment remains.
1410-1473: Drop the duplicated pre-pass incalculate_max_num_blocks_for_vswa.This block duplicates work that is executed again in Lines 1474-1506, adding redundant compute/logging and increasing drift risk.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/pyexecutor/resource_manager.py` around lines 1410 - 1473, The code contains a duplicated "pre-pass" that computes blocks_per_window and logs window/allocation info inside calculate_max_num_blocks_for_vswa (the loop using window_size_to_layers, window_size_shares, calculate_cache_size_per_token, kv_cache_config, primary_tokens/secondary_tokens, primary_blocks/secondary_blocks and logger) which is repeated later; remove this earlier duplicate pass and keep a single canonical computation (the later block at ~1474-1506), or consolidate both into one function, ensuring calculate_cache_size_per_token remains defined once (used by the single pass), window_size_shares logic (env TRTLLM_WINDOW_SIZE_SHARES fallback) is preserved, and all logging and kv_cache_config/is_vswa handling is performed only in that single location to avoid redundant compute and logs.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@tensorrt_llm/_torch/pyexecutor/resource_manager.py`:
- Around line 422-486: The code runs the same cross-rank MIN allreduce and
logging twice for blocks_per_window; remove the duplicate second if
mapping.world_size > 1 block (the one that repeats the allreduce loops and the
two logger.info calls) so the reduction and logs only occur once; locate the
duplicate by searching for the repeated use of mapping.world_size,
mpi_disabled(), torch_comm()/mpi_comm(), MPI.MIN or
torch.distributed.ReduceOp.MIN, and logger.info related to "Original
blocks_per_window" / "Reduced blocks_per_window" and delete the redundant
section.
- Around line 1178-1333: The VSWA section contains merge-artifact duplicates
causing syntax errors: in adjust_window_sizes_for_vswa remove the extra
duplicated self parameter in the signature and delete duplicate statements (the
repeated total_kv_heads assignment and all duplicated logger.info calls inside
adjust_window_sizes_for_vswa), then remove the duplicate definition of
calculate_max_num_blocks_for_vswa so only one valid def remains; ensure
adjust_window_sizes_for_vswa and calculate_max_num_blocks_for_vswa have single,
correctly formatted signatures and that logging calls are not repeated.
---
Nitpick comments:
In `@tensorrt_llm/_torch/pyexecutor/resource_manager.py`:
- Around line 415-417: There is a stray tuple assignment to blocks_per_window
before the real call; remove the redundant line so blocks_per_window is only set
by the actual call to calculate_max_num_blocks_for_vswa(...). Locate the
assignments around the calculate_max_num_blocks_for_vswa invocation in
resource_manager.py (references: blocks_per_window and
calculate_max_num_blocks_for_vswa, and the KvCacheConfig assertion) and delete
the dead tuple-assignment statement so the single, correct assignment remains.
- Around line 1410-1473: The code contains a duplicated "pre-pass" that computes
blocks_per_window and logs window/allocation info inside
calculate_max_num_blocks_for_vswa (the loop using window_size_to_layers,
window_size_shares, calculate_cache_size_per_token, kv_cache_config,
primary_tokens/secondary_tokens, primary_blocks/secondary_blocks and logger)
which is repeated later; remove this earlier duplicate pass and keep a single
canonical computation (the later block at ~1474-1506), or consolidate both into
one function, ensuring calculate_cache_size_per_token remains defined once (used
by the single pass), window_size_shares logic (env TRTLLM_WINDOW_SIZE_SHARES
fallback) is preserved, and all logging and kv_cache_config/is_vswa handling is
performed only in that single location to avoid redundant compute and logs.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 2e576cef-f543-4a34-8abd-dd4512a612ac
📒 Files selected for processing (13)
docs/source/blogs/tech_blog/blog5_Disaggregated_Serving_in_TensorRT-LLM.mddocs/source/deployment-guide/configuring-cpu-affinity.mddocs/source/features/disagg-serving.mddocs/source/release-notes.mdexamples/models/core/multimodal/README.mdtensorrt_llm/_torch/pyexecutor/resource_manager.pytensorrt_llm/commands/bench.pytensorrt_llm/tokenizer/tokenizer.pytests/integration/defs/perf/disagg/execution/executor.pytests/integration/defs/perf/disagg/reporting/report.pytests/integration/defs/perf/disagg/testlist/release_sanity.txttests/integration/test_lists/waives.txttriton_backend/all_models/multimodal/Deprecation_notice.md
eb53b26 to
fd41d49
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #37847 [ run ] triggered by Bot. Commit: |
…ughput_mtp test (NVIDIA#11717) Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
Signed-off-by: peaceh <103117813+peaceh-nv@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
fd41d49 to
1689fa1
Compare
…mand help to work. (NVIDIA#11722) Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
…ing hang with asymmetric PP/TP (NVIDIA#11789) Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
…1779) Signed-off-by: Mindy Li <11663212+limin2021@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
Signed-off-by: yuhangh <58161490+heyuhhh@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
Signed-off-by: yingguo-trt <244492186+yingguo-trt@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
1689fa1 to
7e92ce0
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #37875 [ run ] triggered by Bot. Commit: |
|
PR_Github #37875 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #37940 [ run ] triggered by Bot. Commit: |
|
/bot run --disable-fail-fast |
|
PR_Github #37940 [ run ] completed with state
|
|
PR_Github #37977 [ run ] triggered by Bot. Commit: |
|
/bot run --disable-fail-fast |
|
PR_Github #37979 [ run ] triggered by Bot. Commit: |
|
PR_Github #37979 [ run ] completed with state |
full:DGX_H100/unittest/_torch/auto_deploy/unit/multigpu/transformations/library/test_bmm_sharding.py::test_sharding[1-1] SKIP (https://nvbugs/5936322) is duplicated by unittest/_torch/auto_deploy/unit/multigpu/transformations/library/test_bmm_sharding.py::test_sharding[1-1] SKIP (https://nvbugs/5875203). Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
|
/bot reuse-pipeline |
|
PR_Github #38094 [ reuse-pipeline ] triggered by Bot. Commit: |
|
PR_Github #38094 [ reuse-pipeline ] completed with state |
Summary by CodeRabbit
Release Notes
Documentation
Bug Fixes
Tests
Description
This is weekly Mass Integration (MI) for release/1.2. Follow PR will not cherry back to main:
#11702 #11778 will merge into main with #11898 (dependency and container image security update) @yiqingy0
#11744 duplicate with #11743 @lancelly
#11771 duplicate with #11296 @yechank-nvidia
#11683 @yibinl-nvidia
#11757 @eopXD
#11775 (waive test already in main)
and 13 infra PR with title: [None][infra] Check in most recent lock file from nightly pipeline
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.