[#12071][fix] Replace cudaMemcpy2DAsync with flat copy in copyKvBlockOffsets by wojciech-wais · Pull Request #12111 · NVIDIA/TensorRT-LLM

wojciech-wais · 2026-03-11T13:44:20Z

Replace cudaMemcpy2DAsync with BufferManager::copy (flat async memcpy) for copying KV cache block offsets from host to device. The 2D copy was an optimization to skip zero-padded columns, but it triggered cudaErrorInvalidPitchValue with large beam widths (e.g. beam_width=128).

The flat copy is semantically equivalent since both host and device tensors share the same shape and the host tensor is fully prepared (zeroed then populated). The performance difference is negligible as this is a small metadata buffer.

Also add validation assertions to catch block count vs max blocks per sequence inconsistencies early with clear error messages.

Summary by CodeRabbit

Bug Fixes
- Enhanced K-V cache block offset validation with runtime checks for block counts and tensor shape configurations, providing clearer error messages when issues occur.
Refactor
- Improved internal memory management for K-V cache block offset transfers for better reliability.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

coderabbitai · 2026-03-11T13:46:28Z

📝 Walkthrough

Walkthrough

Updated transformerBuffers.cpp to refactor KV block offset handling by replacing low-level CUDA stream operations with manager-based memory operations and adding runtime validation for block count constraints.

Changes

Cohort / File(s)	Summary
KV Block Offsets Refactoring `cpp/tensorrt_llm/batch_manager/transformerBuffers.cpp`	Replaced cudaStream usage with manager for offset transfers; updated to use 4D block offset shape; added runtime validation ensuring maxBlockCount ≤ maxBlocksPerSeq; updated lambda captures and memory copy logic.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly identifies the specific issue (`#12071`), the type of change (fix), and the main modification (replacing cudaMemcpy2DAsync with flat copy in copyKvBlockOffsets).
Description check	✅ Passed	The PR description provides clear context on the issue (cudaErrorInvalidPitchValue with large beam widths), the solution (replacing cudaMemcpy2DAsync with flat copy), and the rationale (negligible performance impact for small metadata buffer).

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

…vBlockOffsets Replace cudaMemcpy2DAsync with BufferManager::copy (flat async memcpy) for copying KV cache block offsets from host to device. The 2D copy was an optimization to skip zero-padded columns, but it triggered cudaErrorInvalidPitchValue with large beam widths (e.g. beam_width=128). The flat copy is semantically equivalent since both host and device tensors share the same shape and the host tensor is fully prepared (zeroed then populated). The performance difference is negligible as this is a small metadata buffer. Also add validation assertions to catch block count vs max blocks per sequence inconsistencies early with clear error messages. Signed-off-by: Wojciech Wais <wojciech.wais@gmail.com>

…2071

wojciech-wais and others added 2 commits March 11, 2026 14:50

Merge branch 'main' into fix/beam-search-cudamemcpy2d-invalid-pitch-1…

f1528c1

…2071

wojciech-wais force-pushed the fix/beam-search-cudamemcpy2d-invalid-pitch-12071 branch from 1d8dd51 to f1528c1 Compare March 11, 2026 13:51

svc-trtllm-gh-bot added the Community want to contribute PRs initiated from Community label Mar 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[#12071][fix] Replace cudaMemcpy2DAsync with flat copy in copyKvBlockOffsets#12111

[#12071][fix] Replace cudaMemcpy2DAsync with flat copy in copyKvBlockOffsets#12111
wojciech-wais wants to merge 2 commits intoNVIDIA:mainfrom
wojciech-wais:fix/beam-search-cudamemcpy2d-invalid-pitch-12071

wojciech-wais commented Mar 11, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Mar 11, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

wojciech-wais commented Mar 11, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

coderabbitai bot commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wojciech-wais commented Mar 11, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 11, 2026 •

edited

Loading