Skip to content

[#12071][fix] Replace cudaMemcpy2DAsync with flat copy in copyKvBlockOffsets#12111

Open
wojciech-wais wants to merge 2 commits intoNVIDIA:mainfrom
wojciech-wais:fix/beam-search-cudamemcpy2d-invalid-pitch-12071
Open

[#12071][fix] Replace cudaMemcpy2DAsync with flat copy in copyKvBlockOffsets#12111
wojciech-wais wants to merge 2 commits intoNVIDIA:mainfrom
wojciech-wais:fix/beam-search-cudamemcpy2d-invalid-pitch-12071

Conversation

@wojciech-wais
Copy link
Contributor

@wojciech-wais wojciech-wais commented Mar 11, 2026

Replace cudaMemcpy2DAsync with BufferManager::copy (flat async memcpy) for copying KV cache block offsets from host to device. The 2D copy was an optimization to skip zero-padded columns, but it triggered cudaErrorInvalidPitchValue with large beam widths (e.g. beam_width=128).

The flat copy is semantically equivalent since both host and device tensors share the same shape and the host tensor is fully prepared (zeroed then populated). The performance difference is negligible as this is a small metadata buffer.

Also add validation assertions to catch block count vs max blocks per sequence inconsistencies early with clear error messages.

Summary by CodeRabbit

  • Bug Fixes

    • Enhanced K-V cache block offset validation with runtime checks for block counts and tensor shape configurations, providing clearer error messages when issues occur.
  • Refactor

    • Improved internal memory management for K-V cache block offset transfers for better reliability.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 11, 2026

📝 Walkthrough

Walkthrough

Updated transformerBuffers.cpp to refactor KV block offset handling by replacing low-level CUDA stream operations with manager-based memory operations and adding runtime validation for block count constraints.

Changes

Cohort / File(s) Summary
KV Block Offsets Refactoring
cpp/tensorrt_llm/batch_manager/transformerBuffers.cpp
Replaced cudaStream usage with manager for offset transfers; updated to use 4D block offset shape; added runtime validation ensuring maxBlockCount ≤ maxBlocksPerSeq; updated lambda captures and memory copy logic.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly identifies the specific issue (#12071), the type of change (fix), and the main modification (replacing cudaMemcpy2DAsync with flat copy in copyKvBlockOffsets).
Description check ✅ Passed The PR description provides clear context on the issue (cudaErrorInvalidPitchValue with large beam widths), the solution (replacing cudaMemcpy2DAsync with flat copy), and the rationale (negligible performance impact for small metadata buffer).

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

wojciech-wais and others added 2 commits March 11, 2026 14:50
…vBlockOffsets

Replace cudaMemcpy2DAsync with BufferManager::copy (flat async memcpy) for
copying KV cache block offsets from host to device. The 2D copy was an
optimization to skip zero-padded columns, but it triggered
cudaErrorInvalidPitchValue with large beam widths (e.g. beam_width=128).

The flat copy is semantically equivalent since both host and device tensors
share the same shape and the host tensor is fully prepared (zeroed then
populated). The performance difference is negligible as this is a small
metadata buffer.

Also add validation assertions to catch block count vs max blocks per
sequence inconsistencies early with clear error messages.

Signed-off-by: Wojciech Wais <wojciech.wais@gmail.com>
@wojciech-wais wojciech-wais force-pushed the fix/beam-search-cudamemcpy2d-invalid-pitch-12071 branch from 1d8dd51 to f1528c1 Compare March 11, 2026 13:51
@svc-trtllm-gh-bot svc-trtllm-gh-bot added the Community want to contribute PRs initiated from Community label Mar 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Community want to contribute PRs initiated from Community

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants